Skip to content

Conversation

@kevinyang-cky
Copy link
Collaborator

see #21

Removed duplicate description of distributed scalers and corrected DataFrame creation and reading methods.
@kevinyang-cky kevinyang-cky changed the title Address issue #21 Address issue 21 Nov 12, 2025
This module mirrors the code structure and methodology of distributed.py, but focuses specifically on implementing distributed tensor-scaling classes for PyTorch. 

Obsolete attributes (e.g., self.is_array) and unused methods (such as extract_array, get_column_order, and package_transformed_x) from distributed.py have been removed. The extract_x_columns method has also been simplified.

For the fit method, input tensors are expected to be free of NaN values—a reasonable requirement since training datasets should not contain NaNs.

The module requires PyTorch 2.8.0, which is enforced via an assertion at initialization.
@kevinyang-cky kevinyang-cky changed the title Address issue 21 Address issue 21 and add distributed_tensor module Nov 21, 2025
save_scaler is commented out for now, as the custom serialization for tensors still needs to be built.
Moving the tests for the distributed_tensor module to a separate script.
Add unit tests for DStandardScalerTensor and DMinMaxScalerTensor, following the example in distributed_test.py for DStandardScaler and DMinMaxScaler.
@kevinyang-cky
Copy link
Collaborator Author

kevinyang-cky commented Nov 22, 2025

Besides addressing issue #21, I have also added the distributed_tensor module distributed_tensor.py, which is the torch.tensor version of DStandardScaler and DMinMaxScaler. The unit test script is distributed_tensor_test.py.

DStandardScalerTensor and DMinMaxScalerTensor are also tested with the example in the docs and produced identical results (see screenshots below). But keep in mind that in the example, dss_combined = np.sum([dss_1, dss_2]) cannot be simply converted todss_combined = torch.sum([dss_1, dss_2]) since torch.sum() only accepts tensors. Do dss_combined = dss_1 + dss_2 instead. I can also include this part in the unit test script if it is worth it.

Happy Thanksgiving! :)


ndarray_version
tensor_version

Copy link
Collaborator

@djgagne djgagne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some small requested changes to aid in CI test passage and functionality with more than a base version of pytorch.

The restriction to PyTorch 2.8.0 applied only to an early iteration of the code and is no longer relevant. 

According to the documentation, the "unbiased" argument in torch.var was renamed to "correction" beginning with PyTorch 2.0; therefore, impose a version minimum requirement of 2.0.0. 

Tested the module with the latest version 2.9.1, and other versions >= 2.0.0 worked fine.
@kevinyang-cky
Copy link
Collaborator Author

Comments addressed, and CI tests passed.

Copy link
Collaborator

@djgagne djgagne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@djgagne djgagne merged commit e87bd69 into main Dec 4, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants