Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current implementation of
type_infer
is not suitable to be used in distributed compute environments(i.e. non-scalable); currently,
type_infer
can only be executed in a single node and needs to load data into memory. This makestype_infer
unsuitable to analyze large datasets that do not fit in RAM.The internal workings of
type_infer
allow for a relatively straightforward implementation to allow execution in distributed compute environments; one could use sub-sets of columns (each subset loaded into a different worker) to infer the data types, and then apply something like a voting mechanism to choose a type over the other.The voting mechanism shall be aware of data type hierarchy. For example, consider the the case of having 4 workers: worker 1 identifies a subset of a column to be of type
text
while workers 2, 3, and 4 identify the rest of the subsets as being of typeinteger
. Becausetext
is a more general data type thaninteger
(one level higher in the data type hierarchy), the entire column should be casted astext
instead ofinteger
, even if there are more votes for the former. It might be important to mention that the current implementation does not handle this situation which might seem like an edge-case but it is likely very common.The proposed implementation shall use
torch.disitrbuted
to distribute the work across nodes. Becausetorch
is a heavy dependency, this capability shall be available only if the user installtype_infer
by runningAll of the distributed modules should be encapsulated into a sub-module called
distributed
to avoid breaking the already existing code-base.