-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generate_statistics_from_pyarrow table or parquet #92
Comments
In this issue, its mentioned that the Towards the goal of adding support for computing statistics over structured data (e.g., arbitrary protocol buffers, parquet data), GenerateStatistics API will take Arrow tables as input instead of Dict[FeatureName, ndarray] |
One thing to note is that the GenerateStatisticsAPI will only accept Arrow tables whose columns are ListArray of primitive types (e.g., int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, float64, binary, string, unicode), so Arrow tables that are not in that format will not work with the API. |
Hi thanks for the prompt answer ! Sorry I did not search enough before sending the issue. About the limitation of the input data to Arrow Tables of ListArray of primitives: It does not include List of List dtypes? If I use those dtypes what can I do ? When converted to TfRecords it seems it worked when printing the facets: I had access to the quantiles of length of the records. |
Many people use pandas or spark in their data preparation stage. Those tools dump parquet files in which every column is an array of 'primitive type', not an array of 'list of primitive types'.
Just wanted to share my interest in this kind of feature. I would by happy to help if needed. |
The following should work and not cost too much in terms of memory ! |
one caveat to that solution is that your original columns must not contain null (nil, None, etc). |
Hi thanks for your answer indeed my dataframe was already null safe (replaced by 0) so i did not notice it... If i can access the null mask of the array it should not be too hard to make the List array correctly but i'm not sure how to get it. Thanks |
I think there may be an easy patch to your existing solution so it handles Nulls correctly, but I'm not familiar with pandas APIs. Basically, we want to translate something like pd.Series([1, 2, None, 3]) to pa.array([[1], [2], None, [3])) Note that pa.array([1], [2], None, [3]) is essentially
So essentially, you can write a function that:
|
@tanguycdls : if you have something workable, do you mind making a contribution? |
Hi @brills I did not have time yet to go through it and merge all the pieces together, if I do I will share my code ! |
Hi, i worked a bit on the pyarrow side today: actually List Array does not have a mask parameter in the from_arrays function? are you running w/ a nightly version https://github.com/apache/arrow/blob/c49b960d2b697135c8de45222c6377e427ba8aad/python/pyarrow/array.pxi#L1402 To have null you need: offset[j] = None --> arr[j] = None I did the following:
The to_pandas transformation is costly so still need to figure out how to avoid it! EDIT i added the full example to use it directly in Beam: If you're interested i can make a pr out of it, it will be easier to check if all the cases are OK. |
It's true that ListArray.from_arrays doesn't accept a mask parameter, but the offsets parameter actually serves two purposes: So you should be able to do something like
you can use |
Thanks for the help @brills !
It simplifies a lot the code:
I will try to contribute that code as soon as I have more time on my side and once we're sure everything works well ! |
nice! If you don't mind I can add this to tfx_bsl, and revise TFDV's generate_statistics_from_pyarrow. We recently find other libraries could also benefit from this adapter. |
Sorry, I meant to revise generate_statistics_from_pandas. I could imagine a generate_statistics_from_pyarrow, but that will take longer (contributions are welcomed!) |
Ok in that case i can try to contrib on the generate_statistics_from_parquet or pyarrow table, we are more interested by that use case and calling the private method seems to be a bad idea! |
tensorflow/tfx-bsl@d6cc2b8 added the conversion function to tfx_bsl |
@brills I created the generate_statistics_from_parquet method, based on the conversion function added to tfx_bsl: To verify the correctness, i compared the statistics generated from the above method, with Most statistics are correct, apart from a slight discrepancy for histogram statistics for Does the github gist above decode the parquet file correctly? Or are there additional postprocessing required? |
The histogram is computed using an approximate algorithm and due to the nondeterministic nature of beam, the result may even change across runs. So if the diff is slight it could be expected. |
@brills Thanks for the clarification. In that case, can i submit a pull request based on github gist which i submitted? Or is there more work which is required? |
@brills @khorshuheng Any update on when generate_statistics_from_parquet() will be officially added to tfdv? I could really benefit from it. |
The
|
/type feature
Hi, since TF records are already converted to Pyarrow Tables to compute statistics, how hard would it be to add an option to read directly Pyarrow file or Parquet file?
data-validation/tensorflow_data_validation/utils/stats_gen_lib.py
Line 106 in bf40237
If my understanding of that code is correct we could replace beam.io.textio.ReadFromText by beam.io.parquetio.ReadFromParquet? if so will we need to extract features or the Pyarrow schema would be enough ?
My aim would be to use TFDV to extract data features and visualise them using facets.
Thanks
The text was updated successfully, but these errors were encountered: