You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After searching online whether tfdv could be used to validate data that contains text. For instance, for a dataset with sentences that have to be mapped to labels. I could not find any real useful tutorials, as the ones that I could find only go into numerical data regarding the dataset. For instance, height, weights, etc.
If True statistics for semantic domains are generated (e.g: image, text domains).
semantic_domain_stats_sample_rate
An optional sampling rate for semantic domain statistics. If specified, semantic domain statistics is computed over a sample.
vocab_paths
An optional dictionary mapping vocab names to paths. Used in the schema when specifying a NaturalLanguageDomain. The paths can either be to GZIP-compressed TF record files that have a tfrecord.gz suffix or to text files.
These arguments and files do indicate that tfdv can be used to analyze and validate data that would be used in NLP / Text classification type problems.
However, it is unclear to me how one would go about and use these features to validate text-based data?
I have enabled the enable_semantic_domain_stats argument and this does give information like sequence length etc.
However, how would one extend on this, and validate vocabularies for known/unknown word ratio's; etc.
Any tips or thoughts are highly appreciated!
Kind Regards,
Caspar
The text was updated successfully, but these errors were encountered:
Hi,
After searching online whether tfdv could be used to validate data that contains text. For instance, for a dataset with sentences that have to be mapped to labels. I could not find any real useful tutorials, as the ones that I could find only go into numerical data regarding the dataset. For instance, height, weights, etc.
After looking around in the data-validation package I have found a couple of files that seem to be related to this.
https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/generators/natural_language_stats_generator.py
And
https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/generators/natural_language_domain_inferring_stats_generator.py
Furthermore on the Tensorflow website about the StatsOptions class I found the following:
https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/StatsOptions
These arguments and files do indicate that tfdv can be used to analyze and validate data that would be used in NLP / Text classification type problems.
However, it is unclear to me how one would go about and use these features to validate text-based data?
I have enabled the
enable_semantic_domain_stats
argument and this does give information like sequence length etc.However, how would one extend on this, and validate vocabularies for known/unknown word ratio's; etc.
Any tips or thoughts are highly appreciated!
Kind Regards,
Caspar
The text was updated successfully, but these errors were encountered: