You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cd transforms/universal/noop/python
mkdir input
cp .../test.parquet input
make venv
source venv/bin/activate
python src/noop_transform_python.py --data_local_config "{'input_folder': 'input', 'output_folder':'output'}"
09:37:35 INFO - Launching noop transform
09:37:35 INFO - noop parameters are : {'sleep_sec': 1, 'pwd': 'nothing'}
09:37:35 INFO - pipeline id pipeline_id
09:37:35 INFO - code location None
09:37:35 INFO - data factory data_ is using local data access: input_folder - input output_folder - output
09:37:35 INFO - data factory data_ max_files -1, n_sample -1
09:37:35 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
09:37:35 INFO - orchestrator noop started at 2024-11-20 09:37:35
09:37:35 INFO - Number of files is 1, source profile {'max_file_size': 341.400185585022, 'min_file_size': 341.400185585022, 'total_file_size': 341.400185585022}
09:37:39 ERROR - Failed to convert byte array to arrow table, exception Nested data conversions not implemented for chunked array outputs. Skipping it
09:37:39 WARNING - Transformation of file to table failed
09:37:39 INFO - Completed 1 files (100.0%) in 0.07 min
09:37:39 INFO - Done processing 1 files, waiting forflush() completion.
09:37:39 INFO - done flushing in 0.0 sec
09:37:39 INFO - Completed execution in 0.07 min, execution result 0
Segmentation fault: 11
* add polars to try and read some troublesome parquet files to arrow tables
Signed-off-by: David Wood <[email protected]>
* fix bug in convert_binary_to_arrow() by returnning table from polars
Signed-off-by: David Wood <[email protected]>
* update convert_binary_to_arrow() by catching exceptoins from polars
Signed-off-by: David Wood <[email protected]>
* change filter's duckdb setting to allow large buffers on arrow tables
Signed-off-by: David Wood <[email protected]>
* turn off changes to filter for now
Signed-off-by: David Wood <[email protected]>
* add polars to core library
Signed-off-by: David Wood <[email protected]>
* add comment to say way we're adding polars for reading some parquet files
Signed-off-by: David Wood <[email protected]>
* pin core lib polars>=1.16.0
Signed-off-by: David Wood <[email protected]>
* change failure on polars read from warning to error
Signed-off-by: David Wood <[email protected]>
* remove comments on duckdb settings for multimodal in FilterTransform.init().
Signed-off-by: David Wood <[email protected]>
* downgrade polars to >=1.9.0
Signed-off-by: David Wood <[email protected]>
---------
Signed-off-by: David Wood <[email protected]>
Search before asking
Component
Library/core
What happened + What you expected to happen
I have a parquet file that has a column containing a list of images, as byte arrays. Under some circumstances, such files are not readable by pyarrow.
Reproduction script
grab https://ibm.ent.box.com/file/1684883605503?s=9qcne0iubeji6t6a77gh2sxgi29k1tjp as test.parquet
Anything else
OS
Ubuntu, MacOS (limited support)
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: