Pii Modifier should work with `DocumentDataset` on cudf #418

praateekmahajan · 2024-12-10T16:59:17Z

Is your feature request related to a problem? Please describe.

(not urgent since we anyway have to spill to host memory, but we might benefit from faster I/O and dataset filtering e.g. in #417 )

Noticed an oddity in the PII examples / scripts / docs that PII doesn't work when we do DocDataset.read_*(backend="cudf")
Given that

We call a text.tolist() here
And cudf.Series doesn't have support tolist() (here)

All of the examples / scripts / docs do a read dataset using dask (pandas) but to the Modifier pass in device='gpu'

Describe the solution you'd like
The code works with DocumentDataset('cudf')
I think we might just need to_pyarrow().tolist() when series is cudf type

The text was updated successfully, but these errors were encountered:

praateekmahajan added the enhancement New feature or request label Dec 10, 2024

sithape2025 added the jira label Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pii Modifier should work with `DocumentDataset` on cudf #418

Pii Modifier should work with `DocumentDataset` on cudf #418

praateekmahajan commented Dec 10, 2024

Pii Modifier should work with DocumentDataset on cudf #418

Pii Modifier should work with DocumentDataset on cudf #418

Comments

praateekmahajan commented Dec 10, 2024

Pii Modifier should work with `DocumentDataset` on cudf #418

Pii Modifier should work with `DocumentDataset` on cudf #418