Extend support to non-English languages for PII Deidentifier #554

hamsarajan · 2025-02-18T10:24:58Z

Is your feature request related to a problem? Please describe.

My team is currently working on removing PII information from text data that are in South East Asian languages. When using the PIIDeIdentifier for these specific languages, it throws the following error: ValueError: No matching recognizers were found to serve the request. It seems that it only has support for English language.

Describe the solution you'd like
It would be helpful if PII can be detected in South East Asian languages (e.g Bahasa Indonesia, Thai, Vietnamese)

Describe alternatives you've considered
The underlying package used is Presidio. Presidio uses Spacy and Stanza NER models as part of its detection. There are models available in SpaCy and Stanza that supports some of the South East Asian languages. They can be adapted for this use case

The text was updated successfully, but these errors were encountered:

sithape2025 · 2025-03-05T21:50:26Z

@singhva Can you take look at this?

hamsarajan added the enhancement New feature or request label Feb 18, 2025

sithape2025 added the jira label Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend support to non-English languages for PII Deidentifier #554

Extend support to non-English languages for PII Deidentifier #554

hamsarajan commented Feb 18, 2025

sithape2025 commented Mar 5, 2025

Extend support to non-English languages for PII Deidentifier #554

Extend support to non-English languages for PII Deidentifier #554

Comments

hamsarajan commented Feb 18, 2025

sithape2025 commented Mar 5, 2025