Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend support to non-English languages for PII Deidentifier #554

Open
hamsarajan opened this issue Feb 18, 2025 · 1 comment
Open

Extend support to non-English languages for PII Deidentifier #554

hamsarajan opened this issue Feb 18, 2025 · 1 comment
Labels
enhancement New feature or request jira

Comments

@hamsarajan
Copy link

Is your feature request related to a problem? Please describe.

My team is currently working on removing PII information from text data that are in South East Asian languages. When using the PIIDeIdentifier for these specific languages, it throws the following error: ValueError: No matching recognizers were found to serve the request. It seems that it only has support for English language.

Describe the solution you'd like
It would be helpful if PII can be detected in South East Asian languages (e.g Bahasa Indonesia, Thai, Vietnamese)

Describe alternatives you've considered
The underlying package used is Presidio. Presidio uses Spacy and Stanza NER models as part of its detection. There are models available in SpaCy and Stanza that supports some of the South East Asian languages. They can be adapted for this use case

@hamsarajan hamsarajan added the enhancement New feature or request label Feb 18, 2025
@sithape2025
Copy link
Collaborator

@singhva Can you take look at this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request jira
Projects
None yet
Development

No branches or pull requests

2 participants