Summary
clean_unicode() has 100+ Unicode-to-ASCII mappings via the internal unicode_map, but it hasn't been tested against a real uncurated chemical dataset. The function's check_unhandled() mechanism warns about unmapped characters — we need to run it against production data to identify any gaps.
Motivation
The PRE_POST_CURATION_PLAN.md identifies three specific Unicode characters found in real chemical name data:
| Character |
Unicode |
Example |
Expected Mapping |
µ (micro sign) |
U+00B5 |
"Palygorskite fibers (> 5µm in length)" |
u or micro |
® (registered trademark) |
U+00AE |
"TRIM® VX" |
Strip or (R) |
Double spaces after ® |
— |
"Vertasil® Trisiloxanyl-cannabidiol" |
Collapse to single space |
Action Items
Tests
After any additions:
Context
clean_unicode() is already far more comprehensive than the Python equivalent (100+ vs 8 mappings). This issue is about verifying coverage against real data rather than a major rewrite.
Source: PRE_POST_CURATION_PLAN.md section 12.3
Summary
clean_unicode()has 100+ Unicode-to-ASCII mappings via the internalunicode_map, but it hasn't been tested against a real uncurated chemical dataset. The function'scheck_unhandled()mechanism warns about unmapped characters — we need to run it against production data to identify any gaps.Motivation
The
PRE_POST_CURATION_PLAN.mdidentifies three specific Unicode characters found in real chemical name data:µ(micro sign)"Palygorskite fibers (> 5µm in length)"uormicro®(registered trademark)"TRIM® VX"(R)®"Vertasil® Trisiloxanyl-cannabidiol"Action Items
clean_unicode()against the 172-row test dataset (chemical_validation_test.csv) — specifically the 3 Unicode test recordsuncurated_chemicals_2023-05-16_12-43-41.csv)µ(U+00B5) is in the currentunicode_map— if not, add it®(U+00AE) is in the currentunicode_map— if not, add itcheck_unhandled()output to identify any additional unmapped charactersunicode_mapdatasetTests
After any additions:
clean_unicode("5µm")produces ASCII output (e.g.,"5um"or"5microm")clean_unicode("TRIM® VX")produces ASCII output (e.g.,"TRIM(R) VX"or"TRIM VX")clean_unicode("Vertasil® Trisiloxanyl-cannabidiol")produces clean ASCII with no double spacesclean_unicode()tests still passContext
clean_unicode()is already far more comprehensive than the Python equivalent (100+ vs 8 mappings). This issue is about verifying coverage against real data rather than a major rewrite.Source:
PRE_POST_CURATION_PLAN.mdsection 12.3