Evaluate `clean_unicode()` against real chemical data for mapping gaps

## Summary

`clean_unicode()` has 100+ Unicode-to-ASCII mappings via the internal `unicode_map`, but it hasn't been tested against a real uncurated chemical dataset. The function's `check_unhandled()` mechanism warns about unmapped characters — we need to run it against production data to identify any gaps.

## Motivation

The `PRE_POST_CURATION_PLAN.md` identifies three specific Unicode characters found in real chemical name data:

| Character | Unicode | Example | Expected Mapping |
|-----------|---------|---------|-----------------|
| `µ` (micro sign) | U+00B5 | `"Palygorskite fibers (> 5µm in length)"` | `u` or `micro` |
| `®` (registered trademark) | U+00AE | `"TRIM® VX"` | Strip or `(R)` |
| Double spaces after `®` | — | `"Vertasil®  Trisiloxanyl-cannabidiol"` | Collapse to single space |

## Action Items

- [ ] Test `clean_unicode()` against the 172-row test dataset (`chemical_validation_test.csv`) — specifically the 3 Unicode test records
- [ ] If available, test against the full 12,144-row uncurated dataset (`uncurated_chemicals_2023-05-16_12-43-41.csv`)
- [ ] Check whether `µ` (U+00B5) is in the current `unicode_map` — if not, add it
- [ ] Check whether `®` (U+00AE) is in the current `unicode_map` — if not, add it
- [ ] Run `check_unhandled()` output to identify any additional unmapped characters
- [ ] Add any missing mappings to the internal `unicode_map` dataset
- [ ] Verify double-space collapsing happens (may need post-processing step)

## Tests

After any additions:
- [ ] `clean_unicode("5µm")` produces ASCII output (e.g., `"5um"` or `"5microm"`)
- [ ] `clean_unicode("TRIM® VX")` produces ASCII output (e.g., `"TRIM(R) VX"` or `"TRIM VX"`)
- [ ] `clean_unicode("Vertasil®  Trisiloxanyl-cannabidiol")` produces clean ASCII with no double spaces
- [ ] Existing `clean_unicode()` tests still pass

## Context

`clean_unicode()` is already far more comprehensive than the Python equivalent (100+ vs 8 mappings). This issue is about verifying coverage against real data rather than a major rewrite.

Source: `PRE_POST_CURATION_PLAN.md` section 12.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate `clean_unicode()` against real chemical data for mapping gaps #126

Summary

Motivation

Action Items

Tests

Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Character	Unicode	Example	Expected Mapping
`µ` (micro sign)	U+00B5	`"Palygorskite fibers (> 5µm in length)"`	`u` or `micro`
`®` (registered trademark)	U+00AE	`"TRIM® VX"`	Strip or `(R)`
Double spaces after `®`	—	`"Vertasil® Trisiloxanyl-cannabidiol"`	Collapse to single space

Evaluate clean_unicode() against real chemical data for mapping gaps #126

Description

Summary

Motivation

Action Items

Tests

Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Evaluate `clean_unicode()` against real chemical data for mapping gaps #126