feat(0.4.0): dataset upload + streaming#2
Conversation
Mirrors httpx.Response.raise_for_status so callers can opt into exception semantics on mid-stream failures instead of having to remember to check .error before consuming .results.
Uploads TSV/CSV via multipart to POST /map/dataset/stream and yields MappingResult per NDJSON line. Uses AsyncExitStack to ensure the HTTP stream closes before the file handle, so cancellation mid-handshake cannot leave httpx reading from a closed fh. Rejects commas in column names and annotator lists at the boundary to prevent silent wire-format splitting.
Live-API spot check confirmed the /map/dataset/stream endpoint emits a slimmer per-row shape than /map/batch — no assigned_ids block means no scores for downstream extraction.
Blocking sync entry point that wraps map_dataset_file_iter with tqdm progress, on_result callback, and asyncio.run bridge. Uses explicit __anext__ iteration so the transport-error try/except wraps only the iterator call — on_result exceptions are positioned outside it and propagate unwrapped. contextlib.aclosing guarantees the generator's file/stream contexts unwind on early break or callback-raised exception rather than waiting for GC. Initial-request errors propagate unchanged; mid-stream failures are captured into DatasetMappingResult.error. The discriminator is a streaming_started flag rather than exception type, so future timeout-wrapping refactors cannot turn a captured mid-stream failure into a propagated one.
- README gets a 20-line Dataset upload subsection with map_dataset_file_sync example including raise_for_error, highlighting the required name_column / provided_id_columns contract. - Notebook adds two cells: sync + tqdm + raise_for_error teaches the check-before-use idiom; async iterator demonstrates per-result streaming against the live API. - Fixture lives at tests/fixtures/metabolites_sample.tsv (10 rows) so both test suite and notebook share a single canonical sample.
Greptile SummaryThis PR introduces dataset file streaming over
Confidence Score: 5/5Safe to merge; the only finding is a minor type annotation mismatch that does not affect runtime behavior. All findings are P2. The exception-handling contract, resource cleanup, and comma-validation guard are correctly implemented and well-tested. The timeout annotation issue is a cosmetic/mypy concern with no runtime impact. src/ddharmon/client.py — the timeout annotation on init is the only outstanding item. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant map_dataset_file_sync
participant BioMapperClient
participant map_dataset_file_iter
participant BioMapper API
Caller->>map_dataset_file_sync: path, name_column, ...
map_dataset_file_sync->>BioMapperClient: async with (asyncio.run)
BioMapperClient->>map_dataset_file_iter: aclosing(gen)
map_dataset_file_iter->>BioMapper API: POST /map/dataset/stream (multipart)
BioMapper API-->>map_dataset_file_iter: 200 OK + NDJSON stream
loop per NDJSON line
BioMapper API-->>map_dataset_file_iter: line
map_dataset_file_iter-->>map_dataset_file_sync: yield MappingResult
map_dataset_file_sync->>map_dataset_file_sync: append + streaming_started=True
opt on_result
map_dataset_file_sync->>Caller: on_result(r) callback
end
end
alt clean finish (StopAsyncIteration)
map_dataset_file_sync-->>Caller: DatasetMappingResult(results, error=None)
else mid-stream transport error (streaming_started=True)
map_dataset_file_sync-->>Caller: DatasetMappingResult(results, error=str(exc))
else initial-request error (streaming_started=False)
map_dataset_file_sync-->>Caller: raises unwrapped (BioMapperAuthError, etc.)
end
Reviews (2): Last reviewed commit: "ci: add GitHub Actions workflow + clean ..." | Re-trigger Greptile |
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Retain comments while modifying iterator handling.
The previous 'Fix comment formatting' commit de-indented the while loop by one level too many, placing it outside the async with block. Python refused to parse the file (IndentationError: expected an indented block after 'with' statement on line 172). Re-indent the loop and its comments to sit inside the async with, restoring import and test parity.
Workflow runs ruff, mypy --strict, and pytest on every push and pull request across all branches, with a matrix over Python 3.11 and 3.12 (matching pyproject classifiers). Concurrency group cancels superseded runs on the same ref so pushing again doesn't double-queue. Also cleans up 13 preexisting ruff warnings so CI lands green on the first run: - Auto-fixed 9 (unused math import in test_export.py; 8 UP037 quoted-type-annotations in test_metabolon.py now that from __future__ import annotations is effective). - Refactored try/except ValueError: pass in _raise_for_status to contextlib.suppress(ValueError) for clarity. - Kept three idiomatic uses of typing.Any with short noqa justifications: **httpx_kwargs forwarded verbatim to httpx.AsyncClient; *args on __aexit__ (the standard (exc_type, exc, tb) tuple); and results_to_dataframe return type (pandas is an optional extra).
Summary
Adds file-based mapping backed by
POST /map/dataset/stream(NDJSON). Two public entry points:BioMapperClient.map_dataset_file_iter(path, ...)— async method yieldingMappingResultper NDJSON line as it arrives. For live dashboards / per-result streaming (Entity Linker UI).map_dataset_file_sync(path, *, progress=False, on_result=None, total_hint=None, ...)— free function in newsrc/ddharmon/dataset.py. Blocks, returns a completedDatasetMappingResult. For notebooks + scripts.Plan:
~/.claude/plans/2026-04-14-001-feat-ddharmon-0.4.0-dataset-upload-streaming-plan.mdWhat changed
src/ddharmon/models.pyDatasetMappingResultwith.raise_for_error()(mirrorshttpx.Response.raise_for_status)src/ddharmon/client.pymap_dataset_file_iter;contextlib.AsyncExitStackensures stream closes before file handle on cancellation;_dataset_query_paramsrejects commas in column/annotator names to prevent silent wire-format splittingsrc/ddharmon/dataset.pymap_dataset_file_sync. Explicit__anext__loop so on_result exceptions propagate unwrapped while transport errors get captured into.errorsrc/ddharmon/__init__.py+pyproject.toml0.3.0to0.4.0; export new symbolstests/fixtures/metabolites_sample.tsvScope tightening from the plan draft
Review (scope-guardian, product-lens, adversarial) surfaced that the originally-proposed third entry point (
map_dataset_fileasync collector) had no named user — the UI uses the iterator directly, and the notebook uses the sync wrapper. Dropped, along with free-function wrappers around it.dataset.pycontains onlymap_dataset_file_sync. Async callers wanting a list write[r async for r in c.map_dataset_file_iter(...)]— a one-liner.Live-API spot check findings (affected the implementation)
The plan had three Deferred-to-Implementation questions; ran a 5-row upload against the live API to resolve:
provided_id_columnswire format{row_index, name, chosen_kg_id, curies, kg_ids}— noassigned_idsblock. Our permissiveRawApiResulthandles the extrarow_indexfine (silently ignored); the missingassigned_idsmeansconfidence_scoreis alwaysNonefor dataset results. Docstring updated to warn callers.DatasetMappingResult.statsstays{}— documentedapplication/x-ndjson(OpenAPI schema declaredapplication/json); we readaiter_lines()regardless so this doesn't affect behaviorException-handling contract
The sync wrapper uses a
streaming_startedboolean flag (not exception type) to discriminate:BioMapperAuthError,BioMapperRateLimitError,BioMapperServerError,BioMapperTimeoutError,httpx.HTTPStatusErrorDatasetMappingResult.error; partial results preserved in.resultsMappingResult.errorvalues; stream continueson_result) exceptions — propagate unwrapped, replace return value (partial results NOT returned)asyncio.CancelledError— always propagates unwrappedFlag-based discriminator is exception-type-agnostic — future refactors that wrap mid-stream timeouts as typed errors won't silently flip semantics.
Quality
mypy --strictclean across all 10 source filesbiomapper.expertintheloop.io/api/v1/map/dataset/streamwith a 5-row TSVTest plan
poetry run pytest— 138 passpoetry run mypy src/ddharmon/— cleanpoetry run jupyter nbconvert --to notebook --execute— cleanmap_dataset_file_iter, verified row order, resolution, no spurious<unknown>parse errorsPre-merge checklist
Generated with Claude Code.