SAX-based streaming parser for large XML files (#68)#71
Conversation
Restores duckdb and extension-ci-tools from symlinks to correct submodule state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Register streaming (BOOLEAN) and sax_threshold (UBIGINT, default 64MB) named parameters for read_xml. Parameters are parsed and stored in XMLSchemaOptions but have no behavioral effect yet — SAX implementation follows in subsequent commits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create xml_sax_reader.hpp/cpp with: - SAXRecordAccumulator state machine (SEEKING → IN_RECORD → COMPLETE) - SAX2 callbacks for startElementNs/endElementNs/characters/cdataBlock - SAXStreamReader with push parser loop and chunked file reading - InferSchemaFromStream using synthetic XML fragment approach - AccumulatorToRow for converting accumulated data to DuckDB Values - ConvertToValuePublic wrapper on XMLSchemaInference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add SAX state fields to XMLReadGlobalState - Add SAX/DOM mode selection in ReadDocumentFunction based on streaming param, sax_threshold, and file size - Add SAX extraction loop alongside DOM extraction loop - Add SAX schema inference path in ReadDocumentBind and ReadXMLBind - Fix record collection: completed records are now collected immediately in the EndElementNs callback via SAXCallbackContext.completed_records, preventing records from being dropped when multiple records fit in a single 64KB parser chunk Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion - Add test fixture with 200 records for behavioral equivalence testing - Add edge case fixtures: empty file, single record, UTF-8, nullstr - Test DOM vs SAX count and data equivalence - Test type inference, attribute extraction, record_element in SAX mode - Test sax_threshold=0 forcing SAX mode - Test UTF-8 preservation (regression for #64) - Test nullstr interaction with SAX mode - Add bind-time validation: BinderException when streaming=true with complex XPath record_element (predicates, axes, functions) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…threshold - Remove sax_threshold parameter (redundant with maximum_file_size) - Change streaming default to true - SAX mode now activates when file exceeds maximum_file_size (instead of erroring), providing a graceful fallback for large files - streaming=false preserves original behavior (error on oversized files) - Remove bind-time XPath validation (complex XPath + oversized file falls back to DOM error naturally) - Update existing tests to use streaming=false where they test the file-size-limit error behavior - Update docs to reflect new default and behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive test suite verifying SAX mode produces identical results to DOM mode for: datetime_format, record_element, large row counts (3000 rows across chunk boundaries), cross-record attribute discovery, aggregations, type inference, and attr_mode='discard'. Also fixes SAX fragment builder to respect attr_mode='discard' by skipping record-level attributes in synthetic XML. Known difference: SAX emits attributes as elements in synthetic XML, so numeric attribute values (like id) get type-inferred as INTEGER instead of VARCHAR. Documented in test comments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a SAX (libxml2 push-parser) code path to read_xml that is intended to handle oversized XML files by parsing records incrementally instead of building a full DOM, controlled by a new streaming parameter (default true).
Changes:
- Introduces
SAXStreamReader+SAXRecordAccumulatorand wires them intoread_xmlfor schema inference and extraction when files exceedmaximum_file_size. - Adds a public wrapper
XMLSchemaInference::ConvertToValuePublicto reuse existing scalar conversion logic from the SAX path. - Adds documentation and SQL/XML fixtures + tests covering SAX-vs-DOM equivalence and parameter behavior.
Reviewed changes
Copilot reviewed 21 out of 23 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/xml_sax_reader.cpp | New SAX push-parser implementation for record accumulation, schema inference via synthetic XML, and row conversion |
| src/include/xml_sax_reader.hpp | Public interface/types for the SAX reader and accumulator |
| src/xml_reader_functions.cpp | Adds streaming parameter, selects SAX vs DOM, and integrates SAX schema inference + extraction |
| src/include/xml_reader_functions.hpp | Extends global state with SAX-related fields |
| src/xml_schema_inference.cpp | Adds ConvertToValuePublic wrapper |
| src/include/xml_schema_inference.hpp | Adds streaming option + declaration for ConvertToValuePublic |
| CMakeLists.txt | Builds the new SAX source file |
| docs/parameters.rst | Documents the new streaming parameter |
| docs/changelog.rst | Changelog entry for SAX streaming |
| README.md | Mentions streaming parameter |
| test/sql/*.test | Adds SAX-specific tests and updates existing max-file-size tests for streaming=false behavior |
| test/xml/sax_test_*.xml | Adds XML fixtures for SAX mode tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Deeper nested element: accumulate raw XML | ||
| acc->nested_depth = relative_depth; | ||
| acc->nested_xml += "<" + name; | ||
|
|
||
| // Add attributes to raw XML | ||
| for (int i = 0; i < nb_attributes; i++) { | ||
| const char *attr_localname = reinterpret_cast<const char *>(attributes[i * 5]); | ||
| const char *value_start = reinterpret_cast<const char *>(attributes[i * 5 + 3]); | ||
| const char *value_end = reinterpret_cast<const char *>(attributes[i * 5 + 4]); | ||
| std::string attr_value(value_start, value_end); | ||
| acc->nested_xml += " " + std::string(attr_localname) + "=\"" + attr_value + "\""; | ||
| } | ||
| acc->nested_xml += ">"; |
There was a problem hiding this comment.
nested_xml is reconstructed by concatenating tag/attribute strings and raw characters() data without escaping. Because libxml2 SAX callbacks provide decoded character data (e.g., & becomes &), this reconstruction can produce invalid XML whenever text or attribute values contain &, <, > or quotes, and can also mis-handle namespaced attributes (prefix is ignored here). Consider using xmlEncodeSpecialChars (or an existing XML escape helper) for text/attribute values and include prefixes consistently when namespaces='keep'.
| // Build a synthetic XML document from accumulated records | ||
| // This lets us reuse the existing DOM-based InferSchema | ||
| std::ostringstream xml; | ||
| xml << "<root>"; | ||
|
|
||
| for (const auto &record : records) { | ||
| xml << "<record>"; | ||
|
|
||
| // Emit record-level attributes as elements for schema inference | ||
| // (skip when attr_mode='discard' to match DOM behavior) | ||
| if (options.attr_mode != "discard") { | ||
| for (const auto &attr : record.current_attributes) { | ||
| // Skip child element attributes (contain a dot) | ||
| if (attr.first.find('.') != std::string::npos) { | ||
| continue; | ||
| } | ||
| xml << "<" << attr.first << ">" << attr.second << "</" << attr.first << ">"; | ||
| } | ||
| } | ||
|
|
||
| // Emit scalar values | ||
| for (const auto &val : record.current_values) { | ||
| xml << "<" << val.first << ">" << val.second << "</" << val.first << ">"; | ||
| } | ||
|
|
||
| // Emit list values as repeated elements | ||
| for (const auto &list : record.current_lists) { | ||
| for (const auto &item : list.second) { | ||
| xml << "<" << list.first << ">" << item << "</" << list.first << ">"; | ||
| } | ||
| } |
There was a problem hiding this comment.
When building the synthetic XML for schema inference, scalar/list/attribute values are inserted directly into element bodies. If a value contains XML-special characters (e.g., &, <), the synthetic document becomes malformed and InferSchema may fail or infer incorrectly. Please escape text content when emitting <col>...</col> (while still allowing the intentional nested-XML case, if needed, by explicitly distinguishing “raw XML” vs “text”).
| // Strip "//" prefix if present | ||
| if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') { | ||
| tag = tag.substr(2); | ||
| } |
There was a problem hiding this comment.
record_element handling only strips a leading // and otherwise treats the string as a literal tag name. If record_element contains a path (e.g., /root/item or a/b), record_tag will never match element_name, resulting in zero rows. Since SAX mode only supports simple tag-name matching, consider validating options.record_element here (reject/flag any XPath tokens like '/', '@', '*', predicates, etc.) so failures are explicit and consistent with the DOM fallback logic.
| // Strip "//" prefix if present | |
| if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') { | |
| tag = tag.substr(2); | |
| } | |
| // Strip "//" prefix if present (common XPath-style shorthand) | |
| if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') { | |
| tag = tag.substr(2); | |
| } | |
| // SAX mode only supports simple element-name matching, not full XPath. | |
| // Reject obvious XPath tokens so failures are explicit and consistent | |
| // with the DOM fallback logic. | |
| if (tag.find_first_of("/@*[]") != std::string::npos) { | |
| throw InvalidInputException( | |
| "XML SAX reader: record_element '%s' appears to be an XPath expression " | |
| "and is not supported in SAX mode. Please specify a simple element name " | |
| "or use a mode that supports XPath.", | |
| options.record_element); | |
| } |
src/xml_sax_reader.cpp
Outdated
| xmlSAXHandler handler; | ||
| memset(&handler, 0, sizeof(handler)); |
There was a problem hiding this comment.
memset is used here but <cstring> isn’t included in this translation unit. It may compile incidentally via transitive includes, but it’s not guaranteed and can break on stricter toolchains. Add the appropriate header (or value-initialize xmlSAXHandler handler{} instead of using memset).
| xmlSAXHandler handler; | |
| memset(&handler, 0, sizeof(handler)); | |
| xmlSAXHandler handler{}; |
| if (use_sax) { | ||
| // SAX mode: read all records via SAX push parser | ||
| gstate.sax_records = SAXStreamReader::ReadRecords(filename, schema_options); | ||
| gstate.sax_record_index = 0; | ||
| gstate.file_loaded = true; |
There was a problem hiding this comment.
In SAX mode this loads all records into gstate.sax_records via SAXStreamReader::ReadRecords(...) before producing any output. For very large files this can still be O(file size) (or worse) memory, defeating the intended streaming/low-memory behavior and potentially reintroducing OOMs. Consider keeping the libxml2 push-parser context in XMLReadGlobalState and incrementally parsing chunks until either (a) STANDARD_VECTOR_SIZE rows are ready or (b) EOF, emitting rows as soon as a record completes instead of accumulating the entire file in a vector.
src/xml_reader_functions.cpp
Outdated
| bool has_complex_xpath = false; | ||
| if (!schema_options.record_element.empty()) { | ||
| const auto &re = schema_options.record_element; | ||
| has_complex_xpath = (re.find('[') != std::string::npos || | ||
| re.find("::") != std::string::npos || | ||
| re.find('(') != std::string::npos); | ||
| } | ||
| if (!has_complex_xpath) { | ||
| use_sax = true; | ||
| } |
There was a problem hiding this comment.
The has_complex_xpath heuristic doesn’t treat path expressions (e.g., /root/entry, a/b, .//item) as complex. If a user supplies such an XPath in record_element, use_sax may be enabled but MatchesRecordTag will never match, yielding 0 rows without an error. Please expand the complexity check (at least detect / beyond a leading //, plus other XPath tokens like @, *, ..) and fall back to DOM (or throw a clear error) when SAX can’t evaluate the expression.
src/xml_sax_reader.cpp
Outdated
| // Open file and read in chunks | ||
| std::ifstream file(filename, std::ios::binary); | ||
| if (!file.is_open()) { | ||
| throw IOException("Could not open file '%s' for SAX parsing", filename); | ||
| } |
There was a problem hiding this comment.
ReadRecords opens files with std::ifstream, bypassing DuckDB’s FileSystem abstraction used elsewhere (fs.OpenFile). This will break reading from non-local filesystems (e.g., http/s3/zip/vfs) and can behave inconsistently with DuckDB path handling. Consider changing ReadRecords (and InferSchemaFromStream) to read via a FileHandle/FileSystem provided by the caller, or pass in a read callback so streaming works for all supported backends.
| } else if (relative_depth == 1) { | ||
| // Closing a direct child of the record | ||
| std::string value = acc->current_text; | ||
| if (!acc->nested_xml.empty()) { | ||
| // If we accumulated nested XML, use that instead | ||
| value = acc->nested_xml; | ||
| acc->nested_xml.clear(); | ||
| } |
There was a problem hiding this comment.
SAX path stores acc->current_text verbatim (including indentation/newlines and surrounding whitespace). DOM extraction runs CleanTextContent(...) before type conversion; without similar trimming here, numeric/date/boolean conversion can fail or produce different results for pretty-printed XML like <value>\n 10\n</value>. Consider normalizing/cleaning text content before storing or right before calling ConvertToValuePublic (and do the same for nested text when appropriate).
Code reviewFound 9 issues:
duckdb_webbed/src/xml_sax_reader.cpp Lines 384 to 399 in 074a8a3
duckdb_webbed/src/xml_sax_reader.cpp Lines 321 to 323 in 074a8a3
duckdb_webbed/src/xml_sax_reader.cpp Lines 344 to 355 in 074a8a3
duckdb_webbed/src/xml_sax_reader.cpp Lines 87 to 94 in 074a8a3 duckdb_webbed/src/xml_sax_reader.cpp Lines 179 to 187 in 074a8a3
duckdb_webbed/src/xml_sax_reader.cpp Lines 67 to 78 in 074a8a3 duckdb_webbed/src/xml_reader_functions.cpp Lines 451 to 461 in 074a8a3
duckdb_webbed/src/xml_sax_reader.cpp Lines 200 to 202 in 074a8a3 duckdb_webbed/src/xml_reader_functions.cpp Lines 726 to 728 in 074a8a3
duckdb_webbed/src/xml_sax_reader.cpp Lines 379 to 387 in 074a8a3
duckdb_webbed/src/xml_sax_reader.cpp Lines 57 to 64 in 074a8a3
duckdb_webbed/src/xml_sax_reader.cpp Lines 495 to 503 in 074a8a3 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Fixes all 8 issues from Copilot review:
1. Incremental streaming: push parser context now persists in
XMLReadGlobalState across scan calls. Records are emitted as they
complete instead of loading entire file into memory.
2. XML escaping: add XmlEscapeText/XmlEscapeAttr helpers. Applied to
nested XML reconstruction, attribute values, and synthetic XML
fragment builder. Fixes malformed XML on special characters.
3. Whitespace trimming: add TrimWhitespace() matching DOM's
CleanTextContent() (UTF-8-safe). Applied when storing scalar values
in EndElementNs. Fixes type inference failures on pretty-printed XML.
4. DuckDB FileSystem: replace std::ifstream with FileSystem::OpenFile()
+ FileHandle::Read(). Supports S3, HTTP, VFS backends.
5. Path-style XPath detection: HasComplexXPath() now detects /, @, *,
[], (), :, . after stripping leading //. Falls back to DOM or errors
if file too big for DOM.
6. SAX record_element validation: ReadRecords throws
InvalidInputException if stripped tag contains XPath tokens.
7. memset -> value init: xmlSAXHandler handler{}.
8. Remove <fstream> include (no longer needed).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolved conflicts in xml_schema_inference.hpp (streaming default), xml_reader_functions.cpp (sax_threshold removal, streaming extraction), and xml_sax_streaming.test (full test suite). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
xmlCreatePushParserCtxt+xmlParseChunk)maximum_file_sizeautomatically use SAX mode instead of erroring, reducing peak memory from ~4x file size (DOM) to proportional to a single recordstreamingparameter (default:true). Setstreaming:=falsefor original behavior (error on oversized files)InferSchemavia synthetic XML fragments built from SAX-accumulated sample recordsHow it works
Two-pass approach:
InferSchema()AccumulatorToRow()Mode selection:
streaming=true(default) + file ≤maximum_file_size→ DOM (fast for small files)streaming=true(default) + file >maximum_file_size→ SAX (graceful fallback)streaming=false+ file >maximum_file_size→ error (original behavior)record_element(predicates, axes) always uses DOMKnown limitations
id="1") get type-inferred as INTEGER instead of VARCHAR (DOM keeps attributes as VARCHAR).Test plan
Closes #68
🤖 Generated with Claude Code