SAX-based streaming parser for large XML files (#68) by teaguesterling · Pull Request #71 · teaguesterling/duckdb_webbed

teaguesterling · 2026-03-29T17:13:28Z

Summary

Adds SAX-based streaming XML parser using libxml2's push parser API (xmlCreatePushParserCtxt + xmlParseChunk)
Files exceeding maximum_file_size automatically use SAX mode instead of erroring, reducing peak memory from ~4x file size (DOM) to proportional to a single record
Controlled by new streaming parameter (default: true). Set streaming:=false for original behavior (error on oversized files)
SAX mode reads files in 64KB chunks with a state machine accumulator (SEEKING_RECORD → IN_RECORD → RECORD_COMPLETE)
Schema inference reuses existing InferSchema via synthetic XML fragments built from SAX-accumulated sample records

How it works

Two-pass approach:

Schema inference pass: SAX-accumulate first N records, build synthetic XML fragments, feed to existing InferSchema()
Extraction pass: SAX-stream records one at a time, convert to DuckDB Values via AccumulatorToRow()

Mode selection:

streaming=true (default) + file ≤ maximum_file_size → DOM (fast for small files)
streaming=true (default) + file > maximum_file_size → SAX (graceful fallback)
streaming=false + file > maximum_file_size → error (original behavior)
HTML files always use DOM (libxml2 HTML parser is DOM-only)
Complex XPath record_element (predicates, axes) always uses DOM

Known limitations

Nested STRUCT extraction: SAX accumulates nested elements as raw XML strings. Deep STRUCT types work for schema inference but extraction may produce raw XML values instead of proper STRUCT decomposition. Flat records with scalars, attributes, and LIST columns work correctly.
Attribute type inference: SAX emits attributes as elements in synthetic XML for inference, so numeric attributes (e.g., id="1") get type-inferred as INTEGER instead of VARCHAR (DOM keeps attributes as VARCHAR).
HTML: SAX mode not available for HTML (libxml2 limitation).

Test plan

Closes #68

🤖 Generated with Claude Code

Restores duckdb and extension-ci-tools from symlinks to correct submodule state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Register streaming (BOOLEAN) and sax_threshold (UBIGINT, default 64MB) named parameters for read_xml. Parameters are parsed and stored in XMLSchemaOptions but have no behavioral effect yet — SAX implementation follows in subsequent commits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Create xml_sax_reader.hpp/cpp with: - SAXRecordAccumulator state machine (SEEKING → IN_RECORD → COMPLETE) - SAX2 callbacks for startElementNs/endElementNs/characters/cdataBlock - SAXStreamReader with push parser loop and chunked file reading - InferSchemaFromStream using synthetic XML fragment approach - AccumulatorToRow for converting accumulated data to DuckDB Values - ConvertToValuePublic wrapper on XMLSchemaInference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add SAX state fields to XMLReadGlobalState - Add SAX/DOM mode selection in ReadDocumentFunction based on streaming param, sax_threshold, and file size - Add SAX extraction loop alongside DOM extraction loop - Add SAX schema inference path in ReadDocumentBind and ReadXMLBind - Fix record collection: completed records are now collected immediately in the EndElementNs callback via SAXCallbackContext.completed_records, preventing records from being dropped when multiple records fit in a single 64KB parser chunk Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion - Add test fixture with 200 records for behavioral equivalence testing - Add edge case fixtures: empty file, single record, UTF-8, nullstr - Test DOM vs SAX count and data equivalence - Test type inference, attribute extraction, record_element in SAX mode - Test sax_threshold=0 forcing SAX mode - Test UTF-8 preservation (regression for #64) - Test nullstr interaction with SAX mode - Add bind-time validation: BinderException when streaming=true with complex XPath record_element (predicates, axes, functions) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…threshold - Remove sax_threshold parameter (redundant with maximum_file_size) - Change streaming default to true - SAX mode now activates when file exceeds maximum_file_size (instead of erroring), providing a graceful fallback for large files - streaming=false preserves original behavior (error on oversized files) - Remove bind-time XPath validation (complex XPath + oversized file falls back to DOM error naturally) - Update existing tests to use streaming=false where they test the file-size-limit error behavior - Update docs to reflect new default and behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Comprehensive test suite verifying SAX mode produces identical results to DOM mode for: datetime_format, record_element, large row counts (3000 rows across chunk boundaries), cross-record attribute discovery, aggregations, type inference, and attr_mode='discard'. Also fixes SAX fragment builder to respect attr_mode='discard' by skipping record-level attributes in synthetic XML. Known difference: SAX emits attributes as elements in synthetic XML, so numeric attribute values (like id) get type-inferred as INTEGER instead of VARCHAR. Documented in test comments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds a SAX (libxml2 push-parser) code path to read_xml that is intended to handle oversized XML files by parsing records incrementally instead of building a full DOM, controlled by a new streaming parameter (default true).

Changes:

Introduces SAXStreamReader + SAXRecordAccumulator and wires them into read_xml for schema inference and extraction when files exceed maximum_file_size.
Adds a public wrapper XMLSchemaInference::ConvertToValuePublic to reuse existing scalar conversion logic from the SAX path.
Adds documentation and SQL/XML fixtures + tests covering SAX-vs-DOM equivalence and parameter behavior.

Reviewed changes

Copilot reviewed 21 out of 23 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/xml_sax_reader.cpp	New SAX push-parser implementation for record accumulation, schema inference via synthetic XML, and row conversion
src/include/xml_sax_reader.hpp	Public interface/types for the SAX reader and accumulator
src/xml_reader_functions.cpp	Adds `streaming` parameter, selects SAX vs DOM, and integrates SAX schema inference + extraction
src/include/xml_reader_functions.hpp	Extends global state with SAX-related fields
src/xml_schema_inference.cpp	Adds `ConvertToValuePublic` wrapper
src/include/xml_schema_inference.hpp	Adds `streaming` option + declaration for `ConvertToValuePublic`
CMakeLists.txt	Builds the new SAX source file
docs/parameters.rst	Documents the new `streaming` parameter
docs/changelog.rst	Changelog entry for SAX streaming
README.md	Mentions `streaming` parameter
test/sql/*.test	Adds SAX-specific tests and updates existing max-file-size tests for `streaming=false` behavior
test/xml/sax_test_*.xml	Adds XML fixtures for SAX mode tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-29T17:19:14Z

src/xml_sax_reader.cpp

+			// Deeper nested element: accumulate raw XML
+			acc->nested_depth = relative_depth;
+			acc->nested_xml += "<" + name;
+
+			// Add attributes to raw XML
+			for (int i = 0; i < nb_attributes; i++) {
+				const char *attr_localname = reinterpret_cast<const char *>(attributes[i * 5]);
+				const char *value_start = reinterpret_cast<const char *>(attributes[i * 5 + 3]);
+				const char *value_end = reinterpret_cast<const char *>(attributes[i * 5 + 4]);
+				std::string attr_value(value_start, value_end);
+				acc->nested_xml += " " + std::string(attr_localname) + "=\"" + attr_value + "\"";
+			}
+			acc->nested_xml += ">";


nested_xml is reconstructed by concatenating tag/attribute strings and raw characters() data without escaping. Because libxml2 SAX callbacks provide decoded character data (e.g., & becomes &), this reconstruction can produce invalid XML whenever text or attribute values contain &, <, > or quotes, and can also mis-handle namespaced attributes (prefix is ignored here). Consider using xmlEncodeSpecialChars (or an existing XML escape helper) for text/attribute values and include prefixes consistently when namespaces='keep'.

Copilot · 2026-03-29T17:19:14Z

src/xml_sax_reader.cpp

+	// Build a synthetic XML document from accumulated records
+	// This lets us reuse the existing DOM-based InferSchema
+	std::ostringstream xml;
+	xml << "<root>";
+
+	for (const auto &record : records) {
+		xml << "<record>";
+
+		// Emit record-level attributes as elements for schema inference
+		// (skip when attr_mode='discard' to match DOM behavior)
+		if (options.attr_mode != "discard") {
+			for (const auto &attr : record.current_attributes) {
+				// Skip child element attributes (contain a dot)
+				if (attr.first.find('.') != std::string::npos) {
+					continue;
+				}
+				xml << "<" << attr.first << ">" << attr.second << "</" << attr.first << ">";
+			}
+		}
+
+		// Emit scalar values
+		for (const auto &val : record.current_values) {
+			xml << "<" << val.first << ">" << val.second << "</" << val.first << ">";
+		}
+
+		// Emit list values as repeated elements
+		for (const auto &list : record.current_lists) {
+			for (const auto &item : list.second) {
+				xml << "<" << list.first << ">" << item << "</" << list.first << ">";
+			}
+		}


When building the synthetic XML for schema inference, scalar/list/attribute values are inserted directly into element bodies. If a value contains XML-special characters (e.g., &, <), the synthetic document becomes malformed and InferSchema may fail or infer incorrectly. Please escape text content when emitting <col>...</col> (while still allowing the intentional nested-XML case, if needed, by explicitly distinguishing “raw XML” vs “text”).

Copilot · 2026-03-29T17:19:14Z

src/xml_sax_reader.cpp

+		// Strip "//" prefix if present
+		if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') {
+			tag = tag.substr(2);
+		}


record_element handling only strips a leading // and otherwise treats the string as a literal tag name. If record_element contains a path (e.g., /root/item or a/b), record_tag will never match element_name, resulting in zero rows. Since SAX mode only supports simple tag-name matching, consider validating options.record_element here (reject/flag any XPath tokens like '/', '@', '*', predicates, etc.) so failures are explicit and consistent with the DOM fallback logic.

Suggested change

// Strip "//" prefix if present

if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') {

tag = tag.substr(2);

}

// Strip "//" prefix if present (common XPath-style shorthand)

if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') {

tag = tag.substr(2);

}

// SAX mode only supports simple element-name matching, not full XPath.

// Reject obvious XPath tokens so failures are explicit and consistent

// with the DOM fallback logic.

if (tag.find_first_of("/@*[]") != std::string::npos) {

throw InvalidInputException(

"XML SAX reader: record_element '%s' appears to be an XPath expression "

"and is not supported in SAX mode. Please specify a simple element name "

"or use a mode that supports XPath.",

options.record_element);

}

Copilot · 2026-03-29T17:19:15Z

src/xml_sax_reader.cpp

+	xmlSAXHandler handler;
+	memset(&handler, 0, sizeof(handler));


memset is used here but <cstring> isn’t included in this translation unit. It may compile incidentally via transitive includes, but it’s not guaranteed and can break on stricter toolchains. Add the appropriate header (or value-initialize xmlSAXHandler handler{} instead of using memset).

Suggested change

xmlSAXHandler handler;

memset(&handler, 0, sizeof(handler));

xmlSAXHandler handler{};

Copilot · 2026-03-29T17:19:15Z

src/xml_reader_functions.cpp

+				if (use_sax) {
+					// SAX mode: read all records via SAX push parser
+					gstate.sax_records = SAXStreamReader::ReadRecords(filename, schema_options);
+					gstate.sax_record_index = 0;
+					gstate.file_loaded = true;


In SAX mode this loads all records into gstate.sax_records via SAXStreamReader::ReadRecords(...) before producing any output. For very large files this can still be O(file size) (or worse) memory, defeating the intended streaming/low-memory behavior and potentially reintroducing OOMs. Consider keeping the libxml2 push-parser context in XMLReadGlobalState and incrementally parsing chunks until either (a) STANDARD_VECTOR_SIZE rows are ready or (b) EOF, emitting rows as soon as a record completes instead of accumulating the entire file in a vector.

Copilot · 2026-03-29T17:19:15Z

src/xml_reader_functions.cpp

+					bool has_complex_xpath = false;
+					if (!schema_options.record_element.empty()) {
+						const auto &re = schema_options.record_element;
+						has_complex_xpath = (re.find('[') != std::string::npos ||
+						                     re.find("::") != std::string::npos ||
+						                     re.find('(') != std::string::npos);
+					}
+					if (!has_complex_xpath) {
+						use_sax = true;
+					}


The has_complex_xpath heuristic doesn’t treat path expressions (e.g., /root/entry, a/b, .//item) as complex. If a user supplies such an XPath in record_element, use_sax may be enabled but MatchesRecordTag will never match, yielding 0 rows without an error. Please expand the complexity check (at least detect / beyond a leading //, plus other XPath tokens like @, *, ..) and fall back to DOM (or throw a clear error) when SAX can’t evaluate the expression.

Copilot · 2026-03-29T17:19:15Z

src/xml_sax_reader.cpp

+	// Open file and read in chunks
+	std::ifstream file(filename, std::ios::binary);
+	if (!file.is_open()) {
+		throw IOException("Could not open file '%s' for SAX parsing", filename);
+	}


ReadRecords opens files with std::ifstream, bypassing DuckDB’s FileSystem abstraction used elsewhere (fs.OpenFile). This will break reading from non-local filesystems (e.g., http/s3/zip/vfs) and can behave inconsistently with DuckDB path handling. Consider changing ReadRecords (and InferSchemaFromStream) to read via a FileHandle/FileSystem provided by the caller, or pass in a read callback so streaming works for all supported backends.

Copilot · 2026-03-29T17:19:15Z

src/xml_sax_reader.cpp

+		} else if (relative_depth == 1) {
+			// Closing a direct child of the record
+			std::string value = acc->current_text;
+			if (!acc->nested_xml.empty()) {
+				// If we accumulated nested XML, use that instead
+				value = acc->nested_xml;
+				acc->nested_xml.clear();
+			}


SAX path stores acc->current_text verbatim (including indentation/newlines and surrounding whitespace). DOM extraction runs CleanTextContent(...) before type conversion; without similar trimming here, numeric/date/boolean conversion can fail or produce different results for pretty-printed XML like <value>\n 10\n</value>. Consider normalizing/cleaning text content before storing or right before calling ConvertToValuePublic (and do the same for nested text when appropriate).

teaguesterling · 2026-03-29T19:46:14Z

Code review

Found 9 issues:

XML injection in synthetic XML for schema inference -- InferSchemaFromStream concatenates SAX-decoded text values directly into synthetic XML without entity-escaping. The SAX characters callback receives already-decoded text (e.g. & becomes &), so re-inserting it raw produces malformed XML when values contain &, <, or >. This breaks schema inference for any file with special characters in values.

duckdb_webbed/src/xml_sax_reader.cpp

Lines 384 to 399 in 074a8a3

    
           		} 
        
           		xml << "<" << attr.first << ">" << attr.second << "</" << attr.first << ">"; 
        
           	} 
        
           } 
        
           // Emit scalar values 
        
           for (const auto &val : record.current_values) { 
        
           	xml << "<" << val.first << ">" << val.second << "</" << val.first << ">"; 
        
           } 
        
           // Emit list values as repeated elements 
        
           for (const auto &list : record.current_lists) { 
        
           	for (const auto &item : list.second) { 
        
           		xml << "<" << list.first << ">" << item << "</" << list.first << ">"; 
        
           	} 
        
           }

std::ifstream bypasses DuckDB's virtual filesystem -- The SAX reader opens files via std::ifstream instead of DuckDB's FileSystem abstraction. The DOM path uses fs.OpenFile() which supports S3, HTTPFS, etc. SAX mode will fail with a file-not-found error for any non-local file that exceeds maximum_file_size.

duckdb_webbed/src/xml_sax_reader.cpp

Lines 321 to 323 in 074a8a3

    
           // Open file and read in chunks 
        
           std::ifstream file(filename, std::ios::binary); 
        
           if (!file.is_open()) {

xmlDoc memory leak after SAX parsing -- xmlFreeParserCtxt(parser_ctx) is called without first calling xmlFreeDoc(parser_ctx->myDoc). libxml2's push parser builds an internal xmlDoc during SAX parsing and xmlFreeParserCtxt does not free it. This leaks an xmlDoc on every SAX parse invocation.

duckdb_webbed/src/xml_sax_reader.cpp

Lines 344 to 355 in 074a8a3

    
           	int result = xmlParseChunk(parser_ctx, buffer.data(), static_cast<int>(bytes_read), 0); 
        
           	if (result != 0 && !options.ignore_errors) { 
        
           		xmlFreeParserCtxt(parser_ctx); 
        
           		throw IOException("SAX parsing error in file '%s'", filename); 
        
           	} 
        
           } 
        
           // Finalize parsing 
        
           xmlParseChunk(parser_ctx, nullptr, 0, 1 /* terminate */); 
        
           xmlFreeParserCtxt(parser_ctx);

Asymmetric SAX callbacks when stop_parsing is set -- SAXStartElementNs returns early without incrementing current_depth or pushing to element_stack, but SAXEndElementNs still decrements/pops even when stop_parsing is true. This causes current_depth to drift negative after early termination. Consider also calling xmlStopParser(parser_ctx) from within the callback to halt libxml2 immediately.

duckdb_webbed/src/xml_sax_reader.cpp

Lines 87 to 94 in 074a8a3

    
           if (sax_ctx->stop_parsing) { 
        
           	return; 
        
           } 
        
           std::string name = ResolveElementName(localname, prefix, acc->namespace_mode); 
        
           acc->current_depth++; 
        
           acc->element_stack.push_back(name);

duckdb_webbed/src/xml_sax_reader.cpp

Lines 179 to 187 in 074a8a3

    
           if (sax_ctx->stop_parsing) { 
        
           	acc->current_depth--; 
        
           	if (!acc->element_stack.empty()) { 
        
           		acc->element_stack.pop_back(); 
        
           	} 
        
           	return; 
        
           }

XPath path expressions silently return 0 rows in SAX mode -- record_element values like //parent/child or /root/items/item pass the has_complex_xpath guard (which only checks for [, ::, () but MatchesRecordTag compares the full path string against the SAX local element name, which never contains /. Result: zero rows with no error.

duckdb_webbed/src/xml_sax_reader.cpp

Lines 67 to 78 in 074a8a3

    
           static bool MatchesRecordTag(const std::string &element_name, const std::string &record_tag) { 
        
           	if (record_tag.empty()) { 
        
           		return false; 
        
           	} 
        
           	// Strip any leading "//" from XPath-style record element specification 
        
           	std::string tag = record_tag; 
        
           	if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') { 
        
           		tag = tag.substr(2); 
        
           	} 
        
           	return element_name == tag; 
        
           }

duckdb_webbed/src/xml_reader_functions.cpp

Lines 451 to 461 in 074a8a3

    
           // File exceeds max size — use SAX if possible 
        
           bool has_complex_xpath = false; 
        
           if (!schema_options.record_element.empty()) { 
        
           	const auto &re = schema_options.record_element; 
        
           	has_complex_xpath = (re.find('[') != std::string::npos || 
        
           	                     re.find("::") != std::string::npos || 
        
           	                     re.find('(') != std::string::npos); 
        
           } 
        
           if (!has_complex_xpath) { 
        
           	use_sax_inference = true; 
        
           }

SAX mode loads all records into memory before output -- ReadRecords() pushes every record into completed_records and the full vector is assigned to gstate.sax_records. The docs claim peak memory "proportional to a single record" but the implementation materializes all records. For files with millions of records this negates the streaming benefit.

duckdb_webbed/src/xml_sax_reader.cpp

Lines 200 to 202 in 074a8a3

    
           if (sax_ctx->completed_records) { 
        
           	sax_ctx->completed_records->push_back(*acc); 
        
           	acc->Reset();

duckdb_webbed/src/xml_reader_functions.cpp

Lines 726 to 728 in 074a8a3

    
           // SAX mode: read all records via SAX push parser 
        
           gstate.sax_records = SAXStreamReader::ReadRecords(filename, schema_options); 
        
           gstate.sax_record_index = 0;

Attribute/element distinction lost in schema inference -- InferSchemaFromStream emits record-level attributes as plain child elements in the synthetic XML (same block as issue 1). This causes is_attribute to be false for all columns, even those from attributes. While AccumulatorToRow has a fallback check via HasAttribute(), the schema metadata is inaccurate and can cause type inference differences between DOM and SAX modes.

duckdb_webbed/src/xml_sax_reader.cpp

Lines 379 to 387 in 074a8a3

    
           if (options.attr_mode != "discard") { 
        
           	for (const auto &attr : record.current_attributes) { 
        
           		// Skip child element attributes (contain a dot) 
        
           		if (attr.first.find('.') != std::string::npos) { 
        
           			continue; 
        
           		} 
        
           		xml << "<" << attr.first << ">" << attr.second << "</" << attr.first << ">"; 
        
           	} 
        
           }

namespaces='expand' mode not supported in SAX path -- ResolveElementName only handles "keep" (prepend prefix) and the default "strip" (local name only). The "expand" mode, which should substitute the full namespace URI, silently falls through to "strip" behavior.

duckdb_webbed/src/xml_sax_reader.cpp

Lines 57 to 64 in 074a8a3

    
           static std::string ResolveElementName(const xmlChar *localname, const xmlChar *prefix, 
        
                                                  const std::string &namespace_mode) { 
        
           	std::string name = reinterpret_cast<const char *>(localname); 
        
           	if (namespace_mode == "keep" && prefix != nullptr) { 
        
           		name = std::string(reinterpret_cast<const char *>(prefix)) + ":" + name; 
        
           	} 
        
           	return name; 
        
           }

Missing CleanTextContent normalization in SAX path -- The DOM path calls CleanTextContent() on every text value for whitespace normalization. The SAX path passes raw character data directly to ConvertToValuePublic without any cleaning, causing DOM/SAX divergence for text with leading/trailing whitespace or internal whitespace runs.

duckdb_webbed/src/xml_sax_reader.cpp

Lines 495 to 503 in 074a8a3

    
           // Scalar value 
        
           std::string text = accumulator.GetValue(col_name); 
        
           if (text.empty()) { 
        
           	value = Value(); // NULL 
        
           } else if (IsNullString(text, options)) { 
        
           	value = Value(); // NULL 
        
           } else { 
        
           	value = XMLSchemaInference::ConvertToValuePublic(text, col_type, options, datetime_fmt); 
        
           }

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Fixes all 8 issues from Copilot review: 1. Incremental streaming: push parser context now persists in XMLReadGlobalState across scan calls. Records are emitted as they complete instead of loading entire file into memory. 2. XML escaping: add XmlEscapeText/XmlEscapeAttr helpers. Applied to nested XML reconstruction, attribute values, and synthetic XML fragment builder. Fixes malformed XML on special characters. 3. Whitespace trimming: add TrimWhitespace() matching DOM's CleanTextContent() (UTF-8-safe). Applied when storing scalar values in EndElementNs. Fixes type inference failures on pretty-printed XML. 4. DuckDB FileSystem: replace std::ifstream with FileSystem::OpenFile() + FileHandle::Read(). Supports S3, HTTP, VFS backends. 5. Path-style XPath detection: HasComplexXPath() now detects /, @, *, [], (), :, . after stripping leading //. Falls back to DOM or errors if file too big for DOM. 6. SAX record_element validation: ReadRecords throws InvalidInputException if stripped tag contains XPath tokens. 7. memset -> value init: xmlSAXHandler handler{}. 8. Remove <fstream> include (no longer needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Resolved conflicts in xml_schema_inference.hpp (streaming default), xml_reader_functions.cpp (sax_threshold removal, streaming extraction), and xml_sax_streaming.test (full test suite). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

teaguesterling and others added 8 commits March 28, 2026 13:57

Fix submodules re-broken by PR #69 merge

f776a6f

Restores duckdb and extension-ci-tools from symlinks to correct submodule state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs(#68): document streaming and sax_threshold parameters

b1a7079

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 29, 2026 17:13

Copilot started reviewing on behalf of teaguesterling March 29, 2026 17:13 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

teaguesterling merged commit 4b33c43 into main Mar 29, 2026
10 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAX-based streaming parser for large XML files (#68)#71

SAX-based streaming parser for large XML files (#68)#71
teaguesterling merged 9 commits intomainfrom
feature/sax-parser

teaguesterling commented Mar 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

teaguesterling commented Mar 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-		// Strip "//" prefix if present
-		if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') {
-			tag = tag.substr(2);
-		}
+		// Strip "//" prefix if present (common XPath-style shorthand)
+		if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') {
+			tag = tag.substr(2);
+		}
+		// SAX mode only supports simple element-name matching, not full XPath.
+		// Reject obvious XPath tokens so failures are explicit and consistent
+		// with the DOM fallback logic.
+		if (tag.find_first_of("/@*[]") != std::string::npos) {
+			throw InvalidInputException(
+			    "XML SAX reader: record_element '%s' appears to be an XPath expression "
+			    "and is not supported in SAX mode. Please specify a simple element name "
+			    "or use a mode that supports XPath.",
+			    options.record_element);
+		}

	xmlSAXHandler handler;
	memset(&handler, 0, sizeof(handler));
	xmlSAXHandler handler{};

Conversation

teaguesterling commented Mar 29, 2026

Summary

How it works

Known limitations

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

teaguesterling commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

teaguesterling commented Mar 29, 2026 •

edited

Loading