Skip to content

SAX-based streaming parser for large XML files (#68)#71

Merged
teaguesterling merged 9 commits intomainfrom
feature/sax-parser
Mar 29, 2026
Merged

SAX-based streaming parser for large XML files (#68)#71
teaguesterling merged 9 commits intomainfrom
feature/sax-parser

Conversation

@teaguesterling
Copy link
Copy Markdown
Owner

Summary

  • Adds SAX-based streaming XML parser using libxml2's push parser API (xmlCreatePushParserCtxt + xmlParseChunk)
  • Files exceeding maximum_file_size automatically use SAX mode instead of erroring, reducing peak memory from ~4x file size (DOM) to proportional to a single record
  • Controlled by new streaming parameter (default: true). Set streaming:=false for original behavior (error on oversized files)
  • SAX mode reads files in 64KB chunks with a state machine accumulator (SEEKING_RECORD → IN_RECORD → RECORD_COMPLETE)
  • Schema inference reuses existing InferSchema via synthetic XML fragments built from SAX-accumulated sample records

How it works

Two-pass approach:

  1. Schema inference pass: SAX-accumulate first N records, build synthetic XML fragments, feed to existing InferSchema()
  2. Extraction pass: SAX-stream records one at a time, convert to DuckDB Values via AccumulatorToRow()

Mode selection:

  • streaming=true (default) + file ≤ maximum_file_size → DOM (fast for small files)
  • streaming=true (default) + file > maximum_file_size → SAX (graceful fallback)
  • streaming=false + file > maximum_file_size → error (original behavior)
  • HTML files always use DOM (libxml2 HTML parser is DOM-only)
  • Complex XPath record_element (predicates, axes) always uses DOM

Known limitations

  • Nested STRUCT extraction: SAX accumulates nested elements as raw XML strings. Deep STRUCT types work for schema inference but extraction may produce raw XML values instead of proper STRUCT decomposition. Flat records with scalars, attributes, and LIST columns work correctly.
  • Attribute type inference: SAX emits attributes as elements in synthetic XML for inference, so numeric attributes (e.g., id="1") get type-inferred as INTEGER instead of VARCHAR (DOM keeps attributes as VARCHAR).
  • HTML: SAX mode not available for HTML (libxml2 limitation).

Test plan

  • 67 test cases, 2477 assertions — zero regressions
  • Parameter acceptance tests (streaming, maximum_file_size interaction)
  • DOM vs SAX behavioral equivalence: record counts, data values, type inference
  • Large row count (3000 rows spanning chunk boundaries)
  • Cross-record attribute discovery in SAX mode
  • datetime_format presets work in SAX mode
  • record_element parameter in SAX mode
  • attr_mode='discard' in SAX mode
  • Edge cases: empty files, single records, UTF-8, nullstr
  • Complex XPath + oversized file falls back to DOM error
  • streaming=false preserves original error behavior

Closes #68

🤖 Generated with Claude Code

teaguesterling and others added 8 commits March 28, 2026 13:57
Restores duckdb and extension-ci-tools from symlinks to correct
submodule state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Register streaming (BOOLEAN) and sax_threshold (UBIGINT, default 64MB)
named parameters for read_xml. Parameters are parsed and stored in
XMLSchemaOptions but have no behavioral effect yet — SAX implementation
follows in subsequent commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create xml_sax_reader.hpp/cpp with:
- SAXRecordAccumulator state machine (SEEKING → IN_RECORD → COMPLETE)
- SAX2 callbacks for startElementNs/endElementNs/characters/cdataBlock
- SAXStreamReader with push parser loop and chunked file reading
- InferSchemaFromStream using synthetic XML fragment approach
- AccumulatorToRow for converting accumulated data to DuckDB Values
- ConvertToValuePublic wrapper on XMLSchemaInference

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add SAX state fields to XMLReadGlobalState
- Add SAX/DOM mode selection in ReadDocumentFunction based on
  streaming param, sax_threshold, and file size
- Add SAX extraction loop alongside DOM extraction loop
- Add SAX schema inference path in ReadDocumentBind and ReadXMLBind
- Fix record collection: completed records are now collected immediately
  in the EndElementNs callback via SAXCallbackContext.completed_records,
  preventing records from being dropped when multiple records fit in a
  single 64KB parser chunk

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

- Add test fixture with 200 records for behavioral equivalence testing
- Add edge case fixtures: empty file, single record, UTF-8, nullstr
- Test DOM vs SAX count and data equivalence
- Test type inference, attribute extraction, record_element in SAX mode
- Test sax_threshold=0 forcing SAX mode
- Test UTF-8 preservation (regression for #64)
- Test nullstr interaction with SAX mode
- Add bind-time validation: BinderException when streaming=true with
  complex XPath record_element (predicates, axes, functions)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…threshold

- Remove sax_threshold parameter (redundant with maximum_file_size)
- Change streaming default to true
- SAX mode now activates when file exceeds maximum_file_size (instead of
  erroring), providing a graceful fallback for large files
- streaming=false preserves original behavior (error on oversized files)
- Remove bind-time XPath validation (complex XPath + oversized file
  falls back to DOM error naturally)
- Update existing tests to use streaming=false where they test the
  file-size-limit error behavior
- Update docs to reflect new default and behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive test suite verifying SAX mode produces identical results
to DOM mode for: datetime_format, record_element, large row counts
(3000 rows across chunk boundaries), cross-record attribute discovery,
aggregations, type inference, and attr_mode='discard'.

Also fixes SAX fragment builder to respect attr_mode='discard' by
skipping record-level attributes in synthetic XML.

Known difference: SAX emits attributes as elements in synthetic XML,
so numeric attribute values (like id) get type-inferred as INTEGER
instead of VARCHAR. Documented in test comments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 29, 2026 17:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a SAX (libxml2 push-parser) code path to read_xml that is intended to handle oversized XML files by parsing records incrementally instead of building a full DOM, controlled by a new streaming parameter (default true).

Changes:

  • Introduces SAXStreamReader + SAXRecordAccumulator and wires them into read_xml for schema inference and extraction when files exceed maximum_file_size.
  • Adds a public wrapper XMLSchemaInference::ConvertToValuePublic to reuse existing scalar conversion logic from the SAX path.
  • Adds documentation and SQL/XML fixtures + tests covering SAX-vs-DOM equivalence and parameter behavior.

Reviewed changes

Copilot reviewed 21 out of 23 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/xml_sax_reader.cpp New SAX push-parser implementation for record accumulation, schema inference via synthetic XML, and row conversion
src/include/xml_sax_reader.hpp Public interface/types for the SAX reader and accumulator
src/xml_reader_functions.cpp Adds streaming parameter, selects SAX vs DOM, and integrates SAX schema inference + extraction
src/include/xml_reader_functions.hpp Extends global state with SAX-related fields
src/xml_schema_inference.cpp Adds ConvertToValuePublic wrapper
src/include/xml_schema_inference.hpp Adds streaming option + declaration for ConvertToValuePublic
CMakeLists.txt Builds the new SAX source file
docs/parameters.rst Documents the new streaming parameter
docs/changelog.rst Changelog entry for SAX streaming
README.md Mentions streaming parameter
test/sql/*.test Adds SAX-specific tests and updates existing max-file-size tests for streaming=false behavior
test/xml/sax_test_*.xml Adds XML fixtures for SAX mode tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +159 to +171
// Deeper nested element: accumulate raw XML
acc->nested_depth = relative_depth;
acc->nested_xml += "<" + name;

// Add attributes to raw XML
for (int i = 0; i < nb_attributes; i++) {
const char *attr_localname = reinterpret_cast<const char *>(attributes[i * 5]);
const char *value_start = reinterpret_cast<const char *>(attributes[i * 5 + 3]);
const char *value_end = reinterpret_cast<const char *>(attributes[i * 5 + 4]);
std::string attr_value(value_start, value_end);
acc->nested_xml += " " + std::string(attr_localname) + "=\"" + attr_value + "\"";
}
acc->nested_xml += ">";
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested_xml is reconstructed by concatenating tag/attribute strings and raw characters() data without escaping. Because libxml2 SAX callbacks provide decoded character data (e.g., &amp; becomes &), this reconstruction can produce invalid XML whenever text or attribute values contain &, <, > or quotes, and can also mis-handle namespaced attributes (prefix is ignored here). Consider using xmlEncodeSpecialChars (or an existing XML escape helper) for text/attribute values and include prefixes consistently when namespaces='keep'.

Copilot uses AI. Check for mistakes.
Comment on lines +369 to +399
// Build a synthetic XML document from accumulated records
// This lets us reuse the existing DOM-based InferSchema
std::ostringstream xml;
xml << "<root>";

for (const auto &record : records) {
xml << "<record>";

// Emit record-level attributes as elements for schema inference
// (skip when attr_mode='discard' to match DOM behavior)
if (options.attr_mode != "discard") {
for (const auto &attr : record.current_attributes) {
// Skip child element attributes (contain a dot)
if (attr.first.find('.') != std::string::npos) {
continue;
}
xml << "<" << attr.first << ">" << attr.second << "</" << attr.first << ">";
}
}

// Emit scalar values
for (const auto &val : record.current_values) {
xml << "<" << val.first << ">" << val.second << "</" << val.first << ">";
}

// Emit list values as repeated elements
for (const auto &list : record.current_lists) {
for (const auto &item : list.second) {
xml << "<" << list.first << ">" << item << "</" << list.first << ">";
}
}
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When building the synthetic XML for schema inference, scalar/list/attribute values are inserted directly into element bodies. If a value contains XML-special characters (e.g., &, <), the synthetic document becomes malformed and InferSchema may fail or infer incorrectly. Please escape text content when emitting <col>...</col> (while still allowing the intentional nested-XML case, if needed, by explicitly distinguishing “raw XML” vs “text”).

Copilot uses AI. Check for mistakes.
Comment on lines +305 to +308
// Strip "//" prefix if present
if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') {
tag = tag.substr(2);
}
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record_element handling only strips a leading // and otherwise treats the string as a literal tag name. If record_element contains a path (e.g., /root/item or a/b), record_tag will never match element_name, resulting in zero rows. Since SAX mode only supports simple tag-name matching, consider validating options.record_element here (reject/flag any XPath tokens like '/', '@', '*', predicates, etc.) so failures are explicit and consistent with the DOM fallback logic.

Suggested change
// Strip "//" prefix if present
if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') {
tag = tag.substr(2);
}
// Strip "//" prefix if present (common XPath-style shorthand)
if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') {
tag = tag.substr(2);
}
// SAX mode only supports simple element-name matching, not full XPath.
// Reject obvious XPath tokens so failures are explicit and consistent
// with the DOM fallback logic.
if (tag.find_first_of("/@*[]") != std::string::npos) {
throw InvalidInputException(
"XML SAX reader: record_element '%s' appears to be an XPath expression "
"and is not supported in SAX mode. Please specify a simple element name "
"or use a mode that supports XPath.",
options.record_element);
}

Copilot uses AI. Check for mistakes.
Comment on lines +282 to +283
xmlSAXHandler handler;
memset(&handler, 0, sizeof(handler));
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memset is used here but <cstring> isn’t included in this translation unit. It may compile incidentally via transitive includes, but it’s not guaranteed and can break on stricter toolchains. Add the appropriate header (or value-initialize xmlSAXHandler handler{} instead of using memset).

Suggested change
xmlSAXHandler handler;
memset(&handler, 0, sizeof(handler));
xmlSAXHandler handler{};

Copilot uses AI. Check for mistakes.
Comment on lines +725 to +729
if (use_sax) {
// SAX mode: read all records via SAX push parser
gstate.sax_records = SAXStreamReader::ReadRecords(filename, schema_options);
gstate.sax_record_index = 0;
gstate.file_loaded = true;
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SAX mode this loads all records into gstate.sax_records via SAXStreamReader::ReadRecords(...) before producing any output. For very large files this can still be O(file size) (or worse) memory, defeating the intended streaming/low-memory behavior and potentially reintroducing OOMs. Consider keeping the libxml2 push-parser context in XMLReadGlobalState and incrementally parsing chunks until either (a) STANDARD_VECTOR_SIZE rows are ready or (b) EOF, emitting rows as soon as a record completes instead of accumulating the entire file in a vector.

Copilot uses AI. Check for mistakes.
Comment on lines +701 to +710
bool has_complex_xpath = false;
if (!schema_options.record_element.empty()) {
const auto &re = schema_options.record_element;
has_complex_xpath = (re.find('[') != std::string::npos ||
re.find("::") != std::string::npos ||
re.find('(') != std::string::npos);
}
if (!has_complex_xpath) {
use_sax = true;
}
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The has_complex_xpath heuristic doesn’t treat path expressions (e.g., /root/entry, a/b, .//item) as complex. If a user supplies such an XPath in record_element, use_sax may be enabled but MatchesRecordTag will never match, yielding 0 rows without an error. Please expand the complexity check (at least detect / beyond a leading //, plus other XPath tokens like @, *, ..) and fall back to DOM (or throw a clear error) when SAX can’t evaluate the expression.

Copilot uses AI. Check for mistakes.
Comment on lines +321 to +325
// Open file and read in chunks
std::ifstream file(filename, std::ios::binary);
if (!file.is_open()) {
throw IOException("Could not open file '%s' for SAX parsing", filename);
}
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadRecords opens files with std::ifstream, bypassing DuckDB’s FileSystem abstraction used elsewhere (fs.OpenFile). This will break reading from non-local filesystems (e.g., http/s3/zip/vfs) and can behave inconsistently with DuckDB path handling. Consider changing ReadRecords (and InferSchemaFromStream) to read via a FileHandle/FileSystem provided by the caller, or pass in a read callback so streaming works for all supported backends.

Copilot uses AI. Check for mistakes.
Comment on lines +208 to +215
} else if (relative_depth == 1) {
// Closing a direct child of the record
std::string value = acc->current_text;
if (!acc->nested_xml.empty()) {
// If we accumulated nested XML, use that instead
value = acc->nested_xml;
acc->nested_xml.clear();
}
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAX path stores acc->current_text verbatim (including indentation/newlines and surrounding whitespace). DOM extraction runs CleanTextContent(...) before type conversion; without similar trimming here, numeric/date/boolean conversion can fail or produce different results for pretty-printed XML like <value>\n 10\n</value>. Consider normalizing/cleaning text content before storing or right before calling ConvertToValuePublic (and do the same for nested text when appropriate).

Copilot uses AI. Check for mistakes.
@teaguesterling
Copy link
Copy Markdown
Owner Author

teaguesterling commented Mar 29, 2026

Code review

Found 9 issues:

  1. XML injection in synthetic XML for schema inference -- InferSchemaFromStream concatenates SAX-decoded text values directly into synthetic XML without entity-escaping. The SAX characters callback receives already-decoded text (e.g. &amp; becomes &), so re-inserting it raw produces malformed XML when values contain &, <, or >. This breaks schema inference for any file with special characters in values.

}
xml << "<" << attr.first << ">" << attr.second << "</" << attr.first << ">";
}
}
// Emit scalar values
for (const auto &val : record.current_values) {
xml << "<" << val.first << ">" << val.second << "</" << val.first << ">";
}
// Emit list values as repeated elements
for (const auto &list : record.current_lists) {
for (const auto &item : list.second) {
xml << "<" << list.first << ">" << item << "</" << list.first << ">";
}
}

  1. std::ifstream bypasses DuckDB's virtual filesystem -- The SAX reader opens files via std::ifstream instead of DuckDB's FileSystem abstraction. The DOM path uses fs.OpenFile() which supports S3, HTTPFS, etc. SAX mode will fail with a file-not-found error for any non-local file that exceeds maximum_file_size.

// Open file and read in chunks
std::ifstream file(filename, std::ios::binary);
if (!file.is_open()) {

  1. xmlDoc memory leak after SAX parsing -- xmlFreeParserCtxt(parser_ctx) is called without first calling xmlFreeDoc(parser_ctx->myDoc). libxml2's push parser builds an internal xmlDoc during SAX parsing and xmlFreeParserCtxt does not free it. This leaks an xmlDoc on every SAX parse invocation.

int result = xmlParseChunk(parser_ctx, buffer.data(), static_cast<int>(bytes_read), 0);
if (result != 0 && !options.ignore_errors) {
xmlFreeParserCtxt(parser_ctx);
throw IOException("SAX parsing error in file '%s'", filename);
}
}
// Finalize parsing
xmlParseChunk(parser_ctx, nullptr, 0, 1 /* terminate */);
xmlFreeParserCtxt(parser_ctx);

  1. Asymmetric SAX callbacks when stop_parsing is set -- SAXStartElementNs returns early without incrementing current_depth or pushing to element_stack, but SAXEndElementNs still decrements/pops even when stop_parsing is true. This causes current_depth to drift negative after early termination. Consider also calling xmlStopParser(parser_ctx) from within the callback to halt libxml2 immediately.

if (sax_ctx->stop_parsing) {
return;
}
std::string name = ResolveElementName(localname, prefix, acc->namespace_mode);
acc->current_depth++;
acc->element_stack.push_back(name);

if (sax_ctx->stop_parsing) {
acc->current_depth--;
if (!acc->element_stack.empty()) {
acc->element_stack.pop_back();
}
return;
}

  1. XPath path expressions silently return 0 rows in SAX mode -- record_element values like //parent/child or /root/items/item pass the has_complex_xpath guard (which only checks for [, ::, () but MatchesRecordTag compares the full path string against the SAX local element name, which never contains /. Result: zero rows with no error.

static bool MatchesRecordTag(const std::string &element_name, const std::string &record_tag) {
if (record_tag.empty()) {
return false;
}
// Strip any leading "//" from XPath-style record element specification
std::string tag = record_tag;
if (tag.size() >= 2 && tag[0] == '/' && tag[1] == '/') {
tag = tag.substr(2);
}
return element_name == tag;
}

// File exceeds max size — use SAX if possible
bool has_complex_xpath = false;
if (!schema_options.record_element.empty()) {
const auto &re = schema_options.record_element;
has_complex_xpath = (re.find('[') != std::string::npos ||
re.find("::") != std::string::npos ||
re.find('(') != std::string::npos);
}
if (!has_complex_xpath) {
use_sax_inference = true;
}

  1. SAX mode loads all records into memory before output -- ReadRecords() pushes every record into completed_records and the full vector is assigned to gstate.sax_records. The docs claim peak memory "proportional to a single record" but the implementation materializes all records. For files with millions of records this negates the streaming benefit.

if (sax_ctx->completed_records) {
sax_ctx->completed_records->push_back(*acc);
acc->Reset();

// SAX mode: read all records via SAX push parser
gstate.sax_records = SAXStreamReader::ReadRecords(filename, schema_options);
gstate.sax_record_index = 0;

  1. Attribute/element distinction lost in schema inference -- InferSchemaFromStream emits record-level attributes as plain child elements in the synthetic XML (same block as issue 1). This causes is_attribute to be false for all columns, even those from attributes. While AccumulatorToRow has a fallback check via HasAttribute(), the schema metadata is inaccurate and can cause type inference differences between DOM and SAX modes.

if (options.attr_mode != "discard") {
for (const auto &attr : record.current_attributes) {
// Skip child element attributes (contain a dot)
if (attr.first.find('.') != std::string::npos) {
continue;
}
xml << "<" << attr.first << ">" << attr.second << "</" << attr.first << ">";
}
}

  1. namespaces='expand' mode not supported in SAX path -- ResolveElementName only handles "keep" (prepend prefix) and the default "strip" (local name only). The "expand" mode, which should substitute the full namespace URI, silently falls through to "strip" behavior.

static std::string ResolveElementName(const xmlChar *localname, const xmlChar *prefix,
const std::string &namespace_mode) {
std::string name = reinterpret_cast<const char *>(localname);
if (namespace_mode == "keep" && prefix != nullptr) {
name = std::string(reinterpret_cast<const char *>(prefix)) + ":" + name;
}
return name;
}

  1. Missing CleanTextContent normalization in SAX path -- The DOM path calls CleanTextContent() on every text value for whitespace normalization. The SAX path passes raw character data directly to ConvertToValuePublic without any cleaning, causing DOM/SAX divergence for text with leading/trailing whitespace or internal whitespace runs.

// Scalar value
std::string text = accumulator.GetValue(col_name);
if (text.empty()) {
value = Value(); // NULL
} else if (IsNullString(text, options)) {
value = Value(); // NULL
} else {
value = XMLSchemaInference::ConvertToValuePublic(text, col_type, options, datetime_fmt);
}

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Fixes all 8 issues from Copilot review:

1. Incremental streaming: push parser context now persists in
   XMLReadGlobalState across scan calls. Records are emitted as they
   complete instead of loading entire file into memory.

2. XML escaping: add XmlEscapeText/XmlEscapeAttr helpers. Applied to
   nested XML reconstruction, attribute values, and synthetic XML
   fragment builder. Fixes malformed XML on special characters.

3. Whitespace trimming: add TrimWhitespace() matching DOM's
   CleanTextContent() (UTF-8-safe). Applied when storing scalar values
   in EndElementNs. Fixes type inference failures on pretty-printed XML.

4. DuckDB FileSystem: replace std::ifstream with FileSystem::OpenFile()
   + FileHandle::Read(). Supports S3, HTTP, VFS backends.

5. Path-style XPath detection: HasComplexXPath() now detects /, @, *,
   [], (), :, . after stripping leading //. Falls back to DOM or errors
   if file too big for DOM.

6. SAX record_element validation: ReadRecords throws
   InvalidInputException if stripped tag contains XPath tokens.

7. memset -> value init: xmlSAXHandler handler{}.

8. Remove <fstream> include (no longer needed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@teaguesterling teaguesterling merged commit 4b33c43 into main Mar 29, 2026
10 of 14 checks passed
teaguesterling added a commit that referenced this pull request Mar 29, 2026
Resolved conflicts in xml_schema_inference.hpp (streaming default),
xml_reader_functions.cpp (sax_threshold removal, streaming extraction),
and xml_sax_streaming.test (full test suite).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SAX-based streaming parser for very large XML files

2 participants