Skip to content

Commit 6497b69

Browse files
docs: update all documentation for v2.0.0 release
- changelog.rst: expand v1.6.0 with SAX details, limitations, test stats - README.md: rename to "DuckDB Webbed Extension", add streaming features section, update test coverage stats - schema_inference.rst: add SAX streaming impact section documenting sample-based inference and record_element limitations - quickstart.rst: add sections for parse_xml, datetime_format, nullstr, and large file streaming - functions/file_reading.rst: add missing parameters (datetime_format, nullstr, streaming, columns) to read_xml parameter table - .gitignore: exclude generated stress test files - Add generate_stress_test.py for reproducible large file testing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent de93981 commit 6497b69

7 files changed

Lines changed: 214 additions & 6 deletions

File tree

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,10 @@ devenv.local.nix
2222
# Locally cloned vcpkg
2323
vcpkg/
2424

25+
# Generated stress test files (use test/xml/generate_stress_test.py to recreate)
26+
test/xml/sax_stress_test.xml
27+
test/xml/sax_stress_test.xml.gz
28+
2529
# blq
2630
.lq/*
2731
!.lq/hooks/

README.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
[![Documentation](https://img.shields.io/badge/docs-readthedocs-blue)](https://duckdb-webbed.readthedocs.io)
33

44

5-
# DuckDB XML Extension
5+
# DuckDB Webbed Extension
66

77
A comprehensive XML and HTML processing extension for DuckDB that enables SQL-native analysis of structured documents with intelligent schema inference and powerful XPath-based data extraction.
88

@@ -14,16 +14,23 @@ A comprehensive XML and HTML processing extension for DuckDB that enables SQL-na
1414
- Convert between XML, HTML, and JSON formats
1515
- Read files directly into DuckDB tables
1616

17-
### 📊 **Smart Schema Inference**
17+
### 📊 **Smart Schema Inference**
1818
- Automatically flatten XML documents into relational tables
1919
- Intelligent type detection (dates, numbers, booleans)
2020
- Configurable element and attribute handling
21+
- Custom datetime format control with presets and format strings
22+
23+
### 🚀 **Streaming for Large Files**
24+
- SAX-based streaming parser for files exceeding ``maximum_file_size``
25+
- Peak memory proportional to a single record, not the entire file
26+
- Automatic fallback: DOM for small files, SAX for large files
27+
- Controlled by ``streaming`` parameter (default: true)
2128

2229
### 🛠 **Production Ready**
2330
- Built on libxml2 for robust parsing
2431
- Comprehensive error handling
2532
- Memory-safe RAII implementation
26-
- 100% test coverage
33+
- 68 test suites, 2511 assertions
2734

2835
---
2936

docs/changelog.rst

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,28 @@ v1.6.0 (Current)
88

99
- SAX-based streaming parser for very large XML files — files exceeding
1010
``maximum_file_size`` are automatically parsed using SAX mode, reducing peak
11-
memory from ~4x file size (DOM) to proportional to a single record. Controlled
12-
by ``streaming`` parameter (default: true). Set ``streaming:=false`` for the
13-
original behavior of erroring on oversized files (Issue #68)
11+
memory from ~4x file size (DOM) to proportional to a single record (Issue #68)
12+
13+
- New ``streaming`` parameter (default: ``true``). When enabled, oversized XML
14+
files are streamed via libxml2's SAX push parser in 64KB chunks instead of
15+
building a full DOM tree. Set ``streaming:=false`` to restore the previous
16+
behavior of erroring on oversized files.
17+
- SAX mode supports simple tag-name ``record_element`` values (e.g., ``'item'``).
18+
XPath expressions automatically fall back to DOM parsing.
19+
- Not available for HTML files (libxml2 HTML parser is DOM-only).
20+
21+
**Limitations**
22+
23+
- SAX mode currently handles flat records (scalars, attributes, repeated elements).
24+
Nested STRUCT extraction from SAX events is not yet implemented — deeply nested
25+
records fall back to raw XML string values.
26+
27+
**Testing**
28+
29+
- 68 test suites, 2511 assertions
30+
- Comprehensive DOM/SAX equivalence tests covering type inference, datetime_format,
31+
record_element, cross-record attribute discovery, large row counts (3000 rows
32+
across chunk boundaries), UTF-8 content, and nullstr interaction
1433

1534
v1.5.0
1635
-----------------

docs/functions/file_reading.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,18 @@ Read XML files with automatic schema inference.
7171
* - ``namespaces``
7272
- VARCHAR
7373
- Namespace handling: 'strip', 'expand', 'keep' (default: 'strip')
74+
* - ``columns``
75+
- STRUCT
76+
- Explicit column schema (e.g., ``{name: 'VARCHAR', price: 'DOUBLE'}``)
77+
* - ``datetime_format``
78+
- VARCHAR or VARCHAR[]
79+
- Controls date/time detection. Accepts ``'auto'`` (default), ``'none'``, preset names (``'us'``, ``'eu'``, ``'iso'``), custom strftime strings, or a list of formats.
80+
* - ``nullstr``
81+
- VARCHAR or VARCHAR[]
82+
- String value(s) to interpret as NULL (e.g., ``'N/A'`` or ``['N/A', '-']``)
83+
* - ``streaming``
84+
- BOOLEAN
85+
- Enable SAX streaming for files exceeding ``maximum_file_size`` (default: true). SAX mode only supports simple tag names for ``record_element``. Not available for HTML.
7486

7587
**Examples:**
7688

docs/quickstart.rst

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,66 @@ Extracting Links and Images from HTML
106106
(unnest(html_extract_images(html))).alt as alt_text
107107
FROM read_html_objects('page.html');
108108
109+
Parsing XML/HTML Strings
110+
------------------------
111+
112+
Parse XML or HTML content directly from strings:
113+
114+
.. code-block:: sql
115+
116+
-- Parse an XML string with schema inference
117+
SELECT * FROM parse_xml('<data><item><name>Widget</name><price>9.99</price></item></data>');
118+
119+
-- Parse HTML content
120+
SELECT * FROM parse_html('<div><p>Hello</p><p>World</p></div>', record_element := 'p');
121+
122+
Controlling Date/Time Parsing
123+
-----------------------------
124+
125+
Use ``datetime_format`` to control how dates and timestamps are detected:
126+
127+
.. code-block:: sql
128+
129+
-- Parse European dates (DD/MM/YYYY)
130+
SELECT * FROM read_xml('data.xml', datetime_format := 'eu');
131+
132+
-- Parse US dates (MM/DD/YYYY)
133+
SELECT * FROM read_xml('data.xml', datetime_format := 'us');
134+
135+
-- Use a custom format string
136+
SELECT * FROM read_xml('data.xml', datetime_format := '%Y/%m/%d');
137+
138+
-- Disable date detection entirely
139+
SELECT * FROM read_xml('data.xml', datetime_format := 'none');
140+
141+
Handling NULL Values
142+
--------------------
143+
144+
Use ``nullstr`` to specify values that should be treated as NULL:
145+
146+
.. code-block:: sql
147+
148+
-- Treat "N/A" and "-" as NULL
149+
SELECT * FROM read_xml('data.xml', nullstr := ['N/A', '-']);
150+
151+
Processing Large Files
152+
----------------------
153+
154+
Files exceeding ``maximum_file_size`` (128MB by default) are automatically streamed
155+
using a SAX-based parser that processes XML in chunks — peak memory stays proportional
156+
to a single record rather than the entire file:
157+
158+
.. code-block:: sql
159+
160+
-- Large files are streamed automatically (default behavior)
161+
SELECT count(*) FROM read_xml('huge_file.xml');
162+
163+
-- Force DOM mode (errors if file is too large)
164+
SELECT * FROM read_xml('file.xml', streaming := false);
165+
166+
-- Adjust the file size limit for DOM parsing
167+
SELECT * FROM read_xml('file.xml', maximum_file_size := 268435456); -- 256MB
168+
109169
Extracting HTML Tables
110170
----------------------
111171

docs/schema_inference.rst

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,32 @@ Limit parsing depth with ``max_depth``:
170170
-- Unlimited (capped at 10 for safety)
171171
SELECT * FROM read_xml('deep.xml', max_depth := -1);
172172
173+
SAX Streaming and Schema Inference
174+
-----------------------------------
175+
176+
When files exceed ``maximum_file_size`` (128MB by default), the extension uses SAX-based
177+
streaming instead of building a full DOM tree. This affects schema inference in two ways:
178+
179+
1. **Sample-based inference** — SAX mode reads the first ``sample_size`` records (default: 50)
180+
to infer the schema, then streams the rest for extraction. The schema is not revised after
181+
the sample window, so columns or types that only appear in later records may not be detected.
182+
183+
2. **Simple ``record_element`` only** — SAX mode matches record elements by simple tag name
184+
(e.g., ``'item'``). XPath expressions like ``'//ns:item[@type="active"]'`` or path-based
185+
patterns like ``'/root/data/item'`` require DOM parsing. When the file is too large for DOM
186+
and the ``record_element`` contains XPath syntax, the extension raises an error.
187+
188+
.. code-block:: sql
189+
190+
-- Works in SAX mode (simple tag name)
191+
SELECT * FROM read_xml('huge.xml', record_element := 'item');
192+
193+
-- Falls back to DOM (XPath expression)
194+
SELECT * FROM read_xml('data.xml', record_element := '//item[@status="active"]');
195+
196+
Set ``streaming := false`` to force DOM mode for any file (will error if the file exceeds
197+
``maximum_file_size``).
198+
173199
Common Patterns
174200
---------------
175201

test/xml/generate_stress_test.py

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
#!/usr/bin/env python3
2+
"""Generate a large XML file for SAX streaming stress testing.
3+
4+
Usage: python3 generate_stress_test.py [num_records] [output_path]
5+
6+
Default: 1,000,000 records (~382MB) written to test/xml/sax_stress_test.xml
7+
8+
The generated file contains varied record content designed to exercise:
9+
- UTF-8 characters (Cyrillic, Japanese, Turkish, French)
10+
- XML special characters (&, <, >, ")
11+
- Multiple data types (INTEGER, DOUBLE, BOOLEAN, DATE, VARCHAR)
12+
- Attributes on record elements (id, sku)
13+
- 10 rotating categories including one with & (food & beverage)
14+
"""
15+
import os
16+
import sys
17+
18+
def generate(num_records=1000000, output_path=None):
19+
if output_path is None:
20+
output_path = os.path.join(os.path.dirname(__file__), "sax_stress_test.xml")
21+
22+
categories = [
23+
"electronics", "clothing", "food & beverage", "toys", "books & media",
24+
"health", "automotive", "sports", "home & garden", "office",
25+
]
26+
27+
special_names = [
28+
'Standard Item',
29+
'Item with "quotes"',
30+
"Item with <brackets>",
31+
'Item with & ampersand',
32+
'Таймаут продукт',
33+
'日本語の製品',
34+
'Ürün açıklaması',
35+
'Produit français',
36+
]
37+
38+
with open(output_path, "w", encoding="utf-8") as f:
39+
f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
40+
f.write("<catalog>\n")
41+
42+
for i in range(1, num_records + 1):
43+
cat = categories[i % len(categories)]
44+
name_base = special_names[i % len(special_names)]
45+
price = round(0.01 + (i % 99999) * 0.01, 2)
46+
quantity = i % 10000
47+
month = (i % 12) + 1
48+
day = (i % 28) + 1
49+
active = "true" if i % 3 != 0 else "false"
50+
rating = round(1.0 + (i % 50) * 0.1, 1)
51+
52+
name = f"{name_base} #{i}"
53+
name = name.replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;").replace('"', "&quot;")
54+
cat_escaped = cat.replace("&", "&amp;")
55+
desc = f"This is the description for product {i}. It contains enough text to make each record larger."
56+
57+
f.write(f' <product id="{i}" sku="SKU-{i:07d}">\n')
58+
f.write(f" <name>{name}</name>\n")
59+
f.write(f" <category>{cat_escaped}</category>\n")
60+
f.write(f" <price>{price}</price>\n")
61+
f.write(f" <quantity>{quantity}</quantity>\n")
62+
f.write(f" <date>2024-{month:02d}-{day:02d}</date>\n")
63+
f.write(f" <active>{active}</active>\n")
64+
f.write(f" <rating>{rating}</rating>\n")
65+
f.write(f" <description>{desc}</description>\n")
66+
f.write(" </product>\n")
67+
68+
if i % 200000 == 0:
69+
print(f" Generated {i}/{num_records} records...", file=sys.stderr)
70+
71+
f.write("</catalog>\n")
72+
73+
size_mb = os.path.getsize(output_path) / (1024 * 1024)
74+
print(f"Generated {output_path}: {size_mb:.1f} MB, {num_records} records")
75+
76+
77+
if __name__ == "__main__":
78+
n = int(sys.argv[1]) if len(sys.argv) > 1 else 1000000
79+
p = sys.argv[2] if len(sys.argv) > 2 else None
80+
generate(n, p)

0 commit comments

Comments
 (0)