docs: update all documentation for v2.0.0 release

teaguesterling · claude · teaguesterling · commit 6497b6987669 · 2026-03-29T14:31:48.000-07:00
- changelog.rst: expand v1.6.0 with SAX details, limitations, test stats
- README.md: rename to "DuckDB Webbed Extension", add streaming features
  section, update test coverage stats
- schema_inference.rst: add SAX streaming impact section documenting
  sample-based inference and record_element limitations
- quickstart.rst: add sections for parse_xml, datetime_format, nullstr,
  and large file streaming
- functions/file_reading.rst: add missing parameters (datetime_format,
  nullstr, streaming, columns) to read_xml parameter table
- .gitignore: exclude generated stress test files
- Add generate_stress_test.py for reproducible large file testing

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.gitignore b/.gitignore
@@ -22,6 +22,10 @@ devenv.local.nix
 # Locally cloned vcpkg
 vcpkg/
 
+# Generated stress test files (use test/xml/generate_stress_test.py to recreate)
+test/xml/sax_stress_test.xml
+test/xml/sax_stress_test.xml.gz
+
 # blq
 .lq/*
 !.lq/hooks/
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 [![Documentation](https://img.shields.io/badge/docs-readthedocs-blue)](https://duckdb-webbed.readthedocs.io)
 
 
-# DuckDB XML Extension
+# DuckDB Webbed Extension
 
 A comprehensive XML and HTML processing extension for DuckDB that enables SQL-native analysis of structured documents with intelligent schema inference and powerful XPath-based data extraction.
 
@@ -14,16 +14,23 @@ A comprehensive XML and HTML processing extension for DuckDB that enables SQL-na
 - Convert between XML, HTML, and JSON formats
 - Read files directly into DuckDB tables
 
-### 📊 **Smart Schema Inference** 
+### 📊 **Smart Schema Inference**
 - Automatically flatten XML documents into relational tables
 - Intelligent type detection (dates, numbers, booleans)
 - Configurable element and attribute handling
+- Custom datetime format control with presets and format strings
+
+### 🚀 **Streaming for Large Files**
+- SAX-based streaming parser for files exceeding ``maximum_file_size``
+- Peak memory proportional to a single record, not the entire file
+- Automatic fallback: DOM for small files, SAX for large files
+- Controlled by ``streaming`` parameter (default: true)
 
 ### 🛠 **Production Ready**
 - Built on libxml2 for robust parsing
 - Comprehensive error handling
 - Memory-safe RAII implementation
-- 100% test coverage
+- 68 test suites, 2511 assertions
 
 ---
 
diff --git a/docs/changelog.rst b/docs/changelog.rst
@@ -8,9 +8,28 @@ v1.6.0 (Current)
 
 - SAX-based streaming parser for very large XML files — files exceeding
   ``maximum_file_size`` are automatically parsed using SAX mode, reducing peak
-  memory from ~4x file size (DOM) to proportional to a single record. Controlled
-  by ``streaming`` parameter (default: true). Set ``streaming:=false`` for the
-  original behavior of erroring on oversized files (Issue #68)
+  memory from ~4x file size (DOM) to proportional to a single record (Issue #68)
+
+  - New ``streaming`` parameter (default: ``true``). When enabled, oversized XML
+    files are streamed via libxml2's SAX push parser in 64KB chunks instead of
+    building a full DOM tree. Set ``streaming:=false`` to restore the previous
+    behavior of erroring on oversized files.
+  - SAX mode supports simple tag-name ``record_element`` values (e.g., ``'item'``).
+    XPath expressions automatically fall back to DOM parsing.
+  - Not available for HTML files (libxml2 HTML parser is DOM-only).
+
+**Limitations**
+
+- SAX mode currently handles flat records (scalars, attributes, repeated elements).
+  Nested STRUCT extraction from SAX events is not yet implemented — deeply nested
+  records fall back to raw XML string values.
+
+**Testing**
+
+- 68 test suites, 2511 assertions
+- Comprehensive DOM/SAX equivalence tests covering type inference, datetime_format,
+  record_element, cross-record attribute discovery, large row counts (3000 rows
+  across chunk boundaries), UTF-8 content, and nullstr interaction
 
 v1.5.0
 -----------------
diff --git a/docs/functions/file_reading.rst b/docs/functions/file_reading.rst
@@ -71,6 +71,18 @@ Read XML files with automatic schema inference.
    * - ``namespaces``
      - VARCHAR
      - Namespace handling: 'strip', 'expand', 'keep' (default: 'strip')
+   * - ``columns``
+     - STRUCT
+     - Explicit column schema (e.g., ``{name: 'VARCHAR', price: 'DOUBLE'}``)
+   * - ``datetime_format``
+     - VARCHAR or VARCHAR[]
+     - Controls date/time detection. Accepts ``'auto'`` (default), ``'none'``, preset names (``'us'``, ``'eu'``, ``'iso'``), custom strftime strings, or a list of formats.
+   * - ``nullstr``
+     - VARCHAR or VARCHAR[]
+     - String value(s) to interpret as NULL (e.g., ``'N/A'`` or ``['N/A', '-']``)
+   * - ``streaming``
+     - BOOLEAN
+     - Enable SAX streaming for files exceeding ``maximum_file_size`` (default: true). SAX mode only supports simple tag names for ``record_element``. Not available for HTML.
 
 **Examples:**
 
diff --git a/docs/quickstart.rst b/docs/quickstart.rst
@@ -106,6 +106,66 @@ Extracting Links and Images from HTML
           (unnest(html_extract_images(html))).alt as alt_text
    FROM read_html_objects('page.html');
 
+Parsing XML/HTML Strings
+------------------------
+
+Parse XML or HTML content directly from strings:
+
+.. code-block:: sql
+
+   -- Parse an XML string with schema inference
+   SELECT * FROM parse_xml('<data><item><name>Widget</name><price>9.99</price></item></data>');
+
+   -- Parse HTML content
+   SELECT * FROM parse_html('<div><p>Hello</p><p>World</p></div>', record_element := 'p');
+
+Controlling Date/Time Parsing
+-----------------------------
+
+Use ``datetime_format`` to control how dates and timestamps are detected:
+
+.. code-block:: sql
+
+   -- Parse European dates (DD/MM/YYYY)
+   SELECT * FROM read_xml('data.xml', datetime_format := 'eu');
+
+   -- Parse US dates (MM/DD/YYYY)
+   SELECT * FROM read_xml('data.xml', datetime_format := 'us');
+
+   -- Use a custom format string
+   SELECT * FROM read_xml('data.xml', datetime_format := '%Y/%m/%d');
+
+   -- Disable date detection entirely
+   SELECT * FROM read_xml('data.xml', datetime_format := 'none');
+
+Handling NULL Values
+--------------------
+
+Use ``nullstr`` to specify values that should be treated as NULL:
+
+.. code-block:: sql
+
+   -- Treat "N/A" and "-" as NULL
+   SELECT * FROM read_xml('data.xml', nullstr := ['N/A', '-']);
+
+Processing Large Files
+----------------------
+
+Files exceeding ``maximum_file_size`` (128MB by default) are automatically streamed
+using a SAX-based parser that processes XML in chunks — peak memory stays proportional
+to a single record rather than the entire file:
+
+.. code-block:: sql
+
+   -- Large files are streamed automatically (default behavior)
+   SELECT count(*) FROM read_xml('huge_file.xml');
+
+   -- Force DOM mode (errors if file is too large)
+   SELECT * FROM read_xml('file.xml', streaming := false);
+
+   -- Adjust the file size limit for DOM parsing
+   SELECT * FROM read_xml('file.xml', maximum_file_size := 268435456);  -- 256MB
+
 Extracting HTML Tables
 ----------------------
 
diff --git a/docs/schema_inference.rst b/docs/schema_inference.rst
@@ -170,6 +170,32 @@ Limit parsing depth with ``max_depth``:
    -- Unlimited (capped at 10 for safety)
    SELECT * FROM read_xml('deep.xml', max_depth := -1);
 
+SAX Streaming and Schema Inference
+-----------------------------------
+
+When files exceed ``maximum_file_size`` (128MB by default), the extension uses SAX-based
+streaming instead of building a full DOM tree. This affects schema inference in two ways:
+
+1. **Sample-based inference** — SAX mode reads the first ``sample_size`` records (default: 50)
+   to infer the schema, then streams the rest for extraction. The schema is not revised after
+   the sample window, so columns or types that only appear in later records may not be detected.
+
+2. **Simple ``record_element`` only** — SAX mode matches record elements by simple tag name
+   (e.g., ``'item'``). XPath expressions like ``'//ns:item[@type="active"]'`` or path-based
+   patterns like ``'/root/data/item'`` require DOM parsing. When the file is too large for DOM
+   and the ``record_element`` contains XPath syntax, the extension raises an error.
+
+.. code-block:: sql
+
+   -- Works in SAX mode (simple tag name)
+   SELECT * FROM read_xml('huge.xml', record_element := 'item');
+
+   -- Falls back to DOM (XPath expression)
+   SELECT * FROM read_xml('data.xml', record_element := '//item[@status="active"]');
+
+Set ``streaming := false`` to force DOM mode for any file (will error if the file exceeds
+``maximum_file_size``).
+
 Common Patterns
 ---------------
 
diff --git a/test/xml/generate_stress_test.py b/test/xml/generate_stress_test.py
@@ -0,0 +1,80 @@
+#!/usr/bin/env python3
+"""Generate a large XML file for SAX streaming stress testing.
+
+Usage: python3 generate_stress_test.py [num_records] [output_path]
+
+Default: 1,000,000 records (~382MB) written to test/xml/sax_stress_test.xml
+
+The generated file contains varied record content designed to exercise:
+- UTF-8 characters (Cyrillic, Japanese, Turkish, French)
+- XML special characters (&, <, >, ")
+- Multiple data types (INTEGER, DOUBLE, BOOLEAN, DATE, VARCHAR)
+- Attributes on record elements (id, sku)
+- 10 rotating categories including one with & (food & beverage)
+"""
+import os
+import sys
+
+def generate(num_records=1000000, output_path=None):
+    if output_path is None:
+        output_path = os.path.join(os.path.dirname(__file__), "sax_stress_test.xml")
+
+    categories = [
+        "electronics", "clothing", "food & beverage", "toys", "books & media",
+        "health", "automotive", "sports", "home & garden", "office",
+    ]
+
+    special_names = [
+        'Standard Item',
+        'Item with "quotes"',
+        "Item with <brackets>",
+        'Item with & ampersand',
+        'Таймаут продукт',
+        '日本語の製品',
+        'Ürün açıklaması',
+        'Produit français',
+    ]
+
+    with open(output_path, "w", encoding="utf-8") as f:
+        f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
+        f.write("<catalog>\n")
+
+        for i in range(1, num_records + 1):
+            cat = categories[i % len(categories)]
+            name_base = special_names[i % len(special_names)]
+            price = round(0.01 + (i % 99999) * 0.01, 2)
+            quantity = i % 10000
+            month = (i % 12) + 1
+            day = (i % 28) + 1
+            active = "true" if i % 3 != 0 else "false"
+            rating = round(1.0 + (i % 50) * 0.1, 1)
+
+            name = f"{name_base} #{i}"
+            name = name.replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;").replace('"', "&quot;")
+            cat_escaped = cat.replace("&", "&amp;")
+            desc = f"This is the description for product {i}. It contains enough text to make each record larger."
+
+            f.write(f'  <product id="{i}" sku="SKU-{i:07d}">\n')
+            f.write(f"    <name>{name}</name>\n")
+            f.write(f"    <category>{cat_escaped}</category>\n")
+            f.write(f"    <price>{price}</price>\n")
+            f.write(f"    <quantity>{quantity}</quantity>\n")
+            f.write(f"    <date>2024-{month:02d}-{day:02d}</date>\n")
+            f.write(f"    <active>{active}</active>\n")
+            f.write(f"    <rating>{rating}</rating>\n")
+            f.write(f"    <description>{desc}</description>\n")
+            f.write("  </product>\n")
+
+            if i % 200000 == 0:
+                print(f"  Generated {i}/{num_records} records...", file=sys.stderr)
+
+        f.write("</catalog>\n")
+
+    size_mb = os.path.getsize(output_path) / (1024 * 1024)
+    print(f"Generated {output_path}: {size_mb:.1f} MB, {num_records} records")
+
+
+if __name__ == "__main__":
+    n = int(sys.argv[1]) if len(sys.argv) > 1 else 1000000
+    p = sys.argv[2] if len(sys.argv) > 2 else None
+    generate(n, p)