read_html and read_xml Feature Parity Analysis

Date: 2025-11-27 Issue: #18 Branch: issue-18-html-xml-feature-parity

Executive Summary

Analysis shows that read_html and read_xml have near-complete parameter parity at the API level. All 16 core parameters are present in both functions. However, there is a significant test coverage gap: XML has 26 test files while HTML has only 5.

Parameter Comparison

Parameters Present in Both Functions ✓

Both read_html and read_xml support these parameters:

File Processing:
- ignore_errors (BOOLEAN) - Skip invalid files
- maximum_file_size (BIGINT) - File size limit
- union_by_name (BOOLEAN) - Merge schemas by column name
Schema Inference:
- root_element (VARCHAR) - Starting element for parsing
- record_element (VARCHAR) - XPath/tag for row elements
- attr_mode (VARCHAR) - Attribute handling: 'columns', 'prefixed', 'map', 'discard'
- attr_prefix (VARCHAR) - Prefix when attr_mode='prefixed'
- text_key (VARCHAR) - Key for mixed text content
- namespaces (VARCHAR) - Namespace handling: 'strip', 'expand', 'keep'
- empty_elements (VARCHAR) - Empty element handling: 'null', 'string', 'object'
- auto_detect (BOOLEAN) - Enable automatic schema detection
- max_depth (INTEGER) - Maximum nesting depth
- unnest_as (VARCHAR) - How to unnest: 'columns' or 'struct'
Type Control:
- force_list (VARCHAR or LIST(VARCHAR)) - Elements always as LIST
- columns (ANY) - Explicit schema specification
- all_varchar (BOOLEAN) - Force all scalars to VARCHAR

Parameters Unique to read_html

filename (BOOLEAN) - Include filename in output

Assessment: HTML has this extra parameter likely because HTML files are often processed in batches where tracking the source file is important.

Parameters Unique to read_xml

None. XML has no parameters that HTML lacks.

Test Coverage Comparison

XML Test Files (26 total)

✓ xml_all_varchar.test
✓ xml_array_support.test
✓ xml_basic.test
✓ xml_complex_types.test
✓ xml_deep_hierarchies.test
✓ xml_document_analysis.test
✓ xml_document_formatting.test
✓ xml_enhanced_simple.test
✓ xml_enhanced_to_xml.test
✓ xml_force_list.test
✓ xml_function_fixes.test
✓ xml_hybrid_schemas.test
✓ xml_json_conversion.test
✓ xml_large_files.test
✓ xml_large_row_count.test
✓ xml_max_depth.test
✓ xml_replacement_scan.test
✓ xml_rss_feed.test
✓ xml_schema_errors.test
✓ xml_schema_validation.test
✓ xml_table_functions.test
✓ xml_type_casting.test
✓ xml_type_inference_order.test
✓ xml_union_by_name.test
✓ xml_validation.test
✓ xml_xpath_extraction.test

HTML Test Files (5 total)

✓ html_basic.test
✓ html_basic_functions.test
✓ html_entity_encoding.test
✓ html_extraction.test
✓ html_file_reading.test
html_schema_inference.test.future (disabled)

Missing HTML Test Coverage

Features tested for XML but NOT for HTML:

all_varchar - Force scalar types to VARCHAR
array_support - Array/list handling
complex_types - Complex nested structures
deep_hierarchies - Deep nesting scenarios
document_analysis - Document structure analysis
document_formatting - Output formatting
enhanced_simple - Enhanced simple queries
enhanced_to_xml - Conversion back to XML
force_list - Force elements to LIST type
function_fixes - Function-specific fixes
hybrid_schemas - Mixed schema handling
json_conversion - XML to JSON conversion
large_files - Large file handling
large_row_count - Many rows handling
max_depth - Max depth parameter
replacement_scan - Replacement scan functionality
rss_feed - RSS feed parsing (domain-specific to XML)
schema_errors - Error handling in schema inference
schema_validation - Schema validation
table_functions - Table function variations
type_casting - Type casting functionality
type_inference_order - Type inference priority
union_by_name - Union by name functionality
validation - General validation
xpath_extraction - XPath queries

Recommendations

Priority 1: High-Value Tests (should apply to HTML)

These features are fundamental and should work identically in HTML:

Priority 2: Medium-Value Tests (likely applicable)

document_analysis - Structure analysis
hybrid_schemas - Mixed schemas
large_row_count - Many rows
replacement_scan - Replacement scan
schema_validation - Validation
table_functions - Function variations
validation - General validation

Priority 3: XML-Specific (may not apply to HTML)

These are specific to XML format and may not be relevant:

~~rss_feed~~ - RSS is XML-specific
~~json_conversion~~ - XML⟷JSON specific
~~enhanced_to_xml~~ - Conversion to XML
~~xpath_extraction~~ - XPath is XML-specific (though HTML supports XPath)
~~document_formatting~~ - XML-specific formatting

Implementation Plan

Phase 1: Verify Parameter Functionality

Create basic tests to verify all shared parameters work correctly in read_html:

Test all_varchar parameter
Test force_list parameter
Test max_depth parameter
Test union_by_name with multiple HTML files
Test columns explicit schema
Test error handling with ignore_errors

Phase 2: Add Critical Test Coverage

Port the most important XML tests to HTML equivalents:

html_all_varchar.test - Based on xml_all_varchar.test
html_array_support.test - List/array handling
html_complex_types.test - Nested structures
html_force_list.test - Force list parameter
html_type_inference.test - Type detection priority
html_union_by_name.test - Schema merging
html_schema_errors.test - Error handling

Phase 3: Comprehensive Coverage

Add remaining applicable tests:

html_deep_hierarchies.test
html_max_depth.test
html_large_files.test
html_type_casting.test
html_validation.test

Current Status

Parameter audit completed
Test coverage analysis completed
Create Priority 1 tests (5 test suites, 34 test cases)
Create Priority 2 tests (4 test suites, 28 test cases)
Document any HTML-specific limitations
Update README with HTML feature documentation

Test Suites Added

Priority 1 (Critical Features):

✅ html_all_varchar.test - 7 tests
✅ html_force_list.test - 7 tests
✅ html_union_by_name.test - 6 tests
✅ html_type_inference.test - 7 tests
✅ html_max_depth.test - 7 tests

Priority 2 (Important Features): 6. ✅ html_complex_types.test - 7 tests 7. ✅ html_schema_errors.test - 7 tests 8. ✅ html_validation.test - 8 tests 9. ✅ html_large_files.test - 7 tests

Total New Coverage: 9 test suites, 62 test cases

HTML Test Files: 5 → 14 files (180% increase) Test Case Count: ~25 → ~87 cases (248% increase)

Notes

HTML parsing uses libxml2's HTML parser, which is more lenient than the XML parser
HTML may have different nesting structures (e.g., implicit tags like <tbody>)
Some HTML documents may not have well-defined "records" like XML documents
XPath works on HTML when parsed as XML tree structure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_html and read_xml Feature Parity Analysis

Executive Summary

Parameter Comparison

Parameters Present in Both Functions ✓

Parameters Unique to read_html

Parameters Unique to read_xml

Test Coverage Comparison

XML Test Files (26 total)

HTML Test Files (5 total)

Missing HTML Test Coverage

Recommendations

Priority 1: High-Value Tests (should apply to HTML)

Priority 2: Medium-Value Tests (likely applicable)

Priority 3: XML-Specific (may not apply to HTML)

Implementation Plan

Phase 1: Verify Parameter Functionality

Phase 2: Add Critical Test Coverage

Phase 3: Comprehensive Coverage

Current Status

Test Suites Added

Notes

FilesExpand file tree

FEATURE_PARITY_ANALYSIS.md

Latest commit

History

FEATURE_PARITY_ANALYSIS.md

File metadata and controls

read_html and read_xml Feature Parity Analysis

Executive Summary

Parameter Comparison

Parameters Present in Both Functions ✓

Parameters Unique to read_html

Parameters Unique to read_xml

Test Coverage Comparison

XML Test Files (26 total)

HTML Test Files (5 total)

Missing HTML Test Coverage

Recommendations

Priority 1: High-Value Tests (should apply to HTML)

Priority 2: Medium-Value Tests (likely applicable)

Priority 3: XML-Specific (may not apply to HTML)

Implementation Plan

Phase 1: Verify Parameter Functionality

Phase 2: Add Critical Test Coverage

Phase 3: Comprehensive Coverage

Current Status

Test Suites Added

Notes