Date: 2025-11-27 Issue: #18 Branch: issue-18-html-xml-feature-parity
Analysis shows that read_html and read_xml have near-complete parameter parity at the API level. All 16 core parameters are present in both functions. However, there is a significant test coverage gap: XML has 26 test files while HTML has only 5.
Both read_html and read_xml support these parameters:
-
File Processing:
ignore_errors(BOOLEAN) - Skip invalid filesmaximum_file_size(BIGINT) - File size limitunion_by_name(BOOLEAN) - Merge schemas by column name
-
Schema Inference:
root_element(VARCHAR) - Starting element for parsingrecord_element(VARCHAR) - XPath/tag for row elementsattr_mode(VARCHAR) - Attribute handling: 'columns', 'prefixed', 'map', 'discard'attr_prefix(VARCHAR) - Prefix when attr_mode='prefixed'text_key(VARCHAR) - Key for mixed text contentnamespaces(VARCHAR) - Namespace handling: 'strip', 'expand', 'keep'empty_elements(VARCHAR) - Empty element handling: 'null', 'string', 'object'auto_detect(BOOLEAN) - Enable automatic schema detectionmax_depth(INTEGER) - Maximum nesting depthunnest_as(VARCHAR) - How to unnest: 'columns' or 'struct'
-
Type Control:
force_list(VARCHAR or LIST(VARCHAR)) - Elements always as LISTcolumns(ANY) - Explicit schema specificationall_varchar(BOOLEAN) - Force all scalars to VARCHAR
filename(BOOLEAN) - Include filename in output
Assessment: HTML has this extra parameter likely because HTML files are often processed in batches where tracking the source file is important.
None. XML has no parameters that HTML lacks.
- ✓ xml_all_varchar.test
- ✓ xml_array_support.test
- ✓ xml_basic.test
- ✓ xml_complex_types.test
- ✓ xml_deep_hierarchies.test
- ✓ xml_document_analysis.test
- ✓ xml_document_formatting.test
- ✓ xml_enhanced_simple.test
- ✓ xml_enhanced_to_xml.test
- ✓ xml_force_list.test
- ✓ xml_function_fixes.test
- ✓ xml_hybrid_schemas.test
- ✓ xml_json_conversion.test
- ✓ xml_large_files.test
- ✓ xml_large_row_count.test
- ✓ xml_max_depth.test
- ✓ xml_replacement_scan.test
- ✓ xml_rss_feed.test
- ✓ xml_schema_errors.test
- ✓ xml_schema_validation.test
- ✓ xml_table_functions.test
- ✓ xml_type_casting.test
- ✓ xml_type_inference_order.test
- ✓ xml_union_by_name.test
- ✓ xml_validation.test
- ✓ xml_xpath_extraction.test
- ✓ html_basic.test
- ✓ html_basic_functions.test
- ✓ html_entity_encoding.test
- ✓ html_extraction.test
- ✓ html_file_reading.test
- html_schema_inference.test.future (disabled)
Features tested for XML but NOT for HTML:
- all_varchar - Force scalar types to VARCHAR
- array_support - Array/list handling
- complex_types - Complex nested structures
- deep_hierarchies - Deep nesting scenarios
- document_analysis - Document structure analysis
- document_formatting - Output formatting
- enhanced_simple - Enhanced simple queries
- enhanced_to_xml - Conversion back to XML
- force_list - Force elements to LIST type
- function_fixes - Function-specific fixes
- hybrid_schemas - Mixed schema handling
- json_conversion - XML to JSON conversion
- large_files - Large file handling
- large_row_count - Many rows handling
- max_depth - Max depth parameter
- replacement_scan - Replacement scan functionality
- rss_feed - RSS feed parsing (domain-specific to XML)
- schema_errors - Error handling in schema inference
- schema_validation - Schema validation
- table_functions - Table function variations
- type_casting - Type casting functionality
- type_inference_order - Type inference priority
- union_by_name - Union by name functionality
- validation - General validation
- xpath_extraction - XPath queries
These features are fundamental and should work identically in HTML:
-
all_varchar- Test parameter works -
array_support- List/array handling -
complex_types- Nested structures -
deep_hierarchies- Deep nesting -
force_list- Force list parameter -
large_files- Large file support -
max_depth- Max depth parameter -
schema_errors- Error handling -
type_casting- Type conversions -
type_inference_order- Correct type priority -
union_by_name- Schema merging
-
document_analysis- Structure analysis -
hybrid_schemas- Mixed schemas -
large_row_count- Many rows -
replacement_scan- Replacement scan -
schema_validation- Validation -
table_functions- Function variations -
validation- General validation
These are specific to XML format and may not be relevant:
rss_feed- RSS is XML-specificjson_conversion- XML⟷JSON specificenhanced_to_xml- Conversion to XMLxpath_extraction- XPath is XML-specific (though HTML supports XPath)document_formatting- XML-specific formatting
Create basic tests to verify all shared parameters work correctly in read_html:
- Test
all_varcharparameter - Test
force_listparameter - Test
max_depthparameter - Test
union_by_namewith multiple HTML files - Test
columnsexplicit schema - Test error handling with
ignore_errors
Port the most important XML tests to HTML equivalents:
html_all_varchar.test- Based on xml_all_varchar.testhtml_array_support.test- List/array handlinghtml_complex_types.test- Nested structureshtml_force_list.test- Force list parameterhtml_type_inference.test- Type detection priorityhtml_union_by_name.test- Schema merginghtml_schema_errors.test- Error handling
Add remaining applicable tests:
html_deep_hierarchies.testhtml_max_depth.testhtml_large_files.testhtml_type_casting.testhtml_validation.test
- Parameter audit completed
- Test coverage analysis completed
- Create Priority 1 tests (5 test suites, 34 test cases)
- Create Priority 2 tests (4 test suites, 28 test cases)
- Document any HTML-specific limitations
- Update README with HTML feature documentation
Priority 1 (Critical Features):
- ✅ html_all_varchar.test - 7 tests
- ✅ html_force_list.test - 7 tests
- ✅ html_union_by_name.test - 6 tests
- ✅ html_type_inference.test - 7 tests
- ✅ html_max_depth.test - 7 tests
Priority 2 (Important Features): 6. ✅ html_complex_types.test - 7 tests 7. ✅ html_schema_errors.test - 7 tests 8. ✅ html_validation.test - 8 tests 9. ✅ html_large_files.test - 7 tests
Total New Coverage: 9 test suites, 62 test cases
HTML Test Files: 5 → 14 files (180% increase) Test Case Count: ~25 → ~87 cases (248% increase)
- HTML parsing uses libxml2's HTML parser, which is more lenient than the XML parser
- HTML may have different nesting structures (e.g., implicit tags like
<tbody>) - Some HTML documents may not have well-defined "records" like XML documents
- XPath works on HTML when parsed as XML tree structure