Regression on parsing invalid URLs #2382

kamil-certat · 2023-06-22T11:37:42Z

As a continuation of #2377, we have a regression on parsing invalid URLs. Previously, the urllib was mach more liberal in processing URLs, now it rejects much more cases.

We use it for sanitize the URLs, and html_parser is an example of bot that uses the liberal behavior in tests:

intelmq/intelmq/tests/bots/parsers/html_table/test_parser_column_split.py

Line 47 in 61c45ac

    
           EXAMPLE_EVENT2['source.url'] = "http://[D] lingvaworld.ru/media/system/css/messg.jpg"

intelmq/intelmq/tests/bots/parsers/html_table/test_parser_column_split.py

Lines 73 to 80 in 61c45ac

    
           def test_event_without_split(self): 
        
               self.sysconfig = {"columns": ["time.source", "source.url", "malware.hash.md5", 
        
                                             "source.ip", "__IGNORE__"], 
        
                                 "skip_head": True, 
        
                                 "default_url_protocol": "http://", 
        
                                 "type": "malware-distribution"} 
        
               self.run_bot() 
        
               self.assertMessageEqual(0, EXAMPLE_EVENT2)

In patched Python versions (e.g. 3.11.4), this URL is rejected. We need to either decide against allowing such URLs, or redesign our sanitization.

Temporally, the test is skipped to unlock other work.

The text was updated successfully, but these errors were encountered:

More restrict validation in urllib causes troubles when processing invalid URLs. The correct solution on our side is at the moment unclear, see certtools#2382

kamil-certat added bug Indicates an unexpected problem or unintended behavior component: bots component: core labels Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression on parsing invalid URLs #2382

Regression on parsing invalid URLs #2382

kamil-certat commented Jun 22, 2023

Regression on parsing invalid URLs #2382

Regression on parsing invalid URLs #2382

Comments

kamil-certat commented Jun 22, 2023