XML parser

The json, xml and html parsers have the most complex source files after the dynamic parser. It is a JSON file, and it can ΝΟΤ run only by providing an url, a lot of configuration is needed.

It is a must to know the structure of the response that is going to be parsed.

Scrape

The XML response will be parsed into a json using fast-xml-parser and we apply the following options:

new XMLParser({
    allowBooleanAttributes: true,
    attributeNamePrefix: '',
    cdataPropName: '__cdata',
    commentPropName: '__comment',
    textNodeName: '__text',
    ignoreAttributes: false,
    ignoreDeclaration: false,
    ignorePiTags: false,
    numberParseOptions: {
        leadingZeros: false,
        hex: false,
        eNotation: false,
    },
    preserveOrder: false,
    processEntities: true,
    removeNSPrefix: false,
    trimValues: true,
    unpairedTags: unpairedTags // Like <br>
});

So for example the following xml code:

<?xml version='1.0' encoding='utf-8'?>
<rss version='2.0' xmlns:atom='http://www.w3.org/2005/Atom'>
    <channel>
        <atom:link href='https://example.com' rel='self'
                   type='application/rss+xml'/>
        <title>Channel title</title>
        <link>https://example.com/channel-link</link>
        <item>
            <title attr="test">Item 1</title>
            <pubDate>Sun, 31 Jul 2022 23:59:39 +0300</pubDate>
            <guid isPermaLink='false'>Sun, 31 Jul 2022 23:59:39 +0300416759</guid>
        </item>
        <item>
            <title>Item 2</title>
            <guid isPermaLink='false'>Tue, 01 Mar 2022 20:13:46 +0300391866</guid>
        </item>
    </channel>
</rss>

will be transformed into:

{
    "?xml": {
        "version": "1.0",
        "enconding": "utf-8"
    },
    "rss": {
        "channel": {
            "atom:link": {
                "href": "https://example.com",
                "rel": "self",
                "type": "application/rss+xml"
            },
            "title": "Channel title",
            "link": "https://example.com/channel-link",
            "item": [
                {
                    "title": {
                        "__text": "Item 1",
                        "attr": "test"
                    },
                    "pubDate": "Sun, 31 Jul 2022 23:59:39 +0300",
                    "guid": {
                        "__text": "Sun, 31 Jul 2022 23:59:39 +0300416759",
                        "isPermaLink": "false"
                    }
                },
                {
                    "title": "Item 2",
                    "guid": {
                        "__text": "Tue, 01 Mar 2022 20:13:46 +0300391866",
                        "isPermaLink": "false"
                    }
                }
            ]
        }
    }
}

After that the only think that remains is to parse it as a JSON. That means that the xml parser's source file is almost identical to the one from the json parser.

`unpairedTags`

From fast-xml-parser documentation:

Unpaired Tags are the tags which don't have matching closing tag. Eg <br> in HTML. You can parse unpaired tags by providing their list to the parser, validator and builder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

xml.md

xml.md

XML parser

Scrape

`unpairedTags`

Files

xml.md

Latest commit

History

xml.md

File metadata and controls

XML parser

Scrape

unpairedTags

`unpairedTags`