Custom Crawler Dynamics

I use cef.browser to connect chrome dev tools protocol.

https://chromedevtools.github.io/

Dynamic DOM Analysis

First, get backend node id using node id

https://chromedevtools.github.io/devtools-protocol/tot/DOM/#method-pushNodesByBackendIdsToFrontend

custom-crawler/CustomCrawler/CustomCrawlerDynamicsHover.xaml.cs

Line 67 in 47e1dc0

    
           var nn = await CustomCrawlerDynamics.ss.SendAsync(new PushNodesByBackendIdsToFrontendCommand { BackendNodeIds = new long[] { nodeid } });

Second, get node stack traces information

https://chromedevtools.github.io/devtools-protocol/tot/DOM/#method-getNodeStackTraces

custom-crawler/CustomCrawler/CustomCrawlerDynamicsHover.xaml.cs

Line 68 in 47e1dc0

    
           var mm = await CustomCrawlerDynamics.ss.SendAsync(new GetNodeStackTracesCommand { NodeId = nn.Result.NodeIds[0] });

Find Download(Requested) Data Associated With Dynamic Added Elements

Data from the Network tab of the developer tool (.Files such as js, json, html) can be seen from which code the request was made. Since stack tracking is a functional unit, it is different from the elemet creation route implemented in the previous post. Therefore, if you mix the two well, you will achieve your goal.

The request route and element create route must be different. This is because the download request function cannot do two things at the same time, such as adding nodes. How can we know if e.data influenced element generation when looking at the picture above?

Load(requested by HTML script)->a.js->b.js->c.js is the same until. The element calls f.js, and the request calls d.js. If so, we can now see that there is a branch in c.js.

In c.js, a function receives data through d.js and then called f.js. Assuming that e.data unconditionally affects element creation, d.js is called and then f.js is called. Now you can check if you called d.js during element creation. That is, it is possible to see whether d.js is called in the element creation route.

Static analysis is essential to accurately proceed with this process, and there is a way to verify it with a high probability without having to perform static analysis. The method is as follows.

Calculate the parse tree path of the part of c.js to call f.js.
Calculate the parse tree path of the part of c.js to call d.js.
Compare the two paths to check whether a prior common token in function, call, and block occurs and how long it is.
If the length is 0, the two operations are not related to each other with a high probability.
If the length is more than 1, it can be considered to be related with a high probability.

There is only one exception in number 4 of the above method, when the js file calls the code immediately without any function declaration. In this case, it is the same from the load stage, so you can check whether it is related or not. In addition, this is rarely the case, but in some cases, elemet creation is performed with data loaded in advance, which cannot be confirmed unless it is static analysis.

Pre-processing is essential to make this work easier. Save all downloaded .js data and create pastries. In addition, the call path of all downloaded data should be marked on each .js. Now, when a request is made to find data associated with a particular element, the results can be obtained in the following way. I took the picture above as an example.

Start in reverse order of element creation.
List all the files downloaded from f.js.
First, compare whether it is called in the same function, and compare the line/number.
- If matched, it matches perfectly matches. This file affects element creation.
- If it does not match, check it by comparing pastries.

Note) If you only know which .js file is and the line column number, you can obtain the parse tree path as follows. There are a lot of js dug up, so you can use anything.

void find_internal(ref List<INode> result, IEnumerable<INode> node, int line, int column)
{
    if (node == null || node.Count() == 0)
        return;

    var nrr = node.ToList();
    var ii = nrr.BinarySearch(new bb(line, column), Comparer<INode>.Create((x, y) =>
    {
        if (x.Location.Start.Line != y.Location.Start.Line)
            return x.Location.Start.Line.CompareTo(y.Location.Start.Line);
        if (x.Location.Start.Column != y.Location.Start.Column)
            return x.Location.Start.Column.CompareTo(y.Location.Start.Column);
        return 0;
    }));

    if (node.Count() == 1)
        ii = 0;

    if (ii < 0)
        ii = ~ii - 1;

    if (ii < 0 || ii >= node.Count())
        return;

    var z = node.ElementAt(ii);

    if (z.Location.Start.Line > line || z.Location.End.Line < line)
        return;

    if (z.Location.Start.Line == z.Location.End.Line)
    {
        if (z.Location.Start.Column > column || z.Location.End.Column < column)
            return;
    }

    result.Add(z);

    find_internal(ref result, z.ChildNodes, line, column);
}

This below method pick_candidate is function that compare two parse tree.

custom-crawler/CustomCrawler/CustomCrawlerDynamics.xaml.cs

Line 470 in 47e1dc0

    
           public List<(CallFrame, int, int, int)> pick_candidate(string url, List<Esprima.Ast.INode> node, string function_name, int line, int column)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!