Skip to content

Revise root server analysis script #164

@huitema

Description

@huitema

The secondary analysis of M3 data starts with the script
load_l_root_folders.py, which uses the python module m3summary.py. I
just wrote a description of what this script does in the ithitools wiki
page:
https://github.com/private-octopus/ithitools/wiki/Secondary-root-server-statistics.

We had discussion in the past about the output of this script. If I
summarize correctly, the issues were:

  1. The name "useless" carries a value judgement. It would be better to
    replace it by something neutral, like "repeated".

  2. The summary lines mixes atomic counts, like the number of DGA
    queries, and subtotals, like the number of NX domain queries or the
    number of "other" queries.

  3. It is unclear how categories related to published M3 submetrics
    (M3.1, etc.)

  4. We may want to add a few more atomic categories.

I think there are some easy fixes, without changing the ithitools code:

  1. Replace "useless" by "repeated". Possibly replace "useful" by
    something else. Suggestion?

  2. Remove the "subtotals": "queries", "nx_domain", and "others".

  3. Replace the name "rfc6761" by "other_rfc6761" for clarity -- implying
    it excludes .local and .localhost. The sum of local, localhost and
    other_rfc6761 can be used to recompute the metric M3.3.1.

  4. Add "other_frequent_names" after .mail -- total queries for
    "frequently found TLD" excluding home, lan, internal, ip, localdomain,
    corp, and mail. The sum of other_frequent_names, home, lan, internal,
    ip, localdomain, corp, and mail can be used to recompute the metric M3.3.2.

  5. Move dga and jumbo after other_frequent_names, to make it clear that
    these do not include queries to RFC6761 or frequent names.

  6. Add columns for the frequently found categories listed in M3.3:
    bad_syntax, binary, ipv4, numeric. Question: should I also add entries
    for domain names of length 1 to 6? This is a bit more than 1% of all
    total queries, and the total of dga, jumbo, bad_syntax, binary, ipv4,
    numeric and length 1-6 would match the definition of M3.3.3.

  7. Define "other_names" as the count of all NX domain queries minus
    these listed in the previous categories, so the sum of all columns from
    "local" to "other_names" equals the total number of queries.

If we do change the ithitools code, we can add the following:

  • Change the way the overflow of the "frequent names" is computed, so
    the names matching the encoded "frequent name" list are listed as
    "other_frequent_names", instead of being lumped with the dga, jumbo and
    short names categories.

  • Split the dga definition between dga_single (1 part dga name) and
    dga_multi (multi_part dga names).

  • Add the "jumbo" pattern to the M3.3.3 list.

  • Maybe remove the metric M3.3.4, since pretty much every component of
    that will be matching the "frequent names" or "frequent patterns" classes.

  • Isolate other categories per suggestion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions