Revise root server analysis script

The secondary analysis of M3 data starts with the script
load_l_root_folders.py, which uses the python module m3summary.py. I
just wrote a description of what this script does in the ithitools wiki
page:
https://github.com/private-octopus/ithitools/wiki/Secondary-root-server-statistics.

We had discussion in the past about the output of this script. If I
summarize correctly, the issues were:

1) The name "useless" carries a value judgement. It would be better to
replace it by something neutral, like "repeated".

2) The summary lines mixes atomic counts, like the number of DGA
queries, and subtotals, like the number of NX domain queries or the
number of "other" queries.

3) It is unclear how categories related to published M3 submetrics
(M3.1, etc.)

4) We may want to add a few more atomic categories.

I think there are some easy fixes, without changing the ithitools code:

1) Replace "useless" by "repeated". Possibly replace "useful" by
something else. Suggestion?

2) Remove the "subtotals": "queries", "nx_domain", and "others".

3) Replace the name "rfc6761" by "other_rfc6761" for clarity -- implying
it excludes .local and .localhost. The sum of local, localhost and
other_rfc6761 can be used to recompute the metric M3.3.1.

4) Add "other_frequent_names" after .mail -- total queries for
"frequently found TLD" excluding home, lan, internal, ip, localdomain,
corp, and mail. The sum of other_frequent_names, home, lan, internal,
ip, localdomain, corp, and mail can be used to recompute the metric M3.3.2.

5) Move dga and jumbo after other_frequent_names, to make it clear that
these do not include queries to RFC6761 or frequent names.

6) Add columns for the frequently found categories listed in M3.3:
bad_syntax, binary, ipv4, numeric. Question: should I also add entries
for domain names of length 1 to 6? This is a bit more than 1% of all
total queries, and the total of dga, jumbo, bad_syntax, binary, ipv4,
numeric and length 1-6 would match the definition of M3.3.3.

7) Define "other_names" as the count of all NX domain queries minus
these listed in the previous categories, so the sum of all columns from
"local" to "other_names" equals the total number of queries.

If we do change the ithitools code, we can add the following:

* Change the way the overflow of the "frequent names" is computed, so
the names matching the encoded "frequent name" list are listed as
"other_frequent_names", instead of being lumped with the dga, jumbo and
short names categories.

* Split the dga definition between dga_single (1 part dga name) and
dga_multi (multi_part dga names).

* Add the "jumbo" pattern to the M3.3.3 list.

* Maybe remove the metric M3.3.4, since pretty much every component of
that will be matching the "frequent names" or "frequent patterns" classes.

* Isolate other categories per suggestion.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise root server analysis script #164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Revise root server analysis script #164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions