-
Notifications
You must be signed in to change notification settings - Fork 6
Description
The secondary analysis of M3 data starts with the script
load_l_root_folders.py, which uses the python module m3summary.py. I
just wrote a description of what this script does in the ithitools wiki
page:
https://github.com/private-octopus/ithitools/wiki/Secondary-root-server-statistics.
We had discussion in the past about the output of this script. If I
summarize correctly, the issues were:
-
The name "useless" carries a value judgement. It would be better to
replace it by something neutral, like "repeated". -
The summary lines mixes atomic counts, like the number of DGA
queries, and subtotals, like the number of NX domain queries or the
number of "other" queries. -
It is unclear how categories related to published M3 submetrics
(M3.1, etc.) -
We may want to add a few more atomic categories.
I think there are some easy fixes, without changing the ithitools code:
-
Replace "useless" by "repeated". Possibly replace "useful" by
something else. Suggestion? -
Remove the "subtotals": "queries", "nx_domain", and "others".
-
Replace the name "rfc6761" by "other_rfc6761" for clarity -- implying
it excludes .local and .localhost. The sum of local, localhost and
other_rfc6761 can be used to recompute the metric M3.3.1. -
Add "other_frequent_names" after .mail -- total queries for
"frequently found TLD" excluding home, lan, internal, ip, localdomain,
corp, and mail. The sum of other_frequent_names, home, lan, internal,
ip, localdomain, corp, and mail can be used to recompute the metric M3.3.2. -
Move dga and jumbo after other_frequent_names, to make it clear that
these do not include queries to RFC6761 or frequent names. -
Add columns for the frequently found categories listed in M3.3:
bad_syntax, binary, ipv4, numeric. Question: should I also add entries
for domain names of length 1 to 6? This is a bit more than 1% of all
total queries, and the total of dga, jumbo, bad_syntax, binary, ipv4,
numeric and length 1-6 would match the definition of M3.3.3. -
Define "other_names" as the count of all NX domain queries minus
these listed in the previous categories, so the sum of all columns from
"local" to "other_names" equals the total number of queries.
If we do change the ithitools code, we can add the following:
-
Change the way the overflow of the "frequent names" is computed, so
the names matching the encoded "frequent name" list are listed as
"other_frequent_names", instead of being lumped with the dga, jumbo and
short names categories. -
Split the dga definition between dga_single (1 part dga name) and
dga_multi (multi_part dga names). -
Add the "jumbo" pattern to the M3.3.3 list.
-
Maybe remove the metric M3.3.4, since pretty much every component of
that will be matching the "frequent names" or "frequent patterns" classes. -
Isolate other categories per suggestion.