Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Using msgpack instead of json #1819

Draft
wants to merge 45 commits into
base: develop
Choose a base branch
from

Conversation

waldbauer-certat
Copy link
Contributor

@waldbauer-certat waldbauer-certat commented Mar 17, 2021

NOTE: This is a proof of concept. Being heavily tested!

Introduction

Msgpack ( MessagePack ) is a (de)serialization format, which is similar to json, but more optimized for m2m ( Machine-to-Machine ) communication. For sure there are better protocols like protobuf, flatbuffers, capnproto, SBE and so on, but this doenst fit in intelmq very well. Msgpack uses a key-value pattern ( like in json ), so there wont be any major change. The real "magic" happens how the data is being stored, JSON is very human-readable due to its serialization, but msgpack packs data into binary format, which results in smaller size & faster processing - see the benchmark below.
If you want to know some specs, check it out here.

Msgpack itself is available for multiple languages like golang, python, javascript, php and so on.

In addition, Redis - our internal message queue - is also capable of using msgpack within its lua api.

Whats the goal?

  • Faster process time for (de)serialization.
  • less memory footprint
  • no breaking change

Benchmark

For the benchmark, data was extracted from spamhaus-drop-collector, parsed by spamhaus-drop-parser and measured in deduplicator-expert. 460 events were processed in total.

I've tested the bots above, they worked fine with that change, it might break other bots ( which I havent tested yet )

Type Median data size
JSON 387 bytes
MSGPACK 329 bytes
Diff 58 bytes ( 16,20% )

Serialize

Type Median execution time in ns
JSON 39286
MSGPACK 23483
Diff 15803 ( 50,35% )

Deserialize

Type Median execution time in ns
JSON 23483
MSGPACK 12602
Diff 10881 ( 80,62% )

To sum up, changing from json to msgpack will result in a faster (de)serialization and a lower memory footprint.

@waldbauer-certat waldbauer-certat force-pushed the waldbauer/msgpack-poc branch 13 times, most recently from 23cd283 to 9bce822 Compare March 18, 2021 12:10
@waldbauer-certat waldbauer-certat force-pushed the waldbauer/msgpack-poc branch 4 times, most recently from 5c6bdd5 to 9ab334e Compare April 1, 2021 09:32
setup.py Outdated Show resolved Hide resolved
@waldbauer-certat waldbauer-certat force-pushed the waldbauer/msgpack-poc branch 2 times, most recently from 6d9e656 to 40e4ae1 Compare June 30, 2021 15:41
@ghost ghost added the needs: feedback label Aug 20, 2021
waldbauer-certat and others added 29 commits July 15, 2022 13:21
Signed-off-by: Sebastian Waldbauer <[email protected]>
This commit adds license information to a lot of files and adds a
.reuse/dep5 file that lists the license information for some folders

The commit also changes the main license in setup.cfg from AGPL-3.0-only
to AGPL-3.0-or-later because only one file has the AGPL-3.0-only file as
license and multiple files have the AGPL-3.0-or-later in the license
header.

It also removes the cef_logo.png file, as there is no information about
the licese anywhere to be found. It is now included directly from the
website of the european union.

Closes #1633
and add legacy tag to shadowserver caida config
and add legacy tag to the configs it replaces

and update changelog and documentation accordingly
fix mapping
use compromised type if the data indicates an active webshell
plus add testcases
add changelog
update bots documentation
enhance mappings
add 4/6 agnostic mapping for `Sinkhole-Events` as well
document feeds with IPv4 and IPv6 better and shorter
This commit adds a license header or a license file to most of the
files, or documents the license in the .reuse/dep5 license file.

Some of the process was automated, first by listing all the files that
are not reuse lint compliant:
> reuse lint > ../reuse.lst
This list was then modified to remove metainformation and only list
filenames. Also a couple of filenames that need manual modification were
removed.

Then using git and reuse:
> for file in `cat ../reuse.lst`; do year=`git log --reverse --pretty="format:%ai" $file | head -1 | cut -d "-" -f 1`;  author=`git log --reverse --pretty="format:%an" $file|head -1`; reuse addheader --copyright="$author" --year="$year" --license="AGPL-3.0-or-later" --skip-unrecognised $file; done

Then the same process was repeated for files reuse does not recognize,
like csv and json files or REQUIREMENTS.txt files.
match with RSIT in the taxonomy intrusions:
compromised -> system-compromise
unauthorized-command -> system-compromise
unauthorized-login -> system-compromise
 adapt bots depending on the name
add changelog and news entries, including SQL update statements
merged into information-content-security > unauthorised-information-modification

adapt bots depending on the name
add changelog and news entries, including SQL update statements
was renamed and marked as deprecated in 2.0.0.beta1
#1404
Compatibility with the deprecated configuration format (before 1.0.0.dev7) was removed.
#1404
The deprecated shell scripts
- `update-asn-data`
- `update-geoip-data`
- `update-tor-nodes`
- `update-rfiprisk-data`
have been removed in favor of the built-in update-mechanisms (see the bots' documentation). A crontab file for calling all new update command can be found in `contrib/cron-jobs/intelmq-update-database`.

#1404
add two n6 images directly to the repository, as they are not displayed
on readthedocs otherwise: The other websites hosting the images block
loading images if the referer does not match a whitelist. we can't add a
noreferer HTML attribute in rst as well. the option left is to add the
files, that only implies adding the licensing information and the
AGPL-3.0 license text as well.

add two illustrations on the the flow n6 to intelmq and vice versa, own
work.

some textual improvements in the document itself.
The Aggregate Expert might be used to aggregate events within a given
timespan and threshold.

Signed-off-by: Sebastian Waldbauer <[email protected]>
Using msgpack instead of json results in faster (de)serialize and
less memory usage. Redis is also capable of msgpack within its lua
api i.e. https://github.com/kengonakajima/lua-msgpack-native.

====== Benchmark =======
JSON median size: 387
MSGPACK median size: 329
------------------------
Diff: 16.20%

JSON
* Serialize: 39286
* Deserialize: 30713

MSGPACK
* Serialize: 23483
* Deserialize: 12602
---------------------
DIFF
* Serialize: 50.35%
* Deserialize: 83.62%

Data extracted from spamhaus-collector
Measurements based on deduplicator-expert
460 events in total process by deducplicator-expert

Signed-off-by: Sebastian Waldbauer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants