Skip to content

Commit ae5a748

Browse files
committed
docs: updated README and CONTRIBUTING for information on how to contribute to the malware analyzer
Signed-off-by: Carl Flottmann <[email protected]>
1 parent ad6f587 commit ae5a748

File tree

2 files changed

+46
-1
lines changed

2 files changed

+46
-1
lines changed

CONTRIBUTING.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,10 @@ See below for instructions to set up the development environment.
7272
- PRs should be merged using the `Squash and merge` strategy. In most cases a single commit with
7373
a detailed commit message body is preferred. Make sure to keep the `Signed-off-by` line in the body.
7474

75+
### PyPI Malware Detection Contribution
76+
77+
Please see the [README for the malware analyzer](./src/macaron/malware_analyzer/README.md) for information on contributing Heuristics and code patterns.
78+
7579
## Branching model
7680

7781
* The `main` branch is only used for releases and the `staging` branch is used for development. We only merge to `main` when we want to create a new release for Macaron.

src/macaron/malware_analyzer/README.md

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Implementation of Heuristic Malware Detector
1+
# Implementation of Malware Detector
22

33
## PyPI Ecosystem
44

@@ -63,6 +63,47 @@ The following analyzer has been added in as an experimental feature, available b
6363

6464
This feature is currently a work in progress, and supports detection of code obfuscation techniques and remote exfiltration behaviors. It uses Semgrep OSS for detection.
6565

66+
### Contributing
67+
68+
When contributing an analyzer, it must meet the following requirements:
69+
70+
- The analyzer must be implemented in a separate file, placed in the relevant folder based on what it analyzes ([metadata](./pypi_heuristics/metadata/) or [sourcecode](./pypi_heuristics/sourcecode/)).
71+
- The analyzer must inherit from the `BaseHeuristicAnalyzer` class and implement the `analyze` function, returning relevant information specific to the analysis.
72+
- The analyzer must be added to the list of analyzers in `detect_malicious_metadata_check.py` to be run.
73+
74+
**Contributing Code Pattern Rules**
75+
76+
When contributing more Semgrep rules for `pypi_sourcecode_analyzer.py` to use, the following requirements must be met:
77+
78+
- Semgrep `.yaml` Rules are stored in `src/macaron/resources/pypi_malware_rules` and are named based on the category of code behaviors they detect.
79+
- If the rule comes under one of the already defined categories, place it within that `.yaml` file, else create a new `.yaml` file using the category name.
80+
- Each rule ID must be prefixed by the category followed by a single underscore ('_'), so for obfuscation rules in `obfuscation.yaml` each rule ID is prefixed with `obfuscation_`, followed by an ID which uses a hiphen ('-') as a separator.
81+
- Tests must be written for each rule contributed. These are stored in `tests/malware_analyzer/pypi/test_pypi_sourcescode_analyzer.py`.
82+
- These tests are written on a per-category bases, running each category individually. Each category must have a folder under `tests/malware_analyzer/pypi/resources/sourcecode_samples`.
83+
- Within these folders, there must be sample code patterns for testing, and a file `expected_results.json` with the expected JSON output of the analyzer for that category.
84+
- Each sample code pattern `.py` file must not have executable permissions and must include code that prevents it from being accidentally imported or run. The current files use this method:
85+
86+
```
87+
"""
88+
Running this code will not produce any malicious behavior, but code isolation measures are
89+
in place for safety.
90+
"""
91+
92+
import sys
93+
94+
# ensure no symbols are exported so this code cannot accidentally be used
95+
__all__ = []
96+
sys.exit()
97+
98+
def test_function():
99+
"""
100+
All code to be tested will be defined inside this function, so it is all local to it. This is
101+
to isolate the code to be tested, as it exists to replicate the patterns present in malware
102+
samples.
103+
"""
104+
sys.exit()
105+
```
106+
66107
### Confidence Score Motivation
67108

68109
The original seven heuristics which started this work were Empty Project Link, Unreachable Project Links, One Release, High Release Frequency, Unchange Release, Closer Release Join Date, and Suspicious Setup. These heuristics (excluding those with a dependency) were run on 1167 packages from trusted organizations, with the following results:

0 commit comments

Comments
 (0)