Skip to content

Commit 4fbf160

Browse files
authored
chore: add minor code improvements to the typosquatting heuristic (#1122)
Signed-off-by: Amine <[email protected]>
1 parent 014c8d2 commit 4fbf160

File tree

3 files changed

+55
-1
lines changed

3 files changed

+55
-1
lines changed

scripts/find_packages.sh

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
#!/usr/bin/env bash
2+
3+
# Copyright (c) 2022 - 2025, Oracle and/or its affiliates. All rights reserved.
4+
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
5+
6+
#
7+
# This script fetches the list of top PyPI packages and saves them to a file.
8+
# It downloads the data from https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.json,
9+
# extracts the top 5000 package names using jq, and saves them to the specified location.
10+
#
11+
# If the destination file already exists, the script will do nothing.
12+
#
13+
# Usage: ./find_packages.sh [FOLDER] [FILE]
14+
# - FOLDER: The destination folder (default: ../src/macaron/resources)
15+
# - FILE: The destination filename (default: popular_packages.txt)
16+
#
17+
# Dependencies: curl, jq.
18+
19+
# Set default values
20+
DEFAULT_FOLDER="../src/macaron/resources"
21+
DEFAULT_FILE="popular_packages.txt"
22+
23+
# Override with provided arguments if they exist
24+
FOLDER=${1:-$DEFAULT_FOLDER}
25+
FILE=${2:-$DEFAULT_FILE}
26+
27+
FULL_PATH="$FOLDER/$FILE"
28+
URL="https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.json"
29+
30+
# Check if file exists
31+
if [ -f "$FULL_PATH" ]; then
32+
echo "$FULL_PATH already exists. Nothing to do."
33+
else
34+
echo "$FULL_PATH not found. Fetching top PyPI packages..."
35+
36+
# Ensure the directory exists
37+
mkdir -p "$FOLDER"
38+
39+
# Fetch and process JSON using curl and jq
40+
if curl -s "$URL" | jq -r '.rows[:5000] | sort_by(-.download_count) | .[].project' > "$FULL_PATH"; then
41+
echo "Successfully saved top 5000 packages to $FULL_PATH"
42+
else
43+
echo "Failed to fetch or process package data."
44+
exit 1
45+
fi
46+
fi

src/macaron/malware_analyzer/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,14 @@ When a heuristic fails, with `HeuristicResult.FAIL`, then that is an indicator b
5656
- **Description**: Checks if the package name is suspiciously similar to any package name in a predefined list of popular packages. The similarity check incorporates the Jaro-Winkler distance and considers keyboard layout proximity to identify potential typosquatting.
5757
- **Rule**: Return `HeuristicResult.FAIL` if the similarity ratio between the package name and any popular package name meets or exceeds a defined threshold; otherwise, return `HeuristicResult.PASS`.
5858
- **Dependency**: None.
59+
60+
> **Note**: This heuristic relies on a list of popular packages stored in [`src/macaron/resources/popular_packages.txt`](../resources/popular_packages.txt). Maintainers should periodically update this list by running the [`find_packages.sh`](../../../scripts/find_packages.sh) script from the project root directory. This ensures the typosquatting detection remains effective against the latest popular packages.
61+
>
62+
> Example:
63+
> ```bash
64+
> ./scripts/find_packages.sh
65+
> ```
66+
> The script will download the top 5000 PyPI packages and update the resource file automatically.
5967
### Source Code Analysis with Semgrep
6068
**PyPI Source Code Analyzer**
6169
- **Description**: Uses Semgrep, with default rules written in `src/macaron/resources/pypi_malware_rules` and custom rules available by supplying a path to `custom_semgrep_rules` in `defaults.ini`, to scan the package `.tar` source code.

src/macaron/malware_analyzer/pypi_heuristics/metadata/typosquatting_presence.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ def analyze(self, pypi_package_json: PyPIPackageJsonAsset) -> tuple[HeuristicRes
281281

282282
distance_ratio = self.ratio(package_name, popular_package)
283283
if distance_ratio >= self.distance_ratio_threshold:
284-
logger.info(
284+
logger.debug(
285285
"Potential typosquatting detected: '%s' is similar to popular package '%s' (ratio: %.3f)",
286286
package_name,
287287
popular_package,

0 commit comments

Comments
 (0)