MTL-TABlock: Function-level Type-aware Tracking Script Blocking via Multi-task Learning
MTL-TABlock addresses the privacy–usability tension caused by mixed scripts: a single script often contains both tracking logic and essential site functionality, so script-level or domain-level blocking can easily break pages.
This project operates at function granularity. It builds function-level behavioral graphs from browser runtime signals and trains a multi-task learning (MTL) model to jointly perform:
- Tracking function detection (tracking vs. benign)
- Tracking function subtype identification (type-aware), covering: Storage Tracking / Network Beacon / Fingerprinting / Conversion Analytics
Based on the predicted subtype, MTL-TABlock generates and injects type-aware surrogates (compatible replacement functions) to block tracking behavior while preserving page functionality as much as possible.
Note: This repository is a research / reproduction-oriented prototype.
- Key Features
- Method Overview
- Repository Layout
- Requirements
- Quick Start
- Surrogate Strategies
- Extension Injection Approaches
- Data & Compliance
- Citation
- License
- Function-level Behavior Graphs: Captures network requests, DOM mutations, storage accesses, key Web API calls, and call stacks to build fine-grained graph representations.
- Structural + Contextual Feature Fusion: Encodes both graph structure (call/interaction relations) and runtime semantics.
- Multi-Task Learning (MTL): A primary task for tracking detection plus an auxiliary task for subtype classification to drive downstream, subtype-specific blocking.
- Type-aware Surrogate Generation: Produces interface-compatible replacements (preserving return shapes and sync/async semantics) to reduce breakage versus naïve “no-op” stubs.
The overall pipeline consists of six stages:
- Data Collection: Chrome extension + automated crawling to collect runtime signals (network/DOM/storage/WebAPI/call stack).
- Graph Construction: Build site/script-level behavior graphs (function nodes, network nodes, storage nodes, WebAPI nodes, and interaction edges).
- Feature Extraction: Compute structural and contextual features to form function-level samples.
- Function Annotation: Derive training labels (tracking/benign + subtype) using filter lists (EasyList/EasyPrivacy) and high-confidence rules.
- Model Training: Train the MTL model (tracking detection + subtype classification).
- Surrogate Generation & Deployment: Identify target functions and inject type-aware surrogates at runtime.
The layout below follows the project’s stage-based organization.
.
├── 1_data_collection
│ ├── browser_extension
│ │ ├── manifest.json
│ │ ├── background.js
│ │ ├── content.js
│ │ ├── inject.js
│ │ ├── basic.html
│ │ └── breakpoint.json
│ ├── data_collection_server
│ │ ├── package.json
│ │ ├── package-lock.json
│ │ └── server.js
│ └── selenium_crawler
│ ├── crawler_main.py
│ └── crawler_with_hook.py
├── 2_graph_construction
│ ├── graph_builder_main.py
│ ├── graph_population.py
│ ├── graph_population_with_callstack.py
│ └── node_handlers
│ ├── event_handler.py
│ ├── info_share_handler.py
│ ├── network_node_handler.py
│ ├── redirection_edge_handler.py
│ └── storage_node_handler.py
├── 3_feature_extraction
│ ├── feature_extractor_main.py
│ ├── contextual_features.py
│ ├── structural_features.py
│ ├── network_features.py
│ └── network_features_methods.py
├── 4_function_annotation
│ ├── tracking_annotation.py
│ ├── subtype_annotation.py
│ └── filter_lists
│ ├── easylist_parser.py
│ └── high_confidence_rules.py
├── 5_model_training
│ ├── model_main.py
│ └── mtl_model.py
├── 6_surrogate_generation
│ ├── surrogate_main.py
│ ├── surrogate_generator.py
│ ├── function_replacer.py
│ ├── parentheses_balance.py
│ └── surrogate_templates
│ ├── storage_tracking_surrogate.py
│ ├── network_beacon_surrogate.py
│ ├── fingerprinting_surrogate.py
│ └── conversion_analytics_surrogate.py
├── utils
│ └── common_utils.py
└── requirements.txt
- Python: 3.8+ (recommended 3.10+)
- Node.js: 16+ (for
data_collection_server) - Selenium + matching ChromeDriver
pip install -r requirements.txtcd 1_data_collection/data_collection_server
npm installcd 1_data_collection/data_collection_server
node server.js- Open
chrome://extensions - Enable Developer mode
- Click Load unpacked
- Select:
1_data_collection/browser_extension/
cd 1_data_collection/selenium_crawler
python crawler_main.py For stronger instrumentation/hooking (if available in your environment):
python crawler_with_hook.py cd 2_graph_construction
python graph_builder_main.py cd 3_feature_extraction
python feature_extractor_main.py cd 4_function_annotation
python tracking_annotation.py
python subtype_annotation.py cd 5_model_training
python model_main.py cd 6_surrogate_generation
python surrogate_main.py Surrogate templates are under 6_surrogate_generation/surrogate_templates/:
- Storage Tracking: Replace cross-site linkable identifiers with site-scoped pseudo-identifiers to reduce cross-site correlation while keeping site behavior stable.
- Network Beacon: Suppress actual exfiltration while returning “success” semantics (e.g., resolved Promises) to preserve control flow.
- Fingerprinting: Return low-entropy, origin-stable, cross-origin-unlinkable pseudo-fingerprints.
- Conversion Analytics: Locally emulate assignment/variant logic and return stable, business-compatible results without contacting remote endpoints.
Two injection approaches are described:
- Manifest V2 (MV2): Use Chrome DevTools Protocol (CDP) to intercept scripts before execution and replace them with surrogates.
- Manifest V3 (MV3): Use
declarativeNetRequestto redirect target script requests to locally packaged surrogate files (MV3 cannot directly rewrite response bodies in the same way as MV2).
The directory 1_data_collection/browser_extension/ can serve as the starting point for extension-based instrumentation and injection logic.
Dynamic collection involves scripts, network requests, and runtime context. Please ensure that you:
- Collect data only under lawful conditions (authorization / academic research / compliant testing);
- Avoid collecting or storing sensitive personal data;
- Apply anonymization and access control before releasing datasets/models.
If this project is helpful for your research, please cite the paper:
@article{mtl_tablock,
title = {MTL-TABlock: Function-level Type-aware Tracking Script Blocking via Multi-task Learning},
author = {Zhanhui Yuan and Zhi Yang and Jinglei Tan and Hao Hu and Hongqi Zhang},
note = {See MTL-TABlock.md in this repository for the full text},
}Add your preferred open-source license (e.g., MIT / Apache-2.0 / GPL-3.0) and any third-party notices (EasyList/EasyPrivacy, etc.) before publishing.