- Overview
- Code Structure
- Core Files
- Quick Start
- Evaluation Metrics
- Competition Rules
- Important Notes
- Submission Guidelines
This competition focuses on optimizing Vision-Language Models (VLM) for inference performance. Participants are required to modify the VLMModel class in evaluation_wrapper.py to achieve better Time-To-First-Token (TTFT) and Throughput while maintaining accuracy.
AICASGC/
├── benchmark.py # Benchmark script (not recommended to modify)
├── evaluation_wrapper.py # Model wrapper (participants implement optimizations here)
├── requirements.txt # Python dependencies
├── data/ # Validation dataset
│ ├── data-*.arrow # Dataset files
│ ├── dataset_info.json # Dataset metadata
│ └── state.json # Dataset state
├── Qwen3-VL-2B-Instruct/ # Model weights directory (participants need to download)
└── README.md / README_CN.md # Documentation
benchmark.py- Self-testing benchmark script (⚠️ Not recommended to modify)evaluation_wrapper.py- Model wrapper where participants implement optimizationsQwen3-VL-2B-Instruct/- Competition model weights (participants need to download, see "Quick Start" section)data/- Validation datasetrequirements.txt- Python dependencies
The model files are large and need to be downloaded separately. Please create the model directory first, then download the model:
# Create model directory
mkdir -p Qwen3-VL-2B-Instruct
# Install huggingface_hub (if not installed)
pip install -U huggingface_hub
# Set mirror endpoint (recommended for users in China, faster download)
export HF_ENDPOINT=https://hf-mirror.com
# Download model to specified directory
huggingface-cli download \
--resume-download \
Qwen/Qwen3-VL-2B-Instruct \
--local-dir ./Qwen3-VL-2B-Instruct \
--local-dir-use-symlinks FalseNote:
- Model size is approximately 4-5GB, download may take some time
- If download is interrupted, you can rerun the command and it will resume automatically (
--resume-download) - After download completes, the
Qwen3-VL-2B-Instruct/folder will contain all model files - Ensure you have sufficient disk space (at least 5GB)
pip install -r requirements.txtpython benchmark.py \
--model-path ./Qwen3-VL-2B-Instruct \
--dataset-path ./data \
--output result.json \
--num-samples 100Edit the VLMModel class in evaluation_wrapper.py. The optimization architecture uses modular design, where each optimization direction corresponds to an independent method.
Before starting optimizations, you can explore the model structure to understand optimization targets:
class VLMModel:
def __init__(self, model_path: str, device: str = "cuda:0"):
# ... load model ...
# Optional: Explore model structure
self._explore_model_structure() # Will print model structure informationIn the __init__ method, enable/disable different optimizations by commenting/uncommenting:
class VLMModel:
def __init__(self, model_path: str, device: str = "cuda:0"):
# ... load model ...
# ================================================================
# Participant Optimization Area - Enable/disable optimization methods
# ================================================================
# 1. Vision Encoder Acceleration (optimize high-resolution image processing)
# self._optimize_vision_encoder()
# 2. KV Cache Management (optimize memory fragmentation during generation)
# self._optimize_kv_cache()
# 3. Cross-modal Connector Optimization (optimize Cross-modal Connector)
# self._optimize_cross_modal_connector()
# 4. Flash Attention Optimization
# self._enable_flash_attention()
# 5. Quantization Optimization
# self._apply_quantization()Implement your optimization logic in each optimization method. For example, optimizing Vision Encoder:
def _optimize_vision_encoder(self):
"""Find this method in evaluation_wrapper.py and implement your optimization"""
# Example: Replace attention operator
# from your_optimization import optimized_attention
# if hasattr(self._model, 'vision_model'):
# for layer in self._model.vision_model.encoder.layers:
# layer.self_attn.forward = optimized_attention
# TODO: Implement your Vision Encoder optimization
passImportant Notes:
- Benchmark directly calls
self.model.generate()for performance testing - Your optimizations should modify
self.modelor its operators via Monkey Patch in optimization methods - All optimization methods are called in
__init__, and optimizations take effect automatically - The
generate()method is optional and mainly for debugging
python benchmark.py \
--model-path ./Qwen3-VL-2B-Instruct \
--dataset-path ./data \
--output result_optimized.json \
--num-samples 100python benchmark.py \
--model-path ./Qwen3-VL-2B-Instruct \
--dataset-path ./data \
--output result.json \
--num-samples 5000The final score is calculated as:
Final Score = 0.4 × Accuracy + 0.3 × TTFT_Improvement + 0.3 × Throughput_Improvement
-
TTFT (Time To First Token): Time from input preparation to first token generation (in milliseconds)
- Includes: image encoding, text encoding, cross-modal interaction, prefill stage, first token generation
- Baseline: ~80ms
- Improvement = (Baseline - Your_TTFT) / Baseline
-
Throughput: End-to-end token generation rate (tokens per second)
- Baseline: ~55 tokens/sec
- Improvement = (Your_Throughput - Baseline) / Baseline
-
Accuracy: VQA accuracy on validation set (5000 samples)
- Soft matching with multiple ground truth answers
-
Do not modify
benchmark.py- This benchmark script is for self-testing only
- Final evaluation will use a separate official benchmark system
- Modifying this file may lead to inconsistencies between your local results and final evaluation results
-
Only modify
evaluation_wrapper.py -
Maintain required properties
- The
VLMModelclass must exposeprocessor,model, anddeviceproperties - Benchmark uses these properties to access the model and processor
- The
generate()method is optional and mainly for debugging
- The
-
Prohibited behaviors
- Do not hardcode answers
- Do not modify the dataset
- Do not use external APIs or services
- All optimizations must be local and self-contained
-
Encouraged: Operator replacement and kernel optimization - Rewrite or replace standard operator implementations (such as Attention, LayerNorm, Conv2d, etc.) using Triton, CUDA C++, etc.
-
Encouraged: Memory and cache optimization - Optimize KV Cache memory layout, reduce memory fragmentation, optimize GPU memory access patterns
-
Encouraged: Compilation and graph optimization - Use torch.compile for computation graph optimization, custom kernel scheduling
-
Encouraged: Attention mechanism optimization - Implement Flash Attention, memory-efficient attention, sparse attention
-
Encouraged: Generation process optimization - Optimize decoding strategies, cache management, generation configuration parameters
Not Permitted:
-
Using external services: Prohibited from calling external APIs, cloud services, or any functionality requiring network connection
-
Data and answer cheating: Prohibited from training on test data, pre-computing answers, hardcoding outputs
-
Model replacement and tampering: Participants should focus on operator-level optimization. Do not use additional datasets to train the model, change model architecture, or directly modify weight values.
-
Overfitting optimization: Prohibited from using conditional branches or special processing for specific evaluation samples
-
Black-box tool application: Behavior of only modifying configuration files without substantive code contributions is not recognized
-
Environment manipulation: Prohibited from interfering with fair evaluation by modifying system environment, GPU frequency locking, etc.
- The provided
benchmark.pyuses fixed order (first N samples from index 0) - When you run
--num-samples 100, it evaluates samples 0-99 - This ensures reproducibility for local self-testing
- Note: The official evaluation system used by the competition committee may employ different sampling strategies (including random sampling) for final verification
The benchmark automatically records detailed hardware information:
- Python version, PyTorch version, CUDA version
- GPU name, memory, compute capability
- CPU model, cores, frequency
- System information (OS, kernel, architecture)
- PPU information (if available)
This information is saved in result.json under system_info for statistical analysis.
- Warmup: 10 samples are used for GPU warmup before actual measurement
- TTFT Measurement: Measures time from input preparation to first token (includes all preprocessing)
- Throughput Measurement: Measures end-to-end generation time for 128 tokens
- State Isolation: GPU cache is cleared between measurements to ensure fairness
- The
--random-seedparameter only affects PyTorch's random number generator - It does NOT affect sample selection order (which is always fixed)
- Use it for reproducibility of model inference randomness
The result.json file contains:
{
"system_info": {
"timestamp": "...",
"python_version": "...",
"torch_version": "...",
"cuda_version": "...",
"gpu_name": "...",
...
},
"performance": {
"avg_ttft_ms": 90.55,
"avg_throughput_tokens_per_sec": 57.77
},
"answers": [
{
"question_id": 34602,
"prediction": "your answer text here"
},
...
]
}-
result.json- Generated by runningbenchmark.py- Contains predictions for all samples
- Must include valid
performancemetrics - Important: The
result.jsonuploaded to the Tianchi platform is for reference only. Final scores will be evaluated by the competition committee using standardized hardware and the official evaluation system.
-
Your optimized code -
evaluation_wrapper.pycontaining your optimizedVLMModelclass -
Docker image - Container with your optimized environment
- Self-Testing: Use the provided
benchmark.pyto test your optimizations locally - Submission: Upload your
result.jsonto the Tianchi platform (for reference only) - Official Evaluation: The competition committee will evaluate your code using:
- Docker image submission
- Standardized hardware environment
- Official evaluation code
- Full validation set with random sampling for verification
- Final Ranking: Based on the final score calculated by the official evaluation system
We hope you will focus on operator-level optimization, kernel replacement, and efficient memory management. Remember: accuracy and speed are equally important! Good luck!