codebert-heads

A research project exploring which attention heads in CodeBERT are most critical for mapping natural language to code.

Statistical Analysis of Attention Head Activations in Code Retrieval

Objective

Investigate which attention heads and layers in BERT-based models (such as CodeBERT or GraphCodeBERT) are most influential in mapping natural language queries to code snippets.

Research Tasks

1. Data Collection

Curate a dataset of paired natural language queries and corresponding code snippets.
Use publicly available datasets from repositories like GitHub, CodeSearchNet, or curated academic benchmarks.

2. Model Probing

Utilize pre-trained code retrieval models (e.g., CodeBERT, GraphCodeBERT).
Pass query–code pairs through the model and extract attention weights from each layer and head.
Focus on both query-to-code and code-to-query attention when applicable.

3. Statistical Analysis

Compute and analyze the distribution of attention activations across layers and heads.
Identify statistically significant attention patterns, such as:
- Heads that consistently show high activation for specific structures (e.g., loops, functions).
- Layers where most meaningful semantic alignment occurs.
Use statistical tests (e.g., ANOVA, t-tests) to assess significance.

4. Visualization & Interpretation

Apply clustering algorithms or dimensionality reduction techniques like PCA or t-SNE to visualize attention patterns.
Correlate observed patterns with:
- Types of code constructs.
- Natural language instruction patterns.
Interpret the role of specific heads in semantic matching between query and code.

Dataset Used

CodeSearchNet: A large-scale dataset of code snippets and their corresponding natural language descriptions.
Synthetic Data: Kaggle

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
cluster_heads		cluster_heads
clustering_output		clustering_output
dataset-gen		dataset-gen
output		output
paper		paper
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
cluster_heads.py		cluster_heads.py
crossbert.ipynb		crossbert.ipynb
crossbert.py		crossbert.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

codebert-heads

Statistical Analysis of Attention Head Activations in Code Retrieval

Objective

Research Tasks

1. Data Collection

2. Model Probing

3. Statistical Analysis

4. Visualization & Interpretation

Dataset Used

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

uttamkn/codebert-heads

Folders and files

Latest commit

History

Repository files navigation

codebert-heads

Statistical Analysis of Attention Head Activations in Code Retrieval

Objective

Research Tasks

1. Data Collection

2. Model Probing

3. Statistical Analysis

4. Visualization & Interpretation

Dataset Used

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages