Pretrained vs Own-Trained F1 Score Discrepancy

**Description:**  
While running the ATLAS system, we observed a significant discrepancy in **F1 scores** between the pretrained model and our own trained version. The pretrained model closely matches the paper’s results (within 1%), but the own-trained results show a larger variation.  

We find it acceptable if reproducibility differences are **within ±2%**, but these deviations exceed that margin.

---

### **Results Summary**

| Model           | Level            |        TP |       FP |          TN |        FN | Precision | Recall | F1 Score |    F1 % Diff |
| :-------------- | :--------------- | --------: | -------: | ----------: | --------: | --------: | -----: | -------: | -----------: |
| **Paper**       | M1 (Event Level) | 8168.0000 |   3.0000 | 24304.0000 |    0.0000 |    0.9996 | 1.0000 |   0.9998 |            — |
| **Own-Trained** | M1 (Event Level) | 5299.0000 | 379.0000 | 243137.0000 | 2881.0000 |    0.9333 | 0.6478 |   0.7648 | **−23.5103** |
| **Pre-Trained** | M1 (Event Level) | 8180.0000 |   1.0000 | 243494.0000 |    0.0000 |    0.9999 | 1.0000 |   0.9999 |   **0.0123** |

| Model           | Level             |    TP |  FP |     TN |   FN | Precision | Recall | F1 Score |   F1 % Diff |
| :-------------- | :---------------- | ----: | --: | -----: | ---: | --------: | -----: | -------: | ----------: |
| **Paper**       | M3 (Entity Level) | 35.0000 | 1.0000 | 24423.0000 | 1.0000 | 0.9722 | 0.9722 | 0.9722 | — |
| **Own-Trained** | M3 (Entity Level) | 18.0000 | 15.0000 | 1263.0000 | 6.0000 | 0.5455 | 0.7500 | 0.6316 | **−35.0376** |
| **Pre-Trained** | M3 (Entity Level) | 24.0000 | 1.0000 | 1308.0000 | 0.0000 | 0.9600 | 1.0000 | 0.9796 | **0.7580** |

---

**Request:**  
Any idea and guidance on why this discrepancy in F1 scores between the pretrained and own-trained models could be happening?


---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pretrained vs Own-Trained F1 Score Discrepancy #24

Results Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Level	TP	FP	TN	FN	Precision	Recall	F1 Score	F1 % Diff
Paper	M1 (Event Level)	8168.0000	3.0000	24304.0000	0.0000	0.9996	1.0000	0.9998	—
Own-Trained	M1 (Event Level)	5299.0000	379.0000	243137.0000	2881.0000	0.9333	0.6478	0.7648	−23.5103
Pre-Trained	M1 (Event Level)	8180.0000	1.0000	243494.0000	0.0000	0.9999	1.0000	0.9999	0.0123

Model	Level	TP	FP	TN	FN	Precision	Recall	F1 Score	F1 % Diff
Paper	M3 (Entity Level)	35.0000	1.0000	24423.0000	1.0000	0.9722	0.9722	0.9722	—
Own-Trained	M3 (Entity Level)	18.0000	15.0000	1263.0000	6.0000	0.5455	0.7500	0.6316	−35.0376
Pre-Trained	M3 (Entity Level)	24.0000	1.0000	1308.0000	0.0000	0.9600	1.0000	0.9796	0.7580

Pretrained vs Own-Trained F1 Score Discrepancy #24

Description

Results Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions