Multiple trials on the same dataset return very different results

## Strange Variability in SummaC Scores Across Multiple Runs

Hi,

Thanks for sharing the codebase. I am facing a very strange issue when I try this codebase on my custom data.

### Setup

I have a list of source sentences, which I am concatenating to create the source document. I have different LLM-based summaries, which are computed by using different subsets from the source sentences. There are 15 summaries in total.

### Issue: Inconsistent Results

During the first trial, the performance looked like this:

```
SUMMAC EVALUATION RESULTS – CONFIG 4
Config Index: 4

Model                3        5       10       20      all
----------------------------------------------------------------
llama3.1:8b      0.5514   0.5405   0.5252   0.5226   0.5671
llama3.3:70b     0.4986   0.5365   0.5435   0.5571   0.5497
gemma3:27b       0.5203   0.5124   0.5071   0.5135   0.5100
```

I executed the same code one more time, then the performance looked like this:

```
SUMMAC EVALUATION RESULTS – CONFIG 4
Config Index: 4

Model                3        5       10       20      all
----------------------------------------------------------------
llama3.1:8b      0.0000   0.0000   0.0000   0.0000   0.0000
llama3.3:70b     0.0000   0.0000   0.0000   0.0000   0.0000
gemma3:27b       0.0000   0.0000   0.0000   0.0000   0.0000
```

I even tried with 3 configurations instead of 15. During two trials, the results were significantly different.

**Trial 1:**
```
Config 4 - Comments per topic: 5

llama3.1:8b          1.0000
llama3.3:70b         1.0000
gemma3:27b           0.9999
```

**Trial 2:**
```
Config 4 - Comments per topic: 5

llama3.1:8b          0.2663
llama3.3:70b         0.1708
gemma3:27b           0.4512
```

### Code

I am quite baffled by this kind of variability in the results. Here is my code:

```python
COMMENTS_PER_TOPIC_LIST = ["3", "5", "10", "20", "all"]
MODEL_TAGS = ["llama3.1:8b", "llama3.3:70b", "gemma3:27b"]

# Paths
json_path = f"Reddit_Analysis-main/representative_sentences_config_{config_index}.json"

# GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}\n")

# Load SummaC model once
print("Loading SummaC model...")
model = SummaCConv(models=["vitc"], bins="percentile", granularity="sentence",
                   nli_labels="e", device=device, agg="mean", batch_size=32)
print("Model loaded.\n")

# Load source document once
with open(json_path) as f:
    data = json.load(f)
    sentences = [s for obj in data for topic in obj["topics"] for s in topic["representative sentences"]]

print(f"number of source sentences: {len(sentences)}")
source_doc = "\n".join(sentences)
source_preview = source_doc[:100].replace("\n", " ")
print(f"source preview:")
print(source_preview)

# Store results
results = {}

# Run evaluations
for model_tag in MODEL_TAGS:
    results[model_tag] = {}
    for cpt in COMMENTS_PER_TOPIC_LIST:
        summary_path = f"prompt_7/representative_sentences_summary_config_{config_index}_comments_per_topic_{cpt}_{model_tag}.txt"

        print(f"Evaluating: {model_tag} | comments_per_topic={cpt}")

        if not os.path.exists(summary_path):
            print(f"  ERROR: File not found: {summary_path}")
            results[model_tag][cpt] = None
            continue

        # Load summary
        with open(summary_path) as f:
            summary = f.read().strip()
        print("\n" + "=" * 80)
        print(f"SUMMARY PREVIEW — Config {config_index} | model={model_tag} | comments_per_topic={cpt}")
        print("-" * 80)
        print("Summary (full):")
        print(summary)

        # Clear GPU cache to prevent any state accumulation
        if device == "cuda":
            torch.cuda.empty_cache()
        # Compute score
        score = model.score([source_doc], [summary])["scores"][0]
        results[model_tag][cpt] = round(score, 4)
        print(f"  Score: {score:.4f}\n")
```

### Questions

1. Is this behavior expected?
2. What could be causing such dramatic variability in scores (ranging from 0.0 to 1.0)?
3. Are there any known issues with running multiple evaluations sequentially?

Any insights would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple trials on the same dataset return very different results #27

Strange Variability in SummaC Scores Across Multiple Runs

Setup

Issue: Inconsistent Results

Code

Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multiple trials on the same dataset return very different results #27

Description

Strange Variability in SummaC Scores Across Multiple Runs

Setup

Issue: Inconsistent Results

Code

Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions