Skip to content

Multiple trials on the same dataset return very different results #27

@rudra0713

Description

@rudra0713

Strange Variability in SummaC Scores Across Multiple Runs

Hi,

Thanks for sharing the codebase. I am facing a very strange issue when I try this codebase on my custom data.

Setup

I have a list of source sentences, which I am concatenating to create the source document. I have different LLM-based summaries, which are computed by using different subsets from the source sentences. There are 15 summaries in total.

Issue: Inconsistent Results

During the first trial, the performance looked like this:

SUMMAC EVALUATION RESULTS – CONFIG 4
Config Index: 4

Model                3        5       10       20      all
----------------------------------------------------------------
llama3.1:8b      0.5514   0.5405   0.5252   0.5226   0.5671
llama3.3:70b     0.4986   0.5365   0.5435   0.5571   0.5497
gemma3:27b       0.5203   0.5124   0.5071   0.5135   0.5100

I executed the same code one more time, then the performance looked like this:

SUMMAC EVALUATION RESULTS – CONFIG 4
Config Index: 4

Model                3        5       10       20      all
----------------------------------------------------------------
llama3.1:8b      0.0000   0.0000   0.0000   0.0000   0.0000
llama3.3:70b     0.0000   0.0000   0.0000   0.0000   0.0000
gemma3:27b       0.0000   0.0000   0.0000   0.0000   0.0000

I even tried with 3 configurations instead of 15. During two trials, the results were significantly different.

Trial 1:

Config 4 - Comments per topic: 5

llama3.1:8b          1.0000
llama3.3:70b         1.0000
gemma3:27b           0.9999

Trial 2:

Config 4 - Comments per topic: 5

llama3.1:8b          0.2663
llama3.3:70b         0.1708
gemma3:27b           0.4512

Code

I am quite baffled by this kind of variability in the results. Here is my code:

COMMENTS_PER_TOPIC_LIST = ["3", "5", "10", "20", "all"]
MODEL_TAGS = ["llama3.1:8b", "llama3.3:70b", "gemma3:27b"]

# Paths
json_path = f"Reddit_Analysis-main/representative_sentences_config_{config_index}.json"

# GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}\n")

# Load SummaC model once
print("Loading SummaC model...")
model = SummaCConv(models=["vitc"], bins="percentile", granularity="sentence",
                   nli_labels="e", device=device, agg="mean", batch_size=32)
print("Model loaded.\n")

# Load source document once
with open(json_path) as f:
    data = json.load(f)
    sentences = [s for obj in data for topic in obj["topics"] for s in topic["representative sentences"]]

print(f"number of source sentences: {len(sentences)}")
source_doc = "\n".join(sentences)
source_preview = source_doc[:100].replace("\n", " ")
print(f"source preview:")
print(source_preview)

# Store results
results = {}

# Run evaluations
for model_tag in MODEL_TAGS:
    results[model_tag] = {}
    for cpt in COMMENTS_PER_TOPIC_LIST:
        summary_path = f"prompt_7/representative_sentences_summary_config_{config_index}_comments_per_topic_{cpt}_{model_tag}.txt"

        print(f"Evaluating: {model_tag} | comments_per_topic={cpt}")

        if not os.path.exists(summary_path):
            print(f"  ERROR: File not found: {summary_path}")
            results[model_tag][cpt] = None
            continue

        # Load summary
        with open(summary_path) as f:
            summary = f.read().strip()
        print("\n" + "=" * 80)
        print(f"SUMMARY PREVIEW — Config {config_index} | model={model_tag} | comments_per_topic={cpt}")
        print("-" * 80)
        print("Summary (full):")
        print(summary)

        # Clear GPU cache to prevent any state accumulation
        if device == "cuda":
            torch.cuda.empty_cache()
        # Compute score
        score = model.score([source_doc], [summary])["scores"][0]
        results[model_tag][cpt] = round(score, 4)
        print(f"  Score: {score:.4f}\n")

Questions

  1. Is this behavior expected?
  2. What could be causing such dramatic variability in scores (ranging from 0.0 to 1.0)?
  3. Are there any known issues with running multiple evaluations sequentially?

Any insights would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions