Strange Variability in SummaC Scores Across Multiple Runs
Hi,
Thanks for sharing the codebase. I am facing a very strange issue when I try this codebase on my custom data.
Setup
I have a list of source sentences, which I am concatenating to create the source document. I have different LLM-based summaries, which are computed by using different subsets from the source sentences. There are 15 summaries in total.
Issue: Inconsistent Results
During the first trial, the performance looked like this:
SUMMAC EVALUATION RESULTS – CONFIG 4
Config Index: 4
Model 3 5 10 20 all
----------------------------------------------------------------
llama3.1:8b 0.5514 0.5405 0.5252 0.5226 0.5671
llama3.3:70b 0.4986 0.5365 0.5435 0.5571 0.5497
gemma3:27b 0.5203 0.5124 0.5071 0.5135 0.5100
I executed the same code one more time, then the performance looked like this:
SUMMAC EVALUATION RESULTS – CONFIG 4
Config Index: 4
Model 3 5 10 20 all
----------------------------------------------------------------
llama3.1:8b 0.0000 0.0000 0.0000 0.0000 0.0000
llama3.3:70b 0.0000 0.0000 0.0000 0.0000 0.0000
gemma3:27b 0.0000 0.0000 0.0000 0.0000 0.0000
I even tried with 3 configurations instead of 15. During two trials, the results were significantly different.
Trial 1:
Config 4 - Comments per topic: 5
llama3.1:8b 1.0000
llama3.3:70b 1.0000
gemma3:27b 0.9999
Trial 2:
Config 4 - Comments per topic: 5
llama3.1:8b 0.2663
llama3.3:70b 0.1708
gemma3:27b 0.4512
Code
I am quite baffled by this kind of variability in the results. Here is my code:
COMMENTS_PER_TOPIC_LIST = ["3", "5", "10", "20", "all"]
MODEL_TAGS = ["llama3.1:8b", "llama3.3:70b", "gemma3:27b"]
# Paths
json_path = f"Reddit_Analysis-main/representative_sentences_config_{config_index}.json"
# GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}\n")
# Load SummaC model once
print("Loading SummaC model...")
model = SummaCConv(models=["vitc"], bins="percentile", granularity="sentence",
nli_labels="e", device=device, agg="mean", batch_size=32)
print("Model loaded.\n")
# Load source document once
with open(json_path) as f:
data = json.load(f)
sentences = [s for obj in data for topic in obj["topics"] for s in topic["representative sentences"]]
print(f"number of source sentences: {len(sentences)}")
source_doc = "\n".join(sentences)
source_preview = source_doc[:100].replace("\n", " ")
print(f"source preview:")
print(source_preview)
# Store results
results = {}
# Run evaluations
for model_tag in MODEL_TAGS:
results[model_tag] = {}
for cpt in COMMENTS_PER_TOPIC_LIST:
summary_path = f"prompt_7/representative_sentences_summary_config_{config_index}_comments_per_topic_{cpt}_{model_tag}.txt"
print(f"Evaluating: {model_tag} | comments_per_topic={cpt}")
if not os.path.exists(summary_path):
print(f" ERROR: File not found: {summary_path}")
results[model_tag][cpt] = None
continue
# Load summary
with open(summary_path) as f:
summary = f.read().strip()
print("\n" + "=" * 80)
print(f"SUMMARY PREVIEW — Config {config_index} | model={model_tag} | comments_per_topic={cpt}")
print("-" * 80)
print("Summary (full):")
print(summary)
# Clear GPU cache to prevent any state accumulation
if device == "cuda":
torch.cuda.empty_cache()
# Compute score
score = model.score([source_doc], [summary])["scores"][0]
results[model_tag][cpt] = round(score, 4)
print(f" Score: {score:.4f}\n")
Questions
- Is this behavior expected?
- What could be causing such dramatic variability in scores (ranging from 0.0 to 1.0)?
- Are there any known issues with running multiple evaluations sequentially?
Any insights would be greatly appreciated!
Strange Variability in SummaC Scores Across Multiple Runs
Hi,
Thanks for sharing the codebase. I am facing a very strange issue when I try this codebase on my custom data.
Setup
I have a list of source sentences, which I am concatenating to create the source document. I have different LLM-based summaries, which are computed by using different subsets from the source sentences. There are 15 summaries in total.
Issue: Inconsistent Results
During the first trial, the performance looked like this:
I executed the same code one more time, then the performance looked like this:
I even tried with 3 configurations instead of 15. During two trials, the results were significantly different.
Trial 1:
Trial 2:
Code
I am quite baffled by this kind of variability in the results. Here is my code:
Questions
Any insights would be greatly appreciated!