I tried running both the models (ZS and Conv) on RAG Truth datasets (https://github.com/ParticleMedia/RAGTruth)
The steps I did was filtered the RAGTruth Dataset on summary tasks.
And fed them in the models.
model_zs = SummaCZS(granularity="sentence", model_name="vitc", device="cuda") # If you have a GPU: switch to: device="cuda" model_conv = SummaCConv(models=["vitc"], bins='percentile', granularity="sentence", nli_labels="e", device="cuda", start_file="default", agg="mean")
I considered the data in RAG Truth dataset as hallucinated if it had labels reported against it. Then I converted the binary hallucination score to 1 - hallucination score to get the true label for testing against consistency score reported by the model.
Later on I used the util code to choose the best threshold, e.g:
best_thresholds_conv = choose_best_threshold(result_df['label'], result_df['conv_pred_score'])
I am getting F1 score of around 0.6 on this dataset. Will paste the exact results as comment.
I tried running both the models (ZS and Conv) on RAG Truth datasets (https://github.com/ParticleMedia/RAGTruth)
The steps I did was filtered the RAGTruth Dataset on summary tasks.
And fed them in the models.
model_zs = SummaCZS(granularity="sentence", model_name="vitc", device="cuda") # If you have a GPU: switch to: device="cuda" model_conv = SummaCConv(models=["vitc"], bins='percentile', granularity="sentence", nli_labels="e", device="cuda", start_file="default", agg="mean")I considered the data in RAG Truth dataset as hallucinated if it had labels reported against it. Then I converted the binary hallucination score to 1 - hallucination score to get the true label for testing against consistency score reported by the model.
Later on I used the util code to choose the best threshold, e.g:
best_thresholds_conv = choose_best_threshold(result_df['label'], result_df['conv_pred_score'])I am getting F1 score of around 0.6 on this dataset. Will paste the exact results as comment.