The current outputted results can be improved by addressing the following issues:
- Statistical Test Consistency in Part 2 Results
- Currently, both paired t-tests and Wilcoxon tests are used based on normality checks.
- Proposal: Use Wilcoxon test consistently across all arms for better methodological consistency.
-
Non-Inferiority Threshold Clarification
- The non-inferiority p-value and statistical test rely on a threshold that is currently unspecified.
- Action Needed: Define and document this threshold—suggested value is 10%
-
Missing Weighted F-Score Table in Part 2 Results
- The report lacks weighted F-score results per phenotype and per AI arm (the results are across all phenotypes)
- Action Needed: Add a table showing weighted F-scores for each phenotype and AI arm.
-
Include Unweighted F-Score Results
- Unweighted F-score results are not currently included.
- Action Needed: Add these to the report for completeness.
-
Agreement Metrics in Part 1 Results**
Agreement is currently pooled across all AI arms.
Action Needed::
- Report agreement (overlap bars) separately for each AI arm and each disease.
- Weight these bars by concept prevalence.
-
Precision and Recall Reporting**
- Only F-score is reported.
- Action Needed : Report precision and recall separately for each AI arm and each disease.
The current outputted results can be improved by addressing the following issues:
- Proposal: Use Wilcoxon test consistently across all arms for better methodological consistency.
Non-Inferiority Threshold Clarification
Missing Weighted F-Score Table in Part 2 Results
Include Unweighted F-Score Results
- Unweighted F-score results are not currently included.
- Action Needed: Add these to the report for completeness.
Agreement Metrics in Part 1 Results**
Agreement is currently pooled across all AI arms.
Action Needed::
- Report agreement (overlap bars) separately for each AI arm and each disease.
- Weight these bars by concept prevalence.
Precision and Recall Reporting**