5 Performance Metrics Analysis

6 Performance Metrics Analysis

This chapter provides comprehensive performance metrics for AI and pathologists, including sensitivity, specificity, PPV, NPV, ROC curves, and detailed confusion matrices.

6.1 AI Performance Metrics

6.1.1 Phase 1: AI vs Report Diagnosis

AI Performance Against Original Report Diagnoses

This analysis compares AI predictions with the original pathology report diagnoses from Phase 1 of the study.

6.1.2 Confusion Matrix: AI vs Report

Note for Pathologists: This confusion matrix visualizes the matches and mismatches between AI predictions and Phase 1 report diagnoses.

AI Predictions vs Report Diagnoses
	Malignant	Benign
Malignant	215	28
Benign	1	607

6.1.3 Performance Metrics

Note for Pathologists: This table evaluates the AI’s standalone performance against the original pathology report diagnoses from Phase 1.

AI Performance Metrics with 95% CI
Metric	Estimate	Lower_CI	Upper_CI
Sensitivity	0.995	0.974	1.000
Specificity	0.956	0.937	0.971
PPV	0.885	0.838	0.922
NPV	0.998	0.991	1.000
Accuracy	0.966	NA	NA

6.2 Pathologist Performance Metrics

6.2.1 Individual Pathologist Performance (Phase 2)

6.2.1.1 Without AI

Pathologist Performance Without AI Assistance

Evaluating each pathologist’s diagnostic accuracy without AI, using the research consensus diagnosis as reference diagnosis.

Note for Pathologists: Detailed sensitivity, specificity, and accuracy metrics for each pathologist’s unaided evaluation (Phase 2) against the research consensus (Reference Diagnosis).

Pathologist Performance Without AI
Pathologist	Sensitivity	Sensitivity_Lower	Sensitivity_Upper	Specificity	Specificity_Lower	Specificity_Upper	PPV	NPV	Accuracy
P1_noAI	0.868	0.814	0.911	0.990	0.978	0.996	0.967	0.957	0.959
P2_noAI	0.922	0.876	0.955	0.992	0.981	0.997	0.974	0.974	0.974
P3_noAI	0.829	0.771	0.878	0.990	0.978	0.996	0.966	0.945	0.949
P4_noAI	0.927	0.882	0.958	0.983	0.970	0.992	0.950	0.975	0.969

6.2.1.2 With AI

Pathologist Performance With AI Assistance

Evaluating each pathologist’s diagnostic accuracy with AI assistance, using the research consensus diagnosis as reference diagnosis.

Note for Pathologists: Performance metrics for each pathologist when assisted by AI.

Pathologist Performance With AI
Pathologist	Sensitivity	Sensitivity_Lower	Sensitivity_Upper	Specificity	Specificity_Lower	Specificity_Upper	PPV	NPV	Accuracy
P1_withAI	0.946	0.906	0.973	0.995	0.986	0.999	0.985	0.982	0.983
P2_withAI	0.961	0.925	0.983	0.992	0.981	0.997	0.975	0.987	0.984
P3_withAI	0.927	0.882	0.958	0.993	0.983	0.998	0.979	0.976	0.977
P4_withAI	0.941	0.900	0.969	0.988	0.976	0.995	0.965	0.980	0.977

6.2.2 Comparison: Performance Change with AI

Impact of AI on Pathologist Performance

This analysis compares diagnostic performance metrics before and after AI assistance to quantify the improvement or change in accuracy.

Note for Pathologists: This comparison highlights the absolute change in sensitivity, specificity, and accuracy for each pathologist when using AI.

Performance Comparison: Without AI vs With AI
Pathologist	Sensitivity_Without AI	Sensitivity_With AI	Specificity_Without AI	Specificity_With AI	Accuracy_Without AI	Accuracy_With AI	Sensitivity_Change	Specificity_Change	Accuracy_Change
P1	0.868	0.946	0.990	0.995	0.959	0.983	0.078	0.005	0.024
P2	0.922	0.961	0.992	0.992	0.974	0.984	0.039	0.000	0.010
P3	0.829	0.927	0.990	0.993	0.949	0.977	0.098	0.003	0.027
P4	0.927	0.941	0.983	0.988	0.969	0.977	0.015	0.005	0.007

6.3 Detailed Confusion Matrices

6.3.1 Pathologist 1

Note for Pathologists: Detailed breakdown of true/false positives and negatives for each pathologist, comparing ‘No AI’ vs ‘With AI’ scenarios.

6.3.2 Pathologist 2

6.3.3 Pathologist 3

6.3.4 Pathologist 4

6.4 Statistical Significance of Performance Changes

McNemar’s Test for Diagnostic Changes

Testing whether the changes in diagnostic accuracy with AI are statistically significant using McNemar’s test for paired data.

6.4.1 McNemar’s Tests for Diagnostic Changes

Note for Pathologists: Statistical tests (McNemar’s) to determine if the changes in diagnosis (from correct to incorrect, or vice versa) are statistically significant.

6.4.1.1 P1

Contingency Table:

	0	1
0	607	17
1	4	180

McNemar Chi-squared: 6.857 p-value: 0.00883 Discordant pairs: 21 Changed to Malignant: 17 Changed to Benign: 4

6.4.1.2 P2

Contingency Table:

	0	1
0	605	10
1	2	192

McNemar Chi-squared: 4.083 p-value: 0.0433 Discordant pairs: 12 Changed to Malignant: 10 Changed to Benign: 2

6.4.1.3 P3

Contingency Table:

	0	1
0	612	22
1	4	172

McNemar Chi-squared: 11.115 p-value: 0.000856 Discordant pairs: 26 Changed to Malignant: 22 Changed to Benign: 4

6.4.1.4 P4

Contingency Table:

	0	1
0	599	10
1	10	190

McNemar Chi-squared: 0 p-value: 1 Discordant pairs: 20 Changed to Malignant: 10 Changed to Benign: 10

6.5 Accuracy by Case Difficulty

This section analyzes whether the “Difficulty” of a case (as subjectively rated or defined) impacts the accuracy of the diagnosis, and whether AI provides more benefit in difficult cases.

6.5.1 Accuracy Rates by Difficulty

We define “Accuracy” as agreement with the Reference Diagnosis (Research Diagnosis).

Note for Pathologists: Accuracy rates stratified by the difficulty of the cases (as defined by initial disagreement).

6.5.2 Visualization of AI Benefit

Does the improvement (With AI - No AI) depend on difficulty?

Note for Pathologists: Visualizing how accuracy varies with case difficulty and AI assistance.