2  Introduction:

Prostate cancer represents a substantial global health burden, ranking as the second most commonly diagnosed cancer and the third leading cause of cancer death among men worldwide. In 2022, an estimated 1.5 million new cases and 970,000 deaths were attributed to prostate cancer globally (World Health Organization (2024)). This profound public health impact underscores the critical importance of accurate and efficient diagnostic pathways.

The integration of artificial intelligence (AI) into diagnostics has emerged as a transformative advancement in digital pathology. Prostate cancer is an ideal target for AI algorithms because of its high incidence, large routine coverage, and the reliance on Gleason grading for prognosis. While Gleason grading is a long-standing metric, it is a subjective assessment with well-documented challenges in inter-observer reproducibility. Furthermore, concordance between biopsy and radical prostatectomy grading is often suboptimal; Epstein et al. (2012) reported that 36.3% of malignancies graded as Gleason score 5-6 on biopsy were upgraded at radical prostatectomy, highlighting the inherent limitations of needle biopsy sampling (Epstein et al. (2012)). Numerous studies have highlighted moderate to low agreement among pathologists, particularly for intermediate-risk groups (Allsbrook Jr et al. (2001), Singh et al. (2011), Melia et al. (2006), Nakai, Tanaka, Anai, et al. (2015), Salmo (2015), Ozkan et al. (2016)). Validating these concerns, a recent nationwide study in the Netherlands found that the proportion of ISUP Grade 1 diagnoses varied dramatically from 19.7% to 44.3% across laboratories, with significant inter-pathologist variation found in 71% of centers, highlighting the systemic nature of this inconsistency (Flach et al. (2021)). This variability can lead to inconsistent risk stratification and treatment planning.

To address these limitations, AI has emerged as a powerful tool to augment pathologic diagnosis. A significant milestone in this field is the FDA authorization of the first AI-based software, Paige Prostate, to assist pathologists in detection of prostate cancer (U.S. Food and Drug Administration (2021)). The system was trained on a massive dataset of whole-slide images without pixel-level annotations (Campanella et al. (2019)) and has demonstrated clinical utility.

The diagnosis of prostate cancer relies on core needle biopsies (CNB), with current guidelines recommending extended-sampling protocols of 10-12 cores to optimize cancer detection while minimizing patient burden. While multiparametric MRI (mpMRI) has enhanced diagnostic precision through targeted biopsies, systematic sampling remains essential, as studies show that 6-10% of clinically significant cancers may be detected only outside MRI-identified regions. Histopathological parameters such as the number of positive core biopsies, Gleason grade, and tumor extent are critical for accurate risk stratification and determining the appropriate disease management strategy (Mottet et al. (2025)). Current guidelines further emphasize the reporting of the percentage of Gleason pattern 4 in Gleason score 7 tumors to refine prognostic stratification (Mottet et al. (2025)). AI algorithms can assist pathologists by detecting all tumor foci and accurately assessing these parameters, thereby minimizing inter-observer variability and ensuring more consistent assessments (Kartasalo et al. (2021)).

Recent studies highlight the benefits of AI in prostate cancer, including enhanced detection, accurate quantification, consistent grading, improved inter-observer agreement (Silva et al. (2021), Bulten et al. (2022), Eloy et al. (2023), Steiner et al. (2020)), and time efficiency (Silva et al. (2021), Eloy et al. (2023)). However, the generalizability of these AI models across different pathology laboratories and scanning conditions remains a key area of investigation. Over the past two years, we have integrated digital pathology into routine diagnostics and evaluated image analysis solutions. This study aims to investigate the contribution of the Paige Prostate AI application to the diagnosis of prostate cancer and inter-observer agreement in a real-world clinical setting.

3 Materials and Methods:

3.0.1 Preparation of images:

Eight hundred thirty-six prostate CNBs from 60 consecutive cases with whole slide images from Memorial Hospitals Group, Pathology Department Archive were included. Hematoxylin and eosin (H&E) stained slides were scanned by Leica Aperio AT2 scanner with 20x or 40x magnifications. To simulate a routine workflow, the cases were anonymized with the svs-deidentifier program (Ref: Pearce, T. (2020). svs-deidentifier (Version 0.9.1) [Computer software]. https://github.com/pearcetm/svs-deidentifier) and imported into the Paige system as virtual patients.

3.0.2 Study design:

The study was conducted in two phases. In phase I, the diagnoses in the report were retrieved by an expert pathologist (F.A.) and compared with the AI analysis output. The design of phase I is summarized in Figure 1.

Figure 1: Phase I, flowchart detailing the cases and distribution of the categorization.

In phase II, four pathologists, blinded to the original diagnosis and scores, initially reviewed the cases without AI assistance. After a two-week washout period, the AI modules were activated on the Paige interface, and the pathologists re-evaluated the cases with AI assistance.

Pathologists independently completed a standardized Google sheet while blinded to each other’s diagnoses, IHC results, and previous reports. When evaluating each core, they clicked a cell to record the timestamp, filled in tumor presence, and tumor characteristics as if they would in routine workflow. When the assessment finished, they clicked another cell to record the finishing time. In the second assessment, they were not given access to previous sheets. The pathologists used the AI interface in their evaluation. They were asked to evaluate the necessity of AI and its helpfulness, and to record any comments. They were forced to diagnose categorically as “Benign, Suspicious (require IHC/Consultation), Malignant”.

The results were analyzed to see changes in diagnosis and level of agreement before and after the use of AI, the duration for each core, and subjective interpretations within pathologists’ evaluations. The time taken for evaluation, requests for immunohistochemistry, the need for consultation, and the level of agreement among pathologists were also compared, as were the AI results.

3.0.3 Statistical analysis:

Diagnoses and agreements were given as contingency tables with proportions. Diagnoses were categorized as benign, suspicious, and malignant, and later regrouped as benign vs others. For phase I, interrater reliability between the report diagnosis and AI diagnosis was compared using Cohen’s Kappa test and agreement percentage. After reevaluating cases with immunohistochemistry, a reference diagnosis is defined for each core. Using this reference diagnosis, the medical decision test statistics (sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV)) were calculated for the AI model. Both interrater reliability and decision test statistics were calculated case-based and core-based separately.

For phase II, interrater reliability among pathologists with and without using AI was compared using the Fleiss Kappa test and agreement percentage. Both case-based and core-based comparisons were made. The diagnosis time for each core with and without using AI was compared using the Wilcoxon Signed-Rank Test. The diagnosis time for concordant and discordant cases was evaluated separately. Pathologists’ opinions and comments on the use of AI were recorded for each core. The duration of diagnosis was assessed on these cores. Tumor size ratio evaluations were categorized per CAP guidelines as 10% increments. The ratios were compared using Pearson correlation. Cores with changes in diagnosis with or without AI were noted and evaluated separately. Agreement on diagnosis and Gleason grade group with or without AI for each core was evaluated using Fleiss’ Kappa Statistics. Changes in agreement percentage with AI usage were recorded.

For evaluation, p < 0.05 was considered significant. All analyses were done using The jamovi project (2024) (Version 2.6) [Computer Software]. Retrieved from https://www.jamovi.org and R Core Team (2024). R: A Language and environment for statistical computing. (Version 4.4) [Computer software]. Retrieved from https://cran.r-project.org. (R packages retrieved from CRAN snapshot 2024-08-07).

4 Results:

Cases had a median of 13.5 blocks (min:8 - max:22) and a total of 829 biopsy cores in the Phase I analytical cohort (851 slides as uploaded, 22 excluded per the canonical exclude-list — 19 duplicate rescans, 2 accidentally uploaded IHC stain slides, and 1 slide where the Paige website did not run). After excluding cores that could not be re-scored in the second-reader phase, 810 cores were re-read by all four pathologists with and without AI assistance in Phase II; of these, 138 had a parseable Gleason score from every interpreter and constitute the inter-rater complete-cases subset used for the Fleiss kappa.

4.1 Phase I:

The specific diagnoses of the cores were compared with the AI results. Within the 829-core Phase I analytical cohort, AI and the original pathology report agreed on 800 of 829 cores (96.5%; Cohen kappa = 0.909, P < .001). The 29 discordant cores comprised 1 core that AI labelled benign while the report labelled it malignant (Dx_Research = adenocarcinoma) and 28 cores that AI flagged as suspicious for adenocarcinoma while the report labelled them benign. Reference-diagnosis breakdown of those 28 cores after expert re-evaluation with IHC: 18 benign, 1 ASAP, and 9 adenocarcinoma (minute foci with low-grade Gleason patterns; additional tumor foci were present in other cores of the same cases).

When evaluated on a case basis with final IHC confirmation, the agreement between AI and final diagnosis increased to 97.6% (kappa=0.94, p<0.0001). In this analysis, AI demonstrated overdiagnosis in 4 cases. The AI achieved a Sensitivity of 99.6%, Specificity of 97.0%, Positive Predictive Value (PPV) of 92.1% and 90%, and Negative Predictive Value (NPV) of 99.8% and 100% at the core-based and case-based analysis levels, respectively. For further comparisons, these updated results, confirmed with IHC, were used as the reference diagnosis.

4.2 Phase II:

4.2.1 The results show that agreement among pathologists increases with AI use. The agreement among pathologists, both with and without AI assistance, is calculated as follows:

The interobserver agreement percentage among pathologists was 73% (Kappa = 0.69, p<0.001) without AI assistance when the diagnostic categories were ‘Malignant,’ ‘Benign,’ and ‘Suspicious’. The agreement percentage increased to 91% with the integration of AI (Kappa = 0.88, p<0.001).

When diagnostic categories were further grouped as “Benign versus Others”, interobserver agreement without AI was 76% (kappa = 0.70, p < 0.001), which increased to 93% with AI (kappa = 0.90, p < 0.001).

AI assistance led to changes in diagnoses among pathologists. For Pathologist 1, benign diagnoses increased from 60% to 70%, malignant diagnoses from 23% to 24%, and suspicious diagnoses decreased from 17% to 6%. For Pathologist 2, benign diagnoses increased from 71% to 73%, malignant diagnoses increased from 24% to 25%, and suspicious diagnoses decreased from 5% to 2%. For Pathologist 3, benign diagnoses increased from 73% to 74%, malignant diagnoses increased from 22% to 24%, and suspicious diagnoses decreased from 5% to 2%. For Pathologist 4, benign diagnoses increased from 65% to 73%, malignant diagnoses remained stable at 25% and suspicious diagnoses decreased from 10% to 2%.

Overall, across all pathologists, benign diagnoses increased from 67% to 72%, malignant diagnoses increased from 23% to 25%, and suspicious diagnoses decreased from 9.4% to 3.3%. These findings suggest that AI assistance significantly improves diagnostic accuracy and reduces uncertainty among pathologists.

4.2.2 The effect of AI use on the evaluation time:

The median diagnosis time for each core changed significantly. However, this change was heterogeneous among pathologists. In P1 and P4, the duration increased with AI use (30 vs. 35 seconds, p<0.001, and 34 vs. 35 seconds, p<0.001, respectively). In P2 and P3, the duration decreased with AI use (40 vs. 35 seconds, p=0.001, and 36 vs. 28 seconds, p<0.001, respectively).

5 Discussion

PaigeProstate is a machine learning algorithm developed using the digital pathology slide archive of the Memorial Sloan Kettering Cancer Center (MSKCC). It classifies whole slide images (WSIs) of prostate core biopsies as ‘suspicious’ for prostate adenocarcinoma when histological features of adenocarcinoma or glandular atypia—such as focal glandular atypia (FGA), high-grade prostatic intraepithelial neoplasia with adjacent atypical glands (PIN-ATYP), or atypical small acinar proliferation (ASAP)—are identified. In the absence of these lesions, the algorithm categorizes the slide as ‘not suspicious’ for prostate adenocarcinoma (Raciti et al. (2020)). Simultaneously, the grading and quantification of adenocarcinomas aim to evaluate the Gleason score, including primary and secondary Gleason patterns, as well as the cancer length and percentage in each CNB (Raciti et al. (2020)). In recent years, multiple studies have demonstrated that Paige Prostate functions effectively as a prescreening tool and a reliable second reader (2), significantly reducing time to diagnosis (by ~13% on average) and enhancing diagnostic accuracy—specifically increasing pathologist sensitivity from 74% to 90% while maintaining high specificity (Raciti et al. (2020), 3).

In the study conducted by Silva et al. (4), 600 core biopsies taken from 100 patients were evaluated. The results of this study showed high sensitivity (0.99; CI 0.96–1.0), NPV (1.0; CI 0.98–1.0), and specificity (0.93; CI 0.90–0.96) at the sample level when the Paige Prostate AI method was used to evaluate the preparations. In another study utilizing the Paige Prostate algorithm, 465 of 475 cases classified as ‘suspected cancer’ and 1,371 of 1,382 cases classified as ‘unsuspected cancer’ were concordant with the reference diagnosis. These results corresponded to a PPV of 97.9%, a NPV of 99.2%, a sensitivity of 97.7%, and a specificity of 99.3% (5). In this study, 829 prostate core biopsy specimens were evaluated (Phase I analytical cohort: 851 slides as uploaded, 22 excluded per the canonical exclude-list for duplicate rescans, accidentally uploaded IHC stains, and slides where the Paige website did not run). The Paige Prostate-assisted assessment demonstrated a positive predictive value (PPV) of 92.1% at the core level and 90.0% at the case level, while the negative predictive value (NPV) was 99.8% and 100% for core-based and case-based analyses, respectively.

In the second phase of our study, which assessed inter-observer agreement, a significant improvement in concordance among observers was observed when the Paige Prostate algorithm supported pathologists’ evaluations. To more precisely assess inter-observer agreement, cases were categorized into three groups: complete agreement (all four observers concurred), majority agreement (three of four observers concurred), and no agreement. Inter-observer agreement was complete in 80% of cases (617 cores), and this level of agreement remained unchanged following AI-assisted assessment. Before the use of Paige Prostate, 126 cores (16.3%) initially classified under majority agreement were reclassified as complete agreement. Additionally, 10 cores previously demonstrating no agreement were upgraded to majority agreement. Conversely, in two cases initially showing complete agreement, a regression to no agreement was observed. When the last 2 cases are evaluated ……..

  • In those four cores where an expert pathologist rediagnosed as a tumor, other cores of the same case also contained tumor. Maybe the pathologist missed them because other cores also contained tumor. This is a debatable point.
  • Effect of AI in IHC ordering
  • Effect of AI in requesting second opinions
  • Whether there are any differences in sensitivity among pathologists who took longer time to diagnosis.
  • Does the lack of familiarity with the Paige platform (vs Sectra) may have an impact on time? In other words, does using two separate platforms erode efficiency?

Emre’nin eklediği kısım:

  1. The power of AI to make accurate diagnoses:
    Sadece “benign” vs “malign” kategorileri reference diagnosis “final” tanı kabul edilip, AI yardımı ile verilen tanılarda bu reference diagnosis tanıya uyum artmış mı?

- Is AI helpful in suspicious cases?
It was observed that the majority of the ‘suspicious’ (suspicious, IHC requested, consultation requested) cases were diagnosed as benign by the pathologists in general after AI evaluation, and a small portion of these cases were diagnosed as malignant. This may indicate that pathologist assessment sensitivity is higher than AI.(istatistik verilebilir).

- No significant overdiagnosis was observed in AI tumor evaluation (yüksek spesifite? istatistik verilebilir?).

- AI can miss tumor detection, especially if tissue processing is poor.

- ### Diagnostic Agreement and Accuracy

Our study demonstrates that AI assistance significantly enhances both the diagnostic accuracy and inter-observer agreement of pathologists in evaluating prostate biopsies. The improvement in inter-observer agreement from a Fleiss’ Kappa of 0.73 (substantial agreement) to 0.91 (almost perfect agreement) underscores the utility of the Paige Prostate algorithm as a robust diagnostic aid. These findings align with recent multi-institutional studies, such as those by Perincheri et al. (2021), Pantanowitz et al. (2020), and Jung et al. (2022), which reported similar boosts in sensitivity and specificity (Perincheri et al. (2021), Pantanowitz et al. (2020), Jung et al. (2022)). Consistent with these observations, the PANDA challenge results demonstrated that deep learning algorithms achieved a quadratic weighted kappa of 0.876, significantly outperforming general pathologists (kappa 0.765) and matching uropathologist experts (Bulten et al. (2022)). Marrón-Esquivel et al. (2023) further corroborated these findings, reporting that their deep learning models achieved a Kappa of 0.826, surpassing the inter-pathologist agreement of 0.695 observed in the same cohort (Marrón-Esquivel et al. (2023)). This capability is particularly valuable in settings where expert uropathologist consultation is not readily available, as AI can effectively “democratize” expert-level grading accuracy.

Inter-observer variability in Gleason grading is a well-documented challenge in urologic pathology. Previous studies have consistently reported moderate to substantial variability; for instance, Allsbrook et al. (2001) reported a kappa of ~0.43 among general pathologists (Allsbrook Jr et al. (2001)), and Singh et al. (2011) found an overall agreement of only 68% (Singh et al. (2011)). Similarly, Melia et al. (2006) investigated reproducibility among UK uropathologists and reported an overall kappa of 0.54, with notably lower agreement (kappa 0.33) for Gleason score 2-4, distinguishing these from Gleason score 6 (Melia et al. (2006)). Ozkan et al. (2016) likewise observed moderate concordance (kappa ~0.43) for Gleason scores (Ozkan et al. (2016)). By providing a standardized “second opinion,” the AI system in our cohort helped bridge this gap, ensuring greater consistency across participating pathologists. This is increasingly relevant as guidelines now recognize the potential of AI to standardize grading and assist in quantifying risk-stratifying features like percentage Gleason 4 (Mottet et al. (2025)). This effect mirrors findings by Bulten et al. (2020) and Steiner et al. (2020), who observed that AI assistance improved agreement with subspecialists, particularly in identifying Grade Group 1 biopsies.

5.0.1 Perineural Invasion (PNI) Detection

Accurate detection of perineural invasion (PNI) is critical as it serves as an independent predictor of extraprostatic extension. In our cohort, the AI system demonstrated high sensitivity for PNI, flagging potential invasion in approximately 16% of cores, compared to a detection rate of ~4% by pathologists during routine review. This higher “flagging rate” highlights the AI’s utility as a screening tool to ensure no subtle PNI is overlooked. However, it also introduces a need for careful verification, as the system may flag nerve-adjacent tumor cells that do not strictly meet the criteria for invasion. Inter-observer agreement for PNI improved modestly with AI assistance (Fleiss’ Kappa increasing from 0.62 to 0.66), suggesting that while AI draws attention to the finding, the final determination remains a subjective decision requiring pathologist expertise.

5.0.2 Discordance and AI “Rescue”

A key finding of our study is the “rescue effect” of AI in reclassifying cases initially misdiagnosed or deemed indeterminate. The AI system successfully corrected diagnostic errors in a substantial number of cases, particularly in “Moderate” difficulty cases where subjective uncertainty is highest. By offering a confident binary classification, the AI encouraged pathologists to re-evaluate subtle features they might have otherwise dismissed or classified as “atypical,” thereby reducing the rate of ambiguous diagnoses.

5.0.3 Impact on Efficiency and Workflow

The impact of AI on diagnosis time was heterogeneous in our study. While Eloy et al. (2023) reported a ~20% reduction in median reporting time and Baidoshvili et al. (2023) observed a 68% time gain in a fully digital workflow (Eloy et al. (2023), Baidoshvili et al. (2021)), others have noted potential efficiency challenges in digital adoption; for instance, Hanna et al. (2019) reported a ~19% decrease in efficiency during the initial validation of full digital signout (Hanna et al. (2019)). Our participating pathologists noted that time savings were most pronounced in benign cases. Conversely, malignant cases often required equal or increased evaluation time, as pathologists spent additional moments verifying the AI’s tumor markup and grading against their own assessment. This “verification overhead” suggests that the “slowing down” effect may be a transient phase of adopting the new workflow, likely to diminish as user confidence grows and the learning curve flattens. Furthermore, our study utilized a separate AI interface alongside the routine viewing system, necessitating navigation between two platforms. Deeper integration options—such as embedding AI results directly into the primary PACS viewer via API modules—could eliminate this switching cost and further streamline the diagnostic process (Paige.AI (2022)).

Beyond time efficiency, a critical advantage was the reduction in diagnostic uncertainty. In our cohort, the proportion of “suspicious” (ASAP) diagnoses decreased significantly from 9.4% to 3.3% with AI assistance, as pathologists felt more confident in classifying lesions as definitive benign or malignant. This mirrors findings by Eloy et al., who reported a 30% reduction in ASAP diagnoses, a 20% drop in IHC requests, and a 40% decrease in requests for second opinions when AI was used (Eloy et al. (2023)). By highlighting minute foci or confidently ruling out cancer in questionable areas, the AI system acts as a “safety net” that optimizes laboratory resource utilization and potentially reduces the need for ancillary testing.

5.0.4 Challenges and Automation Bias

The interaction between pathologists and AI tools introduces the risk of “automation bias,” where clinicians may over-rely on AI predictions or change their correct diagnoses to match an incorrect AI result. Evans and Snead (2023) highlight that while AI can improve efficiency, it can also lead to errors if pathologists treat AI output as infallible (Evans and Snead (2023)). In our study, while AI successfully identified minute cancers missed by initial review, we also observed instances where artifacts (e.g., blurred images, fold tissues) led to discordant AI classifications. This underscores the “unsafe failure mode” described by Evans and Snead, where AI continues to make predictions on poor-quality data instead of flagging it. Kartasalo et al. (2021) emphasize that reliable clinical implementation requires AI systems capable of anomaly detection—flagging cases that fall outside the model’s training distribution (e.g., rare benign mimics or artifacts) for human review, rather than forcing a potentially erroneous classification (Kartasalo et al. (2021)).

5.0.5 Beyond Diagnosis: Prognostic Potential

While this study focused on the diagnostic accuracy and efficiency of AI in detecting prostate cancer, the potential of AI extends significantly into prognostic stratification and personalized treatment planning. Esteva et al. (2022) and Spratt et al. (2023) demonstrated this by developing multimodal AI models using digital pathology images to predict long-term outcomes and identify patients who would benefit from androgen deprivation therapy (ADT) (Esteva et al. (2022), Spratt et al. (2023)). This suggests that the integration of AI tools like Paige Prostate not only streamlines the diagnostic workflow but may effectively pave the way for AI-driven precision medicine, where pathology slides provide deep insights into treatment response beyond standard Gleason grading.

5.0.6 Limitations and Barriers to AI Adoption

Despite the promising diagnostic capabilities demonstrated in this and other studies, the widespread clinical adoption of AI in prostate pathology faces significant challenges that must be acknowledged. A primary technical challenge is the requirement for large, high-quality, and diverse datasets for robust training and validation. AI models must perform reliably across different laboratories, whole-slide scanners, staining protocols, and varied patient populations to ensure generalizability. Technical inconsistencies in whole-slide image production can degrade AI performance, and bias in training data can lead to poor generalizability in real-world settings. However, recent “in the wild” evaluations offer encouraging evidence for robustness. Faryna et al. (2024) demonstrated that commercial algorithms like Paige Prostate achieved high agreement (QWK 0.860) with pathologists on a diverse, crowdsourced dataset comprising images from multiple scanners and institutions, performing on par with or better than academic models optimized for specific benchmark tasks (Faryna et al. (2024)).

The morphological complexity of the prostate gland, with its plethora of benign mimics and unusual morphological variants, poses challenges for AI algorithms to accurately differentiate subtle or atypical findings that often require nuanced human expert judgment. Furthermore, widespread clinical adoption necessitates substantial investment in digital pathology infrastructure, including high-throughput scanners, robust storage solutions, and integrated laboratory information systems. The absence of standardized WSI file formats across vendors further complicates interoperability and data sharing.

A critical psychological barrier is the “black box” nature of many AI decision-making processes, which can create a lack of interpretability and foster distrust among clinicians. There is a recognized risk that users may over-rely on AI output, potentially leading to diagnostic errors or a decline in clinical competence if critical thinking is superseded by algorithmic suggestions. However, AI can also serve as a valuable educational tool. Nakai et al. (2015) demonstrated that continuous feedback from expert urological pathologists significantly improved general pathologists’ concordance rates from 47.5% to 78.7% over time, but noted that such expert review is resource-intensive (Nakai, Tanaka, Shimada, et al. (2015)). AI offers a scalable alternative for providing this consistent, expert-level feedback during routine sign-out.

Beyond education, AI facilitates comprehensive quality assurance. In traditional practice, QA is often limited to random retrospective reviews of a small percentage of cases. Janowczyk et al. (2020) highlight that AI deployment enables “computational second reads” on 100% of the caseload with minimal additional effort, ensuring that no significant malignancies are missed due to human fatigue or oversight (Janowczyk, Leo, and Rubin (2020)). This shifts the paradigm towards “Augmented Human Intelligence,” where the pathologist’s final diagnosis is fortified by an exhaustive computational safety net.

To further address interpretability concerns, recent advances in explainable AI (XAI) offer promising solutions. Mittmann et al. (2025) developed GleasonXAI, an inherently explainable AI system that uses pathologist-defined terminology and provides interpretable outputs aligned with ISUP guidelines, achieving comparable performance to conventional approaches (Dice score 0.713 vs 0.691) while offering transparent decision-making (Mittmann et al. (2025)). Such explainable AI approaches may increase pathologist trust and confidence while maintaining diagnostic accuracy. Nevertheless, it remains paramount that the reporting pathologist ultimately retains responsibility for accepting or rejecting the diagnosis proposed by AI. Additionally, streamlined regulatory approval processes and robust standardization are essential, along with ethical considerations including data privacy, informed consent, addressing algorithmic biases, and establishing clear accountability for errors.

While AI can lead to cost savings in some areas (e.g., reduced IHC use, as demonstrated in the CONFIDENT P trial with estimated savings of €1,700), the initial implementation and ongoing costs of digital pathology infrastructure and AI licenses can be substantial. These multifaceted challenges underscore that successful AI integration requires a holistic approach addressing technical, operational, ethical, and psychological barriers.

5.0.7 Limitations

The study has limitations primarily related to its retrospective design and the use of a single-center cohort. Additionally, the “washout period” between Phase I and Phase II was necessary but introduces potential recall bias, although the randomized reading order attempted to mitigate this. The reliance on a specific scanner and staining protocol may also affect generalizability, as AI performance can vary across different laboratory preparations. * Tumor Quantification: While AI excelled at detection, discrepancies were noted in tumor burden quantification (tumor length and percentage). The algorithm’s segmentation occasionally included non-tumor tissue or excluded sparse tumor glands, leading to differences between AI-calculated and pathologist-estimated values.

6 Conclusion

The PaigeProstate was found to be helpful for prostate tru-cut biopsy interpretation. However, processing and scanning artifacts can cause errors, thus the images should be checked for quality before AI application. The discrepant cores had no impact on patient management, as they were small foci and the other cores of the same case also contained tumors. The model can be helpful in routine diagnostic practice in appropriate settings.

6.1 Acknowledgment

Part of this study is presented as a poster in European Congress of Digital Pathology (ECDP 2023, A retrospective evaluation of artificial intelligence solution for prostate biopsies)

The authors would like to thank Dr. Juan Retamero from Paige for their support in providing access to use the model, and Yucehan Dogan from Apaz Medikal providing scripts for batch export of whole slide images from Sectra PACS system.