Download PDFOpen PDF in browserIdentifying High-Risk Cancer Patients on Breast Cancer Pathology Reports with Large Language Models9 pages•Published: April 19, 2026AbstractBreast cancer subtypes defined by estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) status guide treatment decisions, yet manual extraction of these biomarkers from pathology reports is time‑consuming and error‑prone. We present an end‑to‑end NLP pipeline that automates high‑risk subtype identification (HER2‑positive and triple‑negative) from digital core‑biopsy reports. A corpus of 2,722 reports (2,401 non‑synoptic, 321 synoptic) was annotated in Doccano, yielding 16,706 question–answer pairs. Reports were pre-processed and then split using a multi-stratified sampling approach into training (59%), validation (17%), and held‑out test (24%) sets. We fine‑tuned BioMedBERT on SQuAD 2.0 and then on our domain‑specific dataset, employing hyperparameter optimization and prediction post-processing. On the held-out test data, our model achieved 99.79% accuracy on synoptic reports and 98.83% on non‑synoptic reports, outperforming human annotators and maintaining robust performance across report formats and biomarker classes. By automatically flagging eligible patients for neoadjuvant chemotherapy triage, this pipeline has the potential to streamline clinical workflows, reduce treatment delays, and improve outcomes for high‑risk breast cancer patients.Keyphrases: clinical documents, large language models, natural language processing, triple negative breast cancers In: Jernej Masnec, Hamid Reza Karimian, Parisa Kordjamshidi and Yan Li (editors). Proceedings of AI for Accelerated Research Symposium, vol 3, pages 27-35.
|

