Identifying High-Risk Cancer Patients on Breast Cancer Pathology Reports with Large Language Models

9 pages•Published: April 19, 2026

Abstract

Breast cancer subtypes defined by estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) status guide treatment decisions, yet manual extraction of these biomarkers from pathology reports is time‑consuming and error‑prone. We present an end‑to‑end NLP pipeline that automates high‑risk subtype identification (HER2‑positive and triple‑negative) from digital core‑biopsy reports. A corpus of 2,722 reports (2,401 non‑synoptic, 321 synoptic) was annotated in Doccano, yielding 16,706 question–answer pairs. Reports were pre-processed and then split using a multi-stratified sampling approach into training (59%), validation (17%), and held‑out test (24%) sets. We fine‑tuned BioMedBERT on SQuAD 2.0 and then on our domain‑specific dataset, employing hyperparameter optimization and prediction post-processing. On the held-out test data, our model achieved 99.79% accuracy on synoptic reports and 98.83% on non‑synoptic reports, outperforming human annotators and maintaining robust performance across report formats and biomarker classes. By automatically flagging eligible patients for neoadjuvant chemotherapy triage, this pipeline has the potential to streamline clinical workflows, reduce treatment delays, and improve outcomes for high‑risk breast cancer patients.

Keyphrases: clinical documents, large language models, natural language processing, triple negative breast cancers

In: Jernej Masnec, Hamid Reza Karimian, Parisa Kordjamshidi and Yan Li (editors). Proceedings of AI for Accelerated Research Symposium, vol 3, pages 27-35.

Links:	https://easychair.org/publications/paper/tkh7
	https://doi.org/10.29007/glv9

BibTeX entry

@inproceedings{AIAS2025:Identifying_High_Risk_Cancer,
  author    = {Trevor Kwan and Jaimie Lee and Raymond Ng},
  title     = {Identifying High-Risk Cancer Patients on Breast Cancer Pathology Reports with Large Language Models},
  booktitle = {Proceedings of AI for Accelerated Research Symposium},
  editor    = {Jernej Masnec and Hamid Reza Karimian and Parisa Kordjamshidi and Yan Li},
  series    = {EPiC Series in Technology},
  volume    = {3},
  publisher = {EasyChair},
  bibsource = {EasyChair, https://easychair.org},
  issn      = {2516-2322},
  url       = {/publications/paper/tkh7},
  doi       = {10.29007/glv9},
  pages     = {27-35},
  year      = {2026}}

Download PDF Open PDF in browser