Clin Chem. 2026 Feb 12:hvag006. doi: 10.1093/clinchem/hvag006. Online ahead of print.
ABSTRACT
BACKGROUND: Feature extraction via manual chart review is often used for both patient care and research, but it is time-intensive and costly. Recent improvements in natural language processing present novel opportunities to perform high-throughput automated feature extraction. Here, we assessed the accuracy of large language models (LLMs) for structured feature extraction from clinical and anatomic pathology notes.
METHODS: We assessed the accuracy of feature extraction by the OpenAI GPT-4o and GPT-5 models across 3 pathology data sets: cardiac transplant pathology reports, hemoglobin variant test interpretations, and urine drug test interpretations. For each case, model-derived features were compared to manual labels from expert clinicians. We also developed a novel web application to enable rapid development and prototyping of structured function calls to common LLM models.
RESULTS: We first developed a “toolbuilder” application to design structured feature extractions from clinical text. Using this application, current LLMs had high accuracy with error rates near 5% for simple cases and 10% for more complex use cases. Performance was strongly influenced by model type but was not drastically improved by prompt engineering or other input adaptations. Across a range of features, expert-LLM concordance was extremely high (κ>0.9), and only slightly below inter-expert concordance. Model errors were most commonly due to mistakes between negative and indeterminate findings, suggesting overconfidence of the models in the presence of reduced information.
CONCLUSION: These findings suggest that LLM tools can provide significant value in automating time- and cost-intensive clinical note feature extraction and annotation.
PMID:41677051 | DOI:10.1093/clinchem/hvag006