UTSW builds AI-driven system to improve data collection
Powerful new tool speeds process of extracting information for research with near-perfect accuracy

DALLAS – July 28, 2025 – A multidisciplinary team at UT Southwestern Medical Center has developed an artificial intelligence (AI)-enabled pipeline that can quickly and accurately extract relevant information from complex, free-text medical records. The team’s novel approach, featured in npj Digital Medicine, could dramatically reduce the time needed to create analysis-ready data for research studies.

“Constructing highly detailed, accurate datasets from free-text medical records is extremely time-consuming, often requiring extensive manual chart review,” said study first author David Hein, M.S., Data Scientist in the Lyda Hill Department of Bioinformatics at UT Southwestern. “Our study demonstrates one approach for creating AI-powered large language models (LLMs) that simplify the process of collecting and organizing medical data for analysis. By automating both data extraction and standardization through AI, we can make large-scale clinical research more efficient.”
To develop the pipeline, researchers used an AI-powered LLM to analyze over 2,200 kidney cancer pathology reports to evaluate the model’s ability to recognize and categorize distinct types of tumors. Through close collaboration with AI scientists, pathologists, clinicians, and statisticians, they refined the workflow through multiple rounds of testing, improving its handling of complex, nuanced information. Their findings were validated against existing electronic medical record (EMR) data to ensure reliability.

The results were striking – 99% accuracy in identifying tumor types and 97% accuracy in detecting whether the cancer had metastasized.
“The biggest challenge in training AI to extract data from narrative reports is that clinicians use a wide range of open-ended terms to describe the same finding,” said study co-leader Payal Kapur, M.D., Professor of Pathology and Urology. “It’s not as simple as counting ‘yes-no’ results. Every report contains hundreds of details in narrative form. But with proper input and oversight, an AI model can efficiently review and categorize vast amounts of records with speed and accuracy.”
A final step included testing across a broader dataset of more than 3,500 internal kidney cancer pathology reports with similar results – a process facilitated by the high-quality, curated data and pipelines available through UT Southwestern’s Kidney Cancer Program.
“The key is collaborative teamwork across specialties to refine AI instructions and ensure accuracy,” said study co-author James Brugarolas, M.D., Ph.D., Director of the Kidney Cancer Program, Professor of Internal Medicine in the Division of Hematology and Oncology, and member of the Cellular Networks in Cancer Research Program of the Harold C. Simmons Comprehensive Cancer Center.

While this study focused on kidney cancer, the approach may have broader applications to other tumor types, the authors said.
“There is no ‘one-size-fits-all’ model for medical data extraction,” said study co-leader Andrew Jamieson, Ph.D., Assistant Professor and Principal Investigator in the Lyda Hill Department of Bioinformatics. “But our study outlines key strategies that can help other researchers use AI-powered LLMs more effectively in their own specialties. We’re excited to continue refining this process and expanding AI’s role in medical research.”
Other UTSW researchers who contributed to the study are Bingqing Xie, Ph.D., Assistant Professor of Internal Medicine in the Division of Hematology and Oncology and Kidney Cancer Program; Joseph Vento, M.D., Assistant Professor of Internal Medicine in the Division of Hematology and Oncology; Lindsay Cowell, Ph.D., Professor, Peter O’Donnell Jr. School of Public Health and Department of Immunology; Scott Christley, Ph.D., Computational Biologist, O’Donnell School of Public Health; Ameer Hamza Shakur, Ph.D., Data Scientist/Machine Learning Engineer, Lyda Hill Department of Bioinformatics; Michael Holcomb, M.S., Lead Data Scientist, Lyda Hill Department of Bioinformatics; Alana Christie, M.S., Biostatistical Consultant, Simmons Cancer Center and Kidney Cancer Program; Neil Rakheja, student intern, Simmons Cancer Center; and AJ Jain, Ph.D. candidate, Biomedical Engineering.

Dr. Kapur holds the Jan and Bob Pickens Distinguished Professorship in Medical Science, in Memory of Jerry Knight Rymer and Annette Brannon Rymer and Mr. and Mrs. W.L. Pickens.
Dr. Brugarolas holds the Sherry Wigley Crow Cancer Research Endowed Chair in Honor of Robert Lewis Kirby, M.D.
Drs. Kapur, Brugarolas, and Cowell are members of the Simmons Cancer Center.
The study was funded by a grant from the National Cancer Institute’s Kidney Cancer Specialized Program of Research Excellence (P50 CA196516) and an endowment from the Brock Fund for Medical Science Chair in Pathology.
About UT Southwestern Medical Center
UT Southwestern, one of the nation’s premier academic medical centers, integrates pioneering biomedical research with exceptional clinical care and education. The institution’s faculty members have received six Nobel Prizes and include 25 members of the National Academy of Sciences, 24 members of the National Academy of Medicine, and 14 Howard Hughes Medical Institute Investigators. The full-time faculty of more than 3,200 is responsible for groundbreaking medical advances and is committed to translating science-driven research quickly to new clinical treatments. UT Southwestern physicians provide care in more than 80 specialties to more than 140,000 hospitalized patients, more than 360,000 emergency room cases, and oversee nearly 5.1 million outpatient visits a year.