Searched for: in-biosketch:yes
person:nsr3
Development and Validation of a Parsimonious Risk Stratification Model for Pancreatic Cancer
Mavromatis, Lucas A; Zlatanic, Viktor; Agarunov, Emil; Sanoba, Shenin A; Kluger, Michael D; Horwitz, Leora I; Razavian, Narges; Maitra, Anirban; Gonda, Tamas A; Grams, Morgan E
IMPORTANCE/UNASSIGNED:Pancreatic ductal adenocarcinoma (PDAC) is a leading cause of cancer deaths in the US. Although early detection improves survival, the rarity of the disease has rendered population screening a difficult approach. OBJECTIVE/UNASSIGNED:To develop and validate a parsimonious, interpretable, and generalizable model predicting incident PDAC-termed PRIME (PDAC Risk Model for Earlier Detection)-using routinely available electronic health record (EHR) data. DESIGN, SETTING, AND PARTICIPANTS/UNASSIGNED:This cohort study used the Optum Labs Data Warehouse, a longitudinal, deidentified US EHR and claims database. Adults 40 years or older with an outpatient clinical encounter between 2016 and 2018 were included. Participants from 23 health systems (n = 4 859 833) comprised the training cohort; 31 additional systems (n = 5 619 091) served as validation. International validation was conducted in the UK Biobank (n = 498 754). Data analysis occurred July 2025 to January 2026. EXPOSURES/UNASSIGNED:Demographics, diagnosis codes, and routinely measured laboratory values were evaluated. Elastic-net regularization with 10-fold cross-validation selected the predictor set. MAIN OUTCOMES AND MEASURES/UNASSIGNED:Incident PDAC was identified by International Classification of Diseases, Ninth and Tenth Revisions (ICD-9/10) codes. Model performance was assessed using time-dependent area under the curve (AUC) and calibration metrics. RESULTS/UNASSIGNED:Overall, the study included more than 11 million adults (2.1% Asian individuals, 8.4% Black individuals, 4.3% Hispanic/Latino individuals, 82.7% White individuals, and 2.4% other race/ethnicity by EHR reporting). In the training cohort (mean [SD] age, 60.4 [11] years), 14 405 individuals were diagnosed with PDAC (incidence 55 per 100 000 person-years) over a mean (SD) of 5.4 (2.5) years; in the validation cohort, 11 693 individuals were diagnosed with PDAC (54 per 100 000 person-years) over a mean (SD) of 3.9 (2.5) years. PRIME retained 19 predictors including history of pancreatitis, gastrointestinal disorders, prior cancers, type 2 diabetes, elevated aspartate aminotransferase levels, smoking, non-type-O blood, and male sex. Discrimination was strong at the 36-month time horizon (AUC = 0.75 in both the training and validation cohorts) with good calibration. In the validation cohort, patients in the top 1% of predicted risk had substantially higher PDAC risk (HR, 7.63; 95% CI, 6.85-8.49) compared with average-risk patients. In the UK Biobank, PRIME achieved a 36-month AUC of 0.71 with good calibration. CONCLUSIONS AND RELEVANCE/UNASSIGNED:In this validation cohort study, PRIME was a transparent EHR-based model that effectively stratified PDAC risk across diverse US health systems and generalized internationally. Prospective studies should evaluate for EHR-guided PDAC case-finding and integration with blood-based early-detection assays.
PMCID:13022769
PMID: 41885821
ISSN: 2374-2445
CID: 6018542
Catalyzing Health AI by Fixing Payment Systems
Razavian, Narges; Batchu-Green, Prem; Chowdhry, Vikas; Elemento, Olivier; Rajpurkar, Pranav; Saria, Suchi; Shah, Nigam H; Topol, Eric J
Despite rapid advances in artificial intelligence (AI) across sectors, health care remains one of the least transformed domains. This stagnation is not due to lack of data, clinical need, or innovation, but rather to persistent regulatory and economic misalignment. Even AI tools cleared by the U.S. Food and Drug Administration that meet clinical efficacy standards often face major barriers to adoption, largely driven by outdated reimbursement frameworks and fragmented incentives among stakeholders. The result is a systemic failure to deploy technologies that could meaningfully reduce clinician workload, shorten wait times, and improve patient lives. In this article, we examine the reimbursement landscape for health AI, focusing first on tools that fit existing regulatory pathways, outlining payment barriers and proposing policy reforms. These include resolving Current Procedural Terminology adoption bottlenecks, addressing integration overhead, and aligning pricing models with AI cost structures. We then extend the discussion to the emerging domain of generative AI in health care, highlighting the urgent need for prospective regulatory frameworks to ensure patient benefits. (Funded by the National Institutes of Health and the Leukemia and Lymphoma Society.).
PMCID:12900248
PMID: 41695240
ISSN: 2836-9386
CID: 6004322
Robust Disease Prognosis via Diagnostic Knowledge Preservation: A Sequential Learning Approach
Rajamohan, Haresh Rengaraj; Xu, Yanqi; Zhu, Weicheng; Kijowski, Richard; Cho, Kyunghyun; Geras, Krzysztof J; Razavian, Narges; Deniz, Cem M
Accurate disease prognosis is essential for patient care but is often hindered by the lack of long-term data. This study explores deep learning training strategies that utilize large, accessible diagnostic datasets to pretrain models aimed at predicting future disease progression in knee osteoarthritis (OA), Alzheimer's disease (AD), and breast cancer (BC). While diagnostic pretraining improves prognostic task performance, naive fine-tuning for prognosis can cause 'catastrophic forgetting,' where the model's original diagnostic accuracy degrades, a significant patient safety concern in real-world settings. To address this, we propose a sequential learning strategy with experience replay. We used cohorts with knee radiographs, brain MRIs, and digital mammograms to predict 4-year structural worsening in OA, 2-year cognitive decline in AD, and 5-year cancer diagnosis in BC. Our results showed that diagnostic pretraining on larger datasets improved prognosis model performance compared to standard baselines, boosting both the Area Under the Receiver Operating Characteristic curve (AUROC) (e.g., Knee OA external: 0.77 vs 0.747; Breast Cancer: 0.874 vs 0.848) and the Area Under the Precision-Recall Curve (AUPRC) (e.g., Alzheimer's Disease: 0.752 vs 0.683). Additionally, a sequential learning approach with experience replay achieved prognostic performance comparable to dedicated single-task models (e.g., Breast Cancer AUROC 0.876 vs 0.874) while also preserving diagnostic ability. This method maintained high diagnostic accuracy (e.g., Breast Cancer Balanced Accuracy 50.4% vs 50.9% for a dedicated diagnostic model), unlike simpler multitask methods prone to catastrophic forgetting (e.g., 37.7%). Our findings show that leveraging large diagnostic datasets is a reliable and data-efficient way to enhance prognostic models while maintaining essential diagnostic skills.
PMCID:12486016
PMID: 41040735
CID: 5973072
Predicting hematoma expansion after intracerebral hemorrhage: a comparison of clinician prediction with deep learning radiomics models
Yu, Boyang; Melmed, Kara R; Frontera, Jennifer; Zhu, Weicheng; Huang, Haoxu; Qureshi, Adnan I; Maggard, Abigail; Steinhof, Michael; Kuohn, Lindsey; Kumar, Arooshi; Berson, Elisa R; Tran, Anh T; Payabvash, Seyedmehdi; Ironside, Natasha; Brush, Benjamin; Dehkharghani, Seena; Razavian, Narges; Ranganath, Rajesh
BACKGROUND:Early prediction of hematoma expansion (HE) following nontraumatic intracerebral hemorrhage (ICH) may inform preemptive therapeutic interventions. We sought to identify how accurately machine learning (ML) radiomics models predict HE compared with expert clinicians using head computed tomography (HCT). METHODS:We used data from 900 study participants with ICH enrolled in the Antihypertensive Treatment of Acute Cerebral Hemorrhage 2 Study. ML models were developed using baseline HCT images, as well as admission clinical data in a training cohort (n = 621), and their performance was evaluated in an independent test cohort (n = 279) to predict HE (defined as HE by 33% or > 6 mL at 24 h). We simultaneously surveyed expert clinicians and asked them to predict HE using the same initial HCT images and clinical data. Area under the receiver operating characteristic curve (AUC) were compared between clinician predictions, ML models using radiomic data only (a random forest classifier and a deep learning imaging model) and ML models using both radiomic and clinical data (three random forest classifier models using different feature combinations). Kappa values comparing interrater reliability among expert clinicians were calculated. The best performing model was compared with clinical predication. RESULTS:The AUC for expert clinician prediction of HE was 0.591, with a kappa of 0.156 for interrater variability, compared with ML models using radiomic data only (a deep learning model using image input, AUC 0.680) and using both radiomic and clinical data (a random forest model, AUC 0.677). The intraclass correlation coefficient for clinical judgment and the best performing ML model was 0.47 (95% confidence interval 0.23-0.75). CONCLUSIONS:We introduced supervised ML algorithms demonstrating that HE prediction may outperform practicing clinicians. Despite overall moderate AUCs, our results set a new relative benchmark for performance in these tasks that even expert clinicians find challenging. These results emphasize the need for continued improvements and further enhanced clinical decision support to optimally manage patients with ICH.
PMID: 39920546
ISSN: 1556-0961
CID: 5784422
Identification of patients at risk for pancreatic cancer in a 3-year timeframe based on machine learning algorithms
Zhu, Weicheng; Chen, Long; Aphinyanaphongs, Yindalon; Kastrinos, Fay; Simeone, Diane M; Pochapin, Mark; Stender, Cody; Razavian, Narges; Gonda, Tamas A
Early detection of pancreatic cancer (PC) remains challenging largely due to the low population incidence and few known risk factors. However, screening in at-risk populations and detection of early cancer has the potential to significantly alter survival. In this study, we aim to develop a predictive model to identify patients at risk for developing new-onset PC at two and a half to three year time frame. We used the Electronic Health Records (EHR) of a large medical system from 2000 to 2021 (N = 537,410). The EHR data analyzed in this work consists of patients' demographic information, diagnosis records, and lab values, which are used to identify patients who were diagnosed with pancreatic cancer and the risk factors used in the machine learning algorithm for prediction. We identified 73 risk factors of pancreatic cancer with the Phenome-wide Association Study (PheWAS) on a matched case-control cohort. Based on them, we built a large-scale machine learning algorithm based on EHR. A temporally stratified validation based on patients not included in any stage of the training of the model was performed. This model showed an AUROC at 0.742 [0.727, 0.757] which was similar in both the general population and in a subset of the population who has had prior cross-sectional imaging. The rate of diagnosis of pancreatic cancer in those in the top 1 percentile of the risk score was 6 folds higher than the general population. Our model leverages data extracted from a 6-month window of time in the electronic health record to identify patients at nearly sixfold higher than baseline risk of developing pancreatic cancer 2.5-3 years from evaluation. This approach offers an opportunity to define an enriched population entirely based on static data, where current screening may be recommended.
PMID: 40188106
ISSN: 2045-2322
CID: 5819542
Evaluating Large Language Models in extracting cognitive exam dates and scores
Zhang, Hao; Jethani, Neil; Jones, Simon; Genes, Nicholas; Major, Vincent J; Jaffe, Ian S; Cardillo, Anthony B; Heilenbach, Noah; Ali, Nadia Fazal; Bonanni, Luke J; Clayburn, Andrew J; Khera, Zain; Sadler, Erica C; Prasad, Jaideep; Schlacter, Jamie; Liu, Kevin; Silva, Benjamin; Montgomery, Sophie; Kim, Eric J; Lester, Jacob; Hill, Theodore M; Avoricani, Alba; Chervonski, Ethan; Davydov, James; Small, William; Chakravartty, Eesha; Grover, Himanshu; Dodson, John A; Brody, Abraham A; Aphinyanaphongs, Yindalon; Masurkar, Arjun; Razavian, Narges
Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.
PMCID:11634005
PMID: 39661652
ISSN: 2767-3170
CID: 5762692
Predicting Risk of Alzheimer's Diseases and Related Dementias with AI Foundation Model on Electronic Health Records
Zhu, Weicheng; Tang, Huanze; Zhang, Hao; Rajamohan, Haresh Rengaraj; Huang, Shih-Lun; Ma, Xinyue; Chaudhari, Ankush; Madaan, Divyam; Almahmoud, Elaf; Chopra, Sumit; Dodson, John A; Brody, Abraham A; Masurkar, Arjun V; Razavian, Narges
Early identification of Alzheimer's disease (AD) and AD-related dementias (ADRD) has high clinical significance, both because of the potential to slow decline through initiating FDA-approved therapies and managing modifiable risk factors, and to help persons living with dementia and their families to plan before cognitive loss makes doing so challenging. However, substantial racial and ethnic disparities in early diagnosis currently lead to additional inequities in care, urging accurate and inclusive risk assessment programs. In this study, we trained an artificial intelligence foundation model to represent the electronic health records (EHR) data with a vast cohort of 1.2 million patients within a large health system. Building upon this foundation EHR model, we developed a predictive Transformer model, named TRADE, capable of identifying risks for AD/ADRD and mild cognitive impairment (MCI), by analyzing the past sequential visit records. Amongst individuals 65 and older, our model was able to generate risk predictions for various future timeframes. On the held-out validation set, our model achieved an area under the receiver operating characteristic (AUROC) of 0.772 (95% CI: 0.770, 0.773) for identifying the AD/ADRD/MCI risks in 1 year, and AUROC of 0.735 (95% CI: 0.734, 0.736) in 5 years. The positive predictive values (PPV) in 5 years among individuals with top 1% and 5% highest estimated risks were 39.2% and 27.8%, respectively. These results demonstrate significant improvements upon the current EHR-based AD/ADRD/MCI risk assessment models, paving the way for better prognosis and management of AD/ADRD/MCI at scale.
PMCID:11071573
PMID: 38712223
CID: 5662732
Evaluating Large Language Models in Extracting Cognitive Exam Dates and Scores
Zhang, Hao; Jethani, Neil; Jones, Simon; Genes, Nicholas; Major, Vincent J; Jaffe, Ian S; Cardillo, Anthony B; Heilenbach, Noah; Ali, Nadia Fazal; Bonanni, Luke J; Clayburn, Andrew J; Khera, Zain; Sadler, Erica C; Prasad, Jaideep; Schlacter, Jamie; Liu, Kevin; Silva, Benjamin; Montgomery, Sophie; Kim, Eric J; Lester, Jacob; Hill, Theodore M; Avoricani, Alba; Chervonski, Ethan; Davydov, James; Small, William; Chakravartty, Eesha; Grover, Himanshu; Dodson, John A; Brody, Abraham A; Aphinyanaphongs, Yindalon; Masurkar, Arjun; Razavian, Narges
IMPORTANCE/UNASSIGNED:Large language models (LLMs) are crucial for medical tasks. Ensuring their reliability is vital to avoid false results. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. OBJECTIVE/UNASSIGNED:Evaluate ChatGPT and LlaMA-2 performance in extracting MMSE and CDR scores, including their associated dates. METHODS/UNASSIGNED:Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. RESULTS/UNASSIGNED:For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. CONCLUSIONS/UNASSIGNED:In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.
PMCID:10888985
PMID: 38405784
CID: 5722422
Author Correction: Generalizable deep learning model for early Alzheimer's disease detection from structural MRIs
Liu, Sheng; Masurkar, Arjun V; Rusinek, Henry; Chen, Jingyun; Zhang, Ben; Zhu, Weicheng; Fernandez-Granda, Carlos; Razavian, Narges
PMID: 37783742
ISSN: 2045-2322
CID: 5735542
Deep learning integrates histopathology and proteogenomics at a pan-cancer level
Wang, Joshua M; Hong, Runyu; Demicco, Elizabeth G; Tan, Jimin; Lazcano, Rossana; Moreira, Andre L; Li, Yize; Calinawan, Anna; Razavian, Narges; Schraink, Tobias; Gillette, Michael A; Omenn, Gilbert S; An, Eunkyung; Rodriguez, Henry; Tsirigos, Aristotelis; Ruggles, Kelly V; Ding, Li; Robles, Ana I; Mani, D R; Rodland, Karin D; Lazar, Alexander J; Liu, Wenke; Fenyö, David; ,
We introduce a pioneering approach that integrates pathology imaging with transcriptomics and proteomics to identify predictive histology features associated with critical clinical outcomes in cancer. We utilize 2,755 H&E-stained histopathological slides from 657 patients across 6 cancer types from CPTAC. Our models effectively recapitulate distinctions readily made by human pathologists: tumor vs. normal (AUROC = 0.995) and tissue-of-origin (AUROC = 0.979). We further investigate predictive power on tasks not normally performed from H&E alone, including TP53 prediction and pathologic stage. Importantly, we describe predictive morphologies not previously utilized in a clinical setting. The incorporation of transcriptomics and proteomics identifies pathway-level signatures and cellular processes driving predictive histology features. Model generalizability and interpretability is confirmed using TCGA. We propose a classification system for these tasks, and suggest potential clinical applications for this integrated human and machine learning approach. A publicly available web-based platform implements these models.
PMCID:10518635
PMID: 37582371
ISSN: 2666-3791
CID: 5590072