NYUHSL Faculty Bibliography

Searched for:

person:aphiny01

in-biosketch:yes

Total Results:

Scientific reports. 2025:15(1).DOI: 10.1038/s41598-025-89607-8

Identification of patients at risk for pancreatic cancer in a 3-year timeframe based on machine learning algorithms

Zhu, Weicheng; Chen, Long; Aphinyanaphongs, Yindalon; Kastrinos, Fay; Simeone, Diane M; Pochapin, Mark; Stender, Cody; Razavian, Narges; Gonda, Tamas A

Early detection of pancreatic cancer (PC) remains challenging largely due to the low population incidence and few known risk factors. However, screening in at-risk populations and detection of early cancer has the potential to significantly alter survival. In this study, we aim to develop a predictive model to identify patients at risk for developing new-onset PC at two and a half to three year time frame. We used the Electronic Health Records (EHR) of a large medical system from 2000 to 2021 (N = 537,410). The EHR data analyzed in this work consists of patients' demographic information, diagnosis records, and lab values, which are used to identify patients who were diagnosed with pancreatic cancer and the risk factors used in the machine learning algorithm for prediction. We identified 73 risk factors of pancreatic cancer with the Phenome-wide Association Study (PheWAS) on a matched case-control cohort. Based on them, we built a large-scale machine learning algorithm based on EHR. A temporally stratified validation based on patients not included in any stage of the training of the model was performed. This model showed an AUROC at 0.742 [0.727, 0.757] which was similar in both the general population and in a subset of the population who has had prior cross-sectional imaging. The rate of diagnosis of pancreatic cancer in those in the top 1 percentile of the risk score was 6 folds higher than the general population. Our model leverages data extracted from a 6-month window of time in the electronic health record to identify patients at nearly sixfold higher than baseline risk of developing pancreatic cancer 2.5-3 years from evaluation. This approach offers an opportunity to define an enriched population entirely based on static data, where current screening may be recommended.

PMID: 40188106

ISSN: 2045-2322

CID: 5819542

Journal of medical Internet research. 2025:27.DOI: 10.2196/67967

Large Language Model-Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions: Development and Validation Study

Schaye, Verity; DiTullio, David; Guzman, Benedict Vincent; Vennemeyer, Scott; Shih, Hanniel; Reinstein, Ilan; Weber, Danielle E; Goodman, Abbie; Wu, Danny T Y; Sartori, Daniel J; Santen, Sally A; Gruppen, Larry; Aphinyanaphongs, Yindalon; Burk-Rafel, Jesse

BACKGROUND:Clinical reasoning (CR) is an essential skill; yet, physicians often receive limited feedback. Artificial intelligence holds promise to fill this gap. OBJECTIVE:We report the development of named entity recognition (NER), logic-based and large language model (LLM)-based assessments of CR documentation in the electronic health record across 2 institutions (New York University Grossman School of Medicine [NYU] and University of Cincinnati College of Medicine [UC]). METHODS:-scores for the NER, logic-based model and area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) for the LLMs. RESULTS:-scores 0.80, 0.74, and 0.80 for D0, D1, D2, respectively. The GatorTron LLM performed best for EA2 scores AUROC/AUPRC 0.75/ 0.69. CONCLUSIONS:This is the first multi-institutional study to apply LLMs for assessing CR documentation in the electronic health record. Such tools can enhance feedback on CR. Lessons learned by implementing these models at distinct institutions support the generalizability of this approach.

PMID: 40117575

ISSN: 1438-8871

CID: 5813782

Transplantation. 2025:109(3):399-402.DOI: 10.1097/TP.0000000000005261

Trials and Tribulations: Responses of ChatGPT to Patient Questions About Kidney Transplantation

Xu, Jingzhi; Mankowski, Michal; Vanterpool, Karen B; Strauss, Alexandra T; Lonze, Bonnie E; Orandi, Babak J; Stewart, Darren; Bae, Sunjae; Ali, Nicole; Stern, Jeffrey; Mattoo, Aprajita; Robalino, Ryan; Soomro, Irfana; Weldon, Elaina; Oermann, Eric K; Aphinyanaphongs, Yin; Sidoti, Carolyn; McAdams-DeMarco, Mara; Massie, Allan B; Gentry, Sommer E; Segev, Dorry L; Levan, Macey L

PMID: 39477825

ISSN: 1534-6080

CID: 5747132

Journal of the American Medical Informatics Association. 2025:32(2):268-274.DOI: 10.1093/jamia/ocae285

Health system-wide access to generative artificial intelligence: the New York University Langone Health experience

Malhotra, Kiran; Wiesenfeld, Batia; Major, Vincent J; Grover, Himanshu; Aphinyanaphongs, Yindalon; Testa, Paul; Austrian, Jonathan S

OBJECTIVES/OBJECTIVE:The study aimed to assess the usage and impact of a private and secure instance of a generative artificial intelligence (GenAI) application in a large academic health center. The goal was to understand how employees interact with this technology and the influence on their perception of skill and work performance. MATERIALS AND METHODS/METHODS:New York University Langone Health (NYULH) established a secure, private, and managed Azure OpenAI service (GenAI Studio) and granted widespread access to employees. Usage was monitored and users were surveyed about their experiences. RESULTS:Over 6 months, over 1007 individuals applied for access, with high usage among research and clinical departments. Users felt prepared to use the GenAI studio, found it easy to use, and would recommend it to a colleague. Users employed the GenAI studio for diverse tasks such as writing, editing, summarizing, data analysis, and idea generation. Challenges included difficulties in educating the workforce in constructing effective prompts and token and API limitations. DISCUSSION/CONCLUSIONS:The study demonstrated high interest in and extensive use of GenAI in a healthcare setting, with users employing the technology for diverse tasks. While users identified several challenges, they also recognized the potential of GenAI and indicated a need for more instruction and guidance on effective usage. CONCLUSION/CONCLUSIONS:The private GenAI studio provided a useful tool for employees to augment their skills and apply GenAI to their daily tasks. The study underscored the importance of workforce education when implementing system-wide GenAI and provided insights into its strengths and weaknesses.

PMCID:11756645

PMID: 39584477

ISSN: 1527-974x

CID: 5778212

Nature medicine. 2025:31(2):618-626.DOI: 10.1038/s41591-024-03445-1

Medical large language models are vulnerable to data-poisoning attacks

Alber, Daniel Alexander; Yang, Zihao; Alyakin, Anton; Yang, Eunice; Rai, Sumedha; Valliani, Aly A; Zhang, Jeff; Rosenbaum, Gabriel R; Amend-Thomas, Ashley K; Kurland, David B; Kremer, Caroline M; Eremiev, Alexander; Negash, Bruck; Wiggan, Daniel D; Nakatsuka, Michelle A; Sangwon, Karl L; Neifert, Sean N; Khan, Hammad A; Save, Akshay Vinod; Palla, Adhith; Grin, Eric A; Hedman, Monika; Nasir-Moin, Mustafa; Liu, Xujin Chris; Jiang, Lavender Yao; Mankowski, Michal A; Segev, Dorry L; Aphinyanaphongs, Yindalon; Riina, Howard A; Golfinos, John G; Orringer, Daniel A; Kondziolka, Douglas; Oermann, Eric Karl

The adoption of large language models (LLMs) in healthcare demands a careful analysis of their potential to spread false medical knowledge. Because LLMs ingest massive volumes of data from the open Internet during training, they are potentially exposed to unverified medical knowledge that may include deliberately planted misinformation. Here, we perform a threat assessment that simulates a data-poisoning attack against The Pile, a popular dataset used for LLM development. We find that replacement of just 0.001% of training tokens with medical misinformation results in harmful models more likely to propagate medical errors. Furthermore, we discover that corrupted models match the performance of their corruption-free counterparts on open-source benchmarks routinely used to evaluate medical LLMs. Using biomedical knowledge graphs to screen medical LLM outputs, we propose a harm mitigation strategy that captures 91.9% of harmful content (F1 = 85.7%). Our algorithm provides a unique method to validate stochastically generated LLM outputs against hard-coded relationships in knowledge graphs. In view of current calls for improved data provenance and transparent LLM development, we hope to raise awareness of emergent risks from LLMs trained indiscriminately on web-scraped data, particularly in healthcare where misinformation can potentially compromise patient safety.

PMID: 39779928

ISSN: 1546-170x

CID: 5782182

Nature medicine. 2025:31(1):60-69.DOI: 10.1038/s41591-024-03425-5

The TRIPOD-LLM reporting guideline for studies using large language models

Gallifant, Jack; Afshar, Majid; Ameen, Saleem; Aphinyanaphongs, Yindalon; Chen, Shan; Cacciamani, Giovanni; Demner-Fushman, Dina; Dligach, Dmitriy; Daneshjou, Roxana; Fernandes, Chrystinne; Hansen, Lasse Hyldig; Landman, Adam; Lehmann, Lisa; McCoy, Liam G; Miller, Timothy; Moreno, Amy; Munch, Nikolaj; Restrepo, David; Savova, Guergana; Umeton, Renato; Gichoya, Judy Wawira; Collins, Gary S; Moons, Karel G M; Celi, Leo A; Bitterman, Danielle S

Large language models (LLMs) are rapidly being adopted in healthcare, necessitating standardized reporting guidelines. We present transparent reporting of a multivariable model for individual prognosis or diagnosis (TRIPOD)-LLM, an extension of the TRIPOD + artificial intelligence statement, addressing the unique challenges of LLMs in biomedical applications. TRIPOD-LLM provides a comprehensive checklist of 19 main items and 50 subitems, covering key aspects from title to discussion. The guidelines introduce a modular format accommodating various LLM research designs and tasks, with 14 main items and 32 subitems applicable across all categories. Developed through an expedited Delphi process and expert consensus, TRIPOD-LLM emphasizes transparency, human oversight and task-specific performance reporting. We also introduce an interactive website ( https://tripod-llm.vercel.app/ ) facilitating easy guideline completion and PDF generation for submission. As a living document, TRIPOD-LLM will evolve with the field, aiming to enhance the quality, reproducibility and clinical applicability of LLM research in healthcare through comprehensive reporting.

PMID: 39779929

ISSN: 1546-170x

CID: 5777972

PLOS digital health. 2024:3(12).DOI: 10.1371/journal.pdig.0000685

Evaluating Large Language Models in extracting cognitive exam dates and scores

Zhang, Hao; Jethani, Neil; Jones, Simon; Genes, Nicholas; Major, Vincent J; Jaffe, Ian S; Cardillo, Anthony B; Heilenbach, Noah; Ali, Nadia Fazal; Bonanni, Luke J; Clayburn, Andrew J; Khera, Zain; Sadler, Erica C; Prasad, Jaideep; Schlacter, Jamie; Liu, Kevin; Silva, Benjamin; Montgomery, Sophie; Kim, Eric J; Lester, Jacob; Hill, Theodore M; Avoricani, Alba; Chervonski, Ethan; Davydov, James; Small, William; Chakravartty, Eesha; Grover, Himanshu; Dodson, John A; Brody, Abraham A; Aphinyanaphongs, Yindalon; Masurkar, Arjun; Razavian, Narges

Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.

PMCID:11634005

PMID: 39661652

ISSN: 2767-3170

CID: 5762692

Journal of arthroplasty. 2024.DOI: 10.1016/j.arth.2024.10.100

Utilization of Machine Learning Models to More Accurately Predict Case Duration in Primary Total Joint Arthroplasty

Dellicarpini, Gennaro; Passano, Brandon; Yang, Jie; Yassin, Sallie M; Becker, Jacob; Aphinyanaphongs, Yindalon; Capozzi, James

INTRODUCTION/BACKGROUND:Accurate operative scheduling is essential for the appropriation of operating room (OR) resources. We sought to implement a machine learning (ML) model to predict primary total hip (THA) and total knee arthroplasty (TKA) case time. METHODS:A total of 10,590 THAs and 12,179 TKAs between July 2017 and December 2022 were retrospectively identified. Cases were chronologically divided into training, validation, and test sets. The test set cohort included 1,588 TKAs and 1,204 THAs. There were four machine learning algorithms developed: linear ridge regression (LR), random forest (RF), XGBoost (XGB), and explainable boosting machine (EBM). Each model's case time estimate was compared to the scheduled estimate measured in 15-minute "wait" time blocks ("underbooking") and "excess" time blocks ("overbooking"). Surgical case time was recorded, and SHAP (Shapley Additive exPlanations) values were assigned to patient characteristics, surgical information, and the patient's medical condition to understand feature importance. RESULTS:The most predictive model input was "median previous 30 procedure case times." The XGBoost model outperformed the other models in predicting both TKA and THA case times. The model reduced TKA 'excess time blocks' by 85 blocks (P < 0.001) and 'wait time blocks' by 96 blocks (P < 0.001). The model did not significantly reduce 'excess time blocks' in THA (P = 0.89) but did significantly reduce 'wait time blocks' by 134 blocks (P < 0.001). In total, the model improved TKA operative booking by 181 blocks (2,715 minutes) and THA operative booking by 138 blocks (2,070 minutes). CONCLUSIONS:Machine learning outperformed a traditional method of scheduling total joint arthroplasty (TJA) cases. The median time of the prior 30 surgical cases was the most influential on scheduling case time accuracy. As ML models improve, surgeons should consider machine learning utilization in case scheduling; however, prior 30 surgical cases may serve as an adequate alternative.

PMID: 39477036

ISSN: 1532-8406

CID: 5747082

JAMIA open. 2024:7(3).DOI: 10.1093/jamiaopen/ooae078

Development and evaluation of an artificial intelligence-based workflow for the prioritization of patient portal messages

Yang, Jie; So, Jonathan; Zhang, Hao; Jones, Simon; Connolly, Denise M; Golding, Claudia; Griffes, Esmelin; Szerencsy, Adam C; Wu, Tzer Jason; Aphinyanaphongs, Yindalon; Major, Vincent J

OBJECTIVES/UNASSIGNED:Accelerating demand for patient messaging has impacted the practice of many providers. Messages are not recommended for urgent medical issues, but some do require rapid attention. This presents an opportunity for artificial intelligence (AI) methods to prioritize review of messages. Our study aimed to highlight some patient portal messages for prioritized review using a custom AI system integrated into the electronic health record (EHR). MATERIALS AND METHODS/UNASSIGNED:We developed a Bidirectional Encoder Representations from Transformers (BERT)-based large language model using 40 132 patient-sent messages to identify patterns involving high acuity topics that warrant an immediate callback. The model was then implemented into 2 shared pools of patient messages managed by dozens of registered nurses. A primary outcome, such as the time before messages were read, was evaluated with a difference-in-difference methodology. RESULTS/UNASSIGNED: = 396 466), an improvement exceeding the trend was observed in the time high-scoring messages sit unread (21 minutes, 63 vs 42 for messages sent outside business hours). DISCUSSION/UNASSIGNED:Our work shows great promise in improving care when AI is aligned with human workflow. Future work involves audience expansion, aiding users with suggested actions, and drafting responses. CONCLUSION/UNASSIGNED:Many patients utilize patient portal messages, and while most messages are routine, a small fraction describe alarming symptoms. Our AI-based workflow shortens the turnaround time to get a trained clinician to review these messages to provide safer, higher-quality care.

PMCID:11328532

PMID: 39156046

ISSN: 2574-2531

CID: 5680362

Clinical transplantation. 2024:38(10).DOI: 10.1111/ctr.15466

ChatGPT Solving Complex Kidney Transplant Cases: A Comparative Study With Human Respondents

Mankowski, Michal A; Jaffe, Ian S; Xu, Jingzhi; Bae, Sunjae; Oermann, Eric K; Aphinyanaphongs, Yindalon; McAdams-DeMarco, Mara A; Lonze, Bonnie E; Orandi, Babak J; Stewart, Darren; Levan, Macey; Massie, Allan; Gentry, Sommer; Segev, Dorry L

INTRODUCTION/BACKGROUND:ChatGPT has shown the ability to answer clinical questions in general medicine but may be constrained by the specialized nature of kidney transplantation. Thus, it is important to explore how ChatGPT can be used in kidney transplantation and how its knowledge compares to human respondents. METHODS:We prompted ChatGPT versions 3.5, 4, and 4 Visual (4 V) with 12 multiple-choice questions related to six kidney transplant cases from 2013 to 2015 American Society of Nephrology (ASN) fellowship program quizzes. We compared the performance of ChatGPT with US nephrology fellowship program directors, nephrology fellows, and the audience of the ASN's annual Kidney Week meeting. RESULTS:Overall, ChatGPT 4 V correctly answered 10 out of 12 questions, showing a performance level comparable to nephrology fellows (group majority correctly answered 9 of 12 questions) and training program directors (11 of 12). This surpassed ChatGPT 4 (7 of 12 correct) and 3.5 (5 of 12). All three ChatGPT versions failed to correctly answer questions where the consensus among human respondents was low. CONCLUSION/CONCLUSIONS:Each iterative version of ChatGPT performed better than the prior version, with version 4 V achieving performance on par with nephrology fellows and training program directors. While it shows promise in understanding and answering kidney transplantation questions, ChatGPT should be seen as a complementary tool to human expertise rather than a replacement.

PMCID:11441623

PMID: 39329220

ISSN: 1399-0012

CID: 5714092