NYUHSL Faculty Bibliography

Searched for:

in-biosketch:yes

person:aphiny01

Total Results:

101

Neurosurgery. 2025.DOI: 10.1227/neu.0000000000003683

Automating the Referral of Bone Metastases Patients With and Without the Use of Large Language Models

Sangwon, Karl L; Han, Xu; Becker, Anton; Zhang, Yuchong; Ni, Richard; Zhang, Jeff; Alber, Daniel Alexander; Alyakin, Anton; Nakatsuka, Michelle; Fabbri, Nicola; Aphinyanaphongs, Yindalon; Yang, Jonathan T; Chachoua, Abraham; Kondziolka, Douglas; Laufer, Ilya; Oermann, Eric Karl

BACKGROUND AND OBJECTIVES/OBJECTIVE:Bone metastases, affecting more than 4.8% of patients with cancer annually, and particularly spinal metastases require urgent intervention to prevent neurological complications. However, the current process of manually reviewing radiological reports leads to potential delays in specialist referrals. We hypothesized that natural language processing (NLP) review of routine radiology reports could automate the referral process for timely multidisciplinary care of spinal metastases. METHODS:We assessed 3 NLP models-a rule-based regular expression (RegEx) model, GPT-4, and a specialized Bidirectional Encoder Representations from Transformers (BERT) model (NYUTron)-for automated detection and referral of bone metastases. Study inclusion criteria targeted patients with active cancer diagnoses who underwent advanced imaging (computed tomography, MRI, or positron emission tomography) without previous specialist referral. We defined 2 separate tasks: task of identifying clinically significant bone metastatic terms (lexical detection), and identifying cases needing a specialist follow-up (clinical referral). Models were developed using 3754 hand-labeled advanced imaging studies in 2 phases: phase 1 focused on spine metastases, and phase 2 generalized to bone metastases. Standard McRae's line performance metrics were evaluated and compared across all stages and tasks. RESULTS:In the lexical detection, a simple RegEx achieved the highest performance (sensitivity 98.4%, specificity 97.6%, F1 = 0.965), followed by NYUTron (sensitivity 96.8%, specificity 89.9%, and F1 = 0.787). For the clinical referral task, RegEx also demonstrated superior performance (sensitivity 92.3%, specificity 87.5%, and F1 = 0.936), followed by a fine-tuned NYUTron model (sensitivity 90.0%, specificity 66.7%, and F1 = 0.750). CONCLUSION/CONCLUSIONS:An NLP-based automated referral system can accurately identify patients with bone metastases requiring specialist evaluation. A simple RegEx model excels in syntax-based identification and expert-informed rule generation for efficient referral patient recommendation in comparison with advanced NLP models. This system could significantly reduce missed follow-ups and enhance timely intervention for patients with bone metastases.

PMID: 40823772

ISSN: 1524-4040

CID: 5908782

Ewha medical journal. 2025:48(3).DOI: 10.12771/emj.2025.00661

The TRIPOD-LLM reporting guideline for studies using large language models: a Korean translation

Gallifant, Jack; Afshar, Majid; Ameen, Saleem; Aphinyanaphongs, Yindalon; Chen, Shan; Cacciamani, Giovanni; Demner-Fushman, Dina; Dligach, Dmitriy; Daneshjou, Roxana; Fernandes, Chrystinne; Hansen, Lasse Hyldig; Landman, Adam; Lehmann, Lisa; McCoy, Liam G; Miller, Timothy; Moreno, Amy; Munch, Nikolaj; Restrepo, David; Savova, Guergana; Umeton, Renato; Gichoya, Judy Wawira; Collins, Gary S; Moons, Karel G M; Celi, Leo A; Bitterman, Danielle S

PMID: 40739974

ISSN: 2234-2591

CID: 5903622

Journal of arthroplasty. 2025:40(5):1185-1191.DOI: 10.1016/j.arth.2024.10.100

Utilization of Machine Learning Models to More Accurately Predict Case Duration in Primary Total Joint Arthroplasty

Dellicarpini, Gennaro; Passano, Brandon; Yang, Jie; Yassin, Sallie M; Becker, Jacob; Aphinyanaphongs, Yindalon; Capozzi, James

INTRODUCTION/BACKGROUND:Accurate operative scheduling is essential for the appropriation of operating room (OR) resources. We sought to implement a machine learning (ML) model to predict primary total hip (THA) and total knee arthroplasty (TKA) case time. METHODS:A total of 10,590 THAs and 12,179 TKAs between July 2017 and December 2022 were retrospectively identified. Cases were chronologically divided into training, validation, and test sets. The test set cohort included 1,588 TKAs and 1,204 THAs. There were four machine learning algorithms developed: linear ridge regression (LR), random forest (RF), XGBoost (XGB), and explainable boosting machine (EBM). Each model's case time estimate was compared to the scheduled estimate measured in 15-minute "wait" time blocks ("underbooking") and "excess" time blocks ("overbooking"). Surgical case time was recorded, and SHAP (Shapley Additive exPlanations) values were assigned to patient characteristics, surgical information, and the patient's medical condition to understand feature importance. RESULTS:The most predictive model input was "median previous 30 procedure case times." The XGBoost model outperformed the other models in predicting both TKA and THA case times. The model reduced TKA 'excess time blocks' by 85 blocks (P < 0.001) and 'wait time blocks' by 96 blocks (P < 0.001). The model did not significantly reduce 'excess time blocks' in THA (P = 0.89) but did significantly reduce 'wait time blocks' by 134 blocks (P < 0.001). In total, the model improved TKA operative booking by 181 blocks (2,715 minutes) and THA operative booking by 138 blocks (2,070 minutes). CONCLUSIONS:Machine learning outperformed a traditional method of scheduling total joint arthroplasty (TJA) cases. The median time of the prior 30 surgical cases was the most influential on scheduling case time accuracy. As ML models improve, surgeons should consider machine learning utilization in case scheduling; however, prior 30 surgical cases may serve as an adequate alternative.

PMID: 39477036

ISSN: 1532-8406

CID: 5747082

Scientific reports. 2025:15(1).DOI: 10.1038/s41598-025-89607-8

Identification of patients at risk for pancreatic cancer in a 3-year timeframe based on machine learning algorithms

Zhu, Weicheng; Chen, Long; Aphinyanaphongs, Yindalon; Kastrinos, Fay; Simeone, Diane M; Pochapin, Mark; Stender, Cody; Razavian, Narges; Gonda, Tamas A

Early detection of pancreatic cancer (PC) remains challenging largely due to the low population incidence and few known risk factors. However, screening in at-risk populations and detection of early cancer has the potential to significantly alter survival. In this study, we aim to develop a predictive model to identify patients at risk for developing new-onset PC at two and a half to three year time frame. We used the Electronic Health Records (EHR) of a large medical system from 2000 to 2021 (N = 537,410). The EHR data analyzed in this work consists of patients' demographic information, diagnosis records, and lab values, which are used to identify patients who were diagnosed with pancreatic cancer and the risk factors used in the machine learning algorithm for prediction. We identified 73 risk factors of pancreatic cancer with the Phenome-wide Association Study (PheWAS) on a matched case-control cohort. Based on them, we built a large-scale machine learning algorithm based on EHR. A temporally stratified validation based on patients not included in any stage of the training of the model was performed. This model showed an AUROC at 0.742 [0.727, 0.757] which was similar in both the general population and in a subset of the population who has had prior cross-sectional imaging. The rate of diagnosis of pancreatic cancer in those in the top 1 percentile of the risk score was 6 folds higher than the general population. Our model leverages data extracted from a 6-month window of time in the electronic health record to identify patients at nearly sixfold higher than baseline risk of developing pancreatic cancer 2.5-3 years from evaluation. This approach offers an opportunity to define an enriched population entirely based on static data, where current screening may be recommended.

PMID: 40188106

ISSN: 2045-2322

CID: 5819542

Journal of medical Internet research. 2025:27.DOI: 10.2196/67967

Large Language Model-Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions: Development and Validation Study

Schaye, Verity; DiTullio, David; Guzman, Benedict Vincent; Vennemeyer, Scott; Shih, Hanniel; Reinstein, Ilan; Weber, Danielle E; Goodman, Abbie; Wu, Danny T Y; Sartori, Daniel J; Santen, Sally A; Gruppen, Larry; Aphinyanaphongs, Yindalon; Burk-Rafel, Jesse

BACKGROUND:Clinical reasoning (CR) is an essential skill; yet, physicians often receive limited feedback. Artificial intelligence holds promise to fill this gap. OBJECTIVE:We report the development of named entity recognition (NER), logic-based and large language model (LLM)-based assessments of CR documentation in the electronic health record across 2 institutions (New York University Grossman School of Medicine [NYU] and University of Cincinnati College of Medicine [UC]). METHODS:-scores for the NER, logic-based model and area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) for the LLMs. RESULTS:-scores 0.80, 0.74, and 0.80 for D0, D1, D2, respectively. The GatorTron LLM performed best for EA2 scores AUROC/AUPRC 0.75/ 0.69. CONCLUSIONS:This is the first multi-institutional study to apply LLMs for assessing CR documentation in the electronic health record. Such tools can enhance feedback on CR. Lessons learned by implementing these models at distinct institutions support the generalizability of this approach.

PMID: 40117575

ISSN: 1438-8871

CID: 5813782

Transplantation. 2025:109(3):399-402.DOI: 10.1097/TP.0000000000005261

Trials and Tribulations: Responses of ChatGPT to Patient Questions About Kidney Transplantation

Xu, Jingzhi; Mankowski, Michal; Vanterpool, Karen B; Strauss, Alexandra T; Lonze, Bonnie E; Orandi, Babak J; Stewart, Darren; Bae, Sunjae; Ali, Nicole; Stern, Jeffrey; Mattoo, Aprajita; Robalino, Ryan; Soomro, Irfana; Weldon, Elaina; Oermann, Eric K; Aphinyanaphongs, Yin; Sidoti, Carolyn; McAdams-DeMarco, Mara; Massie, Allan B; Gentry, Sommer E; Segev, Dorry L; Levan, Macey L

PMID: 39477825

ISSN: 1534-6080

CID: 5747132

Journal of the American Medical Informatics Association. 2025:32(2):268-274.DOI: 10.1093/jamia/ocae285

Health system-wide access to generative artificial intelligence: the New York University Langone Health experience

Malhotra, Kiran; Wiesenfeld, Batia; Major, Vincent J; Grover, Himanshu; Aphinyanaphongs, Yindalon; Testa, Paul; Austrian, Jonathan S

OBJECTIVES/OBJECTIVE:The study aimed to assess the usage and impact of a private and secure instance of a generative artificial intelligence (GenAI) application in a large academic health center. The goal was to understand how employees interact with this technology and the influence on their perception of skill and work performance. MATERIALS AND METHODS/METHODS:New York University Langone Health (NYULH) established a secure, private, and managed Azure OpenAI service (GenAI Studio) and granted widespread access to employees. Usage was monitored and users were surveyed about their experiences. RESULTS:Over 6 months, over 1007 individuals applied for access, with high usage among research and clinical departments. Users felt prepared to use the GenAI studio, found it easy to use, and would recommend it to a colleague. Users employed the GenAI studio for diverse tasks such as writing, editing, summarizing, data analysis, and idea generation. Challenges included difficulties in educating the workforce in constructing effective prompts and token and API limitations. DISCUSSION/CONCLUSIONS:The study demonstrated high interest in and extensive use of GenAI in a healthcare setting, with users employing the technology for diverse tasks. While users identified several challenges, they also recognized the potential of GenAI and indicated a need for more instruction and guidance on effective usage. CONCLUSION/CONCLUSIONS:The private GenAI studio provided a useful tool for employees to augment their skills and apply GenAI to their daily tasks. The study underscored the importance of workforce education when implementing system-wide GenAI and provided insights into its strengths and weaknesses.

PMCID:11756645

PMID: 39584477

ISSN: 1527-974x

CID: 5778212

Nature medicine. 2025:31(2):618-626.DOI: 10.1038/s41591-024-03445-1

Medical large language models are vulnerable to data-poisoning attacks

Alber, Daniel Alexander; Yang, Zihao; Alyakin, Anton; Yang, Eunice; Rai, Sumedha; Valliani, Aly A; Zhang, Jeff; Rosenbaum, Gabriel R; Amend-Thomas, Ashley K; Kurland, David B; Kremer, Caroline M; Eremiev, Alexander; Negash, Bruck; Wiggan, Daniel D; Nakatsuka, Michelle A; Sangwon, Karl L; Neifert, Sean N; Khan, Hammad A; Save, Akshay Vinod; Palla, Adhith; Grin, Eric A; Hedman, Monika; Nasir-Moin, Mustafa; Liu, Xujin Chris; Jiang, Lavender Yao; Mankowski, Michal A; Segev, Dorry L; Aphinyanaphongs, Yindalon; Riina, Howard A; Golfinos, John G; Orringer, Daniel A; Kondziolka, Douglas; Oermann, Eric Karl

The adoption of large language models (LLMs) in healthcare demands a careful analysis of their potential to spread false medical knowledge. Because LLMs ingest massive volumes of data from the open Internet during training, they are potentially exposed to unverified medical knowledge that may include deliberately planted misinformation. Here, we perform a threat assessment that simulates a data-poisoning attack against The Pile, a popular dataset used for LLM development. We find that replacement of just 0.001% of training tokens with medical misinformation results in harmful models more likely to propagate medical errors. Furthermore, we discover that corrupted models match the performance of their corruption-free counterparts on open-source benchmarks routinely used to evaluate medical LLMs. Using biomedical knowledge graphs to screen medical LLM outputs, we propose a harm mitigation strategy that captures 91.9% of harmful content (F1 = 85.7%). Our algorithm provides a unique method to validate stochastically generated LLM outputs against hard-coded relationships in knowledge graphs. In view of current calls for improved data provenance and transparent LLM development, we hope to raise awareness of emergent risks from LLMs trained indiscriminately on web-scraped data, particularly in healthcare where misinformation can potentially compromise patient safety.

PMID: 39779928

ISSN: 1546-170x

CID: 5782182

Nature medicine. 2025:31(1):60-69.DOI: 10.1038/s41591-024-03425-5

The TRIPOD-LLM reporting guideline for studies using large language models

Large language models (LLMs) are rapidly being adopted in healthcare, necessitating standardized reporting guidelines. We present transparent reporting of a multivariable model for individual prognosis or diagnosis (TRIPOD)-LLM, an extension of the TRIPOD + artificial intelligence statement, addressing the unique challenges of LLMs in biomedical applications. TRIPOD-LLM provides a comprehensive checklist of 19 main items and 50 subitems, covering key aspects from title to discussion. The guidelines introduce a modular format accommodating various LLM research designs and tasks, with 14 main items and 32 subitems applicable across all categories. Developed through an expedited Delphi process and expert consensus, TRIPOD-LLM emphasizes transparency, human oversight and task-specific performance reporting. We also introduce an interactive website ( https://tripod-llm.vercel.app/ ) facilitating easy guideline completion and PDF generation for submission. As a living document, TRIPOD-LLM will evolve with the field, aiming to enhance the quality, reproducibility and clinical applicability of LLM research in healthcare through comprehensive reporting.

PMID: 39779929

ISSN: 1546-170x

CID: 5777972

PLOS digital health. 2024:3(12).DOI: 10.1371/journal.pdig.0000685

Evaluating Large Language Models in extracting cognitive exam dates and scores

Zhang, Hao; Jethani, Neil; Jones, Simon; Genes, Nicholas; Major, Vincent J; Jaffe, Ian S; Cardillo, Anthony B; Heilenbach, Noah; Ali, Nadia Fazal; Bonanni, Luke J; Clayburn, Andrew J; Khera, Zain; Sadler, Erica C; Prasad, Jaideep; Schlacter, Jamie; Liu, Kevin; Silva, Benjamin; Montgomery, Sophie; Kim, Eric J; Lester, Jacob; Hill, Theodore M; Avoricani, Alba; Chervonski, Ethan; Davydov, James; Small, William; Chakravartty, Eesha; Grover, Himanshu; Dodson, John A; Brody, Abraham A; Aphinyanaphongs, Yindalon; Masurkar, Arjun; Razavian, Narges

Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.

PMCID:11634005

PMID: 39661652

ISSN: 2767-3170

CID: 5762692