Searched for: in-biosketch:yes
person:aphiny01
Evaluating Large Language Models in extracting cognitive exam dates and scores
Zhang, Hao; Jethani, Neil; Jones, Simon; Genes, Nicholas; Major, Vincent J; Jaffe, Ian S; Cardillo, Anthony B; Heilenbach, Noah; Ali, Nadia Fazal; Bonanni, Luke J; Clayburn, Andrew J; Khera, Zain; Sadler, Erica C; Prasad, Jaideep; Schlacter, Jamie; Liu, Kevin; Silva, Benjamin; Montgomery, Sophie; Kim, Eric J; Lester, Jacob; Hill, Theodore M; Avoricani, Alba; Chervonski, Ethan; Davydov, James; Small, William; Chakravartty, Eesha; Grover, Himanshu; Dodson, John A; Brody, Abraham A; Aphinyanaphongs, Yindalon; Masurkar, Arjun; Razavian, Narges
Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.
PMCID:11634005
PMID: 39661652
ISSN: 2767-3170
CID: 5762692
Evaluating Patient-Oriented Echocardiogram Reports Augmented by Artificial Intelligence [Letter]
Martin, Jacob A; Hill, Theodore; Saric, Muhamed; Vainrib, Alan F; Bamira, Daniel; Bernard, Samuel; Ro, Richard; Zhang, Hao; Austrian, Jonathan S; Aphinyanaphongs, Yindalon; Koesmahargyo, Vidya; Williams, Mathew R; Chinitz, Larry A; Jankelson, Lior
PMID: 39093252
ISSN: 1876-7591
CID: 5743582
Trials and Tribulations: Responses of ChatGPT to Patient Questions About Kidney Transplantation
Xu, Jingzhi; Mankowski, Michal; Vanterpool, Karen B; Strauss, Alexandra T; Lonze, Bonnie E; Orandi, Babak J; Stewart, Darren; Bae, Sunjae; Ali, Nicole; Stern, Jeffrey; Mattoo, Aprajita; Robalino, Ryan; Soomro, Irfana; Weldon, Elaina; Oermann, Eric K; Aphinyanaphongs, Yin; Sidoti, Carolyn; McAdams-DeMarco, Mara; Massie, Allan B; Gentry, Sommer E; Segev, Dorry L; Levan, Macey L
PMID: 39477825
ISSN: 1534-6080
CID: 5747132
Utilization of Machine Learning Models to More Accurately Predict Case Duration in Primary Total Joint Arthroplasty
Dellicarpini, Gennaro; Passano, Brandon; Yang, Jie; Yassin, Sallie M; Becker, Jacob; Aphinyanaphongs, Yindalon; Capozzi, James
INTRODUCTION/BACKGROUND:Accurate operative scheduling is essential for the appropriation of operating room (OR) resources. We sought to implement a machine learning (ML) model to predict primary total hip (THA) and total knee arthroplasty (TKA) case time. METHODS:A total of 10,590 THAs and 12,179 TKAs between July 2017 and December 2022 were retrospectively identified. Cases were chronologically divided into training, validation, and test sets. The test set cohort included 1,588 TKAs and 1,204 THAs. There were four machine learning algorithms developed: linear ridge regression (LR), random forest (RF), XGBoost (XGB), and explainable boosting machine (EBM). Each model's case time estimate was compared to the scheduled estimate measured in 15-minute "wait" time blocks ("underbooking") and "excess" time blocks ("overbooking"). Surgical case time was recorded, and SHAP (Shapley Additive exPlanations) values were assigned to patient characteristics, surgical information, and the patient's medical condition to understand feature importance. RESULTS:The most predictive model input was "median previous 30 procedure case times." The XGBoost model outperformed the other models in predicting both TKA and THA case times. The model reduced TKA 'excess time blocks' by 85 blocks (P < 0.001) and 'wait time blocks' by 96 blocks (P < 0.001). The model did not significantly reduce 'excess time blocks' in THA (P = 0.89) but did significantly reduce 'wait time blocks' by 134 blocks (P < 0.001). In total, the model improved TKA operative booking by 181 blocks (2,715 minutes) and THA operative booking by 138 blocks (2,070 minutes). CONCLUSIONS:Machine learning outperformed a traditional method of scheduling total joint arthroplasty (TJA) cases. The median time of the prior 30 surgical cases was the most influential on scheduling case time accuracy. As ML models improve, surgeons should consider machine learning utilization in case scheduling; however, prior 30 surgical cases may serve as an adequate alternative.
PMID: 39477036
ISSN: 1532-8406
CID: 5747082
Development and evaluation of an artificial intelligence-based workflow for the prioritization of patient portal messages
Yang, Jie; So, Jonathan; Zhang, Hao; Jones, Simon; Connolly, Denise M; Golding, Claudia; Griffes, Esmelin; Szerencsy, Adam C; Wu, Tzer Jason; Aphinyanaphongs, Yindalon; Major, Vincent J
OBJECTIVES/UNASSIGNED:Accelerating demand for patient messaging has impacted the practice of many providers. Messages are not recommended for urgent medical issues, but some do require rapid attention. This presents an opportunity for artificial intelligence (AI) methods to prioritize review of messages. Our study aimed to highlight some patient portal messages for prioritized review using a custom AI system integrated into the electronic health record (EHR). MATERIALS AND METHODS/UNASSIGNED:We developed a Bidirectional Encoder Representations from Transformers (BERT)-based large language model using 40 132 patient-sent messages to identify patterns involving high acuity topics that warrant an immediate callback. The model was then implemented into 2 shared pools of patient messages managed by dozens of registered nurses. A primary outcome, such as the time before messages were read, was evaluated with a difference-in-difference methodology. RESULTS/UNASSIGNED: = 396 466), an improvement exceeding the trend was observed in the time high-scoring messages sit unread (21 minutes, 63 vs 42 for messages sent outside business hours). DISCUSSION/UNASSIGNED:Our work shows great promise in improving care when AI is aligned with human workflow. Future work involves audience expansion, aiding users with suggested actions, and drafting responses. CONCLUSION/UNASSIGNED:Many patients utilize patient portal messages, and while most messages are routine, a small fraction describe alarming symptoms. Our AI-based workflow shortens the turnaround time to get a trained clinician to review these messages to provide safer, higher-quality care.
PMCID:11328532
PMID: 39156046
ISSN: 2574-2531
CID: 5680362
ChatGPT Solving Complex Kidney Transplant Cases: A Comparative Study With Human Respondents
Mankowski, Michal A; Jaffe, Ian S; Xu, Jingzhi; Bae, Sunjae; Oermann, Eric K; Aphinyanaphongs, Yindalon; McAdams-DeMarco, Mara A; Lonze, Bonnie E; Orandi, Babak J; Stewart, Darren; Levan, Macey; Massie, Allan; Gentry, Sommer; Segev, Dorry L
INTRODUCTION/BACKGROUND:ChatGPT has shown the ability to answer clinical questions in general medicine but may be constrained by the specialized nature of kidney transplantation. Thus, it is important to explore how ChatGPT can be used in kidney transplantation and how its knowledge compares to human respondents. METHODS:We prompted ChatGPT versions 3.5, 4, and 4 Visual (4 V) with 12 multiple-choice questions related to six kidney transplant cases from 2013 to 2015 American Society of Nephrology (ASN) fellowship program quizzes. We compared the performance of ChatGPT with US nephrology fellowship program directors, nephrology fellows, and the audience of the ASN's annual Kidney Week meeting. RESULTS:Overall, ChatGPT 4 V correctly answered 10 out of 12 questions, showing a performance level comparable to nephrology fellows (group majority correctly answered 9 of 12 questions) and training program directors (11 of 12). This surpassed ChatGPT 4 (7 of 12 correct) and 3.5 (5 of 12). All three ChatGPT versions failed to correctly answer questions where the consensus among human respondents was low. CONCLUSION/CONCLUSIONS:Each iterative version of ChatGPT performed better than the prior version, with version 4 V achieving performance on par with nephrology fellows and training program directors. While it shows promise in understanding and answering kidney transplantation questions, ChatGPT should be seen as a complementary tool to human expertise rather than a replacement.
PMCID:11441623
PMID: 39329220
ISSN: 1399-0012
CID: 5714092
Evaluation of GPT-4 ability to identify and generate patient instructions for actionable incidental radiology findings
Woo, Kar-Mun C; Simon, Gregory W; Akindutire, Olumide; Aphinyanaphongs, Yindalon; Austrian, Jonathan S; Kim, Jung G; Genes, Nicholas; Goldenring, Jacob A; Major, Vincent J; Pariente, Chloé S; Pineda, Edwin G; Kang, Stella K
OBJECTIVES/OBJECTIVE:To evaluate the proficiency of a HIPAA-compliant version of GPT-4 in identifying actionable, incidental findings from unstructured radiology reports of Emergency Department patients. To assess appropriateness of artificial intelligence (AI)-generated, patient-facing summaries of these findings. MATERIALS AND METHODS/METHODS:Radiology reports extracted from the electronic health record of a large academic medical center were manually reviewed to identify non-emergent, incidental findings with high likelihood of requiring follow-up, further sub-stratified as "definitely actionable" (DA) or "possibly actionable-clinical correlation" (PA-CC). Instruction prompts to GPT-4 were developed and iteratively optimized using a validation set of 50 reports. The optimized prompt was then applied to a test set of 430 unseen reports. GPT-4 performance was primarily graded on accuracy identifying either DA or PA-CC findings, then secondarily for DA findings alone. Outputs were reviewed for hallucinations. AI-generated patient-facing summaries were assessed for appropriateness via Likert scale. RESULTS:For the primary outcome (DA or PA-CC), GPT-4 achieved 99.3% recall, 73.6% precision, and 84.5% F-1. For the secondary outcome (DA only), GPT-4 demonstrated 95.2% recall, 77.3% precision, and 85.3% F-1. No findings were "hallucinated" outright. However, 2.8% of cases included generated text about recommendations that were inferred without specific reference. The majority of True Positive AI-generated summaries required no or minor revision. CONCLUSION/CONCLUSIONS:GPT-4 demonstrates proficiency in detecting actionable, incidental findings after refined instruction prompting. AI-generated patient instructions were most often appropriate, but rarely included inferred recommendations. While this technology shows promise to augment diagnostics, active clinician oversight via "human-in-the-loop" workflows remains critical for clinical implementation.
PMID: 38778578
ISSN: 1527-974x
CID: 5654832
Large Language Model-Based Responses to Patients' In-Basket Messages
Small, William R; Wiesenfeld, Batia; Brandfield-Harvey, Beatrix; Jonassen, Zoe; Mandal, Soumik; Stevens, Elizabeth R; Major, Vincent J; Lostraglio, Erin; Szerencsy, Adam; Jones, Simon; Aphinyanaphongs, Yindalon; Johnson, Stephen B; Nov, Oded; Mann, Devin
IMPORTANCE/UNASSIGNED:Virtual patient-physician communications have increased since 2020 and negatively impacted primary care physician (PCP) well-being. Generative artificial intelligence (GenAI) drafts of patient messages could potentially reduce health care professional (HCP) workload and improve communication quality, but only if the drafts are considered useful. OBJECTIVES/UNASSIGNED:To assess PCPs' perceptions of GenAI drafts and to examine linguistic characteristics associated with equity and perceived empathy. DESIGN, SETTING, AND PARTICIPANTS/UNASSIGNED:This cross-sectional quality improvement study tested the hypothesis that PCPs' ratings of GenAI drafts (created using the electronic health record [EHR] standard prompts) would be equivalent to HCP-generated responses on 3 dimensions. The study was conducted at NYU Langone Health using private patient-HCP communications at 3 internal medicine practices piloting GenAI. EXPOSURES/UNASSIGNED:Randomly assigned patient messages coupled with either an HCP message or the draft GenAI response. MAIN OUTCOMES AND MEASURES/UNASSIGNED:PCPs rated responses' information content quality (eg, relevance), using a Likert scale, communication quality (eg, verbosity), using a Likert scale, and whether they would use the draft or start anew (usable vs unusable). Branching logic further probed for empathy, personalization, and professionalism of responses. Computational linguistics methods assessed content differences in HCP vs GenAI responses, focusing on equity and empathy. RESULTS/UNASSIGNED:A total of 16 PCPs (8 [50.0%] female) reviewed 344 messages (175 GenAI drafted; 169 HCP drafted). Both GenAI and HCP responses were rated favorably. GenAI responses were rated higher for communication style than HCP responses (mean [SD], 3.70 [1.15] vs 3.38 [1.20]; P = .01, U = 12 568.5) but were similar to HCPs on information content (mean [SD], 3.53 [1.26] vs 3.41 [1.27]; P = .37; U = 13 981.0) and usable draft proportion (mean [SD], 0.69 [0.48] vs 0.65 [0.47], P = .49, t = -0.6842). Usable GenAI responses were considered more empathetic than usable HCP responses (32 of 86 [37.2%] vs 13 of 79 [16.5%]; difference, 125.5%), possibly attributable to more subjective (mean [SD], 0.54 [0.16] vs 0.31 [0.23]; P < .001; difference, 74.2%) and positive (mean [SD] polarity, 0.21 [0.14] vs 0.13 [0.25]; P = .02; difference, 61.5%) language; they were also numerically longer (mean [SD] word count, 90.5 [32.0] vs 65.4 [62.6]; difference, 38.4%), but the difference was not statistically significant (P = .07) and more linguistically complex (mean [SD] score, 125.2 [47.8] vs 95.4 [58.8]; P = .002; difference, 31.2%). CONCLUSIONS/UNASSIGNED:In this cross-sectional study of PCP perceptions of an EHR-integrated GenAI chatbot, GenAI was found to communicate information better and with more empathy than HCPs, highlighting its potential to enhance patient-HCP communication. However, GenAI drafts were less readable than HCPs', a significant concern for patients with low health or English literacy.
PMCID:11252893
PMID: 39012633
ISSN: 2574-3805
CID: 5686582
The First Generative AI Prompt-A-Thon in Healthcare: A Novel Approach to Workforce Engagement with a Private Instance of ChatGPT
Small, William R; Malhotra, Kiran; Major, Vincent J; Wiesenfeld, Batia; Lewis, Marisa; Grover, Himanshu; Tang, Huming; Banerjee, Arnab; Jabbour, Michael J; Aphinyanaphongs, Yindalon; Testa, Paul; Austrian, Jonathan S
BACKGROUND:Healthcare crowdsourcing events (e.g. hackathons) facilitate interdisciplinary collaboration and encourage innovation. Peer-reviewed research has not yet considered a healthcare crowdsourcing event focusing on generative artificial intelligence (GenAI), which generates text in response to detailed prompts and has vast potential for improving the efficiency of healthcare organizations. Our event, the New York University Langone Health (NYULH) Prompt-a-thon, primarily sought to inspire and build AI fluency within our diverse NYULH community, and foster collaboration and innovation. Secondarily, we sought to analyze how participants' experience was influenced by their prior GenAI exposure and whether they received sample prompts during the workshop. METHODS:Executing the event required the assembly of an expert planning committee, who recruited diverse participants, anticipated technological challenges, and prepared the event. The event was composed of didactics and workshop sessions, which educated and allowed participants to experiment with using GenAI on real healthcare data. Participants were given novel "project cards" associated with each dataset that illuminated the tasks GenAI could perform and, for a random set of teams, sample prompts to help them achieve each task (the public repository of project cards can be found at https://github.com/smallw03/NYULH-Generative-AI-Prompt-a-thon-Project-Cards). Afterwards, participants were asked to fill out a survey with 7-point Likert-style questions. RESULTS:Our event was successful in educating and inspiring hundreds of enthusiastic in-person and virtual participants across our organization on the responsible use of GenAI in a low-cost and technologically feasible manner. All participants responded positively, on average, to each of the survey questions (e.g., confidence in their ability to use and trust GenAI). Critically, participants reported a self-perceived increase in their likelihood of using and promoting colleagues' use of GenAI for their daily work. No significant differences were seen in the surveys of those who received sample prompts with their project task descriptions. CONCLUSION/CONCLUSIONS:The first healthcare Prompt-a-thon was an overwhelming success, with minimal technological failures, positive responses from diverse participants and staff, and evidence of post-event engagement. These findings will be integral to planning future events at our institution, and to others looking to engage their workforce in utilizing GenAI.
PMCID:11265701
PMID: 39042600
ISSN: 2767-3170
CID: 5686592
Development and external validation of a dynamic risk score for early prediction of cardiogenic shock in cardiac intensive care units using machine learning
Hu, Yuxuan; Lui, Albert; Goldstein, Mark; Sudarshan, Mukund; Tinsay, Andrea; Tsui, Cindy; Maidman, Samuel D; Medamana, John; Jethani, Neil; Puli, Aahlad; Nguy, Vuthy; Aphinyanaphongs, Yindalon; Kiefer, Nicholas; Smilowitz, Nathaniel R; Horowitz, James; Ahuja, Tania; Fishman, Glenn I; Hochman, Judith; Katz, Stuart; Bernard, Samuel; Ranganath, Rajesh
BACKGROUND:Myocardial infarction and heart failure are major cardiovascular diseases that affect millions of people in the US with the morbidity and mortality being highest among patients who develop cardiogenic shock. Early recognition of cardiogenic shock allows prompt implementation of treatment measures. Our objective is to develop a new dynamic risk score, called CShock, to improve early detection of cardiogenic shock in cardiac intensive care unit (ICU). METHODS:We developed and externally validated a deep learning-based risk stratification tool, called CShock, for patients admitted into the cardiac ICU with acute decompensated heart failure and/or myocardial infarction to predict onset of cardiogenic shock. We prepared a cardiac ICU dataset using MIMIC-III database by annotating with physician adjudicated outcomes. This dataset that consisted of 1500 patients with 204 having cardiogenic/mixed shock was then used to train CShock. The features used to train the model for CShock included patient demographics, cardiac ICU admission diagnoses, routinely measured laboratory values and vital signs, and relevant features manually extracted from echocardiogram and left heart catheterization reports. We externally validated the risk model on the New York University (NYU) Langone Health cardiac ICU database that was also annotated with physician adjudicated outcomes. The external validation cohort consisted of 131 patients with 25 patients experiencing cardiogenic/mixed shock. RESULTS:CShock achieved an area under the receiver operator characteristic curve (AUROC) of 0.821 (95% CI 0.792-0.850). CShock was externally validated in the more contemporary NYU cohort and achieved an AUROC of 0.800 (95% CI 0.717-0.884), demonstrating its generalizability in other cardiac ICUs. Having an elevated heart rate is most predictive of cardiogenic shock development based on Shapley values. The other top ten predictors are having an admission diagnosis of myocardial infarction with ST-segment elevation, having an admission diagnosis of acute decompensated heart failure, Braden Scale, Glasgow Coma Scale, Blood urea nitrogen, Systolic blood pressure, Serum chloride, Serum sodium, and Arterial blood pH. CONCLUSIONS:The novel CShock score has the potential to provide automated detection and early warning for cardiogenic shock and improve the outcomes for the millions of patients who suffer from myocardial infarction and heart failure.
PMID: 38518758
ISSN: 2048-8734
CID: 5640892