NYUHSL Faculty Bibliography

Searched for:

in-biosketch:true

person:smallw03

Total Results:

PLOS digital health. 2024:3(12).DOI: 10.1371/journal.pdig.0000685

Evaluating Large Language Models in extracting cognitive exam dates and scores

Zhang, Hao; Jethani, Neil; Jones, Simon; Genes, Nicholas; Major, Vincent J; Jaffe, Ian S; Cardillo, Anthony B; Heilenbach, Noah; Ali, Nadia Fazal; Bonanni, Luke J; Clayburn, Andrew J; Khera, Zain; Sadler, Erica C; Prasad, Jaideep; Schlacter, Jamie; Liu, Kevin; Silva, Benjamin; Montgomery, Sophie; Kim, Eric J; Lester, Jacob; Hill, Theodore M; Avoricani, Alba; Chervonski, Ethan; Davydov, James; Small, William; Chakravartty, Eesha; Grover, Himanshu; Dodson, John A; Brody, Abraham A; Aphinyanaphongs, Yindalon; Masurkar, Arjun; Razavian, Narges

Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.

PMCID:11634005

PMID: 39661652

ISSN: 2767-3170

CID: 5762692

Lancet oncology. 2024:25(9):e420-e431.DOI: 10.1016/S1470-2045(24)00192-X

Clinical research in endometrial cancer: consensus recommendations from the Gynecologic Cancer InterGroup

Creutzberg, Carien L; Kim, Jae-Weon; Eminowicz, Gemma; Allanson, Emma; Eberst, Lauriane; Kim, Se Ik; Nout, Remi A; Park, Jeong-Yeol; Lorusso, Domenica; Mileshkin, Linda; Ottevanger, Petronella B; Brand, Alison; Mezzanzanica, Delia; Oza, Amit; Gebski, Val; Pothuri, Bhavana; Batley, Tania; Gordon, Carol; Mitra, Tina; White, Helen; Howitt, Brooke; Matias-Guiu, Xavier; Ray-Coquard, Isabelle; Gaffney, David; Small, William; Miller, Austin; Concin, Nicole; Powell, Matthew A; Stuart, Gavin; Bookman, Michael A; ,

The Gynecologic Cancer InterGroup (GCIG) Endometrial Cancer Consensus Conference on Clinical Research (ECCC) was held in Incheon, South Korea, Nov 2-3, 2023. The aims were to develop consensus statements for future trials in endometrial cancer to achieve harmonisation on design elements, select important questions, and identify unmet needs. All 33 GCIG member groups participated in the development, refinement, and finalisation of 18 statements within four topic groups, addressing adjuvant treatment in high-risk disease; treatment for metastatic and recurrent disease; trial designs for rare endometrial cancer subgroups and special circumstances; and specific methodology and adaptation for trials in low-resource settings. In addition, eight areas of unmet need were identified. This was the first GCIG Consensus Conference to include patient advocates and an expert on inclusion, diversity, equity, and access to take part in all aspects of the process and output. Four early-career investigators were also selected for participation, ensuring that they represented different GCIG member groups and regions. Unanimous consensus was obtained for 16 of the 18 statements, with 97% concordance for the remaining two. Using the described methodology from previous Ovarian Cancer Consensus Conferences, this conference did not require even one minority statement. The high acceptance rate following active involvement in the preparation, discussion, and refinement of the statements by all representatives confirmed the consensus progress within a global academic setting, and the expectation that the ECCC will lead to greater harmonisation, actualisation, inclusion, and resolution of unmet needs in clinical research for individuals living with and beyond endometrial cancer worldwide.

PMID: 39214113

ISSN: 1474-5488

CID: 5702082

Annals of surgical oncology. 2024:31(9):5919-5928.DOI: 10.1245/s10434-024-15566-5

The Clinical Utility of a 7-Gene Biosignature on Radiation Therapy Decision Making in Patients with Ductal Carcinoma In Situ Following Breast-Conserving Surgery: An Updated Analysis of the DCISionRT^® PREDICT Study

Shah, Chirag; Whitworth, Pat; Vicini, Frank A; Narod, Steven; Gerber, Naamit; Jhawar, Sachin R; King, Tari A; Mittendorf, Elizabeth A; Willey, Shawna C; Rabinovich, Rachel; Gold, Linsey; Brown, Eric; Patel, Anushka; Vargo, John; Barry, Parul N; Rock, David; Friedman, Neil; Bedi, Gauri; Templeton, Sandra; Brown, Sheree; Gabordi, Robert; Riley, Lee; Lee, Lucy; Baron, Paul; Majithia, Lonika; Mirabeau-Beale, Kristina L; Reid, Vincent J; Hirsch, Arica; Hwang, Catherine; Pellicane, James; Maganini, Robert; Khan, Sadia; MacDermed, Dhara M; Small, William; Mittal, Karuna; Borgen, Patrick; Cox, Charles; Shivers, Steven C; Bremer, Troy

BACKGROUND:Breast-conserving surgery (BCS) followed by adjuvant radiotherapy (RT) is a standard treatment for ductal carcinoma in situ (DCIS). A low-risk patient subset that does not benefit from RT has not yet been clearly identified. The DCISionRT test provides a clinically validated decision score (DS), which is prognostic of 10-year in-breast recurrence rates (invasive and non-invasive) and is also predictive of RT benefit. This analysis presents final outcomes from the PREDICT prospective registry trial aiming to determine how often the DCISionRT test changes radiation treatment recommendations. METHODS:Overall, 2496 patients were enrolled from February 2018 to January 2022 at 63 academic and community practice sites and received DCISionRT as part of their care plan. Treating physicians reported their treatment recommendations pre- and post-test as well as the patient's preference. The primary endpoint was to identify the percentage of patients where testing led to a change in RT recommendation. The impact of the test on RT treatment recommendation was physician specialty, treatment settings, individual clinical/pathological features and RTOG 9804 like criteria. Multivariate logisitc regression analysis was used to estimate the odds ratio (ORs) for factors associated with the post-test RT recommendations. RESULTS:RT recommendation changed 38% of women, resulting in a 20% decrease in the overall recommendation of RT (p < 0.001). Of those women initially recommended no RT (n = 583), 31% were recommended RT post-test. The recommendation for RT post-test increased with increasing DS, from 29% to 66% to 91% for DS <2, DS 2-4, and DS >4, respectively. On multivariable analysis, DS had the strongest influence on final RT recommendation (odds ratio 22.2, 95% confidence interval 16.3-30.7), which was eightfold greater than clinicopathologic features. Furthermore, there was an overall change in the recommendation to receive RT in 42% of those patients meeting RTOG 9804-like low-risk criteria. CONCLUSIONS:The test results provided information that changes treatment recommendations both for and against RT use in large population of women with DCIS treated in a variety of clinical settings. Overall, clinicians changed their recommendations to include or omit RT for 38% of women based on the test results. Based on published clinical validations and the results from current study, DCISionRT may aid in preventing the over- and undertreatment of clinicopathological 'low-risk' and 'high-risk' DCIS patients. TRIAL REGISTRATION/BACKGROUND:ClinicalTrials.gov identifier: NCT03448926 ( https://clinicaltrials.gov/study/NCT03448926 ).

PMCID:11300542

PMID: 38916700

ISSN: 1534-4681

CID: 5973052

ACI open. 2024:8(2):e62-e68.DOI: 10.1055/s-0044-1788621

Enhancing Secure Messaging in Electronic Health Records: Evaluating the Impact of Emoji Chat Reactions on the Volume of Interruptive Notifications

Will, John; Small, William; Iturrate, Eduardo; Testa, Paul; Feldman, Jonah

ORIGINAL:0017336

ISSN: 2566-9346

CID: 5686602

JAMA network open. 2024:7(7).DOI: 10.1001/jamanetworkopen.2024.22399

Large Language Model-Based Responses to Patients' In-Basket Messages

Small, William R; Wiesenfeld, Batia; Brandfield-Harvey, Beatrix; Jonassen, Zoe; Mandal, Soumik; Stevens, Elizabeth R; Major, Vincent J; Lostraglio, Erin; Szerencsy, Adam; Jones, Simon; Aphinyanaphongs, Yindalon; Johnson, Stephen B; Nov, Oded; Mann, Devin

IMPORTANCE/UNASSIGNED:Virtual patient-physician communications have increased since 2020 and negatively impacted primary care physician (PCP) well-being. Generative artificial intelligence (GenAI) drafts of patient messages could potentially reduce health care professional (HCP) workload and improve communication quality, but only if the drafts are considered useful. OBJECTIVES/UNASSIGNED:To assess PCPs' perceptions of GenAI drafts and to examine linguistic characteristics associated with equity and perceived empathy. DESIGN, SETTING, AND PARTICIPANTS/UNASSIGNED:This cross-sectional quality improvement study tested the hypothesis that PCPs' ratings of GenAI drafts (created using the electronic health record [EHR] standard prompts) would be equivalent to HCP-generated responses on 3 dimensions. The study was conducted at NYU Langone Health using private patient-HCP communications at 3 internal medicine practices piloting GenAI. EXPOSURES/UNASSIGNED:Randomly assigned patient messages coupled with either an HCP message or the draft GenAI response. MAIN OUTCOMES AND MEASURES/UNASSIGNED:PCPs rated responses' information content quality (eg, relevance), using a Likert scale, communication quality (eg, verbosity), using a Likert scale, and whether they would use the draft or start anew (usable vs unusable). Branching logic further probed for empathy, personalization, and professionalism of responses. Computational linguistics methods assessed content differences in HCP vs GenAI responses, focusing on equity and empathy. RESULTS/UNASSIGNED:A total of 16 PCPs (8 [50.0%] female) reviewed 344 messages (175 GenAI drafted; 169 HCP drafted). Both GenAI and HCP responses were rated favorably. GenAI responses were rated higher for communication style than HCP responses (mean [SD], 3.70 [1.15] vs 3.38 [1.20]; P = .01, U = 12 568.5) but were similar to HCPs on information content (mean [SD], 3.53 [1.26] vs 3.41 [1.27]; P = .37; U = 13 981.0) and usable draft proportion (mean [SD], 0.69 [0.48] vs 0.65 [0.47], P = .49, t = -0.6842). Usable GenAI responses were considered more empathetic than usable HCP responses (32 of 86 [37.2%] vs 13 of 79 [16.5%]; difference, 125.5%), possibly attributable to more subjective (mean [SD], 0.54 [0.16] vs 0.31 [0.23]; P < .001; difference, 74.2%) and positive (mean [SD] polarity, 0.21 [0.14] vs 0.13 [0.25]; P = .02; difference, 61.5%) language; they were also numerically longer (mean [SD] word count, 90.5 [32.0] vs 65.4 [62.6]; difference, 38.4%), but the difference was not statistically significant (P = .07) and more linguistically complex (mean [SD] score, 125.2 [47.8] vs 95.4 [58.8]; P = .002; difference, 31.2%). CONCLUSIONS/UNASSIGNED:In this cross-sectional study of PCP perceptions of an EHR-integrated GenAI chatbot, GenAI was found to communicate information better and with more empathy than HCPs, highlighting its potential to enhance patient-HCP communication. However, GenAI drafts were less readable than HCPs', a significant concern for patients with low health or English literacy.

PMCID:11252893

PMID: 39012633

ISSN: 2574-3805

CID: 5686582

PLOS digital health. 2024:3(7).DOI: 10.1371/journal.pdig.0000394

The First Generative AI Prompt-A-Thon in Healthcare: A Novel Approach to Workforce Engagement with a Private Instance of ChatGPT

Small, William R; Malhotra, Kiran; Major, Vincent J; Wiesenfeld, Batia; Lewis, Marisa; Grover, Himanshu; Tang, Huming; Banerjee, Arnab; Jabbour, Michael J; Aphinyanaphongs, Yindalon; Testa, Paul; Austrian, Jonathan S

BACKGROUND:Healthcare crowdsourcing events (e.g. hackathons) facilitate interdisciplinary collaboration and encourage innovation. Peer-reviewed research has not yet considered a healthcare crowdsourcing event focusing on generative artificial intelligence (GenAI), which generates text in response to detailed prompts and has vast potential for improving the efficiency of healthcare organizations. Our event, the New York University Langone Health (NYULH) Prompt-a-thon, primarily sought to inspire and build AI fluency within our diverse NYULH community, and foster collaboration and innovation. Secondarily, we sought to analyze how participants' experience was influenced by their prior GenAI exposure and whether they received sample prompts during the workshop. METHODS:Executing the event required the assembly of an expert planning committee, who recruited diverse participants, anticipated technological challenges, and prepared the event. The event was composed of didactics and workshop sessions, which educated and allowed participants to experiment with using GenAI on real healthcare data. Participants were given novel "project cards" associated with each dataset that illuminated the tasks GenAI could perform and, for a random set of teams, sample prompts to help them achieve each task (the public repository of project cards can be found at https://github.com/smallw03/NYULH-Generative-AI-Prompt-a-thon-Project-Cards). Afterwards, participants were asked to fill out a survey with 7-point Likert-style questions. RESULTS:Our event was successful in educating and inspiring hundreds of enthusiastic in-person and virtual participants across our organization on the responsible use of GenAI in a low-cost and technologically feasible manner. All participants responded positively, on average, to each of the survey questions (e.g., confidence in their ability to use and trust GenAI). Critically, participants reported a self-perceived increase in their likelihood of using and promoting colleagues' use of GenAI for their daily work. No significant differences were seen in the surveys of those who received sample prompts with their project task descriptions. CONCLUSION/CONCLUSIONS:The first healthcare Prompt-a-thon was an overwhelming success, with minimal technological failures, positive responses from diverse participants and staff, and evidence of post-event engagement. These findings will be integral to planning future events at our institution, and to others looking to engage their workforce in utilizing GenAI.

PMCID:11265701

PMID: 39042600

ISSN: 2767-3170

CID: 5686592

Academic medicine. 2024:99(4S Suppl 1):S48-S56.DOI: 10.1097/ACM.0000000000005621

Leveraging Electronic Health Record Data and Measuring Interdependence in the Era of Precision Education and Assessment

Sebok-Syer, Stefanie S; Small, William R; Lingard, Lorelei; Glober, Nancy K; George, Brian C; Burk-Rafel, Jesse

PURPOSE:The era of precision education is increasingly leveraging electronic health record (EHR) data to assess residents' clinical performance. But precision in what the EHR-based resident performance metrics are truly assessing is not fully understood. For instance, there is limited understanding of how EHR-based measures account for the influence of the team on an individual's performance-or conversely how an individual contributes to team performances. This study aims to elaborate on how the theoretical understandings of supportive and collaborative interdependence are captured in residents' EHR-based metrics. METHOD:Using a mixed methods study design, the authors conducted a secondary analysis of 5 existing quantitative and qualitative datasets used in previous EHR studies to investigate how aspects of interdependence shape the ways that team-based care is provided to patients. RESULTS:Quantitative analyses of 16 EHR-based metrics found variability in faculty and resident performance (both between and within resident). Qualitative analyses revealed that faculty lack awareness of their own EHR-based performance metrics, which limits their ability to act interdependently with residents in an evidence-informed fashion. The lens of interdependence elucidates how resident practice patterns develop across residency training, shifting from supportive to collaborative interdependence over time. Joint displays merging the quantitative and qualitative analyses showed that residents are aware of variability in faculty's practice patterns and that viewing resident EHR-based measures without accounting for the interdependence of residents with faculty is problematic, particularly within the framework of precision education. CONCLUSIONS:To prepare for this new paradigm of precision education, educators need to develop and evaluate theoretically robust models that measure interdependence in EHR-based metrics, affording more nuanced interpretation of such metrics when assessing residents throughout training.

PMID: 38207084

ISSN: 1938-808x

CID: 5686572

[Zhong ji yi kan] = [Medicine for intermediate groups]. 2024.DOI: 10.1101/2023.07.10.23292373

Evaluating Large Language Models in Extracting Cognitive Exam Dates and Scores

IMPORTANCE/UNASSIGNED:Large language models (LLMs) are crucial for medical tasks. Ensuring their reliability is vital to avoid false results. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. OBJECTIVE/UNASSIGNED:Evaluate ChatGPT and LlaMA-2 performance in extracting MMSE and CDR scores, including their associated dates. METHODS/UNASSIGNED:Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. RESULTS/UNASSIGNED:For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. CONCLUSIONS/UNASSIGNED:In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.

PMCID:10888985

PMID: 38405784

CID: 5722422

JAMA network open. 2023:6(12).DOI: 10.1001/jamanetworkopen.2023.49136

Electronic Health Record Messaging Patterns of Health Care Professionals in Inpatient Medicine

Small, William; Iturrate, Eduardo; Austrian, Jonathan; Genes, Nicholas

PMID: 38147337