Try a new search

Format these results:

Searched for:

in-biosketch:true

person:burkrj01

Total Results:

51


Leveraging a Large Language Model to Generate Quality Improvement Feedback for Clinical Notes

Kim, Christopher J; Gelfinbein, Joseph; Gencerliler, Nihan; Jahan, Nusrat; Udaikumar, Jahnavi; Heery, Lauren M; Goodman, Adam; Ng, Sarah; Attard, Joel; Asha, Sharmin; Burk-Rafel, Jesse; Guzman, Benedict Vincent; Hochman, Katherine A; Testa, Paul; Feldman, Jonah
BACKGROUND:Poor documentation quality can significantly affect healthcare operations, but the feedback process for clinicians to improve clinical notes is time-consuming and often insufficient. Large language models (LLMs) such as Generative Pre-trained Transformer 4 (GPT-4) have the potential to streamline this process. OBJECTIVES/OBJECTIVE:To determine whether an LLM can generate feedback to improve the medical contingency and discharge planning (MCDP) component of clinical documentation that is non-inferior to feedback by physicians. METHODS:A cross-sectional study of GPT-4 feedback and physician feedback on inpatient progress notes was conducted. A random sample of 64 inpatient progress notes identified by the validated AI Audit Tool as having a low likelihood of containing MCDP was included from adult general medicine patients hospitalized at New York University Langone Health (NYULH) in December 2023. Both GPT-4 model and attending physicians generated feedback on these inpatient progress notes. A/B testing was then conducted on the measures of understandability, usefulness, acceptability, and impartiality. Evaluations employed 5-point Likert scales that were converted to 10-point bidirectional interval scales for interpretability, ranging from -10 (human suggestions significantly better) to +10 (GPT-4 suggestions significantly better), with a non-inferiority threshold set to -1 for the primary endpoint. RESULTS:64 inpatient progress notes were included, representing 55% female patients with a median age of 73. GPT-4 feedback was non-inferior to physician feedback in all measures: understandability (mean 1.27, 95% CI 0.73 to 1.8, P < 0.001), usefulness (mean 2.09, 95% CI 1.27 to 2.91, P < 0.001), acceptability (mean 2.07, 95% CI 1.33 to 2.81, P < 0.001), and impartiality (mean -0.20, 95% CI -0.52 to 0.12, P < 0.001). CONCLUSIONS:This study shows that an LLM can be leveraged to generate note quality feedback that is non-inferior to expert clinician feedback.
PMID: 41985489
ISSN: 1869-0327
CID: 6027922

Large language model-based identification of venous thromboembolism diagnostic delays

Schaye, Verity; Sartori, Daniel J; Signoriello, Lexi; Malhotra, Kiran; Guzman, Benedict; Rajput, Bijal; Reinstein, Ilan; Burk-Rafel, Jesse
BACKGROUND:Delayed diagnosis of venous thromboembolism (VTE) is prevalent among hospitalized patients, yet case identification is challenging and feedback limited. OBJECTIVE:To develop a large language model (LLM)-based electronic-trigger to identify VTE diagnostic delays. METHODS:All admissions to internal medicine (IM) residents at NYU Langone Health between January 2022 and December 2023 (n = 20,843) were included. Using an open-source LLM, prompts were validated to detect (1) residents considering VTE in admission notes and (2) VTE confirmation in five types of imaging reports (n = 100 for each prompt validation set). The validated prompts were applied to determine discordance between admission note differential omitting VTE and imaging report confirming VTE. Two hospitalists reviewed discordant cases using a validated tool to identify diagnostic delays. Hospitalizations were labeled as diagnostic delays, in-hospital complication, or false-positive. Based on in-hospital complication and false-positive patterns, exclusion criteria were implemented. Positive predictive value (PPV) and negative predictive value (NPV) were calculated. RESULTS:The LLM prompts correctly classified admission notes and VTE imaging studies with high accuracy (range 98%-100%, n = 699 VTE cases identified). Of the 137 diagnostic delays the LLM-based electronic-trigger identified, 31 were true-positives, 60 in-hospital complications, and 46 false-positives. 4.4% of all VTE hospitalizations had a diagnostic delay. With the exclusion criteria, the PPV was 48% (95% confidence interval [CI], 35%-62%) and NPV was 95% (95% CI, 87%-98%). CONCLUSIONS:We developed the first LLM-based electronic-trigger to identify VTE diagnostic delays, with higher performance than existing non-LLM electronic-triggers. LLM-based approaches can facilitate diagnostic performance feedback and are scalable to other conditions and institutions.
PMID: 41058083
ISSN: 1553-5606
CID: 5951832

Sex, Race, and Ethnicity Differences Among Residents With Exceptionally High Graduate Medical Education Ratings

Kim, Jung G; Hauer, Karen E; Boscardin, Christy K; Su, Jasmine I-Shin; Holmboe, Eric S; Konopasek, Lyuba; Chen, Isabel L; Gonzalez, Cristina M; Ogedegbe, Gbenga G; Burk-Rafel, Jesse; Nguyen, Mytien; Andrews, John S; Henderson, David D; Richardson, Judee; McDade, William; Boatright, Dowin
IMPORTANCE/UNASSIGNED:Limited research exists on sex, racial, and ethnic disparities in required graduate medical education (GME) resident competency ratings across specialties during sensitive periods when career decision-making occurs. Rating disparities using an antideficit-based approach measured by exceptionally high ratings are underexplored in GME. OBJECTIVE/UNASSIGNED:To assess the association of exceptionally high ratings in the Accreditation Council for Graduate Medical Education (ACGME) Milestones during time-sensitive training periods across specialties with differences among residents' characteristics, including sex, race, and ethnicity. DESIGN, SETTING, AND PARTICIPANTS/UNASSIGNED:This cross-sectional analysis was conducted between March 15 and December 31, 2025, using 2018 to 2021 Association of American Medical Colleges and ACGME data. Postgraduate year (PGY) 2 residents training at US ACGME-accredited emergency medicine, family medicine, internal medicine, obstetrics and gynecology, pediatrics, and surgery residency programs between 2018 and 2021 who self-reported sex, race, or ethnicity were studied. EXPOSURE/UNASSIGNED:Required Milestones ratings at the end of PGY-2 training associated with resident sex and race or ethnicity (underrepresented in medicine [URiM] and Asian), while controlling for preresidency Step 2 Clinical Knowledge examination scores. MAIN OUTCOMES AND MEASURES/UNASSIGNED:Proportion and adjusted odds ratios (AORs) for exceptionally high resident-level ratings (80th percentile level) across competencies in interpersonal and communication skills, medical knowledge, patient care, practice-based learning and improvement, professionalism, and systems-based practice. RESULTS/UNASSIGNED:Among 19 492 PGY-2 residents across 1754 programs, 10 384 (53.3%) were female, 28 (0.14%) American Indian or Alaskan Native, 4327 (22.2%) Asian, 1106 (5.7%) Black, 1008 (5.2%) Hispanic or Latinx, 3 (0.02%) Native Hawaiian or Pacific Islander, 12 269 (62.9%) White, 751 (3.9%) reporting 2 or more races, and 3423 (17.6%) classified as URiM. Exceptional rating differences were identified by sex, race, and ethnicity. Across all specialties, female residents had greater odds for 80th percentile ratings (AOR, 1.12; 95% CI, 1.05-1.21; P < .001); whereas when compared with White residents, URiM residents (AOR, 0.68; 95% CI, 0.62-0.76; P < .001) and Asian residents (AOR, 0.67; 95% CI, 0.60-0.74; P < .001) were less likely to have 80th percentile ratings than White residents. Within specialties, URiM residents in emergency medicine, family medicine, internal medicine, obstetrics and gynecology, and surgery were less likely to have 80th percentile ratings, whereas Asian residents in family medicine, internal medicine, pediatrics, and surgery were also less likely than White residents. CONCLUSION AND RELEVANCE/UNASSIGNED:In this cross-sectional national study of residents, exceptionally higher ratings were associated with differing resident characteristics during crucial career planning phases. These results suggest the need for more studies to explore factors of resident success during GME training.
PMCID:13036576
PMID: 41910971
ISSN: 2574-3805
CID: 6021292

The impact of shifting hospitalist switch days from Monday to Tuesday

Nguyen, Larry; Messing, Lauren; Hochman, Katherine A; QuiƱones-Camacho, Adriana; Burk-Rafel, Jesse; Verplanke, Benjamin
There is limited data on which hospitalist switch day is optimal for hospital operations and throughput. A quality improvement intervention was implemented, changing the hospitalist switch day from Monday to Tuesday. Retrospective observational analysis revealed an increase in Monday discharges (1.3%, p = .01), a decrease in Tuesday discharges (-1.6%, p < .005), and a significant reduction in 30-day unplanned readmission rates (-1.5%, p = .003), with no significant changes in the average length of stay. Additional studies are needed to further verify these findings in different hospital settings and to consider other switch day patterns.
PMID: 41186934
ISSN: 1553-5606
CID: 5959692

Macy Foundation Innovation Report Part II: From Hype to Reality: Innovators' Visions for Navigating AI Integration Challenges in Medical Education

Gin, Brian C; LaForge, Kate; Burk-Rafel, Jesse; Boscardin, Christy K
PURPOSE/OBJECTIVE:Artificial intelligence (AI) promises to significantly impact medical education, yet its implementation raises important questions about educational effectiveness, ethical use, and equity. In the second part of a 2-part innovation report, which was commissioned by the Josiah Macy Jr. Foundation to inform discussions at a conference on AI in medical education, the authors explore the perspectives of innovators actively integrating AI into medical education, examining their perceptions regarding the impacts, opportunities, challenges, and strategies for successful AI adoption and risk mitigation. METHOD/METHODS:Semi-structured interviews were conducted with 25 medical education AI innovators-including learners, educators, institutional leaders, and industry representatives-from June to August 2024. Interviews explored participants' perceptions of AI's influence on medical education, challenges to integration, and strategies for mitigating challenges. Transcripts were analyzed using thematic analysis to identify themes and synthesize participants' recommendations for AI integration. RESULTS:Innovators' responses were synthesized into 2 main thematic areas: (1) AI's impact on teaching, learning, and assessment, and (2) perceived threats and strategies for mitigating them. Participants identified AI's potential to enact precision education through virtual tutors and standardized patients, support active learning formats, enable centralized teaching, and facilitate cognitive offloading. AI-enhanced assessments could automate grading, predict learner trajectories, and integrate performance data from clinical interactions. Yet, innovators expressed concerns over threats to transparency and validity, potential propagation of biases, risks of over-reliance and deskilling, and institutional disparities. Proposed mitigation strategies emphasized validating AI outputs, establishing foundational competencies, fostering collaboration and open-source sharing, enhancing AI literacy, and maintaining robust ethical standards. CONCLUSIONS:AI innovators in medical education envision transformative opportunities for individualized learning and precision education, balanced against critical threats. Realizing these benefits requires proactive, collaborative efforts to establish rigorous validation frameworks; uphold foundational medical competencies; and prioritize ethical, equitable AI integration.
PMID: 40479503
ISSN: 1938-808x
CID: 5862832

Evaluating Hospital Course Summarization by an Electronic Health Record-Based Large Language Model

Small, William R; Austrian, Jonathan; O'Donnell, Luke; Burk-Rafel, Jesse; Hochman, Katherine A; Goodman, Adam; Zaretsky, Jonah; Martin, Jacob; Johnson, Stephen; Major, Vincent J; Jones, Simon; Henke, Christian; Verplanke, Benjamin; Osso, Jwan; Larson, Ian; Saxena, Archana; Mednick, Aron; Simonis, Choumika; Han, Joseph; Kesari, Ravi; Wu, Xinyuan; Heery, Lauren; Desel, Tenzin; Baskharoun, Samuel; Figman, Noah; Farooq, Umar; Shah, Kunal; Jahan, Nusrat; Kim, Jeong Min; Testa, Paul; Feldman, Jonah
IMPORTANCE/UNASSIGNED:Hospital course (HC) summarization represents an increasingly onerous discharge summary component for physicians. Literature supports large language models (LLMs) for HC summarization, but whether physicians can effectively partner with electronic health record-embedded LLMs to draft HCs is unknown. OBJECTIVES/UNASSIGNED:To compare the editing effort required by time-constrained resident physicians to improve LLM- vs physician-generated HCs toward a novel 4Cs (complete, concise, cohesive, and confabulation-free) HC. DESIGN, SETTING, AND PARTICIPANTS/UNASSIGNED:Quality improvement study using a convenience sample of 10 internal medicine resident editors, 8 hospitalist evaluators, and randomly selected general medicine admissions in December 2023 lasting 4 to 8 days at New York University Langone Health. EXPOSURES/UNASSIGNED:Residents and hospitalists reviewed randomly assigned patient medical records for 10 minutes. Residents blinded to author type who edited each HC pair (physician and LLM) for quality in 3 minutes, followed by comparative ratings by attending hospitalists. MAIN OUTCOMES AND MEASURES/UNASSIGNED:Editing effort was quantified by analyzing the edits that occurred on the HC pairs after controlling for length (percentage edited) and the degree to which the original HCs' meaning was altered (semantic change). Hospitalists compared edited HC pairs with A/B testing on the 4Cs (5-point Likert scales converted to 10-point bidirectional scales). RESULTS/UNASSIGNED:Among 100 admissions, compared with physician HCs, residents edited a smaller percentage of LLM HCs (LLM mean [SD], 31.5% [16.6%] vs physicians, 44.8% [20.0%]; P < .001). Additionally, LLM HCs required less semantic change (LLM mean [SD], 2.4% [1.6%] vs physicians, 4.9% [3.5%]; P < .001). Attending physicians deemed LLM HCs to be more complete (mean [SD] difference LLM vs physicians on 10-point bidirectional scale, 3.00 [5.28]; P < .001), similarly concise (mean [SD], -1.02 [6.08]; P = .20), and cohesive (mean [SD], 0.70 [6.14]; P = .60), but with more confabulations (mean [SD], -0.98 [3.53]; P = .002). The composite scores were similar (mean [SD] difference LLM vs physician on 40-point bidirectional scale, 1.70 [14.24]; P = .46). CONCLUSIONS AND RELEVANCE/UNASSIGNED:Electronic health record-embedded LLM HCs required less editing than physician-generated HCs to approach a quality standard, resulting in HCs that were comparably or more complete, concise, and cohesive, but contained more confabulations. Despite the potential influence of artificial time constraints, this study supports the feasibility of a physician-LLM partnership for writing HCs and provides a basis for monitoring LLM HCs in clinical practice.
PMID: 40802185
ISSN: 2574-3805
CID: 5906762

Large Language Model-Augmented Strategic Analysis of Innovation Projects in Graduate Medical Education

Winkel, Abigail Ford; Burk-Rafel, Jesse; Terhune, Kyla; Garibaldi, Brian T; DeWaters, Ami L; Co, John Patrick T; Andrews, John S
PMCID:12080501
PMID: 40386486
ISSN: 1949-8357
CID: 5852792

How Data Analytics Can Be Leveraged to Enhance Graduate Clinical Skills Education

Garibaldi, Brian T; Hollon, McKenzie; Knopp, Michelle I; Winkel, Abigail Ford; Burk-Rafel, Jesse; Caretta-Weyer, Holly A
PMCID:12080502
PMID: 40386478
ISSN: 1949-8357
CID: 5852752

Artificial intelligence based assessment of clinical reasoning documentation: an observational study of the impact of the clinical learning environment on resident documentation quality

Schaye, Verity; DiTullio, David J; Sartori, Daniel J; Hauck, Kevin; Haller, Matthew; Reinstein, Ilan; Guzman, Benedict; Burk-Rafel, Jesse
BACKGROUND:Objective measures and large datasets are needed to determine aspects of the Clinical Learning Environment (CLE) impacting the essential skill of clinical reasoning documentation. Artificial Intelligence (AI) offers a solution. Here, the authors sought to determine what aspects of the CLE might be impacting resident clinical reasoning documentation quality assessed by AI. METHODS:In this observational, retrospective cross-sectional analysis of hospital admission notes from the Electronic Health Record (EHR), all categorical internal medicine (IM) residents who wrote at least one admission note during the study period July 1, 2018- June 30, 2023 at two sites of NYU Grossman School of Medicine's IM residency program were included. Clinical reasoning documentation quality of admission notes was determined to be low or high-quality using a supervised machine learning model. From note-level data, the shift (day or night) and note index within shift (if a note was first, second, etc. within shift) were calculated. These aspects of the CLE were included as potential markers of workload, which have been shown to have a strong relationship with resident performance. Patient data was also captured, including age, sex, Charlson Comorbidity Index, and primary diagnosis. The relationship between these variables and clinical reasoning documentation quality was analyzed using generalized estimating equations accounting for resident-level clustering. RESULTS:Across 37,750 notes authored by 474 residents, patients who were older, had more pre-existing comorbidities, and presented with certain primary diagnoses (e.g., infectious and pulmonary conditions) were associated with higher clinical reasoning documentation quality. When controlling for these and other patient factors, variables associated with clinical reasoning documentation quality included academic year (adjusted odds ratio, aOR, for high-quality: 1.10; 95% CI 1.06-1.15; P <.001), night shift (aOR 1.21; 95% CI 1.13-1.30; P <.001), and note index (aOR 0.93; 95% CI 0.90-0.95; P <.001). CONCLUSIONS:AI can be used to assess complex skills such as clinical reasoning in authentic clinical notes that can help elucidate the potential impact of the CLE on resident clinical reasoning documentation quality. Future work should explore residency program and systems interventions to optimize the CLE.
PMCID:12016287
PMID: 40264096
ISSN: 1472-6920
CID: 5830212

Large Language Model-Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions: Development and Validation Study

Schaye, Verity; DiTullio, David; Guzman, Benedict Vincent; Vennemeyer, Scott; Shih, Hanniel; Reinstein, Ilan; Weber, Danielle E; Goodman, Abbie; Wu, Danny T Y; Sartori, Daniel J; Santen, Sally A; Gruppen, Larry; Aphinyanaphongs, Yindalon; Burk-Rafel, Jesse
BACKGROUND:Clinical reasoning (CR) is an essential skill; yet, physicians often receive limited feedback. Artificial intelligence holds promise to fill this gap. OBJECTIVE:We report the development of named entity recognition (NER), logic-based and large language model (LLM)-based assessments of CR documentation in the electronic health record across 2 institutions (New York University Grossman School of Medicine [NYU] and University of Cincinnati College of Medicine [UC]). METHODS:-scores for the NER, logic-based model and area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) for the LLMs. RESULTS:-scores 0.80, 0.74, and 0.80 for D0, D1, D2, respectively. The GatorTron LLM performed best for EA2 scores AUROC/AUPRC 0.75/ 0.69. CONCLUSIONS:This is the first multi-institutional study to apply LLMs for assessing CR documentation in the electronic health record. Such tools can enhance feedback on CR. Lessons learned by implementing these models at distinct institutions support the generalizability of this approach.
PMID: 40117575
ISSN: 1438-8871
CID: 5813782