Searched for: in-biosketch:yes
person:sbj2002
Evaluating Hospital Course Summarization by an Electronic Health Record-Based Large Language Model
Small, William R; Austrian, Jonathan; O'Donnell, Luke; Burk-Rafel, Jesse; Hochman, Katherine A; Goodman, Adam; Zaretsky, Jonah; Martin, Jacob; Johnson, Stephen; Major, Vincent J; Jones, Simon; Henke, Christian; Verplanke, Benjamin; Osso, Jwan; Larson, Ian; Saxena, Archana; Mednick, Aron; Simonis, Choumika; Han, Joseph; Kesari, Ravi; Wu, Xinyuan; Heery, Lauren; Desel, Tenzin; Baskharoun, Samuel; Figman, Noah; Farooq, Umar; Shah, Kunal; Jahan, Nusrat; Kim, Jeong Min; Testa, Paul; Feldman, Jonah
IMPORTANCE/UNASSIGNED:Hospital course (HC) summarization represents an increasingly onerous discharge summary component for physicians. Literature supports large language models (LLMs) for HC summarization, but whether physicians can effectively partner with electronic health record-embedded LLMs to draft HCs is unknown. OBJECTIVES/UNASSIGNED:To compare the editing effort required by time-constrained resident physicians to improve LLM- vs physician-generated HCs toward a novel 4Cs (complete, concise, cohesive, and confabulation-free) HC. DESIGN, SETTING, AND PARTICIPANTS/UNASSIGNED:Quality improvement study using a convenience sample of 10 internal medicine resident editors, 8 hospitalist evaluators, and randomly selected general medicine admissions in December 2023 lasting 4 to 8 days at New York University Langone Health. EXPOSURES/UNASSIGNED:Residents and hospitalists reviewed randomly assigned patient medical records for 10 minutes. Residents blinded to author type who edited each HC pair (physician and LLM) for quality in 3 minutes, followed by comparative ratings by attending hospitalists. MAIN OUTCOMES AND MEASURES/UNASSIGNED:Editing effort was quantified by analyzing the edits that occurred on the HC pairs after controlling for length (percentage edited) and the degree to which the original HCs' meaning was altered (semantic change). Hospitalists compared edited HC pairs with A/B testing on the 4Cs (5-point Likert scales converted to 10-point bidirectional scales). RESULTS/UNASSIGNED:Among 100 admissions, compared with physician HCs, residents edited a smaller percentage of LLM HCs (LLM mean [SD], 31.5% [16.6%] vs physicians, 44.8% [20.0%]; P < .001). Additionally, LLM HCs required less semantic change (LLM mean [SD], 2.4% [1.6%] vs physicians, 4.9% [3.5%]; P < .001). Attending physicians deemed LLM HCs to be more complete (mean [SD] difference LLM vs physicians on 10-point bidirectional scale, 3.00 [5.28]; P < .001), similarly concise (mean [SD], -1.02 [6.08]; P = .20), and cohesive (mean [SD], 0.70 [6.14]; P = .60), but with more confabulations (mean [SD], -0.98 [3.53]; P = .002). The composite scores were similar (mean [SD] difference LLM vs physician on 40-point bidirectional scale, 1.70 [14.24]; P = .46). CONCLUSIONS AND RELEVANCE/UNASSIGNED:Electronic health record-embedded LLM HCs required less editing than physician-generated HCs to approach a quality standard, resulting in HCs that were comparably or more complete, concise, and cohesive, but contained more confabulations. Despite the potential influence of artificial time constraints, this study supports the feasibility of a physician-LLM partnership for writing HCs and provides a basis for monitoring LLM HCs in clinical practice.
PMID: 40802185
ISSN: 2574-3805
CID: 5906762
Classifying Continuous Glucose Monitoring Documents From Electronic Health Records
Zheng, Yaguang; Iturrate, Eduardo; Li, Lehan; Wu, Bei; Small, William R; Zweig, Susan; Fletcher, Jason; Chen, Zhihao; Johnson, Stephen B
BACKGROUND:Clinical use of continuous glucose monitoring (CGM) is increasing storage of CGM-related documents in electronic health records (EHR); however, the standardization of CGM storage is lacking. We aimed to evaluate the sensitivity and specificity of CGM Ambulatory Glucose Profile (AGP) classification criteria. METHODS:We randomly chose 2244 (18.1%) documents from NYU Langone Health. Our document classification algorithm: (1) separated multiple-page documents into a single-page image; (2) rotated all pages into an upright orientation; (3) determined types of devices using optical character recognition; and (4) tested for the presence of particular keywords in the text. Two experts in using CGM for research and clinical practice conducted an independent manual review of 62 (2.8%) reports. We calculated sensitivity (correct classification of CGM AGP report) and specificity (correct classification of non-CGM report) by comparing the classification algorithm against manual review. RESULTS:Among 2244 documents, 1040 (46.5%) were classified as CGM AGP reports (43.3% FreeStyle Libre and 56.7% Dexcom), 1170 (52.1%) non-CGM reports (eg, progress notes, CGM request forms, or physician letters), and 34 (1.5%) uncertain documents. The agreement for the evaluation of the documents between the two experts was 100% for sensitivity and 98.4% for specificity. When comparing the classification result between the algorithm and manual review, the sensitivity and specificity were 95.0% and 91.7%. CONCLUSION/CONCLUSIONS:Nearly half of CGM-related documents were AGP reports, which are useful for clinical practice and diabetes research; however, the remaining half are other clinical documents. Future work needs to standardize the storage of CGM-related documents in the EHR.
PMCID:11904921
PMID: 40071848
ISSN: 1932-2968
CID: 5808452
How Point (Single-Probability) Tasks Are Affected by Probability Format, Part 2: A Making Numbers Meaningful Systematic Review
Ancker, Jessica S; Benda, Natalie C; Sharma, Mohit M; Johnson, Stephen B; Demetres, Michelle; Delgado, Diana; Zikmund-Fisher, Brian J
UNLABELLED: HIGHLIGHTS/UNASSIGNED:Formatting a probability as 1 in X, using a foreground-only icon array, adding anecdotes to numbers, and gain-loss framing all affect probability perceptions and feelings.The evidence on communicating numbers to influence perceptions is far stronger than the evidence on using it to change health behavior or behavioral intention.Only weak evidence is available on patient preferences for verbal, graphical, and numerical probability formats.
PMCID:11848894
PMID: 39995775
ISSN: 2381-4683
CID: 5800662
How Difference Tasks Are Affected by Probability Format, Part 1: A Making Numbers Meaningful Systematic Review
Benda, Natalie C; Zikmund-Fisher, Brian J; Sharma, Mohit M; Johnson, Stephen B; Demetres, Michelle; Delgado, Diana; Ancker, Jessica S
UNLABELLED: HIGHLIGHTS/UNASSIGNED:than with 1 in X rates.Adding graphics to probabilities helps readers compute differences between probabilities.
PMCID:11848882
PMID: 39995776
ISSN: 2381-4683
CID: 5800672
Scope, Methods, and Overview Findings for the Making Numbers Meaningful Evidence Review of Communicating Probabilities in Health: A Systematic Review
Ancker, Jessica S; Benda, Natalie C; Sharma, Mohit M; Johnson, Stephen B; Demetres, Michelle; Delgado, Diana; Zikmund-Fisher, Brian J
UNLABELLED: HIGHLIGHTS/UNASSIGNED:The Making Numbers Meaningful project conducted a comprehensive systematic review of experimental and quasi-experimental research that compared 2 or more formats for presenting quantitative health information to patients or other lay audiences. The current article focuses on probability information.Based on a conceptual taxonomy, we reviewed studies based on the cognitive tasks required of participants, assessing 14 distinct possible outcomes.Our review identified 316 articles involving probability communications that generated 1,119 distinct research findings, each of which was reviewed by multiple experts for credibility.The overall pattern of findings highlights which probability communication questions have been well researched and which have not. For example, there has been far more research on communicating single probabilities than on communicating more complex information such as trends over time, and there has been a large amount of research on the effect of communication approaches on behavioral intentions but relatively little on behaviors.
PMCID:11848889
PMID: 39995784
ISSN: 2381-4683
CID: 5800712
How Difference Tasks Are Affected by Probability Format, Part 2: A Making Numbers Meaningful Systematic Review
Benda, Natalie C; Zikmund-Fisher, Brian J; Sharma, Mohit M; Johnson, Stephen B; Demetres, Michelle; Delgado, Diana; Ancker, Jessica S
UNLABELLED: HIGHLIGHTS/UNASSIGNED:Communicating relative risk differences as opposed to absolute risk differences, using numerator-only instead of part-to-whole graphics, and including anecdotes or information about others' decisions will all increase intentions to engage in a behavior.Relative risks (rather than absolute risk differences) and numerator-only graphics (rather than part-to-whole) will also increase felt and perceived effectiveness.To illustrate probability differences, people tend to prefer bar charts over icon arrays and graphics with labels over those without.All findings regarding the impact of different presentation formats for probability differences on trust produced insufficient evidence.
PMCID:11907595
PMID: 40094048
ISSN: 2381-4683
CID: 5813012
How Point (Single-Probability) Tasks Are Affected by Probability Format, Part 1: A Making Numbers Meaningful Systematic Review
Ancker, Jessica S; Benda, Natalie C; Sharma, Mohit M; Johnson, Stephen B; Demetres, Michelle; Delgado, Diana; Zikmund-Fisher, Brian J
UNLABELLED: HIGHLIGHTS/UNASSIGNED:Many researchers have studied the effects of data presentation formats of single probabilities on different outcomes.However, few findings are comparable enough to allow for strong evidence-based conclusions about the impact on identification, recall, contrast, categorization, and computation outcomes.
PMCID:11848880
PMID: 39995779
ISSN: 2381-4683
CID: 5800692
How Synthesis Tasks Are Affected by Probability Format: A Making Numbers Meaningful Systematic Review
Benda, Natalie C; Sharma, Mohit M; Ancker, Jessica S; Demetres, Michelle; Delgado, Diana; Johnson, Stephen B; Zikmund-Fisher, Brian J
UNLABELLED: HIGHLIGHTS/UNASSIGNED:This study found a moderate number of studies assessing strategies for evaluating sets of probabilities conveying information such as risks and benefits.Evidence is moderate that although presenting sets of probabilities in table versus sentences may not affect behavioral intentions, people may prefer tables.Contrary to previous studies about probability feelings, moderate evidence suggested that narratives may not affect effectiveness feelings.Evidence was insufficient to draw conclusions regarding contrast, identification, and trust outcomes, and no studies assessed recall, categorization, computation, or discrimination outcomes.
PMCID:11848887
PMID: 39995777
ISSN: 2381-4683
CID: 5800682
How Time-Trend Tasks Are Affected by Probability Format: A Making Numbers Meaningful Systematic Review
Sharma, Mohit M; Ancker, Jessica S; Benda, Natalie C; Johnson, Stephen B; Demetres, Michelle; Delgado, Diana; Zikmund-Fisher, Brian J
UNLABELLED: HIGHLIGHTS/UNASSIGNED:This systematic review found that few studies of probability trend data compared similar formats or used comparable outcome measures.The only strong piece of evidence was that graphing probabilities over longer time periods such that the distance between curves widens will tend to increase the perceived difference between the curves.Weak evidence suggests that survival curves (versus mortality curves) may make it easier to identify the option with the highest overall survival.Weak evidence suggests that graphing probabilities over longer (rather than shorter) time periods may increase the ability to distinguish between small survival differences.Evidence was insufficient to determine whether any format influenced behaviors or behavioral intentions.
PMCID:11848886
PMID: 39995781
ISSN: 2381-4683
CID: 5800702
Large Language Model-Based Responses to Patients' In-Basket Messages
Small, William R; Wiesenfeld, Batia; Brandfield-Harvey, Beatrix; Jonassen, Zoe; Mandal, Soumik; Stevens, Elizabeth R; Major, Vincent J; Lostraglio, Erin; Szerencsy, Adam; Jones, Simon; Aphinyanaphongs, Yindalon; Johnson, Stephen B; Nov, Oded; Mann, Devin
IMPORTANCE/UNASSIGNED:Virtual patient-physician communications have increased since 2020 and negatively impacted primary care physician (PCP) well-being. Generative artificial intelligence (GenAI) drafts of patient messages could potentially reduce health care professional (HCP) workload and improve communication quality, but only if the drafts are considered useful. OBJECTIVES/UNASSIGNED:To assess PCPs' perceptions of GenAI drafts and to examine linguistic characteristics associated with equity and perceived empathy. DESIGN, SETTING, AND PARTICIPANTS/UNASSIGNED:This cross-sectional quality improvement study tested the hypothesis that PCPs' ratings of GenAI drafts (created using the electronic health record [EHR] standard prompts) would be equivalent to HCP-generated responses on 3 dimensions. The study was conducted at NYU Langone Health using private patient-HCP communications at 3 internal medicine practices piloting GenAI. EXPOSURES/UNASSIGNED:Randomly assigned patient messages coupled with either an HCP message or the draft GenAI response. MAIN OUTCOMES AND MEASURES/UNASSIGNED:PCPs rated responses' information content quality (eg, relevance), using a Likert scale, communication quality (eg, verbosity), using a Likert scale, and whether they would use the draft or start anew (usable vs unusable). Branching logic further probed for empathy, personalization, and professionalism of responses. Computational linguistics methods assessed content differences in HCP vs GenAI responses, focusing on equity and empathy. RESULTS/UNASSIGNED:A total of 16 PCPs (8 [50.0%] female) reviewed 344 messages (175 GenAI drafted; 169 HCP drafted). Both GenAI and HCP responses were rated favorably. GenAI responses were rated higher for communication style than HCP responses (mean [SD], 3.70 [1.15] vs 3.38 [1.20]; P = .01, U = 12 568.5) but were similar to HCPs on information content (mean [SD], 3.53 [1.26] vs 3.41 [1.27]; P = .37; U = 13 981.0) and usable draft proportion (mean [SD], 0.69 [0.48] vs 0.65 [0.47], P = .49, t = -0.6842). Usable GenAI responses were considered more empathetic than usable HCP responses (32 of 86 [37.2%] vs 13 of 79 [16.5%]; difference, 125.5%), possibly attributable to more subjective (mean [SD], 0.54 [0.16] vs 0.31 [0.23]; P < .001; difference, 74.2%) and positive (mean [SD] polarity, 0.21 [0.14] vs 0.13 [0.25]; P = .02; difference, 61.5%) language; they were also numerically longer (mean [SD] word count, 90.5 [32.0] vs 65.4 [62.6]; difference, 38.4%), but the difference was not statistically significant (P = .07) and more linguistically complex (mean [SD] score, 125.2 [47.8] vs 95.4 [58.8]; P = .002; difference, 31.2%). CONCLUSIONS/UNASSIGNED:In this cross-sectional study of PCP perceptions of an EHR-integrated GenAI chatbot, GenAI was found to communicate information better and with more empathy than HCPs, highlighting its potential to enhance patient-HCP communication. However, GenAI drafts were less readable than HCPs', a significant concern for patients with low health or English literacy.
PMCID:11252893
PMID: 39012633
ISSN: 2574-3805
CID: 5686582