Try a new search

Format these results:

Searched for:

in-biosketch:yes

person:joness22

Total Results:

195


Evaluating Hospital Course Summarization by an Electronic Health Record-Based Large Language Model

Small, William R.; Austrian, Jonathan; O\Donnell, Luke; Burk-Rafel, Jesse; Hochman, Katherine A.; Goodman, Adam; Zaretsky, Jonah; Martin, Jacob; Johnson, Stephen; Major, Vincent J.; Jones, Simon; Henke, Christian; Verplanke, Benjamin; Osso, Jwan; Larson, Ian; Saxena, Archana; Mednick, Aron; Simonis, Choumika; Han, Joseph; Kesari, Ravi; Wu, Xinyuan; Heery, Lauren; Desel, Tenzin; Baskharoun, Samuel; Figman, Noah; Farooq, Umar; Shah, Kunal; Jahan, Nusrat; Kim, Jeong Min; Testa, Paul; Feldman, Jonah
ISI:001551557000002
ISSN: 2574-3805
CID: 5974192

Evaluating Large Language Models in extracting cognitive exam dates and scores

Zhang, Hao; Jethani, Neil; Jones, Simon; Genes, Nicholas; Major, Vincent J; Jaffe, Ian S; Cardillo, Anthony B; Heilenbach, Noah; Ali, Nadia Fazal; Bonanni, Luke J; Clayburn, Andrew J; Khera, Zain; Sadler, Erica C; Prasad, Jaideep; Schlacter, Jamie; Liu, Kevin; Silva, Benjamin; Montgomery, Sophie; Kim, Eric J; Lester, Jacob; Hill, Theodore M; Avoricani, Alba; Chervonski, Ethan; Davydov, James; Small, William; Chakravartty, Eesha; Grover, Himanshu; Dodson, John A; Brody, Abraham A; Aphinyanaphongs, Yindalon; Masurkar, Arjun; Razavian, Narges
Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.
PMCID:11634005
PMID: 39661652
ISSN: 2767-3170
CID: 5762692

Quality of care after a horizontal merger between two large academic hospitals

Wissink, Ilse J A; Schinkel, Michiel; Peters-Sengers, Hessel; Jones, Simon A; Vlaar, Alexander P J; Kruijthof, Karen J; Wiersinga, W Joost
BACKGROUND/UNASSIGNED:Hospital mergers remain common, but their influence on healthcare quality varies. Data on effects of European hospital mergers are ill defined, and academic hospitals in particular. This case study assesses early quality of care changes in two formerly competing Dutch academic hospitals that merged on June 6, 2018. METHODS/UNASSIGNED:Statistical process control and interrupted time series analysis were performed. All adult, non-psychiatric patients, admitted between 01-03-2016 and 01-10-2022 were eligible for analysis. Primary outcome measure was all cause in-hospital mortality (or hospice), secondary outcomes were unplanned 30-day readmissions to same hospital, length of stay, and patients' hospital rating. Data were obtained from electronic health records, and patient experience surveys. FINDINGS/UNASSIGNED:The mean (SD) age of the 573 813 included patients was 54·3 (18·9) years. The minority was female (277 817, 48·4 %), and most admissions were acute (308 597, 53·8 %). No merger related change in mortality was found in the first 20 months post-merger (limited to the pre-Covid-19 era). For this same period, the 30-day readmission incidence changed to a downward slope post-merger, and the length of stay shortened (immediate level-change -3·796 % (95 % CI, -5·776 % to -1·816 %) and trend-change -0·150 % per month (95 % CI, -0·307 % to 0·007 %)). Patients' hospital ratings seemed to improve post-merger. INTERPRETATION/UNASSIGNED:In this quality improvement study, a full- and gradual post-merger integration strategy for a Dutch academic hospital merger was not associated with changes in in-hospital mortality, and yielded slight improved results for secondary quality of care outcomes.
PMCID:11490856
PMID: 39430517
ISSN: 2405-8440
CID: 5739502

Development and evaluation of an artificial intelligence-based workflow for the prioritization of patient portal messages

Yang, Jie; So, Jonathan; Zhang, Hao; Jones, Simon; Connolly, Denise M; Golding, Claudia; Griffes, Esmelin; Szerencsy, Adam C; Wu, Tzer Jason; Aphinyanaphongs, Yindalon; Major, Vincent J
OBJECTIVES/UNASSIGNED:Accelerating demand for patient messaging has impacted the practice of many providers. Messages are not recommended for urgent medical issues, but some do require rapid attention. This presents an opportunity for artificial intelligence (AI) methods to prioritize review of messages. Our study aimed to highlight some patient portal messages for prioritized review using a custom AI system integrated into the electronic health record (EHR). MATERIALS AND METHODS/UNASSIGNED:We developed a Bidirectional Encoder Representations from Transformers (BERT)-based large language model using 40 132 patient-sent messages to identify patterns involving high acuity topics that warrant an immediate callback. The model was then implemented into 2 shared pools of patient messages managed by dozens of registered nurses. A primary outcome, such as the time before messages were read, was evaluated with a difference-in-difference methodology. RESULTS/UNASSIGNED: = 396 466), an improvement exceeding the trend was observed in the time high-scoring messages sit unread (21 minutes, 63 vs 42 for messages sent outside business hours). DISCUSSION/UNASSIGNED:Our work shows great promise in improving care when AI is aligned with human workflow. Future work involves audience expansion, aiding users with suggested actions, and drafting responses. CONCLUSION/UNASSIGNED:Many patients utilize patient portal messages, and while most messages are routine, a small fraction describe alarming symptoms. Our AI-based workflow shortens the turnaround time to get a trained clinician to review these messages to provide safer, higher-quality care.
PMCID:11328532
PMID: 39156046
ISSN: 2574-2531
CID: 5680362

Large Language Model-Based Responses to Patients' In-Basket Messages

Small, William R; Wiesenfeld, Batia; Brandfield-Harvey, Beatrix; Jonassen, Zoe; Mandal, Soumik; Stevens, Elizabeth R; Major, Vincent J; Lostraglio, Erin; Szerencsy, Adam; Jones, Simon; Aphinyanaphongs, Yindalon; Johnson, Stephen B; Nov, Oded; Mann, Devin
IMPORTANCE/UNASSIGNED:Virtual patient-physician communications have increased since 2020 and negatively impacted primary care physician (PCP) well-being. Generative artificial intelligence (GenAI) drafts of patient messages could potentially reduce health care professional (HCP) workload and improve communication quality, but only if the drafts are considered useful. OBJECTIVES/UNASSIGNED:To assess PCPs' perceptions of GenAI drafts and to examine linguistic characteristics associated with equity and perceived empathy. DESIGN, SETTING, AND PARTICIPANTS/UNASSIGNED:This cross-sectional quality improvement study tested the hypothesis that PCPs' ratings of GenAI drafts (created using the electronic health record [EHR] standard prompts) would be equivalent to HCP-generated responses on 3 dimensions. The study was conducted at NYU Langone Health using private patient-HCP communications at 3 internal medicine practices piloting GenAI. EXPOSURES/UNASSIGNED:Randomly assigned patient messages coupled with either an HCP message or the draft GenAI response. MAIN OUTCOMES AND MEASURES/UNASSIGNED:PCPs rated responses' information content quality (eg, relevance), using a Likert scale, communication quality (eg, verbosity), using a Likert scale, and whether they would use the draft or start anew (usable vs unusable). Branching logic further probed for empathy, personalization, and professionalism of responses. Computational linguistics methods assessed content differences in HCP vs GenAI responses, focusing on equity and empathy. RESULTS/UNASSIGNED:A total of 16 PCPs (8 [50.0%] female) reviewed 344 messages (175 GenAI drafted; 169 HCP drafted). Both GenAI and HCP responses were rated favorably. GenAI responses were rated higher for communication style than HCP responses (mean [SD], 3.70 [1.15] vs 3.38 [1.20]; P = .01, U = 12 568.5) but were similar to HCPs on information content (mean [SD], 3.53 [1.26] vs 3.41 [1.27]; P = .37; U = 13 981.0) and usable draft proportion (mean [SD], 0.69 [0.48] vs 0.65 [0.47], P = .49, t = -0.6842). Usable GenAI responses were considered more empathetic than usable HCP responses (32 of 86 [37.2%] vs 13 of 79 [16.5%]; difference, 125.5%), possibly attributable to more subjective (mean [SD], 0.54 [0.16] vs 0.31 [0.23]; P < .001; difference, 74.2%) and positive (mean [SD] polarity, 0.21 [0.14] vs 0.13 [0.25]; P = .02; difference, 61.5%) language; they were also numerically longer (mean [SD] word count, 90.5 [32.0] vs 65.4 [62.6]; difference, 38.4%), but the difference was not statistically significant (P = .07) and more linguistically complex (mean [SD] score, 125.2 [47.8] vs 95.4 [58.8]; P = .002; difference, 31.2%). CONCLUSIONS/UNASSIGNED:In this cross-sectional study of PCP perceptions of an EHR-integrated GenAI chatbot, GenAI was found to communicate information better and with more empathy than HCPs, highlighting its potential to enhance patient-HCP communication. However, GenAI drafts were less readable than HCPs', a significant concern for patients with low health or English literacy.
PMCID:11252893
PMID: 39012633
ISSN: 2574-3805
CID: 5686582

Evaluating Large Language Models in Extracting Cognitive Exam Dates and Scores

Zhang, Hao; Jethani, Neil; Jones, Simon; Genes, Nicholas; Major, Vincent J; Jaffe, Ian S; Cardillo, Anthony B; Heilenbach, Noah; Ali, Nadia Fazal; Bonanni, Luke J; Clayburn, Andrew J; Khera, Zain; Sadler, Erica C; Prasad, Jaideep; Schlacter, Jamie; Liu, Kevin; Silva, Benjamin; Montgomery, Sophie; Kim, Eric J; Lester, Jacob; Hill, Theodore M; Avoricani, Alba; Chervonski, Ethan; Davydov, James; Small, William; Chakravartty, Eesha; Grover, Himanshu; Dodson, John A; Brody, Abraham A; Aphinyanaphongs, Yindalon; Masurkar, Arjun; Razavian, Narges
IMPORTANCE/UNASSIGNED:Large language models (LLMs) are crucial for medical tasks. Ensuring their reliability is vital to avoid false results. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. OBJECTIVE/UNASSIGNED:Evaluate ChatGPT and LlaMA-2 performance in extracting MMSE and CDR scores, including their associated dates. METHODS/UNASSIGNED:Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. RESULTS/UNASSIGNED:For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. CONCLUSIONS/UNASSIGNED:In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.
PMCID:10888985
PMID: 38405784
CID: 5722422

Predicting Robotic Hysterectomy Incision Time: Optimizing Surgical Scheduling with Machine Learning

Shah, Vaishali; Yung, Halley C; Yang, Jie; Zaslavsky, Justin; Algarroba, Gabriela N; Pullano, Alyssa; Karpel, Hannah C; Munoz, Nicole; Aphinyanaphongs, Yindalon; Saraceni, Mark; Shah, Paresh; Jones, Simon; Huang, Kathy
BACKGROUND AND OBJECTIVES/UNASSIGNED:Operating rooms (ORs) are critical for hospital revenue and cost management, with utilization efficiency directly affecting financial outcomes. Traditional surgical scheduling often results in suboptimal OR use. We aim to build a machine learning (ML) model to predict incision times for robotic-assisted hysterectomies, enhancing scheduling accuracy and hospital finances. METHODS/UNASSIGNED:A retrospective study was conducted using data from robotic-assisted hysterectomy cases performed between January 2017 and April 2021 across 3 hospitals within a large academic health system. Cases were filtered for surgeries performed by high-volume surgeons and those with an incision time of under 3 hours (n = 2,702). Features influencing incision time were extracted from electronic medical records and used to train 5 ML models (linear ridge regression, random forest, XGBoost, CatBoost, and explainable boosting machine [EBM]). Model performance was evaluated using a dynamic monthly update process and novel metrics such as wait-time blocks and excess-time blocks. RESULTS/UNASSIGNED: < .001, 95% CI [-329 to -89]), translating to approximately 52-hours over the 51-month study period. The model predicted more surgeries within a 15% range of the true incision time compared to traditional methods. Influential features included surgeon experience, number of additional procedures, body mass index (BMI), and uterine size. CONCLUSION/UNASSIGNED:The ML model enhanced the prediction of incision times for robotic-assisted hysterectomies, providing a potential solution to reduce OR underutilization and increase surgical throughput and hospital revenue.
PMCID:11741200
PMID: 39831273
ISSN: 1938-3797
CID: 5778432

Impact of Patient-Clinician Relationships on Pain and Objective Functional Measures for Individuals with Chronic Low Back Pain: An Experimental Study

Vorensky, Mark; Squires, Allison; Jones, Simon; Sajnani, Nisha; Castillo, Elijah; Rao, Smita
PURPOSE:To compare the effects of enhanced and limited patient-clinician relationships during patient history taking on objective functional measures and pain appraisals for individuals with chronic low back pain (CLBP). METHODS:Fifty-two (52) participants with CLBP, unaware of the two groups, were randomized using concealed allocation to an enhanced (n=26) or limited (n=26) patient-clinician relationship condition. Participants shared their history of CLBP with a clinician who enacted either enhanced or limited communication strategies. Fingertip-to-floor, one-minute lift, and Biering-Sorensen tests, and visual analogue scale for pain at rest were assessed before and after the patient-clinician relationship conditions. FINDINGS:The enhanced condition resulted in significantly greater improvements in the one-minute lift test (F(1,49)=7.47, p&lt;.01, ηp2=0.13) and pain at rest (F(1,46)=4.63, p=.04, ηp2=0.09), but not the fingertip-to-floor or Biering-Sorensen tests, compared with the limited group. CONCLUSIONS:Even without physical treatment, differences in patient-clinician relationships acutely affected lifting performance and pain among individuals with CLBP.
PMID: 39584210
ISSN: 1548-6869
CID: 5779832

Ambulatory antibiotic prescription rates for acute respiratory infection rebound two years after the start of the COVID-19 pandemic

Stevens, Elizabeth R; Feldstein, David; Jones, Simon; Twan, Chelsea; Cui, Xingwei; Hess, Rachel; Kim, Eun Ji; Richardson, Safiya; Malik, Fatima M; Tasneem, Sumaiya; Henning, Natalie; Xu, Lynn; Mann, Devin M
BACKGROUND:During the COVID-19 pandemic, acute respiratory infection (ARI) antibiotic prescribing in ambulatory care markedly decreased. It is unclear if antibiotic prescription rates will remain lowered. METHODS:We used trend analyses of antibiotics prescribed during and after the first wave of COVID-19 to determine whether ARI antibiotic prescribing rates in ambulatory care have remained suppressed compared to pre-COVID-19 levels. Retrospective data was used from patients with ARI or UTI diagnosis code(s) for their encounter from 298 primary care and 66 urgent care practices within four academic health systems in New York, Wisconsin, and Utah between January 2017 and June 2022. The primary measures included antibiotic prescriptions per 100 non-COVID ARI encounters, encounter volume, prescribing trends, and change from expected trend. RESULTS:At baseline, during and after the first wave, the overall ARI antibiotic prescribing rates were 54.7, 38.5, and 54.7 prescriptions per 100 encounters, respectively. ARI antibiotic prescription rates saw a statistically significant decline after COVID-19 onset (step change -15.2, 95% CI: -19.6 to -4.8). During the first wave, encounter volume decreased 29.4% and, after the first wave, remained decreased by 188%. After the first wave, ARI antibiotic prescription rates were no longer significantly suppressed from baseline (step change 0.01, 95% CI: -6.3 to 6.2). There was no significant difference between UTI antibiotic prescription rates at baseline versus the end of the observation period. CONCLUSIONS:The decline in ARI antibiotic prescribing observed after the onset of COVID-19 was temporary, not mirrored in UTI antibiotic prescribing, and does not represent a long-term change in clinician prescribing behaviors. During a period of heightened awareness of a viral cause of ARI, a substantial and clinically meaningful decrease in clinician antibiotic prescribing was observed. Future efforts in antibiotic stewardship may benefit from continued study of factors leading to this reduction and rebound in prescribing rates.
PMCID:11198751
PMID: 38917147
ISSN: 1932-6203
CID: 5675032

Menu Labeling and Calories Purchased in Restaurants in a US National Fast Food Chain

Rummo, Pasquale E; Mijanovich, Tod; Wu, Erilia; Heng, Lloyd; Hafeez, Emil; Bragg, Marie A; Jones, Simon A; Weitzman, Beth C; Elbel, Brian
IMPORTANCE/UNASSIGNED:Menu labeling has been implemented in restaurants in some US jurisdictions as early as 2008, but the extent to which menu labeling is associated with calories purchased is unclear. OBJECTIVE/UNASSIGNED:To estimate the association of menu labeling with calories and nutrients purchased and assess geographic variation in results. DESIGN, SETTING, AND PARTICIPANTS/UNASSIGNED:A cohort study was conducted with a quasi-experimental design using actual transaction data from Taco Bell restaurants from calendar years 2007 to 2014 US restaurants with menu labeling matched to comparison restaurants using synthetic control methods. Data were analyzed from May to October 2023. EXPOSURE/UNASSIGNED:Menu labeling policies in 6 US jurisdictions. MAIN OUTCOMES AND MEASURES/UNASSIGNED:The primary outcome was calories per transaction. Secondary outcomes included total and saturated fat, carbohydrates, protein, sugar, fiber, and sodium. RESULTS/UNASSIGNED:The final sample included 2329 restaurants, with menu labeling in 474 (31 468 restaurant-month observations). Most restaurants (94.3%) were located in California. Difference-in-differences model results indicated that customers purchased 24.7 (95% CI, 23.6-25.7) fewer calories per transaction from restaurants in the menu labeling group in the 3- to 24-month follow-up period vs the comparison group, including 21.9 (95% CI, 20.9-22.9) fewer calories in the 3- to 12-month follow-up period and 25.0 (95% CI, 24.0-26.1) fewer calories in the 13- to 24-month follow-up period. Changes in the nutrient content of transactions were consistent with calorie estimates. Findings in California were similar to overall estimates in magnitude and direction; yet, among restaurants outside of California, no association was observed in the 3- to 24-month period. The outcome of menu labeling also differed by item category and time of day, with a larger decrease in the number of tacos vs other items purchased and a larger decrease in calories purchased during breakfast vs other times of the day in the 3- to 24-month period. CONCLUSIONS AND RELEVANCE/UNASSIGNED:In this quasi-experimental cohort study, fewer calories were purchased in restaurants with calorie labels compared with those with no labels, suggesting that consumers are sensitive to calorie information on menu boards, although associations differed by location.
PMID: 38100109
ISSN: 2574-3805
CID: 5588992