Try a new search

Format these results:

Searched for:

in-biosketch:true

person:oermae01

Total Results:

147


In Reply: Augmenting Large Language Models With Automated, Bibliometrics-Powered Literature Search for Knowledge Distillation: A Pilot Study for Common Spinal Pathologies

Kurland, David B; Alber, Daniel A; Oermann, Eric K
PMID: 41537755
ISSN: 1524-4040
CID: 5986532

Enhancing the prediction of hospital discharge disposition with extraction-based language model classification

Small, William R; Crowley, Ryan J; Pariente, Chloe; Zhang, Jeff; Eaton, Kevin P; Jiang, Lavender Yao; Oermann, Eric; Aphinyanaphongs, Yindalon
Early identification of inpatient discharges to skilled nursing facilities (SNFs) facilitates care transition planning. Predictive information in admission history and physical notes (H&Ps) is dispersed across long documents. Language models adeptly predict clinical outcomes from text but have limitations: token length constraints, noisy inputs, and opaque outputs. Therefore, we developed extraction-based language model classification (ELC): generative language models distill H&Ps into task-relevant categories ("Structured Extracted Data") before summarizing them into a concise narrative ("AI Risk Snapshot"). We hypothesized that language models utilizing AI Risk Snapshots to predict SNF discharges would perform the best. In this retrospective observational study, nine language models predicted SNF discharges from unstructured predictors (raw H&P text, truncated assessment and plan) and ELC-derived predictors (Structured Extracted Data, AI Risk Snapshots). ELC substantially reduced input length (AI Risk Snapshot median 141 tokens vs raw H&P median 2,120 tokens) and improved average AUROC and AUPRC across models. The best performance was achieved by Bio+Clinical BERT fine-tuned on AI Risk Snapshots (AUROC = .851). AI Risk Snapshots enhanced interpretability by aligning with nurse case managers' risk assessments and facilitating prompt design. Structuring and summarizing H&Ps via ELC thus mitigates the practical limitations of language models and improves SNF discharge prediction.
PMCID:12789015
PMID: 41522677
ISSN: 3005-1959
CID: 5985892

Evaluating the Performance and Fragility of Large Language Models on the Self-Assessment for Neurological Surgeons

Vishwanath, Krithik; Alyakin, Anton; Ghosh, Mrigayu; Lee, Jin Vivian; Alber, Daniel Alexander; Sangwon, Karl L; Kondziolka, Douglas; Oermann, Eric Karl
BACKGROUND AND OBJECTIVES/OBJECTIVE:The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models' (LLMs) neurosurgical knowledge. LLMs show significant promise for transforming neurosurgical practice; however, they are susceptible to in-text distractions and confounding factors. Given the increasing use of generative artificial intelligence and ambient dictation technologies, clinical text is at a larger risk for the inclusion of extraneous details. The aim of this study was to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. METHODS:A comprehensive evaluation was conducted using 28 state-of-the-art LLMs. These models were tested on 2904 neurosurgery board examination questions derived from the Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons. In addition, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in nonclinical contexts to determine the extent to which such distractions degrade model performance on standard medical benchmarks. RESULTS:Six of the 28 tested LLMs achieved board-passing outcomes, with the top-performing models scoring over 15.7% above the passing threshold. When exposed to distractions, accuracy across various model architectures was significantly reduced-by as much as 20.4%-with 1 model failing that had previously passed. Both general-purpose and medical open-source models experienced greater performance declines compared with proprietary variants when subjected to the added distractors. CONCLUSION/CONCLUSIONS:While current LLMs demonstrate an impressive ability to answer neurosurgery board-like examination questions, their performance is markedly vulnerable to extraneous, distracting information. These findings underscore the critical need for developing novel mitigation strategies aimed at bolstering LLM resilience against in-text distractions, particularly for safe and effective clinical deployment.
PMID: 41358748
ISSN: 1524-4040
CID: 5977102

A full life cycle biological clock based on routine clinical data and its impact in health and diseases

Wang, Kai; Liu, Fei; Wu, Wei; Hu, Changxi; Shen, Xian; Wang, Meihao; Li, Gen; Zeng, Fanxin; Liu, Li; Wong, Io Nam; Liu, Sian; Zou, Zixing; Li, Bingzhou; Li, Jinghang; Huang, Xiaoying; Jin, Shengwei; Li, Zhuomin; Xu, Hui; Chen, Gang; Chen, Xiaodong; Zhu, Ying; Li, Ping; Feng, Zhe; Wang, Winston; Cheng, Linling; Yang, Mingqi; Hou, Qiang; Lu, Wenyang; Sun, Yiwen; Li, Kun; Zhong, Tian; Sun, Zhuo; Yin, Yun; Loupy, Alexandre; Oermann, Eric; Chen, Xiangmei; Zhang, Kang; ,
Aging research has primarily focused on adult aging clocks, leaving a critical gap in understanding a biological clock across the full life cycle, particularly during infancy and childhood. Here we introduce LifeClock, a biological clock model that predicts biological age across all life stages using routine electronic health records and laboratory test data. To enhance individualized predictions, we integrated virtual patient representations from 24,633,025 heterogeneous longitudinal clinical visits across 9,680,764 individuals and projected them into a latent space. Our approach leverages EHRFormer, a time-series transformer-based model, to analyze developmental and aging dynamics with high precision and develop accurate biological age clocks spanning infancy to old age. Our findings reveal distinct biological clock patterns across different life stages. The pediatric clock is strongly associated with children's development and accurately predicts current and future risks of major pediatric diseases, including malnutrition, growth and developmental abnormalities. The adult clock is strongly associated with aging and accurately predicts current and future risks of major age-related diseases, such as diabetes, renal failure, stroke and cardiovascular diseases. This work therefore distinguishes pediatric development from adult aging, establishing a novel framework to advance precision health by leveraging routine clinical data across the entire lifespan.
PMID: 41145791
ISSN: 1546-170x
CID: 5961022

Neuro Data Hub: A New Approach for Streamlining Medical Clinical Research

Han, Xu; Alyakin, Anton; Ciprut, Shannon; Lapierre, Cathryn; Stryker, Jaden; Golfinos, John; Kondziolka, Douglas; Oermann, Eric Karl
BACKGROUND AND OBJECTIVES/OBJECTIVE:Neurosurgical clinical research depends on medical data collection and evaluation that is often laborious, time consuming, and inefficient. The goal of this work was to implement and evaluate a novel departmental data infrastructure (Neuro Data Hub) designed to provide specialized data services for neurosurgical research. Data acquisition would become available purely by request. METHODS:through collaboration between Department Leadership and Medical Center Information Technology, integrating it with Institutional Review Board workflows and an existing Epic electronic health record Datalake infrastructure. The system implementation included monthly departmental meetings and an asynchronous Research Electronic Data Capture-based request system. Data requests submitted between August 2023 and November 2024 were analyzed and categorized as basic, complex, or Natural Language Processing (NLP)-augmented, with optional visualization and database creation services. Request volumes, types, and execution times were assessed. RESULTS:The Hub processed 39 research data requests (2.6/month), comprising 3 basic, 22 complex, and 14 NLP-augmented requests. Two complex requests included visualization services, and one NLP request included database creation. Average request execution time was 36.5 days, with NLP-augmented requests showing increasing adoption over time. CONCLUSION/CONCLUSIONS:The Neuro Data Hub represents a paradigm shift from centralized to department-level data services, providing specialized support for neurosurgical research and democratizing access to institutional data. While effective, implementation may be limited by institutional information technology infrastructure requirements. This model could serve as a template for any form of medical-clinical research program seeking to improve data accessibility and research capabilities.
PMCID:12560744
PMID: 41163737
ISSN: 2834-4383
CID: 5961452

The pitfalls of multiple-choice questions in generative AI and medical education

Singh, Shrutika; Alyakin, Anton; Alber, Daniel Alexander; Stryker, Jaden; Tong, Ai Phuong S; Sangwon, Karl; Goff, Nicolas; De La Paz, Mathew; Hernandez-Rovira, Miguel; Park, Ki Yun; Leuthardt, Eric Claude; Oermann, Eric Karl
The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.
PMCID:12658246
PMID: 41298584
ISSN: 2045-2322
CID: 5968502

Most Roads Lead to Cushing: Mapping Neurosurgical Training Lineages in the United States

Kurland, David B; Park, Minjun; Gajjar, Avi A; Liu, Albert; Kondziolka, Douglas; Golfinos, John G; Alleyne, Cargill H; Oermann, Eric K
OBJECTIVE:Mentorship and training relationships shape the careers and influence of neurosurgeons. Network analysis can reveal structural characteristics and key individuals who support network connectivity and drive the field's development. This endeavor analyzed the U.S.-based neurosurgical training network derived from NeurosurGen.com. METHODS:A network graph was constructed representing neurosurgical training relationships, including chairperson-trainee, program director-trainee, and coresident connections. Graph- and node-level metrics, with a focus on centrality measures, were calculated for a trainer-trainee subgraph. RESULTS:The network consisted of 8840 neurosurgeons represented as nodes, and 382,143 relationships represented as edges. It evolved from an early small-world structure to a hierarchical and decentralized structure dominated by local clusters. Demographic shifts over time reflected increasing diversity and inclusion, with greater representation of female, Hispanic, Asian, and Black trainees across 285 training programs. Nodes were preferentially connected via residency, and the connectivity among underrepresented populations improved in concert with increased representation. Harvey W. Cushing was the quintessential neurosurgeon-influencer in the United States, ranking highly across most centrality measures over time. CONCLUSIONS:The neurosurgical training network is sparse but interconnected, typical of large real-world professional networks. While many small groups of neurosurgeons are closely tied within their immediate training hierarchy and peer group, in modern neurosurgery, each surgeon is only connected to a small fraction of the total network. Highly central individuals have played critical roles in linking disparate groups and shaping network structure. Increasing diversity in recent decades indicates progress toward inclusivity, although overall representation remains low.
PMID: 40914191
ISSN: 1878-8769
CID: 5966272

Automating the Referral of Bone Metastases Patients With and Without the Use of Large Language Models

Sangwon, Karl L; Han, Xu; Becker, Anton; Zhang, Yuchong; Ni, Richard; Zhang, Jeff; Alber, Daniel Alexander; Alyakin, Anton; Nakatsuka, Michelle; Fabbri, Nicola; Aphinyanaphongs, Yindalon; Yang, Jonathan T; Chachoua, Abraham; Kondziolka, Douglas; Laufer, Ilya; Oermann, Eric Karl
BACKGROUND AND OBJECTIVES/OBJECTIVE:Bone metastases, affecting more than 4.8% of patients with cancer annually, and particularly spinal metastases require urgent intervention to prevent neurological complications. However, the current process of manually reviewing radiological reports leads to potential delays in specialist referrals. We hypothesized that natural language processing (NLP) review of routine radiology reports could automate the referral process for timely multidisciplinary care of spinal metastases. METHODS:We assessed 3 NLP models-a rule-based regular expression (RegEx) model, GPT-4, and a specialized Bidirectional Encoder Representations from Transformers (BERT) model (NYUTron)-for automated detection and referral of bone metastases. Study inclusion criteria targeted patients with active cancer diagnoses who underwent advanced imaging (computed tomography, MRI, or positron emission tomography) without previous specialist referral. We defined 2 separate tasks: task of identifying clinically significant bone metastatic terms (lexical detection), and identifying cases needing a specialist follow-up (clinical referral). Models were developed using 3754 hand-labeled advanced imaging studies in 2 phases: phase 1 focused on spine metastases, and phase 2 generalized to bone metastases. Standard McRae's line performance metrics were evaluated and compared across all stages and tasks. RESULTS:In the lexical detection, a simple RegEx achieved the highest performance (sensitivity 98.4%, specificity 97.6%, F1 = 0.965), followed by NYUTron (sensitivity 96.8%, specificity 89.9%, and F1 = 0.787). For the clinical referral task, RegEx also demonstrated superior performance (sensitivity 92.3%, specificity 87.5%, and F1 = 0.936), followed by a fine-tuned NYUTron model (sensitivity 90.0%, specificity 66.7%, and F1 = 0.750). CONCLUSION/CONCLUSIONS:An NLP-based automated referral system can accurately identify patients with bone metastases requiring specialist evaluation. A simple RegEx model excels in syntax-based identification and expert-informed rule generation for efficient referral patient recommendation in comparison with advanced NLP models. This system could significantly reduce missed follow-ups and enhance timely intervention for patients with bone metastases.
PMID: 40823772
ISSN: 1524-4040
CID: 5908782

Introduction. Artificial intelligence in neurosurgery: transforming a data-intensive specialty

Hopkins, Benjamin S; Sutherland, Garnette R; Browd, Samuel R; Donoho, Daniel A; Oermann, Eric K; Schirmer, Clemens M; Pennicooke, Brenton; Asaad, Wael F
PMID: 40591964
ISSN: 1092-0684
CID: 5887762

Outcomes of concurrent versus non-concurrent immune checkpoint inhibition with stereotactic radiosurgery for melanoma brain metastases

Fu, Allen Ye; Bernstein, Kenneth; Zhang, Jeff; Silverman, Joshua; Mehnert, Janice; Sulman, Erik P; Oermann, Eric Karl; Kondziolka, Douglas
PURPOSE/OBJECTIVE:Immune checkpoint inhibition (ICI) has revolutionized the treatment of melanoma care. Stereotactic radiosurgery combined with ICI has shown promise to improve clinical outcomes in prior studies in patients who have metastatic melanoma with brain metastases. However, others have suggested that concurrent ICI with stereotactic radiosurgery can increase the risk of complications. METHODS:We present a retrospective, single-institution analysis of 98 patients with a median follow up of 17.1 months managed with immune checkpoint inhibition and stereotactic radiosurgery concurrently and non-concurrently. A total of 55 patients were included in the concurrent group and 43 patients in the non-concurrent treatment group. Cox proportional hazards models were used to assess the relation between concurrent or non-concurrent treatment and overall survival or local progression-free survival. The Wald test was used to assess significance. Significant differences between patients in both groups experiencing adverse events including adverse radiation effects, perilesional edema, and neurological deficits were tested for using the Chi-square or Fisher's exact test. RESULTS:Patients receiving concurrent versus non-concurrent ICI showed a significant increase in overall survival (median 37.1 months, 95% CI: 18.9 months - NA versus median 11.4 months, 95% CI: 6.4-33.2 months, p = 0.0056) but not local progression-free survival. There were no significant differences between groups with regards to adverse radiation effects (2% versus 3%), perilesional edema (20% versus 9%), neurological deficits (3% versus 20%). CONCLUSION/CONCLUSIONS:These results suggest that the timing of ICI does not increase risk of neurological complications when delivered within 4 weeks of SRS.
PMID: 40183901
ISSN: 1573-7373
CID: 5819412