NYUHSL Faculty Bibliography

Searched for:

in-biosketch:true

person:oermae01

Total Results:

147

Neurosurgery. 2026:98(2).DOI: 10.1227/neu.0000000000003864

In Reply: Augmenting Large Language Models With Automated, Bibliometrics-Powered Literature Search for Knowledge Distillation: A Pilot Study for Common Spinal Pathologies

Kurland, David B; Alber, Daniel A; Oermann, Eric K

PMID: 41537755

ISSN: 1524-4040

CID: 5986532

Operative neurosurgery. 2026:30(2):250-259.DOI: 10.1227/ons.0000000000001646

Intraoperative Evaluation of Dural Arteriovenous Fistula Obliteration Using FLOW 800 Hemodynamic Analysis

Sangwon, Karl L; Grin, Eric A; Negash, Bruck; Wiggan, Daniel D; Lapierre, Cathryn; Raz, Eytan; Shapiro, Maksim; Laufer, Ilya; Sharashidze, Vera; Rutledge, Caleb; Riina, Howard A; Oermann, Eric K; Nossek, Erez

BACKGROUND AND OBJECTIVES/OBJECTIVE:Dural arteriovenous fistula (dAVF) surgery is a microsurgical procedure that requires confirmation of obliteration using formal cerebral angiography, but the lack of intraoperative angiogram or need for postoperative angiogram in some settings necessitates a search for alternative, less invasive methods to verify surgical success. This study evaluates the use of indocyanine green videoangiography FLOW 800 hemodynamic intraoperatively during cranial and spinal dAVF obliteration to confirm obliteration and predict surgical success. METHODS:A retrospective analysis was conducted using indocyanine green videoangiography FLOW 800 to intraoperatively measure 4 hemodynamic parameters-Delay Time, Speed, Time to Peak, and Rise Time-across venous drainage regions of interest pre/post-dAVF obliteration. Univariate and multivariate statistical analyses to evaluate and visualize presurgical vs postsurgical state hemodynamic changes included nonparametric statistical tests, logistic regression, and Bayesian analysis. RESULTS:A total of 14 venous drainage regions of interest from 8 patients who had successful spinal or cranial dAVF obliteration confirmed with intraoperative digital subtraction angiography were extracted. Significant hemodynamic changes were observed after dAVF obliteration, with median Speed decreasing from 13.5 to 5.5 s-1 (P = .029) and Delay Time increasing from 2.07 to 7.86 s (P = .020). Bayesian logistic regression identified Delay Time as the strongest predictor of postsurgical state, with a 50% increase associated with 2.16 times higher odds of achieving obliteration (odds ratio = 4.59, 95% highest density interval: 1.07-19.95). Speed exhibited a trend toward a negative association with postsurgical state (odds ratio = 0.62, 95% highest density interval: 0.26-1.42). Receiver operating characteristic-area under the curve analysis using logistic regression demonstrated a score of 0.760, highlighting Delay Time and Speed as key features distinguishing preobliteration and postobliteration states. CONCLUSION/CONCLUSIONS:Our findings demonstrate that intraoperative FLOW 800 analysis reliably quantifies and visualizes immediate hemodynamic changes consistent with dAVF obliteration. Speed and Delay Time emerged as key indicators of surgical success, highlighting the potential of FLOW 800 as a noninvasive adjunct to traditional imaging techniques for confirming dAVF obliteration intraoperatively.

PMID: 40434390

ISSN: 2332-4260

CID: 5855352

npj health systems. 2026:3(1).DOI: 10.1038/s44401-025-00059-8

Enhancing the prediction of hospital discharge disposition with extraction-based language model classification

Small, William R; Crowley, Ryan J; Pariente, Chloe; Zhang, Jeff; Eaton, Kevin P; Jiang, Lavender Yao; Oermann, Eric; Aphinyanaphongs, Yindalon

Early identification of inpatient discharges to skilled nursing facilities (SNFs) facilitates care transition planning. Predictive information in admission history and physical notes (H&Ps) is dispersed across long documents. Language models adeptly predict clinical outcomes from text but have limitations: token length constraints, noisy inputs, and opaque outputs. Therefore, we developed extraction-based language model classification (ELC): generative language models distill H&Ps into task-relevant categories ("Structured Extracted Data") before summarizing them into a concise narrative ("AI Risk Snapshot"). We hypothesized that language models utilizing AI Risk Snapshots to predict SNF discharges would perform the best. In this retrospective observational study, nine language models predicted SNF discharges from unstructured predictors (raw H&P text, truncated assessment and plan) and ELC-derived predictors (Structured Extracted Data, AI Risk Snapshots). ELC substantially reduced input length (AI Risk Snapshot median 141 tokens vs raw H&P median 2,120 tokens) and improved average AUROC and AUPRC across models. The best performance was achieved by Bio+Clinical BERT fine-tuned on AI Risk Snapshots (AUROC = .851). AI Risk Snapshots enhanced interpretability by aligning with nurse case managers' risk assessments and facilitating prompt design. Structuring and summarizing H&Ps via ELC thus mitigates the practical limitations of language models and improves SNF discharge prediction.

PMCID:12789015

PMID: 41522677

ISSN: 3005-1959

CID: 5985892

Neurosurgery. 2025.DOI: 10.1227/neu.0000000000003878

Evaluating the Performance and Fragility of Large Language Models on the Self-Assessment for Neurological Surgeons

Vishwanath, Krithik; Alyakin, Anton; Ghosh, Mrigayu; Lee, Jin Vivian; Alber, Daniel Alexander; Sangwon, Karl L; Kondziolka, Douglas; Oermann, Eric Karl

BACKGROUND AND OBJECTIVES/OBJECTIVE:The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models' (LLMs) neurosurgical knowledge. LLMs show significant promise for transforming neurosurgical practice; however, they are susceptible to in-text distractions and confounding factors. Given the increasing use of generative artificial intelligence and ambient dictation technologies, clinical text is at a larger risk for the inclusion of extraneous details. The aim of this study was to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. METHODS:A comprehensive evaluation was conducted using 28 state-of-the-art LLMs. These models were tested on 2904 neurosurgery board examination questions derived from the Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons. In addition, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in nonclinical contexts to determine the extent to which such distractions degrade model performance on standard medical benchmarks. RESULTS:Six of the 28 tested LLMs achieved board-passing outcomes, with the top-performing models scoring over 15.7% above the passing threshold. When exposed to distractions, accuracy across various model architectures was significantly reduced-by as much as 20.4%-with 1 model failing that had previously passed. Both general-purpose and medical open-source models experienced greater performance declines compared with proprietary variants when subjected to the added distractors. CONCLUSION/CONCLUSIONS:While current LLMs demonstrate an impressive ability to answer neurosurgery board-like examination questions, their performance is markedly vulnerable to extraneous, distracting information. These findings underscore the critical need for developing novel mitigation strategies aimed at bolstering LLM resilience against in-text distractions, particularly for safe and effective clinical deployment.

PMID: 41358748

ISSN: 1524-4040

CID: 5977102

Nature medicine. 2025:31(12):4225-4235.DOI: 10.1038/s41591-025-04006-w

A full life cycle biological clock based on routine clinical data and its impact in health and diseases

Wang, Kai; Liu, Fei; Wu, Wei; Hu, Changxi; Shen, Xian; Wang, Meihao; Li, Gen; Zeng, Fanxin; Liu, Li; Wong, Io Nam; Liu, Sian; Zou, Zixing; Li, Bingzhou; Li, Jinghang; Huang, Xiaoying; Jin, Shengwei; Li, Zhuomin; Xu, Hui; Chen, Gang; Chen, Xiaodong; Zhu, Ying; Li, Ping; Feng, Zhe; Wang, Winston; Cheng, Linling; Yang, Mingqi; Hou, Qiang; Lu, Wenyang; Sun, Yiwen; Li, Kun; Zhong, Tian; Sun, Zhuo; Yin, Yun; Loupy, Alexandre; Oermann, Eric; Chen, Xiangmei; Zhang, Kang; ,

Aging research has primarily focused on adult aging clocks, leaving a critical gap in understanding a biological clock across the full life cycle, particularly during infancy and childhood. Here we introduce LifeClock, a biological clock model that predicts biological age across all life stages using routine electronic health records and laboratory test data. To enhance individualized predictions, we integrated virtual patient representations from 24,633,025 heterogeneous longitudinal clinical visits across 9,680,764 individuals and projected them into a latent space. Our approach leverages EHRFormer, a time-series transformer-based model, to analyze developmental and aging dynamics with high precision and develop accurate biological age clocks spanning infancy to old age. Our findings reveal distinct biological clock patterns across different life stages. The pediatric clock is strongly associated with children's development and accurately predicts current and future risks of major pediatric diseases, including malnutrition, growth and developmental abnormalities. The adult clock is strongly associated with aging and accurately predicts current and future risks of major age-related diseases, such as diabetes, renal failure, stroke and cardiovascular diseases. This work therefore distinguishes pediatric development from adult aging, establishing a novel framework to advance precision health by leveraging routine clinical data across the entire lifespan.

PMID: 41145791

ISSN: 1546-170x

CID: 5961022

Neurosurgery practice. 2025:6(4).DOI: 10.1227/neuprac.0000000000000162

Neuro Data Hub: A New Approach for Streamlining Medical Clinical Research

Han, Xu; Alyakin, Anton; Ciprut, Shannon; Lapierre, Cathryn; Stryker, Jaden; Golfinos, John; Kondziolka, Douglas; Oermann, Eric Karl

BACKGROUND AND OBJECTIVES/OBJECTIVE:Neurosurgical clinical research depends on medical data collection and evaluation that is often laborious, time consuming, and inefficient. The goal of this work was to implement and evaluate a novel departmental data infrastructure (Neuro Data Hub) designed to provide specialized data services for neurosurgical research. Data acquisition would become available purely by request. METHODS:through collaboration between Department Leadership and Medical Center Information Technology, integrating it with Institutional Review Board workflows and an existing Epic electronic health record Datalake infrastructure. The system implementation included monthly departmental meetings and an asynchronous Research Electronic Data Capture-based request system. Data requests submitted between August 2023 and November 2024 were analyzed and categorized as basic, complex, or Natural Language Processing (NLP)-augmented, with optional visualization and database creation services. Request volumes, types, and execution times were assessed. RESULTS:The Hub processed 39 research data requests (2.6/month), comprising 3 basic, 22 complex, and 14 NLP-augmented requests. Two complex requests included visualization services, and one NLP request included database creation. Average request execution time was 36.5 days, with NLP-augmented requests showing increasing adoption over time. CONCLUSION/CONCLUSIONS:The Neuro Data Hub represents a paradigm shift from centralized to department-level data services, providing specialized support for neurosurgical research and democratizing access to institutional data. While effective, implementation may be limited by institutional information technology infrastructure requirements. This model could serve as a template for any form of medical-clinical research program seeking to improve data accessibility and research capabilities.

PMCID:12560744

PMID: 41163737

ISSN: 2834-4383

CID: 5961452

Scientific reports. 2025:15(1).DOI: 10.1038/s41598-025-26036-7

The pitfalls of multiple-choice questions in generative AI and medical education

Singh, Shrutika; Alyakin, Anton; Alber, Daniel Alexander; Stryker, Jaden; Tong, Ai Phuong S; Sangwon, Karl; Goff, Nicolas; De La Paz, Mathew; Hernandez-Rovira, Miguel; Park, Ki Yun; Leuthardt, Eric Claude; Oermann, Eric Karl

The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10^-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.

PMCID:12658246

PMID: 41298584

ISSN: 2045-2322

CID: 5968502

World neurosurgery. 2025:203.DOI: 10.1016/j.wneu.2025.124433

Most Roads Lead to Cushing: Mapping Neurosurgical Training Lineages in the United States

Kurland, David B; Park, Minjun; Gajjar, Avi A; Liu, Albert; Kondziolka, Douglas; Golfinos, John G; Alleyne, Cargill H; Oermann, Eric K

OBJECTIVE:Mentorship and training relationships shape the careers and influence of neurosurgeons. Network analysis can reveal structural characteristics and key individuals who support network connectivity and drive the field's development. This endeavor analyzed the U.S.-based neurosurgical training network derived from NeurosurGen.com. METHODS:A network graph was constructed representing neurosurgical training relationships, including chairperson-trainee, program director-trainee, and coresident connections. Graph- and node-level metrics, with a focus on centrality measures, were calculated for a trainer-trainee subgraph. RESULTS:The network consisted of 8840 neurosurgeons represented as nodes, and 382,143 relationships represented as edges. It evolved from an early small-world structure to a hierarchical and decentralized structure dominated by local clusters. Demographic shifts over time reflected increasing diversity and inclusion, with greater representation of female, Hispanic, Asian, and Black trainees across 285 training programs. Nodes were preferentially connected via residency, and the connectivity among underrepresented populations improved in concert with increased representation. Harvey W. Cushing was the quintessential neurosurgeon-influencer in the United States, ranking highly across most centrality measures over time. CONCLUSIONS:The neurosurgical training network is sparse but interconnected, typical of large real-world professional networks. While many small groups of neurosurgeons are closely tied within their immediate training hierarchy and peer group, in modern neurosurgery, each surgeon is only connected to a small fraction of the total network. Highly central individuals have played critical roles in linking disparate groups and shaping network structure. Increasing diversity in recent decades indicates progress toward inclusivity, although overall representation remains low.

PMID: 40914191

ISSN: 1878-8769

CID: 5966272

Neurosurgery. 2025.DOI: 10.1227/neu.0000000000003683

Automating the Referral of Bone Metastases Patients With and Without the Use of Large Language Models

Sangwon, Karl L; Han, Xu; Becker, Anton; Zhang, Yuchong; Ni, Richard; Zhang, Jeff; Alber, Daniel Alexander; Alyakin, Anton; Nakatsuka, Michelle; Fabbri, Nicola; Aphinyanaphongs, Yindalon; Yang, Jonathan T; Chachoua, Abraham; Kondziolka, Douglas; Laufer, Ilya; Oermann, Eric Karl

BACKGROUND AND OBJECTIVES/OBJECTIVE:Bone metastases, affecting more than 4.8% of patients with cancer annually, and particularly spinal metastases require urgent intervention to prevent neurological complications. However, the current process of manually reviewing radiological reports leads to potential delays in specialist referrals. We hypothesized that natural language processing (NLP) review of routine radiology reports could automate the referral process for timely multidisciplinary care of spinal metastases. METHODS:We assessed 3 NLP models-a rule-based regular expression (RegEx) model, GPT-4, and a specialized Bidirectional Encoder Representations from Transformers (BERT) model (NYUTron)-for automated detection and referral of bone metastases. Study inclusion criteria targeted patients with active cancer diagnoses who underwent advanced imaging (computed tomography, MRI, or positron emission tomography) without previous specialist referral. We defined 2 separate tasks: task of identifying clinically significant bone metastatic terms (lexical detection), and identifying cases needing a specialist follow-up (clinical referral). Models were developed using 3754 hand-labeled advanced imaging studies in 2 phases: phase 1 focused on spine metastases, and phase 2 generalized to bone metastases. Standard McRae's line performance metrics were evaluated and compared across all stages and tasks. RESULTS:In the lexical detection, a simple RegEx achieved the highest performance (sensitivity 98.4%, specificity 97.6%, F1 = 0.965), followed by NYUTron (sensitivity 96.8%, specificity 89.9%, and F1 = 0.787). For the clinical referral task, RegEx also demonstrated superior performance (sensitivity 92.3%, specificity 87.5%, and F1 = 0.936), followed by a fine-tuned NYUTron model (sensitivity 90.0%, specificity 66.7%, and F1 = 0.750). CONCLUSION/CONCLUSIONS:An NLP-based automated referral system can accurately identify patients with bone metastases requiring specialist evaluation. A simple RegEx model excels in syntax-based identification and expert-informed rule generation for efficient referral patient recommendation in comparison with advanced NLP models. This system could significantly reduce missed follow-ups and enhance timely intervention for patients with bone metastases.

PMID: 40823772

ISSN: 1524-4040

CID: 5908782

Neurosurgical focus. 2025:59(1).DOI: 10.3171/2025.4.FOCUS24674

Introduction. Artificial intelligence in neurosurgery: transforming a data-intensive specialty

Hopkins, Benjamin S; Sutherland, Garnette R; Browd, Samuel R; Donoho, Daniel A; Oermann, Eric K; Schirmer, Clemens M; Pennicooke, Brenton; Asaad, Wael F

PMID: 40591964

ISSN: 1092-0684

CID: 5887762