Searched for: in-biosketch:true
person:oermae01
Evaluating the Performance and Fragility of Large Language Models on the Self-Assessment for Neurological Surgeons
Vishwanath, Krithik; Alyakin, Anton; Ghosh, Mrigayu; Lee, Jin Vivian; Alber, Daniel Alexander; Sangwon, Karl L; Kondziolka, Douglas; Oermann, Eric Karl
BACKGROUND AND OBJECTIVES/OBJECTIVE:The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models' (LLMs) neurosurgical knowledge. LLMs show significant promise for transforming neurosurgical practice; however, they are susceptible to in-text distractions and confounding factors. Given the increasing use of generative artificial intelligence and ambient dictation technologies, clinical text is at a larger risk for the inclusion of extraneous details. The aim of this study was to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. METHODS:A comprehensive evaluation was conducted using 28 state-of-the-art LLMs. These models were tested on 2904 neurosurgery board examination questions derived from the Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons. In addition, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in nonclinical contexts to determine the extent to which such distractions degrade model performance on standard medical benchmarks. RESULTS:Six of the 28 tested LLMs achieved board-passing outcomes, with the top-performing models scoring over 15.7% above the passing threshold. When exposed to distractions, accuracy across various model architectures was significantly reduced-by as much as 20.4%-with 1 model failing that had previously passed. Both general-purpose and medical open-source models experienced greater performance declines compared with proprietary variants when subjected to the added distractors. CONCLUSION/CONCLUSIONS:While current LLMs demonstrate an impressive ability to answer neurosurgery board-like examination questions, their performance is markedly vulnerable to extraneous, distracting information. These findings underscore the critical need for developing novel mitigation strategies aimed at bolstering LLM resilience against in-text distractions, particularly for safe and effective clinical deployment.
PMID: 41358748
ISSN: 1524-4040
CID: 5977102
A full life cycle biological clock based on routine clinical data and its impact in health and diseases
Wang, Kai; Liu, Fei; Wu, Wei; Hu, Changxi; Shen, Xian; Wang, Meihao; Li, Gen; Zeng, Fanxin; Liu, Li; Wong, Io Nam; Liu, Sian; Zou, Zixing; Li, Bingzhou; Li, Jinghang; Huang, Xiaoying; Jin, Shengwei; Li, Zhuomin; Xu, Hui; Chen, Gang; Chen, Xiaodong; Zhu, Ying; Li, Ping; Feng, Zhe; Wang, Winston; Cheng, Linling; Yang, Mingqi; Hou, Qiang; Lu, Wenyang; Sun, Yiwen; Li, Kun; Zhong, Tian; Sun, Zhuo; Yin, Yun; Loupy, Alexandre; Oermann, Eric; Chen, Xiangmei; Zhang, Kang; ,
Aging research has primarily focused on adult aging clocks, leaving a critical gap in understanding a biological clock across the full life cycle, particularly during infancy and childhood. Here we introduce LifeClock, a biological clock model that predicts biological age across all life stages using routine electronic health records and laboratory test data. To enhance individualized predictions, we integrated virtual patient representations from 24,633,025 heterogeneous longitudinal clinical visits across 9,680,764 individuals and projected them into a latent space. Our approach leverages EHRFormer, a time-series transformer-based model, to analyze developmental and aging dynamics with high precision and develop accurate biological age clocks spanning infancy to old age. Our findings reveal distinct biological clock patterns across different life stages. The pediatric clock is strongly associated with children's development and accurately predicts current and future risks of major pediatric diseases, including malnutrition, growth and developmental abnormalities. The adult clock is strongly associated with aging and accurately predicts current and future risks of major age-related diseases, such as diabetes, renal failure, stroke and cardiovascular diseases. This work therefore distinguishes pediatric development from adult aging, establishing a novel framework to advance precision health by leveraging routine clinical data across the entire lifespan.
PMID: 41145791
ISSN: 1546-170x
CID: 5961022
Neuro Data Hub: A New Approach for Streamlining Medical Clinical Research
Han, Xu; Alyakin, Anton; Ciprut, Shannon; Lapierre, Cathryn; Stryker, Jaden; Golfinos, John; Kondziolka, Douglas; Oermann, Eric Karl
BACKGROUND AND OBJECTIVES/OBJECTIVE:Neurosurgical clinical research depends on medical data collection and evaluation that is often laborious, time consuming, and inefficient. The goal of this work was to implement and evaluate a novel departmental data infrastructure (Neuro Data Hub) designed to provide specialized data services for neurosurgical research. Data acquisition would become available purely by request. METHODS:through collaboration between Department Leadership and Medical Center Information Technology, integrating it with Institutional Review Board workflows and an existing Epic electronic health record Datalake infrastructure. The system implementation included monthly departmental meetings and an asynchronous Research Electronic Data Capture-based request system. Data requests submitted between August 2023 and November 2024 were analyzed and categorized as basic, complex, or Natural Language Processing (NLP)-augmented, with optional visualization and database creation services. Request volumes, types, and execution times were assessed. RESULTS:The Hub processed 39 research data requests (2.6/month), comprising 3 basic, 22 complex, and 14 NLP-augmented requests. Two complex requests included visualization services, and one NLP request included database creation. Average request execution time was 36.5 days, with NLP-augmented requests showing increasing adoption over time. CONCLUSION/CONCLUSIONS:The Neuro Data Hub represents a paradigm shift from centralized to department-level data services, providing specialized support for neurosurgical research and democratizing access to institutional data. While effective, implementation may be limited by institutional information technology infrastructure requirements. This model could serve as a template for any form of medical-clinical research program seeking to improve data accessibility and research capabilities.
PMCID:12560744
PMID: 41163737
ISSN: 2834-4383
CID: 5961452
The pitfalls of multiple-choice questions in generative AI and medical education
Singh, Shrutika; Alyakin, Anton; Alber, Daniel Alexander; Stryker, Jaden; Tong, Ai Phuong S; Sangwon, Karl; Goff, Nicolas; De La Paz, Mathew; Hernandez-Rovira, Miguel; Park, Ki Yun; Leuthardt, Eric Claude; Oermann, Eric Karl
The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.
PMCID:12658246
PMID: 41298584
ISSN: 2045-2322
CID: 5968502
Most Roads Lead to Cushing: Mapping Neurosurgical Training Lineages in the United States
Kurland, David B; Park, Minjun; Gajjar, Avi A; Liu, Albert; Kondziolka, Douglas; Golfinos, John G; Alleyne, Cargill H; Oermann, Eric K
OBJECTIVE:Mentorship and training relationships shape the careers and influence of neurosurgeons. Network analysis can reveal structural characteristics and key individuals who support network connectivity and drive the field's development. This endeavor analyzed the U.S.-based neurosurgical training network derived from NeurosurGen.com. METHODS:A network graph was constructed representing neurosurgical training relationships, including chairperson-trainee, program director-trainee, and coresident connections. Graph- and node-level metrics, with a focus on centrality measures, were calculated for a trainer-trainee subgraph. RESULTS:The network consisted of 8840 neurosurgeons represented as nodes, and 382,143 relationships represented as edges. It evolved from an early small-world structure to a hierarchical and decentralized structure dominated by local clusters. Demographic shifts over time reflected increasing diversity and inclusion, with greater representation of female, Hispanic, Asian, and Black trainees across 285 training programs. Nodes were preferentially connected via residency, and the connectivity among underrepresented populations improved in concert with increased representation. Harvey W. Cushing was the quintessential neurosurgeon-influencer in the United States, ranking highly across most centrality measures over time. CONCLUSIONS:The neurosurgical training network is sparse but interconnected, typical of large real-world professional networks. While many small groups of neurosurgeons are closely tied within their immediate training hierarchy and peer group, in modern neurosurgery, each surgeon is only connected to a small fraction of the total network. Highly central individuals have played critical roles in linking disparate groups and shaping network structure. Increasing diversity in recent decades indicates progress toward inclusivity, although overall representation remains low.
PMID: 40914191
ISSN: 1878-8769
CID: 5966272
Automating the Referral of Bone Metastases Patients With and Without the Use of Large Language Models
Sangwon, Karl L; Han, Xu; Becker, Anton; Zhang, Yuchong; Ni, Richard; Zhang, Jeff; Alber, Daniel Alexander; Alyakin, Anton; Nakatsuka, Michelle; Fabbri, Nicola; Aphinyanaphongs, Yindalon; Yang, Jonathan T; Chachoua, Abraham; Kondziolka, Douglas; Laufer, Ilya; Oermann, Eric Karl
BACKGROUND AND OBJECTIVES/OBJECTIVE:Bone metastases, affecting more than 4.8% of patients with cancer annually, and particularly spinal metastases require urgent intervention to prevent neurological complications. However, the current process of manually reviewing radiological reports leads to potential delays in specialist referrals. We hypothesized that natural language processing (NLP) review of routine radiology reports could automate the referral process for timely multidisciplinary care of spinal metastases. METHODS:We assessed 3 NLP models-a rule-based regular expression (RegEx) model, GPT-4, and a specialized Bidirectional Encoder Representations from Transformers (BERT) model (NYUTron)-for automated detection and referral of bone metastases. Study inclusion criteria targeted patients with active cancer diagnoses who underwent advanced imaging (computed tomography, MRI, or positron emission tomography) without previous specialist referral. We defined 2 separate tasks: task of identifying clinically significant bone metastatic terms (lexical detection), and identifying cases needing a specialist follow-up (clinical referral). Models were developed using 3754 hand-labeled advanced imaging studies in 2 phases: phase 1 focused on spine metastases, and phase 2 generalized to bone metastases. Standard McRae's line performance metrics were evaluated and compared across all stages and tasks. RESULTS:In the lexical detection, a simple RegEx achieved the highest performance (sensitivity 98.4%, specificity 97.6%, F1 = 0.965), followed by NYUTron (sensitivity 96.8%, specificity 89.9%, and F1 = 0.787). For the clinical referral task, RegEx also demonstrated superior performance (sensitivity 92.3%, specificity 87.5%, and F1 = 0.936), followed by a fine-tuned NYUTron model (sensitivity 90.0%, specificity 66.7%, and F1 = 0.750). CONCLUSION/CONCLUSIONS:An NLP-based automated referral system can accurately identify patients with bone metastases requiring specialist evaluation. A simple RegEx model excels in syntax-based identification and expert-informed rule generation for efficient referral patient recommendation in comparison with advanced NLP models. This system could significantly reduce missed follow-ups and enhance timely intervention for patients with bone metastases.
PMID: 40823772
ISSN: 1524-4040
CID: 5908782
Introduction. Artificial intelligence in neurosurgery: transforming a data-intensive specialty
Hopkins, Benjamin S; Sutherland, Garnette R; Browd, Samuel R; Donoho, Daniel A; Oermann, Eric K; Schirmer, Clemens M; Pennicooke, Brenton; Asaad, Wael F
PMID: 40591964
ISSN: 1092-0684
CID: 5887762
Outcomes of concurrent versus non-concurrent immune checkpoint inhibition with stereotactic radiosurgery for melanoma brain metastases
Fu, Allen Ye; Bernstein, Kenneth; Zhang, Jeff; Silverman, Joshua; Mehnert, Janice; Sulman, Erik P; Oermann, Eric Karl; Kondziolka, Douglas
PURPOSE/OBJECTIVE:Immune checkpoint inhibition (ICI) has revolutionized the treatment of melanoma care. Stereotactic radiosurgery combined with ICI has shown promise to improve clinical outcomes in prior studies in patients who have metastatic melanoma with brain metastases. However, others have suggested that concurrent ICI with stereotactic radiosurgery can increase the risk of complications. METHODS:We present a retrospective, single-institution analysis of 98 patients with a median follow up of 17.1 months managed with immune checkpoint inhibition and stereotactic radiosurgery concurrently and non-concurrently. A total of 55 patients were included in the concurrent group and 43 patients in the non-concurrent treatment group. Cox proportional hazards models were used to assess the relation between concurrent or non-concurrent treatment and overall survival or local progression-free survival. The Wald test was used to assess significance. Significant differences between patients in both groups experiencing adverse events including adverse radiation effects, perilesional edema, and neurological deficits were tested for using the Chi-square or Fisher's exact test. RESULTS:Patients receiving concurrent versus non-concurrent ICI showed a significant increase in overall survival (median 37.1 months, 95% CI: 18.9 months - NA versus median 11.4 months, 95% CI: 6.4-33.2 months, p = 0.0056) but not local progression-free survival. There were no significant differences between groups with regards to adverse radiation effects (2% versus 3%), perilesional edema (20% versus 9%), neurological deficits (3% versus 20%). CONCLUSION/CONCLUSIONS:These results suggest that the timing of ICI does not increase risk of neurological complications when delivered within 4 weeks of SRS.
PMID: 40183901
ISSN: 1573-7373
CID: 5819412
Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions
Chen, Sully F; Steele, Robert J; Hocky, Glen M; Lemeneh, Beakal; Lad, Shivanand P; Oermann, Eric K
The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. To date, most biosequence transformers have been trained on a single omic-either proteins or nucleic acids and have seen incredible success in downstream tasks in each domain with particularly noteworthy breakthroughs in protein structural modeling. However, single-omic pre-training limits the ability of these models to capture cross-modal interactions. Here we present OmniBioTE, the largest open-source multi-omic model trained on over 250 billion tokens of mixed protein and nucleic acid data. We show that despite only being trained on unlabelled sequence data, OmniBioTE learns joint representations consistent with the central dogma of molecular biology. We further demonstrate that OmbiBioTE achieves state-of-the-art results predicting the change in Gibbs free energy (∆G) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any a priori structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, compared to single-omic controls trained with identical compute, OmniBioTE demonstrates superior performance-per-FLOP and absolute accuracy across both multi-omic and single-omic benchmarks, highlighting the power of a unified modeling approach for biological sequences.
PMCID:11998858
PMID: 40236839
ISSN: 2331-8422
CID: 5883432
CNS-CLIP: Transforming a Neurosurgical Journal Into a Multimodal Medical Model
Alyakin, Anton; Kurland, David; Alber, Daniel Alexander; Sangwon, Karl L; Li, Danxun; Tsirigos, Aristotelis; Leuthardt, Eric; Kondziolka, Douglas; Oermann, Eric Karl
BACKGROUND AND OBJECTIVES/OBJECTIVE:Classical biomedical data science models are trained on a single modality and aimed at one specific task. However, the exponential increase in the size and capabilities of the foundation models inside and outside medicine shows a shift toward task-agnostic models using large-scale, often internet-based, data. Recent research into smaller foundation models trained on specific literature, such as programming textbooks, demonstrated that they can display capabilities similar to or superior to large generalist models, suggesting a potential middle ground between small task-specific and large foundation models. This study attempts to introduce a domain-specific multimodal model, Congress of Neurological Surgeons (CNS)-Contrastive Language-Image Pretraining (CLIP), developed for neurosurgical applications, leveraging data exclusively from Neurosurgery Publications. METHODS:We constructed a multimodal data set of articles from Neurosurgery Publications through PDF data collection and figure-caption extraction using an artificial intelligence pipeline for quality control. Our final data set included 24 021 figure-caption pairs. We then developed a fine-tuning protocol for the OpenAI CLIP model. The model was evaluated on tasks including neurosurgical information retrieval, computed tomography imaging classification, and zero-shot ImageNet classification. RESULTS:CNS-CLIP demonstrated superior performance in neurosurgical information retrieval with a Top-1 accuracy of 24.56%, compared with 8.61% for the baseline. The average area under receiver operating characteristic across 6 neuroradiology tasks achieved by CNS-CLIP was 0.95, slightly superior to OpenAI's Contrastive Language-Image Pretraining at 0.94 and significantly outperforming a vanilla vision transformer at 0.62. In generalist classification, CNS-CLIP reached a Top-1 accuracy of 47.55%, a decrease from the baseline of 52.37%, demonstrating a catastrophic forgetting phenomenon. CONCLUSION/CONCLUSIONS:This study presents a pioneering effort in building a domain-specific multimodal model using data from a medical society publication. The results indicate that domain-specific models, while less globally versatile, can offer advantages in specialized contexts. This emphasizes the importance of using tailored data and domain-focused development in training foundation models in neurosurgery and general medicine.
PMID: 39636129
ISSN: 1524-4040
CID: 5780182