Try a new search

Format these results:

Searched for:

in-biosketch:true

person:oermae01

Total Results:

154


CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications

Alyakin, Anton; Stryker, Jaden; Alber, Daniel Alexander; Lee, Jin Vivian; Sangwon, Karl L; Duderstadt, Brandon; Save, Akshay; Kurland, David; Frome, Spencer; Singh, Shrutika; Zhang, Jeff; Yang, Eunice; Park, Ki Yun; Orillac, Cordelia; Valliani, Aly A; Neifert, Sean; Liu, Albert; Patel, Aneek; Livia, Christopher; Lau, Darryl; Laufer, Ilya; Rozman, Peter A; Hidalgo, Eveline Teresa; Riina, Howard; Feng, Rui; Hollon, Todd; Aphinyanaphongs, Yindalon; Golfinos, John G; Snyder, Laura; Leuthardt, Eric C; Kondziolka, Douglas; Oermann, Eric Karl
BACKGROUND AND OBJECTIVES/OBJECTIVE:General purpose vision-language models (VLMs) demonstrate impressive capabilities, but their opaque training on uncurated internet data poses critical limitations for high-stakes decision making, such as in neurosurgery. We present CNS-Obsidian, a neurosurgical VLM trained on peer-reviewed neurosurgical literature, and demonstrate its clinical utility compared with GPT-4o in a real-world setting. METHODS:We compiled 23 984 articles from Neurosurgery Publications journals, yielding 78 853 figures and captions. Using GPT-4o and Claude Sonnet-3.5, we converted these image-text pairs into 263 064 training samples across 3 formats: instruction fine-tuning, multiple-choice questions, and differential diagnosis. We trained CNS-Obsidian, a fine-tune of the 34-billion parameter Large Language and Visual Assistant-Next model. In a blinded, randomized deployment trial at NYU Langone Health (August 30-November 30, 2024), neurosurgeons were assigned to use either CNS-Obsidian or a Health Insurance Portability and Accountability Act-compliant GPT-4o end point as a diagnostic copilot after patient consultations. Primary outcomes were diagnostic helpfulness and accuracy, assessed through user ratings and presence of the correct diagnosis within the VLM-provided differential, respectively. RESULTS:CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%, P = .235), but only achieved 46.81% accuracy on human-generated questions vs GPT-4o's 65.70% (P < 10-15). In the randomized trial, 70 consultations were evaluated (32 CNS-Obsidian, 38 GPT-4o) from 959 total consults (7.3% utilization). CNS-Obsidian received positive ratings in 40.62% of cases vs 57.89% for GPT-4o (P = .230). Both models included correct diagnosis in approximately 60% of cases (59.38% vs 65.79%, P = .626). CONCLUSION/CONCLUSIONS:Domain-specific VLMs trained on curated scientific literature can approach frontier model performance in specialized medical domains despite being orders of magnitude smaller and less expensive to train. This establishes a transparent framework for scientific communities to build specialized artificial intelligence models. However, low clinical utilization suggests chatbot interfaces may not align with specialist workflows, indicating need for alternative artificial intelligence integration strategies.
PMID: 42153721
ISSN: 1524-4040
CID: 6037862

AI-Powered Pipeline Transforms Neurosurgical Articles Into High-Quality Graphical Abstracts

Alyakin, Anton; Stryker, Jaden; Lee, Jin Vivian; Feng, Rui; Hollon, Todd; Kondziolka, Douglas; Oermann, Eric Karl
BACKGROUND AND OBJECTIVES/OBJECTIVE:articles into graphical abstracts using Cascade Styling Sheets (CSS) templates and iterative prompting of a frontier vision language model and to conduct a human evaluation of this pipeline. METHODS:We developed an automated pipeline to convert extracted manuscript content into standardized graphical abstracts. The pipeline implements a custom CSS profile designed to match existing journal standards. Using Claude Sonnet-3.5, we generated structured hypertext markup language summaries organized into 6 sections: Objectives, Background, Methods, Results, Discussion, and Conclusion. The model selected up to 2 representative figures per manuscript based on caption analysis. We evaluated performance using 100 randomly selected articles published between 2020 and 2024 (95 from Neurosurgery, 4 from Operative Neurosurgery, 1 from Neurosurgery Practice). Three Editorial Review Board members independently assessed abstracts using 3 binary criteria: (1) proper formatting, (2) factual accuracy, and (3) visual appeal. RESULTS:Generated graphical abstracts achieved proper formatting in 85% of cases (95% CI: 76.7%-90.7%), factual accuracy in 99% (95% CI: 94.4%-99.9%), and visual appropriateness in 82% (95% CI: 73.3%-88.3%). Overall, 70% of abstracts (95% CI: 60.5%-78.1%) met all 3 criteria and were deemed "publication ready" without manual intervention. Error analysis revealed poor figure selection (40.0%) as the most common failure mode, followed by title replacement errors from PDF extraction (26.7%). CONCLUSION/CONCLUSIONS:Our artificial intelligence-CSS pipeline demonstrates the feasibility of automating graphical abstract generation for neurosurgical manuscripts, achieving publication-ready quality in 70% of cases with 99% factual accuracy. This technology offers a scalable augmentation tool that can reduce the design burden for authors, enhancing visual scientific communication in neurosurgical publishing while complementing human expertise.
PMCID:13086415
PMID: 42007247
ISSN: 2834-4383
CID: 6032282

A Multi-AI Agent Framework for Interactive Neurosurgical Education and Evaluation: From Vignettes to Virtual Conversations

Sangwon, Karl L; Zhang, Jeff; Steele, Robert; Stryker, Jaden; Choi, Joanne J; Lee, Jin Vivian; Alber, Daniel Alexander; Valliani, Aly; Kannapadi, Nivedha; Ryoo, James; Feng, Austin; Khan, Hammad A; Neifert, Sean; Orillac, Cordelia; Weiss, Hannah K; Kim, Nora C; Kurland, David; Riina, Howard A; Kondziolka, Douglas; Mankowski, Michal; Oermann, Eric Karl
BACKGROUND AND OBJECTIVES/OBJECTIVE:Traditional medical board examinations present clinical information in static vignettes with multiple-choices (MC), fundamentally different from how physicians gather and integrate data in practice. Recent advances in large language models (LLMs) offer promising approaches to creating more realistic clinical interactive conversations. However, these approaches are limited in neurosurgery, where patient communication capacity varies significantly and diagnosis heavily relies on objective data such as imaging and neurological examinations. We aimed to develop and evaluate a multi-artificial intelligence (AI) agent conversation framework for neurosurgical case assessment that enables realistic clinical interactions through simulated patients and structured access to objective clinical data. METHODS:We developed a framework to convert 608 Self-Assessment in Neurological Surgery first-order diagnosis questions into conversation sessions using 3 specialized AI agents: patient AI for subjective information, system AI for objective data, and clinical AI for diagnostic reasoning. We evaluated generative pretrained transformer 4o's (GPT-4o's) diagnostic accuracy across traditional vignettes, patient-only conversations, and patient + system AI interactions, with human benchmark testing from 10 neurosurgery residents. RESULTS:= .0030) using fewer interactions and reported high educational value of the interactive format. CONCLUSION/CONCLUSIONS:This multi-AI agent framework provides both a more challenging evaluation method for LLMs and an engaging educational tool for neurosurgical training. The significant performance drops in conversational formats suggest that traditional MC testing may overestimate LLMs' clinical reasoning capabilities, while the framework's interactive nature offers promising applications for enhancing medical education.
PMCID:13075903
PMID: 41982325
ISSN: 2834-4383
CID: 6027772

Natural Language Processing Methods Automate Molecular Marker Extraction From Glioma Pathology Reports

Maarouf, Nader I; Reinecke, David; Smith, Andrew; Markert, John E; Cogan, Theodore G; Han, Xu; Alyakin, Anton; Alber, Daniel Alexander; Park, Minjun; Goff, Nicolas K; Weiss, Hannah; Harake, Edward S; Eddy, Karen; Hollon, Todd; Oermann, Eric K; Orringer, Daniel A
BACKGROUND AND OBJECTIVES/OBJECTIVE:Molecular markers such as isocitrate dehydrogenase (IDH) and alpha-thalassemia/mental retardation syndrome X-linked (ATRX) status are essential for glioma classification and treatment planning, but their manual extraction from pathology reports creates significant research bottlenecks. This study evaluated 3 Natural Language Processing approaches with increasing computational complexity: deterministic Regular Expressions (RegEx), statistical Term Frequency-Inverse Document Frequency (TF-IDF) with logistic regression, and contextual deep learning Bidirectional Encoder Representations from Transformers (BERT). We address whether more intensive approaches provide sufficient performance benefits over simpler approaches in computational pathology research. METHODS:We analyzed pathology reports from 404 patients with glioma at Institution A and 197 at Institution B for external validation. IDH analysis included 399 (Institution A) and 193 (Institution B) patients; ATRX analysis included 361 and 130 patients, respectively. All approaches underwent identical preprocessing steps, including text normalization, terminology standardization, and context extraction. Performance was evaluated using standard classification metrics and memory usage benchmarks on internal and external validation data sets. RESULTS:Simpler approaches outperformed more intensive approaches on external validation. For IDH, Regex achieved near-perfect accuracy (99%, area under the curve [AUC] 1.000) and TF-IDF performed exceptionally (94.2%, AUC 0.984), while BlueBERT underperformed (85.2%, AUC 0.934). For ATRX, Regex achieved perfect accuracy (100%, AUC 1.000) and TF-IDF maintained high accuracy (98.0%, AUC 0.998), outperforming BERT-large (84.6%, AUC 0.931). BERT-based approaches required 1825-1953 MB of memory vs Regex (0.82-5.52 MB) and TF-IDF (17.27-34.89 MB). CONCLUSION/CONCLUSIONS:Simple Natural Language Processing approaches effectively automate molecular marker extraction from pathology reports with near-perfect accuracy while requiring minimal computational resources. This enables expanded sample sizes in retrospective studies, multi-institutional analyses of rare molecular subgroups, and accelerated biomarker research. Future work will focus on validation across larger data sets, infrastructure integration, and expansion to additional molecular markers.
PMID: 41891708
ISSN: 1524-4040
CID: 6018712

LLM-assisted systematic review of large language models in clinical medicine

Chen, Sully F; Alyakin, Anton; Seas, Andreas; Yang, Eunice; Choi, Joanne J; Lee, Jin Vivian; Chen, Amelia L; Warman, Pranav I; Bitolas, Rochelle T; Steele, Robert J; Alber, Daniel A; Oermann, Eric K
Clinical evaluations of large language models (LLMs) have rapidly expanded since 2022, yet their evidence base remains opaque. The overwhelming volume of studies creates challenges for manual curation and review. However, LLMs themselves offer the scalability and capability to evaluate the ever-growing evidence base. This LLM-assisted review identified 4,609 peer-reviewed studies in clinical medicine between January 2022 and September 2025, equating to roughly 3.2 papers per day. Only 1,048 studies used real-world patient data and of these only 19 were prospective randomized trials; most addressed simulated scenarios (n = 1,857) or exam-style tasks (n = 1,704). ChatGPT and related OpenAI models constitute 65.7% of evaluated models, with Gemini/Bard a distant second constituting 13.1% of evaluated models. Patient-facing communication and education comprised 17% of tasks, followed by knowledge retrieval, and education and assessment simulation. Across 1,046 head-to-head comparisons, LLMs outperformed humans in 33% of comparisons, with a strong dependency on task realism and level of training. At least 25% of studies had sample sizes less than 30. Despite the growth of LLMs in medicine, rigorous, patient-centered evidence remains scarce, underscoring the need for larger prospective trials before clinical adoption.
PMID: 41776077
ISSN: 1546-170x
CID: 6008642

Neural and computational mechanisms underlying one-shot perceptual learning in humans

Hachisuka, Ayaka; Shor, Jonathan D; Liu, Xujin Chris; Friedman, Daniel; Dugan, Patricia; Saez, Ignacio; Panov, Fedor E; Wang, Yao; Doyle, Werner; Devinsky, Orrin; Oermann, Eric K; He, Biyu J
The ability to quickly learn and generalize is one of the brain's most impressive feats and recreating it remains a major challenge for modern artificial intelligence research. One of the most mysterious one-shot learning abilities displayed by humans is one-shot perceptual learning, whereby a single viewing experience drastically alters visual perception in a long-lasting manner. Where in the brain one-shot perceptual learning occurs and what mechanisms support it remain enigmatic. Combining psychophysics, 7 T fMRI, and intracranial recordings, we identify the high-level visual cortex as the most likely neural substrate wherein neural plasticity supports one-shot perceptual learning. We further develop a deep neural network model incorporating top-down feedback into a vision transformer, which recapitulates and predicts human behavior. The prior knowledge learnt by this model is highly similar to the neural code in the human high-level visual cortex. These results reveal the neurocomputational mechanisms underlying one-shot perceptual learning in humans.
PMCID:12873369
PMID: 41639076
ISSN: 2041-1723
CID: 6000282

Large-scale multi-omic biosequence transformers for modeling protein-nucleic acid interactions

Chen, Sully F; Steele, Robert J; Hocky, Glen M; Lemeneh, Beakal; Lad, Shivanand P; Oermann, Eric K
The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. To date, most biosequence transformers have been trained on single-omic data-either proteins or nucleic acids-and have seen incredible success in downstream tasks in each domain, with particularly noteworthy breakthroughs in protein structural modeling. However, single-omic pretraining limits the ability of these models to capture cross-modal interactions. Here we present OmniBioTE, the largest open-source multi-omic model trained on over 250 billion tokens of mixed protein and nucleic acid data. We show that despite only being trained on unlabeled sequence data, OmniBioTE learns joint representations mapping genes to their corresponding protein sequences. We further demonstrate that OmniBioTE achieves state-of-the-art results predicting the change in Gibbs free energy ([Formula: see text]) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any a priori structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Compared to single-omic controls trained with identical compute, OmniBioTE also demonstrates superior performance-per-FLOP across both multi-omic and single-omic benchmarks. Together, these results highlight the power of a unified modeling approach for biological sequences and establish OmniBioTE as a foundation model for multi-omic discovery.
PMID: 41628239
ISSN: 1932-6203
CID: 5999602

In Reply: Augmenting Large Language Models With Automated, Bibliometrics-Powered Literature Search for Knowledge Distillation: A Pilot Study for Common Spinal Pathologies

Kurland, David B; Alber, Daniel A; Oermann, Eric K
PMID: 41537755
ISSN: 1524-4040
CID: 5986532

Enhancing the prediction of hospital discharge disposition with extraction-based language model classification

Small, William R; Crowley, Ryan J; Pariente, Chloe; Zhang, Jeff; Eaton, Kevin P; Jiang, Lavender Yao; Oermann, Eric; Aphinyanaphongs, Yindalon
Early identification of inpatient discharges to skilled nursing facilities (SNFs) facilitates care transition planning. Predictive information in admission history and physical notes (H&Ps) is dispersed across long documents. Language models adeptly predict clinical outcomes from text but have limitations: token length constraints, noisy inputs, and opaque outputs. Therefore, we developed extraction-based language model classification (ELC): generative language models distill H&Ps into task-relevant categories ("Structured Extracted Data") before summarizing them into a concise narrative ("AI Risk Snapshot"). We hypothesized that language models utilizing AI Risk Snapshots to predict SNF discharges would perform the best. In this retrospective observational study, nine language models predicted SNF discharges from unstructured predictors (raw H&P text, truncated assessment and plan) and ELC-derived predictors (Structured Extracted Data, AI Risk Snapshots). ELC substantially reduced input length (AI Risk Snapshot median 141 tokens vs raw H&P median 2,120 tokens) and improved average AUROC and AUPRC across models. The best performance was achieved by Bio+Clinical BERT fine-tuned on AI Risk Snapshots (AUROC = .851). AI Risk Snapshots enhanced interpretability by aligning with nurse case managers' risk assessments and facilitating prompt design. Structuring and summarizing H&Ps via ELC thus mitigates the practical limitations of language models and improves SNF discharge prediction.
PMCID:12789015
PMID: 41522677
ISSN: 3005-1959
CID: 5985892

Automating the Referral of Bone Metastases Patients With and Without the Use of Large Language Models

Sangwon, Karl L; Han, Xu; Becker, Anton; Zhang, Yuchong; Ni, Richard; Zhang, Jeff; Alber, Daniel Alexander; Alyakin, Anton; Nakatsuka, Michelle; Fabbri, Nicola; Aphinyanaphongs, Yindalon; Yang, Jonathan T; Chachoua, Abraham; Kondziolka, Douglas; Laufer, Ilya; Oermann, Eric Karl
BACKGROUND AND OBJECTIVES/OBJECTIVE:Bone metastases, affecting more than 4.8% of patients with cancer annually, and particularly spinal metastases require urgent intervention to prevent neurological complications. However, the current process of manually reviewing radiological reports leads to potential delays in specialist referrals. We hypothesized that natural language processing (NLP) review of routine radiology reports could automate the referral process for timely multidisciplinary care of spinal metastases. METHODS:We assessed 3 NLP models-a rule-based regular expression (RegEx) model, GPT-4, and a specialized Bidirectional Encoder Representations from Transformers (BERT) model (NYUTron)-for automated detection and referral of bone metastases. Study inclusion criteria targeted patients with active cancer diagnoses who underwent advanced imaging (computed tomography, MRI, or positron emission tomography) without previous specialist referral. We defined 2 separate tasks: task of identifying clinically significant bone metastatic terms (lexical detection), and identifying cases needing a specialist follow-up (clinical referral). Models were developed using 3754 hand-labeled advanced imaging studies in 2 phases: phase 1 focused on spine metastases, and phase 2 generalized to bone metastases. Standard McRae's line performance metrics were evaluated and compared across all stages and tasks. RESULTS:In the lexical detection, a simple RegEx achieved the highest performance (sensitivity 98.4%, specificity 97.6%, F1 = 0.965), followed by NYUTron (sensitivity 96.8%, specificity 89.9%, and F1 = 0.787). For the clinical referral task, RegEx also demonstrated superior performance (sensitivity 92.3%, specificity 87.5%, and F1 = 0.936), followed by a fine-tuned NYUTron model (sensitivity 90.0%, specificity 66.7%, and F1 = 0.750). CONCLUSION/CONCLUSIONS:An NLP-based automated referral system can accurately identify patients with bone metastases requiring specialist evaluation. A simple RegEx model excels in syntax-based identification and expert-informed rule generation for efficient referral patient recommendation in comparison with advanced NLP models. This system could significantly reduce missed follow-ups and enhance timely intervention for patients with bone metastases.
PMID: 40823772
ISSN: 1524-4040
CID: 5908782