Searched for: in-biosketch:yes
person:stolog01
Benchmarking large language models for predictive modeling in biomedical research with a focus on reproductive health
Sarwal, Reuben; Tarca, Victor; Dubin, Claire A; Kalavros, Nikolas; Bhatti, Gaurav; Bhattacharya, Sanchita; Butte, Atul; Romero, Roberto; Stolovitzky, Gustavo; Oskotsky, Tomiko T; Tarca, Adi L; Sirota, Marina
Large language models (LLMs) are increasingly used for code generation and data analysis. This study assesses LLM performance across four predictive tasks from three DREAM challenges: gestational age regression from transcriptomics and DNA methylation and classification of preterm birth and early preterm birth from microbiome data. We prompt LLMs with task descriptions, data locations, and target outcomes and then run LLM-generated code to fit prediction models and determine accuracy on test sets. Among the eight LLMs tested, o3-mini-high, 4o, DeepseekR1, and Gemini 2.0 can complete at least one task. R code generation is more successful (14/16) than Python (7/16). OpenAI's o3-mini-high outperforms others, completing 7/8 tasks. Test set performance of the top LLM-generated models matches or exceeds the median-participating team for all four tasks and surpasses the top-performing team for one task (p = 0.02). These findings underscore the potential of LLMs to democratize predictive modeling in omics and increase research output.
PMID: 41707656
ISSN: 2666-3791
CID: 6004802
Placental epigenetic clocks derived from crowdsourcing: Implications for the study of accelerated aging in obstetrics
Bhatti, Gaurav; Sufriyana, Herdiantri; Romero, Roberto; Patel, Tushar; Tekola-Ayele, Fasil; Alsaggaf, Ibrahim; Gomez-Lopez, Nardhy; Su, Emily C Y; Done, Bogdan; Hoffmann, Steve; van Bömmel, Alena; Wan, Cen; Albrecht, Jake; Novak, Charles; ,; Chaiworapongsa, Tinnakorn; Sirota, Marina; Aghaeepour, Nima; Stolovitzky, Gustavo; Bryant, David R; Tarca, Adi L
Epigenetic gestational age acceleration has been implicated in obstetric syndromes including preeclampsia, yet robust conclusions require accurate and unbiased epigenetic age models. Herein, we curated 1,842 public placental methylomes and organized a DREAM challenge to develop models of gestational age. Participants were blinded to the test data that we generated from 384 placentas encompassing normal and complicated pregnancies. Models developed during and post-challenge compared favorably to existing models in terms of accuracy, yet they were better calibrated throughout gestation and indicated that reports of accelerated epigenetic aging in preterm preeclampsia were likely due to modeling artifacts. The models show that accelerated aging is associated with a decrease in birthweight percentiles in male neonates delivered at term. By contrast, preterm accelerated aging was protective against delivery of a small-for-gestational-age neonate regardless of fetal sex. This work informs our understanding of the fetal sex-dimorphic role of the placenta epigenome in obstetrics.
PMCID:12356336
PMID: 40822353
ISSN: 2589-0042
CID: 5908752
Economics of AI and human task sharing for decision making in screening mammography
Ahsen, Mehmet Eren; Ayvaci, Mehmet U S; Mookerjee, Radha; Stolovitzky, Gustavo
The rising global incidence of breast cancer and the persistent shortage of specialized radiologists have heightened the demand for innovative solutions in mammography screening. Artificial intelligence (AI) has emerged as a promising tool to bridge this demand-supply gap, with potential applications ranging from full automation to integrated AI-human decision-making. This study evaluates the economic feasibility of incorporating artificial intelligence (AI) into mammography screening within healthcare settings, considering full or partial integration. To evaluate the economic viability, we employ an optimization model specifically designed to minimize mammography screening costs. This model considers three distinct approaches when interpreting mammograms: automation strategy utilizing AI exclusively, delegation strategy involving the selective allocation of tasks between radiologists and AI, and the expert-alone strategy relying solely on radiologist decisions. Our findings underscore the significance of disease prevalence in relation to the trade-off between costs associated with false positives (e.g., follow-up expenses) and false negatives (e.g., litigation costs stemming from missed diagnoses) in shaping the AI strategy for healthcare organizations. We backtest our approach using data from an AI contest in which participants aimed to match or surpass radiologists' performance in assessing screening mammograms for women. The contest data supports the optimality of the delegation strategy, potentially leading to cost savings of 17.5% to 30.1% compared to relying solely on human experts. Our research provides guidance for healthcare organizations considering AI integration in mammography screening, with broader implications for work design and human-AI hybrid solutions in various fields.
PMCID:11889172
PMID: 40055356
ISSN: 2041-1723
CID: 5807972
Extracellular vesicles, RNA sequencing, and bioinformatic analyses: Challenges, solutions, and recommendations
Miceli, Rebecca T; Chen, Tzu-Yi; Nose, Yohei; Tichkule, Swapnil; Brown, Briana; Fullard, John F; Saulsbury, Marilyn D; Heyliger, Simon O; Gnjatic, Sacha; Kyprianou, Natasha; Cordon-Cardo, Carlos; Sahoo, Susmita; Taioli, Emanuela; Roussos, Panos; Stolovitzky, Gustavo; Gonzalez-Kozlova, Edgar; Dogra, Navneet
Extracellular vesicles (EVs) are heterogeneous entities secreted by cells into their microenvironment and systemic circulation. Circulating EVs carry functional small RNAs and other molecular footprints from their cell of origin, and thus have evident applications in liquid biopsy, therapeutics, and intercellular communication. Yet, the complete transcriptomic landscape of EVs is poorly characterized due to critical limitations including variable protocols used for EV-RNA extraction, quality control, cDNA library preparation, sequencing technologies, and bioinformatic analyses. Consequently, there is a gap in knowledge and the need for a standardized approach in delineating EV-RNAs. Here, we address these gaps by describing the following points by (1) focusing on the large canopy of the EVs and particles (EVPs), which includes, but not limited to - exosomes and other large and small EVs, lipoproteins, exomeres/supermeres, mitochondrial-derived vesicles, RNA binding proteins, and cell-free DNA/RNA/proteins; (2) examining the potential functional roles and biogenesis of EVPs; (3) discussing various transcriptomic methods and technologies used in uncovering the cargoes of EVPs; (4) presenting a comprehensive list of RNA subtypes reported in EVPs; (5) describing different EV-RNA databases and resources specific to EV-RNA species; (6) reviewing established bioinformatics pipelines and novel strategies for reproducible EV transcriptomics analyses; (7) emphasizing the significant need for a gold standard approach in identifying EV-RNAs across studies; (8) and finally, we highlight current challenges, discuss possible solutions, and present recommendations for robust and reproducible analyses of EVP-associated small RNAs. Overall, we seek to provide clarity on the transcriptomics landscape, sequencing technologies, and bioinformatic analyses of EVP-RNAs. Detailed portrayal of the current state of EVP transcriptomics will lead to a better understanding of how the RNA cargo of EVPs can be used in modern and targeted diagnostics and therapeutics. For the inclusion of different particles discussed in this article, we use the terms large/small EVs, non-vesicular extracellular particles (NVEPs), EPs and EVPs as defined in MISEV guidelines by the International Society of Extracellular Vesicles (ISEV).
PMCID:11613500
PMID: 39625409
ISSN: 2001-3078
CID: 5763732
Extracellular vesicles carry transcriptional 'dark matter' revealing tissue-specific information
Dogra, Navneet; Chen, Tzu-Yi; Gonzalez-Kozlova, Edgar; Miceli, Rebecca; Cordon-Cardo, Carlos; Tewari, Ashutosh K; Losic, Bojan; Stolovitzky, Gustavo
From eukaryotes to prokaryotes, all cells secrete extracellular vesicles (EVs) as part of their regular homeostasis, intercellular communication, and cargo disposal. Accumulating evidence suggests that small EVs carry functional small RNAs, potentially serving as extracellular messengers and liquid-biopsy markers. Yet, the complete transcriptomic landscape of EV-associated small RNAs during disease progression is poorly delineated due to critical limitations including the protocols used for sequencing, suboptimal alignment of short reads (20-50 nt), and uncharacterized genome annotations-often denoted as the 'dark matter' of the genome. In this study, we investigate the EV-associated small unannotated RNAs that arise from endogenous genes and are part of the genomic 'dark matter', which may play a key emerging role in regulating gene expression and translational mechanisms. To address this, we created a distinct small RNAseq dataset from human prostate cancer & benign tissues, and EVs derived from blood (pre- & post-prostatectomy), urine, and human prostate carcinoma epithelial cell line. We then developed an unsupervised data-based bioinformatic pipeline that recognizes biologically relevant transcriptional signals irrespective of their genomic annotation. Using this approach, we discovered distinct EV-RNA expression patterns emerging from the un-annotated genomic regions (UGRs) of the transcriptomes associated with tissue-specific phenotypes. We have named these novel EV-associated small RNAs as 'EV-UGRs' or "EV-dark matter". Here, we demonstrate that EV-UGR gene expressions are downregulated by ∼100 fold (FDR < 0.05) in the circulating serum EVs from aggressive prostate cancer subjects. Remarkably, these EV-UGRs expression signatures were regained (upregulated) after radical prostatectomy in the same follow-up patients. Finally, we developed a stem-loop RT-qPCR assay that validated prostate cancer-specific EV-UGRs for selective fluid-based diagnostics. Overall, using an unsupervised data driven approach, we investigate the 'dark matter' of EV-transcriptome and demonstrate that EV-UGRs carry tissue-specific Information that significantly alters pre- and post-prostatectomy in the prostate cancer patients. Although further validation in randomized clinical trials is required, this new class of EV-RNAs hold promise in liquid-biopsy by avoiding highly invasive biopsy procedures in prostate cancer.
PMCID:11327273
PMID: 39148266
ISSN: 2001-3078
CID: 5799502
Optimizing Clinical Trial Eligibility Design Using Natural Language Processing Models and Real-World Data: Algorithm Development and Validation
Lee, Kyeryoung; Liu, Zongzhi; Mai, Yun; Jun, Tomi; Ma, Meng; Wang, Tongyu; Ai, Lei; Calay, Ediz; Oh, William; Stolovitzky, Gustavo; Schadt, Eric; Wang, Xiaoyan
BACKGROUND:Clinical trials are vital for developing new therapies but can also delay drug development. Efficient trial data management, optimized trial protocol, and accurate patient identification are critical for reducing trial timelines. Natural language processing (NLP) has the potential to achieve these objectives. OBJECTIVE:This study aims to assess the feasibility of using data-driven approaches to optimize clinical trial protocol design and identify eligible patients. This involves creating a comprehensive eligibility criteria knowledge base integrated within electronic health records using deep learning-based NLP techniques. METHODS:We obtained data of 3281 industry-sponsored phase 2 or 3 interventional clinical trials recruiting patients with non-small cell lung cancer, prostate cancer, breast cancer, multiple myeloma, ulcerative colitis, and Crohn disease from ClinicalTrials.gov, spanning the period between 2013 and 2020. A customized bidirectional long short-term memory- and conditional random field-based NLP pipeline was used to extract all eligibility criteria attributes and convert hypernym concepts into computable hyponyms along with their corresponding values. To illustrate the simulation of clinical trial design for optimization purposes, we selected a subset of patients with non-small cell lung cancer (n=2775), curated from the Mount Sinai Health System, as a pilot study. RESULTS:-score (0.83, range 0.67-1), enabling the efficient extraction of granular criteria entities and relevant attributes from 3281 clinical trials. A standardized eligibility criteria knowledge base, compatible with electronic health records, was developed by transforming hypernym concepts into machine-interpretable hyponyms along with their corresponding values. In addition, an interface prototype demonstrated the practicality of leveraging real-world data for optimizing clinical trial protocols and identifying eligible patients. CONCLUSIONS:Our customized NLP pipeline successfully generated a standardized eligibility criteria knowledge base by transforming hypernym criteria into machine-readable hyponyms along with their corresponding values. A prototype interface integrating real-world patient information allows us to assess the impact of each eligibility criterion on the number of patients eligible for the trial. Leveraging NLP and real-world data in a data-driven approach holds promise for streamlining the overall clinical trial process, optimizing processes, and improving efficiency in patient identification.
PMCID:11319878
PMID: 39073872
ISSN: 2817-1705
CID: 5799492
An algorithm to identify patients aged 0-3 with rare genetic disorders
Webb, Bryn D; Lau, Lisa Y; Tsevdos, Despina; Shewcraft, Ryan A; Corrigan, David; Shi, Lisong; Lee, Seungwoo; Tyler, Jonathan; Li, Shilong; Wang, Zichen; Stolovitzky, Gustavo; Edelmann, Lisa; Chen, Rong; Schadt, Eric E; Li, Li
BACKGROUND:With over 7000 Mendelian disorders, identifying children with a specific rare genetic disorder diagnosis through structured electronic medical record data is challenging given incompleteness of records, inaccurate medical diagnosis coding, as well as heterogeneity in clinical symptoms and procedures for specific disorders. We sought to develop a digital phenotyping algorithm (PheIndex) using electronic medical records to identify children aged 0-3 diagnosed with genetic disorders or who present with illness with an increased risk for genetic disorders. RESULTS:Through expert opinion, we established 13 criteria for the algorithm and derived a score and a classification. The performance of each criterion and the classification were validated by chart review. PheIndex identified 1,088 children out of 93,154 live births who may be at an increased risk for genetic disorders. Chart review demonstrated that the algorithm achieved 90% sensitivity, 97% specificity, and 94% accuracy. CONCLUSIONS:The PheIndex algorithm can help identify when a rare genetic disorder may be present, alerting providers to consider ordering a diagnostic genetic test and/or referring a patient to a medical geneticist.
PMCID:11064409
PMID: 38698482
ISSN: 1750-1172
CID: 5799462
Modeling combination therapies in patient cohorts and cell cultures using correlated drug action
Arun, Adith S; Kim, Sung-Cheol; Ahsen, Mehmet Eren; Stolovitzky, Gustavo
Characterizing the effect of combination therapies is vital for treating diseases like cancer. We introduce correlated drug action (CDA), a baseline model for the study of drug combinations in both cell cultures and patient populations, which assumes that the efficacy of drugs in a combination may be correlated. We apply temporal CDA (tCDA) to clinical trial data, and demonstrate the utility of this approach in identifying possible synergistic combinations and others that can be explained in terms of monotherapies. Using MCF7 cell line data, we assess combinations with dose CDA (dCDA), a model that generalizes other proposed models (e.g., Bliss response-additivity, the dose equivalence principle), and introduce Excess over CDA (EOCDA), a new metric for identifying possible synergistic combinations in cell culture.
PMCID:10882105
PMID: 38390492
ISSN: 2589-0042
CID: 5799452
Microbiome preterm birth DREAM challenge: Crowdsourcing machine learning approaches to advance preterm birth research
Golob, Jonathan L; Oskotsky, Tomiko T; Tang, Alice S; Roldan, Alennie; Chung, Verena; Ha, Connie W Y; Wong, Ronald J; Flynn, Kaitlin J; Parraga-Leo, Antonio; Wibrand, Camilla; Minot, Samuel S; Oskotsky, Boris; Andreoletti, Gaia; Kosti, Idit; Bletz, Julie; Nelson, Amber; Gao, Jifan; Wei, Zhoujingpeng; Chen, Guanhua; Tang, Zheng-Zheng; Novielli, Pierfrancesco; Romano, Donato; Pantaleo, Ester; Amoroso, Nicola; Monaco, Alfonso; Vacca, Mirco; De Angelis, Maria; Bellotti, Roberto; Tangaro, Sabina; Kuntzleman, Abigail; Bigcraft, Isaac; Techtmann, Stephen; Bae, Daehun; Kim, Eunyoung; Jeon, Jongbum; Joe, Soobok; ,; Theis, Kevin R; Ng, Sherrianne; Lee, Yun S; Diaz-Gimeno, Patricia; Bennett, Phillip R; MacIntyre, David A; Stolovitzky, Gustavo; Lynch, Susan V; Albrecht, Jake; Gomez-Lopez, Nardhy; Romero, Roberto; Stevenson, David K; Aghaeepour, Nima; Tarca, Adi L; Costello, James C; Sirota, Marina
Every year, 11% of infants are born preterm with significant health consequences, with the vaginal microbiome a risk factor for preterm birth. We crowdsource models to predict (1) preterm birth (PTB; <37 weeks) or (2) early preterm birth (ePTB; <32 weeks) from 9 vaginal microbiome studies representing 3,578 samples from 1,268 pregnant individuals, aggregated from public raw data via phylogenetic harmonization. The predictive models are validated on two independent unpublished datasets representing 331 samples from 148 pregnant individuals. The top-performing models (among 148 and 121 submissions from 318 teams) achieve area under the receiver operator characteristic (AUROC) curve scores of 0.69 and 0.87 predicting PTB and ePTB, respectively. Alpha diversity, VALENCIA community state types, and composition are important features in the top-performing models, most of which are tree-based methods. This work is a model for translation of microbiome data into clinically relevant predictive models and to better understand preterm birth.
PMID: 38134931
ISSN: 2666-3791
CID: 5799442
Optimal linear ensemble of binary classifiers
Ahsen, Mehmet Eren; Vogel, Robert; Stolovitzky, Gustavo
MOTIVATION/UNASSIGNED:The integration of vast, complex biological data with computational models offers profound insights and predictive accuracy. Yet, such models face challenges: poor generalization and limited labeled data. RESULTS/UNASSIGNED:To overcome these difficulties in binary classification tasks, we developed the Method for Optimal Classification by Aggregation (MOCA) algorithm, which addresses the problem of generalization by virtue of being an ensemble learning method and can be used in problems with limited or no labeled data. We developed both an unsupervised (uMOCA) and a supervised (sMOCA) variant of MOCA. For uMOCA, we show how to infer the MOCA weights in an unsupervised way, which are optimal under the assumption of class-conditioned independent classifier predictions. When it is possible to use labels, sMOCA uses empirically computed MOCA weights. We demonstrate the performance of uMOCA and sMOCA using simulated data as well as actual data previously used in Dialogue on Reverse Engineering and Methods (DREAM) challenges. We also propose an application of sMOCA for transfer learning where we use pre-trained computational models from a domain where labeled data are abundant and apply them to a different domain with less abundant labeled data. AVAILABILITY AND IMPLEMENTATION/UNASSIGNED:GitHub repository, https://github.com/robert-vogel/moca.
PMCID:11249386
PMID: 39011276
ISSN: 2635-0041
CID: 5799482