NYUHSL Faculty Bibliography

Searched for:

in-biosketch:yes

person:stolog01

Total Results:

139

Nature genetics. 2026.DOI: 10.1038/s41588-026-02606-x

Towards a decentralized future for open-science databases

Sharma, Gaurav; Munteanu, Viorel; Ghiasi, Nika Mansouri; Mahanta, Utkarsha; Banerjee, Jineta; Varma, Susheel; Foschini, Luca; Ellrott, Kyle; Mutlu, Onur; Ciorbă, Dumitru; Ophoff, Roel A; Bostan, Viorel; Moore, Jason H; Sousoni, Despoina; Krishnan, Arunkumar; Lucaci, Alexander G; Tull, Alba; Mason, Christopher E; Dimian, Mihai; Stolovitzky, Gustavo; Liberante, Fabio G; Oleksyk, Taras K; Mangul, Serghei

PMID: 42156561

ISSN: 1546-1718

CID: 6038122

Cell reports. Medicine. 2026:7(2).DOI: 10.1016/j.xcrm.2026.102594

Benchmarking large language models for predictive modeling in biomedical research with a focus on reproductive health

Sarwal, Reuben; Tarca, Victor; Dubin, Claire A; Kalavros, Nikolas; Bhatti, Gaurav; Bhattacharya, Sanchita; Butte, Atul; Romero, Roberto; Stolovitzky, Gustavo; Oskotsky, Tomiko T; Tarca, Adi L; Sirota, Marina

Large language models (LLMs) are increasingly used for code generation and data analysis. This study assesses LLM performance across four predictive tasks from three DREAM challenges: gestational age regression from transcriptomics and DNA methylation and classification of preterm birth and early preterm birth from microbiome data. We prompt LLMs with task descriptions, data locations, and target outcomes and then run LLM-generated code to fit prediction models and determine accuracy on test sets. Among the eight LLMs tested, o3-mini-high, 4o, DeepseekR1, and Gemini 2.0 can complete at least one task. R code generation is more successful (14/16) than Python (7/16). OpenAI's o3-mini-high outperforms others, completing 7/8 tasks. Test set performance of the top LLM-generated models matches or exceeds the median-participating team for all four tasks and surpasses the top-performing team for one task (p = 0.02). These findings underscore the potential of LLMs to democratize predictive modeling in omics and increase research output.

PMID: 41707656

ISSN: 2666-3791

CID: 6004802

[Zhong ji yi kan] = [Medicine for intermediate groups]. 2025.DOI: 10.1101/2021.03.22.21253654

Intra-tumoral epigenetic heterogeneity and aberrant molecular clocks in hepatocellular carcinoma

Restrepo, Paula; Bubie, Adrian; Craig, Amanda J; Cameron, Daniel; Labgaa, Ismail; Schwartz, Myron; Thung, Swan; Stolovitzky, Gustavo A; Losic, Bojan; Villanueva, Augusto

There is limited understanding of the epigenetic drivers of tumor evolution in hepatocellular carcinoma (HCC). Here we characterize the epigenetic contribution of methylation to intra-tumoral heterogeneity (mITH) using regional enhanced reduced-representation bisulfite sequencing DNA methylation data from 47 early stage, treatment-naive HCC biopsies across 9 patients by quantifying regional differential methylation across promoters and CpG islands, while overlapping with methylation age markers. Furthermore, we integrate these data with matching RNA-sequencing, targeted DNA sequencing, tumor-infiltrating lymphocyte (TIL), and hepatitis-B viral expression data. We found substantial mITH signatures in promoter and enhancer sites across 44% of patients in our cohort that highlight a novel axis of ITH that is not otherwise detectable from RNA analysis alone. Additionally, we identify an epigenetic tumoral aging measure that reflects a complex tumor fitness phenotype as a potential proxy for tumor clonality. Associating clinical outcomes with epigenetic tumoral age using 450k array data from 377 patients with HCC in the TCGA-LIHC single-biopsy cohort we found evidence implying that epigenetically old tumors have lower fitness yet higher TIL burden. Our data reveal a novel, unique epigenetic axis of ITH in HCC that merits further exploration.

PMCID:12633114

PMID: 41282903

CID: 6035742

iScience. 2025:28(8).DOI: 10.1016/j.isci.2025.113181

Placental epigenetic clocks derived from crowdsourcing: Implications for the study of accelerated aging in obstetrics

Bhatti, Gaurav; Sufriyana, Herdiantri; Romero, Roberto; Patel, Tushar; Tekola-Ayele, Fasil; Alsaggaf, Ibrahim; Gomez-Lopez, Nardhy; Su, Emily C Y; Done, Bogdan; Hoffmann, Steve; van Bömmel, Alena; Wan, Cen; Albrecht, Jake; Novak, Charles; ,; Chaiworapongsa, Tinnakorn; Sirota, Marina; Aghaeepour, Nima; Stolovitzky, Gustavo; Bryant, David R; Tarca, Adi L

Epigenetic gestational age acceleration has been implicated in obstetric syndromes including preeclampsia, yet robust conclusions require accurate and unbiased epigenetic age models. Herein, we curated 1,842 public placental methylomes and organized a DREAM challenge to develop models of gestational age. Participants were blinded to the test data that we generated from 384 placentas encompassing normal and complicated pregnancies. Models developed during and post-challenge compared favorably to existing models in terms of accuracy, yet they were better calibrated throughout gestation and indicated that reports of accelerated epigenetic aging in preterm preeclampsia were likely due to modeling artifacts. The models show that accelerated aging is associated with a decrease in birthweight percentiles in male neonates delivered at term. By contrast, preterm accelerated aging was protective against delivery of a small-for-gestational-age neonate regardless of fetal sex. This work informs our understanding of the fetal sex-dimorphic role of the placenta epigenome in obstetrics.

PMCID:12356336

PMID: 40822353

ISSN: 2589-0042

CID: 5908752

Nature communications. 2025:16(1).DOI: 10.1038/s41467-025-57409-1

Economics of AI and human task sharing for decision making in screening mammography

Ahsen, Mehmet Eren; Ayvaci, Mehmet U S; Mookerjee, Radha; Stolovitzky, Gustavo

The rising global incidence of breast cancer and the persistent shortage of specialized radiologists have heightened the demand for innovative solutions in mammography screening. Artificial intelligence (AI) has emerged as a promising tool to bridge this demand-supply gap, with potential applications ranging from full automation to integrated AI-human decision-making. This study evaluates the economic feasibility of incorporating artificial intelligence (AI) into mammography screening within healthcare settings, considering full or partial integration. To evaluate the economic viability, we employ an optimization model specifically designed to minimize mammography screening costs. This model considers three distinct approaches when interpreting mammograms: automation strategy utilizing AI exclusively, delegation strategy involving the selective allocation of tasks between radiologists and AI, and the expert-alone strategy relying solely on radiologist decisions. Our findings underscore the significance of disease prevalence in relation to the trade-off between costs associated with false positives (e.g., follow-up expenses) and false negatives (e.g., litigation costs stemming from missed diagnoses) in shaping the AI strategy for healthcare organizations. We backtest our approach using data from an AI contest in which participants aimed to match or surpass radiologists' performance in assessing screening mammograms for women. The contest data supports the optimality of the delegation strategy, potentially leading to cost savings of 17.5% to 30.1% compared to relying solely on human experts. Our research provides guidance for healthcare organizations considering AI integration in mammography screening, with broader implications for work design and human-AI hybrid solutions in various fields.

PMCID:11889172

PMID: 40055356

ISSN: 2041-1723

CID: 5807972

JMIR AI. 2024:3.DOI: 10.2196/50800

Optimizing Clinical Trial Eligibility Design Using Natural Language Processing Models and Real-World Data: Algorithm Development and Validation

Lee, Kyeryoung; Liu, Zongzhi; Mai, Yun; Jun, Tomi; Ma, Meng; Wang, Tongyu; Ai, Lei; Calay, Ediz; Oh, William; Stolovitzky, Gustavo; Schadt, Eric; Wang, Xiaoyan

BACKGROUND:Clinical trials are vital for developing new therapies but can also delay drug development. Efficient trial data management, optimized trial protocol, and accurate patient identification are critical for reducing trial timelines. Natural language processing (NLP) has the potential to achieve these objectives. OBJECTIVE:This study aims to assess the feasibility of using data-driven approaches to optimize clinical trial protocol design and identify eligible patients. This involves creating a comprehensive eligibility criteria knowledge base integrated within electronic health records using deep learning-based NLP techniques. METHODS:We obtained data of 3281 industry-sponsored phase 2 or 3 interventional clinical trials recruiting patients with non-small cell lung cancer, prostate cancer, breast cancer, multiple myeloma, ulcerative colitis, and Crohn disease from ClinicalTrials.gov, spanning the period between 2013 and 2020. A customized bidirectional long short-term memory- and conditional random field-based NLP pipeline was used to extract all eligibility criteria attributes and convert hypernym concepts into computable hyponyms along with their corresponding values. To illustrate the simulation of clinical trial design for optimization purposes, we selected a subset of patients with non-small cell lung cancer (n=2775), curated from the Mount Sinai Health System, as a pilot study. RESULTS:-score (0.83, range 0.67-1), enabling the efficient extraction of granular criteria entities and relevant attributes from 3281 clinical trials. A standardized eligibility criteria knowledge base, compatible with electronic health records, was developed by transforming hypernym concepts into machine-interpretable hyponyms along with their corresponding values. In addition, an interface prototype demonstrated the practicality of leveraging real-world data for optimizing clinical trial protocols and identifying eligible patients. CONCLUSIONS:Our customized NLP pipeline successfully generated a standardized eligibility criteria knowledge base by transforming hypernym criteria into machine-readable hyponyms along with their corresponding values. A prototype interface integrating real-world patient information allows us to assess the impact of each eligibility criterion on the number of patients eligible for the trial. Leveraging NLP and real-world data in a data-driven approach holds promise for streamlining the overall clinical trial process, optimizing processes, and improving efficiency in patient identification.

PMCID:11319878

PMID: 39073872

ISSN: 2817-1705

CID: 5799492

Bioinformatics advances. 2024:4(1).DOI: 10.1093/bioadv/vbae093

Optimal linear ensemble of binary classifiers

Ahsen, Mehmet Eren; Vogel, Robert; Stolovitzky, Gustavo

MOTIVATION/UNASSIGNED:The integration of vast, complex biological data with computational models offers profound insights and predictive accuracy. Yet, such models face challenges: poor generalization and limited labeled data. RESULTS/UNASSIGNED:To overcome these difficulties in binary classification tasks, we developed the Method for Optimal Classification by Aggregation (MOCA) algorithm, which addresses the problem of generalization by virtue of being an ensemble learning method and can be used in problems with limited or no labeled data. We developed both an unsupervised (uMOCA) and a supervised (sMOCA) variant of MOCA. For uMOCA, we show how to infer the MOCA weights in an unsupervised way, which are optimal under the assumption of class-conditioned independent classifier predictions. When it is possible to use labels, sMOCA uses empirically computed MOCA weights. We demonstrate the performance of uMOCA and sMOCA using simulated data as well as actual data previously used in Dialogue on Reverse Engineering and Methods (DREAM) challenges. We also propose an application of sMOCA for transfer learning where we use pre-trained computational models from a domain where labeled data are abundant and apply them to a different domain with less abundant labeled data. AVAILABILITY AND IMPLEMENTATION/UNASSIGNED:GitHub repository, https://github.com/robert-vogel/moca.

PMCID:11249386

PMID: 39011276

ISSN: 2635-0041

CID: 5799482

Journal of extracellular vesicles. 2024:13(8).DOI: 10.1002/jev2.12481

Extracellular vesicles carry transcriptional 'dark matter' revealing tissue-specific information

Dogra, Navneet; Chen, Tzu-Yi; Gonzalez-Kozlova, Edgar; Miceli, Rebecca; Cordon-Cardo, Carlos; Tewari, Ashutosh K; Losic, Bojan; Stolovitzky, Gustavo

From eukaryotes to prokaryotes, all cells secrete extracellular vesicles (EVs) as part of their regular homeostasis, intercellular communication, and cargo disposal. Accumulating evidence suggests that small EVs carry functional small RNAs, potentially serving as extracellular messengers and liquid-biopsy markers. Yet, the complete transcriptomic landscape of EV-associated small RNAs during disease progression is poorly delineated due to critical limitations including the protocols used for sequencing, suboptimal alignment of short reads (20-50 nt), and uncharacterized genome annotations-often denoted as the 'dark matter' of the genome. In this study, we investigate the EV-associated small unannotated RNAs that arise from endogenous genes and are part of the genomic 'dark matter', which may play a key emerging role in regulating gene expression and translational mechanisms. To address this, we created a distinct small RNAseq dataset from human prostate cancer & benign tissues, and EVs derived from blood (pre- & post-prostatectomy), urine, and human prostate carcinoma epithelial cell line. We then developed an unsupervised data-based bioinformatic pipeline that recognizes biologically relevant transcriptional signals irrespective of their genomic annotation. Using this approach, we discovered distinct EV-RNA expression patterns emerging from the un-annotated genomic regions (UGRs) of the transcriptomes associated with tissue-specific phenotypes. We have named these novel EV-associated small RNAs as 'EV-UGRs' or "EV-dark matter". Here, we demonstrate that EV-UGR gene expressions are downregulated by ∼100 fold (FDR < 0.05) in the circulating serum EVs from aggressive prostate cancer subjects. Remarkably, these EV-UGRs expression signatures were regained (upregulated) after radical prostatectomy in the same follow-up patients. Finally, we developed a stem-loop RT-qPCR assay that validated prostate cancer-specific EV-UGRs for selective fluid-based diagnostics. Overall, using an unsupervised data driven approach, we investigate the 'dark matter' of EV-transcriptome and demonstrate that EV-UGRs carry tissue-specific Information that significantly alters pre- and post-prostatectomy in the prostate cancer patients. Although further validation in randomized clinical trials is required, this new class of EV-RNAs hold promise in liquid-biopsy by avoiding highly invasive biopsy procedures in prostate cancer.

PMCID:11327273

PMID: 39148266

ISSN: 2001-3078

CID: 5799502

Cell reports. Medicine. 2024:5(1).DOI: 10.1016/j.xcrm.2023.101350

Microbiome preterm birth DREAM challenge: Crowdsourcing machine learning approaches to advance preterm birth research

Golob, Jonathan L; Oskotsky, Tomiko T; Tang, Alice S; Roldan, Alennie; Chung, Verena; Ha, Connie W Y; Wong, Ronald J; Flynn, Kaitlin J; Parraga-Leo, Antonio; Wibrand, Camilla; Minot, Samuel S; Oskotsky, Boris; Andreoletti, Gaia; Kosti, Idit; Bletz, Julie; Nelson, Amber; Gao, Jifan; Wei, Zhoujingpeng; Chen, Guanhua; Tang, Zheng-Zheng; Novielli, Pierfrancesco; Romano, Donato; Pantaleo, Ester; Amoroso, Nicola; Monaco, Alfonso; Vacca, Mirco; De Angelis, Maria; Bellotti, Roberto; Tangaro, Sabina; Kuntzleman, Abigail; Bigcraft, Isaac; Techtmann, Stephen; Bae, Daehun; Kim, Eunyoung; Jeon, Jongbum; Joe, Soobok; ,; Theis, Kevin R; Ng, Sherrianne; Lee, Yun S; Diaz-Gimeno, Patricia; Bennett, Phillip R; MacIntyre, David A; Stolovitzky, Gustavo; Lynch, Susan V; Albrecht, Jake; Gomez-Lopez, Nardhy; Romero, Roberto; Stevenson, David K; Aghaeepour, Nima; Tarca, Adi L; Costello, James C; Sirota, Marina

Every year, 11% of infants are born preterm with significant health consequences, with the vaginal microbiome a risk factor for preterm birth. We crowdsource models to predict (1) preterm birth (PTB; <37 weeks) or (2) early preterm birth (ePTB; <32 weeks) from 9 vaginal microbiome studies representing 3,578 samples from 1,268 pregnant individuals, aggregated from public raw data via phylogenetic harmonization. The predictive models are validated on two independent unpublished datasets representing 331 samples from 148 pregnant individuals. The top-performing models (among 148 and 121 submissions from 318 teams) achieve area under the receiver operator characteristic (AUROC) curve scores of 0.69 and 0.87 predicting PTB and ePTB, respectively. Alpha diversity, VALENCIA community state types, and composition are important features in the top-performing models, most of which are tree-based methods. This work is a model for translation of microbiome data into clinically relevant predictive models and to better understand preterm birth.

PMID: 38134931

ISSN: 2666-3791

CID: 5799442

iScience. 2024:27(3).DOI: 10.1016/j.isci.2024.108905

Modeling combination therapies in patient cohorts and cell cultures using correlated drug action

Arun, Adith S; Kim, Sung-Cheol; Ahsen, Mehmet Eren; Stolovitzky, Gustavo

Characterizing the effect of combination therapies is vital for treating diseases like cancer. We introduce correlated drug action (CDA), a baseline model for the study of drug combinations in both cell cultures and patient populations, which assumes that the efficacy of drugs in a combination may be correlated. We apply temporal CDA (tCDA) to clinical trial data, and demonstrate the utility of this approach in identifying possible synergistic combinations and others that can be explained in terms of monotherapies. Using MCF7 cell line data, we assess combinations with dose CDA (dCDA), a model that generalizes other proposed models (e.g., Bliss response-additivity, the dose equivalence principle), and introduce Excess over CDA (EOCDA), a new metric for identifying possible synergistic combinations in cell culture.

PMCID:10882105

PMID: 38390492

ISSN: 2589-0042

CID: 5799452