Searched for: in-biosketch:yes
person:baeleg01
Massive parallelization boosts big Bayesian multidimensional scaling
Holbrook, Andrew J; Lemey, Philippe; Baele, Guy; Dellicour, Simon; Brockmann, Dirk; Rambaut, Andrew; Suchard, Marc A
Big Bayes is the computationally intensive co-application of big data and large, expressive Bayesian models for the analysis of complex phenomena in scientific inference and statistical learning. Standing as an example, Bayesian multidimensional scaling (MDS) can help scientists learn viral trajectories through space-time, but its computational burden prevents its wider use. Crucial MDS model calculations scale quadratically in the number of observations. We partially mitigate this limitation through massive parallelization using multi-core central processing units, instruction-level vectorization and graphics processing units (GPUs). Fitting the MDS model using Hamiltonian Monte Carlo, GPUs can deliver more than 100-fold speedups over serial calculations and thus extend Bayesian MDS to a big data setting. To illustrate, we employ Bayesian MDS to infer the rate at which different seasonal influenza virus subtypes use worldwide air traffic to spread around the globe. We examine 5392 viral sequences and their associated 14 million pairwise distances arising from the number of commercial airline seats per year between viral sampling locations. To adjust for shared evolutionary history of the viruses, we implement a phylogenetic extension to the MDS model and learn that subtype H3N2 spreads most effectively, consistent with its epidemic success relative to other seasonal influenza subtypes. Finally, we provide MassiveMDS, an open-source, stand-alone C++ library and rudimentary R package, and discuss program design and high-level implementation with an emphasis on important aspects of computing architecture that become relevant at scale.
PMCID:8218718
PMID: 34168419
ISSN: 1061-8600
CID: 5170652
EFFICIENT BAYESIAN INFERENCE OF GENERAL GAUSSIAN MODELS ON LARGE PHYLOGENETIC TREES
Bastide, Paul; Ho, Lam Si Tung; Baele, Guy; Lemey, Philippe; Suchard, Marc A.
ISI:000674675200021
ISSN: 1932-6157
CID: 5171232
Bayesian Evaluation of Temporal Signal in Measurably Evolving Populations
Duchene, Sebastian; Lemey, Philippe; Stadler, Tanja; Ho, Simon Y W; Duchene, David A; Dhanasekaran, Vijaykrishna; Baele, Guy
Phylogenetic methods can use the sampling times of molecular sequence data to calibrate the molecular clock, enabling the estimation of evolutionary rates and timescales for rapidly evolving pathogens and data sets containing ancient DNA samples. A key aspect of such calibrations is whether a sufficient amount of molecular evolution has occurred over the sampling time window, that is, whether the data can be treated as having come from a measurably evolving population. Here, we investigate the performance of a fully Bayesian evaluation of temporal signal (BETS) in sequence data. The method involves comparing the fit to the data of two models: a model in which the data are accompanied by the actual (heterochronous) sampling times, and a model in which the samples are constrained to be contemporaneous (isochronous). We conducted simulations under a wide range of conditions to demonstrate that BETS accurately classifies data sets according to whether they contain temporal signal or not, even when there is substantial among-lineage rate variation. We explore the behavior of this classification in analyses of five empirical data sets: modern samples of A/H1N1 influenza virus, the bacterium Bordetella pertussis, coronaviruses from mammalian hosts, ancient DNA from Hepatitis B virus, and mitochondrial genomes of dog species. Our results indicate that BETS is an effective alternative to other tests of temporal signal. In particular, this method has the key advantage of allowing a coherent assessment of the entire model, including the molecular clock and tree prior which are essential aspects of Bayesian phylodynamic analyses.
PMCID:7454806
PMID: 32895707
ISSN: 1537-1719
CID: 5170512
Accommodating individual travel history and unsampled diversity in Bayesian phylogeographic inference of SARS-CoV-2
Lemey, Philippe; Hong, Samuel L; Hill, Verity; Baele, Guy; Poletto, Chiara; Colizza, Vittoria; O'Toole, Ãine; McCrone, John T; Andersen, Kristian G; Worobey, Michael; Nelson, Martha I; Rambaut, Andrew; Suchard, Marc A
Spatiotemporal bias in genome sampling can severely confound discrete trait phylogeographic inference. This has impeded our ability to accurately track the spread of SARS-CoV-2, the virus responsible for the COVID-19 pandemic, despite the availability of unprecedented numbers of SARS-CoV-2 genomes. Here, we present an approach to integrate individual travel history data in Bayesian phylogeographic inference and apply it to the early spread of SARS-CoV-2. We demonstrate that including travel history data yields i) more realistic hypotheses of virus spread and ii) higher posterior predictive accuracy compared to including only sampling location. We further explore methods to ameliorate the impact of sampling bias by augmenting the phylogeographic analysis with lineages from undersampled locations. Our reconstructions reinforce specific transmission hypotheses suggested by the inclusion of travel history data, but also suggest alternative routes of virus migration that are plausible within the epidemiological context but are not apparent with current sampling efforts.
PMID: 33037213
ISSN: 2041-1723
CID: 5170542
Gradients Do Grow on Trees: A Linear-Time O(N)-Dimensional Gradient for Statistical Phylogenetics
Ji, Xiang; Zhang, Zhenyu; Holbrook, Andrew; Nishimura, Akihiko; Baele, Guy; Rambaut, Andrew; Lemey, Philippe; Suchard, Marc A
Calculation of the log-likelihood stands as the computational bottleneck for many statistical phylogenetic algorithms. Even worse is its gradient evaluation, often used to target regions of high probability. Order O(N)-dimensional gradient calculations based on the standard pruning algorithm require O(N2) operations, where N is the number of sampled molecular sequences. With the advent of high-throughput sequencing, recent phylogenetic studies have analyzed hundreds to thousands of sequences, with an apparent trend toward even larger data sets as a result of advancing technology. Such large-scale analyses challenge phylogenetic reconstruction by requiring inference on larger sets of process parameters to model the increasing data heterogeneity. To make these analyses tractable, we present a linear-time algorithm for O(N)-dimensional gradient evaluation and apply it to general continuous-time Markov processes of sequence substitution on a phylogenetic tree without a need to assume either stationarity or reversibility. We apply this approach to learn the branch-specific evolutionary rates of three pathogenic viruses: West Nile virus, Dengue virus, and Lassa virus. Our proposed algorithm significantly improves inference efficiency with a 126- to 234-fold increase in maximum-likelihood optimization and a 16- to 33-fold computational performance increase in a Bayesian framework.
PMCID:7530611
PMID: 32458974
ISSN: 1537-1719
CID: 5170492
Genomic Epidemiology, Evolution, and Transmission Dynamics of Porcine Deltacoronavirus
He, Wan-Ting; Ji, Xiang; He, Wei; Dellicour, Simon; Wang, Shilei; Li, Gairu; Zhang, Letian; Gilbert, Marius; Zhu, Henan; Xing, Gang; Veit, Michael; Huang, Zhen; Han, Guan-Zhu; Huang, Yaowei; Suchard, Marc A; Baele, Guy; Lemey, Philippe; Su, Shuo
The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has shown once again that coronavirus (CoV) in animals are potential sources for epidemics in humans. Porcine deltacoronavirus (PDCoV) is an emerging enteropathogen of swine with a worldwide distribution. Here, we implemented and described an approach to analyze the epidemiology of PDCoV following its emergence in the pig population. We performed an integrated analysis of full genome sequence data from 21 newly sequenced viruses, along with comprehensive epidemiological surveillance data collected globally over the last 15 years. We found four distinct phylogenetic lineages of PDCoV, which differ in their geographic circulation patterns. Interestingly, we identified more frequent intra- and interlineage recombination and higher virus genetic diversity in the Chinese lineages compared with the USA lineage where pigs are raised in different farming systems and ecological environments. Most recombination breakpoints are located in the ORF1ab gene rather than in genes encoding structural proteins. We also identified five amino acids under positive selection in the spike protein suggesting a role for adaptive evolution. According to structural mapping, three positively selected sites are located in the N-terminal domain of the S1 subunit, which is the most likely involved in binding to a carbohydrate receptor, whereas the other two are located in or near the fusion peptide of the S2 subunit and thus might affect membrane fusion. Finally, our phylogeographic investigations highlighted notable South-North transmission as well as frequent long-distance dispersal events in China that could implicate human-mediated transmission. Our findings provide new insights into the evolution and dispersal of PDCoV that contribute to our understanding of the critical factors involved in CoVs emergence.
PMCID:7454817
PMID: 32407507
ISSN: 1537-1719
CID: 5170472
Radiation of the coralline red algae (Corallinophycidae, Rhodophyta) crown group as inferred from a multilocus time-calibrated phylogeny
Peña, Viviana; Vieira, Christophe; Braga, Juan Carlos; Aguirre, Julio; Rösler, Anja; Baele, Guy; De Clerck, Olivier; Le Gall, Line
The subclass Corallinophycidae is the only group of red algae characterized by the presence of calcite crystals in their cell walls. Except for the Rhodogorgonales, the remaining orders - collectively called corallines - are diverse and widely distributed, having calcified cell walls and highly variable morphology. Corallines constitute the group with the richest fossil record among marine algae. In the present study, we investigate the evolutionary history of the subclass Corallinophycidae and provide a time-calibrated phylogeny to date the radiation of the crown group and its main lineages. We use a multi-locus dataset with an extensive taxon sampling and comprehensive collection of fossil records, carefully assigned to corallines, to reconstruct a time-calibrated phylogeny of this subclass. Our molecular clock analyses suggest that the onset of crown group diversification of Corallinophycidae started in the Lower Jurassic and sped up in the Lower Cretaceous. The divergence time of the oldest order Sporolithales is estimated in the Lower Cretaceous followed by the remaining orders. We discuss the long period of more than 300 million years between the early Paleozoic records attributed to the stem group of Corallinophycidae and the radiation of the crown group. Our inferred phylogeny yields three highly-supported suprageneric lineages for the order Corallinales; we confirm the family Mastophoraceae and amend circumscription of the families Corallinaceae and Lithophyllaceae. These three families are distinguished by a combination of vegetative and reproductive features. In light of the phylogeny, we discuss the evolutionary trends of eleven morphological characters. In addition, we also highlight homoplasious characters and selected autapomorphies emerging in particular taxa.
PMID: 32360706
ISSN: 1095-9513
CID: 5170452
nosoi: A stochastic agent-based transmission chain simulation framework in r
Lequime, Sebastian; Bastide, Paul; Dellicour, Simon; Lemey, Philippe; Baele, Guy
The transmission process of an infectious agent creates a connected chain of hosts linked by transmission events, known as a transmission chain. Reconstructing transmission chains remains a challenging endeavour, except in rare cases characterized by intense surveillance and epidemiological inquiry. Inference frameworks attempt to estimate or approximate these transmission chains but the accuracy and validity of such methods generally lack formal assessment on datasets for which the actual transmission chain was observed.We here introduce nosoi, an open-source r package that offers a complete, tunable and expandable agent-based framework to simulate transmission chains under a wide range of epidemiological scenarios for single-host and dual-host epidemics. nosoi is accessible through GitHub and CRAN, and is accompanied by extensive documentation, providing help and practical examples to assist users in setting up their own simulations.Once infected, each host or agent can undergo a series of events during each time step, such as moving (between locations) or transmitting the infection, all of these being driven by user-specified rules or data, such as travel patterns between locations. nosoi is able to generate a multitude of epidemic scenarios, that can-for example-be used to validate a wide range of reconstruction methods, including epidemic modelling and phylodynamic analyses. nosoi also offers a comprehensive framework to leverage empirically acquired data, allowing the user to explore how variations in parameters can affect epidemic potential. Aside from research questions, nosoi can provide lecturers with a complete teaching tool to offer students a hands-on exploration of the dynamics of epidemiological processes and the factors that impact it. Because the package does not rely on mathematical formalism but uses a more intuitive algorithmic approach, even extensive changes of the entire model can be easily and quickly implemented.
PMCID:7496779
PMID: 32983401
ISSN: 2041-210x
CID: 5170532
Temporal signal and the phylodynamic threshold of SARS-CoV-2
Duchene, Sebastian; Featherstone, Leo; Haritopoulou-Sinanidou, Melina; Rambaut, Andrew; Lemey, Philippe; Baele, Guy
The ongoing SARS-CoV-2 outbreak marks the first time that large amounts of genome sequence data have been generated and made publicly available in near real time. Early analyses of these data revealed low sequence variation, a finding that is consistent with a recently emerging outbreak, but which raises the question of whether such data are sufficiently informative for phylogenetic inferences of evolutionary rates and time scales. The phylodynamic threshold is a key concept that refers to the point in time at which sufficient molecular evolutionary change has accumulated in available genome samples to obtain robust phylodynamic estimates. For example, before the phylodynamic threshold is reached, genomic variation is so low that even large amounts of genome sequences may be insufficient to estimate the virus's evolutionary rate and the time scale of an outbreak. We collected genome sequences of SARS-CoV-2 from public databases at eight different points in time and conducted a range of tests of temporal signal to determine if and when the phylodynamic threshold was reached, and the range of inferences that could be reliably drawn from these data. Our results indicate that by 2 February 2020, estimates of evolutionary rates and time scales had become possible. Analyses of subsequent data sets, that included between 47 and 122 genomes, converged at an evolutionary rate of about 1.1 × 10-3 subs/site/year and a time of origin of around late November 2019. Our study provides guidelines to assess the phylodynamic threshold and demonstrates that establishing this threshold constitutes a fundamental step for understanding the power and limitations of early data in outbreak genome surveillance.
PMCID:7454936
PMID: 33235813
ISSN: 2057-1577
CID: 5170552
Accommodating individual travel history, global mobility, and unsampled diversity in phylogeography: a SARS-CoV-2 case study [PrePrint]
Lemey, Philippe; Hong, Samuel; Hill, Verity; Baele, Guy; Poletto, Chiara; Colizza, Vittoria; O'Toole, Ãine; McCrone, John T; Andersen, Kristian G; Worobey, Michael; Nelson, Martha I; Rambaut, Andrew; Suchard, Marc A
Spatiotemporal bias in genome sequence sampling can severely confound phylogeographic inference based on discrete trait ancestral reconstruction. This has impeded our ability to accurately track the emergence and spread of SARS-CoV-2, which is the virus responsible for the COVID-19 pandemic. Despite the availability of staggering numbers of genomes on a global scale, evolutionary reconstructions of SARS-CoV-2 are hindered by the slow accumulation of sequence divergence over its relatively short transmission history. When confronted with these issues, incorporating additional contextual data may critically inform phylodynamic reconstructions. Here, we present a new approach to integrate individual travel history data in Bayesian phylogeographic inference and apply it to the early spread of SARS-CoV-2, while also including global air transportation data. We demonstrate that including travel history data for each SARS-CoV-2 genome yields more realistic reconstructions of virus spread, particularly when travelers from undersampled locations are included to mitigate sampling bias. We further explore the impact of sampling bias by incorporating unsampled sequences from undersampled locations in the analyses. Our reconstructions reinforce specific transmission hypotheses suggested by the inclusion of travel history data, but also suggest alternative routes of virus migration that are plausible within the epidemiological context but are not apparent with current sampling efforts. Although further research is needed to fully examine the performance of our new data integration approaches and to further improve them, they represent multiple new avenues for directly addressing the colossal issue of sample bias in phylogeographic inference.
PMID: 32596695
ISSN: 2692-8205
CID: 5170502