Searched for: in-biosketch:yes
person:stolog01
Can we identify cellular pathways implicated in cancer using gene expression data?
Shah, Nigam; Lepre, Jorge; Tu, Yuhai; Stolovitzky, Gustavo
The cancer state of a cell is characterized by alterations of important cellular processes such as cell proliferation, apoptosis, DNA-damage repair, etc. The expression of genes associated with cancer related pathways, therefore, may exhibit differences between the normal and the cancerous states. We explore various means to find these differences. We analyze 6 different pathways (p53, Ras, Brca, DNA damage repair, NFkappab and beta-catenin) and 4 different types of cancer: colon, pancreas, prostate and kidney. Our results are found to be mostly consistent with existing knowledge of the involvement of these pathways in different cancers. Our analysis constitutes proof of principle that it may be possible to predict the involvement of a particular pathway in cancer or other diseases by using gene expression data. Such method would be particularly useful for the types of diseases where biology is poorly understood.
PMID: 16452783
ISSN: 1555-3930
CID: 5821852
Quantitative noise analysis for gene expression microarray experiments
Tu, Y; Stolovitzky, G; Klein, U
A major challenge in DNA microarray analysis is to effectively dissociate actual gene expression values from experimental noise. We report here a detailed noise analysis for oligonuleotide-based microarray experiments involving reverse transcription, generation of labeled cRNA (target) through in vitro transcription, and hybridization of the target to the probe immobilized on the substrate. By designing sets of replicate experiments that bifurcate at different steps of the assay, we are able to separate the noise caused by sample preparation and the hybridization processes. We quantitatively characterize the strength of these different sources of noise and their respective dependence on the gene expression level. We find that the sample preparation noise is small, implying that the amplification process during the sample preparation is relatively accurate. The hybridization noise is found to have very strong dependence on the expression level, with different characteristics for the low and high expression values. The hybridization noise characteristics at the high expression regime are mostly Poisson-like, whereas its characteristics for the small expression levels are more complex, probably due to cross-hybridization. A method to evaluate the significance of gene expression fold changes based on noise characteristics is proposed.
PMCID:137831
PMID: 12388780
ISSN: 0027-8424
CID: 5821712
Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons
Mateos, Alvaro; Dopazo, Joaquín; Jansen, Ronald; Tu, Yuhai; Gerstein, Mark; Stolovitzky, Gustavo
Recent advances in microarray technology have opened new ways for functional annotation of previously uncharacterised genes on a genomic scale. This has been demonstrated by unsupervised clustering of co-expressed genes and, more importantly, by supervised learning algorithms. Using prior knowledge, these algorithms can assign functional annotations based on more complex expression signatures found in existing functional classes. Previously, support vector machines (SVMs) and other machine-learning methods have been applied to a limited number of functional classes for this purpose. Here we present, for the first time, the comprehensive application of supervised neural networks (SNNs) for functional annotation. Our study is novel in that we report systematic results for ~100 classes in the Munich Information Center for Protein Sequences (MIPS) functional catalog. We found that only ~10% of these are learnable (based on the rate of false negatives). A closer analysis reveals that false positives (and negatives) in a machine-learning context are not necessarily "false" in a biological sense. We show that the high degree of interconnections among functional classes confounds the signatures that ought to be learned for a unique class. We term this the "Borges effect" and introduce two new numerical indices for its quantification. Our analysis indicates that classification systems with a lower Borges effect are better suitable for machine learning. Furthermore, we introduce a learning procedure for combining false positives with the original class. We show that in a few iterations this process converges to a gene set that is learnable with considerably low rates of false positives and negatives and contains genes that are biologically related to the original class, allowing for a coarse reconstruction of the interactions between associated biological pathways. We exemplify this methodology using the well-studied tricarboxylic acid cycle.
PMCID:187551
PMID: 12421757
ISSN: 1088-9051
CID: 5821722
Analysis of gene expression microarrays for phenotype classification
Califano, A; Stolovitzky, G; Tu, Y
Several microarray technologies that monitor the level of expression of a large number of genes have recently emerged. Given DNA-microarray data for a set of cells characterized by a given phenotype and for a set of control cells, an important problem is to identify "patterns" of gene expression that can be used to predict cell phenotype. The potential number of such patterns is exponential in the number of genes. In this paper, we propose a solution to this problem based on a supervised learning algorithm, which differs substantially from previous schemes. It couples a complex, non-linear similarity metric, which maximizes the probability of discovering discriminative gene expression patterns, and a pattern discovery algorithm called SPLASH. The latter discovers efficiently and deterministically all statistically significant gene expression patterns in the phenotype set. Statistical significance is evaluated based on the probability of a pattern to occur by chance in the control set. Finally, a greedy set covering algorithm is used to select an optimal subset of statistically significant patterns, which form the basis for a standard likelihood ratio classification scheme. We analyze data from 60 human cancer cell lines using this method, and compare our results with those of other supervised learning schemes. Different phenotypes are studied. These include cancer morphologies (such as melanoma), molecular targets (such as mutations in the p53 gene), and therapeutic targets related to the sensitivity to an anticancer compounds. We also analyze a synthetic data set that shows that this technique is especially well suited for the analysis of sub-phenotype mixtures. For complex phenotypes, such as p53, our method produces an encouragingly low rate of false positives and false negatives and seems to outperform the others. Similar low rates are reported when predicting the efficacy of experimental anticancer compounds. This counts among the first reported studies where drug efficacy has been successfully predicted from large-scale expression data analysis.
PMID: 10977068
ISSN: 1553-0833
CID: 5821682
Catalytic tempering: A method for sampling rough energy landscapes by Monte Carlo
Stolovitzky, G; Berne, B J
A new Monte Carlo algorithm is presented for the efficient sampling of the Boltzmann distribution of configurations of systems with rough energy landscapes. The method is based on the introduction of a fictitious coordinate y so that the dimensionality of the system is increased by one. This augmented system has a potential surface and a temperature that is made to depend on the new coordinate y in such a way that for a small strip of the y space, called the "normal region," the temperature is set equal to the temperature desired and the potential is the original rough energy potential. To enhance barrier crossing outside the "normal region," the energy barriers are reduced by truncation (with preservation of the potential minima) and the temperature is made to increase with ||y ||. The method, called catalytic tempering or CAT, is found to greatly improve the rate of convergence of Monte Carlo sampling in model systems and to eliminate the quasi-ergodic behavior often found in the sampling of rough energy landscapes.
PMCID:17171
PMID: 11027326
ISSN: 0027-8424
CID: 5821692
Systematic and fully automated identification of protein sequence patterns
Hart, R K; Royyuru, A K; Stolovitzky, G; Califano, A
We present an efficient algorithm to systematically and automatically identify patterns in protein sequence families. The procedure is based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical significance of patterns. We demonstrate its application to the fully automated discovery of patterns in 974 PROSITE families (the complete subset of PROSITE families which are defined by patterns and contain DR records). Splash generates patterns with better specificity and undiminished sensitivity, or vice versa, in 28% of the families; identical statistics were obtained in 48% of the families, worse statistics in 15%, and mixed behavior in the remaining 9%. In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with the corresponding PROSITE pattern. The procedure is sufficiently rapid to enable its use for daily curation of existing motif and profile databases. Third, our results show that the statistical significance of discovered patterns correlates well with their biological significance. The trypsin subfamily of serine proteases is used to illustrate this method's ability to exhaustively discover all motifs in a family that are statistically and biologically significant. Finally, we discuss applications of sequence patterns to multiple sequence alignment and the training of more sensitive score-based motif models, akin to the procedure used by PSI-BLAST. All results are available at httpl//www.research.ibm.com/spat/.
PMID: 11108480
ISSN: 1066-5277
CID: 5821702
Compositional heterogeneity within, and uniformity between, DNA sequences of yeast chromosomes
Li, W; Stolovitzky, G; Bernaola-Galván, P; Oliver, J L
The heterogeneity within, and similarities between, yeast chromosomes are studied. For the former, we show by the size distribution of domains, coding density, size distribution of open reading frames, spatial power spectra, and deviation from binomial distribution for C + G% in large moving windows that there is a strong deviation of the yeast sequences from random sequences. For the latter, not only do we graphically illustrate the similarity for the above mentioned statistics, but we also carry out a rigorous analysis of variance (ANOVA) test. The hypothesis that all yeast chromosomes are similar cannot be rejected by this test. We examine the two possible explanations of this interchromosomal uniformity: a common origin, such as genome-wide duplication (polyploidization), and a concerted evolutionary process.
PMID: 9750191
ISSN: 1088-9051
CID: 5822872
Efficiency of DNA replication in the polymerase chain reaction
Stolovitzky, G; Cecchi, G
A detailed quantitative kinetic model for the polymerase chain reaction (PCR) is developed, which allows us to predict the probability of replication of a DNA molecule in terms of the physical parameters involved in the system. The important issue of the determination of the number of PCR cycles during which this probability can be considered to be a constant is solved within the framework of the model. New phenomena of multimodality and scaling behavior in the distribution of the number of molecules after a given number of PCR cycles are presented. The relevance of the model for quantitative PCR is discussed, and a novel quantitative PCR technique is proposed.
PMCID:24026
PMID: 8917524
ISSN: 0027-8424
CID: 5822862
A simple model of chaotic advection and scattering
Stolovitzky, Gustavo; Kaper, Tasso J.; Sirovich, Lawrence
In this work, we study a blinking vortex-uniform stream map. This map arises as an idealized, but essential, model of time-dependent convection past concentrated vorticity in a number of fluid systems. The map exhibits a rich variety of phenomena, yet it is simple enough so as to yield to extensive analytical investigation. The map's dynamics is dominated by the chaotic scattering of fluid particles near the vortex core. Studying the paths of fluid particles, it is seen that quantities such as residence time distributions and exit-vs-entry positions scale in self-similar fashions. A bifurcation is identified in which a saddle fixed point is created upstream at infinity. The homoclinic tangle formed by the transversely intersecting stable and unstable manifolds of this saddle is principally responsible for the observed self-similarity. Also, since the model is simple enough, various other properties are quantified analytically in terms of the circulation strength, stream velocity, and blinking period. These properties include: entire hierarchies of fixed points and periodic points, the parameter values at which these points undergo conservative period-doubling bifurcations, the structure of the unstable manifolds of the saddle fixed and periodic points, and the detailed structure of the resonance zones inside the vortex core region. A connection is made between a weakly dissipative version of our map and the Ikeda map from nonlinear optics. Finally, we discuss the essential ingredients that our model contains for studying how chaotic scattering induced by time-dependent flow past vortical structures produces enhanced diffusivities. (c) 1995 American Institute of Physics.
PMID: 12780224
ISSN: 1089-7682
CID: 5824932