Try a new search

Format these results:

Searched for:

in-biosketch:yes

person:aphiny01

Total Results:

93


Computer models for identifying instrumental citations in the biomedical literature

Fu, Lawrence D.; Aphinyanaphongs, Yindalon; Aliferis, Constantin F.
The most popular method for evaluating the quality of a scientific publication is citation count. This metric assumes that a citation is a positive indicator of the quality of the cited work. This assumption is not always true since citations serve many purposes. As a result, citation count is an indirect and imprecise measure of impact. If instrumental citations could be reliably distinguished from non-instrumental ones, this would readily improve the performance of existing citation-based metrics by excluding the non-instrumental citations. A citation was operationally defined as instrumental if either of the following was true: the hypothesis of the citing work was motivated by the cited work, or the citing work could not have been executed without the cited work. This work investigated the feasibility of developing computer models for automatically classifying citations as instrumental or non-instrumental. Instrumental citations were manually labeled, and machine learning models were trained on a combination of content and bibliometric features. The experimental results indicate that models based on content and bibliometric features are able to automatically classify instrumental citations with high predictivity (AUC = 0.86). Additional experiments using independent hold out data and prospective validation show that the models are generalizeable and can handle unseen cases. This work demonstrates that it is feasible to train computer models to automatically identify instrumental citations. C1 [Fu, Lawrence D.; Aphinyanaphongs, Yindalon] NYU Med Ctr, Ctr Hlth Informat & Bioinformat, Dept Med, New York, NY 10016 USA. [Aliferis, Constantin F.] NYU Med Ctr, Ctr Hlth Informat & Bioinformat, Dept Pathol, New York, NY 10016 USA
ISI:000327219900020
ISSN: 0138-9130
CID: 687922

Identifying unproven cancer treatments on the health web: addressing accuracy, generalizability and scalability

Aphinyanaphongs, Yin; Fu, Lawrence D; Aliferis, Constantin F
Building machine learning models that identify unproven cancer treatments on the Health Web is a promising approach for dealing with the dissemination of false and dangerous information to vulnerable health consumers. Aside from the obvious requirement of accuracy, two issues are of practical importance in deploying these models in real world applications. (a) Generalizability: The models must generalize to all treatments (not just the ones used in the training of the models). (b) Scalability: The models can be applied efficiently to billions of documents on the Health Web. First, we provide methods and related empirical data demonstrating strong accuracy and generalizability. Second, by combining the MapReduce distributed architecture and high dimensionality compression via Markov Boundary feature selection, we show how to scale the application of the models to WWW-scale corpora. The present work provides evidence that (a) a very small subset of unproven cancer treatments is sufficient to build a model to identify unproven treatments on the web; (b) unproven treatments use distinct language to market their claims and this language is learnable; (c) through distributed parallelization and state of the art feature selection, it is possible to prepare the corpora and build and apply models with large scalability.
PMCID:4162393
PMID: 23920640
ISSN: 0926-9630
CID: 484192

A comparison of evaluation metrics for biomedical journals, articles, and websites in terms of sensitivity to topic

Fu, Lawrence D; Aphinyanaphongs, Yindalon; Wang, Lily; Aliferis, Constantin F
Evaluating the biomedical literature and health-related websites for quality are challenging information retrieval tasks. Current commonly used methods include impact factor for journals, PubMed's clinical query filters and machine learning-based filter models for articles, and PageRank for websites. Previous work has focused on the average performance of these methods without considering the topic, and it is unknown how performance varies for specific topics or focused searches. Clinicians, researchers, and users should be aware when expected performance is not achieved for specific topics. The present work analyzes the behavior of these methods for a variety of topics. Impact factor, clinical query filters, and PageRank vary widely across different topics while a topic-specific impact factor and machine learning-based filter models are more stable. The results demonstrate that a method may perform excellently on average but struggle when used on a number of narrower topics. Topic-adjusted metrics and other topic robust methods have an advantage in such situations. Users of traditional topic-sensitive metrics should be aware of their limitations
PMCID:3143298
PMID: 21419864
ISSN: 1532-0480
CID: 135570

Trends and developments in bioinformatics in 2010: prospects and perspectives

Aliferis, C F; Alekseyenko, A V; Aphinyanaphongs, Y; Brown, S; Fenyo, D; Fu, L; Shen, S; Statnikov, A; Wang, J
OBJECTIVES: To survey major developments and trends in the field of Bioinformatics in 2010 and their relationships to those of previous years, with emphasis on long-term trends, on best practices, on quality of the science of informatics, and on quality of science as a function of informatics. METHODS: A critical review of articles in the literature of Bioinformatics over the past year. RESULTS: Our main results suggest that Bioinformatics continues to be a major catalyst for progress in Biology and Translational Medicine, as a consequence of new assaying technologies, most pre-dominantly Next Generation Sequencing, which are changing the landscape of modern biological and medical research. These assays critically depend on bioinformatics and have led to quick growth of corresponding informatics methods development. Clinical-grade molecular signatures are proliferating at a rapid rate. However, a highly publicized incident at a prominent university showed that deficiencies in informatics methods can lead to catastrophic consequences for important scientific projects. Developing evidence-driven protocols and best practices is greatly needed given how serious are the implications for the quality of translational and basic science. CONCLUSIONS: Several exciting new methods have appeared over the past 18 months, that open new roads for progress in bioinformatics methods and their impact in biomedicine. At the same time, the range of open problems of great significance is extensive, ensuring the vitality of the field for many years to come.
PMID: 21938341
ISSN: 0943-4747
CID: 174460

Text categorization models for identifying unproven cancer treatments on the web

Aphinyanaphongs, Yin; Aliferis, Constantin
The nature of the internet as a non-peer-reviewed (and largely unregulated) publication medium has allowed wide-spread promotion of inaccurate and unproven medical claims in unprecedented scale. Patients with conditions that are not currently fully treatable are particularly susceptible to unproven and dangerous promises about miracle treatments. In extreme cases, fatal adverse outcomes have been documented. Most commonly, the cost is financial, psychological, and delayed application of imperfect but proven scientific modalities. To help protect patients, who may be desperately ill and thus prone to exploitation, we explored the use of machine learning techniques to identify web pages that make unproven claims. This feasibility study shows that the resulting models can identify web pages that make unproven claims in a fully automatic manner, and substantially better than previous web tools and state-of-the-art search engine technology
PMID: 17911859
ISSN: 0926-9630
CID: 106405

A comparison of impact factor, clinical query filters, and pattern recognition query filters in terms of sensitivity to topic

Fu, Lawrence D; Wang, Lily; Aphinyanagphongs, Yindalon; Aliferis, Constantin F
Evaluating journal quality and finding high-quality articles in the biomedical literature are challenging information retrieval tasks. The most widely used method for journal evaluation is impact factor, while novel approaches for finding articles are PubMed's clinical query filters and machine learning-based filter models. The related literature has focused on the average behavior of these methods over all topics. The present study evaluates the variability of these approaches for different topics. We find that impact factor and clinical query filters are unstable for different topics while a topic-specific impact factor and machine learning-based filter models appear more robust. Thus when using the less stable methods for a specific topic, researchers should realize that their performance may diverge from expected average performance. Better yet, the more stable methods should be preferred whenever applicable
PMID: 17911810
ISSN: 0926-9630
CID: 86989

A comparison of citation metrics to machine learning filters for the identification of high quality MEDLINE documents

Aphinyanaphongs, Yindalon; Statnikov, Alexander; Aliferis, Constantin F
OBJECTIVE: The present study explores the discriminatory performance of existing and novel gold-standard-specific machine learning (GSS-ML) focused filter models (i.e., models built specifically for a retrieval task and a gold standard against which they are evaluated) and compares their performance to citation count and impact factors, and non-specific machine learning (NS-ML) models (i.e., models built for a different task and/or different gold standard). DESIGN: Three gold standard corpora were constructed using the SSOAB bibliography, the ACPJ-cited treatment articles, and the ACPJ-cited etiology articles. Citation counts and impact factors were obtained for each article. Support vector machine models were used to classify the articles using combinations of content, impact factors, and citation counts as predictors. MEASUREMENTS: Discriminatory performance was estimated using the area under the receiver operating characteristic curve and n-fold cross-validation. RESULTS: For all three gold standards and tasks, GSS-ML filters outperformed citation count, impact factors, and NS-ML filters. Combinations of content with impact factor or citation count produced no or negligible improvements to the GSS machine learning filters. CONCLUSIONS: These experiments provide evidence that when building information retrieval filters focused on a retrieval task and corresponding gold standard, the filter models have to be built specifically for this task and gold standard. Under those conditions, machine learning filters outperform standard citation metrics. Furthermore, citation counts and impact factors add marginal value to discriminatory performance. Previous research that claimed better performance of citation metrics than machine learning in one of the corpora examined here is attributed to using machine learning filters built for a different gold standard and task
PMCID:1513679
PMID: 16622165
ISSN: 1067-5027
CID: 86993

Using citation data to improve retrieval from MEDLINE

Bernstam, Elmer V; Herskovic, Jorge R; Aphinyanaphongs, Yindalon; Aliferis, Constantin F; Sriram, Madurai G; Hersh, William R
OBJECTIVE: To determine whether algorithms developed for the World Wide Web can be applied to the biomedical literature in order to identify articles that are important as well as relevant. DESIGN AND MEASUREMENTS A direct comparison of eight algorithms: simple PubMed queries, clinical queries (sensitive and specific versions), vector cosine comparison, citation count, journal impact factor, PageRank, and machine learning based on polynomial support vector machines. The objective was to prioritize important articles, defined as being included in a pre-existing bibliography of important literature in surgical oncology. RESULTS Citation-based algorithms were more effective than noncitation-based algorithms at identifying important articles. The most effective strategies were simple citation count and PageRank, which on average identified over six important articles in the first 100 results compared to 0.85 for the best noncitation-based algorithm (p < 0.001). The authors saw similar differences between citation-based and noncitation-based algorithms at 10, 20, 50, 200, 500, and 1,000 results (p < 0.001). Citation lag affects performance of PageRank more than simple citation count. However, in spite of citation lag, citation-based algorithms remain more effective than noncitation-based algorithms. CONCLUSION Algorithms that have proved successful on the World Wide Web can be applied to biomedical information retrieval. Citation-based algorithms can help identify important articles within large sets of relevant results. Further studies are needed to determine whether citation-based algorithms can effectively meet actual user information needs
PMCID:1380202
PMID: 16221938
ISSN: 1067-5027
CID: 86994

Prospective validation of text categorization filters for identifying high-quality, content-specific articles in MEDLINE

Aphinyanaphongs, Yindalon; Aliferis, Constantin
In prior work, we introduced a machine learning method to identify high quality MEDLINE documents in internal medicine. The performance of the original filter models built with this corpus on years outside 1998-2000 was not assessed directly. Validating the performance of the original filter models on current corpora is crucial to validate them for use in current years, to verify that the model fitting and model error estimation procedures do not over-fit the models, and to validate consistency of the chosen ACPJ gold standard (i.e., that ACPJ editorial policies and criteria are stable over time). Our prospective validation results indicated that in the categories of treatment, etiology, diagnosis, and prognosis, the original machine learning filter models built from the 1998-2000 corpora maintained their discriminatory performance of 0.97, 0.97, 0.94, and 0.94 area under the curve in each respective category when applied to a 2005 corpus. The ACPJ is a stable, reliable gold standard and the machine learning methodology provides robust models and model performance estimates. Machine learning filter models built with 1998-2000 corpora can be applied to identify high quality articles in recent years
PMCID:1839419
PMID: 17238292
ISSN: 1559-4076
CID: 106403

Text categorization models for high-quality article retrieval in internal medicine

Aphinyanaphongs, Yindalon; Tsamardinos, Ioannis; Statnikov, Alexander; Hardin, Douglas; Aliferis, Constantin F
OBJECTIVE Finding the best scientific evidence that applies to a patient problem is becoming exceedingly difficult due to the exponential growth of medical publications. The objective of this study was to apply machine learning techniques to automatically identify high-quality, content-specific articles for one time period in internal medicine and compare their performance with previous Boolean-based PubMed clinical query filters of Haynes et al. DESIGN The selection criteria of the ACP Journal Club for articles in internal medicine were the basis for identifying high-quality articles in the areas of etiology, prognosis, diagnosis, and treatment. Naive Bayes, a specialized AdaBoost algorithm, and linear and polynomial support vector machines were applied to identify these articles. MEASUREMENTS The machine learning models were compared in each category with each other and with the clinical query filters using area under the receiver operating characteristic curves, 11-point average recall precision, and a sensitivity/specificity match method. RESULTS In most categories, the data-induced models have better or comparable sensitivity, specificity, and precision than the clinical query filters. The polynomial support vector machine models perform the best among all learning methods in ranking the articles as evaluated by area under the receiver operating curve and 11-point average recall precision. CONCLUSION This research shows that, using machine learning methods, it is possible to automatically build models for retrieving high-quality, content-specific articles using inclusion or citation by the ACP Journal Club as a gold standard in a given time period in internal medicine that perform better than the 1994 PubMed clinical query filters
PMCID:551552
PMID: 15561789
ISSN: 1067-5027
CID: 86997