Searched for: in-biosketch:yes
person:aphiny01
Reusable Filtering Functions for Application in ICU data: a case study
Major, Vincent; Tanna, Monique S; Jones, Simon; Aphinyanaphongs, Yin
Complex medical data sometimes requires significant data preprocessing to prepare for analysis. The complexity can lead non-domain experts to apply simple filters of available data or to not use the data at all. The preprocessing choices can also have serious effects on the results of the study if incorrect decision or missteps are made. In this work, we present open-source data filters for an analysis motivated by understanding mortality in the context of sepsis- associated cardiomyopathy in the ICU. We report specific ICU filters and validations through chart review and graphs. These published filters reduce the complexity of using data in analysis by (1) encapsulating the domain expertise and feature engineering applied to the filter, by (2) providing debugged and ready code for use, and by (3) providing sensible validations. We intend these filters to evolve through pull requests and forks and serve as common starting points for specific analyses.
PMCID:5333239
PMID: 28269881
ISSN: 1942-597x
CID: 2476222
A pilot application of automatic tweet detection of alcohol use at a music festival [Meeting Abstract]
Aphinyanaphongs, Y; Lucyk, S; Nguyen, V; Nelson, L; Krebs, P; Su, M; Smith, S W
Study Objectives: Previously, we built machine-learned models to automatically identify Tweets indicating alcohol use from 34,563 labeled Tweets collected over 24 hours during New Year's Day. The models demonstrated an estimated area under the receiver operating curve (AUROC) of 0.94 for identifying alcohol use Tweets. In this study, we validated our alcohol use model in an independently collected dataset - the Electric Zoo music festival on New York City's Randall's Island. This event attracted over 130,000 people in 2013 and resulted in two substance-associated deaths. Methods: The initial dataset contained all Tweets and Instagrams geo-tagged within 5 miles of Randall's Island, covering all event days from August 29-31, 2014. Two authors independently reviewed Tweets for drug- or alcohol-related content. 10% of the Tweets were randomly selected for dual independent review to determine agreement using a weighted Cohen's kappa. Identified Tweets were then jointly reviewed to determine those indicative of alcohol use according to previous definitions. Tweets and Instagrams were considered indicators of alcohol use if they referred to: intention to drink, the act of drinking, location at a bar or liquor store, mention of a specific brand, drinking paraphernalia (eg, flask), consequences from drinking (eg, drunk, wasted, tipsy), or alcohol-related hashtags. Our Bayesian logistic regression machine learned model, which had been derived only from Tweets, was applied to a restricted dataset excluding Instagrams. Results: The complete geo-located collection included 11,071 Tweets and Instagrams. The restricted dataset containing only Tweets consisted of 2,928 elements, of which 82 Tweets were classified as drug- or alcohol-related (weighted kappa = 0.92). Of these, 23 Tweets explicitly referenced alcohol use (eg, "Wine at Zoo is the right play. Instadrunk;" "Wow. I am not sober;" "#clskipfridays #livesummer #Ezoo #were dumb #and drunk"). The model achieved an AUROC of 0.87 when applied to this independent Tweet validation set. Conclusion: Our machine-learned model automatically identified alcohol use at Electric Zoo with high discriminatory power. Differences between the previous estimated AUROC performance and the validated AUROC performance are likely due to language variations between the two groups. An in-depth error analysis may identify approaches to improve model performance. The ability to automate social media geosurveillance of substance behavior at events could be coupled with real-time data feeds. Model automation would allow these real-time data feeds to be analyzed for potential public health interventions (including messaging, Tweet geodensity dependent medical presence, or other measures) to further reduce harm
EMBASE:72032552
ISSN: 0196-0644
CID: 1840842
Pilot Study on Text Classification Methods to Identify Potential Subjects for Clinical Trials
Chapter by: Ray, Bisakha; Heffron, Sean; Kang, Stella; Aphinyanaphongs, Yindalon
in: Program & abstract book (9th Annual Machine Learning Symposium March 13, 2015) by
[New York] : New York Academy of Sciences, 2015
pp. 56-56
ISBN:
CID: 1895872
Integrating text messaging in a safety-net office-based buprenorphine program: A feasibility study [Meeting Abstract]
Tofighi, B; Grossman, E; Bereket, S; Aphinyanaphongs, Y; Lee, J D
Aims: (1) Assess feasibility of a text message appointment reminder (TMR) intervention (2) Determine the clinical impact of the TMR on appointment adherence Methods: A 52-item survey was administered to 100 patients in an urban, public sector, office-based buprenorphine program between June 2013 and March 2014. Survey domains included: demographic characteristics, communication patterns, and content preferences for supportive, informational, and relapse prevention TM interventions. A TMR was then sent 7, 4, 1 day prior to the patients' upcoming appointment followed by a 16 item survey that assessed satisfaction and feedback for the TM reminders (n = 72). Results: Respondents were predominately African-American (42%), unemployed or reliant on public assistance (68%), and lacked permanent housing (52%). MP ownership was common (93%) with the caveat of a high turnover of phones (2) and phone numbers (2) in the past year. Most reported TM use (93%) and comfort with sending TM (79%). The feasibility survey demonstrated satisfaction with the TMR (100%) and most preferred receiving text reminders (88%) in place of telephone reminders at 6 months. There was no significant difference between participants receiving the TMR compared to patients that did not receive the reminders. Conclusions: TM based interventions are an acceptable and feasible strategy for enhancing the delivery of care in a safety net, office-based buprenorphine program
EMBASE:72176978
ISSN: 0376-8716
CID: 1946352
Text Classification-based Automatic Recruitment of Patients for Clinical Trials A Silver Standards-based Case Study
Chapter by: Ray, Bisakha; Aphinyanaphongs, Yindalon; Heffron, Sean
in: 2015 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2015) by Balakrishnan, P; Srivatsava, J; Fu, WT; Harabagiu, S; Wang, F [Eds]
pp. 28-33
ISBN: 978-1-4673-9548-9
CID: 2352122
Designing and Implementing INTREPID, an Intensive Program in Translational Research Methodologies for New Investigators
Plottel, Claudia S; Aphinyanaphongs, Yindalon; Shao, Yongzhao; Micoli, Keith J; Fang, Yixin; Goldberg, Judith D; Galeano, Claudia R; Stangel, Jessica H; Chavis-Keeling, Deborah; Hochman, Judith S; Cronstein, Bruce N; Pillinger, Michael H
Senior housestaff and junior faculty are often expected to perform clinical research, yet may not always have the requisite knowledge and skills to do so successfully. Formal degree programs provide such knowledge, but require a significant commitment of time and money. Short-term training programs (days to weeks) provide alternative ways to accrue essential information and acquire fundamental methodological skills. Unfortunately, published information about short-term programs is sparse. To encourage discussion and exchange of ideas regarding such programs, we here share our experience developing and implementing INtensive Training in Research Statistics, Ethics, and Protocol Informatics and Design (INTREPID), a 24-day immersion training program in clinical research methodologies. Designing, planning, and offering INTREPID was feasible, and required significant faculty commitment, support personnel and infrastructure, as well as committed trainees. Clin Trans Sci 2014; Volume #: 1-7.
PMCID:4267993
PMID: 25066862
ISSN: 1752-8062
CID: 1089772
A Comprehensive Empirical Comparison of Modern Supervised Classification and Feature Selection Methods for Text Categorization
Aphinyanaphongs, Yindalon; Fu, Lawrence D; Li, Zhiguo; Peskin, Eric R; Efstathiadis, Efstratios; Aliferis, Constantin F; Statnikov, Alexander
An important aspect to performing text categorization is selecting appropriate supervised classification and feature selection methods. A comprehensive benchmark is needed to inform best practices in this broad application field. Previous benchmarks have evaluated performance for a few supervised classification and feature selection methods and limited ways to optimize them. The present work updates prior benchmarks by increasing the number of classifiers and feature selection methods order of magnitude, including adding recently developed, state-of-the-art methods. Specifically, this study used 229 text categorization data sets/tasks, and evaluated 28 classification methods (both well-established and proprietary/commercial) and 19 feature selection methods according to 4 classification performance metrics. We report several key findings that will be helpful in establishing best methodological practices for text categorization.
ISI:000342346500002
ISSN: 2330-1643
CID: 1313832
Text classification for automatic detection of alcohol use-related tweets: A feasibility study
Chapter by: Aphinyanaphongs, Y; Ray, B; Statnikov, A; Krebs, P
in: 2014 IEEE 15th International Conference on Information Reuse and Integration by
Piscataway, NJ : IEEE, 2014
pp. 93-97
ISBN: 978-1-4799-5880-1
CID: 1515072
Computer models for identifying instrumental citations in the biomedical literature
Fu, Lawrence D.; Aphinyanaphongs, Yindalon; Aliferis, Constantin F.
The most popular method for evaluating the quality of a scientific publication is citation count. This metric assumes that a citation is a positive indicator of the quality of the cited work. This assumption is not always true since citations serve many purposes. As a result, citation count is an indirect and imprecise measure of impact. If instrumental citations could be reliably distinguished from non-instrumental ones, this would readily improve the performance of existing citation-based metrics by excluding the non-instrumental citations. A citation was operationally defined as instrumental if either of the following was true: the hypothesis of the citing work was motivated by the cited work, or the citing work could not have been executed without the cited work. This work investigated the feasibility of developing computer models for automatically classifying citations as instrumental or non-instrumental. Instrumental citations were manually labeled, and machine learning models were trained on a combination of content and bibliometric features. The experimental results indicate that models based on content and bibliometric features are able to automatically classify instrumental citations with high predictivity (AUC = 0.86). Additional experiments using independent hold out data and prospective validation show that the models are generalizeable and can handle unseen cases. This work demonstrates that it is feasible to train computer models to automatically identify instrumental citations. C1 [Fu, Lawrence D.; Aphinyanaphongs, Yindalon] NYU Med Ctr, Ctr Hlth Informat & Bioinformat, Dept Med, New York, NY 10016 USA. [Aliferis, Constantin F.] NYU Med Ctr, Ctr Hlth Informat & Bioinformat, Dept Pathol, New York, NY 10016 USA
ISI:000327219900020
ISSN: 0138-9130
CID: 687922
Identifying unproven cancer treatments on the health web: addressing accuracy, generalizability and scalability
Aphinyanaphongs, Yin; Fu, Lawrence D; Aliferis, Constantin F
Building machine learning models that identify unproven cancer treatments on the Health Web is a promising approach for dealing with the dissemination of false and dangerous information to vulnerable health consumers. Aside from the obvious requirement of accuracy, two issues are of practical importance in deploying these models in real world applications. (a) Generalizability: The models must generalize to all treatments (not just the ones used in the training of the models). (b) Scalability: The models can be applied efficiently to billions of documents on the Health Web. First, we provide methods and related empirical data demonstrating strong accuracy and generalizability. Second, by combining the MapReduce distributed architecture and high dimensionality compression via Markov Boundary feature selection, we show how to scale the application of the models to WWW-scale corpora. The present work provides evidence that (a) a very small subset of unproven cancer treatments is sufficient to build a model to identify unproven treatments on the web; (b) unproven treatments use distinct language to market their claims and this language is learnable; (c) through distributed parallelization and state of the art feature selection, it is possible to prepare the corpora and build and apply models with large scalability.
PMCID:4162393
PMID: 23920640
ISSN: 0926-9630
CID: 484192