Searched for: in-biosketch:yes
person:sbj2002
Natural language processing challenges in HIV/AIDS clinic notes
Hyun, Sookyung; Bakken, Suzanne; Friedman, Carol; Johnson, Stephen B
In recent years, significant progress has been achieved toward increased structured data entry using standardized health care terminologies. Concurrently, the value of narrative as the clinician's rich description of the encounter and source of vital information has been reaffirmed. Natural language processing (NLP) offers a strategy for integrating these approaches to provide structured reports for further computer processing. As part of a larger project aimed at using narrative data to enrich the online medical record, we analyzed a small sample of documents in a corpus of progress notes to identify potential challenges associated with using NLP for HIV/AIDS clinic notes. We provide illustrative examples of five types of challenges.
PMCID:1480114
PMID: 14728377
ISSN: 1942-597x
CID: 3585932
A native XML database design for clinical document research
Johnson, Stephen B; Campbell, David A; Krauthammer, Michael; Tulipano, P Karina; Medonca, Eneida A; Friedman, Carol; Hripcsak, George
Health-care institutions are gaining an increasing interest in exploiting the data that are gathered through electronic medical records. Narrative data, generated by transcription or direct entry, represents a far greater challenge for analytic tasks. Moreover, a small number of institutions are beginning to explore deeper structuring of narrative data using natural language processing (NLP). The data produced by NLP systems has a complex, nested structure. Current electronic medical records do not have the ability to store and retrieve data of this complexity in a suitable way.
PMCID:1479907
PMID: 14728388
ISSN: 1942-597x
CID: 3585942
Automatic learning of the morphology of medical language using information compression
Mollah, Shamim Ara; Johnson, Stephen B
Conversion of free-text strings in a natural language to a standard representation (codes) is an important reoccurring problem in biomedical informatics. Determining the content of a string involves identifying its meaningful constituents (morphemes). One current method of identifying these constituents is to look them up in a preexisting table (lexicon). Manual construction of lexicons and grammars in complex domains such as biomedicine is extremely laborious. As an alternative to the lexico-grammatical approach, we introduce a segmentation algorithm that automatically learns lexical and structural preferences from corpora via information compression. The method is based on the Minimum Description Length (MDL) principle from classic information theory.
PMCID:1480252
PMID: 14728443
ISSN: 1942-597x
CID: 3585952
Automatic resolution of ambiguous terms based on machine learning and conceptual relations in the UMLS
Liu, Hongfang; Johnson, Stephen B; Friedman, Carol
UNLABELLED:Motivation. The UMLS has been used in natural language processing applications such as information retrieval and information extraction systems. The mapping of free-text to UMLS concepts is important for these applications. To improve the mapping, we need a method to disambiguate terms that possess multiple UMLS concepts. In the general English domain, machine-learning techniques have been applied to sense-tagged corpora, in which senses (or concepts) of ambiguous terms have been annotated (mostly manually). Sense disambiguation classifiers are then derived to determine senses (or concepts) of those ambiguous terms automatically. However, manual annotation of a corpus is an expensive task. We propose an automatic method that constructs sense-tagged corpora for ambiguous terms in the UMLS using MEDLINE abstracts. METHODS:For a term W that represents multiple UMLS concepts, a collection of MEDLINE abstracts that contain W is extracted. For each abstract in the collection, occurrences of concepts that have relations with W as defined in the UMLS are automatically identified. A sense-tagged corpus, in which senses of W are annotated, is then derived based on those identified concepts. The method was evaluated on a set of 35 frequently occurring ambiguous biomedical abbreviations using a gold standard set that was automatically derived. The quality of the derived sense-tagged corpus was measured using precision and recall. RESULTS:The derived sense-tagged corpus had an overall precision of 92.9% and an overall recall of 47.4%. After removing rare senses and ignoring abbreviations with closely related senses, the overall precision was 96.8% and the overall recall was 50.6%. CONCLUSIONS:UMLS conceptual relations and MEDLINE abstracts can be used to automatically acquire knowledge needed for resolving ambiguity when mapping free-text to UMLS concepts.
PMCID:349379
PMID: 12386113
ISSN: 1067-5027
CID: 3585872
Representing nested semantic information in a linear string of text using XML
Krauthammer, Michael; Johnson, Stephen B; Hripcsak, George; Campbell, David A; Friedman, Carol
XML has been widely adopted as an important data interchange language. The structure of XML enables sharing of data elements with variable degrees of nesting as long as the elements are grouped in a strict tree-like fashion. This requirement potentially restricts the usefulness of XML for marking up written text, which often includes features that do not properly nest within other features. We encountered this problem while marking up medical text with structured semantic information from a Natural Language Processor. Traditional approaches to this problem separate the structured information from the actual text mark up. This paper introduces an alternative solution, which tightly integrates the semantic structure with the text. The resulting XML markup preserves the linearity of the medical texts and can therefore be easily expanded with additional types of information.
PMCID:2244450
PMID: 12463856
ISSN: 1531-605x
CID: 3585882
The sublanguage of cross-coverage
Stetson, Peter D; Johnson, Stephen B; Scotch, Matthew; Hripcsak, George
At Columbia-Presbyterian Medical Center, free-text "Signout" notes are typed into the electronic record by clinicians for the purpose of cross-coverage. We plan to "unlock" information about adverse events contained in these notes in a subsequent project using Natural Language Processing (NLP). To better understand the requirements for parsing, Signout notes were compared to other common medical notes (ambulatory clinic notes and discharge summaries) on a series of quantitative metrics. They are shorter (mean length 59.25 words vs. 144.11 and 340.85 for ambulatory and discharge notes respectively) and use more abbreviations (26.88% vs. 20.07% and 3.57%). Despite being terser, Signout notes use less ambiguous abbreviations (8.34% vs. 9.09% and 18.02%). Differences were found using Relative Entropy and Squared Chi-square Distance in a novel fashion to compare these medical corpora. Signout notes appear to constitute a unique sublanguage of medicine. The implications for parsing free-text cross-coverage notes into coded medical data are discussed.
PMCID:2244148
PMID: 12463923
ISSN: 1531-605x
CID: 3585892
The cognitive demands of an innovative query user interface
Wang, Di; Kaufman, David R; Mendonca, Eneida A; Seol, Yoon-Hu; Johnson, Stephen B; Cimino, James J
Too often, online searches for health information are time consuming and produce results that are not sufficiently precise to answer clinicians' or patients' questions. The PERSIVAL project is designed to circumvent this problem by personalizing and tailoring searches and presentation to the demands of the user and the particular clinical context. This paper focuses on a cognitive evaluation of one component of this project, a Query User Interface (QUI). The study examines the system's ability to allow users to easily and intuitively express their information needs. We performed several analyses including a cognitive walkthrough of the interface and quantitative estimations of cognitive load. The paper also presents a preliminary analysis of usability testing. The analyses suggest that there are features in the QUI that contribute to a greater cognitive load and result in greater effort on the part of the subject. The results of usability testing are consistent with these findings. However, subjects found it to be relatively easy and intuitive to generate well-formed queries using the interface. This study contributed to the iterative design of the interface and to the next generation of the PERSIVAL system.
PMCID:2244191
PMID: 12463945
ISSN: 1531-605x
CID: 3585902
Medical Informatics Training and Research at Columbia University
Shortliffe, E H; Johnson, S B
PMID: 27706367
ISSN: 2364-0502
CID: 3650902
Accessing heterogeneous sources of evidence to answer clinical questions
Mendonça, E A; Cimino, J J; Johnson, S B; Seol, Y H
The large and rapidly growing number of information sources relevant to health care, and the increasing amounts of new evidence produced by researchers, are improving the access of professionals and students to valuable information. However, seeking and filtering useful, valid information can be still very difficult. An online information system that conducts searches based on individual patient data can have a beneficial influence on the particular patient's outcome and educate the healthcare worker. In this paper, we describe the underlying model for a system that aims to facilitate the search for evidence based on clinicians' needs. This paper reviews studies of information needs of clinicians, describes principles of information retrieval, and examines the role that standardized terminologies can play in the integration between a clinical system and literature resources, as well as in the information retrieval process. The paper also describes a model for a digital library system that supports the integration of clinical systems with online information sources, making use of information available in the electronic medical record to enhance searches and information retrieval. The model builds on several different, previously developed techniques to identify information themes that are relevant to specific clinical data. Using a framework of evidence-based practice, the system generates well-structured questions with the intent of enhancing information retrieval. We believe that by helping clinicians to pose well-structured clinical queries and including in them relevant information from individual patients' medical records, we can enhance information retrieval and thus can improve patient-care.
PMID: 11515415
ISSN: 1532-0464
CID: 3650672
Comparing syntactic complexity in medical and non-medical corpora
Campbell, D A; Johnson, S B
With the growing use of Natural Language Processing (NLP) techniques as solutions in Medical Informatics, the need to quickly and efficiently create the knowledge structures used by these systems has grown concurrently. Automatic discovery of a lexicon for use by an NLP system through machine learning will require information about the syntax of medical language. Understanding the syntactic differences between medical and non-medical corpora may allow more efficient acquisition of a lexicon. Three experiments designed to quantify the syntactic differences in medical and non-medical corpora were conducted. The results show that the syntax of medical language shows less variation than non-medical language and is likely simpler. The differences were great enough to question the applicability of general language tools on medical language. These differences may reduce the difficulty of some free text machine learning problems by capitalizing on the simpler nature of narrative medical syntax.
PMCID:2243419
PMID: 11825160
ISSN: 1531-605x
CID: 3650682