MARC MeSH in the INNOPAC



Abstract

A significant difficulty confronting medical librarians who maintain Innopac systems has been how to provide authority control for MeSH (National Library of Medicine (NLM) medical subject headings). While the NLM produces MARC subject authority records, no bibliographic utility has chosen to make them available for loading. Consequently medical libraries have great difficulty giving their users effective online assistance with MeSH headings. This paper will show how one Library overcame this problem by using Innopac First Time Use Reports and other tools to successfully load MeSH records into its Innopac. Details concerning workflow, programming, and access are discussed and illustrated.

A earlier version of this paper was presented as "MARC MeSH and INNOPAC: Perfecting Authority Control of Medical Subject Headings" at the Innovative Users Group 7th Annual Meeting (Oakland, California April 25, 1999).




Subject Authorities, MARC, MeSH and MARC MeSH

Subject authorities are crucial to the usefulness of the catalogs librarians create. Subject authority records establish the cross references and other links that users rely on to navigate the catalog's subject index and provide the apparatus that allows librarians to keep the catalog's subject index internally consistent and current. In brief, subject authority records embody the syndetic structure of the subject thesaurus on a term by term basis in the catalog.

MeSH ("Medical Subject Headings") is a national standard thesaurus for subject analysis of bio-medical literature. Produced and administered by the National Library of Medicine (NLM), MeSH is used to index Medline and other large databases and is the equivalent of the Library of Congress Subject Headings (LCSH) in more general purpose library catalogs. MeSH is about the same size as LCSH, with an entry vocabulary of over 250,000 terms. Unlike LCSH, MeSH is a true thesaurus, with a strict internal structure.

In online catalogs, subject authority control, like name and series authority control, is typically achieved by means of records encoded in the MARC format. Although subject authority records may be available in other formats, MARC is the format of choice in the online catalog world. MARC provides a principled means of moving records from system to system.

NLM updates MeSH records on an annual basis and makes MeSH available in the MARC authority format, as well as plain text and other formats. These records are available from the NLM web site. But MARC MeSH is available only as a single, large, ftp-able file. The 1999 version contains about 484,000 authority records and runs about 119Mb.

Libraries that maintain OPACs usually turn to bibliographic utilities (RLIN and OCLC) for MARC bibliographic and authority records for their catalogs. But while LCSH is universally available from the bibliographic utilities, MeSH is not. Vendors that provide batch retrospective conversion, reauthorization, and similar services also provide MeSH authority records, but their services typically require processing of large bodies of records and are typically only used for major reauthorization or retrospective conversion projects.

This leaves medical libraries in a difficult position; they need to keep their catalogs' subject authority records complete and up to date, but they have no way short of large scale reauthorization to accomplish that. The medical library community has for years lobbied OCLC to make MeSH authority records available on the same basis as LCSH. The Health Sciences OCLC Users' Group (HSOCLCUG) formally asked OCLC to load and offer MeSH authority records for downloading in 1990 and again in 1997. Despite the fact that OCLC uses the MARC MeSH file for its own internal purposes, it has not responded positively to these requests.

Older, mainframe based medical library OPACs such as NOTIS were able to get around this difficulty by loading the entire MeSH file via tape to non-public parts of the system; cataloging staff would then select MeSH authority records from this working file and move them into the public database at will. Users of Innopac and similar smaller systems have no such option. They must load subject authority records in the usual way (i.e., from a bibliographic utility) or they must laboriously hand key them or they must do without.

Subject Authorities in the NYU Medical Library OPAC

When the Ehrman Medical Library unveiled MEDCat, its Innopac-based online catalog, in 1990, subject authority records were hand keyed. Because of the work load involved records were keyed selectively; not every MeSH heading in a bibliographic record was given a corresponding authority record. In addition, MeSH authority records selected for keying were ruthlessly edited down to their bare essentials; some cross-references and other data were purposely omitted to make record creation go faster. Even with these constraints, hundreds of non-MARC authority records were added to the catalog, where they suffered from a lack of all but the most pressing maintenance.

It was recognized from the beginning that this approach to the MeSH authority problem was at best inadequate. It was hoped that eventually the entire database would be systematically reauthorized. Meanwhile, the effectiveness of the hand keyed authorities continued to decline as the number of bibliographic records and new MeSH headings increased.

In 1997 the problem suddenly became worse. As the Library was preparing to unveil the new web version of MEDCat, Technical Services discovered that non-MARC authority records in the catalog generated completely inaccurate displays under the new interface. Non-MARC authority records appeared in the subject index doubled. For example, the MeSH heading Oncogenes appeared in the web catalog tautologically as Oncogenes Oncogenes. Discussing the problem with Innovative, it became clear that non-MARC authority records could not be supported in the new environment. (Apparently Innovative Interfaces has remedied this problem since then.)

Technical Services presented the entire problem to the Library Management Team for evaluation. At that point the Library Systems Department and Technical Services began to examine what options were available short of complete reauthorization. Considering authority record creation and maintenance a normal part of the cataloging process, we decided that what we really wanted was a way to use Innopac's First Time Use Reports to selectively load MeSH authority records into the system.

We quickly discovered that the MeSH authority records were freely available in MARC from NLM. Our problem was how to select the desired records from this file and then how to load the chosen records into our Innopac system.

The first step was to download the MeSH file. We chose to load the file on an AIX (UNIX) server that had ample disk space and that was being phased out of its previous functions. Even given the large hard drives on contemporary desktop computers the MARC MeSH file is big, particularly if one needs to manipulate it conveniently. The AIX server was ideal for these purposes, being large and robust.


Once the file was ftp'd, we attempted to parse it using perl scripts (perl is a computer programming language widely used for CGI scripting, systems administration, and general utility work). Because Medical Library systems staff were already comfortable with perl, its selection was quite natural. After some experimenting, we found that it was possible to parse the MARC file and extract individual records on a batch basis. The extraction script was not terribly sophisticated or elegant, but it did work reliably.

Next we began to think about how to communicate with our Innopac system. It was natural to use the standard Innopac Load MARC Records function for adding new records. Similarly we knew that the Innopac "First Time Use Report" was our best bet for easily identifying what new MeSH records were needed. Report Heading Changes is an Innopac function that allows a library to generate regular reports on the state of indexed fields in the OPAC. The function can track invalid headings, new headings, duplicate authority records, blind references, and other conditions. New MeSH headings included in recent cataloging were already part of the report, hence the report was commonly referred to as the "First Time Use Report." The problem was that the headings were buried in an elaborate and often quite long report. Could the headings be extracted in a form that could be used to match against the MARC file? This problem, like the MARC extraction problem, was tackled in perl.

Finally we arrived at a set of scripts that did the following:

We tried various approaches and eventually settled on a mulitpart process that was both reasonably effective and adaptable. The system was implemented in three perl scripts:

Design Decisions

Programming and testing took about 3 weeks. Although we included an option for "fuzzy" matching of headings (allowing for simultaneous right and left truncation), the option was not used except experimentally. Operationally, extraction was triggered by a strict character by character match.

Because the file produced by mmftu.pl is simply an ASCII list of MeSH headings, it is easy to modify the system to extract records for any list of MeSH headings, not just those from an Innopac FTU report. This makes it possible to use the system to accomplish a complete database reauthorization if desired.

The actual look up and extract process (controlled by mmextract.pl) turned out to be quite slow. Processing time varies depending upon the number of headings searched, the number of hits, and the configuration of the extract script. Overall run time averaged about 0.41 minutes per heading searched. At that rate a typical run (circa 170 headings) took about an hour and ten minutes to complete. An important consequence is that it is only feasible to operate the system on a batch basis. It was not practical to use the scripts to do single record, "on-the-fly" record extraction. Because a batch process was implicit in the reliance on the Innopac FTU report, this was not discouraging. In any case, by late 1998 we were able to reduce average run time to about 0.17 minutes per heading searched by porting the scripts from the original AIX to a new and much faster SUN system; at that rate, a typical run takes less than half an hour.

Base Headings vs. Coordinated Headings

One important design question emerged quickly. When a simple, uncoordinated heading matches, there is no question but that the matching MARC record and only that MARC record should be extracted. But when a coordinated heading matches, what records should be extracted? The coordinated record? The "base" record? Both? Another way of putting the question would be to ask, "What actually constitutes a hit?"

To illustrate this question, consider the headings Carpal Tunnel Syndrome and Carpal Tunnel Syndrome -- complications. Both are valid MeSH headings and both have MARC MeSH records. A search for the base heading will result in the extraction and loading of the MARC record:


	A1176110             Last updated: 01-21-98 Created: 01-21-98 Revision: 1
	ACODE1:            ACODE2:            ASUPPRESS: 
	001     D002349 
	008     630701 n ancnnbabn          || ana     bnz n 
	150     Carpal Tunnel Syndrome
	450     Carpal Tunnel Syndromes 
	450     Syndrome, Carpal Tunnel 
	450     Syndromes, Carpal Tunnel 
	550     Median Nerve 
	667     median nerve compression
	680     |iA complex of symptoms resulting from compression of the median nerve
	in the carpal tunnel, with pain and burning or tingling paresthesias in the
	fingers and hand, sometimes extending to the elbow. (Dorland, 27th ed)
A search for the coordinated heading will result in the extraction and loading of the MARC record:


	A119124x             Last updated: 05-29-98 Created: 05-29-98 Revision: 1
	ACODE1:         02 ACODE2:         03 ASUPPRESS: 
	001     D002349Q000150 
	008     630701 n ancnnbabn           n ana     bnz n 
	150     Carpal Tunnel Syndrome|xcomplications
The lack of cross references and catalogers' apparatus is typical of MARC MeSH records for coordinated headings. These records do not in fact contribute to the syndetic structure of the OPAC subject index. On the other hand, they do help those charged with keeping bibliographic records consistent.

When both a base and a coordinated heading appear in bibliographic records in the Innopac database there is little to be concerned with. One can opt to extract and load both records or, perhaps, only the base record. But what if the First Time Use Report reports only the coordinated heading? If one extracts and loads the strict match then one has not in fact helped the user by supplying the cross references found in the base heading. Such references could be quite useful, even though the library has cataloged no works that indexed under that heading, strictly speaking.

Our decision was to load both base and coordinated forms, even when only the coordinated form appeared in our subject index. This was accomplished by modifying mmftu.pl so that it would uniformly output two headings for each coordinated heading found in the FTU report.

The System In Practice

In April 1998 we began to search headings in the NLM distribution file and load the results in MEDCat. Since then we have searched and loaded thousands of MARC MeSH authority records, quickly eliminating all the non-MARC records in the system. Between April 1998 and March 1999 we loaded approximately 3,300 records.

Hit rate has varied depending upon the configuration of the scripts and the headings searched. The overall hit rate has been around 50%, though that average is lowered by inclusion of earlier test runs which had conspicuously low hit rates. A figure of about 60% is probably more indicative of actual system performance.


Reasons for non-matches include:

In any case, while the hits and the subject authority records they bring to the catalog are important and are, indeed, the motivation behind the project, the non-hits reports provide invaluable assistance to catalogers who keep the MEDCat subject index consistent and up-to-date. To assist in this process mmextract.pl sends the cataloging staff an email reporting on all non-hits whenever it is run. Cataloging staff use these reports to correct errors and otherwise keep the database consistent.

Complete Reauthorization

As noted above, the system is capable being used to perform a complete subject authority reload. In late March of 1999, the Library decided to take advantage of this possibility. Beginning a week before the proposed reauthorization date of Easter Weekend, we downloaded every MeSH subject headings in the bibliographic database (about 217,000 of them). This downloaded file was normalized, eliminating all headings with form and geographic subdivisions, while retaining base forms (subfield a) of such headings. The list was then dedupped, broken into alphabetical segments and run against the 1999 MeSH authority file. A number of experimental runs were made to uncover problem headings which were fixed in both the list and in MEDCat bibliographic records. The final list consisted of 16,856 headings. When the final list was run it resulted in 16,365 MeSH authority records (a 97.1% hit rate) broken into 9 separate files. On Saturday April 3, 1999 systems staff deleted all existing subject authority records in the system (just over 5,000) and loaded the 16,365 new records.

Obviously, the diffirence between the number of existing records and the number of records indicates that there was a fairly large number of the headings in the database without subject authority support. For the most part these represent headings that were passed over in the days of hand-keyed authorities. Even in the case of subject authority records that were simply replaced in the complete reload, there are probably cases when the record loaded has significant changes in oblique (see and see also) headings. For these reasons we believe that the full reload was a valuable contribution the usability of our Innnopac system.

Drawbacks?

Any computer system, homemade or commercial, has drawbacks. This system is no different. Some the drawbacks we have encountered are:

Advantages?

Conclusion

The system described can be said to have proven itself at NYU Medical Library. It has been successful both as an incremental system to keep up with new headings as they enter the OPAC and as a tool to accomplish complete database reauthorization. NYU intends to continue to use the system for the foreseeable future.

The fact that it works for us indicates that it may also be useful to other Medical Libraries. Whether this proves true in practice depends upon the capabilities and culture of those libraries. We will be happy to share the technology we created for this purpose with any medical libraries that want to pursue implementation of their own MARC MeSH load system.

Another possibility would be for medical libraries to collectively build an Internet-based "MARC MeSH server" that would provide real time access to MARC MeSH records over the Internet for sustaining subscribers. Such an undertaking would entail considerable development and committments of financial support. Whether such an idea is feasible would be something for medical libraries to determine amongst themselves.

Finally, it's our conclusion that despite the local success of our system and the possibility of collective solution somewhere in the future, medical libraries still need OCLC to provide MeSH records on the same basis as LCSH. NYU style systems are likely to work only in larger academic medical libraries. The collective solution is only a dream. In the meantime medical libraries of all sizes need convenient access MeSH authority records for their OPACs. We should continue to lobby OCLC to this end.


May 1999, revised December 1999


========================================================
Stuart Spore              | voice: 212-263-1092       
Associate Director for    | fax:   212-263-6534        
Systems                   |                             
Ehrman Medical Library    | 
NYU School of Medicine    |
550 1st Ave.              | spore@library.med.nyu.edu
New York, N.Y. 10016      | http://home.nyu.edu/~spores01
========================================================
           * * * lâche pas la patate * * *  
========================================================