A earlier version of this paper was presented as "MARC MeSH and INNOPAC: Perfecting Authority Control of Medical Subject Headings" at the Innovative Users Group 7th Annual Meeting (Oakland, California April 25, 1999).
MeSH ("Medical Subject Headings") is a national standard thesaurus for subject
analysis of bio-medical literature. Produced and administered by the National
Library of Medicine (NLM), MeSH is used to index Medline and other large
databases and is the equivalent of the Library of Congress Subject Headings
(LCSH) in more general purpose library catalogs. MeSH is about the same size
as LCSH, with an entry vocabulary of over 250,000 terms. Unlike LCSH, MeSH is
a true thesaurus, with a strict internal structure.
In online catalogs, subject authority control, like name and series authority control, is typically achieved by means of records encoded in the MARC format. Although subject authority records may be available in other formats, MARC is the format of choice in the online catalog world. MARC provides a principled means of moving records from system to system.
NLM updates MeSH records on an annual basis and makes MeSH available in the MARC authority format, as well as plain text and other formats. These records are available from the NLM web site. But MARC MeSH is available only as a single, large, ftp-able file. The 1999 version contains about 484,000 authority records and runs about 119Mb.
Libraries that maintain OPACs usually turn to bibliographic utilities (RLIN and OCLC) for MARC bibliographic and authority records for their catalogs. But while LCSH is universally available from the bibliographic utilities, MeSH is not. Vendors that provide batch retrospective conversion, reauthorization, and similar services also provide MeSH authority records, but their services typically require processing of large bodies of records and are typically only used for major reauthorization or retrospective conversion projects.
This leaves medical libraries in a difficult position; they need to keep their catalogs' subject authority records complete and up to date, but they have no way short of large scale reauthorization to accomplish that. The medical library community has for years lobbied OCLC to make MeSH authority records available on the same basis as LCSH. The Health Sciences OCLC Users' Group (HSOCLCUG) formally asked OCLC to load and offer MeSH authority records for downloading in 1990 and again in 1997. Despite the fact that OCLC uses the MARC MeSH file for its own internal purposes, it has not responded positively to these requests.
Older, mainframe based medical library OPACs such as NOTIS were able to get around this difficulty by loading the entire MeSH file via tape to non-public parts of the system; cataloging staff would then select MeSH authority records from this working file and move them into the public database at will. Users of Innopac and similar smaller systems have no such option. They must load subject authority records in the usual way (i.e., from a bibliographic utility) or they must laboriously hand key them or they must do without.
It was recognized from the beginning that this approach to the MeSH authority problem was at best inadequate. It was hoped that eventually the entire database would be systematically reauthorized. Meanwhile, the effectiveness of the hand keyed authorities continued to decline as the number of bibliographic records and new MeSH headings increased.
In 1997 the problem suddenly became worse. As the Library was preparing to unveil the new web version of MEDCat, Technical Services discovered that non-MARC authority records in the catalog generated completely inaccurate displays under the new interface. Non-MARC authority records appeared in the subject index doubled. For example, the MeSH heading Oncogenes appeared in the web catalog tautologically as Oncogenes Oncogenes. Discussing the problem with Innovative, it became clear that non-MARC authority records could not be supported in the new environment. (Apparently Innovative Interfaces has remedied this problem since then.)
Technical Services presented the entire problem to the Library Management Team
for evaluation. At that point the Library Systems Department and Technical
Services began to examine what options were available short of complete
reauthorization. Considering authority record creation and maintenance a
normal part of the cataloging process, we decided that what we really wanted
was a way to use Innopac's First Time Use Reports to selectively load MeSH
authority records into the system.
We quickly discovered that the MeSH authority records were freely available in MARC from NLM. Our problem was how to select the desired records from this file and then how to load the chosen records into our Innopac system.
The first step was to download the MeSH file. We chose to load the file on an AIX (UNIX) server that had ample disk space and that was being phased out of its previous functions. Even given the large hard drives on contemporary desktop computers the MARC MeSH file is big, particularly if one needs to manipulate it conveniently. The AIX server was ideal for these purposes, being large and robust.
Once the file was ftp'd, we attempted to parse it using perl scripts (perl is
a computer programming language widely used for CGI scripting, systems
administration, and general utility work). Because Medical Library systems
staff were already comfortable with perl, its selection was quite natural.
After some experimenting, we found that it was possible to parse the MARC file
and extract individual records on a batch basis. The extraction script was not
terribly sophisticated or elegant, but it did work reliably.
Next we began to think about how to communicate with our Innopac system. It was natural to use the standard Innopac Load MARC Records function for adding new records. Similarly we knew that the Innopac "First Time Use Report" was our best bet for easily identifying what new MeSH records were needed. Report Heading Changes is an Innopac function that allows a library to generate regular reports on the state of indexed fields in the OPAC. The function can track invalid headings, new headings, duplicate authority records, blind references, and other conditions. New MeSH headings included in recent cataloging were already part of the report, hence the report was commonly referred to as the "First Time Use Report." The problem was that the headings were buried in an elaborate and often quite long report. Could the headings be extracted in a form that could be used to match against the MARC file? This problem, like the MARC extraction problem, was tackled in perl.
Finally we arrived at a set of scripts that did the following:
Because the file produced by mmftu.pl is simply an ASCII list of MeSH headings, it is easy to modify the system to extract records for any list of MeSH headings, not just those from an Innopac FTU report. This makes it possible to use the system to accomplish a complete database reauthorization if desired.
The actual look up and extract process (controlled by mmextract.pl) turned out to be quite slow. Processing time varies depending upon the number of headings searched, the number of hits, and the configuration of the extract script. Overall run time averaged about 0.41 minutes per heading searched. At that rate a typical run (circa 170 headings) took about an hour and ten minutes to complete. An important consequence is that it is only feasible to operate the system on a batch basis. It was not practical to use the scripts to do single record, "on-the-fly" record extraction. Because a batch process was implicit in the reliance on the Innopac FTU report, this was not discouraging. In any case, by late 1998 we were able to reduce average run time to about 0.17 minutes per heading searched by porting the scripts from the original AIX to a new and much faster SUN system; at that rate, a typical run takes less than half an hour.
To illustrate this question, consider the headings Carpal Tunnel Syndrome and Carpal Tunnel Syndrome -- complications. Both are valid MeSH headings and both have MARC MeSH records. A search for the base heading will result in the extraction and loading of the MARC record:
A1176110 Last updated: 01-21-98 Created: 01-21-98 Revision: 1 ACODE1: ACODE2: ASUPPRESS: 001 D002349 008 630701 n ancnnbabn || ana bnz n 150 Carpal Tunnel Syndrome 450 Carpal Tunnel Syndromes 450 Syndrome, Carpal Tunnel 450 Syndromes, Carpal Tunnel 550 Median Nerve 667 median nerve compression 680 |iA complex of symptoms resulting from compression of the median nerve in the carpal tunnel, with pain and burning or tingling paresthesias in the fingers and hand, sometimes extending to the elbow. (Dorland, 27th ed)A search for the coordinated heading will result in the extraction and loading of the MARC record:
A119124x Last updated: 05-29-98 Created: 05-29-98 Revision: 1 ACODE1: 02 ACODE2: 03 ASUPPRESS: 001 D002349Q000150 008 630701 n ancnnbabn n ana bnz n 150 Carpal Tunnel Syndrome|xcomplicationsThe lack of cross references and catalogers' apparatus is typical of MARC MeSH records for coordinated headings. These records do not in fact contribute to the syndetic structure of the OPAC subject index. On the other hand, they do help those charged with keeping bibliographic records consistent.
When both a base and a coordinated heading appear in bibliographic records in the Innopac database there is little to be concerned with. One can opt to extract and load both records or, perhaps, only the base record. But what if the First Time Use Report reports only the coordinated heading? If one extracts and loads the strict match then one has not in fact helped the user by supplying the cross references found in the base heading. Such references could be quite useful, even though the library has cataloged no works that indexed under that heading, strictly speaking.
Our decision was to load both base and coordinated forms, even when only the coordinated form appeared in our subject index. This was accomplished by modifying mmftu.pl so that it would uniformly output two headings for each coordinated heading found in the FTU report.
In April 1998 we began to search headings in the NLM distribution file and
load the results in MEDCat. Since then we have searched and loaded thousands
of MARC MeSH authority records, quickly eliminating all the non-MARC records
in the system. Between April 1998 and March 1999 we loaded approximately
3,300 records.Hit rate has varied depending upon the configuration of the scripts and the headings searched. The overall hit rate has been around 50%, though that average is lowered by inclusion of earlier test runs which had conspicuously low hit rates. A figure of about 60% is probably more indicative of actual system performance.
Reasons for non-matches include:
Obviously, the diffirence between the number of existing records and the number of records indicates that there was a fairly large number of the headings in the database without subject authority support. For the most part these represent headings that were passed over in the days of hand-keyed authorities. Even in the case of subject authority records that were simply replaced in the complete reload, there are probably cases when the record loaded has significant changes in oblique (see and see also) headings. For these reasons we believe that the full reload was a valuable contribution the usability of our Innnopac system.
The fact that it works for us indicates that it may also be useful to other Medical Libraries. Whether this proves true in practice depends upon the capabilities and culture of those libraries. We will be happy to share the technology we created for this purpose with any medical libraries that want to pursue implementation of their own MARC MeSH load system.
Another possibility would be for medical libraries to collectively build an Internet-based "MARC MeSH server" that would provide real time access to MARC MeSH records over the Internet for sustaining subscribers. Such an undertaking would entail considerable development and committments of financial support. Whether such an idea is feasible would be something for medical libraries to determine amongst themselves.
Finally, it's our conclusion that despite the local success of our system and the possibility of collective solution somewhere in the future, medical libraries still need OCLC to provide MeSH records on the same basis as LCSH. NYU style systems are likely to work only in larger academic medical libraries. The collective solution is only a dream. In the meantime medical libraries of all sizes need convenient access MeSH authority records for their OPACs. We should continue to lobby OCLC to this end.
May 1999, revised December 1999
========================================================
Stuart Spore | voice: 212-263-1092
Associate Director for | fax: 212-263-6534
Systems |
Ehrman Medical Library |
NYU School of Medicine |
550 1st Ave. | spore@library.med.nyu.edu
New York, N.Y. 10016 | http://home.nyu.edu/~spores01
========================================================
* * * lâche pas la patate * * *
========================================================