Mapping and Converting Essential Federal Geographic Data Committee (FGDC) Metadata into MARC21 and Dublin Core: Towards an Alternative to the FGDC Clearinghouse

EE-IR Center Technical Report
Created: 1999-06-14
Revised: 1999-07-02, 1999-08-02, 1999-10-13
Current version: 1999-12-14
Adam Chandler and Dan Foley
Energy and Environmental Information Resources Center
University of Louisiana Lafayette
Lafayette, Louisiana
Alaaeldin M. Hafez
Center for Advanced Computer Studies
University of Louisiana Lafayette
Lafayette, Louisiana

Abstract

The purpose of this article is to raise and address a number of issues related to the conversion of Federal Geographic Data Committee metadata into MARC21 and Dublin Core. We present an analysis of 466 FGDC metadata records housed in the National Biological Information Infrastructure (NBII) node of the FGDC Clearinghouse, with special emphasis on the length of fields and the total length of records in this set. One of our contributions is a 34 element crosswalk, a proposal which takes into consideration the constraints of the MARC21 standard and the realities of user behavior. A second contribution is our discovery that the key to this conversion problem is integrating persistent uniform resource locators (PURLs) into the FGDC Clearinghouse model. Finally, we present an alternative model for managing FGDC metadata.


1. Introduction
2. Analysis of FGDC Metadata
3. FGDC to MARC21/DC Crosswalk
4. The Need for Persistent Uniform Resource Locators
5. Our Alternative to the FGDC Clearingshouse Model
6. Conclusion
7. Appendix: NBII Data
8. Notes
9. References
10. Contact Information

1. Introduction

This paper describes a continuing digital library research project at the Energy and Environmental Information Resources Center to enhance access to Federal Geographic Data Committee (FGDC) data sets.[1] It presents a mapping of selected FGDC metadata elements into Dublin Core (DC) and MARC21 metadata that is based on standard crosswalks [Mangan 1997 ; LC 1999]. The FGDC elements included in our mapping are referred to as "essential FGDC metadata." They provide the basis for a converter being developed to import FGDC metadata into the Online Computer Library Center's WorldCat, its Cooperative Online Resource Catalog (CORC) project, and into local MARC-based library catalogs. We also analyze a data set of 466 FGDC records: 1) as a criterion for selecting essential FGDC elements, and 2) in terms of FGDC record length, because record and field lengths are a limitation for MARC21 records in WorldCat and local library systems.  Working through the logic of converting FGDC metadata has led us to design an alternative to the FGDC Clearinghouse.

One impetus for this work is our discovery in 1998 that more than 50% of the queries directed at the National Biological Information Infrastructure (NBII) node of the FGDC Clearinghouse retrieve zero (0) hits for the user. To us, that number represents a failure in the system architecture. A follow-up analysis of NBII log files between the period of July 1998 and March 1999 substantiated the earlier finding.

We are following two research threads: the first is to create an alternative Clearinghouse model that makes management and maintenance of the metadata easier for the individuals responsible for taking time to create FGDC compatible metadata; the second is to convert existing and future metadata to more widely used metadata standards for inclusion in systems other than the Clearinghouse. Our metadata converter model addresses both concerns. (The permanent URL for our converter project is: http://eeirc.nwrc.gov/converter). Before descibing our project, however, it may be useful to first offer some important definitions for readers who are not professional librarians.

WorldCat is an international bibliographic database of more than 40 million records that is maintained by the Online Computer Library Center (OCLC) in Dublin, Ohio, and is used by more than 34,000 libraries worldwide [OCLC 1999]. WorldCat records are in MARC21 format, which is the current version of the MARC (Machine Readable Cataloging) standard originally developed in the 1960s by the Library of Congress [LC 1998]. MARC21 is used in the United States and Canada. There are also other national and international MARC standards such as UKMARC and UNIMARC.

The Cooperative Online Research Catalog (CORC) is an initiative sponsored by OCLC to develop the creation and sharing by libraries of metadata for Internet resources. Some of the main features of CORC are the integration of Dublin Core and MARC21 metadata into a single system that provides for both shared and local metadata for digital and physical items, editing in DC and MARC21 views, import and export of DC and MARC21 records, RDF/XML import and export, authority control, assisted (DDC) classification and subject heading assignment, automated keyword extraction and data extraction, link maintenance, and Unicode support [CORC 1999].

The Dublin Core Metadata Initiative is well known to researchers in the digital library community [DC 1999]. The first Dublin Core Metadata Workshop was sponsored by OCLC and the National Center for Supercomputing Applications in March 1995. Since that time, six more workshops have taken place, with the Seventh Dublin Core Workshop (DC-7) being held October 25-27, 1999 at Die Deutsche Bibliothek in Frankfurt, Germany [DC-7].

 

2. Analysis of FGDC Metadata

FGDC metadata is based on the "Content Standard of Digital Geospatial Metadata" [FGDC 1998]. The standard is available in several electronic formats, for example, as hyptertext images [CSDGM Image Map 1998]. FGDC metadata has a hierarchic structure of more than 300 elements, including 199 data entry elements, that are organized into seven information sections and three supporting sections called templates:

 

  1. Identification
  2. Data Quality
  3. Spatial Data Organization
  4. Spatial Reference
  5. Entity and Attribute
  6. Distribution
  7. Metadata Reference
  8. Citation
  9. Time Period
  10. Contact

Sections and elements are either mandatory, mandatory if applicable, or optional. Templates are not used alone, but are inserted into information sections at appropriate places. Some data elements are repeatable, as are the templates. Only the Identification and Metadata Reference sections (sections 1 and 7) are mandatory in a fully compliant FGDC metadata record.

After an FGDC record is created in one of the available editors, for example the MetaMaker program created by FGDC [MetaMaker 1999] and the Army Corps of Engineers' effort called CorpsMet [CorpsMet 1999], the structured ASCII text file is run through a parser which first checks its syntax and then outputs three different versions of the record (text, HTML, SGML). All three versions are then sent to a node within the FGDC Clearinghouse. The Isite software indexes the SGML version of the record. The nodes are then searched through one of the Clearinghouse web sites. The user's request is made to a web form which is sent to a Z39.50 client that broadcasts the request to all the selected nodes, then returns the results of the query to the user's browser as one set (see http://130.11.52.184/ ). It should be noted that the FGDC Clearinghouse has published statements which indicate that on average 10% or more of nodes are not functioning at any given time. We believe that percentage may even be higher. The reader may check the status of the FGDC Clearinghouse nodes at anytime at the following location: http://130.11.52.178/serverstatus.html .

Analyzing the NBII Data Set for Record Length

FGDC records provide considerably more information than is usually found in library online catalogs. This applies to both the kind and amount of information that they convey. Thus one of our goals was to determine how much they may exceed the field and record length limits of MARC21 records in OCLC's WorldCat database. MARC21 records in WorldCat have: 1) a maximum record length size of 4096 characters, 2) a maximum field size of 1230 characters in a variable field, and 3) a maximum of 50 variable fields.

Accordingly, we obtained a data set of 466 metadata records from the National Biological Information Infrastructure (NBII) of the U.S. Geological Survey.[2]  The output of the SGML format maps into a flat text file of 444 element cells for each record. A summary of results is presented in Table 1. For those interesed in examining the data in more detail, see the Appendix: NBII Data.

Table 1: Record and Field Length Summary of 466 NBII Metadata Records

  record length largest field in record
average 6792 bytes 2125 bytes
median 6474 bytes 1258 bytes
largest values of 466 NBII records 28042 bytes 9525 bytes
number of records with length > 4096 343 records  
number of records with a field length > 1230   236 records

It is clear that the FGDC metadata in this set is, essentially, of a different type than the typical catalog record found in WorldCat. The largest record in the set contains 28042 bytes, that is, nearly seven times larger than the WorldCat record length limit. The largest field value is about eight times larger than the WorldCat field limit. In fact, about 74% of this set exceeds the maximum record size, while 51% of the records have at least one field, usually the abstract (element 14 in our output), distributor liability statement (element 376), or the process description (element 135) that exceeds the field length limit in WorldCat.

How does this relate to records in OCLC's CORC system, which allows the import and export of both MARC21 and Dublin Core records? According to Thomas Hickey, CORC Project Manager at OCLC, the differences in size between FGDC records in the NBII dataset and MARC21 bibliographic records in WorldCat should not be a problem for the CORC system. The only real limitation to record size in CORC is what browsers can handle. There have been some problems with records having tens of thousands of bytes, but the average FGDC record is well below this range. WorldCat may adopt CORC's XML system sometime in the future, but for now, moving very long CORC records into WorldCat would require an algorithm to cut or drop fields in order to make the record fit. In other words, the WorldCat record would display abbreviated data in some fields, but the CORC system would display the entire record. The newer Dublin Core, XML/RDF, and FGDC standards do not have field or record length limitations.

Criteria for Mapping and Converting FGDC Elements

It is not our intention to map and convert all 300-plus FGDC elements (or 195 data entry elements). Rather, we selected a smaller number of elements that we refer to as "essential FGDC metadata" for a fully compliant FGDC record. Elements were selected for three reasons: 1) they are required (mandatory) for the production of a fully compliant FGDC record; 2) they are search keys such as author, title, subject, and date that are commonly found in online library catalogs; 3) they are fields commonly used by creators of FGDC metadata that may be used as search keys by persons interested in FGDC geospatial data sets. The first two criteria are determined, respectively, from mandatory elements in the "Content Standard for Digital Geospatial Metadata" (CSDGM) and by generally accepted library practice for the selection of access points in online catalogs. The third criterion is based on a frequency analysis of the NBII data set for actual usage of FGDC elements by persons who created the metadata records. The results of this analysis are presented in Table 2.

Columns 1 and 2, respectively, give the tag numbers and names of each essential FGDC element as is found in the CSDGM. Column 3 gives the number of times each essential element was used in the sample set (out of a possible 466 times).

Table 2: Element Frequency Count  for Sample Set of 466 NBII Metadata Records

FGDC Tag FGDC Element NBII Frequency
8.4 Title 466
8.1 Originator 466
1.6.1.1 Theme_Keyword 466
1.6.2.2 Place_Keyword 424
1.2.1 Abstract 465
1.2.2 Purpose 466
8.8.2 Publisher 305
8.2 Publication_Date 466
1.5.1.1 West_Bounding_Coordinate 466
1.5.1.2 East_Bounding_Coordinate 466
1.5.1.3 North_Bounding_Coordinate 466
1.5.1.4 South_Bounding_Coordinate 466
9.3.1 Beginning_Date 345
9.3.1 Ending_Date 345
9.1.1 Calendar_Date 117
10.1.1 Contact_Person 396
10.4.1 Address_Type 461
10.4.2 Address 459
10.4.3 City 461
10.4.4 State_or_Province 461
10.4.5 Postal_Code 461
10.4.6 Country 165
10.5 Contact_Voice_Telephone 461
10.6 Contact_Facsimile_Telephone 226
10.8 Contact_Electronic_Mail_Address 315
10.9 Hours_of_Service 47
4.1.2.1.1 Map_Projection_Name 74
4.1.4.1 Horizontal_Datum_Name 59
1.7 Access_Constraints 466
1.8 Use_Constraints 466
10.1.2 Contact_Organization 65
10.3 Contact_Position 282
6.4.2.1.1 Format_Name 257

 

3. FGDC to MARC21/DC Crosswalk

The following table (Table 3) presents our crosswalk from FGDC to Dublin Core and MARC21. It consists of 34 essential FGDC elements and is based on standard crosswalks [LC 1999 ; Mangan 1997].[3]  It includes mandatory elements from the Identification and Metadata Reference sections, as well as specific elements from the Spatial Reference, Distribution, Citation, Time Period, and Contact sections. Our crosswalk has similarities as well as differences with the "Metadata Entry System" for minimally compliant metadata that has been proposed recently by the Federal Geographic Data Committee [FGDC 1999?].  We recommmend that the reader compare those guidelines with the elements in our crosswalk.

The crosswalk and converter represent the current state of an evolving process rather than a final product. The converter software program is written in C by one of us (Alaaeldin Hafez). It has a modular and adaptable design, that is, it is very easy to add, change, or delete particular features within its general design. However, even the best machine conversion may require some human intervention: in other words, librarians may want to do some editing of records produced by the converter in order to adapt them to their local automated library systems

Table 3: Crosswalk from FGDC to Dublin Core and MARC21

  FGDC Tag FGDC Name DC Name MARC21 Tag
01 8.4 Title Title 245 00 $a
02 1.2.1 Abstract Description 520 __ $a
03 1.2.2 Purpose Description 500 __ $a
04 8.1 Originator Creator.Name 720 __ $a
05 8.8.2 Publisher Publisher 260 0_ $b
06 8.2 Publication_Date Date.Issued 260 0_ $c
07 9.1.1 Calendar_Date Coverage.Date 513 __ $b
08 9.3.1 Beginning_Date Coverage.dateStart 513 __ $b
09 9.3.1 Ending_Date Coverage.dateEnd 513 __ $b
10 1.5.1.1 West_Bounding_Coordinate Coverage.Box.westLimit 034 0_ $d
11 1.5.1.2 East_Bounding_Coordinate Coverage.Box.eastLimit 034 0_ $e
12 1.5.1.3 North_Bounding_Coordinate Coverage.Box.northLimit 034 0_ $f
13 1.5.1.4 South_Bounding_Coordinate Coverage.Box.southLimit 034 0_ $g
14 1.6.1.1 Theme_Keyword Subject.Keyword 653 0_ $a
15 1.6.2.2 Place_Keyword Subject.Geographic 653 0_ $a
16 6.4.2.1.1 Format_Name Format 856 $q
17 10.1.1 Contact_Person   270 $p
18 10.1.2 Contact_Organization   270 $q
19 10.3 Contact_Position   270 $q
20 10.4.1 Address_Type   270 $i
21 10.4.2 Address   270 $a
22 10.4.3 City   270 $b
23 10.4.4 State_or_Province   270 $c
24 10.4.5 Postal_Code   270 $e
25 10.4.6 Country   270 $d
26 10.5 Contact_Voice_Telephone   270 $k
27 10.6 Contact_Facsimile_Telephone   270 $l
28 10.8 Contact_Electronic_Mail_Address   270 $m
29 10.9 Hours_of_Service   270 $r
30 1.7 Access_Constraints Rights.Access 506 $a
31 1.8 Use_Constraints Rights.Use 540 $a
32     Identifier.URL 856 $u
33 4.1.2.1.1 Map_Projection_Name Coverage.Box.projection 255 $b
34 4.1.4.1 Horizontal_Datum_Name Description 342 05 $a

Discussion of Essential FGDC Metadata Elements

The following notes on the crosswalk are based on our cataloging experience as participants in the CORC project. The crosswalk also incorporates results of the Seventh Dublin Core Workshop (DC-7), which was held October 25-27, 1999, at Die Deutsche Bibliothek, Frankfurt, Germany. The principal activitities at DC-7 were DC Working Group discussions of proposals for an initial set for DC Qualifier semantics. Discussion, revision, and approval of DC qualifiers will be completed by January 1, 2000 [Weibel 1999]. We intend to harmonize DC qualifiers in our converter with those of the Dublin Core Metadata Initiative after that date.

FGDC Title

This FGDC element corresponds to MARC21 245 $a and DC.Title. We follow CORC participants' practice of using subfield $h [electronic resource] rather than $h [computer file] as the general material designation (GMD). The former is valid under ISBD (ER) rules, although it is not yet approved in AACR2. The converter enters $h [electronic resource] after the 245 $a title field.

FGDC Originator

This FGDC element corresponds to MARC21 720 and DC.Creator. The 720 field occurs in the earlier (1997) but not the current (1999) version of the Library of Congress crosswalk [LC 1999]. We are using this field because the converter program cannot distinguish between personal and corporate names. Since the 720 field is not used in WorldCat or CORC, librarians will have to edit this field to the appropriate personal (700) or corporate (710) added entry fields. The DC Agents Working Group at DC-7 proposes the qualifier "Agent Name" for Agent elements (that is, the DC.Creator, Publisher, and Contributor elements) [Iannella 1999]. The crosswalk accordingly maps the FGDC Originator as DC.Creator.Name.

FGDC Publication Date

This FGDC element corresponds to MARC21 260 $c and DC.Date. Both FGDC and DC metadata adhere to the date format of the ISO 8601 standard, which has been summarized by Kuhn [Kuhn 1999]. For example, the date November 25, 1999 is given in FGDC as 19991125 (or YYYYMMDD without hyphens) and in DC as 1999-11-25 (or YYYY-MM-DD with hyphens). The DC Date Working Group recommends hyphenation, which is also endorsed by the World Wide Web Consortium (W3C) [Childress 1999 ; Wolf 1997]. The Working Group also recommends the qualifier "Issued" for the "date of formal issuance (e.g., publication) of the resource." The converter also maps the YYYY-MM-DD format to the MARC21 260 $c field. Users of WorldCat and CORC will want to edit the date, e.g., 1999-11-25, to "1999" in MARC 21 260 $c, add "November 12, 1999" as a 500 note, and enter "e" (for expanded date), and "1999,1112" in the Date/Status (DtSt) and Dates fixed fields.

FGDC Calendar Date, Beginning Date, and Ending Date

These FGDC elements refer to the time period covered by the data set, either as a single date or range of dates, and not to the publication date. In MARC21, the date or dates map to a 513 $b Period Covered Note. If there is a range of dates, the converter repeats subfield $b twice (once for each date). For the DC.Coverage element, the joint meeting of the DC-Date and DC-Coverage Working Groups at the 7th Dublin Core Workshop has recommended the qualifiers "Date", "dateStart", and "dateEnd" [Miller 1999]. The crosswalk therefore maps FGDC Calendar_Date, Beginning_Date, and Ending_Date, respectively, as Coverage.Date, Coverage.dateStart, and Coverage.dateEnd.

FGDC Bounding Coordinates

The four FGDC elements for west, east, north, and south bounding coordinates map to MARC21 034 $d, $e, $f, and $g and to DC.Coverage. The DC.Coverage Working Group at DC-7 is proposing the qualifier "Box," which will contain westLimit, eastLimit, northLimit, and southLimit for FGDC bounding coordinates [Miller 1999 ; Cox 1999]. For example, using the bounding coordinates of the data set "Hydrologic units maps of the Conterminous United States," the four FGDC bounding coordinates are mapped as DC.Coverage.Box.westlimit:-127.9061; eastlimit:-65.3219; northlimit:48.2825; southlimit:22.8757.

The MARC21 043 $d, $e, $f, and $g subfields (and 255 $c subfield) express longitudes and latitudes in degrees-minutes-seconds (DMS), while the FGDC values for bounding coordinates are given in decimal degrees (DD). MARC21 records ought to be edited with a DD/DMS converter such as the one provided by Gary Park of the University of Northumbria [Park 1999] (http://www.unn.ac.uk/~evgp1/gary/dec2deg.htm). We intend to add a DD/DMS (and DMS/DD) converter to our metadata converter at a future date.

FGDC Theme Keywords

These FGDC elements correspond to MARC21 653 (Index Terms--Uncontrolled) and DC.Subject.Keyword, not to  MARC21 650 (Subject Added Entry--Topical Term) and DC.Subject.LCSH. This is because creators of FGDC metadata use their own keywords, not a controlled vocabulary like the Library of Congress Subject Headings (LCSH) that many libraries use in MARC21 650. Some users of the converter may prefer this uncontrolled vocabulary, but we prefer to use LCSH where possible. There is an NBII thesaurus (currently containing about 1000 uncontrolled terms) and we have started working on a crosswalk between it and LCSH. This crosswalk can be built into the converter. We think about half of the NBII terms can be mapped directly to LC subject terminology. For clarification, the thesaurus we started mapping to LCSH is a version of the keyword thesaurus incorporated into the USGS MetaWebber tool (http://biology.usgs.gov/pubs/metaweb). However, we have recently learned that the the CERES/NBII Thesaurus Partnership Project (http://ceres.ca.gov/thesaurus/) is a more comprehensive and authoritative thesaurus, so we intend if possible to include that thesaurus in our converter instead.

FGDC Place Keywords

These FGDC elements correspond to MARC21 653 and DC.Subject.Geographic. The converter does not map to MARC21 651 (Subject Added Entry--Geographic Term) because place keywords (like theme keywords) used by creators of FGDC metadata are uncontrolled terms. Some users of the converter may prefer these terms, while others (like ourselves) may prefer to edit them to forms found in the OCLC Authority File.

FGDC Format Name

This FGDC element corresponds to MARC21 856 $q and DC.Format. Creators of FGDC metadata may use a list of domain values given in the "Content Standard for Digital Geospatial Metadata" and some libraries may prefer to use this value. (The list of values is given at element 6.4.2.1.1 in [FGDC 1998]). Our preference is to use Internet Media Types (MIME values), which are also recommended by the DC Resource Type and Format Working Group. For example, many FGDC data sets are available in Arc/Info export file format, which has the FGDC domain value "ARCE." However, we use the Arc/Info extension ".e00" to create the local format "application/e00" (it is local because, so far, we have not found Arc/Info in any MIME list).

FGDC Contact Information

These twelve elements in the FGDC Contact template are mapped to the MARC21 270 primary address field. As mentioned above, the Contact template (section 10) can be inserted into the Identification, Distribution, or Metadata Reference sections (sections 1, 6 or 7). Contact persons or organizations may be the same or different in these three sections. We mapped contact information about the originator of the data set, rather than about the distributor contact or metadata contact, because we felt that this would be of most interest to users of FGDC data sets. For the time being, the crosswalk does not map FGDC Contact information to Dublin Core. This will be done when the basic set of qualifiers for DC Agent is established.[4]

DC Identifier/MARC21 856

The converter does not include mapping of a Uniform Resource Locator (URL) from FGDC to the DC.Identifier or MARC21 856 field. This is because the URL is a link to FGDC metadata located at a clearinghouse or other resource center. The URL does not occur in the body of the FGDC metadata itself. (The URL in an FGDC Online Linkage field in the Citation template is an optional link that points to a data set, not to metadata.) This is really a matter of meta-metadata, or data about metadata, and we are interested to learn how other libraries are addressing this issue. It should be emphasized that we do not view the meta-metadata as a replacement for the full record. Our approach is premised on the availability of the full record via a hyperlink from the Identifier (DC) or 856 (MARC) field back to the full FGDC record.

Our intial solution at the EE-IR Center has been to write DC metadata about FGDC metadata that has been placed onto a Web server and offered to users in HTML:

DC Meta-Metadata: Hydrologic Units Maps of the Coterminous United States
FGDC Metadata: http://water.usgs.gov/GIS/metadata/usgswrd/huc250k.html

Experience shows that putting FGDC metadata records in a database and offering one search engine interface is not sufficient for user needs.  The HTML version of the FGDC record cited in the above example is not unusual. We have found dozens of examples where the HTML version of the record is dressed up and offered as a Web page, making the metadata record - not a search engine - the primary user interface (a list of some of these examples is available at http://eeirc.nwrc.gov/metadata_fgdc.htm).   Huberman and Adamic (1997) contend that "social search" is a mechanism for navigating the bountiful content of the Web.  That is, exchanging links to content with one's colleagues and friends is a successful and efficient method that bypasses portal directories and search engines [Huberman 1997]. By exchanging links to known sites of value, groups of individuals reduce the costs of surfing for information. It is this strategy or pattern, we believe, that may explain the emergence of this parallel FGDC Clearinghouse system.

For example, there is a Dublin Core version of an FGDC record in our collection, the title of which is  "1:250,000 Scale Quadrangles of Landuse/Landcover GIRAS Spatial Data in the Conterminous United States"; the URL is http://eeirc.nwrc.gov/metadata/151.htm. The Identifier field in this DC record points to a full FGDC record at http://nsdi.epa.gov/nsdi/projects/giras.htm. The problem is, our colleague Suzanne Harrison has also found the identical HTML view of this FGDC metadata record at two other locations on the Web:

1. http://www.epa.gov/ngispgm3/nsdi/projects/giras.htm
2. http://www.epa.gov/envirofw/html/ndsi/projects/giras.html (This resource is no longer available.)

What is emerging out of the FGDC Clearinghouse model, therefore, is a very serious maintenance problem. It is clear from this example that there is no method for updating all the permutations of the "official" FGDC record housed in one of the Clearinghouse server nodes. The complications with maintenance of FGDC metadata created by this divergence have not been studied.

4. The Need for Persistent Uniform Resource Locators

The core of the problem we have identified is the omission of a system for creating persistent uniform resource locators for FGDC metadata. As evidenced by the empirical data, it is not possible to map a one to one relationship between the original FGDC source metadata to OCLC's international WorldCat database of MARC21 records. Therefore, what is needed is a surrogate of the FGDC metadata that would reside in other systems and point back to the full record.

The Persistent Uniform Resource Locator (PURL) system developed by OCLC is simple and clever (see: http://purl.oclc.org). It is well tested with over three years of active use and development in places like OCLC and the U.S. Government Printing Office. It could be used right away in an FGDC pilot project. A thorough description of the PURL system is available [Shafer 1996]. Briefly, the system works as follows. Each unique network resource is given a PURL that is associated with the object. When the PURL location is requested by a browser, the request is resolved by the PURL software, and the browser is sent to the resource. If the resource ever changes locations, it is only necessary to change one location in the PURL database that is associated with the permanent URL. In the example below, we created a PURL for this document. As you can see, typing in either address will retrieve the same document.

Example:
PURL: http://purl.oclc.org/NET/alchandler/crosswalk/
RESOURCE: http://eeirc.nwrc.gov/pubs/crosswalk/fgdc-marc-dc.htm

The same system could be used within the FGDC Clearinghouse system. Each metadata record would be validated for syntax, then registered with a unique PURL. Since each FGDC record currently has four different versions (text input, text output, SGML, and HTML), there arises the question as to which record(s) receive a PURL. Further thought needs to be placed on the problem of assigning the PURL and its relation to the other formats. We suspect the way to go is to make an XML/RDF file the primary FGDC metadata format. The important point here is that there is a close solution to the problem available today. In fact, there is even a precendent within the U.S. Government for the utilization of the OCLC PURL software: the U.S. Government Printing Office uses the OCLC PURL software (see http://purl.access.gpo.gov).

Another example comes from the International Consortium for Alternative Academic Publication ( http://www.icaap.org ). In some respects, the potential for FGDC metadata inherent in the journal article identification system used by ICAAP is more easily grasped than the OCLC PURL system. The ICAAP IUICODE is used to keep track of the location of every article, but it is also used for other purposes, such as indexing and maintenance. Sosteric (1999) provides a discussion of how the IUICODE fits into this ICAAP publishing system [Sosteric 1999]. The ICAAP uses a simple database to keep track of article locations (see Figure 1). The ICAAP IUICODE system works just like the Digital Object Identifer (DOI).

Every article published in an ICAAP journal receives a unique, permanent ID. For example:

Foley, Dan. (1999). Metadata in a Digital Special Library: the Energy and Environmental Information Resources Center in Lafayette, Louisiana . Journal of Southern Academic and Special Librarianship: 01[iuicode: http://www.icaap.org/iuicode?62.01.02.04]

The location for Foley's article if one were to type in to a browser this address: http://www.icaap.org/iuicode?62.01.02.04 is http://www.icaap.org/SouthernLibrarianship/content/v01n02/foley_d01.html . The IUICODE version of this location is superior for two reasons: first, it is shorter and easier for the user to type; and two, it has the potential to be permanently valid.

 

Figure 1: ICAAP IUICODE Database

iuicode.gif (22089 bytes)

The relationship between electronic journal publishing and scientific data management may be closer than many realize. Shawn Callahan and his colleagues call this relationship "dataset publishing" [Callahan 1996].  Within the FGDC community here in the U.S. it is an acknowledged fact that motivating scientists to write FGDC-compliant metadata for their data sets is a frustrating, and thus far, disappointing. Callahan argues that organizations should treat data management like scholarly publication, adopting the same cultural incentives for recognition and accountability for individual contributions.

5. Our Alternative to the FGDC Clearinghouse Model

Figure 2 below is a flow diagram for the system we believe can and should be built as a next generation version of the FGDC Clearinghouse. The black arrows shows the exisiting FGDC Clearinghouse information flow, while the green arrows represent the next generation model we propose.

Figure 2.

converter_flow.gif (13261 bytes)

 

There are two main problems with the existing FGDC Clearinghouse. The most serious is the difficulty of setting up and managing a Clearinghouse node. Root access on an NT or UNIX server is required to to the installation and configuration. Given this high technical threshold for creating and managing a node, maintenance of FGDC metadata records is by definition at least a two-person job. The creator of the metadata sends the original metadata record to a node where it is indexed by the node administrator. The problem is that from that point forward when a change is made to the record, the node administrator must be contacted to replace the old version of the record. In addition, the Isite software that is used can only index whole directories of files, which means the server must be turned off, and all the files must be re-indexed in order to change a single email address in one record!

The second problem is related to the first: the FGDC Clearinghouse system is unreliable and furthermore it is not scalable.  As stated earlier, some 10% or more of all Clearinghouse nodes are unavailable to be searched at any one time. The scalability problem is this: each time a query is made it is sent to all the nodes selected by the user. There are at present about 179 nodes in the system, but there is a warning on the search panel that says "Disclaimer: You may experience problems when selecting more than 40 servers in a single search session." To reliably search for relevant metadata in all the nodes in the system would require five unique searches at the present time. Of course, as the system grows (unless this glitch is fixed) it will require more seperate searches. This is a serious disincentive to growth for the existing FGDC model.

The model we propose shifts the emphasis in the Clearinghouse from managing "nodes" to managing metadata records. This occurs because metadata in our system would reside in the individual web server directory of the creator or maintainer of the metadata records. Web servers are ubiqitous and more easily managed than Isite server software.

The key to our system is registering the location of the FGDC records in a central system that would perform three functions: 1) manage persistant URL's to the individual FGDC records; 2) ingest the registered metadata into a central search engine for searching; and 3) offer all users of the system the option converting the FGDC metadata into formats such as MARC21 for importing into other systems more likely to be utilized by the local community of digital library patrons. The PURL for the full FGDC metadata record would be placed in the MARC record's 856 field as the online linkage for the information. Since the relationship between the absolute location of the FGDC metadata record and this PURL would be managed centrally in the converter system, this means that if the location of the full FGDC record ever changes, the adjustment could be made ONE TIME in this centralized system so that all the distributed meta-metadata that had been created to that point would all be updated automatically via the PURL system. Considering the fact that there are 34,000 OCLC affiliates with local library systems who theoretically might import converted FGDC records into their MARC systems, we believe this simple solution is a significant innovation for the FGDC Clearinghouse.

 

6.Conclusion

The realities of mapping FGDC to MARC21 and Dublin Core standards are most clearly understood by examining the record and field length limits of the OCLC WorldCat system. It is our supposition that there are others who are interested in putting FGDC records into their local MARC21 library systems to increase the access points and availability of this valuable metadata. The whole notion of cooperative cataloging mandates that we look for least common denominators for our metadata standards. While it is possible that some library automation systems do not impose the same kind of limits as WorldCat it would be counterproductive to design individual crosswalks for each library vendor's system.

It appears the CORC project's success will translate into a new way of storing metadata for OCLC over time.  Given OCLC's leadership in the field, there is a good chance that the XML based record structure will be adopted by vendors. Completion of a migration away from MARC, however, considering the massive investment of equipment and training in libraries is years in the future. Therefore, metadata conversion efforts ought to consider the OCLC WorldCat field and record length and number limitations as constants for now.

One of the core issues we would like to highlight is the lack of a persistent URI for FGDC metadata. As the system is currently designed, an SGML version of the record is dumped into a Z39.50 database server. Each time the system re-indexes, the address for the record is changed. This design flaw embedded in the FGDC Clearinghouse model violates a core rule of networked information. No stronger statement of this is available than that made by Tim Berners-Lee of the World Wide Web Consortium:

"The most fundamental specification of Web architecture, while one of the simpler, is that of the Universal Resource Identifier, or URI. The principle that anything, absolutely anything, 'on the Web' should [be] identified distinctly by an otherwise opaque string of characters is core to the universality." [Berners-Lee 1998].

Until the problem of dynamic metadata locations is addressed it will not be possible to create meta-metadata for the FGDC records on a large scale. There are other problems with the Z39.50 FGDC Clearinghouse system, such as slow response time and unreliable search results. These are liabilities that cause the metadata searcher and creator to lose faith in the system, thus accelerating the need to export the metadata into other systems with better user interfaces. Solutions must take metadata maintenance into consideration.

An area ripe for empirical investigation is to study what preferences and habits scientists have when searching FGDC metadata. Myke Gluck and Bruce Frasier, for example, have shown that the appearance or format of metadata records has a very large effect on the user's perception of relevance [Gluck 1998].

Another fruitful area of digital library research is to study the relationship between metadata and scholarly electronic journals. We believe FGDC metadata should be peer reviewed and included in the insitutional reviews of scientists for promotion and tenure.

More discussion and critical analysis is due. We hope our effort here will stimulate an exchange of ideas.

 

7. Appendix: NBII Data

These data are presented in nine worksheets in a Microsoft Excel 97 file. If you would like to download the data, it is availabe in a single zipped spreadsheet file, which contains nine seperate worksheet tabs (see the description below for each tab). For more detailed information, please contact the authors. [ download 147 KB ]

The key to understanding the worksheet tabs is the SGML list of 444 FGDC elements, given as line numbers in a tab labeled "Key." For example, line 14 = <metadata><idinfo><descript><abstract>, which is the Abstract (FGDC 1.2.1) in the Identification information section.

Excel and other spreadsheets have a limit of about 240 columns, so we had to divide the 444 elements or line numbers into three parts: A) counts17 (sections 1 and 7) or lines 1-97 and 414-444, a division chosen because it corresponds to the two mandatory sections of FGDC; B) counts4 (section 4), or lines 169-332, chosen because it is the longest single section; and C) counts2356 (sections 2, 3, 5, and 6), or lines 98-168 and 333-413. Thus, the three parts have 128 + 164 + 152 = 444 lines.

The nine tables are interpreted as follows:

Key tab: two fields: the first is the element number, the second is the SGML element path in the input record.

Tagcounts tab: this table has only two columns: column A = the 444 FGDC elements and column B = frequency of use in the data set. For example, element 14 (<abstract>) was used in all 466 FGDC records in the data set.

Stts tab: statistical summary; five columns: column A = file names of the 466 FGDC records; B = list of line numbers (elements) that are used in that record; C = record size (characters, bytes); D = of the longest field in that record (characters, bytes); E = line number of longest field in record (e.g., 14 = <abstract>). Average, median, and maximum for record size and longest field are given at the end of the table. This is the most useful arrangement of the data for our purposes.

Counts17 tab: Identification and Metadata Reference sections (1 and 7): column A = file names of the 466 FGDC records; columns B-DY = lines 1-97 and 414-444

Counts2356 tab: Data Quality, Spatial Data Organization, Entity and Attribute, and Distribution sections (2, 3, 5, and 6): column A = file names of the 466 FGDC records; columns B-EW = lines 98-168 and 333-413

Counts4 tab: Spatial Reference section (4): column A = file names of the 466 FGDC records; columns B-FI = lines 169-332

Maxs17 tab: sections 1 and 7; contains the maximum value for all the elements in that section for each record.

Maxs2356 tab: sections 2, 3, 5 and 6; contains the maximum value for all the elements in that section for each record.

Maxs4 tab: section 4; contains the maximum value for all the elements in that section for each particular record.

 

8. Notes

[1.] By way of background, Adam Chandler is a systems librarian, Dan Foley is a cataloger, and Alaaeldin M. Hafez is a computer scientist. Our library, the Energy and Environmental Information Resources Center (EE-IR Center) is a digital special library of text, numeric, and geospatial data. It was formed as a partnership between the National Wetlands Research Center (NWRC) of the U.S. Geological Survey, and the Center for Advanced Computer Studies of the University of Louisiana (CACS/ULL). Both partners are located in Lafayette, Louisiana. The EE-IR Center is funded by the Office of Scientific and Technical Information (OSTI) of the U.S. Department of Energy. The scope of the collection pertains to energy and the environment of Louisiana, especially the wetlands areas of South Louisiana. An area of special interest is pollution and contamination of the Lower Mississippi Watershed and offshore in the Gulf of Mexico.  For more information see Foley 1999 [Foley 1999].

The EE-IR Center is located in the NWRC Library. Other Center personnel  are NWRC Librarian Judy Buys and GIS Specialist Suzanne Harrison. The work presented in this paper is funded by U.S. Dept. of Energy Grant No. DOE-FG02-97ER1220. The principal investigators for our digital library project under this grant are Dr. Vijay Raghavan, CACS/USL, and Gaye Farris, Branch Chief, Technical and Informatics Branch, NWRC.

[2.] We are grateful to Susan Stitt of NBII for supplying this data set to us.

[3.]For readers unfamiliar with the MARC21 bibliographic format, the best introduction is "Understanding MARC Bibliographic: Machine Readable Cataloging" [Furrie 1998]. Throughout this paper, a three-digit number indicates a MARC tag for a particular MARC field. Fields have subfields $a, $b, $c, etc., where the dollar-sign ($) is a sub-field indicator. For example, the notation 856 $u refers to an Electronic Location and Access field (856) having a subfield ($u) that contains a Uniform Resource Locator (URL)).

[4.] There has been considerable discussion by the DC Agents Working Group about the kind and amount of contact information that ought to be included in the basic set of DC qualifiers for Agents. Currently, the Working Group favors five qualifiers for Agent Type (person, organization, event, or object), Agent Name, Agent Affiliation, Agent Role (for example, the MARC Code List of Relators), and Agent Identifier (for example, a Uniform Resource Identifier or URI) [Iannella 1999]. Some DC user communities or implementors may want to use additional qualifiers (e.g., for e-mail or physical addresses, telephone numbers, etc.). Still other communites will add elements to the basic fifteen-element DC set or are using Dublin Core in other languages besides English. There are plans for the Dublin Core Directorate to establish a registry that will define and accomodate the specific needs of different users groups.

9. References

[Berners-Lee 1998] Berners-Lee, Tim. (1998). "Web Architecture from 50,000 feet." Retrieved 5 May 1999 from: http://www.w3.org/DesignIssues/Architecture.html

[Callahan 1996] Callahan, Shawn, David Johnson, and Paul Shelley. (1996). "Dataset Publishing - a Means to Motivate Metadata Entry." Retrieved on 13 December 1999 from: http://church.computer.org/conferences/meta96/callahan/callahan.html

[Childress 1999] Childress, Eric. (1999). "DC Date Qualifiers: DC Working Draft, 07 December 1999." Available at: http://www.mailbase.ac.uk/lists/dc-date/files/qual-prop.html

[CSDGM Image Map 1998] CSDGM Image Map 1998. (1998). "An Image Map of the Content Standard for Digital Geospatial Metadata: Version 2, 1998 (FGDC-STD-001 June 1998)." (1998). Available at: http://www.its.nbs.gov/fgdc.metadata/version2/

[CORC 1999] CORC--Cooperative Online Resource Catalog. Available at: http://www.oclc.org/oclc/research/projects/corc/

[CorpsMet] United States. Army. Corps of Engineers (1999). "CorpsMet." Available at the Corps' "Geospatial Data Clearinghouse Node" Web page: http://corpsgeo1.usace.army.mil

[Cox 1999] "DCBOX: Specification of the Spatial Limits of a Place, and Methods for Encoding This in a Text String" Available at: http://www.agcrc.csiro.au/projects/3018CO/metadata/dcbox/

[DC-7] 7th Dublin Core Metadata Workshop, October 25-27, 1999, Die Deutsche Bibliothek Frankfurt am Main, Germany. Available at: http://www.ddb.de/partner/dc7conference/

[DC 1999] Dublin Core Metadata Initiative. (1999). Available at: http://purl.org/DC/

[FGDC 1998] Federal Geographic Data Committee. (1998). "Content Standard of Digital Geospatial Metadata, Version2, 1998." Available at: http://www.fgdc.gov/metadata/contstan.html

[FGDC 1999?] Federal Geographic Data Committee. (1999?). "Metadata Elements Included in the Metadata Entry System." Retrieved 9 September 1999 from: http://www.fgdc.gov/clearinghouse/metadataesystem/mes_description.html

[Foley 1999] Foley, Dan. (1999). "Metadata in a Digital Special Library: the Energy and Environmental Information Resources Center in Lafayette, Louisiana." Journal of Southern Academic and Special Librarianship: 01[iuicode:
http://www.icaap.org/iuicode?62.01.02.04 ]

[Furrie 1998] Furrie, Betty. (1998). ""Understanding MARC Bibliographic: Machine Readable Cataloging" Fifth edition reviewed and edited by the Network Development and MARC Standards Office, Library of Congress. Available at: http://lcweb.loc.gov/marc/umb/

[Gluck 1998] Gluck, Myke, and Bruce Fraser. (1998). "Usability of Geospatial Metadata or Space-Time Matters." presented in the "Theory and Practice of the Organization of Image and Other Visuo-Spatial Data for Retrieval: From Indexing to Metadata" Session. American Association for Information Science 1998 Annual Meeting, Pittsburgh, Pennsylvania, 25-29 October 1998.

[Huberman 1997] Huberman, Bernard A., and Lada A. Adamic. (1997). "Novelty and Social Search in the World Wide Web." Xerox Palo Alto Research Center. Available at: http://www.parc.xerox/istl/groups/iea/www/internetecologies.html

[Iannella 1999] Iannella, Renato. (1999). "DC Agent Qualifiers: DC Working Draft, 12 November 1999." Available at: http://www.mailbase.ac.uk/lists/dc-agents/files/wd-agent-qual.html

[Kuhn 1999] Kuhn, Markus. (1999). "A Summary of the International Standard Date and Time Notation." Available at: http://www.cl.cam.ac.uk/~mgk25/iso-time.html

[LC 1998] Library of Congress. Network Development and MARC Standards Office. (1998). "MARC 21: Harmonized USMARC and CAN/MARC." 22 October 1998 . Available at: http://lcweb.loc.gov/marc/annmarc21.html

[LC 1999] Library of Congress. Network Development and MARC Standards Office. (1999). "Dublin Core/MARC/GILS Crosswalk." Available at:  http://lcweb.loc.gov/marc/dccross.html

[Mangan 1997] Mangan, Elizabeth. (1997). "Crosswalk: FGDC Content Standards for Digital Geospatial Metadata to USMARC." Available at:  http://alexandria.sdc.ucsb.edu/public-documents/metadata/fgdc2marc.html

[MetaMaker] MetaMaker. (1999). U.S. Geological Survey. Available at: http://www.emtc.usgs.gov/metamaker/nbiimker.html

[Miller 1999] Miller, Eric (1999). "DC Working Draft: 16 November 1999" [Qualifier Proposal for DC Coverage]. Available at: http://www.mailbase.ac.uk/lists/dc-coverage/files/wd-coverage-qual.htm

[NBII 1998] NBII National Program Office. (1998). "The NBII Biological Metadata Standard." Available at: http://www.nbii.gov/factsheet/factsheet3.html

[OCLC 1999] OCLC Online Computer Library Center, Inc. [home page] (1999). http://www.oclc.org/oclc/menu/home1.htm

[Park 1999] Park, Gary J. (1998). "Decimal Degrees to Degrees Converter" Available at: http://www.unn.ac.uk/~evgp1/gary/dec2deg.htm

[Shafer 1996] Shafer, Keith, Stuart Weibel, Erik Jul and Jon Fausey. (1996). "Introduction to Persistent Uniform Resource Locators." Available at: http://purl.oclc.org/OCLC/PURL/INET96

[Sosteric 1999] Sosteric, Mike. (1999). "ICAAP eXtended Markup Language: Exploiting XML and Adding Value to the Journals Production Process." D-Lib Magazine 5, no. 2 (February 1999). Available at: http://www.dlib.org/dlib/february99/02sosteric.html

[Weibel 1999] Weibel, Stuart. (1999). "Qualifier Approval Timetable". Available on the DC-General mailing list at http://www.mailbase.ac.uk/lists/dc-general/1999-11/0012.html

[Wolf 1997] Wolf, Misha, and Charles Wicksteed. (1999). "Date and Time Formats." Available at: http://www.w3.org/TR/NOTE-datetime

10. Contact Information

Adam Chandler
Systems Librarian
Energy and Environmental Information Resources Center
University of Louisiana at Lafayette
700 Cajundome Blvd.
Lafayette, LA 70506
web: http://eeirc.nwrc.gov
email: adam_chandler@usgs.gov
tel: 318-266-8697

 

Dan Foley
Metadata Librarian
Energy and Environmental Information Resources Center
University of Louisiana at Lafayette
700 Cajundome Blvd.
Lafayette, LA 70506
web: http://eeirc.nwrc.gov
email: dan_foley@usgs.gov
tel: 318-266-8539

 

Alaaeldin M. Hafez
Research Scientist
Center for Advanced Computer Studies
University of Louisiana at Lafayette
P.O. Box 44330
Conference Center Room 459
Lafayette, LA 70504-4330 USA
web: http://www.cacs.usl.edu/Departments/CACS/
email: ahafez@cacs.usl.edu
tel: 318-482-5791
This work is supported in part by a grant from the U.S. Department of Energy (under grant No. DE-FG02-97ER1220).