| EE-IR Center | Contacts | Metadata | Standards/Tools | USL Student Page | Pubs | Search |
| EE-IR Center |
Enhancing Search Capabilities of Legacy Internet Resources Marie Erie, Michelle LeBlanc and Vijay Raghavan* The Center for Advanced Computer Studies University of Southwestern Louisiana Abstract: Many information resources
available over the Internet only provide browsing and pre-defined navigation capabilities.
If the user is interested in simple word matches on documents as a whole, search engines
such as Yahoo or AltaVista may be adequate. For resources of high volume, the user should
be able to specify preferences and just obtain the relevant portions of a resource. In
this paper, we study the problem of enhancing the search capability to an existing
resource by implementing and evaluating two approaches: use of (i) structured (relational)
database or, (ii) HTML documents enhanced with meta tags. The World Wide Web (WWW) has enabled widely dispersed audiences to access a spectrum of
information resources. A variety of Internet search tools [1] exist for retrieving links
to documents of potential user interest. Yahoo and AltaVista are examples of search
engines that normally examine entire documents, but may allow the user to search only the
title or only the body of HTML [2] documents. Searches are for simple matches on limited
combinations of words or a phrase input by the user. Results are returned as a series of
pages with links to documents that satisfy the request. Based on the brief text associated
with a link, the user may then navigate to a document, or a collection of documents within
a given Web site, and perhaps decide to search within that resource. Of course, a resource
of interest may be obtained by other paths, such as a link from a parent document or
directly from a known URL address. Once obtained, it is often the case that the server
offers the user little more than simple navigational links based on a predefined breakdown
of the information at the site. Alternatively, one could use the FIND function of the
browser to look for simple word patterns. For successful and efficient searching through
resources of high volume, the user should be able to specify one or more terms of interest
(e.g. of a certain semantic type) in a search request that returns only that portion of
the document that matches the request. To this end, in this paper, we address the issue of enhancing the search capabilities
of legacy Internet resources, (i.e. old-style pre-existing resources developed prior to
the advent of more modern methods). In section II, we begin by establishing the
requirements and desired features of a search tool designed to augment an existing
resource. Section III introduces two approaches to creating a user search interface by
which the user may find desired information from a resource. In section IV.A, we describe
a large online document local to the University of Southwestern Louisiana (USL) as an
exemplary legacy Internet resource offering only browsing capability. In sections IV.B and
IV.C, we detail two implementations of a search tool for this resource. The two approaches
are modeled after the architectures described in section III. Both implementations enable
the user to specify search terms in particular fields of the resource in order to obtain
only that small portion of the resource matching the sets of field-term pairs. In section
IV.D, we contrast the two approaches with respect to the requirements outlined in section
II. Section V concludes by stating our preferences among the two implementations. Given that the condition of an existing Internet resource may warrant the construction
of a tailored search tool subsequent to the resources availability online, we
outline features to be considered in designing such a tool. We propose the following
criteria for assessing alternative approaches. With these guidelines in mind, we now outline two Internet-computing design approaches
that may be applied to the task of Internet resource search enhancement. HTML documents are one type of document that a WWW server presents for display on
the clients browser. HTML tags dictate the document display format. CGI is a
mechanism that allows servers to execute external CGI programs that accept user-specific
data from the client via HTML forms [7]. Figure 1 shows the architecture for this
interface. These HTML documents are authored with FORM, INPUT, and SUBMIT tags. This
allows the user to place values into the form. When the user submits the form, the values
are passed on to the server via a URL request that calls the CGI program which processes
the user input. The figure shows the architecture for a CGI program that queries a
structured database (DB) system by using embedded SQL (Structured Query Language)
statements. The program completes execution by generating a new HTML document that
contains the programs results. The server passes this document back to the Web
browser for display. There may be several iterations of this cycle, with new user input,
before the user is satisfied that information retrieval is complete. The HTTP protocol by
which these processes are spawned does not preserve the state of execution from one cycle
to the next. In order to maintain state information, the CGI program is written such that
when it outputs the results to the user, as a new HTML document, it includes state
information from the previous cycle in hidden fields, denoted by user INPUT tags with a
HIDDEN attribute. A potential bottleneck in this approach is that, since processing is restricted to the
server-side, a large client base could place a heavy load on the server. In addition, the
lack of statefullness associated with HTTP can lead to excessive data transfer during a
cycle of execution. With respect to the topic of this paper, the structured database in the figure can
be thought of as the original data from which the online resource was created. In the
event the database is not structured or cannot be accessed by a SQL, then the database
must be ported to a structured form. The Isite/Isearch system [3,4,9] is a text retrieval system that allows one to
retrieve documents according to several classes of queries. We describe this alternative
approach to providing search enhancement assuming that Isite is used as the information
retrieval engine. This alternative uses a client/server architecture involving Z39.50
protocol and thus maintains state information throughout a search session. The
architecture of such an application is shown in Figure 2. The architecture is similar to
that of RDB/CGI applications shown in Figure 1 but with the distinction that HTTP is a transaction
oriented protocol whereas Z39.50 is session oriented, meaning it preserves state.
When a client contacts a server with a Z39.50 protocol request, the server establishes a
connection to a Z39.50 server via the Z39.50 Gateway. The Z39.50 Gateway and Server
functions in a way somewhat analogous to the CGI program in the RDB/CGI architecture. That
is, the Z39.50 server executes the search against a database (or a distributed database).
The database system shown uses Isites Isearch/Iindex software which adheres to
Z39.50 information retrieval standards. The Isite database is composed of documents,
indexed by Iindex, accessible by Isearch. Isearch features give the user many options for composing queries with search and
target elements; options not offered by a SQL embedded CGI application without significant
programming effort. The Simple Search allows the user to perform case-insensitive search
on one or more search elements (fields). Partial matching to the left is allowed. The
Boolean Search allows the user to compose a two-term query where the two terms are related
by one of the Boolean operators AND, OR, or ANDNOT. "Full Text" is the default
search domain unless the user selects a particular element for a term from the terms
pull-down menu. The Advanced Search form accepts more complex Boolean queries that are
formed by nesting two-term Boolean expressions; for example: "Acadian AND (language
OR dialect)". To narrow a search domain from "Full Text" to hits occurring
within a single search element, the term is prefixed with the element name and a forward
slash. For example, the query "AUTHOR/Dobbs AND TITLE/education" searches for
Dobbs only in the AUTHOR element. The information targeted for return by a query may be specified by choosing target
elements from a pull-down menu. The user may also choose to view a maximum number of items
in the results set at a time. Isearch is capable of performing a weighted search based on
search term frequency. A terms weight increases proportionately to a terms
occurrence frequency within a document, but decreases as the number of documents in which
the term appears increases. The statistics for all search terms are combined to establish
a ranking among the members of the results set. The results set is ordered, for viewing,
with the highest ranked results first. To compare these two design architectures, we selected an existing Internet resource
for which a real-world community of users exists and who expressed to us a wish for a
better search interface to this resource. We have implemented two different
document-specific search engines that address their needs. The next section is devoted to
the details of this work. The Bayou State Periodical Index [10] is a guide to Louisiana periodicals which
is published yearly by the USL Edith Garland Dupre Library. It is organized into three
main sections. The first is a listing of cited Journals with publisher and distribution
information. The second and largest section, the Subject Index, lists all journal
citations alphabetically by subject. The last section is the Author Index which lists all
journal citations alphabetically by author. The existing Internet resource for this index
, The 1996 Abridged Bayou State Periodical Index, is an HTML document located at
URL: http://www.usl.edu/Departments/Library/departments/larm/abspi.html. It contains
approximately one quarter of the entries from the unabridged hard-copy version. This
resource is limited in that it only allows the user to browse the index sequentially by
subject. Additionally, the entire document is partitioned into four segments grouped by
alphabetical links for navigational purposes: A-C, D-K, L-P, Q-Z. The citations in the
unabridged (hard-copy) index contain one or more of the following attributes: subject,
author, title, journal (or periodical), volume, issue number, pages, date, and year. In
addition, in the subject index, there may exist cross-references to other subjects,
prefixed by See or See Also. For such resources of high volume, information
retrieval would be greatly enhanced by a search engine that allowed the user to go
directly to only one or a few citations by specifying values of some subset of the
available attributes (search elements). We addressed this problem by implementing the
architectures described in section III above. The details for each are outlined in the
following subsections. In the implementation discussed in this section, we chose to create a search tool that
accesses the original source data used for the hard-copy publication rather than process
the abridged version located at the URL noted in the previous section. The URL for this
implementation is located at http://www.cacs.usl.edu/cs561-bin/Bayou.cgi. The original data for the Bayou State Periodical Index resides in an AUTHEX Plus
[11] database system in two main files. The "database" file is composed of
records containing data on bibliographic citations. Each record is unique by a
<title> field and contains other fields for entering the citation data. A
"database" record also includes an attribute <subject> whose value is a
list of subjects under which the citation is a member. The "subject" file
contains records which are unique by <subject>. Two of the subject file fields
relevant to this implementation are the See and See also subject
cross-referencing fields. The "database" and "subject" files are
ultimately joined by the <subject> field, even though a "database"
records subject field may contain a list and a "subject" records
subject field may not. Since the AUTHEX Plus software does not provide a host
language-based access to its data, the "database" and "subject" files
were obtained in export format: ASCII files with delimited records and fields. The files
were parsed for relevant data items which were then loaded into a relational database,
ORACLE® [12]. Proprietary structure existing within a field in the
AUTHEX Plus export files is accounted for at the outset during parsing, thus
precluding any need for manual editing of the load data files. The tables comprising the
schema are defined as follows: SUBJECT SUBJ_NO SUBJ_NAME SUBSUB SUBJ_NO REF_TYPE REFERENCE BIBLIOGRAPHY BIB_NO TITLE JOURNAL VOLUME ISSUE PAGES MONTH YEAR BIBSUB
AUTHORS
Assuming some knowledge of structured databases, these tables are self-explanatory with perhaps the exception of the SUBSUB table. Its REFERENCE field holds a subject number value for one of the two mutually exclusive subject cross-reference types, See and See also. REF_TYPE holds a flag to indicate which reference type is in the REFERENCE field. These fields were placed in a separate table since most subject entries have NULL cross-reference fields. The overall design was chosen so as to compress the data while allowing full join capability. We organized the originating HTML form into four options (via radio buttons) to a journal list, an author list, a subject search form, and an author search form. Selection of one of the four categories followed by selecting the SUBMIT button, invokes a CGI script which, depending on the query type, produces an HTML page specific to that search. Each page has a link back to the originating page so as to permit the user to perform another search. A link to the 1996 Abridged Bayou State Periodical Index is also provided. The journal list submission returns the results of a query to the BIBLIOGRAPHY table and is simply a listing of all journals cited in the Bayou State Periodical Index. The author list submission similarly returns the results of a query to the AUTHOR table. It is an HTML form which alphabetically lists all authors associated with bibliographic entries. The list is headed by an alphabetical index that allows the user to link to a particular section of the list. Each authors name is the value of a radio button, and when selected and submitted, the CGI script initiates a new database search for all bibliographic citations by the chosen author. The results are returned as an HTML page in the format found in the Bayou State Periodical Indexs Author Index. The Subject Search and Author Search forms allow the user to input a subject and author name respectively. These searches allow for partial matching and are case-insensitive. An optional "year" field allows the user to narrow the search further. This field was included in anticipation of the addition to the database of journal citations for future years. All matching bibliographic entries are returned, organized by subject or author, depending on which form was submitted. The results returned for the Subject Search have the added feature that the See and See also cross-reference fields, if present, have the associated subject as the value of a radio button. The cross-reference may be selected and submitted, taking the user immediately to the results of that subject search. It might be desirable to add optional author input to the subject form, and vice-versa, allowing the user to refine the search. Furthermore, any other attribute of the tables could be included in a form for search purposes, accompanied by minor changes to the CGI script.
In this approach, the existing resource, The 1996 Abridged Bayou State Periodical Index, was standardized by extending the HTML document with metatags. These are tags which delimit data elements in the document to indicate information about those elements. The syntax of their use is the same as that of standard HTML tags but a Web browser ignores them. One HTML document was produced for each set of citations under <subject>. This set of documents was indexed by the Isite software [9] for subsequent searching using the Isearch utility [3,4]. The inserted metatags are recognized by this software. The Dublin Core scheme [13] was used, which has been proposed as a standard for metadata for bibliographic citation data. The Dublin Core metadata element set is a core set in that it is a small number of elements of general applicability. Of this set, the ones relevant to the current implementation are Subject, Title, Author and Date. The extensibility of the Dublin Core scheme allows for the addition of other elements to the set. For enhancing the search and selectively displaying certain target protions of the existing resource, Periodical, Volume and Page elements were added to the set. Figure 3 shows an example of one entry from the indexs HTML source.
Figure 3: HTML tags used in the index It was noted that <B> and </B> enclosed the subject, </B> and <I> enclosed the title, <I> and </I> enclosed the periodical, and </I> and </BR> enclosed the volume, pages, and date. The online resource was parsed and reproduced in a form that could be used for indexing and later search. Figure 4 shows the HTML metadata resulting from parsing the citation shown in Figure 3.
Figure 4: HTML meta-data used for citation indexing The syntax in Figure 3 is not consistent throughout the resource and some manual editing was required to produce accurate HTML metadata for all citations. For example, many citations have no entries for some of the fields. In addition, after the subject field, a small percentage of the citations have the See and See also fields delimited by the <I></I> tag pair. Similar inconsistencies were found in other fields. The interface for the indexed abridged dataset offers links to a Simple Search page, a Boolean Search page, and an Advanced Search page. This menu is located at URL: http://www.cacs.usl.edu/cs561-bin/wmb2/wmb2.cgi. The Simple, Boolean and Advanced Searches are discussed above (in III.B). Search and target elements are: Subject, Title, Periodical, Volume, Page and Date. A range of dates is allowed. This built-in Isearch feature may become relevant in future years when a separate HTML page for another year is expected to be available for processing. Detailed information about the users results are available in the Search Summary entry at the top of each searchs "results page". It shows how many times each query term occurred and in how many documents, as well as total time required to perform the search. An on-line Help page is accessible via a Netscape browser.
Equipped with the above details of both implementations, we turn to evaluating the degree to which they meet the requirements and preferences outlined earlier.
The ability of CGI scripting to provide special-purpose applications enabled us to construct query types that are not possible with Isearch without significantly modifying its source code. Specifically, we were able to return results with values associated with radio buttons: (1) subjects associated with See and See also in Subject Search returns and (2) author names in the Author List return. The ability to initiate a new search by selecting these buttons is a very attractive feature of the RDB/CGI approach. In the current HTML metatag implementation, the user would have to make a note of any subject referenced by See or See also, return to some type of search input form, enter the subject name and submit. Since the online periodical index has been generated without author data, author searches are not available in the latter implementation.
Although the use of HTML metatags for indexing and searching is a new and sometimes desirable approach to the problem of search enhancement, the peculiarities of a given resource may prove cumbersome enough to warrant the use of more traditional methods such as porting one database to another, even at the expense of duplicate storage. The original database for the resource of this study is maintained locally, making the relational database approach a viable option. Weighing the above requirements, we suggest that, when the source database is available, it is preferable to construct a tailored search engine that accesses the source directly rather than adopt a method requiring the processing of an output document from that source. Acknowledgments: The implementation using HTML metatag insertions and use of Isite/Isearch software as described in section IV.C. was created by Wael M. Badawy, a Ph.D. student in computer science at CACS, USL. We acknowledge Ms. Sheryl Moore and Ms. Jean Kiesel of the USL Dupre Library and Ms. Judy Buys of the National Wetlands Research Center for their guidance during the implementation of the systems described in this paper. This work is supported in part by a grant from the U.S. Department of Energy (under Grant No. DE-FG02-97ER1220).
Author biography: Marie Erie and Michelle LeBlanc are Computer Science Ph.D. students at the Center for Advanced Computer Studies (CACS), University of Southwestern Louisiana. Maries interests are in medical visualization and physics-based modeling in computer graphics. Michelles interests include artificial intelligence, database systems, and user interface design. Vijay Raghavan is a professor of computer science at CACS. His research interests are in information retrieval, database mining and Internet computing. He directs a project, funded by the DoE, aimed at developing an Energy and Environmental Information Resources Center. REFERENCES
|