|
Why Don't Search Engines Work Better?Let's face it. The problem isn't with the search engines. Adding more power, more search features, more Web pages won't solve the problem. The reason for lousy search results isn't the engines, but the Web itself, namely HTML. HTMLAlmost all Web pages are constructed using the hypertext markup language (HTML) whether version 2.0 or the newer 3.0. What "structure" the Web has is that given it by HTML. What the search engines do is index words in a Web document as they relate to the HTML tags used in the document. If a search engine finds a search word or phrase in the area defined by the <title> tags, it gives the document a higher score than if the word or phrase is only found in the first paragraph, or farther down the page. The search engine developers won't reveal the exact algorithms used, since this is a very competitive environment, but what the search engines do is primarily a machine process for searching, scoring, and returning in ranked order Web pages that purport to satisfy a particular search strategy. The problem is that HTML does a fairly poor job of describing the contents of a document, and the search engines are a long way from artificial intelligence.HTML is essentially a presentation language. It defines the "look and feel" of a document, so that the browser knows how to display the document in its window. Beyond the <title> and <header> tags there is little useful information regarding the intellectual content or structure of the document. HTML is simply not up to the task of describing Web documents in enough detail for a search engine to do a good job. The ideal solution would be a markup language that defines in significant detail what the document is about. For a time it was thought that the standard generalized markup language (SGML) would provide the solution, but the sheer complexity of SGML makes it too difficult for general use. In addition to the complexity, it turns out that the best person to apply the SGML is the author of the document! No simple and accurate machine conversion of plain text or HTML documents into SGML exists, at least to my knowledge. Also, attempts to apply it by programmers or other technicians to documents they did not author have been frustrating at best, and dismal failures at worst. Most authors of Web documents are reluctant to spend the time necessary to learn SGML, or for that matter, the time necessary to apply it. They would rather put their thoughts and ideas down on paper (or in digital form) and then move on to their next project. As far as I know, no other markup language that is easier to use and apply than SGML is waiting in the wings to solve this problem. MetadataThis brings us to the subject of "metadata." (If you have not heard this word before, I suspect you will hear much more of it soon.) Simply stated, metadata is "data about data." A library catalog entry is metadata. The MARC record is a structure for metadata in machine-readable form. Efforts to bring the process of metadata to apply to this problem are slowly getting off the ground. Someone must devise a system of metadata document definitions of sufficient specificity and clarity to enable most documents that are now found of the Web to be easily and quickly described. One such example is the Dublin Core, (Dublin, Ohio--not Ireland) a core set of metadata elements defined at an invitational workshop co-sponsored by OCLC (Online Computer Library Center) and NCSA (National Center for Supercomputing Applications) in March 1995. The Dublin Core is a good first step in identifying what elements of metadata are necessary to define an electronic document. The following fifteen elements comprise the Dublin Core Metadata Element Set, as of January 15, 1997:
TITLE For more information more about the Dublin Core, good overviews can be found at the following URLs: gopher://marvel.loc.gov/00/.listarch/usmarc/dp99.docResearchers are following many lines of inquiry, some seeking ways to embed metadata within HTML tags, others creating a whole new markup language. Once a system is devised, it should be possible to describe Web documents in enough detail that the incredible power available in and to search engines can be brought to apply, and we will be a step closer to making search engines earn their keep. If you want to know more about metadata and library resources, go to IFLA's Digital Libraries: Metadata Resources Web site found at www.nlc-bnc.ca/ifla/II/metadata.html. This comprehensive site has links to various Web resources for metadata projects, definitions, standards, and background documents. Or you might just go to your favorite search engine (I chose Infoseek) and peruse the thousands of Web pages you can find under the term "metadata." Enjoy! by Michael Perkins. Perkins is business reference librarian at San Diego State University. He may be reached via e-mail at: mperkins@mail.sdsu.edu; or visit his Web page at: http://libweb.sdsu.edu/busi/perkins.html. For more information on "On the Net," or to contribute to the column, please contact Sharyn Ladner at: 1-305-284-4067; fax: 1-305-665-7352; e-mail: sladner@umiami.ir.miami.edu. Information Outlook Table of Contents
Copyright © 1997 SLA. All rights reserved. This page was updated on May 6, 1997. |