Internet Power Searching:
Finding Pearls in a Zillion Grains of Sand

by Amelia Kassel



 

During the past two years, web content has expanded enormously. Global access to hundreds of government resources and agencies worldwide, more than 1,400 Internet-based online public access catalogs (OPACs) from libraries on every continent1, professional and trade associations, and experts in millions of subjects are just a few examples of categories of information not readily found online in the past. As the Internet erupted, search engines, metasearch engines, and intelligent agents with value-added features came on the scene and gradually began to refine their offerings, turning information retrieval into a more organized process than ever before. Traditional vendors used by professional searchers also became accessible on the web. For example The Dialog Corporation, Dow Jones Interactive, LEXIS-NEXIS, OCLC FirstSearch, Ovid, Silver Platter, and STN all now provide web-based database searching2. In addition, a 1997 survey of database producers on the web found remarkable progress3. Of fifty-four leading databases from thirty-eight database producers, thirty-five searchable databases were either on the web or had been announced. Added to these, new entrepreneurial publishers, also called niche market research boutiques, entered the market. This incredible growth has made the Internet the major research tool of the late twentieth century-although not without some serious shortcomings. Unfortunately, much time can be spent-and wasted-when searching without knowing the tricks of the trade. Furthermore, the search engines are constantly changing, growing, and improving in their quality and capabilities for locating needed information. As a result, library and information professionals must learn new skills and incorporate them into their daily activities. There is no doubt that the technology has come a long way but still has a long way to go and improvements are on the horizon. Nevertheless, a major challenge for information professionals is knowing how to find what's needed.

 

Search Engine Size

An April 1998 article in Science measured the size of the Internet and reported 320 million pages at that time4. This figure has grown to more than 380 million plus hundreds of databases in recent months. Nevertheless, one of the search engines, HotBot, has estimated that only 200 million pages are searchable within their system. These numbers, along with other information about search engine coverage indicate that a large proportion of the web is not reachable at all through search engines. According to Danny Sullivan (http://searchenginewatch.com), there are both technical and physical reasons that search engine coverage is incomplete. Some of the reasons are:

Information retrieval technology may not necessarily require exact matches and returns pages with related words.

Documents that don't exist anymore are returned.

Documents are changed after an index picks it up.

Most search engines cannot index frames or image maps.

Search engines do not index sites that deliver information from complex databases, for example, such sites as Amazon.com (http://www.amazon.com), an online bookstore, or Mediafinder.com (http://www.mediafinder.com), a database of magazines, newsletters, journals, newsletters, and mail order catalogs.

Sites that require passwords are not returned.

Sites that use a robots.txt file to keep files and/or directories off limits prevent search engine results.

Since so many web sites can not be reached, it is important for researchers to amass knowledge about a range of resources useful for uncovering information not found by search engines, as well as to learn how to use search engines for a range of requests.

 

Focus on Big

The new Internet economy has brought about the development of competing search engine companies, each with its own proprietary software. Sites are collected and updated differently. After a search is conducted, one search engine provides exactly what's required within the first ten hits whereas another is useless. Frequently, there is tremendous overlap, although no two search engines are exactly alike. Since the outcome varies from search engine to search engine, researchers often find it necessary to use several search engines for the same question for either the best or more comprehensive results. The larger the index compiled by a search engine, the more likely the chance of finding obscure material. Spiders or crawlers constantly visit sites to create catalogs or indexes of web pages that are searchable. Results are sorted or ranked by relevancy based on individual proprietary algorithms. Although dozens of search engines now exist, the focus here is on those that are big. One of the major search engines is AltaVista (http://www.altavista.com). It began operation in 1995 and is one of the largest. It remained unchallenged until September 1997 when Hotbot (http://www.hotbot.com) began to compete and surpassed it in terms of number of pages indexed at that time. Other search engines of note are Excite (http://www.excite.com) and Northern Light (http://www.northernlight.com). In fact, early this year, Greg R. Notess (http://www.notess.com/search) suggested that Northern Light now ranks first, followed by AltaVista and HotBot. Another very well known and useful site is Yahoo, (http://www.yahoo.com), the oldest web directory with some 750,000 sites. It is based on user submissions and staff selections. All of the search engines mentioned here, plus Yahoo, have expanded and improved whereas others have tapered off in size or completely disappeared. Some key features of the largest search engines follow.

 

AltaVista (http://www.altavista.com)

One of the most powerful and popular search engines.

Good for specific searches.

Offers an advanced query feature with more search options.

Allows for a natural language query.

Provides a translator between English and five languages that is useful but has been criticized as not "too good".

Offers Boolean and proximity searching.

Includes field searching.

Flaws in the retrieval algorithm have been found in the past.

AltaVista is not as user-friendly as Hotbot, but once mastered is the favorite for many.

 

Excite (http://www.excite.com)

Good for searches on broad, general topics.

Fast access to a small number of relevant sites.

Adds interesting extras like a simultaneous search of the web, news headlines, sports scores, and company information and groups the relevant results on a single page.

If you find a site that is on target, you can click on search for more documents like this one and the search engine finds more of the same, although it doesn't work well for all types of queries.
Includes a service called NewsTracker for selecting subjects of interest and receiving daily alerts from 300 news sources.

Provides a user-friendly travel site for booking airline reservations.

A power search capability broadens the scope of a search.

Boolean searching is available by default on the home page.

HotBot (http://www.hotbot.com)

Provides a very user-friendly interface with pull-down menus.

Search results appear quickly.

Recent changes integrate material generated by human editors into the service.

Users can review one-hundred results at a time, important for quick scanning when there are a large number of hits that are worth reviewing.

Boolean searching is an option.

Searching by continents can prove useful for some research.

Hotbot was the most current search engine at one time, providing a new index every two weeks in the past, although more recently, it has been criticized for lack of freshness. This is supposed to be corrected.

Field searching can narrow research.

Stemming is now provided.

 

Northern Light (http://www.northernlight.com)

Provides content that encompasses both the web and Northern Light's Special Collections which are articles that can be purchased from more than 5,000 publications on a pay-as-you-go basis for $1.00 to $4.00 each. Some of these publications are not available from other major commercial vendors.
Advanced, power, and industry searches narrow results by document type such as press release or product type.

Automatically refines every search by creating Custom Search Folders with similar sites by subject, source, or type.

Enterprise accounts for corporations and organizations are available.

 

Yahoo (http://www.yahoo.com)

A directory or catalog of web sites, valuable for searching broad general topics.

Contains 750,000 sites

World Yahoos, i.e., country versions.

Drill down through categories or with a click, the query originally sent to Yahoo is "piped" or forwarded to a major search engine. This is especially useful since Yahoo is selective rather than as all encompassing as the other search engines mentioned here.

Inclusion/exclusion, phrases, wildcards, title, and URL limiters.

 

DejaNews (http://www.dejanews.com) and Reference.com (http://www.reference.com)

Both DejaNews and Reference.com are search engines for newsgroups or mailing lists and can be used to identify experts who participate in various discussion groups, review major trends, or what's being said about a company, product, or topic.

 

Where to Start

Where and how to search depends on research goals and needs. Indeed, whether to use the Internet or a traditional database is often the first decision and whether to use a narrow or broad strategy is another consideration. Fundamentally, it's necessary to become familiar with several major search engines and select the right one for the job. Much Internet research is trial and error and serendipity, too. Nonetheless, self-education is necessary and preparing for Internet research involves visiting major search engine sites to review how each works. The more that is known about a particular search engine, the better prepared the searcher will be to decide which is appropriate for each request. Each search engine provides detailed instructions about basic or simple searches and how to use more advanced or power searching techniques. Before searching, it's important to plan the search by considering unique words, phrases, and synonyms that describe the topic. Once a search is conducted, a review of results can lead to reformulating the search when what you are looking for is not found. If you find yourself spending too much time at one site, move on to the next search engine. Search results often improve when taking a search elsewhere.

 

Search Engine Basic Hints & Tips

Some search engines permit Boolean searching with and, or, or not.

Many search engines require the use of quotations around phrases.

Some search engines allow you to truncate a word and pick up variations but others do not.
Search engines typically do not look for articles such as the, a, etc., conjunctions such as and, with, or heavily used adjectives.

Some search engines will not search on common words. Hotbot, for example, ignores the search terms Internet and web.

Search Engine Advanced Hints & Tips

One of the best ways to refine searches is with power features such as field searching. Ran Hock explains that, "fortunately, some web search engines do provide at least a rudimentary field search capability, but because of the immature nature of the engines, the options are neither very numerous nor particularly sophisticated." AltaVista allows date, title, URL, and language searching, plus a half-dozen other fields all related to the types of features included on the page, such as image and sound files. HotBot, similarly, provides date, title, and URL searching. In addition, it lets a user search for records that contain a sound or video file, search by page depth, by what words are included in hypertext links, and for the presence of a variety of scripting languages and plug-ins. For a detailed discussion on this subject, see Hock's article "How to Do Field Searching in Web Search Engines: A Field Trip" 5.

 

Metasearch Engines

Metasearch engines are web sites that send a search to several search engines all at once. Often, only a selected number of sites from each search engine are identified and then incorporated into what are blended results from many search engines into one page. Some well-known metasearch engines are described below.

 

Dogpile (http://www.dogpile.com)

Dogpile integrates many search engines as well as other types of sources and sorts the results by search engine. Included in the search are 1) Search engines: Yahoo!, Lycos' A2Z, Excite Guide, GoTo.com, PlanetSearch, Thunderstone, What U Seek, Magellan, Lycos, WebCrawler, InfoSeek, Excite and AltaVista, 2) Usenet: Reference.com, Dejanews, AltaVista and Dejanews' old database. 3) More than two dozen online news services or other types of sources.

Includes a simple and advanced search and allows Boolean operators.
Dogpile is a good way to check to see which search engine works best for a particular question.

Internet Sleuth (http://www.isleuth.com)

Internet Sleuth is a 3,000-strong collection of specialized online databases, which can also simultaneously search up to six other search sites for web pages, news, and other types of information. It's excellent for highly specialized searches of any subjects in its detailed directory.

Links popular Net search engines and allows you to specify categories like business, computers, education, sports, etc.

 

MetaCrawler (http://www.metacrawler.com)

A powerful metasearch engine that searches several popular search engines and sorts the results. It is excellent for getting a quick hit of what's out there. But if you don't see what you want in the results, its limited search options make it tough to issue really precise queries.

ProFusion (http://www.profusion.com)

Lets you select what search engines to search including AltaVista, InfoSeek, Lycos, Excite, WebCrawler, and others. Filters results to remove duplicates and broken links.

 

SavvySearch (http://www.savvysearch.com)

Searches multiple Internet search engines, web directories such as Yahoo or Magellan, Usenet, and other sources via just one query and then returns the linked results.

Intelligent Agents

Metasearch engines can be advantageous for getting a quick overview, but because every search engine differs in how it functions and because metasearch engines provide limited results per each search engine, the outcome is incomplete. In addition, some metasearch engines are rather slow and create another problem, that of duplicates. A better solution is to consider using intelligent agents, software programs that search many search engines at once, similarly to metasearch engines, but which add other features such as automatically finding, analyzing, filtering, and presenting information rapidly. BullsEye, one of the most recent entrants to the marketplace, offers a trial version for download (http://www.intelliseek.com). As compared to metasearch engines, one valuable feature is that the user can specify the number of total hits and how many are desired from each search engine. As a result, a much larger list of hits is created than when using metasearch engines on the web. A unique and automated feature of BullsEye is that it can track and update searches based on the time frame selected by the user-either hourly, daily, weekly-and then e-mail updates to you.

 

Hard-to-Find Information

Two categories of hard-to-find information are industry statistics and market data. Often, this information is developed and provided by two distinct types of organizations-government agencies or professional and trade associations. Consider what agency or association would typically generate the required information and search for that first. For example, when looking for U.S. population statistics, consult the U.S. Bureau of the Census at http://www.census.gov since it is the governmental agency responsible for compiling these statistics. If you need market data about restaurants, try the National Restaurant Association at http://www.restaurant.org. A reference book for additional help with hard-to-find information is Finding Statistics Online by Paula Berinstein, Information Today, Inc., 1998 (http://www.infotoday.com). Here are some additional web sites which are useful for finding information not readily available or indexed by search engines.

 

Price's List of Lists (http://gwis2.circ.gwu.edu/~gprice/listof.htm)

The Internet contains many lists of information in the form of rankings of different people, organizations, companies, etc. This site contains a collection that is designed to be a clearinghouse for these types of resources.

 

Direct Search (http://gwis.circ.gwu.edu/~gprice/direct.htm)

This site contains links to resources not easily searchable by search engines such as archives & library catalogs, books, news sources, and ready reference

 

Internet Publishers & Databases

Although there is an astounding amount of free information, professional researchers have also seen the commercialization of the web during the past year. As mentioned previously, many traditional commercial database vendors who were available only through dial-up telecommunications have launched web products and new publishers have entered the market with unique products. Here are examples of some of the new producers or products that have come onto the scene:

Hoover's Inc. (http://www.hoovers.com) provides company snapshots.

Research Bank Web (http://www. investext.com) includes three major database collections-investment research, market research, and trade association research.

Vista Information Solutions (http://www.vistainfo.com) provides information on environmental, property, and business-risk information on any property, business, or address in the United States.

XLS (http://www.xls.com) contains financial databases with information that can be downloaded as pre-formatted spreadsheets.

Integra (http://www.integrainfo.com) provides financial ratios based on 3.5 million private companies in 900 industries in the form of industry profiles as a way to benchmark against financial information of a specific company that the user already knows about. Also offers a new product called Prospect Profiler that includes a range of important information for sales prospecting.

VentureOne (http://www.ventureone.com) provides a database of venture capital companies, transactions, and funds.

 

Web Tools & Specialty Search Engines

A very interesting web navigation service is Alexa (http://www.alexa.com). It works in conjunction with a web browser and resides as a tool bar at the bottom of the browser. Alexa provides useful information about the sites you are visiting and suggests related sites with links to click on. This can immediately add relevant sites to the search process as one way to save time on a search. An

example of a specialty search engine is Liszt (http://www.liszt.com). Liszt provides brief descriptions of some 90,000 electronic mailing lists and discussion groups. These are especially valuable for keeping up with current trends in your own profession or those related to your areas of subject expertise and interest. A search can be initiated by key word or there are broad categories from which to choose such as Business, Computer, Education, Politics, or Science. Another specialty search engine for finding companies from all over the world is Corporate Information (http://www.corporateinformation.com). It's new search engine and A-Z list of countries with links to sites makes this a unique source for global company information.

 

Keeping Up

Keeping up with changes in search engines and the latest information necessary for professional information workers is quite a challenge. Here are some selected sources:

Cyberskeptic Guide to Internet Research (http://www.bibliodata.com) is a newsletter with articles about useful sites for searchers.

Free Pint (http://www.freepint.co.uk) is a British-based free e-mail newsletter that includes information on quality and reliable information on the web. It contains tips, tricks, and articles written by information professionals in the United Kingdom and is currently sent to more than 12,000 information professionals every two weeks.

On the Net (http://www.onlineinc.com), a column by Greg Notess covers the information side of the Internet and is published in Online and Database.

The Search Engine Update (http://searchenginewatch.com) is a free site with a subscription-based e-mail newsletter emailed twice monthly with access to "in progress" projects and detailed information only available to subscribers.

Web Wise Ways (http://www.infotoday. com) a column by Amelia Kassel, began in October 1998 and is published in Searcher magazine. This column provides in-depth reviews of new web-based research products and compares them to traditional commercial database products when applicable.

 

What's Next for Internet Power Searchers?

Just when searchers have conquered the methods and idiosyncrasies of a search engine, it changes. My very first personal favorite, Open Text, has disappeared. I then discovered that Hotbot was easy-to-use and most satisfactory for the majority of my research requests. Of late, Northern Light, the most significant entry to the playing field during the past year and half, continues to add new content and features while others have remained either fairly static or in some cases deteriorated. In recent months, there has been a hush in new search engine development. Nothing much new! Nevertheless, Reva Basch points out that, with regard to search engines, "the only constant is change"6 . This insightful comment implies, to me, that information professionals will want to continue their experimentation with search engines, and acclimate themselves to changes or new features. For the moment, we can hone our skills using existing products while waiting to see what the next generation will bring. For now, searchers will need to continue to identify, collect, evaluate, and organize useful web sites and learn new tools that come onto the scene since so much on the web is not accessible via search engines. Many of the same skills that we learned in graduate schools of library and information science are applicable to this new searching environment that we have had to meet head on.

 

Amelia Kassel is president and owner of MarketingBASE, a successful information brokerage specializing in market research, competitive intelligence, and worldwide business information since 1984. Kassel holds a Master's degree in library science(1971, UCLA) and combines an in-depth knowledge of information sources with an emphasis on theuse of databases, and a knowledgeof business and marketing strategies. Kassel has taught information brokering and electronic research for the University of California, Berkeley and San Jose State University, Division of Library and Information Science. A recognized author and national and international speaker, she also conducts workshops for conferences and associations.
Error processing SSI file