XML Under the Hood
XML Under the Hood

Information Outlook, Vol. 6, No. 12, December 2002

XML Under the Hood

By Davida Scharf

Davida Scharf has been online since the 1970s. At NKR Associates Inc. (www.NKRassociates.com), she assists organizations of all types in using information technology effectively, through consulting and project management.

A New Set of Rules

The concept of XML (eXtensible Markup Language) is very familiar to librarians. It is a set of rules for organizing information that will enable complex search, retrieval, and manipulation of data in many ways. It is simply the latest in a long line of tools and techniques for creating "cataloging" or "metadata." It is an important tool, with wide-ranging implications for how information will be handled in the future. While few of us will likely be called upon to work directly with XML, we should understand something about what's under the hood of these emerging systems.

Automated library catalogs using the MARC format have been great at handling traditional library resources in a standard way—saving time and money, and facilitating communication about these resources across institutions for more than 30 years. Indeed, the MARC standard was ahead of its time. MARC, along with another library standard, the Z39.50 retrieval protocol, helped put lumbering libraries on the leading edge of networked computer applications and made them serious users of the earliest incarnations of the Internet. However, we now find that our inflexible MARC records and current online library catalogs are poorly suited to providing integrated access to a wide variety of electronic content, much of it dynamic in nature. Today, a large percentage of library "holdings" are licensed electronic sources that lie outside the online catalog. XML holds the key that will enable libraries and all Web users to unlock the door to 21st century "bibliographic control" and resource-sharing, regardless of location or format. XML documents can be used for print, the Web, or any other document medium. XML's flexibility enables designers to adopt one set of standards, tools, and methods for processing documents, regardless of their various distribution targets.

XML Only Looks Like HTML

XML is a set of rules for structuring documents and data on the Web.1 With bracketed start and end tags, XML looks similar to HTML (Hypertext Markup Language). But it differs in that the tags delimit the content in a meaningful semantic way rather than describing the presentation of the content. XML specification 1.0 provides a means of describing documents that is independent of medium. Similar to database field tags, the tags in XML delimit the pieces of data in a systematic way so that they can be manipulated and presented later in various ways. The presentation and manipulation of the content is done in other ways, by other software. Like HTML, XML is a text format, which means it can be looked at with any text editor rather than just with the program that produced it.

The word "extensible" means it can be extended in any dimension or direction. While XML is called a meta-language, it isn't really a language as much as a grammar, or a set of rules for creating a language. These rules are used to create markup languages for specific purposes. Each application of XML can be unique, but it doesn't have to be. Indeed, what is happening is that standardized XML applications are being developed for broad categories of use. XML is an open standard that is license-free and platform-independent. It is a natural for librarians, because it can behave like a database to facilitate structuring content for improved "understanding." XML is not just for Web pages. It can be used to store any kind of structured information and to enclose or encapsulate information in order to pass it between different computing systems that would otherwise be unable to communicate.

XML tags describe the content in a standardized and consistent way, which makes them similar to database field names. A pair of tags including their content is called an "element," regardless of the granularity. Unlike database fields, tags can be nested, thereby enhancing meaning through a description of hierarchical relationships. Librarians, who truly understand the importance of the semantic nature of information, can appreciate the need for a technology that enables the developer to retain semantic relationships.

DTDs, Schemas, XSL, and More

XML is modular and usually refers to a family of standards and tools that comprise new methods for organizing and presenting information on the Web. Basic XML documents must rely on other types of XML documents to define the specifics for a particular application. A DTD (Document Type Definition) is typically used to define elements that are allowed in the group of XML documents that refer to it. A DTD is not required for documents that are considered "well formed," but it is useful as a way to describe and validate information in related XML documents. A well-formed document follows a set of rules for XML. Many DTDs already exist in journalism, law, e-commerce, and other fields.

A schema is similar to a DTD in that it expresses shared vocabularies and allows machines to execute the rules we create. Schemas provide a means for defining the structure, content, and semantics of XML documents and were designed to overcome some of the limitations of the DTD.

CSS (Cascading Style Sheets), which provide consistent rules browsers can interpret for formatting HTML pages on the Web, can be used with XML. But the newest language being used to perform a similar this function is XSL (eXtensible Stylesheet Language). Namespaces allow the definition of an XML document that uses elements obtained from different sources. They do this by associating prefixes to each element tag name; each prefix is unmistakably identified by mean of a persistent identifier, such as a URI (Uniform Resource Identifier).2 Xlink and Xpointer are standards that allow developers to define links among XML documents. They are similar to hyperlinks in the HTML world, but more powerful. Every day more XML technology and tools are being created and refined.

Making XML Useful

XML can be used on an elementary level. Writing simple XML and XSL stylesheets can be relatively easy for aperson already familiar with HTML. However, for XML to reach its full potential, additional technology and software interfaces must be used. The Document Object Model (DOM) is a complicated interface that programmers will use to work with XML. It will enable developers using JAVA, C, or other programming languages to create methods for finding, sorting, manipuating, and displaying information in very sophisticated ways. The difference between how this is done now and how it will be done with XML under the hood is in the ability of the new technologies to communicate information in different formats, created by different vendors. This ability, often called "interoperability," refers to the full interchange of information across different computer systems.

Organizations

XML.org was formed in 1999 to minimize overlap and duplication in XML languages and XML standard initiatives by providing public access to XML information and XML schemas. Today it serves as a centralized resource to developers and others interested in XML. (www.XML.org)

OASIS (Organization for the Advancement of Structured Information Systems), founder of XML.org, is a not-for-profit, global consortium that is driving the development and adoption of standards for e-business. (www.oasis-open.org)

XML.coverpages.org—hosted by OASIS, managed by XML.org, and edited by Robin Coveris widely regarded as the most comprehensive online reference work for XML and its parent, SGML. (www.XML.coverpages.org)

W3C (World Wide Web Consortium) was created in October 1994 to lead the World Wide Web to its full potential by developing common protocols that promote its evolution and ensure its interoperability. (www.w3c.org)

A Brief History of XML

XML is the latest in a line of standardized languages used to handle electronic information. XML, developed in 1996, has its origins in SGML (Standard Generalized Markup Language), an ISO standard developed in the early 1980s to support computerized publishing by a consortium of vendors. Publishers have been using SGML for more than a decade because it is very effective at describing text documents. Because it is difficult to implement in Web browsers, SGML users now routinely convert their SGML documents to HTML for display on the Web. XML development was also informed by experience with HTML throughout the 1990s. The consortium that originally worked on SGML is now called OASIS (Organization for the Advancement of Structured Information Systems), and its scope has expanded. OASIS is a not-for-profit global consortium whose aim is to produce worldwide standards for security, Web services, XML conformance, business transactions, electronic publishing, topic maps, andinteroperability within and between marketplaces. OASIS runs the XML.org website.

The Semantic Web

XML is already under the hood in many Web applications, helping to move us all toward the so-called Semantic Web. If you haven't heard of it yet, the Semantic Web is really still a vision. But it is the vision of Tim Berners-Lee, inventor of the World Wide Web. It is a vision of the Web in which machines are able to interpret data and carry out sophisticated tasks because complex and verbose metadata are routinely coded into Web documents, enabling automatic linking and reuse of data across the Web and across various applications. An XML language called RDF (Resource Description Framework) is emerging as a key support for the Semantic Web to describe the types of resources with which librarians are familiar.

When the Semantic Web is closer to realization, many tools will surely exist for creating coded documents automatically, just as word processors today automatically convert a document to doc, txt, or html formats. There are already XML editing programs similar to Dreamweaver or FrontPage for HTML. The most often mentioned commercial program is XML SPY, which is a family of software products for creating the family of document types needed with XML.

XML and Related Technologies

CSS (Cascading Style Sheet) is a document that enforces consistency in presentation of the HTML or XML documents on the Web that refer to it.

XSL (eXtensible Stylesheet Language) is a more fully featured type of style sheet than CSS. It contains a set of rules for enforcing consistency in presentation of XML documents on the Web.

DTD (Document Type Definition) is a description of the structure and properties of a class of XML files.

EAD (Encoded Archival Description) is a standard for encoding archival finding aids maintained by the Library of Congress in partnership with the Society of American Archivists.

RDF (Resource Description Framework) is a text format that supports resource description and metadata applications, such as bibliographies and photo collections.

Schema express shared vocabularies and allow machines to carry out rules made by people. They provide a means for defining the structure, content, and semantics of XML documents.

SGML (Standard Generalized Markup Language) is the predecessor of XML.

XML (eXtensible Markup Language) is a family of technologies for facilitating the interchange of information on the Web.

XML at the Library of Congress

The Library of Congress (LC) has embraced XML in major ways. In cooperation with the Society of American Archivists, the LC developed the EAD (Encoded ArchivalDescription) for archival finding aids. The LC has also been developing a schema called MARC/XML. The MARC/XML schemas, stylesheets, and software tools under development will greatly facilitate the transition of library systems to an XML environment. You can read more about this project at the MARC website, http://www.loc.gov/standards/mets/METSOverview.html. At the same time, the LC is developing a schema called METS (Metadata Encoding and Transmission Standard) for encoding metadata about electronic publications within a digital library. This is necessary because management of digital content requires metadata about the structure of the electronic content in a way that a printed publication did not. The archaic communications and storage format of existing library catalogs currently keeps them segregated from other information resources on the Web. Putting these catalogs in a more universally accepted format should keep libraries relevant and accessible in the future. The Library of Congress and other major research libraries are embracing XML3 because it promises to bring standardization and resource sharing to new heights.

ASCII of the 21st Century

XML is a powerful emerging open standard for exchanging information that promises to greatly improve the functionality of the Web. It is powerful because it is flexible. It is powerful because it should keep us from getting lost in proprietary dead ends. It is powerful because it allows data to be structured hierarchically but separates form and content in a way that makes changing the presentation of information easier and less costly. It is an evolving family of guidelines that will provide developers with the means to make complex communication and interpretation of information by machines and people much easier. It has been called the ASCII of the 21st century4 because it is simply a flexible text format that may have an impact as far-reaching as the development of ASCII did in the 20th century.

REFERENCES

Introducing the Extensible Markup Language (XML): SGML and XML as Meta-Markup Languages, Cover Pages, http://xml.coverpages.org/xmlIntro.html.

"XML in 10 points," World Wide Web Consortium (W3C), http://www.w3.org/XML/1999/XML-in-10-points, 2002.

Network Development and MARC Standards Office, Library of Congress, MARC 21 XML Schema, http://www.loc.gov/standards/marcxml/marcxml-overview.html.

1 W3C.org website, http://www.w3.org/XML

2 For a discussion of persistent identifiers on the Web, see Davida Scharf, "The DOI is coming," Information Outlook, September 2002.

3 For MARC at Stanford University, see http://laneweb.stanford.edu:2380/wiki/medlane/xmlmarc.

4 Carlos Delgado-Kloos, "XML: The ASCII of the 21st Century," Upgrade: The European Online Magazine for the IT Professional, vol. 3, no. 4, August 2002, p. 6.10. (www.upgrade-cepis.org)

 

What Simple XML Looks Like

ENGLISH TEXT

Java Secrets

by Elliotte Rusty Harold

· Publisher: IDG Books Worldwide

· ISBN: 0-764-58007-8

· Pages: 900
· Price: $59.99

· Publication Date: May 1997

· Bottom Line: Buy It
The Java virtual machine, byte code, the sun packages, native methods, stand-alone applications, and a few more naughty bits.

 

HTML

<dt>Java Secrets

<dd> by Elliotte Rusty Harold

<ul>
<li>Publisher: IDG Books Worldwide

<li>ISBN: 0-764-58007-8

<li>Pages: 900

<li>Price: $59.99

<li> Publication Date:

May 1997

<li>Bottom Line: Buy It

</ul>

<P>The Java virtual machine, byte code, the sun packages, native methods, stand-alone applications, and a few more naughty bits.<P>

XML
<book>

<title>Java Secrets</title>

<author>Elliotte Rusty Harold</author>

<publisher>IDG Books Worldwide</publisher>

<isbn>0-764-58007-8</isbn>

<pagecount>900</pagecount>

<price>$59.99</price>

<pubdate>May 1997</pubdate>

<recommendation>Buy It</recommendation>

<blurb>The Java virtual machine, byte code, the sun packages, native methods, stand-alone applications, and a few more naughty bits.</blurb>

</book>

Privacy Statement
©2009 Special Libraries Association. All rights reserved.
331 South Patrick Street Alexandria, VA 22314-3501 USA