0% found this document useful (0 votes)
168 views26 pages

Introduction to HTML and XML Basics

best html unit forever

Uploaded by

piyush.2005mh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
168 views26 pages

Introduction to HTML and XML Basics

best html unit forever

Uploaded by

piyush.2005mh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT 11 INTRODUCTION TO HTML AND

XML
Structure

11.0 Objectives
11.1 Introduction
11.2 World Wide Web and Markup Languages
11.3 Standard Generalized Markup Language (SGML)
11.4 HyperText Markup Language (HTML)
11.4.1 Introduction to HTML
11.4.2 Features of HTML
11.4.3 Editor for HTML
11.4.4 Syntax of HTML Commands
11.4.5 Framework of a Web Page

11.5 Basic HTML Tags


11.5.1 Linking
11.5.2 URLs

11.6 HTML and the Browser


11.7- eXtensible Markup Language (XML)
11.7.1 NeedforXML
11.7.2 Objectives of XML
11.7.3 Features of XML
11.7.4 How XML is Different from HTML
11.7.5 Advantages of XML

11.8 XML Syntax and Semantic Tags


J 1.8.1 XML Syntax
11.8.2 Semantic Tags of XML

11.9 Document Type Definition (DTD)


11.10 Implications of XML in Library and Information Activities
11.11 Summary
11.12 Answers to Self Check Exercises
11.13 Keywords

11.14 References and Further Reading

11.0 OBJECTIVES
.In the previous Unit, you have been acquainted with the guidelines, norms and
standards developed by various organisations/ suggested by different experts and
organisations for development of content on the Web. The Internet has changed
the way the information can be organised therein. It is necessary to know the
way or form in which the information can be organised on the net which we will
be studying in this Unit.

After reading this Unit, you will be able to:

• understand what is meant by World Wide Web;


352 • understand the concept and functions of markup language;
• learn structure, tags and syntax of HyperText Markup Language (HTML) and Introduetion to HTML
eXtelJ.~ie I\I[arkup Language (XML); and andXML

• know appl~~ations of these in creation of web page.

11.1 INTRODUCTION
Today Internet has changed the way of life in all fields. It has created an instant
online connection and communication world over. Due to its feasible technology,
Internet has grown rapidly in the past few years gained so much popularity. It
has been transformed from just a text-based environment to a click-able and link-
able world. What has made this possible is the World Wide Web (WWW).
Internet today" has become a multimedia communication channel where data can
be transferred in all the formats .

.The use of hypermedia and hypertext. is so much ingrained on the Internet that
WWW cannot be thought without multimedia and WWW has become a synonym
for the term Internet. Hypertext markup language is a language to render the
information over Internet. It can accommodate audio, video, text and image. It
can be said that basic feature of today's Internet is hypertext and hyper-linking.

11.2 WORLD WIDE WEB AND MARKUP


LANGUAGES
The World Wide Web (WWW) is designed for the display, distribution and searching
of information, files, and data across multiple locations over the Internet. WWW
is. used to view the web documents and these web documents are written in a
particular language supported by [Link] not a programming
language. Hypertext MarkUp Language is used to represent the web content on
web pages [Slack Inc, 2001].

The term markup means instructions for printing in a particular style, just like,
while proofreading editors mark the text (e.g. underlined) to display it in bold
.while printing. Similarly to display the electronic text in web page on browsers,
embedded instructions are given within the text to make the parser understand
how text should appear on display [Sol, 1999]. .

But,markup is also used for data retrieval, particularly in the library and information
field. Once the structure of a document is fixed, one can easily find which part
of the document contains which kind of data. For example, an email has a fixed
structure that means it will look like,

To: inder@[Link]
From: rakesh@[Link]
Date: Tue, 26 Jan 200501 :00:58 -0800
Subject: Memo

Kindly inform me the timetable of the term end examination of the MLIS course.

Regards
Rakesh

If we observe this email we will find the following fields,

To:
From: 353
Content Development Date:
Subject:
Body:

It is very easy to fetch the data from email once the fields are known. This is
a typical example of markup.

A 'markup language', may be no more than a loose set of markup conventions


used together for encoding texts. A markup language must specify what markup
is allowed, what markup is required, how markup is to be distinguished from text,
and what the markup means. Standard Generalised Markup Language (SGML)
provides the means for doing the first three of these only; it allows one to
describe a markup language independently of what the markup is intended to do.
To understand and act upon the markup, additional semantic information is needed,
which differs in different situations. An SGML-aware processor can analyze the
structure of an SGML-encoded document with no sense of its meaning. This
independence is necessary, given the open-ended nature of electronic textual
applications. It does not, of course, imply that the intentions of the encoder of
a text are unimportant or vacuous; only that they are formally distinct from the
encoding itself.

For understanding of all markup languages, when described in SGML terms,


knowledge of three basic concepts are fundamental. These are the notions of a
markup entity, a markup element with its associated attributes, and a document
type. As you know at the most basic level, texts are composed simply of streams
of symbols (characters or bytes of data, marks on a page, graphics, etc.): these
are known as entities in SGML. At a higher level of abstraction, a text is
composed of representations of objects of various kinds - linguistically or
functionally defined. Such objects do not appear randomly within a text: coherence
demands that particular types of object appear in specifiable relationship to other
objects - for example, they may be included within each other, linked to each
other by reference or simply presented sequentially. This level of description
ensures texts as composed of structurally defined objects, known as elements in
SGML. The grammar defining how elements may legally be combined in a
particular class of texts is known as a document type. These three fundamental
concepts together are, it is claimed, adequate to describe all the complexities of
marked-up texts, of whatever kind and for whatever purposes.

Self Check Exercises


1) What is meant by WWW?
2) Distinguish between Hypertext, Hyperlink and Hyper media.
3) Define a markup.
Note: i) Write your answers in the space given below.
ii) Check your answers with the answers given at the end of the Unit.

354
Introduction to HTML
11.3 STANDARD GENERALIZED MARKUP andXML
LANGUAGE (SGML)
SGML aims to give a general structure for other Markup languages. Thus, it is
a meta-language which gives rise to other Markup languages, for example, XML
(eXtensible Markup Language) is a derivative of SGML. It basically preserves
the semantics of the text through the embedded text. It is not meant for formatting
of text. Basically it was meant to preserve the structure of document.

SGML is not a kind of text formatting system (although its origins can be readily
traced in the world of electronic text formatting), or is a competitor for such
languages as TeX or Postxcript. These languages define how the text should
appear on screen or over print. SGML by contrast is decidedly unhelpful about
how texts are to be reproduced but it binds one to a specific structure of
document and the sequence of elements in the text.

HTML isa relatively simple language and stands for HyperText Markup Language.
An HTML 'page' is a plain text document with markUp inserted into it. This
markUp includes codes for forming hypertext links. Using it becomes easier if
one understands the basic principles behind it, and take its limitations into account
[Gorman, Dianne],

11.4 HYPERTEXT MARKUP LANGUAGE (HTML)


Hypertext Markup Language (HTML) is a structured markup language that is
used to create web pages. A markup language such as HTML is simply a
collection; of codes called elements that are used to indicate the structure and
format of a document. Elements in HTML consist of alphanumeric tokens within
angular brackets, such as <b>, <html>, <body>, etc.

Most elements consist of paired tags: a start tag and an end tag. For example,
<b> is a start tag and <Ib> is the end tag. The end tag is similar to start tag,
except that the symbol is preceded by forward slash. An element's instruction
applies to whatever content is contained between its start and end tags:

E.g. <b> This text is bold; <Ib> but this text is not.

Element names are not case-sensitive. An element such as <hTml> is equivalent


to <html>. However using either upper or lower case consistently makes HTML
documents easier to understand and maintain. Element names cannot contain
spaces.

11.4.1 Introduction to HTML


HTML is the language with which Web pages are designed. HTML allows web
documents to be created with ease. The primary objective of using HTML would
be to build a web page that communicates readily and effectively to make the
document on the web most compelling to access and read.

11.4.2 Features ofHTML

HTML is a content-based or structured markUp language, where the codes


describe what the contents of the document are. This means that the codes are
used to indicate the various parts of the document, such as headings, paragraphs,
lists, etc.

It is platform-independent. This means that HTML documents are portable from


one computer system to another. 355

I
Content Development There is some misconception about HTML:

• HTML is not a programming language. The markup in an HTML document


describes the contents. It does not contain processing instructions.

• HTML is not a page layout language. With only a few exceptions, HTML
tags are concerned with the structure of a document rather than its appearance.

Basically HTML can be seen as both a structural language as well as page


layout language. For instance, the tag <HI>, i.e., heading tag is basically a
structural tag which says the text embedded is 'Heading of first order'. But
similarly HTML has <B>, i.e., bold, <I>, i.e., italics, etc., are formatting or page
layout tags.

Some of the very Basic HTML concepts, tags and features are described below.

11.4.3 Editor for HTML


,
HTML is a plain text file and needs a simple text editor to create the tags.
However, it is important that HTML documents have the extension .html which
is a four-letter extension. As most editors allow only three letters, it is important
to select an editor that allows four letters as the file extension. MS-DOS 'edit'
may be used as an editor for writing the HTML files.

11.~.4 Syntax ofHTML Commands

In general, all HTML commands will have the form:

<COMMAND> text </COMMAND>.

Two points need to be noted here: (1) all commands MUST be enclosed within
angular brackets < >; and, (2) all commands are used in pairs wherein the
<COMMAND> marks the beginning and </COMMAND> marks the end .
.
11.4.5 Framework of a Web Page

The framework of a web page is

<HTML>
<HEAD>

<TITLE> Title of Your Page </TITLE>


</HEAD>
<BODY>

The Body of Your Page

</BODY>

</HTML>

Explanation

The <HTML> </HTML> tells the browser that your page is in HTML code.

The <HEAD> </HEAD> encloses the header of your page.

The <BODY> </BODY> is that part of your page that will actually be displayed.
356
"'I

Introduction to HTML
11.5 BASIC HTML TAGS and X:\lL

Some of the basic HTML tags that are used in developing HTML documents are
as follows:

Markup Tags
HTML
This element tells the browser that the file contains HTML-coded information.
The file extension .html and .htm also indicates that this a HTML document.
Head
The head element identifies the first part of your HTML-coded document that
contains the title. The title is shown in the title bar of browser's window.
Title
The title element 'contains your document title and identifies its content in a global
context. The titleis typically displayed in the title bar at the top of the browser
window, but not inside the window itself. The title is also what is displayed on
someone's hotlist or bookmark list. So it is better choose something descriptive,
unique, and relatively short. A title is also used to identify your page for search
engines (such as HotBot or Infoseek). Generally it is advisable to keep titles to
64 characters or fewer.
Body
The second-and largest-part of a HTML document is the body, which contains
the document content (displayed within the text area of the brow ser window).
The tags explained below are used within the body of your HTM L document.
Headings
HTML has six lex els of headings, numbered 1 through 6, \\ uh 1 being the
largest. Headings are typically displayed ll1 larger and-or bol Icr IG;'l~'11ll1 [Link]
body text. The first heading in each document should be tagQ",'d,i l l >.
The syntax of the llcJ<lll1g clement i

where y is a number betvcen 1 and 6 specifying the level of' the heading.

Generally, it is advised not to Ship lcv Is ('I' h adings in a t TM L document For


example, do not start wuh a level-one heading ("Hi') and then n " ."e Cl le ,~J-
three «H3» heading.

For example:
<html>

<head>
<title>IGNOU Homepage</title>

</head>
<body> Welcome to the HO:lIe Page l.)fLGNOU. IG")\;OUIS one of the open universities
oflndia providing distance education courses 111 different fields.
</body>
<!html>

357

I
Content Development Paragraphs

Unlike documents in most word processors, carriage returns in HTML files are
not significant. In fact, any amount of white space - including spaces, linefeeds,
and carriage returns - are automatically compressed into a single space when
a HTML document is displayed in a browser. Word wrapping can occur at any
point in the source file without affecting how the page will be displayed.

<P>Welcome to the world ofHTML.


This is the first paragraph.
while short it is
still a paragraph!</P>

In the source file there is a line break between the sentences. A Web browser
ignores this line break and starts a new paragraph only when it encounters
another <P> tag.

Important: You must indicate paragraphs with <P> elements. A browser ignores
any indentations or blank lines in the source text. Without <P> elements, the
document becomes one large paragraph. (One exception is text tagged as
'preformatted,' which is explained below.) For example, the following would
produce identical output as the first example:

<HI>Level-one heading<lHl>
<P> Welcome to the world of HTML. This is the
first paragraph. While short it is still a
paragraph! </P> <P>And thi-s is the second paragraph.</P>

NOTE: The </P> closing tag may be omitted. This is because browser understands
that when it encounters a <P> tag, it means that the previous paragraph has
ended. However, since HTML now allows certain attributes to be assigned to the
<P> tag, it is generally a good idea to include it.

Using the <P> and </P> as a paragraph container means that you can center a
paragraph by including the ALIGN=alignment attribute in your source file.

<P ALIGN=CENTER>
This is a centered paragraph.
</P>
This is a centered paragraph.

It is also possible to align a paragraph to the right instead, by including the


ALIGN=RIGHT attribute. ALIGN=LEFT is the default alignment; if no ALIGN
attribute is included, the paragraph will be left-aligned.

Lists
HTML supports unnumbered, numbered, and definition lists. Nested lists can also
be used, but use this feature sparingly because too many nested items can
become difficult to follow.

Unnumbered Lists
To make an unnumbered, bulleted list,
start with an opening list <UL> (for unnumbered list) tag

358
enter the <LI> (list item) tag followed by the individual item; no closing <ILl> Introduction to HTML
tag is needed to end the entire list with a closing list </UL> tag andXML

Below is a sample three-item list:

<UL> '-
<LI> apples
<LI> bananas
<LI> grapefruit
</uL>

The output is:


apples
bananas
grapefruit

The <LI> items can contain multiple paragraphs. Indicate the paragraphs with
the <P> paragraph tags.

Numbered Lists
A numbered list (also called an ordered list, from- which the tag name derives)
is identical to an unnumbered list, except it uses <OL> instead of <UL>. The
items are tagged using the same <LI> tag. The following HTML code:

<OL>;

<LI> oranges
<LI> peaches
<LI> grapes
<IOL>

produces this formatted output:

oranges
peaches
grapes
Definition Lists
A definition list (coded as <DL» usually consists of alternating a definition term
(coded as <DT» and a definition description (coded as <DD». Web browsers
generally format the definition on a new line and indent it.

The following is an example of a definition list:

<DL>
<DT> JGNOU
<DD> IGNOU, Indira Gandhi National Open University is located in New Delhi.
<DT> IISc
<DD> IISc, the Indian Institute of Science is located in Bangalore
</DL>

359
Content Development The output looks like:

IGNOU

IGNOU, Indira Gandhi National Open University is located in New Delhi.


IISc

USc, the Indian Institute of Science is located in Bangalore.

The <DT> and <DD> entries can contain multiple paragraphs (indicated by <P>
paragraph tags), lists, or other definition information.

Nested Lists

Lists can be nested. You can also have a number of paragraphs, each containing
a nested list, in a single list item.

Here is a sample nested list:


<UL>
. <LJ> A few fruits:
<UL>
<LJ> Apple
<U> Grapes
<LJ> Banana
</uL>
<LJ> Two citrus fruits:
<UL> .
<LJ> Lime
<LJ> Orange
</UL>
</uL>

The nested list is displayed as:


• A few fruits:
• Apple
• Grapes
• Banana
• Two citrus fruits:
• Lime
• Orange

Self Check Exercise

4) What is HTML? Mention some basic HTML tags.

Note: i) Write your answer in the space given below.


ii) Check your answer with the answers given at the end of the Unit.
................................................................................................................

........................... .........•..........................................................................

[Link]

The chief ability of HTML comes from its ability to link text and/or an image to
360 another document or section of a document thus weaving 'a web' of resources.
A browser highlights the identified text or image with colour and/or underlines Introduction to HTML
andXML
to indicate that it is a hypertext link (often shortened to hyperlink or just link).

HTML's single hypertext-related tag is <A>, which stands for anchor. To include
an anchor in your document:

a) start the anchor with <A (include a space after the A)

b) specify the document you're linking to by entering the parameter


HREF='jzlename" followed by a closing right angle bracket (»

c) enter the text that will serve as the hypertext link in the current document

d) enter the ending anchor tag: <lA> (no space is needed before the end
anchor tag)

Here is a sample hypertext reference in a file called [Link]:

<A HREF = "[Link]">Hello<1 A>

This entry makes the word 'Hello' the hyperlink to the document [Link],
which is in the same directory as the first document.

Relative Pathnames Versus Absolute Path names

You can link to documents in other directories by specifying the relative path
from the current document to the linked document. For example, a link to a file
[Link] located in the subdirectory temp would be:

<A HREF= "temp/[Link]">Content <lA>

These are called relative links because you are specifying the path to the linked
file relative to the location of the current file. You can also use the absolute
pathname (the complete URL) of the file, but relative links are more efficient in
accessing 'a server.

They also have the advantage of making your documents more 'portable' - for
instance, you can create several web pages in a single folder on your local
computer, using relative links to hyperlink one page to another, and then upload
the entire folder of web pages to your web server. The pages on the server will
then link to other pages on the server, and the copies on your hard drive will still
point to the other pages stored there.

It is important to point out that UNIX is a case-sensitive operating system where


filenames are concerned, while DOS and the MacOS are not. For instance, on
a Macintosh, '[Link]', '[Link]', and '[Link]'
are all the same name. If you make a relative hyperlink to '[Link]',
and the file is actually named' [Link]', the link will still be valid. But if
you upload all your pages to a UNIX web server, the link will no longer work.
Be sure to check your filenames before uploading.

Pathnames use the standard UNIX syntax. The UNIX syntax for the parent
directory (the directory that contains the current directory) is ",,"

If you were in the [Link] file and were referring to the original document
[Link], your link would look like this:

<A HREF = ""/[Link]">India</A>

I
Content Development In general, you should use relative links whenever possible because:

a) it is easier to move a group of documents to another location (because the


relative path names will still be valid)

b) it is more efficient connecting to the server


c) . there is less to type

However, use absolute pathnames when linking to documents that are not directly
related. For example, consider a group of documents that comprise a user manual.
Links within this group should be relative links. Links to other documents (perhaps
a reference to related software) should use absolute pathnames instead. This
way if you move the user manual to a different directory, none of the links would
have to be updated.

11.5.2 URLs
The World Wide Web uses Uniform Resource Locators (URLs) to specify the
location of files on other servers. A URL includes the type of resource being
accessed (e.g., Web, gopher, FTP), the address of the server, and the location
of the file. The syntax is:

scheme:/ ![Link] [:port ]/pathl filename


where scheme is one of the following:
file
a file on your local system
ftp
a file on an anonymous FTP server
http
a file on a World Wide Web server
gopher
a file on a Gopher server
WAIS
a file on a WAIS server
news
a Usenet newsgroup
telnet
a connection to a Telnet-based service

Links to Specific Sections


Anchors can also be used to move a reader to a particular section in a document
(either the same or a different document) rather than to the top, which is the
default. This type of an anchor is commonly called a named anchor because to
create the links, you insert HTML names within the document.

You can also link to a specific section in another document. That information is
presented first because understanding that helps you understand linking within
the same document.

Links' between Sections of Different Documents


Suppose you want to set a link from document A ([Link]) to a specific
section in another document ([Link]). EIl:ter the' HTML coding for a link to
362
Intreduqtion to HTML
a named anchor: andXML

[Link]:

In addition to the many institute, Delhi is also home to

<a href = "[Link]#JNU">Jawaharlal Nehru University</a>.

The characters after the hash (#) mark is using for titling within the [Link]
file. It tells the browser what should be displayed at the top of the window when
the link is activated. In other words, the first line in the browser window should
be the Jawaharlal Nehru University heading.

Next, to create the named anchor (in this example "JNU") in [Link]:

<H2><A NAME = "JNU"> Jawaharlal Nehru University </a><1H2>

With both of these elements in place, a reader can go directly to the JNU
reference in [Link].

NOTE: links cannot be made to specific sections within a different document


unless either there is write permission to the coded source of that document or
that document already contains in-document named anchors.

Links to Specific Sections within the Current Document


The technique is the same except the filename is omitted.

For example, 'to link to the JNU anchor from within [Link], enter:

...More information about


<A HREF="#JNU"> Jawaharlal Nehru University <la>

is available elsewhere in this document.

Be sure to include the <A NAME=> tag at the place in your document where
you want the link to jump to «A NAME="JNU"> Jawaharlal Nehru University
<la».

Named anchors are particularly useful when you think readers will print a
document in its entirety or when you have a lot of short information you want
to place online in one file.

Mailto Attribute
You can make it easy for a reader to send electronic mail to a specific person
or mail alias by including the mailto attribute in a hyperlink. The format is:

<A HREF="[Link]

For example, enter:

<A HREF="[Link] Publications</a>

to create a mail window that is already configured to open a mail window for
the JNU Publications Group alias.

3'''',~
Content Dev elupmeut Self Check Exercise
5) Mention the different types of links that can be created in a HTML document
Note: i) Write your answer in the space given below.
ii) Check your answer with the answers given at the end of the Unit.

,
...............................................................................................................

11.6 HTML AND THE BROWSER


What is typed as HTML tags can be viewed only through a browser to see its
actual web display. It is hence necessary to constantly view the web page by
switching into the browser mode as and when necessary. A windows based
version allows you to keep both the editor window and the browser window
. open, thus making it easier to use [Gorman, Dianne]. The popular web browsers
available nowadays are: Internet Explorer form Microsoft Corporation, Netscape
Navigator from Netscape Communication, Firefox from Mozilla Foundation, Opera
from Opera Software, and so on. All these web browsers support HTML, SGML
and XML tags and elements to display web documents properly and to extract
documents' description.

11.7 EXTENSIBLE MARKUP LANGUAGE (XML)


According to the abstract from the XML Specification version 1.0 [World Wide
Web Consortium, 2005]:

"The eXtensible Markup Language (XML) is a subset of SGML that is


completely described in this document (i.e. XML version 1.0 specification).
Its goal is to enable generic SGML to be served, received, and processed
on the Web in the way that is now possible with HTML. XML has been
designed for ease of implementation and for interoperability with both SGML
and HTML."

• XML stands for eXtensible MarkUp Language.


• XML is a markUp language much like HTML, structurally.
• XML was designed to describe data.
• XML tags are not predefined. You must define your own tags.

• XML might uses a DTD (Document Type Definition) to describe the data.
• XML with a DTD is designed to be self-descriptive.

11.7.1 Need for XML

The Idea of markup was to format a parncular kmd of document. The markup
languages that carry the mstruction for text processing are known as Procedural
markup. But later on, it was felt that for system-to-system information interchange,
markup languages could be used. This was first realized by Charles Goldfarb. Ed
Mosher and Ray Lorie w hen they were working with legal documents. They
designed first markup language known as O\1L (Generalized Markup Language)
based on the following observations:
.i64
• The document processing programs needed to support a common document format; Introduction to HTML
andXML
• The common format needed to be specific to their domain-for example legal
documents; and
• To achieve a high a degree of reliability, the document format would have to
follow specific rules.

For example, take an example of memorandum,

From: Akkamahadevi
To: Suchitra Pattanayak
CC: Prasenjit Kar
Date: 27.01.2002
Subject: Appointment order

We are extremely happy to inform you that you are selected as the
coordinator of Knowledge management team.

If we look into this document we find that there are six fields in this document.

• Who sent the document (the From: field)


• Who the document is intended for (the To: field)
• Who has been sent a copy of document (the CC: field)
• The date of document written (the Date: field)
• The subject of document (the Subject: field)

• The document body

So, if we make a fixed structure of this document then whoever writes the
document has to follow same structure. Thus, for porting information from one
system to another it will not be a problem as the structure of document is well
defined. The definition of the structure of document is known as DTD (Document
Type Definition).

Glodfarb further fine-tuned GML and proposed the SGML (Standardized General
Markup Language) which was further approved by ISO (International Organization
for Standardization) in 1986. This language was not a language itself but it was
a meta language to develop other markup languages. HTML (HyperText Markup
Language) is a derivative of SGML. HTML acts more like a formatting language
so, it is always difficult to pull out what kind of data is stored inside a HTML
document. Once this difficulty was understood, for information interchange the
need for domain specific tags was felt. Development of such tags was not
possible with HTML. Hence, XML was developed. It is always said that XML
is more near to SGML than HTML.

11.7.2 Objectives ofXML

The specification for XML has been developed with the following objectives.

i) XML shall be straightforwardly usable over the Internet.


ii) XML shall be compatible with SGML.
iii) It shall be easy to write programs which process XML.
iv) The processors could read the XML document easily.
365
Content Development v) XML document should be human-legible and reasonably clear.
vi) The XML design should be prepared quickly.
vii) The design of XML should be formal and concise.
viii) XML document shall be easy to create.
ix) Terseness in XML is of minimum importance.

11.7.3 Features ofXML

The problem of preserving the semantics can be easily addressed by XML.


HTML has problem of storing semantics of data. The gravity of problem can be
understood when some one searches Internet for Books on Ranganathan, the
results fetched by the search engines will have books on Ranganathan as well
as books by Ranganathan.

11.7.4 How XML is Different from HTML

i) XML was designed to attach semantic to data.


HTML has nothing to do with semantics of data. It only defines how the page
should be presented (like, font, colour etc.).
ii) XML is not a replacement for HTML.

Many have a misconception that XML will replace HTML but whatever the case
finally the actual representation is done in HTML format.
iii) XML is about describing information.
HTML is about displaying information.

11.7.5 Advantages ofXML

XML does not DO everything

XML is created as a way to structure, store and send information. XML is not
designed to DO everything.
<?xml version="I.O" encoding="UTF-8" ?>
-<book>
<title>Prolegomena to library c1assification<ltitle>
- <author>
<C name>Ranganathan</C name>
<1-name>S.R. </1 -name>
</author>
<edition> 3rd reprint</edition>
<place>Bangalore</place>
<publisher>Sarada Ranganathan Endowment</publisher>
<physical_ desc>640 p. </physical_ desc>
</book>

The example shows the structure of a document, which describes a book, titled
Prolegomena to library classification. The book has a title, author, edition,
place, publisher, physical description elements. Author is further divided into first
name (fname) and last name (I_name). Inside these tags the actual data is
stored. If one sees the document in the web browser, data will appear embedded
in the tags without having any kind of formatting.

366
Customised Tags Introduction to HTML
andXML
In the above-mentioned example, <book> tag is defined by the person who is ..
describing the document. Thus, one can see that XML provides the facility to
define user-customized tags. It is contrary to HTML where the tags are fixed
and predefined. So the XML is used to create domain specific tag set which
facilitates the information interchange within a specific domain. For example,
NewsML is developed for information interchange among the news agencies like
Reuter and others.

Data Exchange
As XML allows attaching semantics to the data, data can be exchanged between
incompatible systems. In the real world, the data stored in computer systems and
databases, usually are in incompatible formats. One of the most time-consuming
challenges for developers has been to exchange data among such systems over
the Internet. Converting the data to XML greatly reduces complexity, since many
applications can easily read such data.

Share Data
With XML, plain text files can be used to share data. Since XML data is stored
in plain text format, XML provides a software and hardware independent way
of sharing data.

This makes it much easier to create data that different applications can work
with. This also makes it easier to expand or upgrade a system to new operating
systems, servers, applications, and new browsers.

XML can make data more useful


With XML, a' user's data is available to more users. Since XML is independent
of hardware, software and application, a user can make hislher data available to
more than only standard HTML browsers.

Other clients and applications can access one's XML files as data sources, like
they are accessing databases. His/her data can be made available to all kinds of
'reading machines'.

XML can be used to create new languages


XML is the mother of Wireless Application Protocol (WAP) and Wireless Markup
Language (WML). WML, used to markup Internet applications for handheld
devices, like mobile phones, is written in XML.

Self Check Exercise


6) Why XML is needed over HTML?
Note: i) Write your answer in the space given below.
ii) Check your answer with the answers given at the end of the Unit.

367
Content Development
11.8 XML SYNTAX AND SEMANTIC TAGS

11.8.1 XML Syntax

Let us consider the first line of the example at 11.7.5,

<?xml version="l.O" encoding="UTF-8" ?>

This line opens and closes with an angular bracket and a question mark, which
suggests to XML parser that this document follows XML version 1.0 specification
given by W3C and the character encoding system is used for data representation
is UNICODE Transformation Format-S. The second line is -<book>, which is
nothing but collapsible tags which shows that this tag has child elements. For
each starting tag there is a closing tag e.g. the tag <book> ends with closing tag
<!book>. <book> has several child element like <title> <author>, <edition>, <place>,
<publisher> and <physical_desc>. A child can have further sub-children as in
case of <author>.

- <author>
<Cname--Ranganathan-c/fname>
<1-name>S.R. </1- name>
</author>

Inside the tags actual data is stored for example,

<title>Prolegomena to library classification</title>

XML tags are case sensitive and should be properly nested

Unlike HTML, XML tags are case sensitive. With XML, the tag <Author> is
different from the tag <author>. Opening and closing tags must therefore be
written with the same case. All XML elements must be properly nested. Improper
nesting of tags makes no sense to parser. For example,

<edition> 3rd reprint</edition>


<place>Bangalore
<publisher></place>Sarada Ranganathan Endowment</publisher>

a) All XML documents must have a root tag


The first tag in an XML document is the root tag. All XML documents must
contain a single tag pair to define the root element. All other elements must be
nested within the root element. All elements can have sub elements (children).
Sub elements must be correctly nested within their parent element. In the
previously-mentioned example the <book> is the root element all the other tags
are child to it.

<root>
<child>
<subchild> .....</subchild>
</child>
</root>

b) XML Elements
An element is a component of a document. Elements can be made up of other
elements, other types of data, or a descriptive representation that tells the XML
368
parser about a resource that exists in document.
Introduction to HTML
Thus, and XML

• XML Elements have simple naming rules.


• XML Elements are Extensible. XML documents can be extended to carry more
information.
• XML elements have relationship. All the, elements inside the <book> are
child elements for <book>. This relationship indicates that <title> <author>,
<edition>, <place>, <publisher> and <physical_desc> are describing an element
book.

Thus, the tags used like <book>, <author>, <place>, <publisher>, etc. are elements.

c) Element Naming
XML elements must follow the following naming rules:

• Names can contain letters, numbers, and other characters. For example, <author 1>
... </<authorl>
• Names must not start with a number or other punctuation characters. For Example,
it is illegal to have tags like, <856> ... </856> or <:856> ... </856>
• Names must not start with the letters xml (or XML or Xml ..).
• Names cannot contain spaces. For Example, it is illegal to have tags like, <first
author> ... <zfirst author>
• Any name can be used, no words are reserved, but the idea is to make
names descriptive. Names with an underscore separator are nice.

Examples: <f' name>, <I_name>.

• Avoid "-" and "." in names. It could be a mess if your software tried to subtract
name from first (f-name) or think that 'name' is a property of the object 'first'
([Link]).
• Element names can be as long as you like but names should be short and simple,
for example, <book_title>
not like,
<the title of the book>
• Non-English letters like eoa are perfectly legal in XML element names, but watch
out for problems if your software vendor does not support them.
• The ":" should not be used in element names because it is reserved for
namespaces.

d) XML Attributes
Attributes are used to provide additional information about elements. In HTML
we often use attribute to get extra effect while formatting. For example,

<font size= "12" color= "red">Hello World</font>

will show the "Hello World" text in 12 font size and red coloured. The size and
colour used are nothing but pre-defined attributes to the <font>.

Similarly, in XML also one can define the attributes. Attribute values must be
quoted and it is illegal to omit quotation marks. XML elements can have attributes
in name/value pairs just like in HTML. It further extends file [Link] as:

1- <?xml version="l.O" encoding="UTF-8"?> 369


Content Development 2- <book>
3- <title>Prolegomena to library classification</title>
4- <author authorship="primary''>
5- <C namec-Ranganathan-of , name>
6- <1-name>S.R. </1 -name>
7- </author>
8- <edition> 3rd reprint</edition>
9- <place>Bangalore</place>
10- <publisher>Sarada Ranganathan Endowment</publisher>
11- <physical_ desc>640 p. </physical_ desc>
12- <!book>

(NOTE: Here 1, 2, 3 represents the line number of program.)

Line 4 - <author authorship="primary"> has an attribute called as authorship


which has value "primary". One can have any number of attributes associated
with a single element.

There are some problems associated with using attributes:

• attributes cannot contain multiple values (child elements can)

• attributes are not easily expandable (for future changes)


• attributes cannot describe structures (child elements can)

• attributes are more difficult to manipulate by program code

• attribute values are not easy to test against a DTD

So, it is always good to use child elements in spite of using attributes to describe
an object.

11.8.2 Semantic Tags ofXML


XML was designed to attach semantics to data, i.e., adding context to the data.
It does so by allowing to define one's own tags. For example,
<?xml version="l.O" encoding="UTF-8"?>
-<book>
<title>Prolegomena to library classification</title>
- <author>
<C namec-Ranganathan'</f name>
<1-name>S.R. </1 -name>
</author>
<edition> 3rd reprint</edition>
<place>Bangalore</place>
<publisher>Sarada Ranganathan Endowment</publisher>
<physical_ desc>640 p. </physical_ desc>
</book>
The example shows the structure of a document, which describes a book, titled
Prolegomena to library classification. The book has a title, author, edition,
place, publisher, physical description elements. Author is further divided into first
name (f name) and last name (I_name). Inside these tags the actual data is
stored. These tags provide context to the whole structure of the document, hence
these are known as semantic tags.

370

I
Self Check Exercise' Introduction to HTML
andXML
7) What are semantic tags?
Note: i) Write your answer in the space given below.
ii) Check your answer with the answers given at the end of the Unit.

11.9 DOCUMENT TYPE DEFINITION (DTD)


It is possible to .define your own structure of XML document and give others to
write the XML document against your own schema to avoid the mistakes. A
schema is nothing but the logical structure of document. This schema is called
as DTD (Document Type Definition). When the XML document is prepared
against DTD it is called a Valid document and when there is no DTD for the
document and the syntax of document is correct it is known as Well-formed
document.

A DTD can be defined for a Valid-document. The declaration of DTD used for
the validation is given in the processing tag of XML file. You may refer to Unit
12 for further discussion on DTD.

11.10 IMPLICATIONS OF XML IN LIBRARY AND


INFORMATION ACTIVITIES
In the context of library and information activities, the most important implication
is that XML can be used as a common platform for information exchange
provided every one agrees to a common set of tags. As we know that many
variant versions of MARCs and all 'standard MARCs' have created a kind of
non-standardisation. In such a condition XML can be very much useful.

XML can also be used in Digital libraries. It can be used for document surrogate'
as a catalogue. On the web a great amount bibliographic data exchange is in
XML.

With XML one can define the tags. These tags have the semantic value such as
- 'author' tag contains the name of author. Once we define a set of tags in a
particular subject field, it becomes easy to transport data from one machine to
other. For example, NewsML <[Link] is a very good initiative in this
direction as lot of news information have to be transferred from one place to
other. The NewsML tag set provides a standard for data interchange among the
news agencies. Currently Reuter is taking care of NewsML.

Searching is another area where XML may be of great help. As it provides


context to search term, searching becomes efficient. XML can improve the
search efficiency of current search engines. There are projects under development
to identify schemas to perform search. RDF (Resource Description Framework)
is one such initiative in this direction.

Finally, it is sometimes felt that formatted display isa tedious job in XML. This
is because currently we are in the world of HTML and the objective of XML 371
Content Development is not the display in browser but to store data in a more meaningful manner. But
the technology is so fluid interface tools to write formatted XML document may
be available in the near future.

Self Check Exercise


8) Describe the library applications of XML.
Note: i) Write your answer in the space given below.
ii) Check your answer with the answers given at the end of the Unit.

11.11 SUMMARY
In this Unit the concepts of WWW, Hypertext, Hyperrnedia and Markup Language
are discussed, which are foundation of Internet. XML is another derivative of
SGML, which is also used to render the information on the web. It is necessary
one should know at least basic HTML tags to put the information on Internet.
Though HTML has certain problems associated with it for example, inability to
handle efficient search, but still it is widely used for web page design. The XML
preserves the context of the term as well as its semantics. An XML file also like
HTML is a plain ASCII file, where one can define his/her own tags.

11.12 ANSWERS TO SELF CHECK EXERCISES


1) World Wide Web (WWW) is actually a collection of traditional Internet access
methods (FTP, Gopher, Telnet, etc.) and a new communications method called
Hyper Text Transport Protocol (HTTP).
WWW uses the concept. of a page for viewing information. Each page is
actually a single text files written in something called HypcrText Markup
Language (HTML). This HTML file is retrieved from a remote computer,
known as the HTTP Server, by a WWW browser, and is used to determine
the appearance of that particular WWW page. A HTML document can
contain pointers to other HTML documents, graphics, files, sounds, and even
descriptions for buttons and other on-screen elements for displaying data.
This interconnection of HTML documents on computers all over the Internet,
each containing pointers to other HTML documents on other computers on
the Internet has created a kind of web of virtual documents and that is why,
the term 'web' came.

2) Hypertext is basically the same as regular text ;- it can be stored, read,


searched, or edited - with an important exception: hypertext contains
connections within the text to other documents.
When on selection any specific part of document gives access to other
document, this is known as hyperlink and this can create a complex virtual
web of connections.

Hypermedia is hypertext with a difference - hypermedia documents content


links not only to other pieces of text, but also to other forms of media -
sounds, images, animation and movies.

372
3) The word markup was originally used to describe annotation or other marks Introduction to HTML
within a text intended to instruct a compositor or typist how a particular passage andXML

should be printed or laid out.


A 'markup language', may be no more than a loose set of mark up conventions
used together for encoding texts. A markup language must specify what
markup is allowed and whereabouts, what markup is required, how markup
is to be distinguished from text, and what the markup means.

4) HTML is a content-based structured markup language where the codes describe


what the contents are. Some the basic tags of HTML are Head, Title, Headings
and Body.
5) Different types links that can be created are: (i) linking of documents in other
directories or websites, (ii) linking to specific sections of documents, (iii)
linking between sections of different documents, (iv) linking to specific sections
of current documents, etc.

6) eXtensible Markup Language is a kind of markup language.


It has certain advantages over HTML.

• XML can carry data.


• XML was designed to describe data and to focus on what data is.

• HTML is about displaying information. XML is about describing informa-


tion

• XML is extensible. One can define own tags

• XiY1;Lis used to exchange data while it is very difficult with HTML

• XML is also considered as meta-language. Thus, XML can be used to


create new languages

7) XML was designed to attach semantic to data, i.e., adding context to the data. It
does so by allowing to define one's own tags .'For example,
<?xml version="l.O" encoding="UTF-8"?>
-<book>
<title>Prolegomena to library c1assification</title>
- <author>
<f_ name> Ranganathan'</f , name>
<I name>S.R. </1 name>
</author>
<edition> 3rd reprint</edition>
<place>Bangalore</place>
<publisher>Sarada Ranganathan Endowment<7'publisher>
<physical_ desc>640 p.</physical_ desc>
</book>

The example shows the structure of a document, which describes a book,


titled Prolegomena to library classification. The book has a title, author,
edition, place, publisher, physical description elements. Author is further divided
into first name (C name) and last name (1_name). Inside these tags the actual
data is stored. These tags provide context to the whole structure of the
document, hence these are known as semantic tags.

8) XML can have implications in library environment. The first and foremost use of
XML can be sought in information exchange. As we know that we are sitting on 373
Content Development ,the heap of MARCs, and ironically this heap of standard MARCs has created a
kind of non-standardization. In such a condition XML can be used as common
platform for informationexchangeprovided at least everyonewill have acceptance
to a common set of tags.
XML can also be used in Digital libraries. It can be used for document
surrogate as a catalogue. It will be still an ambitious statement to make that'
XML can beat DBMS (Database Management Systems) and can be a solution
for BD13MS (Bibliographic Database Management Systems), on the web a'
great amount of bibliographic data exchange takes place using XML.

Searching is another area where XML is of great help. As it provides context


to search term, searching becomes efficient particularly when we are agreed
to follow a set of tags. XML can improve the search efficiency of current
search engines. There are projects under development to identify schemas to
perform search. RDF (Resource Description Framework) is one initiative in
this direction.

'11.13 KEYWORDS
Assistive Technologtes Devices used by people with disabilities to
access computers. Some assistive technologies
include text-to-speech screen readers, alternative
keyboards and mice, head pointing devices,
voice recognition software, and screen
magnification software.
Attribute A setting for a tag, that affects the way the
tag is displayed.
Browser A program used to access and display web
pages. Graphical browsers can display images
and many different text fonts; non-graphical
browsers cannot.
CGI Common Gateway Interface is a way to allow
, us~rs to provide information to scripts attached
to web pages, usually through forms.
Cyberspace The imaginary space users of the web move
around in. A metaphor that many people take
almost literally.
Domain Name The name of an Internet site, for example
[Link] or [Link].
Font A font, strictly speaking, is a set of characters
that all belong to the same size and style of a
typeface. For example, Courier.
Forms The mechanism by which web pages become
interactive, allowing users to supply input to
CGI or other scripts.
FTP File Transfer Protocol, a way to exchange files
with other sites on the Internet.
Gopher A protocol that is older than HTTP and serves
a sirnilar purpose, allowing users to tunnel
through cyberspace in search of information.
Graphic A picture or illustration, also called image.

374
HTTP Hyper'Iext Transfer Protocol, the conventions Introduction to HTML
used by web browsers and servers to transfer andXML
web pages.

Hypermedia A combination of hypertext and multimedia that


allows users to move in a non-linear fashion
through text, images, sounds, and of her
information.

Hypertext A collection of documents joined by links so


that users can read it in a variety of different
orders.
Image File A file containing an image.

Indexers : Programs that read pages throughout the web


and add a description of their contents to a
database that can be searched by users looking
for specific information. •

Link • The anchor tag «A» is used to define both


anchors and links. A link is a directive to a
browser: .when a user selects a link a new
page is loaded. Some people call a link a
hotlink or hyperlink.

Multimedia The. combination of several different


communications techniques: for example sound,
written text, still pictures, and moving pictures.

Nested An element that is entirely contained within


another element. For example, the phrase 'the
quick brown fox' contains a bold element (the
word 'quick') nested within an italic element
(the entire phrase.) Some browsers will display
the word 'quick' only as bold, others will display
it as both bold and italic.

Plug-ins Software programs that enhance other


programs or applications on your computer. There
are plugins for Internet browsers, graphics
programs, and other applications.

Server A program running on an Internet site that


makes the web pages at that site available to
browsers throughout the Internet.

Site Internet website.


Tags Tags are metadata which embeds the
information in it.

Unicode The universal character encoding, maintained


by the Unicode Consortium. This encoding
standard provides the basis for processing,
storage and interchange of text data in any
language in all modern software and ICT
protocols. It uses two bytes or 16 bits to code
each character.

URI Uniform Resource Identifier - URIs have been


known by many names:' WWW addresses,
Univers'al Document Identifiers, Universal
Resource Identifiers, and fmally the combination 375
Content Development of Uniform Resource Locators (URL) and
Names (URN). As far as HTTP is concerned,
Uniform Resource Identifiers are simply
formatted strings that identify - via name, location,
or any other characteristic - a resource.

W3C An international industry consortium which


develops common. protocols that promote
WWW evolution and ensure its interoperability.
W3C develops interoperable technologies
(specifications, guidelines, software, and tools)
to lead the Web to its full potential as a forum
for information, commerce, communication, and
collective understanding.

11.14 REFERENCES AND FURTHER READING

Blue Book (1988). Volume VIII - Fascicle VlIl.8, Data communication networks
directory, recommendations X500-X521, CCITT.
Devika, P.M. (2003). Introduction to XML and HTML. In: PGDLAN Course mate-
rial, MLI-006, Unit 8. New Delhi: Indira Gandhi National Open University.
Gorman, Diane. Introduction to HTML: understanding HTML. <http://
[Link]>.
Gorman, Dianne. SGML and HTML: a guide to resources. <[Link]
sgml>.
Horton, M., and R. Adams (1987). Standard for interchange of USENET mes-
sages, RFC 1036. AT&T Bell Laboratories, Center for Seismic Studies.

Hu, James R. A beginner s guide to URLs. <[Link]


Cmps/[Link]>.

Hughes, Kevin (1994). What is hypertext and hypermedia? <http://


[Link]/locaI/JUNKlguide/[Link]>.

Kantor, B., and P. Lapsley (1986). Network News Transfer Protocol: a proposed
standard for the stream-based transmission of news, RFC 977. UC San Diego &
UC Berkeley.

Lang, R., and Wright, R. (1992). RFC 1292 - a catalog of available X500 imple-
mentations. < [Link]

Lewis, Chris. (2004). What is a markup language? <[Link]


faq/section-4 .html>.

National Center for Supercomputing Applications. (2000). Welcome to SGML on the


web. < [Link]

Schwartz, M,; and Tsirigotis, P. (1991). Experience with a semantically cognizant


Internet white pages directory tool. Journal of Internetworking Research and Ex-
perience, 1(2), 23-50. <[Link]

Slack Incorporated. (200 I). What is the World Wide Web?


< [Link] .

Sol, Selena. (1999). What is a markup language? <[Link]


Languages/XMLlTutorials/Intro/what_is _markup [Link]>.
,76
Introductien to HTML
Weider, c., and Reynolds, J. (1992). RFC 1308 - executive introduction to direc-
andXML
tory services using the X500 Protocol. <[Link]

Weider, c., Reynolds, L, and Heker, S. (1992). RFC 1309 - technical overview of
directory services using the X500 Protocol. <[Link]
[Link]>.

Williamson, S. (1993). RFC 1400 - transition and modernization of the internet


registration service. <[Link] .

World Wide Web Consortium. (2001). About the World Wide Web. <http://
[Link]/WWW>.
World Wide Web Consortium. (2005). Extensible Markup Language (X~fLJ.
<[Link]

377

You might also like