Introduction to HTML and XML Basics
Introduction to HTML and XML Basics
XML
Structure
11.0 Objectives
11.1 Introduction
11.2 World Wide Web and Markup Languages
11.3 Standard Generalized Markup Language (SGML)
11.4 HyperText Markup Language (HTML)
11.4.1 Introduction to HTML
11.4.2 Features of HTML
11.4.3 Editor for HTML
11.4.4 Syntax of HTML Commands
11.4.5 Framework of a Web Page
11.0 OBJECTIVES
.In the previous Unit, you have been acquainted with the guidelines, norms and
standards developed by various organisations/ suggested by different experts and
organisations for development of content on the Web. The Internet has changed
the way the information can be organised therein. It is necessary to know the
way or form in which the information can be organised on the net which we will
be studying in this Unit.
11.1 INTRODUCTION
Today Internet has changed the way of life in all fields. It has created an instant
online connection and communication world over. Due to its feasible technology,
Internet has grown rapidly in the past few years gained so much popularity. It
has been transformed from just a text-based environment to a click-able and link-
able world. What has made this possible is the World Wide Web (WWW).
Internet today" has become a multimedia communication channel where data can
be transferred in all the formats .
.The use of hypermedia and hypertext. is so much ingrained on the Internet that
WWW cannot be thought without multimedia and WWW has become a synonym
for the term Internet. Hypertext markup language is a language to render the
information over Internet. It can accommodate audio, video, text and image. It
can be said that basic feature of today's Internet is hypertext and hyper-linking.
The term markup means instructions for printing in a particular style, just like,
while proofreading editors mark the text (e.g. underlined) to display it in bold
.while printing. Similarly to display the electronic text in web page on browsers,
embedded instructions are given within the text to make the parser understand
how text should appear on display [Sol, 1999]. .
But,markup is also used for data retrieval, particularly in the library and information
field. Once the structure of a document is fixed, one can easily find which part
of the document contains which kind of data. For example, an email has a fixed
structure that means it will look like,
To: inder@[Link]
From: rakesh@[Link]
Date: Tue, 26 Jan 200501 :00:58 -0800
Subject: Memo
Kindly inform me the timetable of the term end examination of the MLIS course.
Regards
Rakesh
To:
From: 353
Content Development Date:
Subject:
Body:
It is very easy to fetch the data from email once the fields are known. This is
a typical example of markup.
354
Introduction to HTML
11.3 STANDARD GENERALIZED MARKUP andXML
LANGUAGE (SGML)
SGML aims to give a general structure for other Markup languages. Thus, it is
a meta-language which gives rise to other Markup languages, for example, XML
(eXtensible Markup Language) is a derivative of SGML. It basically preserves
the semantics of the text through the embedded text. It is not meant for formatting
of text. Basically it was meant to preserve the structure of document.
SGML is not a kind of text formatting system (although its origins can be readily
traced in the world of electronic text formatting), or is a competitor for such
languages as TeX or Postxcript. These languages define how the text should
appear on screen or over print. SGML by contrast is decidedly unhelpful about
how texts are to be reproduced but it binds one to a specific structure of
document and the sequence of elements in the text.
HTML isa relatively simple language and stands for HyperText Markup Language.
An HTML 'page' is a plain text document with markUp inserted into it. This
markUp includes codes for forming hypertext links. Using it becomes easier if
one understands the basic principles behind it, and take its limitations into account
[Gorman, Dianne],
Most elements consist of paired tags: a start tag and an end tag. For example,
<b> is a start tag and <Ib> is the end tag. The end tag is similar to start tag,
except that the symbol is preceded by forward slash. An element's instruction
applies to whatever content is contained between its start and end tags:
E.g. <b> This text is bold; <Ib> but this text is not.
I
Content Development There is some misconception about HTML:
• HTML is not a page layout language. With only a few exceptions, HTML
tags are concerned with the structure of a document rather than its appearance.
Some of the very Basic HTML concepts, tags and features are described below.
Two points need to be noted here: (1) all commands MUST be enclosed within
angular brackets < >; and, (2) all commands are used in pairs wherein the
<COMMAND> marks the beginning and </COMMAND> marks the end .
.
11.4.5 Framework of a Web Page
<HTML>
<HEAD>
</BODY>
</HTML>
Explanation
The <HTML> </HTML> tells the browser that your page is in HTML code.
The <BODY> </BODY> is that part of your page that will actually be displayed.
356
"'I
Introduction to HTML
11.5 BASIC HTML TAGS and X:\lL
Some of the basic HTML tags that are used in developing HTML documents are
as follows:
Markup Tags
HTML
This element tells the browser that the file contains HTML-coded information.
The file extension .html and .htm also indicates that this a HTML document.
Head
The head element identifies the first part of your HTML-coded document that
contains the title. The title is shown in the title bar of browser's window.
Title
The title element 'contains your document title and identifies its content in a global
context. The titleis typically displayed in the title bar at the top of the browser
window, but not inside the window itself. The title is also what is displayed on
someone's hotlist or bookmark list. So it is better choose something descriptive,
unique, and relatively short. A title is also used to identify your page for search
engines (such as HotBot or Infoseek). Generally it is advisable to keep titles to
64 characters or fewer.
Body
The second-and largest-part of a HTML document is the body, which contains
the document content (displayed within the text area of the brow ser window).
The tags explained below are used within the body of your HTM L document.
Headings
HTML has six lex els of headings, numbered 1 through 6, \\ uh 1 being the
largest. Headings are typically displayed ll1 larger and-or bol Icr IG;'l~'11ll1 [Link]
body text. The first heading in each document should be tagQ",'d,i l l >.
The syntax of the llcJ<lll1g clement i
where y is a number betvcen 1 and 6 specifying the level of' the heading.
For example:
<html>
<head>
<title>IGNOU Homepage</title>
</head>
<body> Welcome to the HO:lIe Page l.)fLGNOU. IG")\;OUIS one of the open universities
oflndia providing distance education courses 111 different fields.
</body>
<!html>
357
I
Content Development Paragraphs
Unlike documents in most word processors, carriage returns in HTML files are
not significant. In fact, any amount of white space - including spaces, linefeeds,
and carriage returns - are automatically compressed into a single space when
a HTML document is displayed in a browser. Word wrapping can occur at any
point in the source file without affecting how the page will be displayed.
In the source file there is a line break between the sentences. A Web browser
ignores this line break and starts a new paragraph only when it encounters
another <P> tag.
Important: You must indicate paragraphs with <P> elements. A browser ignores
any indentations or blank lines in the source text. Without <P> elements, the
document becomes one large paragraph. (One exception is text tagged as
'preformatted,' which is explained below.) For example, the following would
produce identical output as the first example:
<HI>Level-one heading<lHl>
<P> Welcome to the world of HTML. This is the
first paragraph. While short it is still a
paragraph! </P> <P>And thi-s is the second paragraph.</P>
NOTE: The </P> closing tag may be omitted. This is because browser understands
that when it encounters a <P> tag, it means that the previous paragraph has
ended. However, since HTML now allows certain attributes to be assigned to the
<P> tag, it is generally a good idea to include it.
Using the <P> and </P> as a paragraph container means that you can center a
paragraph by including the ALIGN=alignment attribute in your source file.
<P ALIGN=CENTER>
This is a centered paragraph.
</P>
This is a centered paragraph.
Lists
HTML supports unnumbered, numbered, and definition lists. Nested lists can also
be used, but use this feature sparingly because too many nested items can
become difficult to follow.
Unnumbered Lists
To make an unnumbered, bulleted list,
start with an opening list <UL> (for unnumbered list) tag
358
enter the <LI> (list item) tag followed by the individual item; no closing <ILl> Introduction to HTML
tag is needed to end the entire list with a closing list </UL> tag andXML
<UL> '-
<LI> apples
<LI> bananas
<LI> grapefruit
</uL>
The <LI> items can contain multiple paragraphs. Indicate the paragraphs with
the <P> paragraph tags.
Numbered Lists
A numbered list (also called an ordered list, from- which the tag name derives)
is identical to an unnumbered list, except it uses <OL> instead of <UL>. The
items are tagged using the same <LI> tag. The following HTML code:
<OL>;
<LI> oranges
<LI> peaches
<LI> grapes
<IOL>
oranges
peaches
grapes
Definition Lists
A definition list (coded as <DL» usually consists of alternating a definition term
(coded as <DT» and a definition description (coded as <DD». Web browsers
generally format the definition on a new line and indent it.
<DL>
<DT> JGNOU
<DD> IGNOU, Indira Gandhi National Open University is located in New Delhi.
<DT> IISc
<DD> IISc, the Indian Institute of Science is located in Bangalore
</DL>
359
Content Development The output looks like:
IGNOU
The <DT> and <DD> entries can contain multiple paragraphs (indicated by <P>
paragraph tags), lists, or other definition information.
Nested Lists
Lists can be nested. You can also have a number of paragraphs, each containing
a nested list, in a single list item.
........................... .........•..........................................................................
[Link]
The chief ability of HTML comes from its ability to link text and/or an image to
360 another document or section of a document thus weaving 'a web' of resources.
A browser highlights the identified text or image with colour and/or underlines Introduction to HTML
andXML
to indicate that it is a hypertext link (often shortened to hyperlink or just link).
HTML's single hypertext-related tag is <A>, which stands for anchor. To include
an anchor in your document:
c) enter the text that will serve as the hypertext link in the current document
d) enter the ending anchor tag: <lA> (no space is needed before the end
anchor tag)
This entry makes the word 'Hello' the hyperlink to the document [Link],
which is in the same directory as the first document.
You can link to documents in other directories by specifying the relative path
from the current document to the linked document. For example, a link to a file
[Link] located in the subdirectory temp would be:
These are called relative links because you are specifying the path to the linked
file relative to the location of the current file. You can also use the absolute
pathname (the complete URL) of the file, but relative links are more efficient in
accessing 'a server.
They also have the advantage of making your documents more 'portable' - for
instance, you can create several web pages in a single folder on your local
computer, using relative links to hyperlink one page to another, and then upload
the entire folder of web pages to your web server. The pages on the server will
then link to other pages on the server, and the copies on your hard drive will still
point to the other pages stored there.
Pathnames use the standard UNIX syntax. The UNIX syntax for the parent
directory (the directory that contains the current directory) is ",,"
If you were in the [Link] file and were referring to the original document
[Link], your link would look like this:
I
Content Development In general, you should use relative links whenever possible because:
However, use absolute pathnames when linking to documents that are not directly
related. For example, consider a group of documents that comprise a user manual.
Links within this group should be relative links. Links to other documents (perhaps
a reference to related software) should use absolute pathnames instead. This
way if you move the user manual to a different directory, none of the links would
have to be updated.
11.5.2 URLs
The World Wide Web uses Uniform Resource Locators (URLs) to specify the
location of files on other servers. A URL includes the type of resource being
accessed (e.g., Web, gopher, FTP), the address of the server, and the location
of the file. The syntax is:
You can also link to a specific section in another document. That information is
presented first because understanding that helps you understand linking within
the same document.
[Link]:
The characters after the hash (#) mark is using for titling within the [Link]
file. It tells the browser what should be displayed at the top of the window when
the link is activated. In other words, the first line in the browser window should
be the Jawaharlal Nehru University heading.
Next, to create the named anchor (in this example "JNU") in [Link]:
With both of these elements in place, a reader can go directly to the JNU
reference in [Link].
For example, 'to link to the JNU anchor from within [Link], enter:
Be sure to include the <A NAME=> tag at the place in your document where
you want the link to jump to «A NAME="JNU"> Jawaharlal Nehru University
<la».
Named anchors are particularly useful when you think readers will print a
document in its entirety or when you have a lot of short information you want
to place online in one file.
Mailto Attribute
You can make it easy for a reader to send electronic mail to a specific person
or mail alias by including the mailto attribute in a hyperlink. The format is:
<A HREF="[Link]
to create a mail window that is already configured to open a mail window for
the JNU Publications Group alias.
3'''',~
Content Dev elupmeut Self Check Exercise
5) Mention the different types of links that can be created in a HTML document
Note: i) Write your answer in the space given below.
ii) Check your answer with the answers given at the end of the Unit.
,
...............................................................................................................
• XML might uses a DTD (Document Type Definition) to describe the data.
• XML with a DTD is designed to be self-descriptive.
The Idea of markup was to format a parncular kmd of document. The markup
languages that carry the mstruction for text processing are known as Procedural
markup. But later on, it was felt that for system-to-system information interchange,
markup languages could be used. This was first realized by Charles Goldfarb. Ed
Mosher and Ray Lorie w hen they were working with legal documents. They
designed first markup language known as O\1L (Generalized Markup Language)
based on the following observations:
.i64
• The document processing programs needed to support a common document format; Introduction to HTML
andXML
• The common format needed to be specific to their domain-for example legal
documents; and
• To achieve a high a degree of reliability, the document format would have to
follow specific rules.
From: Akkamahadevi
To: Suchitra Pattanayak
CC: Prasenjit Kar
Date: 27.01.2002
Subject: Appointment order
We are extremely happy to inform you that you are selected as the
coordinator of Knowledge management team.
If we look into this document we find that there are six fields in this document.
So, if we make a fixed structure of this document then whoever writes the
document has to follow same structure. Thus, for porting information from one
system to another it will not be a problem as the structure of document is well
defined. The definition of the structure of document is known as DTD (Document
Type Definition).
Glodfarb further fine-tuned GML and proposed the SGML (Standardized General
Markup Language) which was further approved by ISO (International Organization
for Standardization) in 1986. This language was not a language itself but it was
a meta language to develop other markup languages. HTML (HyperText Markup
Language) is a derivative of SGML. HTML acts more like a formatting language
so, it is always difficult to pull out what kind of data is stored inside a HTML
document. Once this difficulty was understood, for information interchange the
need for domain specific tags was felt. Development of such tags was not
possible with HTML. Hence, XML was developed. It is always said that XML
is more near to SGML than HTML.
The specification for XML has been developed with the following objectives.
Many have a misconception that XML will replace HTML but whatever the case
finally the actual representation is done in HTML format.
iii) XML is about describing information.
HTML is about displaying information.
XML is created as a way to structure, store and send information. XML is not
designed to DO everything.
<?xml version="I.O" encoding="UTF-8" ?>
-<book>
<title>Prolegomena to library c1assification<ltitle>
- <author>
<C name>Ranganathan</C name>
<1-name>S.R. </1 -name>
</author>
<edition> 3rd reprint</edition>
<place>Bangalore</place>
<publisher>Sarada Ranganathan Endowment</publisher>
<physical_ desc>640 p. </physical_ desc>
</book>
The example shows the structure of a document, which describes a book, titled
Prolegomena to library classification. The book has a title, author, edition,
place, publisher, physical description elements. Author is further divided into first
name (fname) and last name (I_name). Inside these tags the actual data is
stored. If one sees the document in the web browser, data will appear embedded
in the tags without having any kind of formatting.
366
Customised Tags Introduction to HTML
andXML
In the above-mentioned example, <book> tag is defined by the person who is ..
describing the document. Thus, one can see that XML provides the facility to
define user-customized tags. It is contrary to HTML where the tags are fixed
and predefined. So the XML is used to create domain specific tag set which
facilitates the information interchange within a specific domain. For example,
NewsML is developed for information interchange among the news agencies like
Reuter and others.
Data Exchange
As XML allows attaching semantics to the data, data can be exchanged between
incompatible systems. In the real world, the data stored in computer systems and
databases, usually are in incompatible formats. One of the most time-consuming
challenges for developers has been to exchange data among such systems over
the Internet. Converting the data to XML greatly reduces complexity, since many
applications can easily read such data.
Share Data
With XML, plain text files can be used to share data. Since XML data is stored
in plain text format, XML provides a software and hardware independent way
of sharing data.
This makes it much easier to create data that different applications can work
with. This also makes it easier to expand or upgrade a system to new operating
systems, servers, applications, and new browsers.
Other clients and applications can access one's XML files as data sources, like
they are accessing databases. His/her data can be made available to all kinds of
'reading machines'.
367
Content Development
11.8 XML SYNTAX AND SEMANTIC TAGS
This line opens and closes with an angular bracket and a question mark, which
suggests to XML parser that this document follows XML version 1.0 specification
given by W3C and the character encoding system is used for data representation
is UNICODE Transformation Format-S. The second line is -<book>, which is
nothing but collapsible tags which shows that this tag has child elements. For
each starting tag there is a closing tag e.g. the tag <book> ends with closing tag
<!book>. <book> has several child element like <title> <author>, <edition>, <place>,
<publisher> and <physical_desc>. A child can have further sub-children as in
case of <author>.
- <author>
<Cname--Ranganathan-c/fname>
<1-name>S.R. </1- name>
</author>
Unlike HTML, XML tags are case sensitive. With XML, the tag <Author> is
different from the tag <author>. Opening and closing tags must therefore be
written with the same case. All XML elements must be properly nested. Improper
nesting of tags makes no sense to parser. For example,
<root>
<child>
<subchild> .....</subchild>
</child>
</root>
b) XML Elements
An element is a component of a document. Elements can be made up of other
elements, other types of data, or a descriptive representation that tells the XML
368
parser about a resource that exists in document.
Introduction to HTML
Thus, and XML
Thus, the tags used like <book>, <author>, <place>, <publisher>, etc. are elements.
c) Element Naming
XML elements must follow the following naming rules:
• Names can contain letters, numbers, and other characters. For example, <author 1>
... </<authorl>
• Names must not start with a number or other punctuation characters. For Example,
it is illegal to have tags like, <856> ... </856> or <:856> ... </856>
• Names must not start with the letters xml (or XML or Xml ..).
• Names cannot contain spaces. For Example, it is illegal to have tags like, <first
author> ... <zfirst author>
• Any name can be used, no words are reserved, but the idea is to make
names descriptive. Names with an underscore separator are nice.
• Avoid "-" and "." in names. It could be a mess if your software tried to subtract
name from first (f-name) or think that 'name' is a property of the object 'first'
([Link]).
• Element names can be as long as you like but names should be short and simple,
for example, <book_title>
not like,
<the title of the book>
• Non-English letters like eoa are perfectly legal in XML element names, but watch
out for problems if your software vendor does not support them.
• The ":" should not be used in element names because it is reserved for
namespaces.
d) XML Attributes
Attributes are used to provide additional information about elements. In HTML
we often use attribute to get extra effect while formatting. For example,
will show the "Hello World" text in 12 font size and red coloured. The size and
colour used are nothing but pre-defined attributes to the <font>.
Similarly, in XML also one can define the attributes. Attribute values must be
quoted and it is illegal to omit quotation marks. XML elements can have attributes
in name/value pairs just like in HTML. It further extends file [Link] as:
So, it is always good to use child elements in spite of using attributes to describe
an object.
370
I
Self Check Exercise' Introduction to HTML
andXML
7) What are semantic tags?
Note: i) Write your answer in the space given below.
ii) Check your answer with the answers given at the end of the Unit.
A DTD can be defined for a Valid-document. The declaration of DTD used for
the validation is given in the processing tag of XML file. You may refer to Unit
12 for further discussion on DTD.
XML can also be used in Digital libraries. It can be used for document surrogate'
as a catalogue. On the web a great amount bibliographic data exchange is in
XML.
With XML one can define the tags. These tags have the semantic value such as
- 'author' tag contains the name of author. Once we define a set of tags in a
particular subject field, it becomes easy to transport data from one machine to
other. For example, NewsML <[Link] is a very good initiative in this
direction as lot of news information have to be transferred from one place to
other. The NewsML tag set provides a standard for data interchange among the
news agencies. Currently Reuter is taking care of NewsML.
Finally, it is sometimes felt that formatted display isa tedious job in XML. This
is because currently we are in the world of HTML and the objective of XML 371
Content Development is not the display in browser but to store data in a more meaningful manner. But
the technology is so fluid interface tools to write formatted XML document may
be available in the near future.
11.11 SUMMARY
In this Unit the concepts of WWW, Hypertext, Hyperrnedia and Markup Language
are discussed, which are foundation of Internet. XML is another derivative of
SGML, which is also used to render the information on the web. It is necessary
one should know at least basic HTML tags to put the information on Internet.
Though HTML has certain problems associated with it for example, inability to
handle efficient search, but still it is widely used for web page design. The XML
preserves the context of the term as well as its semantics. An XML file also like
HTML is a plain ASCII file, where one can define his/her own tags.
372
3) The word markup was originally used to describe annotation or other marks Introduction to HTML
within a text intended to instruct a compositor or typist how a particular passage andXML
7) XML was designed to attach semantic to data, i.e., adding context to the data. It
does so by allowing to define one's own tags .'For example,
<?xml version="l.O" encoding="UTF-8"?>
-<book>
<title>Prolegomena to library c1assification</title>
- <author>
<f_ name> Ranganathan'</f , name>
<I name>S.R. </1 name>
</author>
<edition> 3rd reprint</edition>
<place>Bangalore</place>
<publisher>Sarada Ranganathan Endowment<7'publisher>
<physical_ desc>640 p.</physical_ desc>
</book>
8) XML can have implications in library environment. The first and foremost use of
XML can be sought in information exchange. As we know that we are sitting on 373
Content Development ,the heap of MARCs, and ironically this heap of standard MARCs has created a
kind of non-standardization. In such a condition XML can be used as common
platform for informationexchangeprovided at least everyonewill have acceptance
to a common set of tags.
XML can also be used in Digital libraries. It can be used for document
surrogate as a catalogue. It will be still an ambitious statement to make that'
XML can beat DBMS (Database Management Systems) and can be a solution
for BD13MS (Bibliographic Database Management Systems), on the web a'
great amount of bibliographic data exchange takes place using XML.
'11.13 KEYWORDS
Assistive Technologtes Devices used by people with disabilities to
access computers. Some assistive technologies
include text-to-speech screen readers, alternative
keyboards and mice, head pointing devices,
voice recognition software, and screen
magnification software.
Attribute A setting for a tag, that affects the way the
tag is displayed.
Browser A program used to access and display web
pages. Graphical browsers can display images
and many different text fonts; non-graphical
browsers cannot.
CGI Common Gateway Interface is a way to allow
, us~rs to provide information to scripts attached
to web pages, usually through forms.
Cyberspace The imaginary space users of the web move
around in. A metaphor that many people take
almost literally.
Domain Name The name of an Internet site, for example
[Link] or [Link].
Font A font, strictly speaking, is a set of characters
that all belong to the same size and style of a
typeface. For example, Courier.
Forms The mechanism by which web pages become
interactive, allowing users to supply input to
CGI or other scripts.
FTP File Transfer Protocol, a way to exchange files
with other sites on the Internet.
Gopher A protocol that is older than HTTP and serves
a sirnilar purpose, allowing users to tunnel
through cyberspace in search of information.
Graphic A picture or illustration, also called image.
374
HTTP Hyper'Iext Transfer Protocol, the conventions Introduction to HTML
used by web browsers and servers to transfer andXML
web pages.
Blue Book (1988). Volume VIII - Fascicle VlIl.8, Data communication networks
directory, recommendations X500-X521, CCITT.
Devika, P.M. (2003). Introduction to XML and HTML. In: PGDLAN Course mate-
rial, MLI-006, Unit 8. New Delhi: Indira Gandhi National Open University.
Gorman, Diane. Introduction to HTML: understanding HTML. <http://
[Link]>.
Gorman, Dianne. SGML and HTML: a guide to resources. <[Link]
sgml>.
Horton, M., and R. Adams (1987). Standard for interchange of USENET mes-
sages, RFC 1036. AT&T Bell Laboratories, Center for Seismic Studies.
Kantor, B., and P. Lapsley (1986). Network News Transfer Protocol: a proposed
standard for the stream-based transmission of news, RFC 977. UC San Diego &
UC Berkeley.
Lang, R., and Wright, R. (1992). RFC 1292 - a catalog of available X500 imple-
mentations. < [Link]
Weider, c., Reynolds, L, and Heker, S. (1992). RFC 1309 - technical overview of
directory services using the X500 Protocol. <[Link]
[Link]>.
World Wide Web Consortium. (2001). About the World Wide Web. <http://
[Link]/WWW>.
World Wide Web Consortium. (2005). Extensible Markup Language (X~fLJ.
<[Link]
377