Introduction to XML
by Nikita Bais
Table Of Contents
Markup Languages What is XML ? The Difference Between XML and HTML How Can XML be Used? XML Structure XML Syntax Valid vs Well Formed XML Document Type Definition (DTD)
Markup Languages
Mark up The term refers to the tagging electronic documents
Modify look and formatting of documents (ex: bold & italic fonts, font sizes, text indents) Sets up structure of document and defines semantic meaning Example of documents uses markup HTML, RTF, SGML, XML
Markup Languages
Classification Of Markup Languages
Specific Markup Language
Generate code that is specific to particular application. Ex: HTML, RTF
Generalized Markup Language
Describe only structure not its formatting and syntax is strictly enforced. Ex: SGML, XML
XML Basics
What is XML?
XML stands for EXtensible Markup Language XML was designed to carry data, not to display data XML tags are not predefined. Users can define their own tags. XML is designed to be self-descriptive XML is a W3C Recommendation XML documents can be validated using DTD
XML Basics
Difference Between XML and HTML
XML is not a replacement for HTML. XML is complement to HTML. XML and HTML were designed with different goals:
XML was designed to transport and store data, with focus on what data is HTML was designed to display data, with focus on how data looks
HTML is about displaying information, while XML is about carrying information.
XML Basics
How can XML be used?
XML Separates Data from HTML XML Simplifies Data Sharing XML Simplifies Data Transport XML is Used to Create New Internet Languages
XML Structure
XML document includes logical and physical structure Logical Structure Indicates how document is built as opposed to what document contains.
Physical Structure Content used in the document.
XML Logical Structure
Prolog
First structural element in an XML document which is optional.
Prolog consists of two basic components
The XML declaration (all in lower case) <?xml version=1.0> The Document Type Declaration <!DOCTYPE filename>
XML Logical Structure
Document Element Follows prolog Heart of XML document where the actual content resides
XML Physical Structure
The physical structure of an XML document is composed of all the content used in document. The data is stored in form of entities
Ex: Predefined entities in XML
Entity Reference Character
< > & " '
< > &
XML Physical Structure
What is Entity? Entities are storage unit Each entity is identified by unique name Entities are declared in DTD and are used anywhere in xml document. Processor retrieves contents of the entity when referenced in the xml document
XML Physical Structure
Entity declaration Syntax <!ENTITY entity-name "entity-value"> Example
DTD :<!ENTITY writer "Donald Duck."> XML Document :<author>&writer;</author>
(An entity has three parts: an ampersand (&), an entity name, and a semicolon (;). )
XML Physical Structure
Internal and External Entities Internal entities
Require no separate storage Contents are provided in its declaration
Syntax <!ENTITY entity-name "entity-value"> Example <!ENTITY writer "Donald Duck.">
XML Physical Structure
External Entities
Require separate storage Refers to a storage unit in its declaration by using SYSTEM or PUBLIC identifier
Syntax
<!ENTITY entity-name SYSTEM "URI/URL">
Example
<!ENTITY MyImage SYSTEM [Link] NDATA GIF>
XML Physical Structure
In addition to SYSTEM identifier an entity can include PUBLIC identifier PUBLIC identifier provides alternative way to retrieve content of an entity PUBLIC identifier is useful when working with an entity that is publically available
Ex: <!ENTITY MyImage PUBLIC -//Images//Text Standard Images//EN [Link] NDATA GIF>
XML Physical Structure
Parsed Entity
An entity made up of parsable text(any text data) XML processor extract content of entity Content of entity appears at the location of the entity reference in XML document
Example: <!ENTITY writer "Donald Duck.">
Entity declaration writer that contains Donald Duck
<author>&writer;</author>
Reference to the writer entity gets replaced with Donald Duck
XML Physical Structure
Unparsed Entity
An entity that cannot be parsed by XML processor An entity might or might not be text, if text it is not parsable text i.e. binary. An entity sometimes referred as binary entity as its content is often binary file (i.e. image) Requires notation, that identifies the format, or type, of resource to which entity is declared.
XML Physical Structure
Example
Entity Delcaration: <!ENTITY MyImage SYSTEM [Link]" NDATA GIF> Notation Declaration: <!NOTATION GIF SYSTEM //Utils/[Link]> (This Specifies that XML processor should use [Link] to process entity of type GIF)
XML Syntax
Opening and Closing tags
XML requires that closing tag be used for every element Example:
<EMAIL> <TO>Ashish</TO> . </EMAIL>
XML Syntax
The EMPTY-ELEMENT tag
Shortcut for empty element (element containing no data) Example:
If CC element doesnt contained data,
it can be declared as: <CC></CC> OR <CC/>
XML Syntax
Attributes
Attributes provide a method of associating values to an element XML elements can have attributes in name/value pairs just like in HTML.
Example:
<EMAIL DATE=14/02/2011> </EMAIL>
Valid Vs Well Formed XML
Valid XML
XML validated against a DTD is "Valid" XML Obeys all the validity constraints identified in XML specification
Example: Validity Constraint : Required Attribute If default declaration is the key #REQUIRED then attribute must be specified for all the elements of the type in attribute-list declaration.
Valid Vs Well Formed XML
<!ATTLIST element-name attribute-name attributetype #REQUIRED> DTD: <!ATTLIST person number CDATA #REQUIRED> Valid XML: <person number="5677" /> Invalid XML: <person />
Valid Vs Well Formed XML
Well formed XML
XML document with correct XML syntax XML syntax rules
XML documents must have a root element XML elements must have a closing tag XML tags are case sensitive XML elements must be properly nested XML attribute values must be quoted
Valid Vs Well Formed XML
Well Formed XML Example
<?xml version="1.0" ?> <EMAIL> <TO>Ashish</TO> <CC>Rahul</CC> <SUBJECT>Meeting Reminder</SUBJECT> <BODY>Group Meeting at 4.00 PM</BODY> </EMAIL>
Valid Vs Well Formed XML
Benefits of well-formedness
For the Client saves downloading time of DTD, if the xml document is validated against DTD by server. In cases where validation is not required, the focus is on the structure of document.
(Note: Valid documents = Well-formedness + satisfying all validity constraints)
Document Type Declaration
Document Classes
Background of design of XML Relates to OOP Conceptual use of inheritance and polymorphism
Example: Base class Book Book
Number Of Chapters
Cover Letter
DTD CONTD
Inheritance (Book and its subclasses)
Book
NumberOfChapters CoverLetter
CookBook
NumberOfChapters(Value 10) CoverLetter(Value Red) Recipe
TextBook
NumberOfChapters(Value 21) CoverLetter(Value Blue) Recipe
DTD CONTD
Polymorphism
Book
ArtBook
NumberOfChapters
CoverLetter(Value Blue, Pattern pt)
Class ArtBook overloads CoverLetter property of base class Book, it accepts color patterns in addition to the color values.
DTD CONTD
DTD
Acts as a Rule Book that allows author to create new documents of same type and same characteristics as a base document Defines the building blocks of an XML document. Defines the document structure with a list of elements and attributes
DTD CONTD
Example: DTD created for medical community.
Documents created with DTD can contain Patient Name, Medical History, Medications and so on. This information can be easily read by any medical institution which supports XML based document system.
DTD CONTD
DTD structure
Internal DTD (subset)
DTD which is declared inside XML document
<!DOCTYPE root-element [element-declarations]>
External DTD (subset)
DTD declared in external file and that file is included in XML document
<!DOCTYPE root-element SYSTEM "filename"> (Note: If the document contains both type of DTD then internal subset takes precedence over external subset)
Internal DTD
In this example, EMAIL DTD is created in XML document itself.
<?xml version=1.0 ?>
<!DOCTYPE EMAIL [ <!ELEMENT EMAIL (TO, FROM, CC, SUBJECT, BODY)> <!ELEMENT TO (#PCDATA)> <!ELEMENT FROM (#PCDATA)> <!ELEMENT CC (#PCDATA)> <!ELEMENT SUBJECT (#PCDATA)> <!ELEMENT BODY (#PCDATA)> ]> <EMAIL> <TO>Ashish@[Link]</TO> <FROM>Rahul@[Link]</FROM> <CC>Bill@[Link]</CC> <SUBJECT>My First DTD</SUBJECT> <BODY>Hello World</BODY> </EMAIL>
CONTD.
Interpretation of DTD
!DOCTYPE EMAIL defines that the root element of this document is EMAIL !ELEMENT EMAIL defines that the EMAIL element contains four elements: " TO, FROM, CC, SUBJECT, BODY " !ELEMENT TO defines the TO element to be of type "#PCDATA" !ELEMENT FROM defines the FROM element to be of type "#PCDATA" !ELEMENT CC defines the CC element to be of type "#PCDATA !ELEMENT SUBJECT defines the SUBJECT element to be of type "#PCDATA !ELEMENT BODY defines the BODY element to be of type "#PCDATA"
External DTD
In the following example, [Link] file is separately created and referenced in XML document as [Link]
<?xml version="1.0"?> <!DOCTYPE EMAIL SYSTEM [Link]"> <EMAIL> <TO>Ashish@[Link]</TO> <FROM>Rahul@[Link]</FROM> <CC>Bill@[Link]</CC> <SUBJECT>My First DTD</SUBJECT> <BODY>Hello World</BODY> </EMAIL>
Here the file [Link]" will contain the EMAIL DTD.
DTD CONTD
The Building Blocks of XML Documents
From a DTD point of view, all XML documents (and HTML documents) are made up by the following building blocks: Elements Attributes Entities PCDATA CDATA
DTD CONTD
Element Declarations
Syntax: <!ELEMENT element-name category> or <!ELEMENT element-name (element-content)> Empty Elements : Empty elements are declared with the category keyword EMPTY: <!ELEMENT element-name EMPTY> Example: <!ELEMENT br EMPTY> XML example: <br />
DTD CONTD
Elements with Parsed Character Data Elements with only parsed character data are declared with #PCDATA inside parentheses: <!ELEMENT element-name (#PCDATA)> Example: <!ELEMENT FROM (#PCDATA)>
DTD CONTD
Elements with any Contents Elements declared with the category keyword ANY, can contain any combination of parsable data: <!ELEMENT element-name ANY> Example: <!ELEMENT EMAIL ANY>
DTD CONTD
Elements with Children (sequences) Elements with one or more children are declared with the name of the children elements inside parentheses: <!ELEMENT element-name (child1)> or <!ELEMENT element-name (child1,child2,...)> Example: <!ELEMENT EMAIL (TO, FROM, CC, SUBJECT, BODY)>
(NOTE : When children are declared in a sequence separated by commas, the children must appear in the same sequence in the document. )
DTD CONTD
Declaring Only One Occurrence of an Element <!ELEMENT element-name (child-name)> Example: <!ELEMENT EMAIL (BODY)> The example above declares that the child element BODY" must occur once, and only once inside the EMAIL" element.
DTD CONTD
Declaring Minimum One Occurrence of an Element <!ELEMENT element-name (child-name+)> Example: <!ELEMENT EMAIL (BODY+)> The + sign in the example above declares that the child element BODY" must occur one or more times inside the EMAIL" element.
DTD CONTD
Declaring Zero or More Occurrences of an Element <!ELEMENT element-name (child-name*)> Example: <!ELEMENT EMAIL (BODY*)> The * sign in the example above declares that the child element BODY" can occur zero or more times inside the EMAIL" element.
DTD CONTD
Declaring Zero or One Occurrences of an Element <!ELEMENT element-name (child-name?)> Example: <!ELEMENT EMAIL (BODY?)> The ? sign in the example above declares that the child element BODY" can occur zero or one time inside the EMAIL" element.
DTD CONTD
Declaring either/or Content Example: <!ELEMENT EMAIL(TO,FROM,CC,SUBJECT,(MESSAGE|BOD Y))> The example above declares that the EMAIL" element must contain a TO" element, a FROM" element, a CC" element, and either a MESSAGE" or a BODY" element.
DTD CONTD
Declaring Mixed Content Example: <!ELEMENT EMAIL (#PCDATA|TO|FROM|CC|SUBJECT|BODY)*> The example above declares that the EMAIL" element can contain zero or more occurrences of parsed character data, TO", FROM", CC", SUBJECT or BODY" elements.
DTD CONTD
Declaring Attributes An attribute declaration has the following syntax: <!ATTLIST element-name attribute-name attributetype default-value>
DTD example: <!ATTLIST person number CDATA 0000">
XML example: <person number="5677" />
THANK YOU!!!!!!!