Type Safe XML Generation in Scala
Type Safe XML Generation in Scala
Intoduction
The Extensible Markup Language(XML) is a language that specifies a set of rules to encode
documents in a human-readable and machine-readable format. It is defined by free open standards.XMLs design goals emphasize simplicity, generality and usability across the Internet.
Although the XML design focuses on documents, it is widely used for the representation of
arbitrary data structures(ADTs) especially the ones used in web services.
XML can serve many purposes: as a more expressive mark-up language than HTML;
as an object-serialization format for distributed object applications; or as a data exchange
format. In this work, we focus on generation of type-safe XML documents from persistent
data that is sent over a network to an application. Numerous industry groups, including
health care and telecommunications, are working on document type definitions (DTDs) and
XML Schema(XSD) that specify the format of the XML data to be exchanged between their
applications.
The existing XML generation mechanism in most modern programming languages is
based on string literals and type unsafe. We tried to rule out the idea of treating XML tags as
string literals and implement an Object Oriented Model of genrating XML document.Thus,
it makes TSXML an efficient and generic library for type safe XML generation in Scala.
1
Motivation
The Web has so far been incredibly successful at delivering information to human users.
XML and its various extensions (data-models, query languages) are a primary step in this
direction. Unfortunately, the Web is not yet a well organized repository of nicely structured
XML documents but rather a conglomerate of volatile HTML pages, for which structure has
to be extracted. There is still a lot of data present and data migration from legacy systems
is a primary task being taken up by many enterprises.
Most modern programming languages use XML as the standard markup language to
store documents. They treat XML commands and tags as string literals and almost no type
checking is done before its generation. These languages might provide XML syntax checking
libraries but do not provide type checking. Type systems built directly into compilers cannot
easily extend to keep track of run-time invariants abstractions. This work presents library
techniques for extending the type system of Scala to support domain-specific abstractions.
For type safety, one has to depend on external APIs or parsers. Native language support
for type safety in these languages. The XML is created by the language and is passed on to
an external XML parser for parsing. As a result, if the generated XML has an error it has
to be regenerated, and re-parsed as long as a correct XML is not created. We feel that this
method is time consuming and there should be mechanism of type-checking at the time of
XML generation itself, provided in the native language.
A previous paper [1] addressed this problem for C++. We based our work on their idea
2
and using the Object Relation Mapping model, we create objects of XML instead of string
literals.
There are two important reasons Scala was chosen to built this solution. Firstly the
interoperability of Scala with Java is extremely robust. Scala can use java libraries and in
return built libraries which can be imported in Java. The below graph depicts the popularity of languages of programming languages on GitHub and Stackoverflow. On the top on
popularity graph Java is present and among functional languages Scala is top. By building
a solution in Scala, we can leverage the Scala users together with Java users.
XML Validation
XML validation is the process of checking whether a document written in XML confirms that
it is both well-formed and also valid. A well-formed document follows the basic syntactic
rules of XML, which are the same for all XML documents. A valid document also respects
the rules dictated by a particular DTD or XML schema, according to the application-specific
choices for those particular. In addition, extended tools are available such as OASIS CAM
standard specification that provide contextual validation of content and structure that is
more flexible than basic schema validations. There are tools that do this, some examples are
xmllint in LINUX commmand prompt and XLint in java.
3.1
A document type definition (DTD) is a set of markup declarations that define a document
type for XML. It defines the legal building blocks of an XML document. It defines the
document structure with a list of legal elements and attributes. DTDs persist in applications
that need special publishing characters.
Parsers for these : XML parser, example XLint
3.2
An XML Schema describes the structure of an XML document. It can be used by developers
to verify each piece of item content in an XML document. Unlike most other schema languages, XSD was designed with the intent that determination of a documents validity would
produce a collection of information adhering to specific data types. Such a post-validation
infoset is very useful in the development of XML document processing software.
Document Model
Document modeling deals with the structures and patterns of the written work and breaks
it into branches and labels them.
4.1
XML is a Semi structured database, i.e it is a form of structured data which does not conform
with the formal structure of data models associated with relational databases or other forms
of data tables. However, it contains tags or other markers to separate semantic elements
and enforce hierarchies of records and fields within the data and is thus called self-describing
structure. In the semi-structured data, the entities belonging to the same class may have
different attributes even though they are grouped together, and the attributes order is not
important. Semi-structured data is increasingly occurring, where full-text documents and
databases are not the only forms of data any more and different applications need a medium
for exchanging information.
4.2
TSXML changes the traditional way in which XML is created. It is a new scheme to generate XML, namely using the ORM model. The user provides us with objects to be put
in the XML, and the library takes care of everything else from nesting, type-checking, wellformedness and indentation. This relieves the user from the unnecessary hassle of generating
XML via strings. If data entered by the user has an error, he is infored at compile time,
rather than using a parser an re-creating the XML.
5.1
5.2
There are primarily six classes which implements the algorithm in the previous section. These
classes are customXmlTypeSafe, Lift, Liftable, ImplementLiftable, XMLNodes, MacroImplementLiftable. customXmlTypeSafe is the main class which uses the other background classes.
Lift class is used to generate the tree structure of the user classes. Liftable class performs the
level order traversal of the tree. ImplementLiftable reads the DTD/XSD and performs the
type checking. Once the type safe criteria passes successfully then class XMLNodes creates
XML tags and attributes. Lastly, MacroImplementLiftable uses string interpolation to nest
XML tags together with appropriate indentation.
Conclusion
The TSXML library redefines the way XML geceration is viewed by programming languages.
It brings in an object oriented approach using the Object relation Mapping model for XML
6
generation. The main idea is, if an XML is created it better be perfect, else dont create it
at all.
Future Work
The current mechanism of error invocation on XML violation detection is weak. Specifically,
in the case of high level of tag nesting (>15 levels) detection of error becomes difficult.
The performing logic need to be modularized.
For reading an XSD/DTD file, we are using an external library. We are planning to write
our own XSD/DTD reader.
References
rvi, Extending type systems in a library: Type-safe
[1] Yuriy Solodkyy and Jaakko Ja
XML processing in C++, Journal of Science of Computer Programming. Vol. 76, 2011,
290-306
[2] Mary Fernandez, Wang-Chiew Tan and Dan Suciu, SilkRoute: Trading between
Relations and XML, International World Wide Web Conference (WWW) (2000)