Converting STM Content To XML
Converting STM Content To XML
Introduction
Content is being read on mobile devices, ebook readers, PDAs, notepads, etc. The final electronic
format of the content can vary depending on the device that is being used to read it. To accommodate
that device, flexibility is one of the main reasons why many content owners elect to convert their
content first to XML (EXtensible Markup Language). XML is a set of rules for encoding content
electronically that is device, application, and product neutral. XML’s design goals emphasize application
independence, simplicity, generality, and usability. The XML specifications are managed by the World
Wide Web Consortium (W3C) and are widely adopted and recognized throughout most of the
information industry. There are hundreds of tools and languages that have been developed to support
XML. XML is a format even supported by Microsoft® and Apple®, industry leaders in office productivity
tools. Besides interoperability, XML produces componentized content for reuse and repurposing. Once
the content is in XML, specific tools (also with their own open standards) can be leveraged to produce
the hot information products of today (epub, mobipocket, XHTML, HTML5) as well as the hot products
of tomorrow.
This whitepaper is based on experience with a large conversion initiative for an STM publisher (a
common way to refer to a publisher of Scientific, Technical or Medical content). The project goal was to
convert 70,000+ technical papers to XML. This paper therefore focuses on the issues and challenges
when working with STM content. The paper focuses on the following points:
Where to begin?
Before we jump in, let’s review all of the essential project management items necessary to successfully
plan, execute and successfully complete such a project. Any project, whether a content conversion or a
kitchen remodeling, should follow a structured methodology that addresses:
1
The sponsor is the project champion and is typically responsible for the financial decisions while the stakeholders
can include a variety of roles including senior management, the customer, functional managers, etc. Stakeholders
can hold the keys to unlock the available resources within the company who can work on the project.
Specifically when designing a project to address electronic publishing take into consideration the
following:
A conversion of content from one format to another is generally not the main long term business goal, it
is merely one step or activity within an overall initiative. So in order to begin a conversion, stepping
back to understand the goals, objectives and requirements should help ensure success.
Converting to XML
The following is a list of items to take into consideration when doing the planning for converting content
to XML:
• Who will do the conversion? STM publishers typically have A LOT of content to convert. How
much content you decide to convert will depend on what you intend to do with the content
once it’s converted. There are many vendors available to perform content conversions. This
paper makes no recommendations except to follow a standard request for proposal (RFP)
process. Each company has their own unique business requirements for how they wish to deal
with vendors. By asking questions, gathering information and talking to others who have
completed a conversion, the best decisions can be made. The following list contains information
you’ll need to supply your vendor(s) in order to obtain an adequate estimate:
o What is the target level of accuracy for text, tags, tables, graphics, etc? 98%, 99.95%,
99.995%, 99.9995%
o What is the timeframe/turnaround time expected to meet your organizations business
goals?
o How will complex content such as math equations and tables be captured?
o What is estimated # of pages? What is the estimated number of pages for each source
format?
o How will the content be sent to and from the vendor? If hardcopy, is it expected to be
sent back when finished? If yes, is it expected that it will be sent back exactly as
received (e.g. bound or paper clipped or stapled?)
o What is the source material? PDF Normal, PDF Image, Hardcopy, proprietary content
processing tool, some other source…2
2
Text for the XML file can also be extracted from proprietary formats like, Xyvision, FrameMaker, etc. For image
files or hardcopy files the text usually will be processed using OCR technologies and is subject to further manual
review thereby increasing the cost. In rare cases of particularly challenging source material actual manual
keyboarding may be needed, also a potential cost facto. Also, if hardcopy, an additional step is needed to scan the
content although those added costs are typically minor. You'll want to break down number of pages by type of
Page 2 of 10
Converting STM Content to XML
o Are the conversion specifications complete or is this a service you need from the
vendor? The project can be more accurately quoted when the upfront analysis and
conversion specifications are complete.
o What result products are expected? XML, PDF, epub, etc.
o IF PDF Image is an output, will it include clean or dirty hidden text? In otherwords, will
the OCR text be reviewed and cleaned by the vendor to correct errors? 3
o Who pays shipping and travel costs (if any)?
o How is the product to be delivered? FTP? DVD?
o What language is the content? All English?
In addition to the items above, a sample set should be included with the request. This set
should be a good sampling of the content. If the content is in many formats, include all formats
in the sample. If content is old, include enough samples from each period of time, for example
each decade4. If some documents are more complex than others, provide an estimate for the
varying degrees of complexity. Before making a final selection, ask the vendor to convert your
sample set to XML and verify the results.
o Trust but Verify. A conversion project typically includes a contract contingency that the
output produced by the vendor is 99.95%, 99.995% or 99.9995% accurate, however that
is no guarantee. While you are paying for the vendor to perform that level of quality
assurance, you will need to decide how much quality assurance your company will
perform internally to verify the deliverables are meeting the requirements. The QA
document format if more than 1 format.
3
Ideally the conversion specifications should outline the details of your output/deliverable.
4
Older documents that were typeset can be more difficult for an OCR engine to successfully process.
Page 3 of 10
Converting STM Content to XML
requirements may vary perhaps depending on the use of the document (a technical
paper versus, perhaps, the more rigorous quality goals of a technical standard). A few
other things related to quality to address:
Take the time to understand the vendor’s quality assurance technique and
process. This is important when selecting a vendor.
The QA process should be verifying that the XML specifications were followed,
and that only an acceptable level of character errors were introduced. If the
vendor is responsible for only submitting the resulting XML and associated files
(graphics) then consider how you will be reviewing the content for accuracy.
You will need to have some tool that enables a QA tester to review the text of
the document without needing to understand XML. Whatever tool you select,
share it with the vendor. That will help ensure that they are seeing the same
results as you.
Begin your QA process (with all necessary tools) as soon as you begin the XML
conversion. Capturing issues real time will be less painful than finding a
problem after several files have been converted to XML.
• Normalizing content across an organization. Once the content is electronic, how does it
compare to the content stored in the organization’s business databases? It can be rather
interesting to find how many times and ways a title, or author name is stored. Also to find out
how that information may relate to or be referenced by other corporate systems. It is possible
that the metadata on the content and the metadata captured by the business systems are
different. This is an opportunity to capture the data at the source but that may or may not end
up being the best solution overall.
o Customizing the common (master) DTD/XSD 5. The NLM DTD has been adapted by
several STM implementations and has evolved through 3 major releases. The many
revisions and extensions to the core NLM DTD provides a basis for virtually all content
elements – and then some. It would be a challenge to make the case to not deploy the
NLM DTD framework for typical STM publisher purposes. It’s possible to modify the
DTD/XSD to meet specific business requirements, however, by keeping things simple
5
A document type definition (DTD) or XML Schema Definition (XSD) is a set of declarations that define the
structure and requirements for a given document type. Either format (DTD or XSD) can be used. Its primary
purpose, during a conversion, is to define the document structure and provide a mechanism to validate the XML.
Page 4 of 10
Converting STM Content to XML
and keeping with the standard structure, further customizations, and issues with
upgrades can be avoided. The NLM DTD, comes with thorough online documentation
including tagging preferences - http://dtd.nlm.nih.gov/publishing/tag-library/n-
qk32.html. Also follow the preferences and industry standards recommended by the
conversion vendor.
• Creating the Conversion Specifications. Although the NLM tagging preferences and the
conversion vendor are useful resources when defining the conversion specifications, there are
many conversion decisions to be made. With every XML element, try to include varying
examples of content which will help define “how to” properly capture the content for that
element. For example, consider the title of a Technical Paper. How consistent has your
organization been in defining the title format and presentation in the past, in your print/PDF
versions? Do the authors break the title down into subtitles? Are there characters used within
the title that would suggest a break between title and subtitle? Be sure your design
specification takes this into consideration and defines a consistent method for capturing this
information. If this level of detail is not covered in your design specification, the conversion
vendor may or may not consistently capture the title and subtitle. The more details you can
provide on how you want your content converted, the more consistent the results will be, and
the easier it will be to convert from XML to other formats for electronic publishing.
• Conversion specification decisions. It may not be financially beneficial to tag legacy content to
the same degree as current content. It depends on how the content will be used. Tagging
appropriately leaves opportunity to go back and further enhance legacy content when/if
necessary. What is the right balance? What is important to take into consideration when
creating conversion specifications? Consider these suggestions:
o For legacy content, consider capturing tables and equations as graphics. Capture the
content-type as “table” or “equation” for future conversion possibilities. Costs will rise
significantly when converting tables and equations to XML especially with large complex
tables. The amount of time dedicated by the vendor and the organization’s QA process
must increase. Keep in mind, however, the vendor will charge for the graphics. Some
charge a fixed price for each graphic.
o Capture reference citations as mixed-citation to preserve the variety of ways the
citations were written while tagging the individual items of the citation for future
reformatting of the content. By capturing the reference “as is” (mixed-citation) the
application code to deliver your final output for your customer may be simplified. Also
capture the individual parts of the citation so that it can be loaded into tools such as
CrossRef – Cited-by Linking®, a reference search database, etc. Also consider capturing
the type of publication as an attribute. That may add value later when needing to
display the references in a particular way depending on the type of publication. It is
challenging to determine publication types for some references when there is not
Page 5 of 10
Converting STM Content to XML
enough information or the reference does not follow a common format. Well defined
specifications will be important as well as including a default type.
o For legacy content, capture table and figure titles, figure captions, and legends as part of
the graphic, not as text. This ensures that key data is not missed and is kept with the
applicable content. This is also beneficial when dealing with a large volume of content
where your QA resources are limited. Lastly, figures created with symbols or varying
hyphens will be captured accurately. If the intent is to provide an archive of all graphics
and to reference them by figure caption or title then this approach is not feasible.
Remember though that if you ever wanted to go back and make that feature a
possibility you can, since all of the information is captured within the graphic and XML
attributes.
o Be specific on the graphics that are needed. This topic can be a whitepaper in itself.
There are many types of graphics that are valuable in different ways. Some standard
formats to consider are TIFF, JPG, GIF and SVG. Review the pros and cons of each of
these. If you have multiple needs, for example if you need 1) JPG for normal
black/white, grayscale and color figures; 2) GIF for thumbnails; 3) SVG for engineering
drawlings, vector images and 4) TIFF to preserve the graphic in its original form; then ask
for all four! The effort to convert from TIFF to JPG and GIF can be automated and
therefore should be no additional cost from the conversion vendor. SVG however may
be a different story depending on your vendor’s capabilities. Also consider the size of
the graphics. Keep the TIFF the original size of the image, the GIF is down-sampled to
100x100 for a thumbnail, and resize the JPG, if it is larger than most handheld reader
devices (600x800 dpi).
o Consider requesting an exact replica of the hardcopy in PDF and it have delivered
immediately in order to get immediate returns. If the archive is in hardcopy form, the
first thing the vendor must do is scan it in to an electronic form that can be converted to
characters (using optical character recognition software). Request the scanned images
back as a PDF Image with dirty hidden text (OCR text prior to cleanup). You can put this
online immediately for access by your customers!
o Capture section, list and other labels as labels. Especially with legacy content, there is
probably no consistent definition for how to label the content. One paper may have
numbered sections while another may include letters. By placing the label within its
own element this allows for flexibility in how the content is published going forward.
o STM papers are heavily peppered with a wide range of character codes/symbols. There
are several things to consider when capturing these items into your XML.
Standardize on Unicode for capturing your characters even when entity codes or
other options are available. Unicode is platform and language independent. It
is widely and strongly adopted. For more information, see Unicode.org.
Page 6 of 10
Converting STM Content to XML
While there is a Unicode for every character, the fonts available to display the
Unicode in your reader or print format may be limited. While web browsers and
any print/publishing system where you control the fonts, can produce great
results, some of the readers (Adobe Digital Editions, Kindle 2) are very limited in
the fonts they support and more complex characters will appear as question
marks, blank boxes or another symbol. (Similar to what you’d see in Word if
you’re opening up a document and you do not have the appropriate font on
your desktop.) Some suggestions to address this situation include providing a
format of the document that captures all symbols accurately and/or capture any
symbols/characters that are not within your standard fonts as a graphic instead
of the Unicode value.
The hyphen, a pretty common character and can be represented in many forms
EMDash, ENDash, etc. Capturing them all as a hyphen - ensures a very
common character is viewable in the limited readers mentioned earlier. Take
care however when the Dash has meaning for example, when used in a figure.
This may not be an issue, however, if the Caption/Legends are captured as part
of the graphic.
Be aware also that DTD’s include predefined character sets. Some have found
the NLM DTD to have outdated codes for the & and < characters. Sticking with
Unicode will resolve this issue.
o Analyze a large sampling of documents including documents from every decade and
specify how the conversion vendor should deal with them. Look for anomalies and
document how they should be processed. Some anomalies which can be found include:
Multipart figures – This may just be a common occurrence for your documents.
A multipart figure is multiple figures that are defined as parts of a more
common figure caption.
Documents with responses or discussions – Older technical papers commonly
have responses or discussions appended to the end of the paper. Luckily, the
NLM accounts for this with the element <response>!
Documents with advertisements included – Yep, advertisements….You probably
will chose not to convert this content.
Documents with abstracts only
Journals with pages that unfold into large-size documents or diagrams
Sections of a paper with a title and no content
Nested reference lists – This occurs not just in older content, be clear in the
specifications how to address this.
Logos on the cover page
An article within a journal being continued on another page of the journal
(continued on page 493…)
Articles in a journal that start on the same page as where another article ends
Page 7 of 10
Converting STM Content to XML
o When dealing with older content it is possible for parts of text to be missing or rubbed
off from the page. Decide how to handle this. Is the paper still worth converting? If so,
indicate that the content is [ILLEGIBLE].
o Carefully study the authors and affiliations on papers. Multiple authors can be grouped
with multiple affiliations. Authors can represent this in a variety of ways including
placing numbers or letters next to the authors who are affiliated with the respective
organizations. This can get quite complex. The NLM DTD provides a great deal of
flexibility in how the Authors and Affiliations can be tagged within a <contrib> or
<contrib group>. It’s very important to provide detailed instructions on how variations
should be tagged so that the content is tagged consistently for later reproduction.
o Be clear on what should and should not be a definition list. The basic rendering of a
definition list is as follows:
Wikipedia
Insert a definition here…..
If, however, your conversion vendor captures the following as a definition list, you may
not like how it’s rendered by default:
Y = the length in centimeters of a piece of gum.
Y
The length in centimeters of a piece of gum
There is an attribute for definition lists that help you capture the different types of lists
thereby permitting the rendering of the lists unique to their type.
o Be aware of directional text in figure captions. I’ve seen this more in books than papers.
A caption may refer to more than one figure and distinguish them by referencing as “the
figure on the right”, or a clock directional. Also, especially in photo figures, references
are made to the contents of the picture such as mentioning people’s names in a photo
from left to right.
o If an “id” attribute exists for a given element, its best to use it. It may be helpful when
rendering the XML to a new format. If an “id” attribute exists, it’s there for a reason.
Someone found a need for it. So go ahead and include it. For example, sections can be
numbered in a variety of ways but having a standard way to capture sections and
subsections will be helpful when rendering the content. For example, use “s_n.n.n”
where each “n” referred to a level deeper within the section. This will be valuable for
defining table of contents, and different fonts and font sizes when rendering section
titles.
Page 8 of 10
Converting STM Content to XML
o Remember to “adjust as you go”. You certainly don’t want to make a decision to
capture the XML one way and then change midstream, however you may encounter an
anomaly that was missed during the analysis. The vendor and organization should have
a change control process that can be followed to perform analysis on the anomaly and
adjust necessary documentation and processes to address it.
• Paying the bills. An obvious cost to performing the conversion is the cost for the contracted
services (e.g. conversion vendor, expert consultants, QA analysts, etc.) But also plan for the
following which may have a tangible or intangible cost to the organization:
o Account for disk space and method for transferring large amounts of data
o Plan for unexpected costs/buffers - overruns, rework, scope changes
o Include costs for shipping, copying, software licenses, readers
o Do you need to access an offsite facility/mine for your documents? Are there costs
associated with that?
Summary
Time to Jump in! If you are not convinced that XML is the way to go, compare it to the costs and
benefits of converting to other formats. Also, learn about XML and why it is such an important
standard.6 Lastly, do not skip the upfront analysis and exercises (See “Where to begin?”) as these are
essential to successfully completing the project!
Acknowledgements
Sperling Martin, Information Industry Consultant
6
W3C Extensible Markup Language (XML) http://www.w3.org/XML/
Page 9 of 10
Converting STM Content to XML
Page 10 of 10