XML

What does XML stand for?
XML stands for EXtensible Markup Language. It’s straightforward, legible to humans, and very powerful. On top of that, it takes up an incredible amount of space. However, because many of its textual elements are very frequently repeated, it can be compressed very effectively, which goes some way towards making up for that disadvantage. You can find a detailed description on Wikipedia; here, we will focus solely on what is important for EDI/EAI.

A simple example

<?xml version=“1.0“ encoding=“UTF-8“ standalone=“yes“?>

<Order>

<Header>

<OrderNo>4711</OrderNo>

<CustomerNo>K0815</CustomerNo>

<OrderDate>2013-05-30T00:00:00+2</OrderDate>

</Header>

<LineItems>

<LineItem No=“1“>

<ItemNo>S123</ItemNo>

<Quantity>5</Quantity>

<PricePerUnit>9.99</PricePerUnit>

</LineItem>

<LineItem No=“2“>

<ItemNo>H456</ItemNo>

<Quantity>3</Quantity>

<PricePerUnit>17.95</PricePerUnit>

</LineItem>

</LineItems>

</Order>

The structure

First of all, it has to be said that the line breaks and indentations in the example above help make it easier to read. You will often find documents formatted in this way – however, in principle, all this could be written out in one line without making any difference to the meaning. Whitespaces (line breaks, spaces, indentations) outside values have no effect whatsoever. Of course, if line breaks appear within certain values (e.g. item descriptions), they will still be included in the transfer. And now let’s take a look at the different elements.

XML-Header

The XML header comes at the start of the XML file. It specifies the version of the XML format (in practice, this is always 1.0) and the encoding of the file. The standalone=”yes” here also indicates that this file does not need to be checked against an external format definition. More on that later. Strictly speaking, this header should always appear – however, there are plenty of examples of XML files that do without it.

In general, statements bracketed inside <? and ?> are called “processing instructions”. This header takes that format, but it isn’t really a processing instruction, technically speaking. Processing instructions have no direct impact on the data content, so we won’t spend any more time on them here.

Tags and attributes

The identifiers bracketed by < and > are called tags. Every tag is opened and then closed again, unless it doesn’t have any content inside it. We therefore talk about opening and closing tags. The closing tag simply adds a slash (/) in front of the identifier.

  • At the top level – i.e. if the tag is not the “child” of a “parent” tag – there can only be exactly one tag per file. This is what is known as the “root element”. An XML file that has more than one set of opening and closing tags at the top level is invalid.
  • A tag can simply contain a value, such as here: <OrderNo>4711</OrderNo>
  • It’s also possible for a tag to enclose other tags, which can be nested as deeply as required. The example above provides a nice example of this.
  • A single tag can also appear multiple times. For example, we could represent the classification of an item in a catalogue structure as follows:

<Item no=“4711“ name=“ABC Educational Game“>

<ProductGroup>Educational</ProductGroup>

<ProductGroup>Games</ProductGroup>

<ProductGroup>School</ProductGroup>

<ProductGroup>Children</ProductGroup>

</Item>

After all, the item would fit well into all four of these categories. Similarly, the order in our initial example also has multiple line items.

  • As well as values and other tags, a tag can contain any number of attributes, as in this example: <LineItem no=“2“> … </LineItem>

Unlike in HTML, the value of an attribute is always enclosed in “ “.

As a rule, attributes are added to the opening tag.

The important point is that an attribute can only appear once per tag.

The following is therefore invalid: <LineItem no=“2“ no=“3“>

After all, this wouldn’t make sense – what number line item would we be talking about?

  • Best of all, a tag can contain all these things at once:

<ParentTag attr1=“blah“ attr2=“stuff“ attr3=“gubbins“>

<ChildTag1 another_attr=“Maria“>

<ChildChildTag_a>So long</ChildChildTag_a>

<ChildChildTag_b>Farewell</ChildChildTag_b>

<ChildChildTag_c>Auf Wiedersehen, goodbye</ChildChildTag_c>

This is the value of child tag 1.

</ChildTag1>

<ChildTag2>Adieu, adieu, to you and you and you</ChildTag2>

This is the value of the parent tag.

</ParentTag>

  • If a tag contains neither values nor child tags, it can also be written out in abbreviated form by simply merging the opening and closing tags into one.

<Tag_Without_Contents attr1=“Attributes are allowed“ />

Can you see the slash (/) at the end? That tells you that this is the end of the tag, and that there won’t be another closing tag.

  • Comments are opened with <!– and closed again with –> at the end. They can span multiple lines, but they have to be placed within individual tags and can never be nested.

And with that, we have covered everything we need to know about the structure of XML.

So far, so simple. But there’s a little bit more to it than that.

Special characters and notations

What happens if a < or a > appears in a value? That would knock an XML parser off course. It would see a < and think it was the start of a new tag, when in truth it was merely a value stating “a < b”. To avoid confusing the parser, we mask these characters using things called “entities”. This would replace < with &lt;. Every entity begins with an ampersand (&) and ends with a semi-colon (;). Between those comes the code for the character in question. The “lt” in &lt; stands for “lower than”. As you might expect, we then code for > with &gt; (“gt” standing for “greater than”). And because the ampersand (&) also has a special meaning in XML, it too needs to be coded as &amp;. You’ve guessed it – “amp” stands for “ampersand”. The semi-colon (;) doesn’t need to be coded, however. Because it is only meaningful when it is preceded by an & along with a valid code, a standalone semicolon won’t confuse the parser. However, one thing that causes major problems is when quotation marks “ appear in attribute values. These need to be coded too as &quot; (“quot” for “quotation mark”). Last but not least, single quotation marks (aka apostrophes) need to be masked as &apos;. Incidentally, you can also write out the numerical values of the ASCII or Unicode codes for the masked characters between the & and the ; – however, these aren’t as easy to remember as lt, gt, amp, quot or apos. Anyone who has any experience in creating HTML pages in a text editor (Vi rules!) will now be thinking of the many other entities they have already encountered. For example, when writing in German, you need &Auml; for capital Ä, or &szlig; for the letter ß. All this dates back to a time when the Internet was almost exclusively written in ASCII. Back then, everything that didn’t appear in ASCII code needed to be specially coded. Nowadays, we state the encoding at the top of HTML pages (if we are working cleanly), and modern browsers are all compatible with Latin1 and Unicode, which means that there is no longer any need to code umlauts and other special characters.

As far as notation goes, the first point to note is that XML is case sensitive, which means that we need to pay attention to our use of capital letters. In other words, The is not the same as the, which is different from THE. This is a critical point for you to bear in mind. After all, some systems encourage us to develop sloppy habits on this front. There are also a few additional rules for identifiers, by which we mean the names of tags and attributes.

Spaces are taboo, in any case. They are used solely to separate tag names and attributes within a tag. Entities are also forbidden. And special characters such as umlauts will interfere with some parsers too. In brief, only the following characters are permitted: Upper- or lower-case letters (a-z|A-Z), numbers, underscores (_), dashes (–) and full stops. Colons have a special meaning that we will discuss below. In addition, no identifiers may begin with “xml” (irrespective of whether it is written in upper case, lower case or a mix of both). This is reserved for special purposes.

Rules in practice

The following discussion relates exclusively to data XML structures. These are a lot simpler than document structures. For example, documents from OpenOffice, NeoOffice, KOffice and so on are also saved in an XML format. After all, a file in OpenDocument format is nothing more than a ZIP archive containing a handful of XML files (alongside any embedded pictures and so on). One of those files – generally the biggest one by far – will contain the text from OpenOffice Writer (or whichever program is being used). Very special rules apply here – e.g. the order of all the elements within the file will be hugely important. By contrast, if you take another look at the examples on pages 75–77 – either the order or the item and its product groups – you will quickly see that the order in these cases makes absolutely no difference. The line items contain the attribute “no”, while the product groups for the item can also be rearranged into any order. OK, the “so long, farewell, auf Wiedersehen, goodbye” example admittedly wouldn’t be as funny if we

changed the order …

At any rate, the following rules apply to XML files that transport solely non-document data (although some of them are equally applicable to documents too):

  • If no values or sub-tags are specified, the short form is always equivalent to the longer notation comprising opening and closing tags:

<Tag attr=“abc“></Tag> is equivalent to <Tag attr=“abc“ />

  • A completely empty tag (i.e. one containing no attribute values) can even be omitted altogether: <Empty /> can simply be deleted.
  • If an attribute has no value, it can also be omitted:

<Tag empty=““ >value</Tag> is equivalent to <Tag>value</Tag>

  • The order of the tags and attributes is irrelevant. All organisational criteria are represented through the data themselves, e.g. through the line item numbers in the initial example.

<ParentTag attr1=“blah“ attr2=“stuff“ attr3=“gubbins“>

<ChildTag1>Text1</ChildTag1>

<ChildTag2 attr1=“a“ attr2=“b“ attr3=“c“>Text2</ChildTag2>

<ChildTag3>Text3</ChildTag3>

</ParentTag>

is equivalent to:

<ParentTag attr1=“blah“ attr3=“gubbins“ attr2=“stuff“>

<ChildTag2 attr3=“c“ attr2=“b“ attr1=“a“>Text2</ChildTag2>

<ChildTag3>Text3</ChildTag3>

<ChildTag1>Text1</ChildTag1>

</ParentTag>

Unfortunately, at this point we should note that many XML parsers (especially custom-made ones) will see things differently. In those cases, empty tags will be required after all, and they will have to be provided in a very particular form (either <long></long> or <short />). In some data formats, the order of the tags can also be meaningful, instead of relying purely on attributes, as in the example above with the line item numbers.

Schemas

A further advantage of XML structures is that you can precisely define their structure in a globally applicable form. Whereas formats such as EDIFACT or VDA come with descriptions that are more-or-less comprehensible to humans, as well as a set of general rules, XML structures can be represented in an entirely machine-readable way. This used to be done with the help of a DTD (Document Type Definition), but today, the standard is the XSD (the XML Schema Definition). There are other approaches, but XSD is currently the most widely used. These schemas are themselves written in XML, which offers a major advantage over the tricky notation used in the old DTDs. XSDs are also more powerful than DTDs.

 

A schema for our opening example might appear as follows:

<?xml version=“1.0“ encoding=“UTF-8“?>

<s:schema xmlns:s=“http://www.w3.org/2001/XMLSchema“ elementFormDefault=“qualified“>

<s:element name=“Header“ minOccurs=“1“ maxOccurs=“1“>

<s:complexType>

<s:sequence>

<s:element name=“OrderNo“ type=“s:string“ minOccurs=“1“ maxOccurs=“1“/>

<s:element name=“CustomerNo“ type=“s:string“ minOccurs=“1“ maxOccurs=“1“/>

<s:element name=“OrderDate“ type=“s:time“ minOccurs=“1“ maxOccurs=“1“/>

</s:sequence>

</s:complexType>

</s:element>

<s:element name=“LineItem“ minOccurs=“1“ maxOccurs=“unbounded“>

<s:complexType>

<s:sequence>

<s:element name=“ItemNo“ type=“s:string“ minOccurs=“1“ maxOccurs=“1“/>

<s:element name=“Quantity“ type=“s:integer“ minOccurs=“1“ maxOccurs=“1“/>

<s:element name=“PricePerUnit“ type=“s:float“ minOccurs=“1“ maxOccurs=“1“/>

<s:element name=“Name“ type=“s:string“ minOccurs=“0“ maxOccurs=“1“/>

</s:sequence>

<s:attribute name=“no“ type=“s:integer“/>

</s:complexType>

</s:element>

<s:element name=“LineItems“ minOccurs=“1“ maxOccurs=“1“>

<s:complexType>

<s:sequence>

<s:element ref=“LineItem“ minOccurs=“1“ maxOccurs=“unbounded“/>

</s:sequence>

</s:complexType>

</s:element>

<s:element name=“Order“ minOccurs=“1“ maxOccurs=“1“>

<s:complexType>

<s:sequence>

<s:element ref=“Header“ minOccurs=“1“ maxOccurs=“1“/>

<s:element ref=“LineItems“ minOccurs=“1“ maxOccurs=“1“/>

</s:sequence>

</s:complexType>

</s:element>

</s:schema>

This is all fairly straightforward. If you compare the schema with the sample data, you should be able to quickly figure out the main principles. The purpose of these schemas is firstly to specify the maximum possible structure of an XML file.

We have said “Maximum” in the sense that the “name” element, for example, appears in the XSD file, but not in our example XML file. In other words, the schema sometimes describes more than what is present in the actual data. Secondly, the schema serves to check the validity of existing XML data. This involves taking an order file and comparing it to the schema to see whether any elements or attributes appear that are forbidden by the schema; whether anything mandatory is missing (minOccurs=”1”); or whether anything appears more often than it should.

A schema can also specify the number of values that an attribute or tag is allowed to have, as well as the type of certain values. You can also define custom types that are derived from other types (like in programming languages), as well as a host of other things.

 

We don’t have room for a comprehensive discussion of XSD here, so we’ll just add one

more thing:

Complex XML structures are best defined using multiple schema files. This involves one central XSD that integrates other schemas by means of include or import instructions. That allows us to define certain basic types for things like pricing information, telephone numbers or web addresses in one file; specify everything relating to customer information in another file; and then define all item-related information in a third file; before using all of this in a central document that describes the structure of an order or a catalogue.

It was almost too easy - Namespaces

If only we could leave it there, XML would be wonderfully straightforward. And so readable too! Unfortunately, there is another element that makes everything more complicated: namespaces. As we have already mentioned, complex structures are sometimes described over multiple XSDs. And this is where namespaces often come into play. We could compare these to the packages used in Java. A class called Price might appear in a package called en.example.basictypes, while another completely separate and independent class called Price appears in the package en.example.iteminfo. Similarly, in XML, a price element can be defined as a basic type in the following way:

<s:simpleType name=“Price“>

<s:restriction base=“s:float“>

<s:minInclusive value=“0“ />

</s:restriction>

</s:simpleType>

This specifies that a price is a decimal value that can be zero, but can never be negative. However, the price information for a particular item might simultaneously include the purchase price, the sale price and the VAT rate:

<s:complexType name=“Price“>

<s:sequence>

<s:element name=“PurchPrice“ type=“Price“ />

<!– minOccurs and maxOccurs are not specified, so a 1 is assumed for both –>

<s:element name=“SalePrice“ type=“Price“ />

<s:element name=“VAT“ type=“s:float“ />

</s:sequence>

</s:complexType>

The complex type is called “Price”, and two of its values are of a type also called “Price” – but this is the simple type. This is bound to lead to confusion between types … And this is where namespaces come into play. You can see what these look like in the schema itself. First of all, a namespace needs to be declared:

<s:schema xmlns:s=“http://www.w3.org/2001/XMLSchema“ elementFormDefault=“qualified“>

Here, we have declared the namespace “s”. This is the exact URL used in schemas. In normal XML files, any URL could appear here in principle, since it doesn’t actually need to link to anywhere – unless you actually want to test the structure. But we’ll come back to that at the end. In any case, every namespace gets its own URL.

And we use the declared namespace by placing it as a prefix in front of the tag or attribute names (or attribute values, at least in schemas):

<s:element name=“ItemNo“ type=“s:string“ minOccurs=“1“ maxOccurs=“1“/>

Both the “element” tag and the “string” type are contained in the namespace “s”, which simply contains all the element and attribute types defined by the World Wide Web Consortium (W3C for short).

If we apply this to our price example, we should place the basic definition of a price (i.e. that it cannot be negative) in the “basic” namespace (for example), while the pricing information for an item including purchase price, sale price and VAT will appear in the “item” namespace. Once we have given the simple price its namespace, we can then define the item price, followed by a mini-item:

<s:schema xmlns:s=“http://www.w3.org/2001/XMLSchema“ elementFormDefault=“qualified“

targetNamespace=“http://www.lobster.de/madeup/item.xsd“

xmlns=“http://www.lobster.de/madeup/item.xsd“

xmlns:basic=“http://www.lobster.de/madeup/basic.xsd“>

<!– Here, we begin by specifying which schema and namespace the types and elements defined below should

appear in. –>

<!– We have also already indicated that we will use elements from the “basic” namespace, and stated where

their definitions come from. –>

<!– And now we import the basic.xsd schema. –>

<s:import namespace=“http://www.lobster.de/madeup/basic.xsd“

schemaLocation=“basic.xsd />

<!– Please note: The URL must exactly match the one provided in the schema tag above it. We have also

assumed that the file basic.xsd is located next to the item.xsd file. –>

<s:complexType name=“Price“>

<s:sequence>

<s:element name=“PurchPrice“ type=“basic:Price“ />

<s:element name=“SalePrice“ type=“basic:Price“ />

<s:element name=“VAT“ type=“s:float“ />

</s:sequence>

</s:complexType>

<s:element name=“Item“>

<s:element name=“Name“ type=“s:string“ />

<s:element name=“PriceInfo“ type=“Price“ />

<!– Here we are dealing with the (complex) price in the same schema/namespace as for the item, which is why

there is no prefix –>

<s:element name=“OfferPrice“ type=“basic:Price“ />

<!– This is the other simple price, so it needs a prefix –>

</s:element>

</s:schema>

The item itself is part of a catalogue that is assigned the namespace “cat” (let’s leave the schema completely to one side here). In the XML file, which uses both schemas, the end result looks like this:

<cat:Catalogue xmlns:cat=“http://www.lobster.de/madeup/catalogue.xsd“>

<!–Here, the namespace “cat” is declared in the first element that uses that namespace–>

<item:Item xmlns:item=“http://www.lobster.de/madeup/item.xsd“>

<!– And now we also declare the “item” namespace –>

<item:Name>USB Stick 32G</item:Name>

<item:Price>

<item:PurchPrice>5.50</item:PurchPrice>

<!– Here, the price from “basic:” is still just the number, which is why we don’t see the namespace anymore –>

<item:SalePrice>17.99</item:SalePrice>

<item:VAT>19.0</item:VAT>

</item:Price>

<item:OfferPrice>13.49</item:OfferPrice>

</item:Item>

</cat:Catalogue>

If we now want to validate this XML file – i.e. check that its structure is clean – we will need to include the schema files with the specified URLs. If we decide to skip the validation, then the URL could be anything we like, provided it looks like a URL.

Let’s leave things here. There are endless examples and detailed explanations of namespaces to be found online. All you need to know for now is that they exist, and what the prefixes stand for.

What is it?
Let’s start by quoting a paragraph from the openTRANS homepage:

”The openTRANS initiative brings leading German and international companies together under the leadership of the Fraunhofer IAO with the goal of standardising business documents (orders, delivery notes, invoices etc.) in order to build a foundation for electronic system-to-system communication. An expert working group has been formed to

define these business documents using XML as a basis and to develop integration solutions for buyers, suppliers and marketplace operators.”

In other words, we can define openTRANS as a messaging standard that takes XML as its underlying data format. The choice of XML is a sensible one, as this means any software that understands XML will also be able to read or generate openTRANS documents. Incidentally, openTRANS and BMEcat are closely linked, with openTRANS schemas even using types and elements derived from the BMEcat world.

So what kinds of documents are there? As of September 2013, openTRANS is in version 2.1, which contains:

  • RFQ (Request For Quotation)
  • QUOTATION
  • ORDER
  • ORDERCHANGE
  • ORDERRESPONSE
  • DISPATCHNOTIFICATION
  • RECEIPTACKNOWLEDGEMENT
  • INVOICE
  • INVOICELIST
  • REMITTANCEADVICE

This list has been taken from the openTRANS homepage.

Can you use openTRANS, and are you allowed to?

The answer to the second question is yes. It’s even free of charge. Of course, we aren’t the final authority on this, so you can find more detailed terms and conditions in the FAQs on the openTRANS homepage.

As for the “can you” part: As we have already said, the choice of XML as the underlying data format for this messaging standard means that any EDI/EAI software that can handle XML will also be able to work with openTRANS documents. Once you have registered on the homepage (free of charge), you can download format descriptions in either human-legible (PDF) format or as machine-readable XML schemas. Beyond that, however, you will need to do a bit of extra work, depending on your software.

Example:

As a final example, here is the rough structure of the INVOICE in the openTRANS standard version 2.1:

This structure has over 800 elements. That represents an awful lot of work. So it’s handy that you don’t have to do everything yourself…

What is it?
To start with, let’s take a look at the homepage of the BMEcat project:

“The German Association for Supply Chain Management, Procurement and Logistics (BME), based in Frankfurt, has launched an initiative to develop an electronic data interchange standard for product catalogues that has been actively supported by major companies. […] BMEcat is currently open to public review in the version ‘2005 final draft’. During this

phase, the new version will be tested in practice and any errors will be uncovered. BMEcat 2005 grants new sectors and product groups access to the electronic interchange of product information. […] BMEcat establishes a basis for simple transfers of catalogue data from a wide range of different formats, and in particular, it creates the right conditions to drive the online exchange of goods between companies in Germany. The XML-based standard BMEcat has been successfully deployed in many different projects.”

The 2005 version is still the latest one, and is also used as part of the openTRANS standard. That means that elements from BMEcat 2005 can also be found in openTRANS structures. Even their schemas are linked. The BMEcat messaging standard is based on XML.

Structure

The key points relating to this structure are:

  • The structure is fairly large. It has over 2500 elements in total (which are mostly numbered in sequence in the structure).
  • There are three types of BMEcat files, which you can see under the Choice-228 node. A choice (the terms comes from the XML schema) offers a set of alternative elements, only one of which can be used. In this case, only one of the following types of content can be transferred:
    • An entire (new) catalogue (T_NEW_CATALOG)
    • A product update (T_UPDATE_PRODUCTS)
    • A price update (T_UPDTE_PRICES)
  • In other words, each individual file only uses part of the overall structure. Even an entire new catalogue will consist of only around 1500 elements in total. And only a small number of those will actually appear in the file. However, certain elements will appear very frequently.
  • An entire catalogue consists chiefly of three main parts:
    • The catalogue groups
    • The products
    • The classification (which products are assigned to which catalogue groups)
  • Updates contain only the most essential information. In principle, a product update should contain all the data relating to the items, including their classification into various catalogue groups, but it should not include any changes to existing groups. And price updates are limited to a small amount of basic product information, as well as their pricing data. If any changes are made to the catalogue groups, an entire new catalogue will be required.
  • Catalogue groups can be assigned to just one superordinate group in a structure diagram. Products, by contrast, can belong to any number of different catalogue groups.
  • Certain general information is contained in the header, which is separate from the three types outlined above. This can include the catalogue being referred to, the language, the parties involved, and so on.
  • If one of the elements available in the standard proves inadequate, it is possible to specify certain user-defined extensions. This allows for extensions to be made to the format in response to your needs. However, it requires the agreement of all the parties involved!

Who uses this and what does it cost?

In fact, it’s used by everyone who wants to exchange catalogue data – especially the manufacturing and retail sectors. According to the BME’s own marketing, the format is now the “de facto standard for the interchange of electronic product catalogues” and is used primarily in the German-speaking world – though there are naturally plans to take it further afield.

There is no charge to use the format. Once you have registered on the homepage (also free of charge), you can gain access to the download zone, where this will be explained to you. You will also have access to format descriptions for the various versions in a range of different file formats, including PDF and XSD.