XML Data : Well Formed XML File
The following example is a three line snippet demonstrating a well formed XML data file, in it’s simplest possible form. It can be entered using a simple text editor or any of the XML editor now available. When writing the document away the file-name should end with a .xml.
Content of box
Q. So, what do the three lines above mean?
A. The <box> … </box> represent an opening and closing tag pair, called an element. The element provides a container for some content. In this case “Content of box”. The element string “box” is quite simply a term I’ve chosen to best describe this content. The string in the closing tag should always match the string in the opening tag but for one detail; the element name should be preceded with a forward slash as indicated. This is important.
Q. What is xml data file for and what does it do?
A. An XML data file simply holds data. In this case “Content of box” and it should hold this data in a self-describing manner. In our example using the <box> … </box> element.
Q. Why would you need to do that?
A. XML provides computer programs and different operating systems with a method by which they can exchange information in an unambiguous fashion. XML is a standard format which is widely accepted and adopted throughout the industry and has changed very little since its inception. So, any program could be written to read the above XML data file reliably and make use of its content.
Another program could be made to update the ‘box’ content of this file. XML files typically carry more complex data sets to be useful but the above three lines of code show just how simple such a file can be and that it can be read by humans as well as machines.
XML Data : Well Formed XML Structure
The following example shows a simple well formed XML structure. It can be entered using a simple text editor or any of the XML editors now available. When writing the document away, the file-name should end with .xml. For example: “human.xml” or “yourFileName.xml”.
18 July 1990
In the example above, the <human> … </human> element contains three other elements and each of these contain textual content. Because the three elements are located between the opening and closing human element tags it means that they are ‘child elements’ of human. These child elements provide us with specific details of the parent element. In this case it tells us that this human is “male”, was born “18 July 1990″ and has “Blonde” hair. XML allows us to structure our data collections in a logical, hierarchy that makes sense. The actual content provided by the child elements in the above example can arguably be made available using attributes.
XML Data : Attributes
The following example shows a simple well formed XML file with attributes. It can be entered using a simple text editor or any of the XML editors now available. When writing the document away the file-name should end with a .xml.
<human gender=”male” born=”18 July 1990″>
Unlike our previous example, the above <human> … </human> element contains just one other child element now. This time, the information stored in gender and born is made available using attributes instead of ‘child elements’. The order of the attributes doesn’t matter but each should take the form attr=”stuff”, following the <elem> tag.
Where more than one attribute is used, these should be separated by whitespace and each attribute should be unique for any one element. At this point you may well ask, when should you place content within an attribute as opposed to a child element? Well the answer to this is not quite so clear as you might think.
Some groups maintain that attributes are metadata about the element, while elements are for the information itself but it’s not always obvious which is which. I’d advise you to consult additional resources for guidance, beyond which common sense should prevail.
XML Data : Names
For consistency, the rules for creating a valid attribute name are the same as those for creating valid element names and for the names of some lesser known constructs. Collectively I refer to these as XML names. XML names may contain characters in the ranges [A-Z, a-z and 0-9].
They may also include none English letters, numbers and ideograms. XML names may also include an underscore, hyphen or period. A valid XML name may only start with a letter, ideogram or underscore. It may not start with a number hyphen or period. Element lengths are limitless.
The following element names are all valid:
The following element names are all invalid:
<first name>Peter</first name>
In the first line, the element name contains an apostrophe. In the second line the element name contains a forward slash. In the third line the element name begins with a numeric and in the fourth, the element name contains a space. All of these are illegal XML names.
XML Data : Namespaces
What constitutes a valid XML name was briefly covered in an earlier 5 minute tutorial. Such an XML name in its self is not always enough, however. Situations can arise where XML names can become ambiguous. Suppose, for example, that two or more XML files are merged.
The files may have more than one author and at least one of the XML names could have been defined in more than one of the merged files. This creates a potential conflict. To illustrate the problem consider the following mark-up:
<name>Dining Room Table</name>
When there is only one ‘table’ element defined, we have clarity. As soon as we merge a second ‘table’ element definition alongside the first, however, we are no longer able to discern the meaning of one from the other.
The context in which the two element definitions operate now requires further qualification. We could, of course, rename one of the conflicting elements but in a big document and where there may be many such duplicate names, this is no longer a viable exercise.
The issue is resolved quite simply by using ‘namespaces’. Namespaces not only provides differentiation between duplicate XML names but also performs the important job of creating distinct groups to which XML entities belong.
<ct:name>Dining Room Table</ct:name>
How namespaces work is that each element (or attribute) definition is designated a prefix. This is separated from what is termed the ‘local part’ of the name, by a single colon (as shown above). It is common practise to associate a prefix with a URI (or Uniform Resource Indicator). The most common form is a URL (or Uniform Resource Locator).
The URI is not used for lookup over the internet. Its adoption is purely to identify a set of data objects uniquely and since URI’s are used globally they are ideal for the job. URI’s are bound to a namespace prefix using the
The prefix and the full URI definition are not interchangeable as URI’s can contain characters which are illegal in an XML namespace. It is also illegal to use the three letters XML in any case combination for a namespace prefix as these are reserved.
XML Data : XML Data Types
Whether a programming language is strongly typed or not it’s important that the data stored in an XML document is unambiguous. It should, therefore, be correctly specified. If an application reading data from an XML document expects to find, say an integer value expressed between a specific element’s opening and closing tags then it is important it finds one there. A whole raft of data-types exists for describing XML data objects:
The above XML data-type reference hierarchy illustrates the built-in data-types and shows logical derivation. It is also possible to create user derived data-types from these. For an in depth appreciation of the various types listed please visit http://www.w3.org/TR/xmlschema-2/.
Specifying data-types for elements and attributes within an XML file are all done in a separate file called a schema. Without getting into schemas just yet, let’s just look at some everyday data-type examples:
<street>1 My Street</street>
Each of the elements above indicates the type of data that would typically be held at those data locations. It is important that applications which are required to read and/or write to this resource are able to do so effectively, without flagging I/O exceptions. The designation of appropriate data-types is a major part of why schemas are necessary.
XML Data : XML Schema
An XML schema definition language is a mechanism for creating schemas. A schema is a document for defining the structure, content and semantics of an XML document. A number of schema definition languages are available for use. The DTD (or Document Type Definition) language was widely used by the XML community but has largely been superseded by XSD (or XML Schema Definition) language. XSD is recommended and maintained by the web standards body, W3C. Unlike DTD, XSD is itself written in XML (therefore, extensible), has support for data-types and namespaces and is generally more comprehensive.
An XML schema defines the elements and attributes that go into your XML document, their data-types and default values (if any). It defines which elements are child elements, the order and number of them. It also defines whether an element is empty or can include text. An XML file is not required to have a schema declaration but when one is included it will be used to validate the XML document against all the above criteria.
Schemas are designed by various institutions and professional bodies to represent a common protocol for data interchange within a given industry, profession or other specialist domain. Schemas are created with the intent that they will find wide spread adoption by their community and in so doing, improve marketplace cohesion. I have listed just a few examples from an ever increasing number of schemas now available:
- RSS (Really Simple Syndication) for news syndication,
- FpML (Financial products Mark-up language)
and FIXML (Financial Information eXchange Markup Language) for the financial markets,
- XBRL (Extensible Business Reporting Language) for the Business markets,
- SDMX-ML (Statistical Data and Metadata eXchange Markup Language) for sharing statistical data.
- RDF (Resource Description Language) for Metadata,
- MathML (Mathematical mark-up language) for mathematicians and
- SVG (Scalable Vector Graphics) language for vector images.
A more comprehensive repository of schemas can be found in this XML Standards Library.
Understanding all the nuances of XSD is a challenge but a number of XML editors (of varying capability) are available to help you simplify the task of creating XML documents and schemas. An editor will typically provide code completion and help with syntax during the design process. It should also be able to generate a sample XML document from your finished schema.
Some will attempt to generate a schema from a sample XML document. Some will provide you with a graphical representation of your schemas and XML files and may generate other documentation for you, also. XML editors can help you learn XML technology as well as help you manage large, complex schemas and XML documents.
The basic syntax for including a schema namespaces definition within your XSD file is as follows:
<xsd:schema targetNamespace=”http://www.myschema.com” xmlns=”http://www.myschema.com” xmlns:xsd=”http://www.w3.org/2001/XMLSchema”>
The targetNamespace is an attribute of schema. In this example it defines the URI: http://www.myschema.com. This URI identifies the current schema’s namespace. It is also defined as the default namespace by the xmlns=http://www.myschema.com (note the absence of a prefix).
This means that any element or attribute in the XML instance document(s) do not need to be prefixed to define which schema they belong. Unless specifically prefixed, all elements and attributes in the instance document(s) belong to this namespace. Another URI is also defined within our schema header: xmlns:xsd=”http://www.w3.org/2001/XMLSchema”. Note that this one is prefixed :xsd.
This now means that if an element or attribute in our instance document has a prefix associated with the same URI, then this schema resource should be referenced instead of our default schema. Note that the prefix in its self is unimportant. What is important is that both the XSD prefix and XML instance document prefix should match the same URI.
If we prefixed our example schema document’s URI xmlns=http://www.myschema.com with say xmlns:ms=http://www.myschema.comt, the instance document would be required to prefix all its elements and attributes with a prefix associated with that same URI.
We could then remove our W3C URI xmlns:xsd=http://www.w3.org/2001/XMLSchema prefix like so, xmlns=http://www.w3.org/2001/XMLSchema and this would instead become our default schema. This arrangement is common and often makes good sense.
Namespaces support for XSD schema allows the use of any prefix in an instance document to accept unknown elements and attributes from known or unknown namespaces. This is not the case for DTDs.
To omit the ‘targetNamespace’ is to work without namespaces. The function of the ‘targetNamespace’ is to bind a namespace to a W3C XML schema document. In the above example we bound the URI http://www.myschema.com to represent our default namespace.
The only part of the schema namespace definition example I haven’t covered yet is the opening <xsd:schema…> part. The prefix here simply indicates that this line should be processed using the namespace URI bound to the xsd prefix (i.e. http://www.w3.org/2001/XMLSchema ).
XML Data : ElementFormDefault Schema Attribute
A schema itself may well be comprised of components from multiple schemas each in its own namespace. A schema designer has to decide whether to expose or hide these namespaces to the instance document. The elementFormDefault schema attribute allows them to do just this.
Setting elementFormDefault=”unqualified” (the default) will hide (or localise) the namespaces, whereas setting it to “qualified” will expose the namespaces defined within the schema to the instance document.
By way of example the schema below describes a car which sources components from three other schemas. The chassis, wheels and interior are all derived from separate manufacturers.
<xsd:element name=”chassis” type=”Ford:chassis”/>
<xsd:element name=”wheels” type=”Toyota:wheels”/>
<xsd:element name=”interior” type=”Audi:interior”/>
Note the import elements. These allow access to elements from the different manufacturers. Note also that the schema attribute elementFormDefault is set to unqualified. This hides the different manufacturers’ namespaces from any instance document. Such an instance document might look something like this:
<my:car xmlns:my=”http://www.car.org” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://www.car.org Car.xsd”>
Ford F-Series F-150 Regular Cab 2WD
Only the car root element namespace qualifier is exposed in the instance document, above. The different car manufacturers providing the various components are now hidden or ‘localised’ to the schema definition. The instance document doesn’t concern itself with where the components are sourced from. Only that they are available.