What does this tool do?

The purpose of this tool is to help create scripts to migrate XML data from one version of an XML schema to a later version of the same schema.

Usage

XMLSchemaEvolver SchemaVersion1.xsd SchemaVersion2.xsd

Output:

1. A schema diff showing what elements have been changed

2. XSLT to translate XML data from SchemaVersion1 to SchemaVersion2

How does it work?

The basic idea is this:

1) Do a diff of two xml schema (xsd) files.

2) Each change is classified as an INSERT, DELETE, MOVE or RENAME operation.

3) For each of these operations, emit simple XSLT to carry out the desired data change.

4) These data change operations are modeled after a set of standard XSLT operations suggested by Jesper Tverskov in XSLT Transformation Patterns. A full list of the transformations emitted by our code can be found XSLT Transformations.txt in the documentation folder.

Will this program do everything I will ever need?

Probably not. The point of this code is to automate simple changes that are found via differencing. For large complex changes where you need to map multiple values into a single value or massively change a structure you will probably need a data mapping tool. However, this code should help to automate many small simple changes and/or give you starter code that you can build upon to implement more complex changes.

Example

Suppose we start with the following simple schema to represent an address (see employees.xsd in the Tests directory):

<?xml version="1.0"?>

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:emp="http://www.zephyrassociates.com/Zephyr/Employees" targetNamespace="http://www.zephyrassociates.com/Zephyr/Employees">

<xsd:complexType name="Locale">

<xsd:attribute name="region" type="xsd:string" default="US"/>

</xsd:complexType>

<xsd:complexType name="BaseAddress">

<xsd:complexContent>

<xsd:extension base="emp:Locale">

<xsd:sequence>

<xsd:element name="street" type="xsd:string"/>

<xsd:element name="city" type="xsd:string"/>

<xsd:element name="state" type="xsd:string"/>

<xsd:element name="zip" type="xsd:string"/>

</xsd:sequence>

</xsd:extension>

</xsd:complexContent>

</xsd:complexType>

<xsd:complexType name="Address">

<xsd:complexContent>

<xsd:extension base="emp:BaseAddress">

<xsd:sequence>

<xsd:element name="zip_plus_four" type="xsd:string"/>

</xsd:sequence>

</xsd:extension>

</xsd:complexContent>

</xsd:complexType>

</xsd:schema>

Now we want to rename the attribute in the base type “Locale” from “region” to “country”. So, we have a new schema like (employeesAttributeRename.xsd):

<?xml version="1.0"?>

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:emp="http://www.zephyrassociates.com/Zephyr/Employees" targetNamespace="http://www.zephyrassociates.com/Zephyr/Employees">

<xsd:complexType name="Locale">

<xsd:attribute name="country" type="xsd:string" default="US"/>

</xsd:complexType>

…

XMLSchemaEvolver will detect the change in the name of the attribute and issue XSLT to change the name of the attribute in the base type and any types that are derived as extensions of this base type (Tests_Out_Expected\employeesAttributeRename_fwd.xslt):

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns:emp="http://www.zephyrassociates.com/Zephyr/Employees"

<xsl:import-schema namespace="http://www.zephyrassociates.com/Zephyr/Employees" schema-location="employees.xsd"/>

<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

<!--

________________________

Final changelist after identifying renamed elements, moved elements, and removing rearrangements that are not part of a sequence.

Note. Before we emit the XSLT we remove any inserted root elements or types that did not exist in the old schema since we have no place to put such elements.

________________________

Types. Renamed:

emp:Locale/@region ==> emp:Locale/@country

________________________

-->

<xsl:template match="@*|node()">

<xsl:copy>

<xsl:apply-templates select="@*|node()"/>

</xsl:copy>

</xsl:template>

<xsl:template match="@xsi:schemaLocation">

<xsl:attribute name="xsi:schemaLocation">http://www.zephyrassociates.com/Zephyr/Employees employeesAttributeRename.xsd</xsl:attribute>

</xsl:template>

<xsl:template match="element(*, emp:Locale)/@region">

<xsl:attribute name="country" select="."/>

</xsl:template>

</xsl:stylesheet>

More Details on the Algorithm

1) Use Xerces to process the schema files. Xerces will return a collection of elements and types. Store these as a vector of Nodes where we define a node as:

struct Node

{

string namespace;

string parent;

string member;

string type;

compositor_type connector;

}

2) Do a simple diff of the Nodes representing elements and types using the longest common subsequence algorithm. This is the traditional algorithm used by basic text diff utilities. (See http://en.wikipedia.org/wiki/Longest_common_subsequence_problem).

3) From the diff compute inserted_types, deleted_types, inserted_elements and deleted_elements as vectors of type Node.

4) Compute the set_intersection of inserted and deleted types as ordered by (parent, member, type). This set intersection represents reordered elements. If the element was re-ordered, ignore this diff if the parent connector type is (xsd:all). (In an XML schema document the (parent, member, type) triple should be unique so this matchup should give us only reordered nodes). Do the same matchup for inserted and deleted elements.

5) Identify renamed nodes by matching deleted nodes with inserted nodes of the same type at the same position in the schema. If we find a match, we assume this is a renamed element.

6) Identify moved nodes by matching deleted nodes with inserted nodes having the same member name and type. (We assume these are the same node, just at a different location.)

7) We now have a collection of inserted, deleted, renamed and moved nodes. For each node, emit XSLT to carry out the corresponding operation on the XML data. A list of XSLT data migration transforms can be found in XSLT Transformations.txt in the doc folder.

8) The XSLT that we generate should be standard and run on any XSLT 2.0 compliant processor. In our tests, the generated XSLT has been tested using the Altova XSLT processor available at: http://www.altova.com/altovaxml.html

References

There are many academic papers on the topic of XML schema migration, but very few publically available tools to help with this task. Here is a short list of references I found useful:

An Online Bibliography on Schema Evolution. An online collection of papers on schema evolution, many related to databases, but some to XML. http://se-pubs.dbs.uni-leipzig.de/

“Managing XML Data with Evolving Schema,” B V N Prashant and P Sreenivasa Kumar. This outlines an approach similar to the one we followed. They examine changes in a DTD schema, classify these changes into types, and generate XSLT data transformations for each type of change. A copy of this paper is in the Doc folder as “ManagingXMLData.pdf”.

“Conceptual XML Schema Evolution – the CODEX approach for Design and Redesign,” Meike Klettke. This shows a graphical schema modeling tool. Every time a change is made through the UI, the change operation is recorded. These changes are then “compressed” to get a minimal set of changes and a change script to translate XML data is created. The problem here is, you may not want to use their UI for all schema changes, or changes may come from outside the UI. That’s why I prefer a “diff” approach to detect changes. This paper can be found in the Doc folder as “Klettke.pdf”.

XSLT Data Tranformation Patterns by Jesper Tverskov. http://www.xmlplease.com/xsltidentity Also see “XSLT transformations.txt” in the Doc directory for a list of the patterns emitted by the code.

DiffDog and MapForce by Altova. http://blog.altova.com/2009_12_01_archive.html Altova makes some nice tools but as of May 2010, I found their diff and map tool to be inadequate. The problem with their diff is that it generates a huge amount of spam code just to map all the identical elements in the schema. This makes it very hard to understand and maintain the script to see what has changed. The identity mapping used above is much cleaner and more maintainable.