Code Generation using XML / XSLT

 »»»»»»»»»»»»»»»          (== | ^ ==)




Automated code generation by wizards has been a fancy in the industry albeit criticism by hard-core programmers. It has never been more given to the amount of R&D and buzz going into MDA, DSL / DSM and the likes. What we discuss here is the problem faced after modeling is complete. Once a model is developed there is a translation procedure which generates equivalent code in a specific language. In this article we will explore the possibility of generating structured language code using XSLT (eXtensible Stylesheet Language Transformations) esp. when the model can be easily represented in XML format (or in a near equivalent form).

Template based code generation has been widely adopted by automatic code generating tools. The model / input to these specify the parameters that govern the vagaries of the output. If the model by itself is highly structured and semantics driven, it is quite easy to represent it in XML format. Combined with XPath expressions it offers a powerful yet simple means for code generation. The most alluring aspect of this method is its scalability all the way towards generation of most programming constructs albeit dubious at the facade.

XSL Transformation has been widely applied to translate XML documents to other text formats - XML of a different schema, HTML and even to PDF documents. So why not generate high level language code? Realizing the potential for improvement in development efforts I stumbled upon this practicable solution in an attempt to generate template based C code.

In this article I hope to elucidate how XML can be used along with XSLT & XPATH in generation of text based code generation. In the process I shall highlight the pros / cons of this method, and introduce means of extending the potential of this scheme. Although not a panacea / replacement for conventional code generation, it can give leverage in certain situations. In the following sections I shall give hints, pointers and illustrations as to how XSL may be used in generation of C code. The same idea can be extended for generation of source in other languages. But before we get to those details lets take a quick detour on representing source code in XML.

Can a XML represent a high level language program?

Most high level languages in existence are structured or can be reduced to such a form. This is quite obvious from the fact that compilers for these languages are able to represent them in an AST. A XML although in text format is a widely adopted standard used to represent structured information. In fact in DOM a XML document is represented in a tree structure. Thus persistence of an AST in XML format sounds propitious. Technically this makes representation of a structured program in XML quite possible :). Henceforth in the text we assume that the model is represented in XML format.

The model XML must at the least be able to describe the language constructs that would vary at the output, apart from containing details not pertinent to code generation. Emphasis here is given to the fact that fabricating a XML Schema to represent every possible output in the language is rarely required. It is sufficient that the XML be capable of representing the parameters that govern the vagaries at the output (directly / derived). Once these details are abstracted, writing the XSL to perform the conversion process is a trivial task.

In fact, with little forethought the XML can be designed to be extensible to express constructs not supported currently. This would enable enhancing the XML Schema and extending the XSL later to support these new constructs when warranted. With the XSL written to skip unknown XML tags, the design would be backward compatible as well.

The following sections assume that the reader is aware of basics of XML, XSL, XPath and on how to perform the translation by manually written code / a tool. It is advisable to get a basic idea on these to be able to relate to the what is explained.

Our Hypothetical System

For illustration purpose we will consider the following problem - Our system has various configuration parameters and each with a fixed structure / format of its own. (I.e. each parameter can be represented by a data type definition in a high level language.) A GUI takes input for these parameters and outputs a XML file containing the values. It is required to architect a XML and XSL to generate the required C code with values of the parameters. (Note: The actual system can be much more complex comprising of many components in different numbers, each governed by a set of parameters and such. Further, instances of the same component type can all share some parameters as well. Such details are not considers as they are an extension to the basic idea.). Restricting ourselves to the above statement we would only discuss generating value definitions for the configuration parameters.

Enough ado! Let's get to business.

Programming language constructs in XML

The first step in the process is to define the XML Schema / format representative of the model. Since we target generating C code, we shall formulate XML fragments that represent various C Language constructs. We shall begin with the simplest of constructs. The following snippet illustrates various XML fragments that represent chosen basic types.

<Element Name="ENABLE" Type="byte" Value="1" Qualifier="#define"/>

<Element Name="today" Type="enum" EnumTag="DayOfWeek">Monday</Element>

<Element Name="wordMax" Type="word" Value="65535" Qualifier="const"/>

<Sequence Name="currentStock" Type="struct" StructTag="Stock">
        <Element Name="name" Type="string" Value="hiking boots"/>
        <Element Name="minStock"  Type="int" Value="10"/>
        <Element Name="maxStock" Type="int" Value="25"/>

<Array Name="fibonacci" Type="int" Size="5" Qualifier="static">
        <Entry Value="1"/>
        <Entry Value="1"/>
        <Entry Value="2"/>
        <Entry Value="3"/>
        <Entry Value="5"/>

The above XML fragments each demonstrate a plausible representation of a few language constructs in XML. The following explains each of the above in sequence. This is followed by the C code that we expect to generate for this input.

  1. ENABLE is a macro with the value 1. The Qualifier attribute of the Element tag is used in understanding this context against a variable definition.
  2. today is a variable of enumerated type DayOfWeek assigned with the predefined value Monday.
  3. wordMax is a constant variable of type unsigned short with the value 65535.
  4. currentStock is an instance of structure Stock with 3 fields - name, minStock and maxStock.
  5. fibonacci is an array of five integers containing values of the Fibonacci sequence.

#define ENABLE 1

enum DayOfWeek today = Monday;

const unsigned short wordMax = 65535;

struct Stock currentStock =
{ "hiking boots", 10, 25 };

static int fibonacci[5] =
{ 1, 1, 2, 3, 5 };

That doesn't seem like much, doesn't it? But let's take a look at the amount of ground-work required writing the XSL that would perform the transformation.

XSL that does the magic!

The following XSL snippet only depicts fragments that are essential for handling the above cases and not all possible variants of the above constructs.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="">
    <xsl:output method="text" encoding="utf-8" indent="no"/>

    <xsl:template match="Element">
            <xsl:when test="'#define'=@Qualifier">
                <!-- The element is output as a macro definition -->
                <xsl:text>#define </xsl:text>
                <xsl:value-of select="@Name"/><xsl:text>&#32;</xsl:text><xsl:value-of select="@Value"/>
                <!-- The element is output as a variable definition -->
                <xsl:call-template name="outputCQualifier"/>
<!-- Function Call!  -->
                <xsl:call-template name="outputType"/><xsl:text>&#32;</xsl:text>
                <xsl:value-of select="@Name"/><xsl:text> = </xsl:text><xsl:value-of select="@Value"/>

    <xsl:template match="Sequence">
        <xsl:call-template name="outputCQualifier"/>
        <xsl:text>struct </xsl:text><xsl:value-of select="@StructTag"/><xsl:text>&#32;</xsl:text>
        <xsl:value-of select="@Name"/><xsl:text> =&#13;&#10;</xsl:text>
           <xsl:call-template name="outputValue"/>

    <xsl:template match="Array">
        <xsl:call-template name="outputCQualifier"/>
        <xsl:call-template name="outputType"/><xsl:text>&#32;</xsl:text>
        <xsl:value-of select="@Name"/>
        <xsl:text>[</xsl:text><xsl:value-of select="@Size"/><xsl:text>] =&#13;&#10;</xsl:text>
           <xsl:call-template name="outputValue"/>

    <xsl:template name="outputCQualifier">
        <xsl:if test="''!=@Qualifier"><!-- IF-THEN only -->
            <xsl:value-of select="$Qualifier"/><xsl:text>&#32;</xsl:text>

    <xsl:template name="outputType">
        <xsl:choose><!-- SWITCH-CASE using cascaded IF's -->
            <xsl:when test="@Type='byte'">unsigned char</xsl:when>
            <xsl:when test="@Type='word'">unsigned short</xsl:when>
            <xsl:when test="@Type='enum'">
                <xsl:text>enum </xsl:text><xsl:value-of select="@EnumTag"/>
                <xsl:message terminate="no">
                     <xsl:text>#missing type at node: </xsl:text>
                    <!-- Invoke XSL Code to formulate XPath of current node based on the Schema -->

    <xsl:template name="outputValue">
        <xsl:variable name="TagName" select="name()"/>
            <xsl:when test="$TagName='Element' or $TagName='Entry'">
                    <xsl:when test="boolean(@Value)"><!-- IF-THEN -->
                    <xsl:if test="@Type='string'"><xsl:text>"</xsl:text></xsl:if>
                    <xsl:value-of select="@Value"/>
                    <xsl:if test="@Type='string'"><xsl:text>"</xsl:text></xsl:if>
                    <xsl:otherwise><xsl:value-of select="."/></xsl:otherwise><!-- ELSE -->
            <xsl:when test="$TagName='Sequence'">
                <xsl:text>{ </xsl:text>
                <xsl:for-each select="Element|Sequence|Array">
                    <xsl:call-template name="outputValue"/>
                    <xsl:if test="position() != last()"><xsl:text>,</xsl:text></xsl:if>
            <xsl:when test="$TagName='Array'">
                <xsl:text>{ </xsl:text>
                <xsl:for-each select="Entry">
                    <xsl:call-template name="outputValue"/>
                    <xsl:if test="position() != last()"><xsl:text>,</xsl:text></xsl:if>


A walk-though of the above XSL would give you a fair idea of what can be performed using it. XSL embeds the template as well as information as to what is generated and when. In fact if you have worked with ASTs, you would find that the above process parallels it in execution. Although there is quite a getting used to programming in XSL this learning curve can be sharply reduced by using tools that enable debugging the XSL Translation process.

By now you would probably be convinced that XSL is packed with quite a lot of programming capabilities (functions, if-then-else, switch-case block, function-polymorphism using the attribute mode etc.). Being a Functional Language it lacks mutable variables, inherent support for looping, ability to have global variables to maintain states etc. The 1st two of these two are usually circumvented by using recursive XSL functions (<xsl:template name="<function name>">). The later and most of the other deficiencies can potentially be solved using Extension Objects. With support for invoking .NET / Java code from XSL becoming prevalent among XSLT Processors, the potential is only limited to your imagination. XSL also facilitates reporting warning / exceptions using <xsl:message> elements.

Although we have not demonstrated every capability of XSL in processing the model XML, the above exercise sure gives a head start and good insight to what can be accomplished. The XSL sample above does seems quite big for what it handles. But remember that it encapsulates the code that performes the translation, the template, output formatting etc. With so much having gone into the XSL there is hardly any non-trivial coding left to do!

The uncharted territory

Although the above illustration proves the point, there are certain capabilities not discused as yet which gives greater leverage to XSLT. XPath - the notation used to address portions of XML document which is widely used as a query language as well. This is very useful in extracting sections of the XML document based on certain parameters or based on absolute / relative paths.

This proves extremely useful when dealing with constructs such as linked-ists, pointers, profiles (a class of variables in our hypothetical system) and dependencies. By the term Profiles we refer to those variables which serve in reusing values of a specific type (e.g. MIN_TEMP, OPTIMAL_TEMP and MAX_TEMP for a quantity that represent temperature). For an e.g. of dependency lets consider in C language where struct A holds a pointer to struct B and we need to initialize an instance of the former. Although the variable of struct A is of interest to us, we end up generating one for struct B as well.

The following illustrates potential use of XPath in describing a type in XML.

<Sequence Type="struct" StructTag="stockChain">
        <Element Name="name" Type="string"/>
        <Element Name="minStock"  Type="int"/>
        <Element Name="maxStock" Type="int"/>
        <Ptr Name="pNextStockItem" Type="struct"/><Ref StructTag="stockChain"/></Ptr>
<Sequence Type="struct" StructTag="warehouseStock">
        <Element Name="location" Type="string"/>
        <Element Name="turnAroundTime"  Type="int"/>
        <Ptr Name="pCurrentStock" Type="struct"/><Ref StructTag="stockChain"/></Ptr>

Just as this notation is used to represent types, predefined values (and profiles) can also make use of the XPath notation. Unfortunately XSL inherently cannot evaluate an XPath stored in a string variable and mandates that it be mentioned inline in the XSL file. Provided the XML Schema is designed such that all queries can follow preset XPath patterns, it is possible to perform the XPath query using few parameters. For e.g. in the above case if it is guranteed that type definitions are unique by name then we can use the following XPath when processing Ptr elements of type struct - //Sequence[@StructTag=$StructName]. Thus XPath can prove to be handy in representing such cross references. 

Why & when to use XSL over other means for code generation?

The widely preferred approach is doing it the hard way - writing hand written code generators in a programmer's choice of language. Although programming this way can be a lot of fun, it is very susceptible to errs. Not to mention that the development time would also be considerable. If the generation is merely outputting a table of values and such in a code template then the conventional methods certainly have an edge. But when the output involves varying usage of high level language constructs, use of some form of AST is imminent. This not only entails writing code for traversing and processing of the ASTs, but would also require changing the code to handle any future enhancements to the AST.

Under an assumption that the model can be expressed in a neatly designed XML with reasonably minimal effort, there are factors where XSL might prove to be a more optimal choice for code generation. XSL provides an intuitive approach to processing tree structure characteristic of ASTs. This would void the effort involved in development of a framework to traverse and process it. Further the ability to express structured programming constructs easily in XML also enables incremental development. The most notable advantage is the short development time and the fact that testing is limited to verifying the functionality (rather than scenarios where the code would misbehave given to flawed programming). Since the model is in XML it can be verified for structure by performing XML Schema Validation using a XML Scheme Definition file, and the rest of the validation performed in the XSL itself. The output generation being trivial in syntax is also very easy in XSL. Assuming a well formed model XML the output generation process using XSL is very robust and reliable.

There are some negatives to using XSL as well. The model XML can get excruciatingly complex when the output involves niche vagaries and idioms of the target language. For e.g. consider a scenario where we have a struct A holding pointer to struct B as a field. Population of instances of struct A can be done in two ways - creating an instance of struct B and assigning its address to the field in struct A, or have a table consisting of all instance of struct B and assign one of these to the field in struct A. Given to the fact that in practical code generation problems such variations are very rare the XSL mechanism should prove more befitting than otherwise.

Another problem that I encountered was lack of flexibility to format the output code, ending up choosing an XSL friendly output format. The XSLT processor in .NET did not support extension objects to monitor the output (in TextWriter) during the transformation (i.e. the TextWriter does not reflect the output until the transformation was complete). Since XSL does not support mutable variables, it is too tedious to maintain the current column etc. Although I wouldn't deem impossible the effort required didn't seem well worth the results. This is not a deficiency of XSL as such since it was designed for translation into XML / HTML formats where such formatting requirements are insignificant.

Performance is usually not a constraint for code generators since these involve batch processing. Nevertheless, extensive use of Extension Objects can affect the performance. XSL does seem to be hit by programming constraints too. But this would not be a great deterrent factor considering the possibility of pre-computing and embedding such details in the model XML itself when feasible. Output formatting too can be achieved this way with pre-computed formatting information.

The main indicators of using XSL not being the right choice are way too complex model XML, heavy reliance on Extension Objects and programming constraints which are hard to implement in XSL.

Conclusion & Future Direction

Using XML / XSLT is a less known method for source code generation. When the system's model can be easily expressed in hierarchical and semantics rich XML and output is based on code templates this method is very advantageous. With tools such as Altova XMLSpy, StylusStudio and Visual Studio 2005's integrated XSLT Debugger (and more) the development of XSL can be accelerated many folds.

An interesting extension to this would be generation of documents using XSLT. Many systems require generation of documents which are derived from the model. For a person to manually update the document is usually a time consuming and error-prone process. With the introduction of WordML (a.k.a. Word Processing ML) is it quite possible to use XSLT to generate documents whose content is derived from the model XML.

This betterment in using XSL is markedly notable when hand-written code generators are considered as an alternative. To what extent this method competes with others like using the m4 processor, StringTemplate is worth investigating. Nevertheless the versatility, development / testing effort involved and intuitiveness of the method of Code Generation using XML / XSLT can starkly overshadow its weakness in many template based source code generation scenarios.