How to make the machine write the boring code

Denes Botond
Software engineer
@Accenture Romania

PROGRAMMING

Have you ever had to write code for a lot of highly similar things, differing only in a few details? Have you ever wondered - while copy pasting code from one class to another, and another fifty more - why are you doing this repetitive, error prone and boring task instead of a machine? Machines are just fine doing this kind of work, why aren"t they doing it? In this article we will talk about how one can write a program which generates code based on an xml input file and flexible templates.

Code generation sounds like a topic for NASA engineers, not something for humble programmers in Cluj working in outsourcing. Imagine our surprise, when on a hot summer afternoon our client announced as our next task a code generator. A code generator which can generate code based on an xml input file and on flexible templates for multiple platforms and programming languages.

To understand the problem let"s look at an example. Imagine you have to develop a database application for a warehouse, which stores various kinds of electrical devices. This is a classical "skin-for-database" type of application, with a lot of forms, and of course a lot of DAL (Database Access Layer) classes. If this warehouse is small and they have only four different kinds of products than writing these DAL classes is no big deal. But if it is big, storing all kind of devices, ranging from smartphones to monitors and from routers to DVD players, and you have to create a table for each category, than this task may soon get very time consuming. In the case of these many similar classes that differ only in small details, writing them is not the only problem. A small modification or a new feature means the modification of all the DAL classes, which is very copy-paste error prone. In a scenario like this, having a code generator that generates all this code ensures that modifications and new features yield changes only to several template files and that refactoring the code is easy and fast. A well written code generator can be used in many projects, being a very useful tool in the future.

The architecture of the code generator

A code generator is composed of two major parts: the part which reads and parses the input and a part which parses the templates and generates the output. The input parser component passes the parsed data to the output writer in a well-defined format, thus any of the two parts may be changed as long as they adhere to the defined interfaces.

Reading the input

The input has to contain all the necessary information for the completion of the templates, in a structured, machine parsable format. In this article, an xml format will be discussed. The input has to contain all data that is intended to be used in the templates. Since in this case the templates will generate C++ classes with SQL snippets in them, the input has to contain all necessary information about the categories that will be mapped to classes (and tables) and all their fields. It is advisable to preserve the structure of the data as it is in the xml file, because it eases the writing of the templates.

An xml input file for this particular case could look like this:






...




...

...

This xml file in fact contains an array of categories, each category containing an array of its fields. Each field has a name, a type, and platform-specific types. Storing the platform specific type in each field tag may not be expedient even with just two platforms. It may be much better to store just a platform-independent general typename in each field tag like "string", or "number", and to store the type mapping in a different part of the xml, but for the sake of simplicity in this example it will be stored in the field tag. There is not much to say about parsing the xml file, there are several libraries and parsers readily available; there is no need to write one.

The intermediate format

The parsed input data has to be stored in an intermediate format, before and during being made available to the template engine. The details of this format are specific to one"s problem that needs to be solved, and it depends on how general the code generator is, and many others. It should however comply with some criteria:

it should make it easy to look up categories,
it should make it easy to look up fields of categories.

Generating the output

This is by far the trickiest part of the code generator. It is the most important and most complex part. If one wants to write a code generator which is flexible and reusable, than hardcoding parts of the result code or any knowledge of the target language or platform is a dead-end. The best solutions are templates.

Templates

Templates are in fact partly written - in this case - C++ classes with some sort of instruction on how to fill in the missing parts. The missing parts are the bits of code which depend on the input - the attributes of the categories and its fields. The most obvious example is the name of the class. This depends on the name of the category. In the template this will appear as a placeholder which will be replaced by the actual name of the category during the template processing. This will be called as "variable substitution" throughout the article. But in order to generate all the output, this variable substitution won"t be enough. For example, to generate the code to declare all the members corresponding to the category fields, some kind of loop is needed to iterate through the fields, and generate the appropriate member declaration. Also, if one has to make decisions based on some values in the input - which is very probable - then some sort of boolean expression evaluation and branching is necessary. This is starting to look like a programming language, with variables, if-branches and loops. Writing a robust parser which is easily modifiable in case a new construct is added is no easy task. But before we dive into the details of a possible template parser implementation, let us see an example for possible template syntax.

class <$category.name$>
{
public:
<$category.name$>();
~<$category.name$>();
<@foreach <$field$> in <$category.fields$> @>
<$field.cpp-type$> get<$field.name$>();
void set<$field.name$>( const <$field.cpp-type$> &value);
<@endforeach@>
private:
<@foreach <$field$> in <$category.fields$> @>
<$field.cpp-type$> <$field.name$>;
<@endforeach@>
}

This is a possible template for the class that represents a category. It contains variable substitutions and two loops. Variables are marked with the "<$" and "$>" tokens and the instructions with "<@" and "@>". These delimiter tokens make the template more human readable, although this might seem absurd at first. When this template gets parsed and completed, all variables get substituted by their value and everything that is in a loop gets evaluated in every iteration.

Parsing the Templates

Although at first it looks like a plausible option to write a parser by hand, it is not a good idea for several reasons. First, although this example is not very complicated, a real life template might have many more features and thus much complicated syntax with many more keywords and tokens. Writing a parser from scratch for a mini programming-language is a time and energy consuming task with a lot of pitfalls that are not obvious. Second, there is no need to reinvent the wheel. There are well tested, proven parser generators that generate a standalone parser from a set of rules. The most well-known parser generator is the lex - yacc pair. Lex is a lexical analyzer generator, and yacc is a parser generator. Both lex and yacc have open source variants; for our project we used the GNU variant of these tools, flex and bison.

Processing the templates and generating the output is done in three steps as it can be seen on the following flow chart:

I"m not going to go into details about every step, with examples, I will only talk briefly about each step. I"m going to try to give a good picture about what is done in each one of them.

Breaking the input into tokens

In the first step the input is broken into tokens. This is what the C preprocessor does (among others). What this means is that, in fact, every group of characters that has some importance gets tagged and unimportant characters are ignored. For example the character group "<$" gets the tag "VAR_B" and together - the tag and its value - they form a token. A line of C++ code with no template keywords in it gets tagged as "text". Unimportant character groups are for example comments or whitespace in some cases. Tokenizing the input is what the lexical analyzer - or in other words tokenizer - generated by flex does. Flex generates the tokenizer based on an input file. The input file contains the rules that describe what kind of tokens this tokenizer recognizes, and what they look like. The rules are composed of a regular expression and a piece of code that gets executed when the regular expression matches a piece of input. The tokenizer reads the input one character at a time and tries to match it against all the defined rules. As soon as a match is found the longest possible match is selected and the appropriate piece of code is executed. The input to the tokenizer is the template itself and the output is the tokens. Flex generates a function yylex() which for each call returns the next token in the template.

Validating the grammar and building the abstract syntax tree

In this step the sequence of input tokens are validated against the grammatical rules that describe the template language. This is done by the parser generated by Bison. Bison generates the parser from an input file which contains the grammatical rules that describe the language. The grammatical rules have to describe a LALR(1) context-free grammar. This grammar, although somewhat constrained in that it cannot handle ambiguities, is suitable for most languages, even for fairly complex ones as Java. LALR(1) languages are composed of terminal and nonterminal symbols. Terminal symbols are the ones that can be matched one-to-one to the input tokens. For example in this template language the token <$ is a terminal, an atomic building block of the language. Nonterminal symbols are composed of multiple terminal and nonterminal symbols. The definition of a nonterminal symbol is called a rule of the grammar. A rule is a list of all the possible combinations of terminal and nonterminal symbols that form this nonterminal. For example, if we like to define a nonterminal for the name of a variable, it would look something like this:

variable_name: word OR word.word

The variable_name nonterminal symbol is a composite building block of the language. It can be part of the definition list of other nonterminals to make up even more complex nonterminals, like a foreach loop. There is a special nonterminal that represents the whole language. This nonterminal is the topmost level nonterminal of the language. All the valid combinations of the other building block of the language must be part of its definition. If an input is valid, that means that gradually, substituting the smaller building blocks with the larger ones, this master nonterminal can be substituted with all the input at the end.

The generated parser is a finite stack machine. The parser puts the received input tokens (terminals) onto a stack. This operation is called shift. Whenever appropriate, it substitutes them with nonterminals. This is called reduce. This is the same procedure as the one described above. If all goes well, then all the input is reduced to the master nonterminal and the language is valid.

During the validation of the grammar, the abstract syntax tree is built also. This is the most crucial part of the parsing procedure. The abstract syntax tree represents the whole template file in a tree form, where the important parts of the template are represented as tree nodes. To have an idea how this tree looks like let"s take a look at an example syntax tree:

In this (overly simple) example tree there is some text, a foreach with some text in it, and a variable. How does this tree get built? Bison makes it possible to specify custom code that will be executed whenever a reduce happens. Thus it is possible to react to every event of the template parsing and create tree nodes when a new construct of the template is assembled. In our example the tree would have a node for text, foreach and variable.

Bison generates the function yyparse(). This function calls yylex - the function generated by flex - internally to obtain the input. On successful parsing, it will return 0, and another value on error. The user engineer can specify a custom input parameter, such as a pointer to the root node of the syntax tree. Thus, after the parsing is done, the built abstract syntax tree representing the whole template can be retrieved.

Executing the instructions in the syntax tree

The user program has to know exactly what to do for each node to generate the output required by the specifications. For example, a variable node gets substituted by the looked-up value of the variable. Each node has to contain enough information for the user program to generate the expected output. In the aforementioned example, the variable node has to contain the name of the variable. To generate the output for the entire template, the abstract syntax tree has to be walked with the depth-first method.

A possible result for the given template and just one category could be:

class Smartphone
{
public:
Smartphone();
~ Smartphone ();
std::string getvendor();
void setvendor(const std::string &value);
std::string getos();
void setos(const std::string &value);
private:
std::string vendor;
std::string os;
}

Conclusions

Writing a code generator is not a trivial task, it requires a large amount of effort, but in many cases this invested effort pays off with interest. Imagine that instead of writing and maintaining several dozens of files you have to write and maintain only several. I think it is not bold to say that a code generator can save you many workdays. I addressed all the main issues that we met while developing the code generator with as many examples as possible. In our case the decision to invest time in learning flex and bison proved to be very fruitful. The fact that they are well tested, proven tools meant that we didn"t have to test and debug a custom parser, and we could focus on the definition of the structure of our language on a much higher level. Flex and bison were flexible enough to suit all our needs and they are well documented too. Working with a parser generator has another advantage: in case of even large structural changes, or new features, only a few files and a minimum amount of code has to be modified. These advantages should sound familiar. Generating code for a code generator is awesome.