TSM - Managing growing data with relational databases

Alin Luncan - Software Engineer

I remember about two years ago one of our clients saying that as it was the main requirement for a new software application. Those words haunted me for a while, I could not understood why he would phrase it like that. An application crash was bad enough but loosing data was out of this world; I was trying to visualize the kind of complexity that made them ask for this particular requirement.

I was failing to see the big picture but I learned it. In the past two years and a half I learned that for this client data is everything. Nothing beats data: not UI, not the way data is stored, not even usability. That is because the application collects data used in scientific models from different devices (sensors) and software services (image processing) which is in turn passed to a mathematical model that evaluates the data resulting in even more data that again is re-processed and combined with other sensors readings and inputs that are all re-evaluated in other models that are displayed as charts or grids. But it does not end there: the user has the possibility to filter out bad inputs from any level, which makes the mathematical models re-evaluate. Yes! For this software, data is the entire application; everything else is meaningless if the data is corrupt or unreachable. It took me a while to understand the main requirement in its full form but finally I was there with a full overview of the entire data hierarchy.

And we did it. About a year ago, we got to a point in the development that all the data the previous application collected and crashed was saved and stored in our relational database (SQL Server 2012). It was reusable and you could export/import the data between different computers thought a fast and smart service integrated into the application. We designed together with the client a recursive structure that could handle all future requests, sensors and mathematical models with small changes. It was stable and scalable and we were doing it while still using an Object Relationship Mapping tool (ORM) for the Data Layer.

For the data collecting process, it was necessary for us to design several tables of configuration of the sensors and of the process itself: the number of used sensors, environmental constancies and data unit of measurement. The configurations and the actual values of the data are evaluated through a mathematical model in order to obtain a first set of results. The mathematical model is represented by a simple table that is related to the initial configurations of the sensors, one's own configurations in order to determine the units of measurement and the name of the results it displays, but also a real connection to each result it displays (data output), as there is a logical level in the code that applies the actual mathematical formulas. The output of a mathematical model can serve as an input point for another one, thus creating a parent - child relationship between the models. There is also the possibility that a collected data set be used by several models. All these relations between the models and the outputs are determined by the user's action, as he has the possibility to filter the data collected for a single model or evaluation results for another one. For our data base, this meant adding multiple connection tables in order to be able to determine which data are selected from a set and creating columns in some tables, to indicate the parent of the models, if it exists.

The design was working and holding. We handled every new requirement into our structures while still holding the original planned relationships between the tables. The relational database design gave us the flexibility to perform complicated queries from all levels and filter our bad sensor readings and reevaluate our data at every level of the hierarchy be it on the vertical or horizontal scale. I for one was glad we used a relational database and did not fall into the trap of using a NoSQL design on such structured data.

But something broke. At the beginning of this year we started to notice that a particular requirement was producing a big chunk of low level Evaluation Data. As it turned out, one type of device had a requirement to collect and store multiple sensor readings with a rate of 20 readings per second. That would mean in an hour we would have to do 72000 readings from about 7 sensors, process each record with the mathematical model and store those values. After some math regarding the size of the object for one sensor reading that ends up in the database, we figured out that it will amount to 2 kilobyte of data. So we did some more math:

2 kilobytes x 72000 reading x 7 sensors= 1.008 Gigabyte

That is 1GB of disk size just to hold the data from 7 sensors for one hour. How much more memory was needed to take the objects from the Data Layer to the Presentation Layer so the user could have a full representation of the collected data? Things where not looking good.

Handling the problem! The first thing was to double check that the requirement was real (collecting data for one hour at a sampling rate of 20 reading per second). Unfortunately this was a valid use case and furthermore to aggravate the problem the client was getting frustrated with the loading speed. Reading so much data and creating the Business Logic objects for it was taking much more time and memory than just copying data from the disk into memory. We needed a new way to properly handle our relational data so we could achieve a performance as close to the disk read time as possible.

We checked how the ORM and other similar tools performed at loading all the data and assembling the hierarchy as described by the database and by the Business Logic. To sum things up the short answer was as expected: badly. We did everything you could imagine with the ORM to the point of using multiple instances to load the data separately per objects, we stopped tracking of the entities states but again and again it was far from the performance we were expected to deliver. The picture was clear: do not rely on ORM tools to build and load complex hierarchies.

That was the point when we realized we needed to find a way to express the complexity of the hierarchy as it goes both horizontally (siblings) and vertically (parents, children) into a single table so we could get the entities and their relationship with a single simple query and with one pass of this table, the Business Logic objects and their representation could be correctly loaded. We created a table that would hold the mapping for each type of object in the database tables and also store the level of depth for them; that way we covered both sibling and parent relationships.

This table (Model Nodes) basically allows us to extract all entries from the data base that have a connection to a mathematical model. Each configuration, each preceding model, each sensor reading, each output of the model has an entry in this table.

• ID Model - represents the Id of the mathematical model

• NodeType - points towards what we are relating (configuration, results, sensor reading)

• Node Id - shows the Id of the object we are relating to

• Parent Id - indicates the parent of the node

By reading each line of this table, we can create our own index in the memory for a mathematical model based on which, from each table indicated by the NodeType, we can upload only the relevant data.

Suppose we have two mathematical models: M1 and M2. The first model, M1, is based on the data read by sensors S1 and S2 and produces the output results M1R1, M1R2 and M1R3. The second model, M2, gets M1R1 and M1R3 as input and produces the result M2R1. The user wants to delete from this computation only the data influenced by sensor S1, and for this it is enough to unselect it from the interface.

Normally, this operation would mean developing a complex query starting from the IDs of the mathematical models, of the sensors and of the outputs. All values from S1 that were used to determine M1R1 and M1R3 are extracted and M2R1 is reevaluated. However, if we use the hierarchy table, we will already have all the necessary IDs in the memory, connected both to M1 and M2:

All we have to do is read the values from the corresponding tables of the data base, reevaluate the model and update the hierarchy table if necessary.

This table can be built in two ways: either by generating it through a script or at the moment of creating new logical entities in the application, an entry is automatically generated and the given example is only one of the multiple scenarios where indexing these logical connections in the memory can save running time on the data base server.

To understand why this table solves the complexity of our growing data in relational databases, we need to understand the objects behind it. Relational databases do not have a performance issue with reading flat structured data, but when the Data Layer starts assembling business objects it created complex queries that have high running times and a high memory impact. This table creates a flat description of a hierarchy and any information we need regarding the logical relationships between the objects will be available inside it. All the data needed to describe the logic relationships between the Business Logic objects can be done by running a single query in memory. Just by adding the additional table in our database, we were able to read and assemble objects as fast as the disk allows it from the database at a cost of 40 bytes 72000 reads 7 sensors (20 MB) for 1GB of raw data.

Looking into the future. This got me thinking how our data will look in the future, how much the application can actually handle this way. Is there something more we can do to optimize it? To answer this question we have again to turn to our database. If the data permits it, we can have aggregated values that could easily fit in the hierarchy table; we could load those values and use them in further filtering and calculation and if the math allows it, filtering out bad sensor reading values will update the aggregated value which in turn can cause the reevaluation of the Mathematical Models.

A pattern has emerged and we can draw our conclusion: the database can be split in two types of relations: schema and logical ones. Schema relations are the ones that make your life easy, described by the table schemas, loading this type of data is straightforward and scalable. Logical relations, on the other hand, are the ones that cause problems. Applying complex logic in queries leads to slow performance and huge memory consumptions. This is where you have to optimize and find a way to map those relationships to the database as a schema relationship.

Let the database work for you.