TSM - Let’s be inventive with how we can process and collect data

Radu Vunvulea - Solution Architect

Let's discover together another approach to collect and transform information that is sent by devices.

Context

In the world of smart devices, devices are more and more chatty. Let's assume that we have a smart device that needs to send, every 30 seconds, a Location Heartbeat that contains \. Worldwide we have 1.000.000 devices that send this information to our backend. At global level, backend runs on 4 different Azure Regions, with an equal distribution of devices. This means that, on each instance of our backend, there will be 250.000 devices that send heartbeats with their locations. From a load perspective, it means that, on each Azure Region, there are around 8.300 heartbeat requests every second. 8K messages per second might be an acceptable load or not. It depends on what actions we need to do for each request.

Requirements

From the end user point of view, there are two clear requirements that need to be fulfilled:

The first requirement is pretty lax. From an implementation point of view, we need to persist all the location heartbeats in a repository. The repository can be a file-based repository.

The second requirement is a little more complex. It requires us to have the last location of the device available at all times.

Classical solution

The standard solution for this situation is to have an event messaging system that aggregates all the location heartbeats from devices. On Azure, we can use Azure Event Hub successfully. Azure Event Hub is able to ingest high throughput.

Behind it, we need a computation unit that can scale up easily. The main scope of this unit shall be to process each location heartbeat, by dumping the information in storage for audit, and to update the last location of the device.

For processing purposes, we can use Worker Roles or Azure Service Fabric successfully. In both situations, you shall expect that the processing action will take around 0.4-0.5s (dumping data and updating the last location). This means that you will need at least 16 instances of Worker Roles that will do this job.

The audit of the device location can be dumped easily in blobs. If you don't want to do this by hand, you can use Azure Event Hubs Archive. This new feature offered by Azure can do this job for you, dumping all the events directly in blobs. This solution is applicable as long as you don't need to do a transformation of the messages.

For storing the last location, there are many options. The faster one is to have an in-memory cache like NCache and update the last location. The downside of this approach is that you will need to access this information and an API shall be available. At a specific time interval a dump to Azure Table or SQL could be done.

You might ask why I did not propose Azure Redis Cache or another cache service. When you do a lot of writes, a cache system doesn't behave as you expect it to. In the past, we had problems with such solution, where the read and write latency had 2-5s, because of heavy writes (yes, there were a lot of write operations).

A better solution is with Azure Tables. Each row stores the last location of a device. Each time when a new location is received, we update the location information. You can replace Azure Table with SQL if the reports that you need to do are complex or you need to be able to generate the reports on the fly (even for this situations, Azure Table can be a good friend).

Cost

Even if we try to optimize the code, writing the last location of devices is expensive. You need to identify the device, get the table, write the content. This means that, even if you reduce the time from 0.4-0.5s to 0.2-0.3, it will still be expensive.

To be able to consume and process 8.300 location heartbeats per second, it will still be costly. Let's assume that you manage to process 500 on a Worker Role. This would translate in at least 17-18 instances of worker roles.

Identify the bottleneck

We need to try to see how we can use Azure Services that are available now, on the market, to our advantages. We need to try to reduce our custom code. In the long run, this means that we will have a lower cost for maintenance and support, and the number of bugs will also be lower.

Our assumption is that we cannot use the Azure Event Hub Archive because there is some transformation that needs to be done to store the data in the agreed audit format.

The first thing that we shall do is to separate these two requirements in different modules. The first module would create the audit data and the other one would store the last location of the device.

Remarks: Whatever you do, you shall use Event Processor Host to consume messages from Event Hub. This will ensure that you will have:

If we run these two modules separately, we notice that processing the location heartbeats only for audit is extremely fast. On a single worker role, we are able to process 4000-5000 location heartbeats per second. By contrast, the action of updating the last device location is an expensive action.

The problem does not stem from Azure Table, where latency is low. The difference is that, for audit purposes, you only transform the messages and dump them, whereas strategies like buffering can be used to optimize this kind of actions. Whatever we do, we cannot process more than a maximum of 600 location heartbeats per second, on each computation instance.

Let's be inventive

We observed that creating the audit data is a low consuming action. Storing the last location of the device is expensive. If we look at last device location again, we realize that this action could be done directly by the device.

Why couldn't we make the device update its own Azure Table Row directly with its last known location? In this way, we don't have to process the data and update the last location using backend resources.

From an access perspective, Azure Table Shared Access Signature (SAS) enables us to offer granularity access at partition and row key level. Each device can have access only to their own raw from that respective table.

Finally, each device ends up sending a request to Azure Table to update the last known location. Another request is sent to the Azure Event Hub to push the location heartbeat for audit. At backend level, the load decreases drastically. We only need the Worker Role instances that create our audit data. In Azure Tables, we will have the last location of each device, which is available for reporting other types of actions.

At device level, things change a little bit. Now, for each location heartbeat, a device needs to do two different requests. This means that the device will consume more bandwidth and a little bit more CPU. This approach is valid for devices that have a good internet connection (cheap) and no CPU strict limitations.

In this way we move the load from one place to another, increasing the noise on the internet (smile).

Another approach is to have a dedicated Azure Table for each device, where devices add new rows all the time. We will not explore this solution, which might be interesting, as long as we don't need complex reports.

Conclusion

To solve a business problem, there are multiple solutions and approaches. There are times when it is a good approach to make a step back and see how we can use services that are available on the market to solve different problems.

Making direct calls to different services might be a better solution, but never forget about the cost of making that request at device level.