TSM - On Counting and Counters: QoS – Quality of Service

Marius Span - Senior developer @Betfair

You just launched an application on the market, all champagnes start popping, everybody is happy. The DevOps come to see you the next day. Nothing is pink anymore. The application stopped several times during the night due to overload. To make things even more complicated, they cannot identify the reason why the application crashed. 

The only clue you have is an overload of the operating system resources.

Before going home, the previous day, everything seemed to work fine and to standard. Because you do not have a lot of time until the next chain of events, you recommend that a new instance be added. This is not very complicated. You create a couple of new virtual machines, you install the application, you add them to the existing group of instances and you begin to accept traffic via your new virtual machines. Everything looks good. Things look good the next day too. The application managed to handle the traffic from the previous night. The fire is extinguished, but there is still some smoke left. You know that there will be more events soon, events that will bring more traffic. It is likely that the current capacity is not sufficient and that the application will crash. Moreover, you are not the only one who noticed this. The owners are already informed and ask you to take action.

All right, it is clear that you need to define the capacity clearly, by comparison with the applications that query your application or which send requests to your application. You talk to the teams, which maintain these applications, and, in the end, you agree upon the capacity provisioned for the requests demanded by each client-side application. Put together, the capacities do not exceed the capacity of the initial group, but, in spite of all this, the application stopped after the first evening. Something else must be wrong! You currently do not know how the traffic served by your application looks like and you also know that, in the near future, the owners will request a new functionality. The only solution you currently have is to add new virtual machines. However, this is not a long-term solution. You need to understand what triggers capacity overload. You must implement a mechanism which will allow you to measure, at operation level, the number of requests per second, namely the requests from each client, since these terms describe the contract you have with the clients of your application. This mechanism needs to expand, to be flexible and to be performant.

Nothing new, right? What options do you have?

You must analyse the logs, but this is not a performant solution because it heavily relies on access to the disk, which is known to be slow, and, what is more, it does not allow you to make automated decisions within the application. The database is not a good solution either, because this would determine an increase in the loading process of the connexion set, which the application uses to feed the clients' requests. You could start the HTTP service, which might publish values relating to the application capacity, but this seems to be a bit too much because such a service consumes resources.

You need something simpler.

JMX looks like a good candidate. You need a singleton instance that can abstract a counter, which you can constantly increment, at each request, a counter you can later publish in the MBean service. That is settled, but this will not show traffic per operation. It will only show the total traffic, at application level. This is the reason why you must instantiate the abstraction for each of the operations requested by the clients. Things get a bit complicated, because a solution based on individual counters, invoked everywhere via code, will not be suitable for too many new operations that you might add in the future. You must define an abstraction which can encapsulate all the counters and which can offer an easy to use API, based on identifiers. Such abstraction will allow you to isolate the application from newly added functionality. This sounds good! You implement this, you make a couple of test requests, and then you send the application to production. You begin to observe the reported values closely. Everything looks good.

You can now tell how many requests are served by your application, but your initial goal was to understand what the number of requests initiated by each client actually is. Your next step is to add, to your abstraction, the identity of the clients, which perform the requests. Because you previously chose to abstract the counters, all you need to do now is to add a new parameter that can identify the client for which the request is executed. Within the abstractisation, you must identify the counter associated to the client and increment it each time you serve a request. Of course, you would not like to implement code that can a priori know all clients. Such a solution is fragile and has a shelf life.

The abstractisation must be generic enough to accommodate any operation invoked by each client without any previous knowledge of the client set and of the operations invoked by these. This seems complicated because such abstractisation must be shared by all execution strands and this relates to a shared state. You choose an index whose keys represent the link between the identity of the client and the operation it invokes, as the base structure. The index value is the counter, and, to solve the share issue among execution strands, each time you need to add new entries in the index, instead of moving the current index, you replicate it completely as a copy of the current one, to which you add a new key. 

Yes, this approach has several disadvantages: it is possible that until, all index keys are discovered/made available, some counters be created several times. Another disadvantage is that, until the index stabilises, it is possible to lose some requests from the counting queue. On the other hand, this approach has benefits in terms of performance, as the index structure can rely on a simple HashMap collection, which allows you to have access to quick data with no restrictions from the competition.

Now that you have an idea of how the expansion problem can be solved, you would also like for the counter index to become an index for transactions per second (TPS), because that is what you want to measure, in the end. The current counter must be reset after each second, which would bring along a new problem. To reset the counter after each second, you need to associate the counter with the moment/time of the last reset, which means even more shared state. In addition, you also need to check a request, each time you count a request, if it is necessary to reset the counter before incrementing it. This does not sound good anymore! Instead of implementing this solution, you could allow the execution strand that processes the request to increment the counter, and allow another execution strand to make sure that the execution strands that process the client requests access the reset counter after one second. This means that you must change the counter index, the one you tried to keep in a semi-static state, to keep it in a simple HashMap collection. It becomes clear that the index can no longer contain a simple counter as value. The index value must be a complex object, which must represent a circular list of counters, with a fixed set of elements, which allows you to give up the control elements of the shared state. The counter, which is incremented in the current second, will be at the top of the list. The rest of the elements in the list will represent the reusable counters in which the requests from the previous seconds were counted.  The execution strand, which maintains the counters, must only modify the top of the list, and, because you want to reuse the counters, the list can be immutable. If you follow this line of thought, if you expose the top of the list in JMX, you will be able to monitor the number of requests per second, as they are discharged for each client.

After you implement and install the application at production level, you start to monitor the values. Everything becomes clear after a while. One of the application clients executes a number of requests, which is way above the established contract. As soon as you notice this, you report the situation to the development team. Further investigations reveal that the client application, which caused this behaviour, has a flaw, which causes the contract to be exceeded.

Once the flaw is fixed, the application returns to the parameters of the contracts established with your customers. Everything is perfect. The problem is solved!

Is it really solved? What happens when the same flaw re-emerges? You cannot allow the application to go back to production. When a client exceeds contractual limits, you must deny the processing. Based on the current implementation, after you increment a counter, you can check whether the counter exceeded a certain limit. When this happens, you must raise an error, which signals that the contractual limit was exceeded. This might sound tough to do. Despite the fact that you do not want to have clients, which exceed the established limits, you may be in a situation where your client exceeds the limit, but the application still has the capacity to process the request without the performance offered to the other clients being affected. When you decide to refuse the processing of a request, you would like to do it only if you know that the processing would have unwanted effects. To your decision of refusing a request, you can add a simple condition, which checks the available capacity, or you might also check how many current requests are in a queue, waiting to be processed. Both rely on calculations and state sharing, so, if you decide to use such an approach you must validate whether the advantages are higher than the costs.

In addition to the possibility of short-circuiting the request refusal, you might better the current implementation with the possibility of not refusing requests when your clients occasionally exceed the agreed limit by a small percentage. This means that, when you make the decision of refusing requests, you should also consider the value of the current counter and the average for the last couple of seconds. Of course, this entails more calculations, but, for the sake of efficiency, the calculations can be performed only once per second, and, the ideal place to do this is the execution strand which maintains the current counters. This execution strand, in addition to the responsibility of resetting an old counter, might calculate the request average per second.

Voila! After having implemented these last aspects as well, you have finally managed to have a solid QoS - Quality of Service implementation.