TSM - Learning to forget data

Cosmin Gabriel Popa - SA R&D Osprov Team

Computers started as a basic alternative to the human brain. The power and all that was mysterious and compelling needed replicated. Humans were the pinnacle of rational evolution but, at one point in time, they seemed limited. Eventually they started creating controllable things that will otherwise help them overcome their drawbacks. Forgetting is natural human defect and this is why we live in an era in which “the internet never forgets”. But maybe it’s time it should.

Total recall can be a curse. Not only for the one that carries this ability but also for those that will come in contact with this flawless imprint in time. How do you behave in front of such a “being”? What do you say? How should you look? What can you think when everything that you are is recorded with an immense precision and maybe it will stick until the end of time? Time is always the most cruel judge and executioner and this is one of the scariest things to bear in mind when it comes to something or someone that pushes you in front of such jury.

Actually, besides the philosophical standpoint, “computers” tend to store everything - from mechanical punch cards to floppy-disks, DVDs, CDs, harddisks, external harddisks, datastores and cloud services, and these are just some of the solutions used. There will be more of them as technology evolves. History and also predictions show that with time the storage capacities will increase while it will become inevitably cheaper - with the blunt translation that we can store more with less money. There are also alternative ways in storing data (useful and important information) - one example would be the research done on storing data into DNA sequences resulting in the ability of save 700 terabytes of data in 1 gram of DNA by translating data into ones and zeros and then encoding it into TGAC (T and G=1, A and C=0) bases that can be synthesized with specific addresses and then converted into sequences. So, if we have significant amounts of storage power now and if we’re finding new ways and going extra miles to save as much data as we can on the smallest space possible, why would we need to forget?

Humans choose what to remember. Sometimes it isn’t even a conscious decision. This is also a self-sustaining action of the brain - when new information is stored, old information is released. This is how short-term memory works. With time the needed and essential information swoops in the long-term memory and will be available for access at request. But it is inevitably that a lot of information (maybe redundant data) is forgotten by choice or unintentionally. Using mnemonic devices will increase the capacity to memorize and retrieve information, but then again this is done selectively. This means that we can filter data/information by its importance. Neuroscientists say forgetting is crucial to the efficient functioning of the mind, to learning, adapting and recalling more significant things. Researchers in the field of Neural Networks are trying to figure out how to resolve the catastrophic interference, also known as catastrophic forgetting. When confronting with this problem they’ve discovered ways in changing the Neural Network's structure so that the already learned skills will not be forgotten after a new skill is acquired.

We’re basically reaching a point where we’ll have to “learn how to forget”. And this seems crucial. Forgetting is not a process of deletion, it’s a process of selection - figuring out through experience what is important and what can be left aside. If we apply this concept to any device (or alternative way) that stores data we can create a supervised learning mechanism aimed on sorting out the important information from the already stored data or the new data coming in. This also means that the irrelevant data will be forgotten (eventually deleted) as time passes by.

One method is by Moving Data. In a complex system where there are several levels of storing facilities, we’ll have the possibility to move the information around in any format on any layer based on it’s importance. This would mean that the user should always have a quick access to his day-by-day important information - the irrelevant data will be eventually archived and moved through the layers (Fig. 1)

Fig. 1: Moving data in search for the best performance

What happens to the data that reaches the furthest layer of storage? It will age with a final dying gesture of self-deletion. Once there the information can be accessed again if needed at a certain point in time. If the request comes in the mechanism moves the data back from the furthest layer to the closest or fastest accessible point for the user.

The second method is by Archiving Data. The same principle is applied but this time we are aware of the fact that the system that we’re working on doesn’t have several layers of storage. Theoretically we’re referring to a single unit, a single machine that needs performance improvement. Applying the same mechanism data will be archived when it becomes irrelevant (same as moving it through layers of storage) until it gets deleted. If any of the once irrelevant data becomes for some reason relevant the system will make the data available for use by extracting it from the archives .

Fig. 2: Archiving data in search for the best performance

But using these methods, even though they partially solve the “how to” question regarding space, are not the full solution to the problem. We need an instrument that will have the capacity to sort out the important and useful information from the stack of needles the system faces. As already mentioned, this is a supervised learning procedure always considering that the user’s needs are far more important than a forced move or forced archive just for the sake of freeing up space. The user will force upon the system rules by which the instrument will function. In addition the mechanism will learn what data is important by gathering constant statistics while monitoring the system - this will provide a better flow in accessing important data fast and also suggest rules to the user for a better overview of the system (Fig. 3)

Fig. 3: The “forgetting” structure

This mechanism will learn to forget – will adapt better and better to the user's requirements – and that implies that the user will have constant access to clean, irrevocable and important information. Think of Big Data where huge amounts of information are not accessed for long periods of time, think of the Internet where harmful information (leaked photos, leaked email or even bank accounts, dead web-sites, etc.) will always be available, think of Build Servers for different applications that store hundreds of intermediate builds before releasing candidates, think of Virtual Centers where virtual machines are created for one time use and then left there without anyone knowing what where they used for, think of your laptop and how frustrating it is when you don’t have enough space to store your holiday photos. Data is always able of carrying metadata which can be helpful in handling information that can have an impact on the users day by day decisions. The possibility of adding metadata to the existing data so that the we would have a smooth decision making process, will always exist, but it won't always be necessary.

There are several advantages by inducing such a mechanism. There are also disadvantages but all are related to how short is the access time between the different layers of storage. The system will keep itself as an independent capsule that will only do its job when needed. Furthermore the user can access important data faster than ever. Data isn’t lost until it is not used or it is not useful anymore - time passes by and the information is ageing. As it should.

We need to learn from history - information is always important and knowledge is the most powerful weapon, but there are millions of GB of data sitting somewhere in a dark corner for nobody to use or just forgotten but still occupying valuable space. Deleting data becomes an act of will. If computers were inspired by the human brain there is a necessity to emulate its behaviour. We should use information in a smart and productive way that can cause and sustain evolution, otherwise everything will be overwhelmed by the staggering amounts of present and future terabytes. We should remember what counts, what is important not only for us but also our society. We should forget what becomes white noise. Machines shouldn’t be so different.