Issue 26

Keeping Your Linux Servers Up To Date – Part II

Sorin Pânca
Senior Systems Administrator
@Yardi România

In this article I"ll talk about the storage system that integrates the best with the needs of a datacenter. To debate the pros and cons when choosing a storage solution for a datacenter I must write a short classification and history of storage systems first.

By their location, the storage systems are classified into two categories:

A. Local Data Storage: these filesystems reside on a local peripheral device:

  1. a hard disk drive;
  2. a NAND memory connected to the computing system by a SATA or SAS data buses (in which case we call it a SSD), or by a USB data bus (in which case we call it a "USB drive/stick" or a "pen drive");
  3. an optical disk drive (CD/DVD/BlueRay);
  4. RAID - multiple identical storage media are part of a bigger storage medium that provides redundancy at data block level;
  5. rechargeable battery backed RAM connected to the host computer system by the PCIe data bus;
  6. memristor (not available yet on the market).

B. Networked Data Storage (I won"t list them all here - they are too many,  but I"ll give a couple of examples, to illustrate the architecture and concepts they have in common and that are employed in their functionality):

  1. single point of access, based on the client - server paradigm: CIFS (also known as "My Network Places" in Windows), NFS, FTP, rsync, IMAP, scp, WebDAV/HTTP, etc.;
  2. clustered storage systems, that replicate data across nodes at the data block level: DRBD, Microsoft DFS, GlusterFS;
  3. clustered storage systems, that replicate data at file level: Lustre, Ceph, MooseFS, XtreemFS.

Historically, the computers were closed systems with an architecture comprising of the following components: CPU, Memory and Peripherals. Among the Peripherals, the storage system evolved from a read-only medium to a writable one.

The evolution of the local storage systems was in short as follows:

  • punched cards (a read-only medium; the data was punched in manually);
  • magnetic tapes;
  • hard disk drives (with feet sized diameters);
  • 5.25" floppy disks;
  • 2.5" floppy disks (with a capacity of 850 kB, one sided; 1.44 MB double sided; 2.88 MB double sided, high density);
  • 3.5" hard disks;
  • all the media already mentioned above.

The networks revolution followed, when the need to move data between different computer systems arouse, without the need to physically bring a device from one computer to another, especially with the purpose of real time communications.

Thus, the idea of the "client - server" paradigm surfaced in which a system having a resource (say, a file) could share it over a communication channel (i.e. a network) using a standard communication language (a protocol); this system plays the role of the "server" (because it serves its resources). A computing system (known as a "client") that needs a resource that exists on another computing system, can contact the remote computer system (the server), using the communication channel (the data network) and a standard communication language (a protocol) that both systems understand (protocol examples: FTP, HTTP, IMAP mentioned above - "P" in these names stands for "protocol"). A really nice theory.

The issues that can come up in practice, though, in the above scenario (in the order of severity), and their resolution, labeled with "R", where it"s possible:

  • multiple clients want to update the same resource at the same time (R: the network filesystems implement resource locking);
  • too many clients want the same resource, depleting the available memory or network capacity of the server system (R: RAID0, RAID10, SSD, rechargeable battery backed RAM, b or c network filesystems);
  • the server"s storage fills up (R: c type networked storage systems - they are extensible);
  • the server becomes unavailable due to a power surge (R: UPS);
  • the server"s network adapter brakes during data transfers leading to data corruption or truncation (R: link aggregation between multiple network adaptors);
  • multiple servers need to keep their whole data in sync (data mirrors) continuously (R: b or c type network file systems);
  • each client wants to search through all the available data, and the search must complete in under a second (R: Full Text Search DB);
  • the data stored on the server are old, because the server is part of an update chain that needs to propagate the data from a source, but the synchronizing process takes place periodically (R: c type network storage systems);
  • the data stored on the server becomes corrupted or gets lost as a result of a server"s local storage failure (R: RAID(1,5,6,10), b or c network storage systems);
  • the server becomes unavailable, because another component brakes, not the storage system: the main board, the CPU, the memory, the PSU (R: b or c network storage systems);
  • some data needs to be replicated in more copies than other data (R: c type network storage systems);
  • human error (rm -rf /home /var, you know what (or whom) I mean).

In the above list we notice that the system that offers all features when it comes about data protection is the c type network storage system.

Regarding data protection, the companies that sell storage solutions adopted a local storage strategy and created the RAID (Redundant Array of Inexpensive Disks) with different "levels" of protection:

  1. RAID0 - offers no data protection at all, but groups multiple disks to obtain elevated speeds and increased storage capacity: the files are fragmented in data blocks (about 4 to 64 kB in size) that are spread across all the disks in the array, so that both when reading and when writing, the file fragments can be processed in parallel.
  2. RAID1 - offers data protection by writing all data on all media in the array; the writing takes as long as it would be written on a single medium, but reading is slower because a comparison is being done between the data on all storage media, so that the correct data are returned to the host system; the correctness of the data, in the case a quorum is not available, can be inferred from the response speed of each medium: broken media return data slower or not at all; if the return time is short and comparable between all media, then all media are considered "healthy".
  3. RAID5 - offers data protection by computing a checksum and slower speeds than one base medium.
  4. RAID6 - offers data protection by computing two checksums; it runs at slower speeds than RAID5.
  5. RAID10 - is a combination between RAID1 and RAID0: pairs of media in a RAID1 configuration are combined in a RAID0 to offer both speed and redundancy.
  6. RAID50 - see on the Net.

Sometimes these solutions offer a certain data protection, other times they offer speed, but they fail to address problems that can render the whole server system unusable: mainboard, CPU, memory, PSU, NIC failures or the sysadmin"s unstoppable desire to turn off the server, etc.

The first solution that offers high availability of storage services was to create a RAID1 setup between two file servers over the network. This is how DRBD was invented; a system that presents the clients with a file storage that is replicated over the network to a peer server and has as backend storage engine a local disk. These two peer servers can keep their local storage in sync and can share the storage over the network to clients using standard protocols (CIFS, NFS, FTP, WebDav/HTTP, rsync, etc.). In order to offer high availability, the system presents the clients a "service IP" or "virtual IP" that is bounced from one server to another manually or automatically. Thus, a server is considered "active" or "live", while the other is "in stand-by". The problem with this setup is that when we need more space it is rather hard to extend the replicated storage: in turn, we shut down one server, we increase the storage space on it, we copy the data back from the other, we turn off the other server, we increase its capacity, we copy the data back from the first server, we restore replication. Another problem with this setup is the cost: to obtain a network storage service, we need to at least double the resources without obtaining a continuous benefit. We only benefit from this setup in the moment one server crashes irrecoverably, an event occurring about 0% of the total operation time. In some cases, these costs are justified (when the data are really important and we need continuous access to them no matter what - think life support systems).

The same can be said about all storage systems that replicate data at storage block level. Also, because they replicate data at block level means that storage space needs to be preallocated, that leads to an inefficient storage space management by the systems administration team: how much space should be preallocated, how many new identical hardware nodes are needed, that should offer enough storage space from the beginning (one can"t use already existing hardware), how much downtime is needed (until the storage is reconfigured and the systems are reinstalled). If the sysadmin decides to reconfigure an old server, the data need to be migrated elsewhere, also.

(In closing the intro, I"d like to say a word about capitalism and consumerism: many programmers and sysadmins are complaining about the fact that the company they work for makes their lives hard by insisting in reusing already existing old hardware instead of investing in new, bleeding edge hardware. This denotes laziness or an inability on the part of the developer or the sysadmin in planning for systems" repurposing or in code optimization.

Of course, with money anyone can do anything. The real art is to be able to make a profit out of the smallest investment. As big the profit is, as much the employees, not other, external companies that sell new hardware, software or support services can enjoy that profit. Therefore Google, Facebook and other successful companies not only do they write their own software and use free and open source software, but they also design their own hardware and datacenters and order components directly from original equipment manufacturers (OEMs).

As much as an employee"s job can be easily outsourced, as less its value is. The value of a company emerges from the quality of its employees" work that can be calculated as profit minus spending, not from the goods that that company acquires. Otherwise, a company can outsource all its activities and become a hedge fund.)

The Network File System

Even though, after a team decision we"re using GlusterFS, I"ll present a solution that we tested for the last 3 years: MooseFS. In the previous article I was saying that we started to use XtreemFS. That storage system"s architecture looked interesting at first, with no single point of failure. In the meantime I learned that, that system is not extensible enough in regards to its components - the metadata and replica catalog service (MRC) and the directory service (DIR), and its authors are willing to distribute it as a binary only in the future, that stands against the principles I discussed at the end of the intro.

Initially, I researched all network filesystem solutions available, reading about other people"s experiences and getting to understand the theoretical principles behind each of the filesystem. As a result, I chose for testing two network filesystems: Lustre and MooseFS. When an infrastructure architect decides on a technology to be used in production, most of the research time is spent on documenting and creating of feature lists for each solution, both from technical and financial and temporal costs points of view.

As for the project metrics of age and popularity, these are critical topics that should be discussed in depth in order to avoid arriving to wrong conclusions. An old project is for sure more popular than a new project in absolute terms, because it had more time to establish a foothold on the market, while the new system didn"t exist yet. To correctly compare any two systems, we must calculate an estimate statistic for the new system for the same amount of years that the old system covers, at the comparable adoption (growth) rate for the new system"s lifetime.

For example, if we were to compare GlusterFS and MooseFS, we"d make the following calculations. Assuming that GlusterFS is 10 years old, and MooseFS is 2 years old, we can look at the growth rate of MooseFS in the first 2 years and estimate the growth for the next 8 years using the growth rate from the first 2 years. Then, we compare the estimated 10 years growth rate of MooseFS with the existing 10 years growth rate of GlusterFS.

A new open source project is not created because a developer decides they have nothing to do at night. That developer searched for an existing open source system to use in his setup and found nothing that was good enough. Probably he also looked at commercial solutions and found nothing there either.  The decision to write a brand new solution from scratch is not taken lightly. Especially, when you"re about to write a new filesystem that targets the exact organizations who"s functioning relies on those data, including the organization the project"s author is part of. Also, if an open source project attracts many contributors and the solution is adopted in academia and by businesses thus replacing other solutions, we have a proof that that new system is technically superior and will replace the old system, where possible.

There"s also the myth that old systems are more stable than new ones. Indeed, it can happen that the old systems contain fewer bugs, but even an old system that is still developed still accumulates new bugs. There isn"t a mathematical method to prove that a software system still has bugs or how many bugs are there in comparison with another (new or old) system. A software system that is used in production by multiple organizations for many years (even if only for one year), can be considered as stable and reliable as an old system. On the other hand, about an old system we can know that it has unfixable architectural flaws that led developers to create new systems. In conclusion, it would be the wisest decision from an organization"s system architect to adopt a solution as new as possible that is marked as stable and used in production by other organizations. At the same time, the system administrator must assume the responsibility to contribute to the projects that produced the software he or she uses by submitting patches or at least bug reports back to the project"s author. Another responsibility the sysadmin must assume is that of the protection of the organization"s data against corruption or deletion. (I will not talk about the data stealing protection here, because the topic of the article is data storage, not data security.) The sysadmin must assure the organization that the responsibility of the data is his/her and it should not be outsourced to another company. Any software has bugs, and the sysadmin is the one who has the responsibility to deal with those bugs.

Lustre is employed in scientific research and academia, where there"s a need to handle large amounts of data using supercomputers. The Lustre system is tested and used by the World"s top 500 supercomputers. The sysadmins working with those supercomputers are at the top of their field, and the sysadmins from commercial companies can count on their decision making. When I decided to test Lustre (in 2011) I learned that it doesn"t support new Linux kernel versions (it needed a patched Linux version 2.6.18) - that was not compatible with the drivers and performance enhancements we needed on our systems, that were only available on newer Linux versions. The next year, Intel bought Lustre and promised to add support for Linux 3.0, also renaming the project to Whamcloud. They failed to keep up with their promise, but the community Lustre project evolved, and today the Lustre client is part of the Linux 3 kernel series.

Anyhow, we needed to use something back in 2011. The next project on our list that offered the most features and its architecture was fail proof was MooseFS. MooseFS has two architectural problems:

  1. the client is implemented as a userspace program using FUSE, thus it is a process that runs just like any other userspace process on the system, an indication of low performance because it competes equally with other user processes; from our tests though we found that it has a pretty good performance; and
  2. the master node was not redundant out of the box; this second problem could be solved by the sysadmin.

The architecture of MooseFS allows the extensibility of the file system on all hardware nodes in the datacenter. MooseFS creates on the storage nodes a file structure of "data files", resembling the data files used by the Oracle database.

Inside those data files, MooseFS stores the actual files using an internal algorithm: first it splits the files into "chunks" of variable size ranging from 0 to 64 MB, and then it distributes and clones those chunks on multiple nodes, that it calls "chunkservers". Each received file that is to be stored, has a "goal" that represents the number of copies that file is to be stored in on the MooseFS system. The copies are stored on different nodes, so that if a copy of that file in the MooseFS file system goes missing after a "chunkserver" becomes unavailable, the file is still accessible. Of course, a file can have any number of copies, distributed on any number of nodes, not only two.

The coordinates of the files are kept on the master node whose role is to arbiter the distribution and cloning of the chunks. The master holds the information about files in a file called metadata.mfs. Besides this file, it creates a bunch of transaction log files, so that the metadata.mfs file can be recreated from these logs if it becomes corrupted. Besides the local logs, to ensure the recovery of the stored files in case of a master node failure, the mfs master node replicates all its data (the metadata.mfs and transaction log files) onto another node type in the MooseFS architecture: the mfsmetalogger node. At any moment, a mfsmetalogger node can be promoted to master using its copy of the metadata.mfs and transaction logs. The MooseFS architecture allows any number of chunk server and metalogger nodes, but only one master node.

The MooseFS clients access the file system by querying the master node that in turn redirects them to the storage servers (chunkservers), where the data input/output operations take place. So, the file transfer operations don"t take place between the master and MooseFS clients, but between the clients and the chunk servers directly, which means that the transfers are distributed between all the servers in the datacenter that leads to enhanced transfer speeds.

From MooseFS"s architecture we can infer an interesting use case: we can create a distributed, performant and chunkserver fail proof file system using all systems on the network. This is true not only for the datacenter, but also in a typical office, where there are hundreds of workstations with free and unused storage space. Thus, the employees from the office will store files literally "on the network", not on a dedicated, expensive and fail prone file server that can fill up at any moment, and that needs the care of the sysadmins.

A versed sysadmin will immediately notice: but aren"t the disks going to be filled up? No: MooseFS accounts for the filling percentage of its nodes (not only by its data files, but also by other files each node stores locally on the same storage as MooseFS). MooseFS will try to equally match the percentage of the storage usage on all available chunkservers. So, because this is a percentual accounting of the storage filling, MooseFS does not require that all disks have the same size. Thus, for example, all disks from all MooseFS"s chunkservers will be percentage-wise equally filled, regardless of size.

Because all files are cloned on different nodes, at node level we do not need RAID mirroring solutions that are also speed inefficient to ensure data protection. The data are protected by internode, not intranode replication. The multiplication goal can be finely controlled by the sysadmin even at file level without downtimes of the system or any of its components. Also, if on MooseFS there is no data stored, MooseFS"s storage use on chunkservers is almost zero. So, MooseFS will use as much storage space as is needed and there will be no need to repartition the nodes - it just uses a directory; the systems administrator doesn"t have to plan for anything ahead of time, but only to install and start MooseFS"s component services.

To enhance the storage capacity, at any future time the sysadmin can add new chunkserver nodes or extend the capacity of old nodes, without notifying anyone, the unavailability of a storage node having no consequences on the availability of the network storage system. The only single point of failure in MooseFS"s standard configuration is the master node, but on manual intervention from the sysadmin, any metalogger node can in less than 10 seconds become a master node.

Distributed Network File System Use Cases

In our company we process large amounts of data. The data comes in periodically in a denormalized form. Those data are both free and non-free. In time, we receive more and more data, and the sysadmin team was not once faced with the problem of storage space running out. Sometimes that happened at 2:47 AM. Not only does the data we receive need storage. The backup and archiving processes need storage as well. These needs only grow in time. Using standard RAID solutions we can scale until the storage enclosures are fully filled with disks. This type of scaling does not offer protection against the host node failure, though.

Also, the "virtual machines" need to be able to shut down on a node and start up on another in the shortest time possible, without losing data. Thus, we can"t shut down a virtual machine, copy its filesystem over the network to another host and then start it back up, because of the long downtime required. If the virtual machine is stored on the network, we can simply stop it on a node and start it up on another, without copying anything between nodes.

In an office, the sysadmin can run virtual machines on the employees" workstations that are stored on the network and that can automatically be started on another node when the current host node is shut down. Thus, we can run nightly tests and builds or grid computing directly on the employees" workstations, without the need to buy expensive dedicated servers.

Choosing a performant distributed network file system is vital for fail proofing the systems and for efficient resource usage. What can be more important than knowing that ALL the disks in the datacenter are percentually equally filled and that whenever there"s a need for more storage space one can add more chunkserver nodes, without causing any downtime? MooseFS offers other interesting features for those that want to run nearly identical virtual machines: snapshotting and configurable recycle bin. MooseFS"s recycle bin removes files of a certain sysadmin configurable age. So, the recycle bin does not need to be manually and entirely emptied, but any file that ends up in recycle bin will be removed automatically after a certain time. All deleted files end up in recycle bin, without exception, and the sysadmin cannot avoid this. One can configure the recycle bin to remove files after 0 seconds, though, this having the effect of permanent and instant removal. Snapshotting is a method of storing identical chunks only once for all files. Those files point to the same on disk data, creating the illusion that they actually contain the same data multiple times. This method is also known as "deduplication". When a change occurs in one file on a shared chunk, that chunk is stored separately from that moment on. The rest of the chunks though, continue to be shared, the total amount of storage needed for those files being equal to the amount of common chunks plus the amount of different chunks. The practice of storing only differences is called "copy on write" (COW). Thus, if the sysadmin creates a virtual machine template, he or she can create multiple snapshots of the template and start up virtual machines to use them independently. For example, to scale up an overloaded webserver, one can spin up multiple virtual machines (VMs) that use the same on disk data and the differences start to occur when each VM clone writes its own logs or other specific data on its disk. The mfsmakesnapshot command creates this type of files.

In the third part of the article I"ll talk about LxC, a virtualization method of Linux systems that does not suffer of poor performance compared to a bare metal system, and that enables controlled updates without downtime.


Diagrams are taken from www.moosefs.org and they are the intellectual property of MooseFS"s authors.




  • BT Code Crafters
  • Accesa
  • Bosch
  • Betfair
  • FlowTraders
  • MHP
  • Connatix
  • BoatyardX
  • metro.digital
  • AboutYou
  • Colors in projects