Expert Speak

High availability in storage systems

storage systems

In today’s connected world, even brief down times can result in substantial losses for the business and long term down times can cripple a business and bring it to its knees. Often, the impact of a down time cannot be predicted accurately. While there are some obvious impacts of a down time in terms of lost revenue and productivity, there can be several intangible impacts such as brand image damages that could have not-so-obvious and far-reaching effects on the business, opines Narayanan B, Project Manager-Storage, American Megatrends India.

Classes of Availability
In the volatile and uncertain world of today, it becomes extremely important to plan for contingencies to protect against possible disasters. Disasters could be software related or hardware related. Different disasters need different Disaster Recovery (DR) strategies. The figure below shows some of the common data protection strategies that include DR and High availability strategies.

One of the primary focal points of an effective DR strategy is to minimize the amount of down time on a disaster. However, it does not focus on keeping the data available without any down times. Availability is measured as the ratio of mean time between failures(MTBF) to the sum of MTBF and mean time to repair (MTTR). Thus, availability is indicative of the percentage of time the system is available throughout its useful life. One way to achieve higher availability is by decreasing the MTTR (down time), which is also the goal of DR strategies. Thus, while DR strategies are strictly not availability strategies, they do meet availability requirements to an extent.

Availability is often expressed as a percentage of system availability. Often an availability of about 90-95 percent is sufficient for most applications. However, for extremely critical business data such amounts of availability is simply not enough. Often, truly highly available solutions have an availability of 99.999 percent (99.999 five nines) or 99.9999 percent (99.999 six nines). Such solutions have a down time of the order of a few seconds to a couple of minutes per year. There are different classes of data protection mechanisms based on the availability requirements. In figure given, as one goes up the pyramid the down time decreases and hence the availability increases. The top two levels of the pyramid constitute strategies that represent true high availability (five nines and six nines).

Active/Active Dual Controllers: SBB
The fundamental way to make a storage system highly available is to make each and every component of the system highly redundant. This includes the processors, memory modules, drives (using RAID), network and other host connectivity ports, power supplies, fans and other components. However, still the disk array controller (RAID controller) and motherboard of the system constitute single points of failure in the system.

Storage Bridge Bay (SBB) is a specification created by a non-profit working group that defines a mechanical/electrical interface between a passive backplane drive array and the electronics packages that give the array its “personality”, thereby standardizing storage controller slots. One chassis could have multiple controllers that can be hot-swapped. This ability to have multiple controllers means that the system is protected against controller failures as well, thereby giving it a true high availability.

But, such a configuration is not bereft of challenges. Some of these are the need for a mutual exclusion policy on the shared drive array to prevent data corruptions and inconsistencies, a cluster-aware RAID module, and the need for maintaining cache coherency across the two canisters to ensure smooth failovers. In addition, despite the controllers being redundant, the mid-plane connecting the controllers to the drive back-plane is still shared making it a single point of failure.

High Availability Cluster

The highest class of availability in the availability pyramid is achieved using High Availability Clusters. High Availability Clusters are cluster of storage nodes that are implemented by having redundant storage nodes which ensure continuity of data availability despite component and a storage node failure as well. This represents the highest form of availability that is possible (six nines). In comparison to SBB based dual controller nodes, HA Clusters do not suffer from any single point of failures. In addition, since the drive arrays are not shared by the two systems, the individual systems have their own RAID configurations, thereby making the HA Cluster resilient to more drive failures than SBB setup. And finally, HA Clusters are also resilient to site failures thus making them the best in class availability solution.

Conclusion
Whereas the storage can be made highly available by using one or more of the above classes of data protection, a true storage high availability can be only ensured when there is redundancy built in to every component of a storage sub-system. This includes not just the storage and its components being redundant, but also includes the need for the paths to the storage and switches connecting them to the application servers as well to be redundant. Today, High availability is no more a luxury for businesses with huge budgets, but has become essential for the uninterrupted and productive operation of them.

Leave a Response