Monday 12 January 2009

Clustering, Failover and Raid.

Clustering.

In some ways, a server cluster is a little like redundant array of independent disks (RAID). With RAID, multiple physical drives work together to contain a file system. In a RAID 5 array, for example, data is striped across multiple drives, with one drive serving as a parity drive. If a drive in the array fails, you can replace that failed drive and the array rebuilds itself. If the array contains a hot spare, the array can add the spare to the array and rebuild itself without any intervention. The result is that the logical volume and data remain available.

In a server cluster, an active server can fail or be taken offline without affecting the service being provided by the cluster. In our SQL Server example, you might take the active server in the cluster offline for maintenance for several hours, but because the other server in the cluster remains online and takes over the task of serving SQL requests, customers and/or users never know the server is offline. So a cluster provides fault tolerance by allowing other servers in the cluster to take over the workload for a failed server.

In state ful clustering, the cluster maintains the user and application state during a failover, with the user and application state failing over to the other server. This means that users who access an Exchange server cluster will not lose access to their mailboxes or other Exchange features if their active server in the cluster goes down, even if they have an open connection to the server when the failure occurs.

Failover.

Failover is the ability of the cluster to move application processing from one server in the cluster to another when a hardware or application failure occurs. For example, if one of the servers in our fictitious SQL Server cluster fails, the transactions being handled by the failed server can migrate to a healthy server in the cluster. When a server comes back online in the cluster, the application can fail back to the original server.

RAID

RAID (Redundant Array of Independent Disks) is a technology that employs the simultaneous use of two or more hard disk drives to achieve greater levels of performance, reliability, and larger data volume sizes.

RAID combines two or more physical hard disks into a single logical unit by using either special hardware or software.

There are three key concepts in RAID: mirroring, the copying of data to more than one disk; striping, the splitting of data across more than one disk; and error correction, where redundant data is stored to allow problems to be detected and possibly fixed (known as fault tolerance).

When several physical disks are set up to use RAID technology, they are said to be in a RAID array. This array distributes data across several disks, but the array is seen by the computer user and operating system as one single disk.

Redundancy is a way that extra data is written across the array, which are organized so that the failure of one (sometimes more) disks in the array will not result in loss of data. A failed disk may be replaced by a new one, and the data on it reconstructed from the remaining data and the extra data. A redundant array allows less data to be stored. For instance, a 2-disk RAID 1 array loses half of the total capacity that would have otherwise been available using both disks independently, and a RAID 5 array with several disks loses the capacity of one disk. Other RAID level arrays are arranged so that they are faster to write to and read from than a single disk.

There are various combinations of these approaches giving different trade-offs of protection against data loss, capacity, and speed. RAID levels 0, 1, and 5 are the most commonly found, and cover most requirements.

  • RAID 0 (striped disks) distributes data across several disks in a way that gives improved speed and full capacity, but all data on all disks will be lost if any one disk fails.
  • RAID 1 (mirrored settings/disks) could be described as a real-time backup solution. Two (or more) disks each store exactly the same data, at the same time, and at all times. Data is not lost as long as one disk survives. Total capacity of the array is simply the capacity of one disk. At any given instant, each disk in the array is simply identical to every other disk in the array.
  • RAID 5 (striped disks with parity) combines three or more disks in a way that protects data against loss of any one disk; the storage capacity of the array is reduced by one disk.
  • RAID 6 (striped disks with dual parity) (less common) can recover from the loss of two disks.
  • RAID 10 (or 1+0) uses both striping and mirroring. "01" or "0+1" is sometimes distinguished from "10" or "1+0": a striped set of mirrored subsets and a mirrored set of striped subsets are both valid, but distinct, configurations.

The configuration affects reliability and performance in different ways. The problem with using more disks is that it is more likely that one will go wrong, but by using error checking the total system can be made more reliable by being able to survive and repair the failure.

Basic mirroring can speed up reading data as a system can read different data from both the disks, but it may be slow for writing if the configuration requires that both disks must confirm that the data is correctly written. Striping is often used for performance, where it allows sequences of data to be read from multiple disks at the same time.

No comments: