Hello Itisaboat,
Let me first refer you to a similar question / answer:
http://answers.google.com/answers/threadview?id=390140
which describes the reliability calculations for three situations:
o serial reliability (if two items, the system fails if either fails)
o parallel reliability (if two items, the system fails if both fails)
o composite reliability (a combination of parallel / serial reliability)
I note that several of the links are "broken" and I'll add a few more
new ones to this answer to provide the "clear and concise" tutorials
and other references.
I will also assume when you refer to MTBF, you are referring to Mean
Time Between [Critical] Failure where the calculation is relative to
the performance of the whole RAID system (controller & disks) and not
to the failure of components within the system that do not cause the
RAID system to fail. In some references, this is also referred to as
Mean Time To Data Loss (MTTDL). I also assume the failure rates of the
disks are not correlated (e.g., failures not due to a "common cause"
such as overheating of the RAID system or a manufacturing defect). If
this is not true, please make a clarification request so I can fix the
answer.
The RAID 5 situation you describe has composite reliability and can be
illustrated as:
Controller -> RAID Array
(in series) where the RAID array is treated as a complex system with
three major states:
1 - normal operation (all drives are OK)
2 - degraded operation (one [or two] drives failed, replacement /
repair in progress)
3 - failed (another failure occurred when the system was degraded)
There is a complex set of calculations required to compute this
precisely, but can be approximated in the following way.
1 - all drives OK, for ten 100,000 hour drives, a single drive
failure roughly once per 10,000 hours
2 - one drive failed, assume duration 24 hours long to replace /
rebuild the failed drive, another drive failure roughly once per
11,111 hours (of the 24 hour periods)
3 - RAID system failed, the remaining period
From the Hannu Kari's PHD thesis at
http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf
(page 6) the MTTDL of a 10+1 disk RAID 5 system is roughly 3.8 million
hours (for 100,000 hour drives).
Taking the MTTDL above in combination with a 1,000,000 hour MTBF of a
controller results in the composite reliability calculated as:
lambda of RAID = 1,000,000/3,800,000 = 0.263
lambda of controller = 1,000,000/1,000,000 = 1.000
lambda of both = 1.000 + 0.263 = 1.263
MTBF of both = 1,000,000/1.263 = about 792 thousand hours.
The RAID 6 example is similar with four major states as:
1 - all drives OK,
2 - one drive failed
3 - two drives failed
4 - RAID system failed
I could not find the result for this directly on the internet but it
is at least several hundred times more reliable than the RAID 5
example (approximated by 10,000/24).
Chapter 6 of the thesis (same location) has some nice illustrations of
the states and explanations of the math involved to do the analytical
solution for RAID-5 and several approximations. I suggest skipping the
analytic solution and review the approximations. I also suggest you
stop reading this section where it goes into Disk Array Fault models
unless you are interested in the variety of ways that disks operate
and fail.
For other references, I suggest viewing:
http://www.softwareresearch.net/site/teaching/SS2003/PDFdocs.EmbC/16_fault_tolerance.pdf
a general tutorial on reliability, describing serial and parallel
reliability as well as how real equipment works and fails.
http://www.rism.com/Trig/binomial.htm
An explanation of combinations (along with related calculations). Can
be used in reliability calculations that do not take into account
repair. If you don't repair / replace a failed drive, that MTTDL of
3.8 million in a RAID 5 system goes down to roughly 110,000. If you
use Excel, =COMBIN does the combination calculation.
Using the search phrase
reliability tutorial
in Google also brings up a number of good tutorials on reliability in
general. I added RAID to the search phrase for some of my searches,
but that generally did NOT bring up helpful information. Other phrases
such as
MTTDL
"RAID 6" reliability MTTDL
can help you find other resources as well.
There are also several good books referred to in the PHD thesis. You
may want to review one of them at well (check your local university
library).
If any part of this answer is unclear or incomplete, please make a
clarification request before rating the answer so I can make the
appropriate corrections.
Good luck with your work.
--Maniac |