Google Answers: MTBF with redundancy

View Question

Q: MTBF with redundancy ( Answered, 1 Comment )

Question

Subject: MTBF with redundancy
Category: Science > Math
Asked by: isitaboat-ga
List Price: $25.00

Posted: 18 May 2006 13:28 PDT
Expires: 17 Jun 2006 13:28 PDT
Question ID: 730165

I'm trying to find a simple way of calculating MTBF for a RAID 5, and
RAID 6 array and a controller card.

Basically the redundant part of the system has N items. I'd like the math to
caculate the probabilty of failiure of M drives at any one point. The N
drives may have different MTBF's.

I.e.

a RAID 5 array fails when 2 of N drives fail. The array is repairable
if 1 drive fails.

a RAID 6 array fails when 3 of N drives fail. The array is repairable
if up to 2 drives fail.


(not sure if the repariability bit is useful to you)


Please provide examples or links to tutorials explaining this clearly
and concisely.


Also, I would like to know how to combine properly the results of the
above redundant system's probability with that of a controller card
(which is not redundant).

For example;

10 x discs @ 10,000hours MTBF - redundancy. Will fail if M drives fail.
1 x card @ 10,000hours MTBF - not redundant

Answer

Subject: Re: MTBF with redundancy
Answered By: maniac-ga on 18 May 2006 20:44 PDT

Hello Itisaboat,

Let me first refer you to a similar question / answer:
  http://answers.google.com/answers/threadview?id=390140
which describes the reliability calculations for three situations:
 o serial reliability (if two items, the system fails if either fails)
 o parallel reliability (if two items, the system fails if both fails)
 o composite reliability (a combination of parallel / serial reliability)
I note that several of the links are "broken" and I'll add a few more
new ones to this answer to provide the "clear and concise" tutorials
and other references.

I will also assume when you refer to MTBF, you are referring to Mean
Time Between [Critical] Failure where the calculation is relative to
the performance of the whole RAID system (controller & disks) and not
to the failure of components within the system that do not cause the
RAID system to fail. In some references, this is also referred to as
Mean Time To Data Loss (MTTDL). I also assume the failure rates of the
disks are not correlated (e.g., failures not due to a "common cause"
such as overheating of the RAID system or a manufacturing defect). If
this is not true, please make a clarification request so I can fix the
answer.

The RAID 5 situation you describe has composite reliability and can be
illustrated as:
  Controller -> RAID Array
(in series) where the RAID array is treated as a complex system with
three major states:
 1 - normal operation (all drives are OK)
 2 - degraded operation (one [or two] drives failed, replacement /
repair in progress)
 3 - failed (another failure occurred when the system was degraded)
There is a complex set of calculations required to compute this
precisely, but can be approximated in the following way.
 1 - all drives OK, for ten 100,000 hour drives, a single drive
failure roughly once per 10,000 hours
 2 - one drive failed, assume duration 24 hours long to replace /
rebuild the failed drive, another drive failure roughly once per
11,111 hours (of the 24 hour periods)
 3 - RAID system failed, the remaining period
From the Hannu Kari's PHD thesis at
  http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf
(page 6) the MTTDL of a 10+1 disk RAID 5 system is roughly 3.8 million
hours (for 100,000 hour drives).

Taking the MTTDL above in combination with a 1,000,000 hour MTBF of a
controller results in the composite reliability calculated as:
  lambda of RAID = 1,000,000/3,800,000 = 0.263
  lambda of controller = 1,000,000/1,000,000 = 1.000
  lambda of both = 1.000 + 0.263 = 1.263
  MTBF of both = 1,000,000/1.263 = about 792 thousand hours.

The RAID 6 example is similar with four major states as:
 1 - all drives OK,
 2 - one drive failed
 3 - two drives failed
 4 - RAID system failed
I could not find the result for this directly on the internet but it
is at least several hundred times more reliable than the RAID 5
example (approximated by 10,000/24).

Chapter 6 of the thesis (same location) has some nice illustrations of
the states and explanations of the math involved to do the analytical
solution for RAID-5 and several approximations. I suggest skipping the
analytic solution and review the approximations. I also suggest you
stop reading this section where it goes into Disk Array Fault models
unless you are interested in the variety of ways that disks operate
and fail.

For other references, I suggest viewing:

http://www.softwareresearch.net/site/teaching/SS2003/PDFdocs.EmbC/16_fault_tolerance.pdf
a general tutorial on reliability, describing serial and parallel
reliability as well as how real equipment works and fails.

http://www.rism.com/Trig/binomial.htm
An explanation of combinations (along with related calculations). Can
be used in reliability calculations that do not take into account
repair. If you don't repair / replace a failed drive, that MTTDL of
3.8 million in a RAID 5 system goes down to roughly 110,000. If you
use Excel, =COMBIN does the combination calculation.

Using the search phrase
  reliability tutorial
in Google also brings up a number of good tutorials on reliability in
general. I added RAID to the search phrase for some of my searches,
but that generally did NOT bring up helpful information. Other phrases
such as
  MTTDL
  "RAID 6" reliability MTTDL
can help you find other resources as well.

There are also several good books referred to in the PHD thesis. You
may want to review one of them at well (check your local university
library).

If any part of this answer is unclear or incomplete, please make a
clarification request before rating the answer so I can make the
appropriate corrections.

Good luck with your work.
  --Maniac

Comments

Subject: Re: MTBF with redundancy
From: kottekoe-ga on 19 May 2006 13:01 PDT

I suppose this is all covered in the thesis referred to above, but you
cannot answer the question unless you know the probability of failure
as a function of life. The simplest assumption is that the failure
probability is independent of time, which I would expect to be a poor
assumption for a mechanical device subject to wear out mechanisms.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy