View Question
Q: MTBF with redundancy ( Answered,   1 Comment )
 Question
 Subject: MTBF with redundancy Category: Science > Math Asked by: isitaboat-ga List Price: \$25.00 Posted: 18 May 2006 13:28 PDT Expires: 17 Jun 2006 13:28 PDT Question ID: 730165
 ```I'm trying to find a simple way of calculating MTBF for a RAID 5, and RAID 6 array and a controller card. Basically the redundant part of the system has N items. I'd like the math to caculate the probabilty of failiure of M drives at any one point. The N drives may have different MTBF's. I.e. a RAID 5 array fails when 2 of N drives fail. The array is repairable if 1 drive fails. a RAID 6 array fails when 3 of N drives fail. The array is repairable if up to 2 drives fail. (not sure if the repariability bit is useful to you) Please provide examples or links to tutorials explaining this clearly and concisely. Also, I would like to know how to combine properly the results of the above redundant system's probability with that of a controller card (which is not redundant). For example; 10 x discs @ 10,000hours MTBF - redundancy. Will fail if M drives fail. 1 x card @ 10,000hours MTBF - not redundant```
 Subject: Re: MTBF with redundancy Answered By: maniac-ga on 18 May 2006 20:44 PDT
 ```Hello Itisaboat, Let me first refer you to a similar question / answer: http://answers.google.com/answers/threadview?id=390140 which describes the reliability calculations for three situations: o serial reliability (if two items, the system fails if either fails) o parallel reliability (if two items, the system fails if both fails) o composite reliability (a combination of parallel / serial reliability) I note that several of the links are "broken" and I'll add a few more new ones to this answer to provide the "clear and concise" tutorials and other references. I will also assume when you refer to MTBF, you are referring to Mean Time Between [Critical] Failure where the calculation is relative to the performance of the whole RAID system (controller & disks) and not to the failure of components within the system that do not cause the RAID system to fail. In some references, this is also referred to as Mean Time To Data Loss (MTTDL). I also assume the failure rates of the disks are not correlated (e.g., failures not due to a "common cause" such as overheating of the RAID system or a manufacturing defect). If this is not true, please make a clarification request so I can fix the answer. The RAID 5 situation you describe has composite reliability and can be illustrated as: Controller -> RAID Array (in series) where the RAID array is treated as a complex system with three major states: 1 - normal operation (all drives are OK) 2 - degraded operation (one [or two] drives failed, replacement / repair in progress) 3 - failed (another failure occurred when the system was degraded) There is a complex set of calculations required to compute this precisely, but can be approximated in the following way. 1 - all drives OK, for ten 100,000 hour drives, a single drive failure roughly once per 10,000 hours 2 - one drive failed, assume duration 24 hours long to replace / rebuild the failed drive, another drive failure roughly once per 11,111 hours (of the 24 hour periods) 3 - RAID system failed, the remaining period From the Hannu Kari's PHD thesis at http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf (page 6) the MTTDL of a 10+1 disk RAID 5 system is roughly 3.8 million hours (for 100,000 hour drives). Taking the MTTDL above in combination with a 1,000,000 hour MTBF of a controller results in the composite reliability calculated as: lambda of RAID = 1,000,000/3,800,000 = 0.263 lambda of controller = 1,000,000/1,000,000 = 1.000 lambda of both = 1.000 + 0.263 = 1.263 MTBF of both = 1,000,000/1.263 = about 792 thousand hours. The RAID 6 example is similar with four major states as: 1 - all drives OK, 2 - one drive failed 3 - two drives failed 4 - RAID system failed I could not find the result for this directly on the internet but it is at least several hundred times more reliable than the RAID 5 example (approximated by 10,000/24). Chapter 6 of the thesis (same location) has some nice illustrations of the states and explanations of the math involved to do the analytical solution for RAID-5 and several approximations. I suggest skipping the analytic solution and review the approximations. I also suggest you stop reading this section where it goes into Disk Array Fault models unless you are interested in the variety of ways that disks operate and fail. For other references, I suggest viewing: http://www.softwareresearch.net/site/teaching/SS2003/PDFdocs.EmbC/16_fault_tolerance.pdf a general tutorial on reliability, describing serial and parallel reliability as well as how real equipment works and fails. http://www.rism.com/Trig/binomial.htm An explanation of combinations (along with related calculations). Can be used in reliability calculations that do not take into account repair. If you don't repair / replace a failed drive, that MTTDL of 3.8 million in a RAID 5 system goes down to roughly 110,000. If you use Excel, =COMBIN does the combination calculation. Using the search phrase reliability tutorial in Google also brings up a number of good tutorials on reliability in general. I added RAID to the search phrase for some of my searches, but that generally did NOT bring up helpful information. Other phrases such as MTTDL "RAID 6" reliability MTTDL can help you find other resources as well. There are also several good books referred to in the PHD thesis. You may want to review one of them at well (check your local university library). If any part of this answer is unclear or incomplete, please make a clarification request before rating the answer so I can make the appropriate corrections. Good luck with your work. --Maniac```
 Subject: Re: MTBF with redundancy From: kottekoe-ga on 19 May 2006 13:01 PDT
 ```I suppose this is all covered in the thesis referred to above, but you cannot answer the question unless you know the probability of failure as a function of life. The simplest assumption is that the failure probability is independent of time, which I would expect to be a poor assumption for a mechanical device subject to wear out mechanisms.```