If the principles and the statistical bases of reliability are well understood, they can be applied to many different fields. From my work history, you can see that I have experience as an individual contributor, a reliability manager, and hands-on tester. My reliability management and analysis experience are derived from mechanical, electromechanical and electronic hardware, from fields as diverse as solar power, hard disk drives and data storage systems, fault tolerant computers, plasma-etch equipment and nuclear safety systems.

High Concentration Photo-voltaic (HCPV) Solar Power
In an HCPV system, all the solar cells must follow the sun throughout the day. This is achieved using a tracker to move the panels. Trackers are large mechanical assemblies consisting of structural beams, motors, gears, shafts, and bearings. To be economically viable, trackers must survive 25 years. At SolFocus I developed and implemented a Mechanical Reliability Test Program. The tests use accelerometers, voltages, currents and temperatures to monitor changes throughout the life in an accelerated test. Changes in vibration and current often mean increased wear or friction. Estimates of reliability are made by following the changes in currents and vibrations over the 25 year accelerated life.

Implementing this program required selection of National Instruments equipment, writing LabView and Python scripts to control the tests, and NI's DIAdem with Visual Basic to assess, plot and create reports. Data analyses required fast Fourier Transforms (FFTs) to create spectral density functions and determine total power induced by vibrations.

Data Storage Systems
Data storage systems are often configured as a redundant array of inexpensive disks (RAID). RAID systems can have any number of hard disk drives (HDDs); but, the more common ones have either 1 or 2 spares built into the system, and use parity as a means to reconstruct any data lost due to corrupted bits, sectors, tracks, or failed HDDs.

Data corrupted from bit errors and media defects can often be corrected quickly using parity. Data for a sector is reconstructed and re-written to a new, defect free location on the HDD. Data lost from an entire HDD failure can also be reconstructed, but may take 5 to 100 hours because the reconstruction is performed as a background activity while the system continues to store and serve data. The amount of foreground activity, the priority, and the capacity of the HDD are critical to the reconstruction time.

A system with 1 redundant HDD, termed "N+1" redundant, can tolerate one complete HDD failure and not lose any data. A system with 2 redundant HDDs, termed "N+2", can tolerate complete failure of 2 HDDs without losing data. The reliability models for estimating the expected number of "data loss events" from an N+1 RAID group is the subject of several papers I have written (see "Storage Papers" in Papers/References). These reliability analyses are based on 4 different probability distributions and require a sequential Monte Carlo simulation in order to correctly model the unique conditions of the system. I created the Monte Carlo simulation in C and used a novel and fast, but publicly available, random number generator.

Hard Disk Drives
In my 20 years at Tandem Computers, IBM and NetApp, I provided HDD reliability assessments; and participated in HDD manufacturing process reviews and product qualifications. This meant reviewing HDD designs, manufacturers' tests and qualification data, and reviewing their qualification processes. To verify HDD reliability, I created reliability predictions based on supplier test data and in-house test results. I created a Reliability Audit Process to determine whether the HDDs were designed, tested and manufactured in a way that will create the most reliable products.
A critical aspect of understanding HDD reliability is analysis and interpretation of the field data. It is often necessary to go beyond the first level "distribution fit" and find out the reason that data does not fit a plot-line well. Several reasons that I have found for a poor fit include data with the following attributes:
  • Mixed vintages with inherently different failure mechanisms and different rates
  • Failure mechanisms changed as the HDD aged (young HDDs had one failure mechanism and old HDDs had a different mechanism)
  • Ship-receive delay assumptions can change the best fit from a decreasing failure rate to an increasing failure rate
Don't assume; have your data carefully analyzed!

Plasma Etch and Vacuum Technology
Plasma-etching systems require knowledge of a diverse set of disciplines. In addition to understanding reliability principles, one must also know the following:
  • Robotics and mechanical motion
  • Effects of chemicals on reliability and life
  • Effects of vacuum on the lubricants used, if any
  • Microprocessor controllers and data acquisition systems to control and move the robotic elements
This began my "hands on" testing experience, over 20 years ago. I developed and implemented a showcase reliability laboratory with distributed controllers for the various assemblies and sub-assemblies. All the local controllers were managed by a master controller over a RS-232 and RS-422 interface.

Nuclear Safety Systems
Nuclear safety systems require sophisticated reliability analyses. One of the critical safety functions is removing heat from the reactor core after the core has been shutdown. The Shutdown Heat Removal System (SHRS) of the Liquid Metal Fast Breeder Reactor (LMFBR) research project sponsored by the US Department of Energy (DOE) meant developing and applying new modeling techniques. Lead by a mentor, Dr. Gerry Ingram and gifted statistician, Dr. Gary Crellin, I developed models and the data input required for the analyses. The analyses and codes are the subject of several papers presented at international conferences [?????].