PPPL-4672

Ensuring High Availability And Recoverability of Acquired Data

Authors: C. Pugh, T. Carrol and P. Henderson

Abstract:
Every time one runs a shot, or simulation, exorbitant amounts of data are collected and sent off to live a life in storage. This data is important to our livelihood as a scientific research community, and to the goals of our mission of sustainable energy. Therefore it will behoove all to ensure the integrity of this data. Many mechanisms are available to store and ensure the availability of this data, from Hardware Raid, to Software Raid, and backups. Is the right amount of data redundancy being utilized in order to ensure data is safe? What are the scenarios in which these redundancies could fail? How can one ensure that each type of failure is accounted for with the least amount of overhead? When using Hardware Raid on the storage networks, each Raid group is allowed a certain number of failures, before the whole group fails beyond recovery. Software Raid, specifically ZFS raid-z or mirroring, can check for "soft errors," and provide a way to recover, even if a hard disk fails or a device is prematurely removed. Finally, backups are only as good as the policy and resources provided to the system. As with many engineering decisions, it is often not clear what the best solution is. Alone, each one of these mechanisms provides a certain level of data redundancy or availability. However, when one would combine these resources, it will ensure that no matter what scenario, data will be available and recoverable.
__________________________________________________

Submitted to: 38th International Conference on Plasma Science & 24th Symposium on Fusion Engineering/ICOPS 2011 SOFE, Chicago, IL, June 26-30, 2011

__________________________________________________

Download PPPL-4672 (pdf 667 KB 5 pp)
__________________________________________________