PPPL-4672
Ensuring High Availability And Recoverability of Acquired Data
Authors: C. Pugh, T. Carrol and P. Henderson
Abstract:
Every time one runs a shot, or simulation, exorbitant
amounts of data are collected and sent off to live a life in storage.
This data is important to our livelihood as a scientific research
community, and to the goals of our mission of sustainable energy.
Therefore it will behoove all to ensure the integrity of this data.
Many mechanisms are available to store and ensure the
availability of this data, from Hardware Raid, to Software Raid,
and backups. Is the right amount of data redundancy being
utilized in order to ensure data is safe? What are the scenarios in
which these redundancies could fail? How can one ensure that
each type of failure is accounted for with the least amount of
overhead?
When using Hardware Raid on the storage networks, each
Raid group is allowed a certain number of failures, before the
whole group fails beyond recovery. Software Raid, specifically
ZFS raid-z or mirroring, can check for "soft errors," and provide
a way to recover, even if a hard disk fails or a device is
prematurely removed. Finally, backups are only as good as the
policy and resources provided to the system.
As with many engineering decisions, it is often not clear what
the best solution is. Alone, each one of these mechanisms provides
a certain level of data redundancy or availability. However, when
one would combine these resources, it will ensure that no matter
what scenario, data will be available and recoverable.
__________________________________________________
Submitted to: 38th International Conference on Plasma Science & 24th Symposium on Fusion Engineering/ICOPS 2011 SOFE, Chicago, IL, June 26-30, 2011
__________________________________________________
Download PPPL-4672 (pdf 667 KB 5 pp)
__________________________________________________