Is Persistent Memory Persistent? | SNIA

Abstract

Preserving application data integrity is a paramount duty of computing systems. Failures such as power outages are major perils: A sudden crash during an update may corrupt data or effectively destroy it by corrupting metadata. Applications protect data integrity by using update mechanisms that are atomic with respect to failure; such mechanisms promise to restore data to a consistent state following a crash. Unfortunately, the checkered history of failure-atomic update mechanisms precludes blind trust. Widely used relational databases and key-value stores often fail to uphold their transactionality guarantees [Zheng et al., OSDI '14]. Lower on the stack, durable storage devices may corrupt or destroy data when power is lost [Zheng et al., FAST '13]. Emerging non-volatile memory (NVM) hardware and corresponding failure-atomic update mechanisms strive to avoid repeating the mistakes of earlier technologies, as do software abstractions of persistent memory for conventional hardware [the topic of my SDC 2019 talk]. Healthy skepticism, however, demands firsthand evidence that such systems deliver on their integrity promises. Prudent developers and operators follow the maxim, "train as you would fight." Software that must tolerate abrupt power failures should demonstrably survive such failures in pre-production tests or "Game Day" failure-injection testing on production systems. In the past, my colleagues and I extensively tested our crash-tolerance mechanisms against power failures, but we did not document the tribal knowledge required to practice this art. This talk describes the design and implementation of a simple and cost-effective testbed for subjecting applications running on a complete hardware/software stack to repeated sudden whole-system power interruptions. The testbed is affordable, runs unattended indefinitely, and performs a full power-off/on test cycle in one minute. The talk will furthermore present my findings when I used such a testbed to evaluate a crash-tolerance mechanism for persistent memory by subjecting it to over 50,000 power failures. Any software developer can use this type of testbed to evaluate crash-tolerance software before releasing it for production use. Application operators can learn from this talk principles and techniques that they can apply to power-fail testing their production hardware and software. A peer-reviewed companion paper that covers all of the material in the talk and that provides additional detail will be published prior to the talk; attendees are invited but not required to read the paper before the talk.

Learning Objectives

Attendees will learn why it is necessary to subject complete production hardware/software stacks to realistic, sudden, whole-system power interruptions.,Attendees will review in detail the design of simple yet robust auxilliary circuitry that allows a test computer to cut its own power abruptly. The auxilliary circuitry restores power after a few seconds, triggering a reboot that starts the next test cycle. Attendees will learn to design and build such circuitry for themselves safely and at low cost.,Attendees will learn how to design test software to detect data corruption caused by crashes, and how to configure test computers to run tests unattended indefinitely.