Abstract
HDFS (Hadoop Distributed File System) cache feature has a centralized cache mechanism which supports end users to just specify a path to cache corresponding HDFS data. HDFS cache could provide significant performance benefits for queries and other workloads whose high volume of data is frequently accessed. However, as DRAM is used as cache medium, HDFS cache might cause performance regression for memory intensive workloads. So its usage is limited especially in scenarios where memory capacity is insufficient. To overcome the limitations of HDFS DRAM cache, we introduced persistent memory to serve as cache medium. Persistent memory represents a new class of memory storage technology that offers high performance, high capacity and data persistence at lower cost, which makes it suitable for big data workloads. In this session, the attendees can gain a lot of technical knowledge in HDFS cache and learn how to accelerate workloads by leveraging HDFS persistent memory cache. We will first introduce the architecture of HDFS persistent memory cache feature, then present the performance numbers of micro workloads like DFSIO and industry standard workloads like TPC-DS. We will showcase that HDFS persistent memory cache can bring 14x performance speedup compared with no HDFS cache case and 6x performance speedup compared with HDFS DRAM cache case. With data persistence characteristic, HDFS persistent memory cache can help users reduce cache warm-up time in cluster restart situations, which will also be demonstrated. Moreover, we will discuss our future work, such as potential optimizations and HDFS lazy-persistent write cache support with persistent memory.
Learning Objectives
Hadoop Distributed File System (HDFS) with persistent memory cache,Architecture for HDFS persistent memory cache,How to accelerate big data workloads with persistent memory cache