Abstract
Apache Hadoop File System layer has integrations to many popular file systems including cloud storages like S3, Azure Data Lake Storage etc, along with in-house Apache Hadoop Distributed File System. When users want to migrate between file systems, it’s very difficult for them to update their meta storages when they persist file system paths with schemes. For Example Apache Hive persists the URI paths in metastore. In Apache Hadoop, we came up with a solution(HDFS-15289) for this problem with the View FileSystem Overload Scheme. In this talk, we will cover in details, how users can enable it and how easily users can migrate data between file systems without modifying their metastores. It’s completely transparent to users with respective to the file paths. We will present one of the use cases with Apache Hive partitioning, that is the user can move one/some of their partition data to a remote file system and just add a mount point on the default file system(ex: HDFS) where the user was working with. Here Hive queries will work transparently from the user point of view even though the data resides in a remote storage cluster ex: Apache Hadoop Ozone or S3. This will be very useful when users want to move certain kinds of data, ex: Cold Partitions, Small Files can be moved to remote clusters from a primary HDFS cluster without affecting applications. The Mount tables are maintained at the central server, all clients will load the tables while initializing the file system and also can refresh on modification of mount points. So, that all the initializing clients will be in sync. This will make user’s life easier to migrate data between cloud and on-premise storages in a much flexible way.
Learning Objectives
Audience will learn how client side mount file systems work,Audience will learn what are different Hadoop compatible filesystems out and there,Audience will learn how easily data can be migrated between file systems with out interfering application code.,Audience will learn what APIs exposed by Hadoop Compatible File Systems.