Abstract
The PFFS is a POSIX compliant parallel file system capable of high resiliency and scalability. The user data is dispersed across a cluster of peers with no replication. This is an analysis of the performance metrics obtained as we ramped up the count of peers in the cluster. For each cluster configuration we adjusted the allowable count of peer failures (FEC - Forward Error Correction) from 14% to 77% of the cluster and measured the I/O performance of the cluster. Write operations consistently exceeded 700MB/s even with 77% of the peers faulted. Read operations always exceeded 400MB/s with no peer failures and we observed a graceful performance degradation as faults were injected so that even with 77% peer failure we never dropped below 300MB/s. As expected large I/O were faster when the cluster was mounted in direct_io mode whereas smaller I/O were faster when direct_io was disabled and the kernel was allowed to optimize I/O requests. We discuss how Ethernet latency vs. throughput issues affect PFFS performance. We explore how the PFFS can be made faster in the future. We draw conclusions on the scalability of the cluster’s performance based upon our measurements.
Learning Objectives
Design for performance
Resiliency vs. performance
Scaling performance
Issues with Ethernet