Everyone is looking to squeeze more efficiency from storage.
That’s why the
SNIA
Networking Storage Forum hosted a live webcast last month “Compression:
Putting the Squeeze on Storage.” The audience asked many great questions on
compression techniques. Here are answers from our expert presenters, John Kim
and Brian Will:
Q. When multiple unrelated entities are likely to
compress the data, how do they understand that the data is already compressed
and so skip the compression?
A. Often they can tell from the file extension or header
that the file has already been compressed. Otherwise each entity that wants to
compress the data will try to compress it and then discard the results if it
makes the file larger (because it was already compressed).
Q. I’m curious about storage efficiency of data reduction techniques
(compression/ thin provisioning etc) on certain database/server workloads which
end up being more of a hindrance. Ex: Oracle ASM, which does not perform very
well under any form of storage efficiency method. In such scenarios, what would
be the recommendation to ensure storage is judiciously utilized?
A. Compression
works well for some databases but not others, depending both on how much data
repetition occurs within the database and how the database tables are
structured. Database compression can be done on the row, column or page level,
depending on the method and the database structure. Thin provisioning generally
works best if multiple applications using the storage system (such as the
database application) want to reserve or allocate more space than it actually
needs. If your database system does not like the use of external
(storage-based, OS-based, or file system-based) space efficiency techniques,
you should check if it supports its own internal compression options.
Q.
What is a DPU?
A. A
DPU is a data processing unit that specializes in moving, analyzing and
processing data as it moves in and out of servers, storage, or other devices.
DPUs usually combine network interface card (NIC) functionality with
programmable CPU and/or FPGA cores. Some possible DPU functions include packet
forwarding, encryption/decryption, data compression/decompression, storage
virtualization/acceleration, executing SDN policies, running a firewall agent,
etc.
Q.
What’s the difference between compression versus compaction?
A.
Compression replaces repeated data with either shorter symbols or pointers that
represent the original data but take up less space. Compaction eliminates empty
space between blocks or inside of files, often by moving real data closer
together. For example, if you store multiple 4KB chunks of data in a storage
system that uses 32KB blocks, the default storage solution might consume one
32KB storage block for each 4KB of data. Compaction could put 5 to 8 of those
4KB data chunks into one 32KB storage block to recover wasted free space.
Q.
Is data encryption at odds with data compression? That is, is data encryption a problem for
data compression?
A.If you encrypt data first, it usually makes compression of the encrypted
data difficult or impossible, depending on the encryption algorithm. (A simple
substitution cypher would still allow compression but wouldn’t be very secure.)
In most cases, the answer is to first compress the data then encrypt it. Going
the other way, the reverse process is to first decrypt the data then decompress
it.
Q.
How do we choose the binary form code 00, 01, 101, 110, etc?
A. These
will be used as the final symbol representations written into the output data
stream. The table represented in the presentation is only illustrative, the
algorithm document in the deflate RFC is a complete algorithm to represent symbols in a compacted
binary form.
Q.
Is there a resource for different algorithms vs CPU requirements vs compression
ratios?
A. A
good resource to see the cost versus ratio trade-offs with different algorithms
is on GItHub here. This utility covers a wide
range of compression algorithms, implementations and levels. The data shown on
their GitHub location is benchmarked against the silesia corpus which
represents a number of different data sets.
Q.
Do these operations occur on individual data blocks, or is this across the
entire compression job?
A. Assuming
you mean the compression operations, it typically occurs across multiple data
blocks in the compression window. The compression window almost always spans
more than one data block but usually does not span the entire file or disk/SSD,
unless it’s a small file.
Q.
How do we guarantee that important information is not lost during the lossy
compression?
A. Lossy compression is not my current area of expertise but there is a significant area of information theory called Rate-distortion theory which is used for quantification of images for compression, that may be of interest. In addition, lossy compression is typically only used for files/data where it’s known the users of that data can tolerate the data loss, such as images or video. The user or application can typically adjust the compression ratio to ensure an acceptable level of data loss.
Q. Do you see any advantage in performing the compression on the same CPU controller that is managing the flash (running the FTL, etc.)?
A.There may be cache benefits from running compression and flash on the same
CPU depending on the size of transactions. If the CPU is on the SSD controller
itself, running compression there could offload the work from the main system
CPU, allowing it to spend more cycles running applications instead of doing
compression/decompression.
Q.
Before compressing data, is there a method to check if the data is good to be
compressed?
A.Some compression systems can run a quick scan of a file to estimate the
likely compression ratio. Other systems look at the extension and/or header of
the file and skip attempts to compress it if it looks like it’s already
compressed, such as most image and video files. Another solution is to actually
attempt to compress the file and then discard the compressed version if it’s
larger than the original file.
Q.
If we were to compress on a storage device (SSD) what do you think are the
topic challenges? Error propagation? Latency/QoS or other?
A. Compressing
on a storage device could mean higher latency for the storage device, both when
writing files (if compression is inline) or when reading files back (as they
are decompressed). But it’s likely this latency would otherwise exist somewhere
else in the system if the files were being compressed and decompressed
somewhere other than on the storage device. Compressing (and decompressing) on
the storage device means the data will be transmitted to (and from) the storage
while uncompressed, which could consume more bandwidth. If an SSD is doing post
compression (i.e. compression after the file is stored and not inline as the
file is being stored), it would likely cause more wear on the SSD because each
file is written twice.
Q.
Are all these CPU-based compression analyses?
A. Yes
these are CPU-based compression analyses.
Q.
Can you please characterize the performance difference between, say LZ4 and
Deflate in terms of microseconds or nanoseconds?
A. Extrapolating
from the data available here, an 8KB request using LZ4 fast level 3 (lz4fast 1.9.2 -3)
would take 9.78 usec for compression and 1.85 usec for decompression. While
using zlib level 1 for an 8KB request compression takes 68.8 usec while
decompression will take 21.39 usec. Another aspect to note it that at while LZ4
fast level 3 takes significantly less time, the compression ratio is 50.52%
while zlib level 1 is 36.45%, showing that better compression ratios can have a
significant cost.
Q.
How important is the compression ratio when you are using specialty products?
A.
The compression
ratio is a very important result for any compression algorithm or
implementation.
Q.
In slide #15, how do we choose the binary
code form for the characters?
A. The
binary code form in this example is entirely controlled by the frequency of
occurrence of the symbol within the data stream. The higher the symbol
frequency the shorter the binary code assigned. The algorithm used here is just
for illustrative purposes and would not be used (at least in this manner) in a
standard. Huffman Encoding in DEFLATE. Here
is a good example of a defined encoding algorithm.
This
webcast was part of a SNIA NSF series on data reduction. Please check out the
other two sessions:
- Everything You Wanted to Know About Storage But Were Too Proud to Ask: Data Reduction – Available on-demand
- Not Again! Data Deduplication for Storage Systems – Live on November 10, 2020, on-demand after.
Leave a Reply