Data Deduplication FAQ

Alex McDonald

Jan 5, 2021

title of post

The SNIA Networking Storage Forum (NSF) recently took on the topics surrounding data reduction with a 3-part webcast series that covered Data Reduction Basics, Data Compression and Data Deduplication. If you missed any of them, they are all available on-demand.

In Not Again! Data Deduplication for Storage Systems” our SNIA experts discussed how to reduce the number of copies of data that get stored, mirrored, or backed up. Attendees asked some interesting questions during the live event and here are answers to them all.

Q. Why do we use the term rehydration for deduplication?  I believe the use of the term rehydration when associated with deduplication is misleading. Rehydration is the activity of bringing something back to its original content/size as in compression. With deduplication the action is more aligned with a scatter/gather I/O profile and this does not require rehydration.

A. "Rehydration" is used to cover the reversal of both compression and deduplication. It is used more often to cover the reversal of compression, though there isn't a popularly-used term to specifically cover the reversal of deduplication (such as "re-duplication").  When reading compressed data, if the application can perform the decompression then the storage system does not need to decompress the data, but if the compression was transparent to the application then the storage (or backup) system will decompress the data prior to letting the application read it. You are correct that deduplicated files usually remain in a deduplicated state on the storage when read, but the storage (or backup) system recreates the data for the user or application by presenting the correct blocks or files in the correct order.

Q. What is the impact of doing variable vs fixed block on primary storage Inline?

A. Deduplication is a resource intensive process. The process of sifting the data inline by anchoring, fingerprinting and then filtering for duplicates not only requires high computational resources, but also adds latency on writes. For primary storage systems that require high performance and low latencies, it is best to keep these impacts of dedupe low. Doing dedupe with variable-sized blocks or extents (for e.g. with Rabin fingerprinting) is more intensive than using simple fixed-sized blocks. However, variable-sized segmentation is likely to give higher storage efficiency in many cases. Most often this tradeoff between latency/performance vs. storage efficiency tips in favor of applying simpler fixed-size dedupe in primary storage systems.

Q. Are there special considerations for cloud storage services like OneDrive?

A. As far as we know, Microsoft OneDrive avoids uploading duplicate files that have the same filename, but does not scan file contents to deduplicate identical files that have different names or different extensions. As with many remote/cloud backup or replication services, local deduplication space savings do not automatically carry over to the remote site unless the entire volume/disk/drive is replicated to the remote site at the block level. Please contact Microsoft or your cloud storage provider for more details about any space savings technology they might use.

Q. Do we have an error rate calculation system to decide which type of deduplication we use?

A. The choice of deduplication technology to use largely depends on the characteristics of the dataset and the environment in which deduplication is done. For example, if the customer is running a performance and latency sensitive system for primary storage purposes, then the cost of deduplication in terms of the resources and latencies incurred may be too high and the system may use very simple fixed-size block based dedupe. However, if the system/environment allows for spending extra resources for the sake of storage efficiency, then a more complicated variable-sized extent based dedupe may be used. About error rates themselves, a dedupe storage system should always be built with strong cryptographic hash-based fingerprinting so that the error rates of collisions are extremely low. Errors due to collisions in a dedupe system may lead to data loss or corruption, but as mentioned earlier these can be avoided by using strong cryptographic functions.

Q. Considering the current SSD QLC limitations and endurance... Can we say that if a right choice for deduped storage?

A. In-line deduplication either has no effect or reduces the wear on NAND storage because less data is written. Post-process deduplication usually increases wear on NAND storage because blocks are written then later erased--due to deduplication--and the space later fills with new data. If the system uses post-process deduplication, then the storage software or storage administrator needs to weigh the space savings benefits vs. the increased wear on NAND flash. Since QLC NAND is usually less expensive and has lower write endurance than SLC/MLC/TLC NAND, one might be less likely to use post-process deduplication on QLC NAND than on more expensive NAND which has higher endurance levels.

Q. On slides 11/12 - why not add compaction as well - "fitting" the data onto respective blocks and "if 1k file, not leaving the rest 3k of 4k block empty"?

A. We covered compaction in our webcast on data reduction basics “Everything You Wanted to Know About Storage But Were Too Proud to Ask: Data Reduction.” See slide #18 below.

Again, I encourage you to check out this Data Reduction series and follow us on Twitter @SNIANSF for dates and topics of more SNIA NSF webcasts.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Data Deduplication FAQ

Alex McDonald

Jan 5, 2021

title of post
The SNIA Networking Storage Forum (NSF) recently took on the topics surrounding data reduction with a 3-part webcast series that covered Data Reduction Basics, Data Compression and Data Deduplication. If you missed any of them, they are all available on-demand. In Not Again! Data Deduplication for Storage Systems” our SNIA experts discussed how to reduce the number of copies of data that get stored, mirrored, or backed up. Attendees asked some interesting questions during the live event and here are answers to them all. Q. Why do we use the term rehydration for deduplication?  I believe the use of the term rehydration when associated with deduplication is misleading. Rehydration is the activity of bringing something back to its original content/size as in compression. With deduplication the action is more aligned with a scatter/gather I/O profile and this does not require rehydration. A. “Rehydration” is used to cover the reversal of both compression and deduplication. It is used more often to cover the reversal of compression, though there isn’t a popularly-used term to specifically cover the reversal of deduplication (such as “re-duplication”).  When reading compressed data, if the application can perform the decompression then the storage system does not need to decompress the data, but if the compression was transparent to the application then the storage (or backup) system will decompress the data prior to letting the application read it. You are correct that deduplicated files usually remain in a deduplicated state on the storage when read, but the storage (or backup) system recreates the data for the user or application by presenting the correct blocks or files in the correct order. Q. What is the impact of doing variable vs fixed block on primary storage Inline? A. Deduplication is a resource intensive process. The process of sifting the data inline by anchoring, fingerprinting and then filtering for duplicates not only requires high computational resources, but also adds latency on writes. For primary storage systems that require high performance and low latencies, it is best to keep these impacts of dedupe low. Doing dedupe with variable-sized blocks or extents (for e.g. with Rabin fingerprinting) is more intensive than using simple fixed-sized blocks. However, variable-sized segmentation is likely to give higher storage efficiency in many cases. Most often this tradeoff between latency/performance vs. storage efficiency tips in favor of applying simpler fixed-size dedupe in primary storage systems. Q. Are there special considerations for cloud storage services like OneDrive? A. As far as we know, Microsoft OneDrive avoids uploading duplicate files that have the same filename, but does not scan file contents to deduplicate identical files that have different names or different extensions. As with many remote/cloud backup or replication services, local deduplication space savings do not automatically carry over to the remote site unless the entire volume/disk/drive is replicated to the remote site at the block level. Please contact Microsoft or your cloud storage provider for more details about any space savings technology they might use. Q. Do we have an error rate calculation system to decide which type of deduplication we use? A. The choice of deduplication technology to use largely depends on the characteristics of the dataset and the environment in which deduplication is done. For example, if the customer is running a performance and latency sensitive system for primary storage purposes, then the cost of deduplication in terms of the resources and latencies incurred may be too high and the system may use very simple fixed-size block based dedupe. However, if the system/environment allows for spending extra resources for the sake of storage efficiency, then a more complicated variable-sized extent based dedupe may be used. About error rates themselves, a dedupe storage system should always be built with strong cryptographic hash-based fingerprinting so that the error rates of collisions are extremely low. Errors due to collisions in a dedupe system may lead to data loss or corruption, but as mentioned earlier these can be avoided by using strong cryptographic functions. Q. Considering the current SSD QLC limitations and endurance… Can we say that if a right choice for deduped storage? A. In-line deduplication either has no effect or reduces the wear on NAND storage because less data is written. Post-process deduplication usually increases wear on NAND storage because blocks are written then later erased–due to deduplication–and the space later fills with new data. If the system uses post-process deduplication, then the storage software or storage administrator needs to weigh the space savings benefits vs. the increased wear on NAND flash. Since QLC NAND is usually less expensive and has lower write endurance than SLC/MLC/TLC NAND, one might be less likely to use post-process deduplication on QLC NAND than on more expensive NAND which has higher endurance levels. Q. On slides 11/12 – why not add compaction as well – “fitting” the data onto respective blocks and “if 1k file, not leaving the rest 3k of 4k block empty”? A. We covered compaction in our webcast on data reduction basics “Everything You Wanted to Know About Storage But Were Too Proud to Ask: Data Reduction.” See slide #18 below.
Again, I encourage you to check out this Data Reduction series and follow us on Twitter @SNIANSF for dates and topics of more SNIA NSF webcasts.

Olivia Rhye

Product Manager, SNIA

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Not Again! Data Deduplication for Storage Systems

Alex McDonald

Oct 7, 2020

title of post

As explained in our webcast on Data Reduction, “Everything You Wanted to Know About Storage But Were Too Proud to Ask: Data Reduction,” organizations inevitably store many copies of the same data. Intentionally or inadvertently, users and applications copy and store the same files over and over; with developers, testers and analysts keeping many more copies. And backup programs copy the same or only slightly modified files daily, often to multiple locations and storage devices.  It’s not unusual to end up with some data replicated thousands of times, enough to drive storage administrators and managers of IT budgets crazy. 

So how do we stop the duplication madness? Join us on November 10, 2020 for a live SNIA Networking Storage Forum (NSF) webcast, “Not Again! Data Deduplication for Storage Systems”  where our SNIA experts will discuss how to reduce the number of copies of data that get stored, mirrored, and backed up.

Attend this sanity-saving webcast to learn more about: 

  • Eliminating duplicates at the desktop, server, storage or backup device
  • Dedupe technology, including local vs global deduplication
  • Avoiding or reducing making copies of data (non-duplication)
  • Block-level vs. file- or object-level deduplication
  • In-line vs. post-process deduplication
  • More efficient backup techniques

Register today (but only once please) for this webcast so you can start saving space and end the extra data replication.

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Data Reduction: Don’t Be Too Proud to Ask

John Kim

Jul 29, 2020

title of post

It’s back! Our SNIA Networking Storage Forum (NSF) webcast series “Everything You Wanted to Know About Storage but Were Too Proud to Ask” will return on August 18, 2020. After a little hiatus, we are going to tackle the topic of data reduction.

Everyone knows data volumes are growing rapidly (25-35% per year according to many analysts), far faster than IT budgets, which are constrained to flat or minimal annual growth rates. One of the drivers of such rapid data growth is storing multiple copies of the same data. Developers copy data for testing and analysis. Users email and store multiple copies of the same files. Administrators typically back up the same data over and over, often with minimal to no changes.

To avoid a budget crisis and paying more than once to store the same data, storage vendors and customers can use data reduction techniques such as deduplication, compression, thin provisioning, clones, and snapshots. 

On August 18th, our live webcast “Everything You Wanted to Know about Storage but Were Too Proud to Ask – Part Onyx” will focus on the fundamentals of data reduction, which can be performed in different places and at different stages of the data lifecycle. Like most technologies, there are related means to do this, but with enough differences to cause confusion. For that reason, we’re going to be looking at:

  • How companies end up with so many copies of the same data
  • Difference between deduplication and compression – when should you use one vs. the other?
  • Where and when to reduce data: application-level, networked storage, backups, and during data movement. Is it best done at the client, the server, the storage, the network, or the backup?
  • What are snapshots, clones, and thin provisioning, and how can they help?
  • When to collapse the copies: real-time vs. post-process deduplication
  • Performance considerations

Register today for this efficient and educational webcast, which will cover valuable concepts with minimal repetition or redundancy!

Olivia Rhye

Product Manager, SNIA

Find a similar article by tags

Leave a Reply

Comments

Name

Email Adress

Website

Save my name, email, and website in this browser for the next time I comment.

Subscribe to Data Deduplication