Deduplication Methods Compared 

Data Deduplication essentially refers to the elimination of non-unique, that is to say, redundant data.

In today's computing environment, multiple copies of the same information exist everywhere throughout most organizations. Take for example a simple document that is copied to the staff of a 100 person company. This document will exist in 100 locations and, if its important, may be retained for a significant period. Then, suppose someone edits the document or attaches some more information, 100 more copies are now sent and probably saved. This is one very simple, but common, example of duplicated data.

You can go further and delve into the actual content of the document and find many common elements that exist. An example would be a company logo or a signature block, these elements are used over and over, but each instance is identical. Clearly this is a significant waste of resources, and it's not just a storage challenge. This huge unnecessary duplication of data puts significantly more pressure on all parts of an organization's IT infrastructure. Network traffic constraints and issues with large backup jobs being at the forefront, not to mention managing this ever increasing problem. This is the issue that deduplication addresses and solves.

Why is Deduplication causing so much interest in IT circles today?
Its game changing technology and it won't be too long before every storage vendor will offer a solution that will include deduplication. Lets examine how it works.

In the deduplication process, duplicate data is identified and then deleted, leaving only one copy of the data to be stored. However, an index of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored. For example, a typical email system might contain 100 instances of the same one megabyte (1 MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB of storage space. With a form of data deduplication called File Level Deduplication or Single Instance Storage (SIS), only one instance of the file is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only 1 MB.

This is a simplified explanation of the process, but serves well to illustrate the great potential savings possible even with the most simple deduplication methods. A good example of a common SIS implementation is Microsoft's 'Windows Storage Server' and 'Windows Home Server', both of which use this technique to substantially increase the effective use of storage and reduce backup times. The home computer backup solution in Windows Home Server has a single-instance store at the cluster level. Clusters are typically collections of data stored on the hard drive, 4 kilobytes (KB) in size. Every backup is a full backup, but the home server only stores each unique cluster once. This creates the restore-time convenience of full backups (you do not have to repeat history) with the backup time performance of incremental backups. For more information on Windows Home Server SIS a 'Technical Brief' is available from Microsoft.

Deduplication, however, is not restricted to the SIS, or file level models. Deduplication is applied at the block level to further achieve space saving on data storage systems. Block level deduplication analyses smaller chunks of data, looking for identical strings of binary code. Once a duplicate string of code is identified, a hash or digital fingerprint is created to identify the data, an entry is logged in the index, and only one chunk of data is stored. Because its more granular than SIS it results in enormous space savings on the storage devices, in most cases, significantly more than can be archived with SIS.

calculate your savings here

As vendors raced to introduce their de-dupe offerings, new methods and enhancements were developed. Some new methods introduced significant improvements, while some were different solely due to limitations of the vendors pre-existing technology. Lets examine some of the principle variants.

Deduplication for Backup: Initially Disk to Disk backup brought many significant advantages to backup methods, the most significant being much faster recovery times than possible from tape. This technique became feasible with the advent of large, lower cost SATA drives together with improved RAID technology. Today, this secondary storage is ideally suited to benefit from duplication. By removing redundant data, less space is used, which in turn allows for a greater depth of retention. More versions of your backup data stored near line.

There are two distinct methods of deduplication used for backup: Target-Based and Source-Based.

Target-Based: Target-based deduplication employs a disk storage device as the data repository or target. The data is driven to the target using standard backup software. Once it reaches the device, the deduplication is processed as it enters the target (In-Line Processing), or it is received by the device in its raw data state and is processed after the entire backup job has arrived (Post-Process). There are pros and cons with each of these methods and picking the correct technology for your specific environment is important.

A good example of in-line technology is EMC's Data Domain product line. These appliances have extremely fast and capable possessing power and are specifically designed to be able to handle and deduplicate data as fast as it can be supplied to them. In fact, network performance is often found to be the limiting factor to speed of backup, rather than de-dupe processing. A key benefit with in line systems, such as these, is the ability to replicate to disaster recovery sites immediately because of the fact that the data is deduplicated as soon as it is received.

Post-process products ingest data to local storage, and then process the stored data. In some products, the de-dupe process can start at the same time as the backup starts, but in most cases, the process lags the incoming data, and can take a considerable time to complete. This method avoids the need for the high performance processing power in order to reduce the cost. However there are trade offs. First, you can't replicate data until the whole backup is deduplicated. Second, the solution has to have more disk storage capacity than an in-line method, as it needs to be able to store a complete backup session in unduplicated form.

Source-Based: With source-based deduplication, the data is most often deduplicated by software agents, installed on the source servers, working with the central deduplication appliance. Only unique data is sent across the network. In addition to capacity benefits there are significant advantages to be gained from the reduction in network traffic. This can be very beneficial for organizations with large campuses, or organizations with remote offices, that backup to a central location.

Examples of source-based deduplication are EMC's Avamar and CommVault's Simpana 9. Avamar is based on a storage grid appliance, and priced on capacity of this appliance with unlimited, no additional cost, software agents for the source servers. Simpana 9 is a pure software solution, and will run on a wide range of hardware. However, the performance capabilities of the equipment must be up to the task, and CommVault makes some recommendations in this regard. Software-based solutions provide for a great deal of flexibility but often at the cost of increased complexity, so it is important that your chosen supplier is experienced with not only the chosen product, but also its suitability to the specific application.

Primary Storage Deduplication: The same deduplication processes can be applied to primary storage systems, removing non unique data and significantly increasing effective capacity. Primary storage is usually higher performing and therefore more costly so this application can provide a significant improvement to the overall ROI. Current offerings in primary storage deduplication are based on post-processing and typically run this process overnight, or at period of lower demand, so as to reduce any processing impact during peak times. NetApp were leaders in this area and have offered primary de-dupe on select products, for several years.

Compression Ratios: How much data will you actually be able to store? This is hard to answer without experience in specifying these solutions. Questions that must be answered to determine this effective compression ratio are, how much does your data change, how many backups do you do per day or per hour. It follows that the more changed data and the more backups done the lower the effective deduplication will be, however due to the use of block level examination of the data, these ratios will still be very significant.

File types: One of the most important factors that effects the ability to deduplicate data is the actual type of data. Data that has a random nature, in general, is less likely to be effectively reduced in size. Examples of this are any encrypted data, video files and other graphics files. On the other hand general business data, operating systems, email, and documents all produce very significant reductions. Deduplication excels in virtualized environments where multiple clones of an operating environment are common.

Virtualized Environments: Data deduplication is a particularly valuable tool within the VMware environment were there is often much duplication between each of the various virtual server instances. The ability to deduplicate VMDK files, needed for the deployment of virtual environments, and snap-shot files, such as VMSN and VMSD will result in considerable cost savings compared to conventional disk backup methods and allow more recovery points.

Making the right choice: While it is reasonably clear that deduplication will bring substantial benefits by way of the tremendous efficiencies it allows, it is much less certain which is the best method to utilize. In most cases, the choice will be influenced by the existing infrastructure and most recent investments. If, for example, you have recently upgraded your suite of backup software, it is unlikely you would want to make a change that involved replacing it. A target-based solution would, in most cases, be a good non-disruptive addition.

If a recent investment was in storage for D2D backup, adding a software product that provided deduplication would be the best choice. File type and network constraints also play a large part in designing the optimum configuration. The good news is that there is certainly a deduplication solution that is right for you.

Open Storage Solutions knows deduplication: With the wide choice in deduplication technology available today, it is easy to make a mistake and select less than optimal technology. It makes extremely good sense to talk to us about how deduplication would best be implemented in your environment.

We will do a simple assessment of your actual data and show you what you should expect to see in capacity enhancements. We will show you how this technology can reduce your costs and we will suggest the best technology for your specific data and IT topology.

In conclusion, data deduplication improves data protection, increases the speed of service, and reduces costs. The business benefits from data deduplication start with increasing overall data integrity and end with reducing overall data protection costs. Data deduplication lets users reduce the amount of disk they need for backup by a substantial amount. With reduced acquisition costs and reduced power, space, and cooling requirements, disk becomes suitable for first stage backup and restore, and for retention that can easily extend to months. With data on disk, restore service levels are higher, media handling errors are reduced, and more recovery points are available on fast recovery media. Data deduplication can also reduce the data that must be sent across a WAN for remote backups, replication, and disaster recovery.

Next step: try our online calculator for instant feedback of your anticipated results and then call us for a free assessment.

Call us today.. 1 800 387 3419