Data Domain Deduplication Calculator FAQ

Is there a basic underlying assumption of this calculator?

Yes - as you back up data in your organization, the more full copies of the data set that are processed and the longer they are retained, the higher your deduplication results should be. The implications of deduplication can be seen as you compare the alternatives of the Data Domain solution, disk/VTL or tape.

What is the difference between weekly full and daily full backups?

In a typical data center, there are data sets (e.g., general file systems) that get backed up on a weekly basis with daily incremental backups being performed between full runs. Some data sets are backed up in their entirety every day (e.g., Exchange, databases, etc.). The total backup set is a combination of these two policies. The impact on deduplication is taken into account based on the number of full and incremental data sets that will be written to each solution (Data Domain, disk/VTL, tape).

Why can’t I specify the incremental changes to my data set?

Reports of incremental data set information can vary greatly from site to site. Typical incremental change rates are around 5% of the total backup data set. This results in about 25% net new data per week to include in the sizing estimates and deduplication considerations. The reporting problem comes from the fact that measurable change rate can be impacted by file system organization. A small change to a file usually results in the file being different and picked up on incremental backups. If the original file is 10KB vs 10MB, the actual, small change is measured very differently by the file system and backup application as opposed to the underlying content change that affects the deduplication processing. For this reason, and this estimation, we are using a standard 5% daily change rate on the data set which is processed in full on a weekly basis. Contact your local sales team if you would like to try this analysis using several other sizing factors.

How does the backup window impact my results?

The backup window is used to estimate the tape drive infrastructure required at the primary site (see tape technology question below). No library systems or tape drives are considered for the DR site. A simple throughput computation is used where the total weekly backup workload is fit into the total backup window hours allotted. Considerations for specific job sizes, retry needs and incremental backup accumulation are not addressed in this model. The intent is to get a rough sizing model for the infrastructure that might be required.

What tape technology is being used for comparison?

For calculation of the tape solution, an LTO-2 configuration is used. The average throughput rate used in this model is 25 MB/s. This reflects an industry observed average and not the best-case or worst-case rates possible. The capacity of LTO-2 is 250 GB/tape. This represents a combination of 200 GB base capacity (no compression) with 1.6x data compression and 80% utilization rate approximations. The exact number of tape drives and media can vary based on several other infrastructure and data type considerations.

How is replication bandwidth determined?

Replication bandwidth is an average value determined by taking the total weekly backup workload and spreading the data out over 100 hours. This equates to using between 14 and 15 hours per day to run the replication to the DR site continuously at the estimated rate. If more bandwidth is available, the actual time for replication times can decrease. If more time per day is available, the size of the replication link required can be decreased.

What retention assumptions are built into the model?

This model focuses on two aspects of retention requirements. The first is specified by the user and ranges between 4, 8 or 12 weeks (roughly 30, 60 or 90 days) to keep all produced backup images (daily full and weekly full/ daily incremental). The second addresses retention after this period. The assumption is that some number of additional full backup images will be retained at the monthly, quarterly and yearly intervals. The current assumption is that over a three year period, monthly backups are retained for year 1, quarterly backups for year 2 and year end backups for year 3. This adds about 10-12 additional full backups to the retention sizing estimates. Contact your local sales team if you would like to try this analysis using several other sizing factors.

What assumptions are made for off-site DR configurations?

For this model, the assumption is made that all data backed up at primary data center location needs to be duplicated to the DR site. Essentially a local copy is kept at the primary data center and a duplicate copy is made for the DR site. While this is a slight over simplification of the actual solution, it does provide a realistic comparison of the various alternatives. Adjustments for tape rotation/reuse, multiple copy needs or retention period management are not included in this model.

How does the type of data impact my deduplication results?

While certain types of data can exhibit different deduplication results, this model is using an average deduplication rate that has been observed by customers over several years of handling and deduplicating their data. Actual deduplication rates may change, but the trends demonstrated in the model estimations should still be valid. Contact your local sales team if you would like to try this analysis using several other sizing factors.

What is meant by Usable Disk Capacity?

For RAID storage systems, there are measures of raw and usable capacity. For this exercise, we are looking at the Usable Disk Capacity measure as the amount of disk thas is needed to store the backup data. Typically in a RAID-5 or RAID-6 configuration, the usable capacity is about 70-80% of the raw capacity. This depends on things like the number of hot spare drives, the RAID configuration used and formatting of the disks for OS access.

What is meant by Logical Disk Capacity?

For systems that deduplicate data inline as it is stored to disk (Data Domain for example), the Logical Disk Capacity represents the projected amount of storage that could be available for holding backup data given a Usable Disk Capacity times the deduplication factor.