By Adrian Moir.
Deduplication is one of the hottest technologies in the current market because of its ability to reduce costs. But it comes in many flavours and organizations need to understand each one of them if they are to choose the one that is best for them.
Deduplication is the process of examining a data set or byte stream and storing and/or sending only unique data; duplicate data is replaced with a pointer to the first occurrence of the data. Some IT professionals think that deduplication and Single Instance Store (SIS) are the same thing, but they are not. The key difference between the two is that SIS evaluates the data stream at the file level, so if a user renames a file, SIS will cause it to be seen as new and be stored again, whereas with deduplication, the entire internal contents of the file will be seen as duplicate. As a result SIS delivers less space savings.
All deduplication journeys end with a significantly reduced amount of data on the disk but the ways they get there can differ greatly. The two prevailing methods are fixed-block length and variable-block length; with the latter the deduplication engine can change the block size and recognize more duplicate patterns thereby decreasing the amount of data stored and increasing the space savings.
Inline and post process deduplication also offer different advantages and tradeoffs. With inline deduplication, data is deduplicated before being stored on disk; this approach does not require any additional disk space to store the data prior to deduplication but has the following tradeoffs:
- It lengthens the time to complete the backup, leading to longer backup windows and degraded performance during business hours as well as the inability to start the next backup because the previous backup job is still running;
- It does not allow the flexibility to leave data that does not deduplicate well, non-deduplicated;
- It often forces users to ‘rehydrate’ the whole backup to recover a single file, making restores slower.
With post-process deduplication, the backup is briefly placed on disk-based staging storage prior to being deduplicated; some technologies allow deduplication to start after a set amount of the data stream has been staged, reducing the sizing requirements for the staging storage while allowing the backups to complete as fast as possible. So although post-process deduplication requires additional disk space for the staging storage area, it enables faster backups, shrinking the backup window, it allows the non-deduplication of data that does not deduplicate well, and it offers faster restores.
Source-side deduplication typically uses a client-located deduplication engine that will check for duplicates against a centrally-located deduplication index, typically located on the backup server or media server. Only unique blocks will be transmitted to the disk. The advantage of source-side deduplication is that it reduces network contention because less data is sent over it.
However, by running source-side deduplication users are adding hashing, a processor-intensive algorithm, to the client. This means that clients that are already overloaded will become even more stretched, possibly slowing down the backups and lengthening the backup window.
Target-side deduplication is generally better suited for data-intensive environments and runs the deduplication at the storage level, removing the need to have clients with enough ‘horsepower’: because the hashing occurs at the target. The trade-off is that more data is going to be sent over the network. Different vendors offer different solutions that mix and match the ‘when’ and ‘where’: for example, one solution could do inline deduplication starting at the source, while others may do post-process starting at the target.
A final criterion to review when evaluating deduplication technologies is deciding how long to retain data; the more the data that is examined, the greater the likelihood that duplicates are found and hence the greater the space savings. For example an initial full backup will only be deduplicated against itself but when the next full backup is performed, only the unique data that has been updated or added since the first full backup will be stored. When deduplicating backups, each additional week of backup can be retained using a decreasing amount of additional disk space, allowing organizations to store even more backups on the existing amount of storage for a longer period, virtually eliminating the need to restore from offsite storage unless there is complete site failure.
So, in summary, what should users consider when planning a deduplication strategy? Their goal(s) will influence the deduplication technologies they should evaluate. The following are some typical deduplication goals and considerations:
Maximum disk space savings
- Deduplication offers more disk space savings than SIS;
-
Variable block deduplication saves more space than fixed block ;
-
Inline deduplication reduces disk space requirements;
-
Source-side deduplication can increase the disk space savings;
-
Retaining deduplicated data longer will allow users to store even more backups on the same amount of disk storage for a longer period.
Maximum flexibility
- Post-process deduplication offers the ability to leave data that does not deduplicate well in a non-deduplicated state, ensuring that valuable time and processing power are not wasted on data that will not benefit from deduplication;
-
With post-process deduplication, restores are faster;
-
Post-process deduplication allows users to provision data on existing storage which can be up to 1/10 the cost of appliance storage.
Shorter backup windows
- Post-process deduplication can be scheduled to occur outside the backup window;
-
Target-side deduplication does not unnecessarily elongate backup windows.
Deduplication can lead to significant savings in terms of time, human resources and of course budget. Although the technology continues to develop, there are several proven solutions already on the market today, and those organizations who choose the right products to meet their requirements will find that few storage technologies have made such a difference to their data centres in the past.
Author: Adrian Moir is technical director EMEA, BakBone Software.

•Date: 7th July 2010 • Region: World •Type: Article •Topic: IT continuity
Rate this article or make a comment - click here |