File Level Deduplication

The main reason to use file-level deduplication (Community and Enterprise) would be if you had exactly the same files between several different machines. This may be the case if your operating systems or applications are exactly the same and upgraded to the same version. It is also only recommended if you cannot use other type of block deduplication for some reason (for example, Bacula Enterprise Global Deduplication, including  magnetic tapes and cloud storage).

This deduplication works in the following way: you must perform a (special) milestone backup that has a specific level within the Bacula: Base Job (always a kind of Full), which must also be performed in a different Pool. All Bacula client backups that are configured to compare their contents with those of the base job do not repeat the copy of the same files that have already been copied by the Base job.

If deduplication is set up correctly, you can see in the job log the proportion of files that were not backed up because they were copied to the corresponding Base Job.

In theory, you could set up a Base backup of a machine and compare only your future backups with your Base. The practical effect of this is that even if you send a Full backup job, Bacula will only copy the files that have changed since the last termination of the Base level job. This would basically be the same behavior as performing a Full backup and then a differential/incremental. Therefore, there is no clear advantage of using deduplication for a single specific machine.

Configuring the Bacula Deduplication:

a) Add to bacula-dir.conf the following directives to the jobs that can have the files similar to the Base Job and that will be deduplicated:

Job {
  Name = BackupHeitor
  Base = BackupHeitor
  Accurate = yes 
  Schedule = base_schedule
  FileSet = debian_7_set
  ...
}

The Base directive tells Bacula the universe of regular and base backup tasks that will be compared and not repeat copying of similar files. In the correct example, the BackupHeitor work is comparing with the occurrences of itself that will be executed using the Base level. The Accurate = yes directive is also mandatory.

b) Don’t leave the bacula-dir.conf yet. You should also make some changes to your FileSet from the original regular backup:

FileSet { 
  Name = debian_7_set
  Include {
    Options {
     	  BaseJob = pmugcs5 
     	  Accurate = mcs5 
     	  Verify = pin5
    }
    File = /etc
    File = /var
    File = /opt
  }
  ...

Each of these necessary options establishes different behaviors for the way Bacula searches and compares files between Base and Regular Backup Jobs. They are the same as those supported for Bacula Verify Job.

c) Add at least one new Pool and Schedule for base backup tasks. Consider using a significant VolumeRetention value that is not less than regular backups; otherwise, you could lose the ability to perform a complete restoration of some Full Jobs if the Base Job they refer to is already recycled.

Pool {
  Name = Base-Pool
  Pool Type = Backup
  Volume Use Duration = 18 hours
  Volume Retention = 364 days
  ...
}
Schedule {
  Name = base_schedule
  Run = Base Pool=Base-Pool 1st sunday at 12:00
}

d) Perform a Base Job and later the regular backup Job (e.g. Full), that should be deduplicated. You may notice that similar information will appear in the Job log summary:

...
Rate: 2425.4 KB/s 
Software Compression: 39.7 % 
Base files/Used files: 39336/39114 (99.44%) 
VSS: yes 
Encryption: no
...

Disponível em: pt-brPortuguês (Portuguese (Brazil))enEnglishesEspañol (Spanish)

Leave a Reply