Bacula Services’ High Availability, Clustering and Replication

With failover and load balancing capabilities, Bacula Enterprise allows organizations to minimize downtime, ensuring operational continuity even in the event of hardware, software, or other unforeseen failures. Additionally, its distributed and scalable architecture provides redundancy and resilience, enabling quick recovery of critical systems and data. With the ability to create highly available environments, Bacula Enterprise stands out as a reliable solution for companies seeking effective and consistent data asset protection.

To implement a high-availability architecture for Bacula, it is necessary to consider all services that can be a single point of failure, as follows.

  1. Stored backup jobs
  2. Storage Daemon
  3. Director and Catalog

That said, let’s evaluate possible features to provide more disaster resilience for each of the components.

Backup Jobs

Bacula’s backup jobs are stored on disks, tapes, and cloud storage, always written by a Storage Daemon. This way, protection against loss or failures involves one or more of the following methods.

Physical Protections

For disks, RAID6 protection for backup volumes is the most recommended and common in the market, supporting tolerance for up to two disks with failures. For tapes, some tape libraries have the “mirroring” functionality, which allows simultaneous recording on two tapes with identical content. For the cloud, protections from each provider and object storage replication, including multi-region, can be used. All of this without prejudice to other possible techniques.

Copy and Clone Features

Through Bacula, you can configure jobs of the Copy type to run sequentially after the backup job is completed or Clone jobs that run simultaneously (Job Run Directive). This way, copies between devices of the same Storage Daemon (e.g., disk to tape) and between devices of different Storage Daemons allow data restoration in case of loss of one of the backup media.

Storage Daemon

A Bacula Director can virtually manage an infinite number of Storage Daemons as backup storage backends. These storage units can be used independently or grouped for load balancing and failover purposes through a Bacula Storage Group. In this case, multiple Storages can be separated by commas for use by the Pool or backup Job, and different load balancing policies can be used, as shown in the configuration examples below, in text and via Bweb (Figure 1).

Pool {
   ...
   Storage = File4, File5, File6
   StorageGroupPolicy = LeastUsed
   ...
}

Bacula Services' High Availability, Clustering and Replication 1

Figure 1. Storage Group Configuration in Bacula’s Pool Resource (Bweb).

Director and Catalog

For this topic, there are two architecture models that can be used: a multi-master (or multi-Director) and a Director master-standby.

Multi-Director

In this architecture, multiple Directors are implemented, typically in different locations. Storage and File Daemons, if desired, can be managed by more than one Director. This architecture can typically be used by banks, where data centers are fully replicated. In this architecture, each Director has its own Catalog, which can eventually be shared among administrators.

However, administration can be more labor-intensive, as more than one instance of the backup system needs to be managed.

Director Master-Standby

In this mode, at any given time, only one instance of the Director should be active. Inactive Director replicas are installed and receive periodic replicas of configurations. In case of a disaster with the primary Director, one of the inactive Directors takes over to resume backup and restoration jobs.

As shown in Figure 2, Bacula’s PostgreSQL Catalog can follow the same Master-Standby logic as the Director’s machines or can have a separate set of machines with external access. Bacula Services' High Availability, Clustering and Replication 2

Figure 2. PostgreSQL Master-Standby Replication

Regardless of the model adopted, here are examples of Admin Jobs for Director replication and PostgreSQL configuration for Bacula’s Catalog replication.

Replicating Director Configurations
Setting up the bacula user’s SSH key on Linux – Primary Director:
mkdir /opt/bacula/working/.ssh/
chown -R bacula /opt/bacula
sudo -u bacula ssh-keygen -t rsa
cat /opt/bacula/working/.ssh/id_rsa.pub # save contents for later.
Setting up the bacula user’s SSH key on Linux – Secondary Director:
mkdir /opt/bacula/working/.ssh/
chown -R bacula /opt/bacula
touch /opt/bacula/working/.ssh/authorized_keys
chmod -R 750 /opt/bacula/working/.ssh/
vi /opt/bacula/working/.ssh/authorized_keys

Open the content of rsa_id.pub from the primary machine /opt/bacula/working/.ssh/id_rsa.pub and paste it into the file /opt/bacula/working/.ssh/authorized_keys. Save and exit.

Ref.: https://www.hostinger.com.br/tutoriais/como-configurar-chaves-ssh

Create an Admin Job that Executes a Script to copy configurations to the secondary environment (Run Script Before Job):
Job={
  Type=Admin
  ...
  scp -i /opt/bacula/working/.ssh/id_rsa.pub -r /opt/bacula/etc/conf.d/Director/ bacula@<ip>://opt/bacula/etc/conf.d/
  scp -i /opt/bacula/working/.ssh/id_rsa.pub -r /opt/bacula/etc/bacula-dir.conf bacula@<ip>://opt/bacula/etc
}
PostgreSQL Master-Standby Catalog Replication
On the Primary Node:
# Create a user

 with replication permissions. Set a password.
sudo -u postgres createuser -U postgres repuser -P -c 5 --replication

mkdir /var/lib/pgsql/data/archive
chown -R postgres /var/lib/pgsql/data/archive/
# Add connection permission for the secondary node. Passive IP of the cluster
echo "host replication all 10.146.19.65/24 md5" >> /var/lib/pgsql/data/pg_hba.conf

vi /var/lib/pgsql/data/postgresql.conf
++
listen_addresses = '*' # it might already be configured
wal_level = hot_standby
archive_mode = on
archive_command = 'test ! -f /var/lib/pgsql/data/archive/%f && cp %p /var/lib/pgsql/data/archive/%f'
:x!
service postgresql reload # if it doesn't work, restart - which causes downtime

firewall-cmd --permanent --zone=public --add-port=5432/tcp
service firewalld restart
On the Secondary Node:
sudo service postgresql stop
mv /var/lib/pgsql/data /var/lib/pgsql/data.old
# Copy the primary's database (primary IP) to the secondary
sudo -u postgres pg_basebackup -h 10.145.19.35 -D /var/lib/pgsql/data -U repuser -v -P -X stream
# repuser password - VkR5#64%#n2H
vi /var/lib/pgsql/data/postgresql.conf
# Add:
hot_standby = on
promote_trigger_file = '/tmp/postgresql.trigger.5432'
restore_command = 'cp /var/lib/pgsql/data/archive/%f %p'
recovery_target_timeline = 'latest'
archive_cleanup_command = 'pg_archivecleanup /var/lib/pgsql/data/archive %r'

#####
sudo -u postgres touch /var/lib/pgsql/data/standby.signal
service postgresql start
# Check the postgresql logs in /var/lib/pgsql/data/log to verify the replication status.

Ref.: https://cloud.google.com/community/tutorials/setting-up-postgres-hot-standby

Disponível em: pt-brPortuguês (Portuguese (Brazil))enEnglishesEspañol (Spanish)

Leave a Reply