Backing Up your Data

You should always have an up-to-date backup copy of any important data you keep on Nectar infrastructure, as per the Nectar Terms and Conditions. Nectar does not normally make or keep backups of your instances, volumes or object storage, so it is important that you do this to protect your data.  Things that you need to protect against include:

  • File system corruption or loss due to hardware, operating system and other infrastructure failures.

  • Data loss due to human error; e.g. your error, a co-worker’s error, or an operator error.

  • Data loss due to someone hacking your instance and interfering with the data.

Note that if an instance is destroyed or compromised, you don’t just lose the research data stored there.  You may also lose code or scripts that you have developed, software builds installations and configurations, and much of the history of what you were doing. 

Instance snapshots are a way to save a copy of the state of an instance, but they do not include data on the secondary ephemeral storage drive or on any attached volumes. Likewise, volume snapshots are a way to record the disk state of a volume, but they are not suitable for backups because the snapshot is held on the same storage platform as the volume.

There are a number of other ways to implement backups as described below.

Develop a Backup Strategy

Given the importance of protecting your research data holdings on Nectar, we recommend that you develop and document a Backup Strategy.  The things to consider include:

  • Deciding what it is you are you are protecting against.  Is it simply data loss, or are you also concerned about your data being unavailable?

  • Do you need a single disaster recovery backup, or do you need to be able to retrieve old versions of files?

  • Do you actually need file archiving and data preservation rather than backup?  (Bear in mind that Nectar storage resources are not intended to be used for the former.)

  • How much data is involved?

A data backup strategy should include transferring data to a secondary data storage location to guard against the complete loss of a storage device, and/or short or medium term unavailability.   Possible locations such as your desktop or other (non-Nectar) research data storage, but it is important to pick a location that is sufficiently.

Next, you need to decide how often you need to backup your data.  Bear in mind that any data that has changed since the last successful backup is at potential risk.

Next, you need to decide which mechanism or mechanisms are most appropriate for your needs.  If you are going to automate the backups, you need to set up appropriate scripts that log what they are doing.  You also need to ensure that the logs are checked regularly.  (Many of us have heard horror stories about people who go to restore some lost files from the backups only to discover that the backups have not been working for the mast N months.)

Next, you need to document your procedures so that you (or possibly someone else) knows where the backups are stored, how to run them, and how to restore data.  It is unwise to just rely on your memory.

Finally, you need to test your ability to restore an entire lost file system or tree from the backups, or (if this is in scope) individual lost files or directories.  Indeed, it is advisable to do these tests periodically.

Backup using Compressed Archives

The simplest way to back-up files is to create a compressed archive file for your directories and then transfer the archive from your VM to your back-up destination.

The general command syntax for creating a compressed tar archive is:

tar -cvpzf <NameOfArchive>.tar.gz <pathname> ...

For example, to create an archive for /mnt/data as data.tar.gz:

tar -cvpzf data.tar.gz /mnt/data

(If the files are owned by a different user or by multiple users, you may need to use “sudo” to run the command.)

The compressed archive should now be transferred to the backup destination.  You can transfer it from your VM to your desktop using FileZilla, or any other GUI based file transfer tool that understands SCP.   Alternatively you can use the SCP command line tool on your VM to copy the file to a remote server.  Please refer to the article on Transferring Data to your VM for various file transfer options.

The advantages of this approach are that it is simple and easy to understand.  The disadvantages are:

  • This approach copies everything.  That takes time and potentially uses a lot of disk space; e.g. if you want to keep multiple backups.

  • You need enough spare disk space on your VM to hold the compressed archive before transferring it to the backup location.

Backup using RSYNC

RSYNC is a utility for mirroring file trees; i.e. creating or updating an identical copy of a tree of files and directories in a different place.  It can be used as a backup mechanism; i.e. by making a copy of the tree that you want on a different computer.

RSYNC has a several important advantages over creating and copying archives:

  • No temporary space is needed at the source location.

  • Only files that have changed will be copied to the destination.

On the flip-side, an RSYNC copy is analogous to a single backup archive.  It only contains the version of any file or directory at the last time you completed a backup.  Furthermore, if something goes wrong while you are running RSYNC, the mirror copy will contain a mix of the current version and earlier versions.

To use RSYNC, it has to be installed on the source computer (the VM) and the destination computer:

  • RSYNC is pre-installed on MacOSX

  • On Linux systems, RSYNC can be installed using your distro’s package managed.  For example on Ubuntu you can install it with:

sudo apt-get install rsync

  • On Windows, installation involves using a Cygwin package.  RSYNC usage is more complicated.

The general command syntax for creating or syncing a back-up copy is as follows:

rsync -av <source> <destination>

where the <source> and <destination> are either simple directory paths or take the form <account>@host:<directory>, where <account> is a (remote) account name and <host> is the hostname or IP address of the (remote) source or destination.

For example:

rsync -av /mnt/data/ ubuntu@remote.host.edu.au:data/directory/

will mirror from “/mnt/data/” to a machine called “remote.host.edu.au”.  The transfer will use the “ubuntu” account on the destination machine.  Since “data/directory/” doesn’t have a leading “/”, it will be a directory within the ubuntu user’s home directory.

Likewise:

rsync -av ubuntu@remote.host.edu.au:data/directory/ /mnt/data/ 

mirrors files in the opposite direction; i.e from “remote” to the current machine.

Caution!  Beware! The RSYNC command is unforgiving.  It blindly does what you tell it to do … without knowing if what you are telling it makes sense.  So, for example, if you RSYNC files in the wrong direction, you could end up clobbering your current file tree with the (older) backup, and lose all changes since the backup.

If you have any doubts about what you are doing, we recommend that you use the “--dry-run” option to see what your RSYNC command would do.  Dry run tells you what files would be copied, etcetera without doing it.

Advanced RSYNC usage

The previous section describes the basics of using RSYNC for backups, assuming that your SSH keys “just work” using the default keys in your “~/.ssh” directory.

If you need to use a different key (or a specific key if you have lots of them), you can do it by passing the relevant arguments through to the (local) “ssh” command that establishes the communication channel that RSYNC uses:

rsync -av -e "ssh -i <path-to-private-key>" <source> 
    <destination>

If the files that you are backing up have multiple (Linux) file owners and/or groups and you wish to preserve this, then you need to run “rsync” as “root” (on both ends) and include options to preserve the user and group information.  Here is an example:

sudo rsync -aP -e "ssh -i /path/to/private_key_file"
    --rsync-path="sudo rsync" 
    /mnt/myvolume ubuntu@new:/mnt

Explanation:

  • Using “sudo” locally allows RSYNC to access all files, irrespective of their permissions.

  • The “-P” flag tells RSYNC to preserve all ownership and group information.  

  • The “-e” option tells the RSYNC to use a specific SSH key rather than the local “root” user’s default key.

  • The “--rsync-path” gives the RSYNC command to be run on the remote system.  Using “sudo rsync” means that the remote RSYNC has permission to create the files with the correct owners and groups.

If you are using RSYNC to create a backup on an HSM (hierarchical storage management) file system, then you need to take some precautions.  There are two main things to be concerned with:

  1. By default, RSYNC uses a “delta-transfer algorithm” to optimize transfers.  Essentially, it works out the differences between the source and destination copy of a file, and transfers those rather than the entire file.  That typically makes the transfers faster.  The problem is that RSYNC needs to read the file at the destination to compute the deltas. If the destination is on an HSM it will trigger a recall of offline files.
    The solution to this problem is to provide the “-W” or “--whole-file” option to turn off the delta-transfer algorithm.

  2. By default, RSYNC uses a “quick and dirty” method to determine if a file has changed; i.e. it compares the file modification times.  However, there is a “--checksum” which tells RSYNC to compute and compare file checksums for the source and destination copies and compare them to decide if the file needs to be copied. This is more reliable in some circumstances. If the destination is on an HSM it will trigger a recall of offline files.

The reasons that these issues are a problem is that the HSM will typically stall the RSYNC command while it recalls each offline file.  Since each recall can take seconds or minutes, this can make a backup take an impractically long time.  The unnecessary HSM recall traffic is also liable to hurt performance for other HSM users.  However, if you do need to recall a lot of files, it is better for HSM performance if you retrieve them in bulk rather than one at a time.  The HSM can schedule bulk retrievals to minimize tape load / unload operations.

In summary:

  • When RSYNCing to an HSM file system, use “-W” or “--whole-file” and do not use “--checksum”.

  • When RSYNCing from an HSM file system, perform a bulk recall of any offline files first.

Incremental Backups

Incremental backup describes any backup mechanism where you make copies of only files that have changed since a previous backup.  If properly implemented, an incremental backup scheme gives you the same security as saving multiple full backups while using a lot less storage space.  The downside is that incremental backups are more complicated to implement, and retrieval of a particular version of a particular file can be particularly complicated.

We will not go any further into this topic here, except to note that Duplicity is a commonly used open-source backup package for Linux.  Among other things, Duplicity can be configured to save backups into Openstack Swift object storage containers, as provided by Nectar.

It is also worth noting that you may be able to get incremental backups with minimal effort if you can RSYNC your data to a secure location where someone else is looking after the problem of backups.

Volume Storage Backups

Volume backups can be created using the Volume Backup service. To create a backup, open the Nectar Dashboard’s “Project > Volume > Volumes” panel, find the volume and use the “Create Backup” action.  Once successfully created, a Volume Backup can be restored using the “Project > Volume > Backups”, either onto an existing volume or as a new one. 

Caveats:

  • Volume Backups tend not to work for large volumes; e.g. larger than 1TB.  (The problem is that the backup must be staged to local disk storage on the backup server.  If there is not enough staging space, the backup fails.)

  • Do not confuse Volume Backups with Volume Snapshots. A snapshot is just a “point in time” view of a volume that is held on the volume storage media.  If there is a major service failure, volumes and snapshots will both be lost.

  • There is a default quota of 10 volume backups per project.  If you need more backup quota, please raise a Nectar Support ticket.

  • The backups are held in Object Storage and your project needs quota to hold them.

  • When you restore a Volume Backup onto an existing Volume, the existing volume content will be overwritten.  It would be highly inadvisable to try this while the existing volume is mounted as a file system.

The alternative to the above is to attach and mount the volume and use 

Database Backups

If you are using the Nectar Database service to host your database, you can backup the database via the Nectar Dashboard.  The procedure for taking a backup is explained in the Database Service Tutorial.  The only caveat is that since the Database service saves the backups in Object Storage, your project will need Object Storage quota sufficient to include space for the backups.

By contrast, if you have installed a database product (e.g. MySQL, PostgreSQL, MongoDB, etc) on a Nectar instance, you will need to research and implement an appropriate vendor specific database backup scheme.  Check the vendor documentation for guidance on the best way to implement backups for your databases.

Note: You cannot rely on normal file system backup techniques to safely backup a database. The state of a database is typically spread across a number of files, and most file system backup techniques won’t give you an instantaneous snapshot of a number of files. A file system level backup taken for an active database is liable to be inconsistent.