I just read a post on Slashdot asking the question of how to keep personal data safe. The questioner had just returned from Mexico with lots (he said 16GB worth) of photographs. Other peoples comments about offsite backup, and keeping too much data made me realise that some experiences I have from this last couple of weeks were worthy of some comment.
I have been looking at both offsite backup and reviewing my archived data.
Firstly, I was thinking about the offline storage of live backup data I keep on my server. I automatically backup all the key datastores on the computers around our home on my own server. So I have one place where I can examine and determine which data is crucial. After such analysis, I have determined that I have about 200GB of data that I would regard as really important data. This consists of
- Imap store for all mails (including sent mail archive)
- Data from my “home” directory (a linux machine)
- Data from my wife’s “home” directory (a windows machine)
- Data from my work laptop (a windows machine)
- Key media from Melinda’s Backups
- A special archive store – explained below
The archive store is a place where I can keep multiple versions of the same file over time. It is an ideal place for keeping such items as:-
- database backups (for example the backup of the database behind this site running the drupal content management system)
- snapshots of my online filestores (these are often, but not exclusively, matching the database backups which refer to files in these stores).
- old copies of files that have changed in a special “mydocs” sub-directory in my home directory
- snapshots of the /etc directories of my linux machines.
it consists of a series of directories called snap ( the most recent snapshot) daily.1 (the snapshot a day ago), daily.2 ..daily.6 (snapshots 2 to 6 days ago), weekly.1..weekly.5 (weekly snapshots), monthly.1 .. monthly.6 (monthly snapshots) and finally a dvd archive named dvd-Hh-yyyy (where “h” and “yyyy” and the half – 1 or 2 – and year respectively). The the dvd archive is set so that six months worth of archive data can be permenantly written to dvd so the snapshots can be kept for ever.
The real sublety of this approach is how each snapshot is aged, and when it becomes the oldest of its set is merged into the slower moving archive. As an example, every day daily.6 is merged into weekly.1 such that if they both contain a file with the same name, the daily.6 file replaces the file already there, but files not in daily.6 are not overwritten and files in daily.6 which do not exist yet in weekly.1 are placed there (ie the directories are merged). It turns out (assuming daily.6 does not hold any hidden “dot” files at its top level) that the following command achieves this extremely quickly (as it is only moving meta data)
cp -alf daily.6/* weekly.1/ rm -rf daily.6
What this command is doing is copying by creating hard links in the destination all the files, recursively. So this does not require the copying of any data, but does merge the two directories together. Following the removal of the daily.6 directories all the other daily directories can be shuffled with
mv daily.5 daily.6 mv daily.4 daily.5 ... mv daily.1 daily.2 mv snap daily.1
Similar approaches are taken to merge weekly into monthly and monthly into the dvd archive.
The net result of all this is an archive of aging copies of important files – daily versions for the most recent, but as they get further away, they become weekly copies, then monthly copies and finally 6 monthly copies – at which point they are ultimately archived to dvd.
Making the “Off Site” Backup
It turns out that I have one spare 500GB and two 160GB sata disks that have come out of upgraded computers. So I bought myself a SATA/usb docking station and I can now copy the key datastores listed above to one or more disks.
I had to decide what filesystem format to use for these disks. I have used a mixture of reiserfs and ext4 filesystems in the main data areas. I liked reiserfs because of its efficient use of space packing small files but I also found it a bit slow. Ext4 seemed to serve well as the general purpose tool for my current systems, but didn’t have any special facilities for managing the data in the snapshot format. In fact I was handling this type of operation using LVM layed under the ext4 filesystem itself. After some checking of other possible solutions I decided to use btrfs.
Btrfs is an interesting choice. At the time of writing this blog post, I do not believe it is sufficently stable to use in an on-line production environment. The version available on my Debian Squeeze system is way behind the cutting edge versions, and even they still do not perform without occassional problems. However, this is for a set of disks that will be re-written every couple of months and then stored off site and not touched. Provided a basic use will work (and a few tests quickly proved this to be the case), I do not think this is the issue. What is much more exciting is the features which btrfs offers. Some things that are interesting is
- Almost instantaneous creation of the file system even when I have a 500Gb disk to use
- Sub-volumes and the ability to snapshot offers the possibility of potentially keeping multiple versions of the off site backup set on the same disk (since it is essentially the same data, and with the use of rsync –inplace copying and the btrfs copy on write ability, I only expect to see a minimal increase in the use of disk space on each offsite backup cycle)
- Efficient packing of small files
For these reasons I have proceeded with the btrfs file systems as the basis for my off site storage sets. The initialisation of the disks is done with the following commands
mkfs.btrfs /dev/sdd mount /dev/sdd /mnt/offsite btrfs subvolume create /mnt/offsite/archive
I have made myself two off site backup sets (one consisting of the 500GB disk, and one consisting of the two 160GB disks). Once a month I perform the following steps
- Retrieve the oldest of the two sets from off site (my daughters house) leaving the new one still there
- Make a new snapshot (see below) of the archive stores (and other key data)
- Take this new backup back to its off site storage location
This means I am never without at least one backup storage set being off site and any time.
I have written different scripts to handle each of the different disks, but I include below the script for the simple case, the single 500Gb disk
#!/bin/sh DATE=$(date +%d-%b-%Y) MOUNT=/mnt/offsite set -v rsync -a --delete --inplace /bak/archive/ $MOUNT/archive/archive/ rsync -a --delete --inplace /bak/alan/backup/ $MOUNT/archive/alan/ rsync -a --delete --inplace /bak/imap/ $MOUNT/archive/imap/ rsync -a --delete --inplace /bak/pooh/backup/ $MOUNT/archive/pooh/ rsync -a --delete --inplace /bak/rabbit/backup/ $MOUNT/archive/rabbit/ rsync -a --delete --inplace /bak/mb.com/public_html/media_store/ $MOUNT/archive/mb.com-media_store/ # Make a snapshot, so that next time we update it we can do so and not loose the original btrfs subvolume snapshot $MOUNT/archive $MOUNT/snapshot-$DATE set +v
I hope others of you out there can use these scripts as the starting point for your own off site storage programme.
Slimming down my archive data
I have been taking backups for as long as I can remember, and from 1998 I have been archiving snapshots of
all my important data to CD (and latterly to DVD). I have a small CD wallet holder that stores approximately 120 CD and as this was getting full, I decided the time was ripe to go through all my CD’s from 1998, and to transfer them onto DVD – thus providing for a much more efficient use of the CD holder.
I started by copying the oldest CD to an area on my hard drive, and then following that with the second one.
I soon remembered that the structure of each CD was similar. There was normally a top level directory with the particular area I was backing up at the time (my work computer, my personal computer before I used linux, my personal computer after I used linux, our family computer, other general backups), followed by a standard set of directories/files beneath those top level areas.
I then asked myself the question. If I had to go back and choose a file, would the version of that file make any difference? Would I know which version I wanted, or would the newest version be what I was looking for? I soon realised that for files now over 10 years old, I was more likely to be looking for any of a file rather than worrying about a particular version.
So using the technique described above (cp -alf newer/* older/ ) to merge newer versions of directories into older ones I realised I could compress my archives by removing files that were just repeats of the same thing.
In the end, I merged 51 CDs (with an average of about 500MB of data on each one) into just 4 DVDs, without loosing anything important. Even so, I can still locate any important file I have had since 1998.
I call that a good days work