I’ve been trying to sort out the contents of my old file server. Its been sitting powered off for a few years. I had dumping data onto it forever on the assumption that one day I would sort it out. Today’s the day!

Backups

Before starting I took a backup of everything to an external HDD with rsync -avz -P. This was itself a slow and messy process, because the server locks up after a few hours of copying. Many restarts later and a polling loop of

while true; do
 if ! ssh -o ConnectTimeout=1 sewer true; then
  echo 'server failed' | mail
 fi
done

This was done. I then reinstalled the server, both due to it being 2 debian releases behind, and in the hope the lockups had been fixed.

Dupes

I wanted to find all identical files. First I wrote a program that used Go’s filepath.Walk to list all files, and to take a hash of each file. There’s filenames with spaces and other messiness, so I put the results into a CSV file.

The first version of this program had poor performance, so I ended up writing a version with a worker pool to do the hashing. It is using 100% disk and 100% cpu, so that’s good. I think there’s still optimisations to make - for example I would like to only hash the first 1Mb of each file, then later hash the full file on finding a possible-dupe.

Given a CSV file, I wrote a program to read it and group by hash. This is suggesting 9% savings so far.

Next idea was to find whole directories that contained the same files. This I implemented by sorting the csv file, and processing each directory. I took a rolling hash of the file name (without the directory name) and the hash of the file contents. If two directories had the same files (names and contents, hashed in order), they are a match. This is again a 9% improvement, but likely not cumulative with file matches.

The next idea is to try and find whole sub-trees of directories that are the same. This is pretty promising. This is finding 24% savings.