New project idea: Data Diaper
My latest project to embark on is called Data Diaper. It is a program that I've been meaning to write. I've been wanting a project that should be easy to get results from so that progress can be seen when it is made. Also, I've been looking for a project idea to use to learn Python. My first idea was to code some utilities for handling my music collection in Python, but I rather enjoy the current utilities that I've made and couldn't think of any other music collection utilities that I needed.
My criteria for writing Data Diaper are these:
- Modular design to allow backups to be made to new destinations (e.g. CD-R/RW, HTTP, FTP, SFTP, NFS, SMB, etc.)
- Modular design to allow different sources for backups so that one could back up a collection of CDs, a directory on an FTP server, their home directory on a server they have SSH access to, etc.
- Support for incremental backups
- Written in Python
- Simplicity
These criteria present a few questions.
Should there exist a daemon to help in downloading remote files to be backed up?
My original idea was to use SHA-1 hashes in conjunction with Tigertrees (such as is used by some Gnutella clients for verifying the integrity of multi-source downloads) so that only parts of a file that have been changed would need to be downloaded. This would have made it possible to save time when transferring over slow connections. However, there is the problem of the remote end also having to compute the SHA-1 and Tigertree hashes. Unfortunately, this does not seem possible unless there is a companion daemon being run on the remote end to help in the computation of these hashes.
Aside from using a remote daemon, one could set up rsync for the remote connection and have an rsync source module used in Data Diaper. So, my solution is to go ahead and have an rsync source module and create the SHA-1 and Tigertree hashes to verify the backup's integrity as well as have it available for future module uses.
Should incremental backups be made as patches to existing backups or should existing backups be modified, or should this be an option?
If one is backing up data to a CD-RW or set of CD-RWs it might be desirable that the CD-RWs are just updated to contain the latest copies of files to be backed up. This would save discs when making backups. This would also prevent one from having to restore to an old copy of the backup and then patch their way to a newer version. This is not an issue on media that is writable once such as CD-Rs since the files on the discs cannot very well be updated. However, the difference between updating the backup (by replacing the CD-Rs with new versions) and creating patches to the backup (creating CD-Rs that will update the previous full-backup to the current state) still exists.
My current inclination is to just make this a user-selectable option.
That is about as far as I have come with Data Diaper. I'm working over the design in my head and I have a general idea of how things will be done. I'll soon create a TODO list that will give me a more concrete idea of what needs to be done and then arrange the items according to dependencies and interest.
