Brainstorming help needed (Data copy job - ideas needed)

Madison Kelly linux-5ZoueyuiTZhBDgjK7y7TUQ at public.gmane.org
Tue Jun 8 04:29:00 UTC 2004


Hi all,

I am at the literal last step of my backup program and have only one 
hurdle left to overcome. I am not sure though how best (logically) to 
best go about it so I am hoping that I can pick your brains or have my 
idea vetted.

Just before though; I think the licensing has been hammered out (lawyer 
need to okay it I guess). It looks like we'll release the source code 
and it will be free to use for residential use and free for 
not-for-profits and non-governmental-organizations. It isn't GPL (which 
I was hoping for) but it is pretty close. Once we land a few clients and 
development is covered I think I can prod my boss (who is a great guy) 
into GPL'ing it.

Anywho, back to the problem:

The backup software is designed to work on top of very flexible 
configurations. This means that any number of source and destination 
partitions may exist at one time and that they can be mounted anywhere 
(or not mounted at all) and the program can work. Now there is a switch 
in the scheduler that will say a certain group or sources or 
destinations have to exist but you get the idea. Now my current (and 
last) challenge is to figure out how to run the backup.

The trick is that 1, 2 or more backup partitions may exist at one time. 
I want to be able to say when a backup job starts to take the source 
data, split it into X tasks (X being the number of destinations online) 
and then each task can run in parallel pushing the data over.

Now, I am using 'rsync' to actually move the data and I am using the 
include/exclude file to determine what directories (& mount points) and 
files are copied to where. So once I figure out how to part out jobs I 
simply need to write the 'rsync' in/exclude file for each job to exclude 
the directories and files that are assigned to other jobs and then run.

Is my lack of sleep showing? I hope this makes sense...

Hokay, so, (as my friend Lexy would say); when the job starts it checks 
to see what source partitions have been selected for this job and then 
(if needed) mounts the partition and then reads in the current data for 
that partition. Once the partitions are mounted and up to date each 
partition's contents are re-read and the current status of each file and 
directory is updated or recorded. At this point I now know via the 
database everything on the source that I am about to move (yes, in that 
short time deltas can occur... I haven't worried about that yet). Oh, 
any online destinations are updated/mounted similarly but their contents 
are not scanned.

So my plan currently is to search the database for what I need to backup 
(based on a flag set earlier by the user) and then say "okay, I need to 
backup 50GB of data across 2 destinations; start writing out the first 
job's 'rsync' file and keep track of the size of each file until I 
either come up on the size of the first (and smallest) destination 
partition or until I hit roughly 25GB then switch over and start writing 
out the 'rsync' file for the other destination.

Now for knowing what went where should (so the can later recover) they 
will be able to search by file name (or etc) and each match will be 
found because when the backup is finished ('rsync' exits) a list of all 
the files copied from a source to a destination will be copied from the 
source partitions data into a backup table which remember when the file 
was backed up, where it came from, where it went and what the file info 
was at the time the backup ran. What is one a backup drive can be 
updated, too.

The part that throws me is that 'rsync' shines when I can copy over and 
over the same chunk of files to the same destination so that only deltas 
are copied. The program is already ridiculously database crazy so I am 
here hoping someone might have a genius idea on how to go about this 
that I may have missed.

Anyway, thanks for, if nothing else, reading my ramblings (yes, I am 
gawd-awful tired right now... I've been at this program non-stop for 
over five weeks now).

Madison
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list