Brainstorming help needed (Data copy job - ideas needed)

Tue Jun 8 14:59:20 UTC 2004

On Tue, Jun 08, 2004 at 12:29:00AM -0400, Madison Kelly wrote:
> Hi all,
> 
> I am at the literal last step of my backup program and have only one 
> hurdle left to overcome. I am not sure though how best (logically) to 
> best go about it so I am hoping that I can pick your brains or have my 
> idea vetted.
> 
> Just before though; I think the licensing has been hammered out (lawyer 
> need to okay it I guess). It looks like we'll release the source code 
> and it will be free to use for residential use and free for 
> not-for-profits and non-governmental-organizations. It isn't GPL (which 
> I was hoping for) but it is pretty close. Once we land a few clients and 
> development is covered I think I can prod my boss (who is a great guy) 
> into GPL'ing it.
> 
> Anywho, back to the problem:
> 
> The backup software is designed to work on top of very flexible 
> configurations. This means that any number of source and destination 
> partitions may exist at one time and that they can be mounted anywhere 
> (or not mounted at all) and the program can work. Now there is a switch 
> in the scheduler that will say a certain group or sources or 
> destinations have to exist but you get the idea. Now my current (and 
> last) challenge is to figure out how to run the backup.
> 
> The trick is that 1, 2 or more backup partitions may exist at one time. 
> I want to be able to say when a backup job starts to take the source 
> data, split it into X tasks (X being the number of destinations online) 
> and then each task can run in parallel pushing the data over.
> 
> Now, I am using 'rsync' to actually move the data and I am using the 
> include/exclude file to determine what directories (& mount points) and 
> files are copied to where. So once I figure out how to part out jobs I 
> simply need to write the 'rsync' in/exclude file for each job to exclude 
> the directories and files that are assigned to other jobs and then run.
> 
> Is my lack of sleep showing? I hope this makes sense...
> 
> Hokay, so, (as my friend Lexy would say); when the job starts it checks 
> to see what source partitions have been selected for this job and then 
> (if needed) mounts the partition and then reads in the current data for 
> that partition. Once the partitions are mounted and up to date each 
> partition's contents are re-read and the current status of each file and 
> directory is updated or recorded. At this point I now know via the 
> database everything on the source that I am about to move (yes, in that 
> short time deltas can occur... I haven't worried about that yet). Oh, 
> any online destinations are updated/mounted similarly but their contents 
> are not scanned.
> 
> So my plan currently is to search the database for what I need to backup 
> (based on a flag set earlier by the user) and then say "okay, I need to 
> backup 50GB of data across 2 destinations; start writing out the first 
> job's 'rsync' file and keep track of the size of each file until I 
> either come up on the size of the first (and smallest) destination 
> partition or until I hit roughly 25GB then switch over and start writing 
> out the 'rsync' file for the other destination.
> 
> Now for knowing what went where should (so the can later recover) they 
> will be able to search by file name (or etc) and each match will be 
> found because when the backup is finished ('rsync' exits) a list of all 
> the files copied from a source to a destination will be copied from the 
> source partitions data into a backup table which remember when the file 
> was backed up, where it came from, where it went and what the file info 
> was at the time the backup ran. What is one a backup drive can be 
> updated, too.
> 
> The part that throws me is that 'rsync' shines when I can copy over and 
> over the same chunk of files to the same destination so that only deltas 
> are copied. The program is already ridiculously database crazy so I am 
> here hoping someone might have a genius idea on how to go about this 
> that I may have missed.
> 
> Anyway, thanks for, if nothing else, reading my ramblings (yes, I am 
> gawd-awful tired right now... I've been at this program non-stop for 
> over five weeks now).

Assuming your file list is created in a deterministic (and consistent)
manner I wouldn't worry about it, since the only files that would move
from slice1 to slice2 would be the last files that went to slice1 in the
case where new files have been inserted in the list of files before
them, hence shifting the end of slice1 to slice2.  Unless you expect
many insertions of new files or potentially many deletes (which would
shift data back from slice2 to slice1) very few of your files would not
end up in the same location as the last backup.  For the few that do
move, oh well.  Probably not worth making anything complicated to deal
with it.

If the file list comes out random each time it is generated, perhaps you
ought to create a better file list gnerator. :)

Any good reason to have the backup going to multiple places?

Lennart Sorensen
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml