finding same files across hardrives

Tue Dec 2 14:28:17 UTC 2008

On Mon, Dec 01, 2008 at 04:53:21PM -0500, Jose wrote:
> Antonio T. Sun wrote:
> >[sorry for my previous test mail -- Didn't thought it could pass through]
> >
> >On Sun, Nov 30, 2008 at 12:55 PM, Andrei <andreilitvin-bJEeYj9oJeDQT0dZR+AlfA at public.gmane.org> wrote:
> >
> >>>On Sat, 2008-11-29 at 12:07 -0500, Jose wrote:
> >>>>I've been trying to find files with the same name, asically Imade
> >>>>multiple copies when I had these workstations, I got a machine 
> >capable
> >>>>of holding more disk and data, but I need to get a list so I can 
> >safely
> >>>>delete the date from one drive(s) and keep the other, I tried using a
> >>>>combination of find and du but the ooutput is not helpful.
> >>>>
> >>>>Is there any linux rpm or souce to compile utility that may help to 
> >do this?> On Sun, 2008-11-30 at 12:48 -0500, Andrei wrote:
> >>>How about something like:
> >>>
> >>>find /dir1 -type f | xargs md5sum | sort >data1.txt
> >>>find /dir2 -type f | xargs md5sum | sort >data2.txt
> >>>
> >>>join ./data1.txt ./data2.txt
> >>>
> >>>I think this should give you all the files with the same content (not
> >>>sure how it would handle duplicates though, but I guess it should work)
> >>>
> >>And if you are looking for duplicate names, you could use
> >>
> >>find /dir -type f | xargs -n 1 sh -c 'echo `basename $0` $0' ...
> >>
> >>however that probably breaks when you have spaces in your names, so you
> >>can try a more "evil":
> >>
> >>find /dir -type f | xargs -n 1 sh -c 'echo `basename $0 | md5sum` $0'
> >>| ...
> >>
> >>to join by the MD5 of the name. Will run faster, but like someone else
> >>said, searching for same name for deduplication is fairly dangerous. MD5
> >>search (while getting one or two good-nights sleep :-) ) is probably
> >>better.
> >
> >Personally, I against the idea of finding duplications using MD5
> >check sums, because creating MD5 is sloooooow. Moreover, if you
> >don't have a lot of duplications, then probably 99% of your time
> >and CPU power is wasted on creating useless check sums. Further
> >more, if there is any remote possibilities of the following cases,
> >then the method is not complete:
> >
> >- Variation in file names. I used to make backups using distinct
> >names, e.g., file.ver1, file.ver2, etc. If you ever backup your
> >files this way, then finding duplications by names won't help much.
> >
> >- Variation in content. Since backups are made over the times, any
> >slightly change in file content will break the MD5 check sum
> >method entirely.
> >
> >if the above cases (remotely) apply to you, you still need
> >something that is more suitable for finding duplications, and it
> >is much faster than creating MD5 check sums.
> >
> >Antonio
> >
> >--
> >The Toronto Linux Users Group.      Meetings: http://gtalug.org/
> >TLUG requests: Linux topics, No HTML, wrap text below 80 columns
> >How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
> >
> I tried both scripts (thanks guys), but it keeps breaking when it find 
> paths with blanks, like /sdb1/backup/C folder/etc...
> 
> Basedir breaks, I haven't been able to find a solution to this problem yet,

Add -print0 to find, and -0 to xargs, and then spaces should be much
less of a problem.  The use of basename within backticks and such though
makes it very hard to deal with the spaces.  Now quotes are needed.

Perhaps
find /dir -type f -print0| xargs -0 -n 1 sh -c 'echo `basename "$0"` "$0"'

Not sure since I am not sure what $0 refers to in this case.  Usually $0
means application name, but that doesn't make sense here.

-- 
Len Sorensen
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists