finding same files across hardrives

Antonio T. Sun mlist.ats-w1QkCcy0X+BxKfgMtfWJuA at public.gmane.org
Mon Dec 1 19:29:55 UTC 2008


[sorry for my previous test mail -- Didn't thought it could pass through]

On Sun, Nov 30, 2008 at 12:55 PM, Andrei <andreilitvin-bJEeYj9oJeDQT0dZR+AlfA at public.gmane.org> wrote:

>> On Sat, 2008-11-29 at 12:07 -0500, Jose wrote:
>> >
>> > I've been trying to find files with the same name, asically Imade
>> > multiple copies when I had these workstations, I got a machine 
capable
>> > of holding more disk and data, but I need to get a list so I can 
safely
>> > delete the date from one drive(s) and keep the other, I tried using a
>> > combination of find and du but the ooutput is not helpful.
>> >
>> > Is there any linux rpm or souce to compile utility that may help to 
do this?> On Sun, 2008-11-30 at 12:48 -0500, Andrei wrote:
>>
>> How about something like:
>>
>> find /dir1 -type f | xargs md5sum | sort >data1.txt
>> find /dir2 -type f | xargs md5sum | sort >data2.txt
>>
>> join ./data1.txt ./data2.txt
>>
>> I think this should give you all the files with the same content (not
>> sure how it would handle duplicates though, but I guess it should work)
>>
> And if you are looking for duplicate names, you could use
>
> find /dir -type f | xargs -n 1 sh -c 'echo `basename $0` $0' ...
>
> however that probably breaks when you have spaces in your names, so you
> can try a more "evil":
>
> find /dir -type f | xargs -n 1 sh -c 'echo `basename $0 | md5sum` $0'
> | ...
>
> to join by the MD5 of the name. Will run faster, but like someone else
> said, searching for same name for deduplication is fairly dangerous. MD5
> search (while getting one or two good-nights sleep :-) ) is probably
> better.

Personally, I against the idea of finding duplications using MD5
check sums, because creating MD5 is sloooooow. Moreover, if you
don't have a lot of duplications, then probably 99% of your time
and CPU power is wasted on creating useless check sums. Further
more, if there is any remote possibilities of the following cases,
then the method is not complete:

- Variation in file names. I used to make backups using distinct
names, e.g., file.ver1, file.ver2, etc. If you ever backup your
files this way, then finding duplications by names won't help much.

- Variation in content. Since backups are made over the times, any
slightly change in file content will break the MD5 check sum
method entirely.

if the above cases (remotely) apply to you, you still need
something that is more suitable for finding duplications, and it
is much faster than creating MD5 check sums.

Antonio

--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list