finding same files across hardrives

Jose jtc-vS8X3Ji+8Wg6e3DpGhMbh2oLBQzVVOGK at public.gmane.org
Mon Dec 1 21:53:21 UTC 2008


Antonio T. Sun wrote:
> [sorry for my previous test mail -- Didn't thought it could pass through]
> 
> On Sun, Nov 30, 2008 at 12:55 PM, Andrei <andreilitvin-bJEeYj9oJeDQT0dZR+AlfA at public.gmane.org> wrote:
> 
>>> On Sat, 2008-11-29 at 12:07 -0500, Jose wrote:
>>>> I've been trying to find files with the same name, asically Imade
>>>> multiple copies when I had these workstations, I got a machine 
> capable
>>>> of holding more disk and data, but I need to get a list so I can 
> safely
>>>> delete the date from one drive(s) and keep the other, I tried using a
>>>> combination of find and du but the ooutput is not helpful.
>>>>
>>>> Is there any linux rpm or souce to compile utility that may help to 
> do this?> On Sun, 2008-11-30 at 12:48 -0500, Andrei wrote:
>>> How about something like:
>>>
>>> find /dir1 -type f | xargs md5sum | sort >data1.txt
>>> find /dir2 -type f | xargs md5sum | sort >data2.txt
>>>
>>> join ./data1.txt ./data2.txt
>>>
>>> I think this should give you all the files with the same content (not
>>> sure how it would handle duplicates though, but I guess it should work)
>>>
>> And if you are looking for duplicate names, you could use
>>
>> find /dir -type f | xargs -n 1 sh -c 'echo `basename $0` $0' ...
>>
>> however that probably breaks when you have spaces in your names, so you
>> can try a more "evil":
>>
>> find /dir -type f | xargs -n 1 sh -c 'echo `basename $0 | md5sum` $0'
>> | ...
>>
>> to join by the MD5 of the name. Will run faster, but like someone else
>> said, searching for same name for deduplication is fairly dangerous. MD5
>> search (while getting one or two good-nights sleep :-) ) is probably
>> better.
> 
> Personally, I against the idea of finding duplications using MD5
> check sums, because creating MD5 is sloooooow. Moreover, if you
> don't have a lot of duplications, then probably 99% of your time
> and CPU power is wasted on creating useless check sums. Further
> more, if there is any remote possibilities of the following cases,
> then the method is not complete:
> 
> - Variation in file names. I used to make backups using distinct
> names, e.g., file.ver1, file.ver2, etc. If you ever backup your
> files this way, then finding duplications by names won't help much.
> 
> - Variation in content. Since backups are made over the times, any
> slightly change in file content will break the MD5 check sum
> method entirely.
> 
> if the above cases (remotely) apply to you, you still need
> something that is more suitable for finding duplications, and it
> is much faster than creating MD5 check sums.
> 
> Antonio
> 
> --
> The Toronto Linux Users Group.      Meetings: http://gtalug.org/
> TLUG requests: Linux topics, No HTML, wrap text below 80 columns
> How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
> 
I tried both scripts (thanks guys), but it keeps breaking when it find 
paths with blanks, like /sdb1/backup/C folder/etc...

Basedir breaks, I haven't been able to find a solution to this problem yet,
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list