Article 2690 of comp.lang.perl:
Xref: feenix.metronet.com alt.sources.d:397 comp.lang.perl:2690
Newsgroups: alt.sources.d,comp.lang.perl
Path: feenix.metronet.com!news.ecn.bgu.edu!wupost!howland.reston.ans.net!zaphod.mps.ohio-state.edu!cs.utexas.edu!uunet!pipex!unipalm!ian
From: ian@unipalm.co.uk (Ian Phillipps)
Subject: Re: checking for duplicate files
Message-ID: <1993May9.221923.28594@unipalm.co.uk>
Organization: Unipalm Ltd., 216 Cambridge Science Park, Cambridge CB4 4WA, UK
References: <C5LC4E.L5t@news.iastate.edu> <C5nC7K.D9F@rahul.net> <4669@teslab.lab.oz.au> <JSC.93May4125103@monolith.mit.edu>
Date: Sun, 9 May 1993 22:19:23 GMT
Lines: 155

jsc@monolith.mit.edu (Jin S Choi) writes:

>In article <4669@teslab.lab.oz.au> andrew@teslab.lab.oz.au (Andrew Phillips) writes:

>   In <C5nC7K.D9F@rahul.net> Rahul Dhesi <dhesi@rahul.net> writes:
>   :>Anyone have a good script to traverse a directory tree to
>   :>look for duplicate files.  Not just duplicate names, but duplicate
>   :>contents.
>   :
>   :The attached script looks for duplicate files based on 32-bit CRCs.
>   :It's a fairly quick hack: consider it beta software.

Well, here's a script I hacked up which goes a step farther. It doesn't
mind what the files are called, and works resonably efficiently on large
quantities of files. It will try to make links even when one of the
parties is multiply-linked. Takes list of filename on STDIN - i.e. it's
a "find" back-end, or it will take a list as arguments. I've just
noticed that I put in a "-I" switch as well.

Posted to comp.lang.perl as well, in case anyone over there's
interested.

---- hack here ----
#!/usr/local/bin/perl
# This relinks files which have duplicated information.
$sep="\n"; 	# Must not occur in a filename. Must be one character.

$0 =~ m#[^/]+$#; $progname = $&;
$usage = "$progname [ -I include-file | <files> ]\n";
die $usage if ( @ARGV <= 0 && -t STDIN );

# Sort out the files

$debug=(shift,$1) if $ARGV[0] =~ /^-d(.*)/;
$verbose=(shift,$1) if $ARGV[0] =~ /^-v(.*)/;

if( $ARGV[0] eq '-I' )
{
    shift; open( STDIN, ( $_ = shift ) ) || die "Cannot open $_: $!\n";
}

# addname takes file name in $_, and adds it to the list (duplicates
# silently dropped).
sub addname
{
    return if $ident{ $_ };
    ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,
	$atime,$mtime,$ctime,$blksize,$blocks) = stat( $_ );

    $id = "$dev:$ino";
    $ofsize{ $size } .= $_ . $sep;
    $ident{ $_ } = $id;
    return if $nlinks{ $id };
    $nlinks{ $id } = $nlink;
    $sizes{ $id } = $size;
}

# dolink( name1, name2 )
# links two files, choosing one at random. Permission problems cause abort!
# Returns null string if OK, or error message if not.

sub dolink
{
    local ($n1, $n2 ) = @_;
    $id1 = $ident{ $n1 };
    $id2 = $ident{ $n2 };
    return '' if $id1 eq $id2;
    return $error = "Too many links $n1 and $n2" if $nlinks{ $id1 } > 1 && $nlinks{ $id2 } > 1;
    ($id1, $id2, $n1, $n2 ) = ($id2, $id1, $n2, $n1 ) if $nlinks{ $id2 } > 1;
    # We remove n2 and link n1 to it.
    # Try to move n2 to start with...ignore 14-char limits...
    $tempname = $n2.".tmp";
    $tempname++ while -f $tempname;
    # This will test write-access to directory with file 2:
    print "Linking $n1 to $n2\n" if $verbose > 0;
    rename( $n2, $tempname ) || return $error = "Cannot rename $n2: $!";
    if ( link( $n1, $n2 ) )
    {
	unlink( $tempname ) || return $error = "Cannot remove $tempfile: $!";
    }
    else
    {   rename( $tempname, $n2 ) ||
	    die "$0: panic: can't rename $tempname to $n2\n";
    }
    '';
}

# Test whether link is rquired and do it.
sub linkif
{
    # Skip if already linked
    print "Should I link @names?\n" if $verbose > 1;
    return 0 if $ident{ $names[0] } eq $ident{ $names[1] };

    $return = &docmp;
    return 0 unless $return == 0;
    # Files are the same, so link them
    warn $error if &dolink( $names[0], $names[1] );
}

# Run a comparison on two files. Easy for now...
# Return 0 iff files are the same.
sub docmp
{
    system( 'cmp', '-s', @_[0,1] );
}

unless( @ARGV ) { while( <STDIN> ) { chop; &addname; } }
for( @ARGV ) { &addname; }

# Now sort out the duplicates.
# Address files in size order.

size:
while( ($sz, $files) = each( %ofsize ) )
{
    next size unless $sz > 0;	# Ignore zero-length files
    chop $files;		# remove $sep from end
    @names = split( $sep, $files );
    next size unless @names > 1;

    # Each file is of the same size, so compare them.

    # We only consider merging files with a link count of 1.
    # Case 1: two files only
    if ( @names == 2 )
    {
	&linkif(@names);
    }
    else # more than two files: Permute
    {
	# Now do some cmps...
	@names = sort { $nlinks{$b} <=> $nlinks{$a}; } @names;
	# @tonames holds list of names we consider linking TO
	@tonames = shift @names;
	@tonames = shift @names while $nlinks{$names[0]} > 1;
	# .. and names holds the rest
	name:
	foreach $name (@names)
	{
	    foreach $toname ( @tonames )
	    {
		if( ! &docmp($name, $toname) )
		    { warn $error if &dolink( $name, $toname); next name; }
		# Didn't match, so put this on the list of unique files
	    }
	    push( @tonames, $_ );
	}
    }
}
--- hack here ---
-- 
Ian Phillipps, Unipalm Ltd, 216 Science Park,		Phone +44 223 420002
Milton Road, Cambridge, CB4 4WA, England.		Phax  +44 223 426868
PIPEX Ltd. is a wholly-owned subsidiary of Unipalm Ltd. - phone 0223 424616.