Google Answers: Removing duplicate records from csv in Perl

View Question

Q: Removing duplicate records from csv in Perl ( Answered, 8 Comments )

Question

Subject: Removing duplicate records from csv in Perl
Category: Computers
Asked by: corn_jan-ga
List Price: $15.00

Posted: 13 Jun 2002 06:55 PDT
Expires: 20 Jun 2002 06:55 PDT
Question ID: 25196

Hello,

We have a webserver runnin Perl 5.6 and Linux. A script generates a
csv file (flatfile database file) which is then read by a website.
However, this generated csv file often contains duplicate records.
How can we automatically remove duplicate records? The solution needs
to be able to run through a Telnet command prompt. The script or
program needs to generate a _new_ csv file not containing the
duplicate recors.
I think this problem is quite simple so the answer needs to be
complete and needs to fully work.

An alternative automatic solution is also possible as long as it runs
on our server and can be configured to run at set times.

Thanks,
Corne

Answer

Subject: Re: Removing duplicate records from csv in Perl
Answered By: iaint-ga on 13 Jun 2002 07:17 PDT

Hi

Here's a solution that does what you require, with a side-effect of
sorting the data into (pseudo-)alphabetical order. For most database
applications this shouldn't be a problem, however if the ordering of
the data is critical and needs to be preserved as in the original
file, please let me know and I'll work on a slightly more complex
version.

To use: simply cut and paste the following code into a text file (eg
dedupe.pl). It can then simply be run from the UNIX command-line with:
   perl dedupe.pl

Or alternatively if you make it executable by doing chmod u+x
dedupe.pl it can be run simply by typing
   ./dedupe.pl

There are three variables you will need to modify to make this work on
your specific data, they are commented in the code but in brief you
will need to change:
   $csvfile: this is your existing CSV file
   $newfile: this is the name of the deduped file the program will
write out
   $fieldnames: if the first line of your CSV file contains field
names (rather than data), leave this set at 1. If your CSV file is
just pure data, set this to 0.

Please don't hesitate to use the clarification request if you require
anything further regarding this request.

Thank you!


--- code begins below ---

#!/usr/bin/perl -w

# I place the following code 100% in the public domain.

use strict;

# Set to filename of CSV file
my $csvfile = 'test.csv';

# Set to filename of de-duped file (new file)
my $newfile = 'new.csv';

# Set to 1 if first line of CSV file contains field names, 0 otherwise
my $fieldnames = 1;

### Shouldn't need to change stuff below here ###

open (IN, "<$csvfile")  or die "Couldn't open input CSV file: $!";
open (OUT, ">$newfile") or die "Couldn't open output file: $!";

# Read header lines if they exist
my $header;
$header = <IN> if $fieldnames;

# Slurp in & sort everything else
my @data = sort <IN>;

# If we read in a header line, throw it back out again
print OUT $header;

my $n = 0;
# Now go through the data line by line, writing it to output unless
it's identical
# to the previous line (in which case it's a dupe)
my $lastline = '';
foreach my $currentline (@data) {
  next if $currentline eq $lastline;
  print OUT $currentline;
  $lastline = $currentline;
  $n++;
}

close IN; close OUT;

print "Processing complete. In = " . scalar @data . " records, Out =
$n records\n";

Request for Answer Clarification by corn_jan-ga on 13 Jun 2002 07:43 PDT

Hello

The results are good. But there is one problem: the records can't be
sorted and need to stay in original order.

So please make the slightly more complex version as you already
mentioned as possible.

Thanks in advance,
Corné Janmaat

Clarification of Answer by iaint-ga on 13 Jun 2002 08:14 PDT

For a non-sorted dedupe I would suggest you use xabi-ga's code from
the third comment below. It uses the same technique I would use to
check for duplicates (using the line as a hash key) and should meet
your needs perfectly. However if you'd still like me to produce
something similar I'm more than happy to do so.

Regards
iaint-ga

Comments

Subject: Re: Removing duplicate records from csv in Perl
From: xabi-ga on 13 Jun 2002 07:22 PDT

cat original_csv.txt | sort -u > new_csv.txt

that's all

xabi

PS: More help, just ask.

Subject: Re: Removing duplicate records from csv in Perl
From: xabi-ga on 13 Jun 2002 07:30 PDT

How big is the original file?

foo,12345,bar
foo,56789,more trash

Are that records equal records? or a record is equal to other if both are exact?

None perl solution:

% cat original_csv.txt | sort -u > new_csv.txt

Note that the output are a sorted file.

If you only need to check for diferent fields just do this:

% cat original_csv.txt | sort -u -t, -k2,2

where 2,2 means "field 2"

xabi

Subject: Re: Removing duplicate records from csv in Perl
From: xabi-ga on 13 Jun 2002 07:36 PDT

And here goes the perl version if you need to do it but you need an
unsorted out file (As de original but without duplicates)

#!/usr/bin/perl -w 
use strict; 
my $origfile = 'xabi.txt'; 
my $outfile  = 'xabi2.txt'; 
my %hTmp;
 
open (IN, "<$origfile")  or die "Couldn't open input file: $!"; 
open (OUT, ">$outfile") or die "Couldn't open output file: $!"; 
 
while (my $sLine = <IN>) {
  print OUT $sLine unless ($hTmp{$sLine}++);
}
close OUT;
close IN;

Subject: Re: Removing duplicate records from csv in Perl
From: dwightrau-ga on 13 Jun 2002 08:43 PDT

isnt it just simpler to do
uniq file > newfile
seems too easy?

Subject: Re: Removing duplicate records from csv in Perl
From: xabi-ga on 13 Jun 2002 09:40 PDT

dwightrau-ga:

You are not right with

uniq file > newfile 

cause file must be sorted before ussing uniq. You can do it as you want this way:

or

cat original_csv.txt | sort -u > new_csv.txt

or 

sort -u original_csv.txt > new_csv.txt

xabi

Subject: Re: Removing duplicate records from csv in Perl
From: billmichaelson-ga on 14 Jun 2002 06:40 PDT

This satisfies all your requirements.  Run from shell command line as a filter...

perl -e \
'while ($l = <STDIN>) { $seen{$l} += 1;  next if $seen{$l} > 1; print $l; }'

Subject: Re: Removing duplicate records from csv in Perl
From: billmichaelson-ga on 14 Jun 2002 07:22 PDT

Hmm. My code was clunky.  This is more concise, saves typing...

perl -e 'while (<>) { print $_ unless $s{$_}++; }' <infile >outfile

Subject: Re: Removing duplicate records from csv in Perl
From: gautamg-ga on 18 Jun 2002 03:59 PDT

Hi there:

Assuming you're still looking for better answer.  I'm pasting the
script below.
This script assumes you have Perl installed in /usr/bin/perl. Change
it if perl resides somewhere else.
Rest of the script should need no change. Script usage is explained
within comments.

Sample input:
Header
1,2,3
2,1,3
1,2,3
2,1,3
3,2,1
2,3,1

Sample output:
Header
1,2,3
2,1,3
3,2,1
2,3,1

Let me know if you need any fixes.

#!/usr/bin/perl -w

# Author: Gautam Gopalakrishnan.
# Feel free to reuse.
#
# Usage: dupe.pl filename.csv [file2.csv] [file3.csv] ...
# Ofcourse, this script does not care if the input file is a csv.
# Unique lines are placed in the order they appear
# WARNING: Original file is overwritten. 

use strict;
my (@final, %hash, $file) = ((), (), "");

if (not defined $ARGV[0]) {
	print "Usage: dupe.pl filename.csv [file2.csv] [file3.csv] ...\n";
	exit -1;
}
foreach $file (@ARGV) {
	if (!open FILE, "+<$file") {
		print "Unable to open input csv file for read-write, '$file' $!\n";
		next;
	}
	while (<FILE>) {
		if (not exists $hash{$_}) {
			push @final, $_;
			$hash{$_} = 1;
		}
	}
	truncate FILE, 0;
	seek FILE, 0, 0;
	print FILE @final;
	close FILE;
	%hash = @final = ();
}

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy