|
|
Subject:
Removing duplicate records from csv in Perl
Category: Computers Asked by: corn_jan-ga List Price: $15.00 |
Posted:
13 Jun 2002 06:55 PDT
Expires: 20 Jun 2002 06:55 PDT Question ID: 25196 |
Hello, We have a webserver runnin Perl 5.6 and Linux. A script generates a csv file (flatfile database file) which is then read by a website. However, this generated csv file often contains duplicate records. How can we automatically remove duplicate records? The solution needs to be able to run through a Telnet command prompt. The script or program needs to generate a _new_ csv file not containing the duplicate recors. I think this problem is quite simple so the answer needs to be complete and needs to fully work. An alternative automatic solution is also possible as long as it runs on our server and can be configured to run at set times. Thanks, Corne |
|
Subject:
Re: Removing duplicate records from csv in Perl
Answered By: iaint-ga on 13 Jun 2002 07:17 PDT |
Hi Here's a solution that does what you require, with a side-effect of sorting the data into (pseudo-)alphabetical order. For most database applications this shouldn't be a problem, however if the ordering of the data is critical and needs to be preserved as in the original file, please let me know and I'll work on a slightly more complex version. To use: simply cut and paste the following code into a text file (eg dedupe.pl). It can then simply be run from the UNIX command-line with: perl dedupe.pl Or alternatively if you make it executable by doing chmod u+x dedupe.pl it can be run simply by typing ./dedupe.pl There are three variables you will need to modify to make this work on your specific data, they are commented in the code but in brief you will need to change: $csvfile: this is your existing CSV file $newfile: this is the name of the deduped file the program will write out $fieldnames: if the first line of your CSV file contains field names (rather than data), leave this set at 1. If your CSV file is just pure data, set this to 0. Please don't hesitate to use the clarification request if you require anything further regarding this request. Thank you! --- code begins below --- #!/usr/bin/perl -w # I place the following code 100% in the public domain. use strict; # Set to filename of CSV file my $csvfile = 'test.csv'; # Set to filename of de-duped file (new file) my $newfile = 'new.csv'; # Set to 1 if first line of CSV file contains field names, 0 otherwise my $fieldnames = 1; ### Shouldn't need to change stuff below here ### open (IN, "<$csvfile") or die "Couldn't open input CSV file: $!"; open (OUT, ">$newfile") or die "Couldn't open output file: $!"; # Read header lines if they exist my $header; $header = <IN> if $fieldnames; # Slurp in & sort everything else my @data = sort <IN>; # If we read in a header line, throw it back out again print OUT $header; my $n = 0; # Now go through the data line by line, writing it to output unless it's identical # to the previous line (in which case it's a dupe) my $lastline = ''; foreach my $currentline (@data) { next if $currentline eq $lastline; print OUT $currentline; $lastline = $currentline; $n++; } close IN; close OUT; print "Processing complete. In = " . scalar @data . " records, Out = $n records\n"; | |
| |
|
|
Subject:
Re: Removing duplicate records from csv in Perl
From: xabi-ga on 13 Jun 2002 07:22 PDT |
cat original_csv.txt | sort -u > new_csv.txt that's all xabi PS: More help, just ask. |
Subject:
Re: Removing duplicate records from csv in Perl
From: xabi-ga on 13 Jun 2002 07:30 PDT |
How big is the original file? foo,12345,bar foo,56789,more trash Are that records equal records? or a record is equal to other if both are exact? None perl solution: % cat original_csv.txt | sort -u > new_csv.txt Note that the output are a sorted file. If you only need to check for diferent fields just do this: % cat original_csv.txt | sort -u -t, -k2,2 where 2,2 means "field 2" xabi |
Subject:
Re: Removing duplicate records from csv in Perl
From: xabi-ga on 13 Jun 2002 07:36 PDT |
And here goes the perl version if you need to do it but you need an unsorted out file (As de original but without duplicates) #!/usr/bin/perl -w use strict; my $origfile = 'xabi.txt'; my $outfile = 'xabi2.txt'; my %hTmp; open (IN, "<$origfile") or die "Couldn't open input file: $!"; open (OUT, ">$outfile") or die "Couldn't open output file: $!"; while (my $sLine = <IN>) { print OUT $sLine unless ($hTmp{$sLine}++); } close OUT; close IN; |
Subject:
Re: Removing duplicate records from csv in Perl
From: dwightrau-ga on 13 Jun 2002 08:43 PDT |
isnt it just simpler to do uniq file > newfile seems too easy? |
Subject:
Re: Removing duplicate records from csv in Perl
From: xabi-ga on 13 Jun 2002 09:40 PDT |
dwightrau-ga: You are not right with uniq file > newfile cause file must be sorted before ussing uniq. You can do it as you want this way: or cat original_csv.txt | sort -u > new_csv.txt or sort -u original_csv.txt > new_csv.txt xabi |
Subject:
Re: Removing duplicate records from csv in Perl
From: billmichaelson-ga on 14 Jun 2002 06:40 PDT |
This satisfies all your requirements. Run from shell command line as a filter... perl -e \ 'while ($l = <STDIN>) { $seen{$l} += 1; next if $seen{$l} > 1; print $l; }' |
Subject:
Re: Removing duplicate records from csv in Perl
From: billmichaelson-ga on 14 Jun 2002 07:22 PDT |
Hmm. My code was clunky. This is more concise, saves typing... perl -e 'while (<>) { print $_ unless $s{$_}++; }' <infile >outfile |
Subject:
Re: Removing duplicate records from csv in Perl
From: gautamg-ga on 18 Jun 2002 03:59 PDT |
Hi there: Assuming you're still looking for better answer. I'm pasting the script below. This script assumes you have Perl installed in /usr/bin/perl. Change it if perl resides somewhere else. Rest of the script should need no change. Script usage is explained within comments. Sample input: Header 1,2,3 2,1,3 1,2,3 2,1,3 3,2,1 2,3,1 Sample output: Header 1,2,3 2,1,3 3,2,1 2,3,1 Let me know if you need any fixes. #!/usr/bin/perl -w # Author: Gautam Gopalakrishnan. # Feel free to reuse. # # Usage: dupe.pl filename.csv [file2.csv] [file3.csv] ... # Ofcourse, this script does not care if the input file is a csv. # Unique lines are placed in the order they appear # WARNING: Original file is overwritten. use strict; my (@final, %hash, $file) = ((), (), ""); if (not defined $ARGV[0]) { print "Usage: dupe.pl filename.csv [file2.csv] [file3.csv] ...\n"; exit -1; } foreach $file (@ARGV) { if (!open FILE, "+<$file") { print "Unable to open input csv file for read-write, '$file' $!\n"; next; } while (<FILE>) { if (not exists $hash{$_}) { push @final, $_; $hash{$_} = 1; } } truncate FILE, 0; seek FILE, 0, 0; print FILE @final; close FILE; %hash = @final = (); } |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |