Google Answers Logo
View Question
 
Q: Perl script required ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: Perl script required
Category: Computers > Programming
Asked by: mickr-ga
List Price: $20.00
Posted: 19 Jul 2004 02:16 PDT
Expires: 18 Aug 2004 02:16 PDT
Question ID: 376032
Hi,

I would like a perl script that remove duplicate entities from a file.

the file will look like the following

<many lines of text>
 Startpoint: a/b/c
  <one line of text>
 Endpoint: d/e/f/g
  <many lines of text>
 slack <plus some more texton the same line>
  <some blank lines>
 Startpoint: e/r/k
  <one line of text>
 Endpoint: d/e/f/g
  <many lines of text>
 slack <plus some more texton the same line>
  <some blank lines>
 Startpoint: a/b/c
  <one line of text>
 Endpoint: d/e/f/g
  <many lines of text>
 slack <plus some more texton the same line>
  <some blank lines>

i would like all the lines between Startpoint and slack to be printed
only for the first unique variation of Startpoint and Endpoint i.e. in
the above example

lets say we had 

a) Startpoint: a/b/c Endpoint d/e/f/g - not seen before so print it
upto and including slack followed by a /n

b) Startpoint: e/r/k Endpoint: d/e/f/g - Startpoint not seen before so print it

c) Startpoint: e/r/k Endpoint: d/e/f/g/z - Endpoint not seen before so print it

d) Startpoint: a/b/c Endpoint: d/e/f/g/z - Startpoint and Endpoint not
seen before so print it (they have both occured but not at the same
time)

e) Startpoint: e/r/k Endpoint: d/e/f/g/z - both Startpoint and
Endpoint seen before so do not print it.

Please make the code user friendly and easy to understand with comments.
Answer  
Subject: Re: Perl script required
Answered By: palitoy-ga on 19 Jul 2004 05:10 PDT
Rated:5 out of 5 stars
 
Hello Mickr

If I understand you correctly the following script should give you the
output you require.  I have tried to make the script as readable and
friendly as possible by not using too many shortcuts and commenting it
throughout.

At the beginning of the script it reads in a text file called
"filename.txt", this should be altered to fit your needs and depending
on where your input file is held.

If you have any questions or queries regarding the script please ask
for clarification and I will do my best to help.

###BEGIN###

#!/usr/bin/perl

# open and read the file containing the data into an array called "lines"
open (TXTFILE, "filename.txt");
@lines = <TXTFILE>;
close(TXTFILE);

# variable to hold the information to print
$print_this = "";

# variable to check whether we should be printing
$printing = 0;

# an array holding starting/ending points to check we do not duplicate things
@points = "";

# a variable to say whether we should output the information or not
$should_output = 0;

# variables to hold the starting and ending points
$start_point = "";
$end_point = "";

# loop through the text file and process the data
foreach $newline (@lines) {
  # if the line is a starting point then...
  if ($newline =~ m/^Startpoint/) {
    # remember what is found in the text file until $printing is changed
    $printing = 1;
    # the start point is
    $start_point = $newline;
  }

  # if the line is a ending point then...
  $end_point = $newline if ($newline =~ m/^Endpoint/);

  # if we have seen the start and end point then process them
  if ( ( $start_point ne "" ) && ( $end_point ne "" ) ) {
    # join the start and end points
    $start_point = $start_point . " " . $end_point;
    # if we have not seen this start/end point before...
    if ( grep(/$start_point/, @points) == 0 ) {
      # remember it by adding it to our points array
      push @points, $start_point;
      # indicate that the information should be printed out
      $should_output = 1;
    }
  }

  # if the line begins with "slack" and the info should be printed...
  if ( ($newline =~ m/^slack/) && ($should_output == 1) ) {
    # print it out!
    print $print_this . $newline . "\n";
  };

  # if the line begins with "slack" then reset the variables
  if ( $newline =~ m/^slack/ ) {
    $printing = 0;
    $print_this = "";
    $should_output = 0;
    $start_point = "";
    $end_point = "";
  };

  # if the endpoint has not been reached but we are printing
  if ( $printing == 1 ) {
    # add the new line to the data that could be printed out
    $print_this .= $newline;
  };

}; # end loop through text file

# end the script
exit(0);

### END ###

Request for Answer Clarification by mickr-ga on 19 Jul 2004 07:11 PDT
Hi palitoy,

It works with some minor modifications - startpoint etc are not at
line begin there is space before them so no ^ necessary - there is
empty lines so I have changed the startpoint ne "" etc to startpoint
ne "dummys".

It appears to work but it is very very slow. I have 127 entities and
it has only processed 19 in 13 minutes? The input file is 3.6M with
25k lines.
 
Any idea why it is so slow? Is it possible for me to send you the
input file to try it?

Regards,

Mick

Clarification of Answer by palitoy-ga on 19 Jul 2004 09:16 PDT
Hi Mick

Thanks for the clarification.  I will deal with your points one at a
time if I may...

1) If the startpoint always begins with a [SPACE]Startpoint then I
would suggest these changes:

  if ($newline =~ m/^Startpoint/) {

to:

  if ($newline =~ m/^\sStartpoint/) {

This would look for a line starting with a space followed by the
"Startpoint".  Is this standard throughout the file or do the number
of spaces change?  Is there ever anything else before the
"Startpoint"?

Similarly:

  $end_point = $newline if ($newline =~ m/^Endpoint/);

to:

  $end_point = $newline if ($newline =~ m/^\sEndpoint/);

And:

  if ( ($newline =~ m/^slack/) && ($should_output == 1) ) {

to:

  if ( ($newline =~ m/^\sslack/) && ($should_output == 1) ) {

And:

  if ( $newline =~ m/^slack/ ) {

to:

  if ( $newline =~ m/^\sslack/ ) {


2) Re: changing startpoint ne "dummys" - I am unclear why you have
done this... is it because there are empty lines between each example?

The [if ( ( $start_point ne "" ) && ( $end_point ne "" ) ) {] section
is used to determine whether the script has found the start and end
points of a section, this happens when these two variables are no
longer blank.  Changing the startpoint to not equal "dummys" would
mean that the argument would not be correct.

If you changed this because of the empty lines between each example
the proper solution would be to change this:

  print $print_this . $newline . "\n";

to this:

  print $print_this . $newline ;

3) The speed of the script is dependant on a number of factors but I
would not have thought it would take as long as you are describing.  I
am guessing that the problem is because of the changes taken in part
2) here.

4) Unfortunately we are not allowed to give out any personal details
for people to be able to contact us.  Any information given out would
be removed as soon as the Google Answers Editors saw the information. 
Some people who ask the questions put their email addresses in the
questions or clarifications but this is frowned upon by the Google
Answers editors...

Clarification of Answer by palitoy-ga on 20 Jul 2004 01:34 PDT
Glad I could sort it out for you.  Thanks for the 5-star rating and
tip - they are both appreciated.

Request for Answer Clarification by mickr-ga on 23 Jul 2004 05:07 PDT
Hi Palitoy,

Thanks for all your help previously. I am stuck on Perl modules now,
new question posted if you are interested.

Thanks,

Mick

Clarification of Answer by palitoy-ga on 23 Jul 2004 05:35 PDT
Hello Mick

I am just looking at that question now for you... what operating
system are you running and which version of perl?
mickr-ga rated this answer:5 out of 5 stars and gave an additional tip of: $10.00
Thanks, it worked my debug didn't!

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy