Google Answers Logo
View Question
 
Q: PHP Code Needed ( Answered,   1 Comment )
Question  
Subject: PHP Code Needed
Category: Computers > Programming
Asked by: neutral-ga
List Price: $40.00
Posted: 21 Aug 2005 04:39 PDT
Expires: 20 Sep 2005 04:39 PDT
Question ID: 558300
ACKNOWLEDGEMENT:

Technorati.com has a search system which tells the user how many
incoming links a site has.

When you put in the following format in your browser's address bar,
Technorati tells you how many incoming links this site has:

The format is: http://technorati.com/search/www.SiteName.com

To see a working example, see: http://technorati.com/search/www.socioeconomics.com


WHAT I NEED:

I have about 400 sites, of which I want to get the latest # of incoming links.

To achieve this goal, I need someone to write a PHP script which does
that automatically in order to save me from the trouble of manually
checking them every once in a while.

A script which can take good care of the following would do just fine:

1- Will extract the necessary information from Technorati either
simultaneously OR when I run it. (which means that the script does not
necessarily have to perform the same task each time the page loads.
This would take a lot of time. I can run and see the results myself,
and update it every week manually.)

2- Will not require a MySQL to work, and will be self-sufficient.

3- Will order the results top-down.

Request for Question Clarification by palitoy-ga on 21 Aug 2005 04:59 PDT
Hello neutral-ga

I will be able to write this script for you but I am a little unsure
as to what you require in one section.

You state "the script does not necessarily have to perform the same
task each time the page loads", can you explain what you mean by this?

As I understand it you require a script that searches Technorati for
your 400 websites and then lists your sites in order of which ones are
the most popular (have the most links).  These results will be stored
in a text file rather than an MySQL database.

Does the script have to be written in PHP?  This kind of checking
might be easier done in Perl/CGI.

Please let me know what you mean by the above phrase so I can begin
work on this and whether a Perl script would also be acceptable.

Clarification of Question by neutral-ga on 21 Aug 2005 05:56 PDT
I believe you have two questions. Below are my answers for each:

1- As for clarification about what I meant with the phrase "the script
does not necessarily have to perform the same task each time the page
loads":

I will publish the results on a web page. And I thought it would not
be the best way to do this that way, if the script is to extract
latest results from technorati each time it loads.

Therefore I thought maybe I should run the script offline myself
every, shall we say, 10 days, and manually publish the top20 results
on a static page in order for the page to load faster.

I wanted to let you know about this concern of mine beforehand - just
considering the possibility that it might affect the way you write the
script. (if it doesn't, simply ignore it!)


2- PHP would be great. But if you think that you can do it with perl
in a better way, and that it will work with no problems, then I can
say yes to perl too.

I hope these replies have answered your question. Feel free let me
know if you need any further details.
Answer  
Subject: Re: PHP Code Needed
Answered By: palitoy-ga on 21 Aug 2005 08:27 PDT
 
Hello neutral-ga

Thank-you for your question.

Whilst I was working on this vladimir-ga produced an answer that is
nearly complete but fails to sort the list once it is been obtained. 
I will build on his solution to provide you with a complete answer.

First of all you will require a text file that contains each of your
website URL's.  Each URL should be on a separate line.  This should be
placed in the same location as the perl script we will write below. 
As vladimir-ga stated it should take the form:

www.site1.com
www.site2.com
www.site3.com
...

The ideal solution would make use of the Technorati API but that would
require more time than a $40 question would permit and would also
require you to register with the Technorati site for an API key.  You
did not mention in your question whether you had one of these so my
solution, like vladimir-ga's, did not use the API method.

The API method may be a quicker solution in the long run as it would
probably be quicker.  Submitting 400 URL's at once will take some
time... in my solution I have built in a 2 second delay before each
URL is submitted to Technorati.  Without this their server would be
unfairly hit and may cause a degradation in their service and result
in your IP address being banned in the future.  It is always
recommended not to "hit" a server too often too quickly when
performing tasks such as these out of courtesy.

I was unsure whether you wanted the number of sites that link to your
url or what vladimir-ga gave you which was the latest number of links.
 It is not too difficult to switch back to vladimir-ga's solution.

My script will output a SORTED list with the most linked site in first
position.  The output will be in the form number|url.  This can easily
be changed to another format, just let me know what you require and I
can alter this for you.

Finally to run the script on a Windows machine you should use
something like this at the command prompt:

perl nameofthescript.pl nameofthetextfilecontainingtheurls.txt

If you have any further questions or queries on this subject please do
not hesitate to ask and I will do my best to respond swiftly.

Finally the finished script (with comments and sorting):

#!/usr/bin/perl

# set up required modules and stuff
use strict;
use warnings;
use LWP::Simple;

# if a parameter is not passed to the script then stop
die "parameter missing" if @ARGV != 1;

# read the urls to check
open (URLS, $ARGV[0]) or die "cannot open ".$ARGV[0];

# set up some holding variables
my $site;
my @urls;

# loop through the urls and process them
while ($site = <URLS>) {
        # remove line feeds
        chomp($site);
        # get the search page
        my $page = get 'http://technorati.com/search/'.$site;
        # check something is found
        defined($page) or die 'cannot fetch '.$site;
        # match the number of sites linking to your url
        $page =~ m/\<strong\>(.*) sites link/ or next;
        # add it to a url in the format:
        # number of links | url
        push @urls, "$1|$site";
        # wait for 2 seconds before querying Technorati again
        sleep(2);
}

# sort the urls
@urls = sort(@urls);
# print out the sorted list
foreach my $link ( @urls ) { print $link."\n"; };

# quit the script
exit(0);

Request for Answer Clarification by neutral-ga on 21 Aug 2005 09:20 PDT
All I need was the sorted list as in:

132 links | www.sitenumber1.com
127 links | www.sitenumber2.com
117 links | www.sitenumber3.com
...and so on.

So it looks like you got me right.

However, I do not know anything about command lines or how to
run/execute this script.

I have a site.txt file with a list of the URLs.
I also have the texhnorati.pl file with your script in it.

I do not know what to do next.

Please clarify.

Thanks.

Clarification of Answer by palitoy-ga on 21 Aug 2005 09:58 PDT
Can you please let me know what operating system you will be running the script on?

You should be able to run the script on any operating system but you
will probably need to install ActivePerl if you intend to run the
script on a Windows machine.  This is free software and can be
downloaded from here:
http://www.activestate.com/Products/ActivePerl/?mp=1

If you are running Windows and do not have ActivePerl installed (you
probably will not have if you have never done any Perl programming
before), please proceed to start downloading this while you are
waiting for my response to your answer of what operating system you
are running the script on.

Request for Answer Clarification by neutral-ga on 21 Aug 2005 10:29 PDT
My server is a Linux which can run perl scripts.

On my computer, I have windows.

Clarification of Answer by palitoy-ga on 21 Aug 2005 11:22 PDT
If you do not have a preference I would find it easier to talk you
through a Windows solution as this is the operating system I am
working on at the present moment.

Here are the steps you require:

1) Double-click on My Computer and then C: (your main hard disk).
2) Create a folder in C: with a name of your choice (I will call it urlupdate)
3) Copy the perl script and text file containing your URL's to
C:/urlupdate (the folder you made in part 2 above).
4) Go to the ActivePerl website and download the free software:
http://www.activestate.com/Products/ActivePerl/?mp=1
5) Install the software by clicking on the program you download and
follow the instructions.
6) This will install Perl on your system.
7) Go to Start->All Programs->Accessories and choose Command Prompt
(alternatively Start->Run and type "command" in the window that
appears).
8) A new mainly black window should appear.  This is the command prompt, type:
cd C:\urlupdate
9) You are using DOS commands and this has located your position to
the urlupdate folder you made in step 2.
10) Now type:
perl nameofperlscript.pl nameoftextfile.txt >nameofyourchoicefortheoutput.txt
11) This should start running the script (I would initially ensure you
only have a few URL's in the text file just to make sure it is
working!).  It will take some time before the process completes (it
took about 20 seconds on my PC to do 3 URL's earlier when I was
writing the program).  Completion is indicated by the fact you can
type something else in to the command prompt.

I know this must seem quite daunting but I am here to help you through
each step.  If you get any error messages please ask for
clarification, state the step you got to, the error message and I will
do my best to respond swiftly (I should be here for another 2 hours
today and all day tomorrow).

Clarification of Answer by palitoy-ga on 21 Aug 2005 11:32 PDT
FYI: If you are confused as to which ActivePerl program to download,
he simplest installation package for ActivePerl is the one located
here:
http://downloads.activestate.com/ActivePerl/Windows/5.6/ActivePerl-5.6.1.638-MSWin32-x86.msi

Request for Answer Clarification by neutral-ga on 21 Aug 2005 13:04 PDT
9 down 2 to go!

I was at the 10th step.

I installed perl under c:\perl\perl

so I typed: 

c:\perl\perl\perl technorati.pl site2.txt >output.txt

However, it turned the folliowng error:

"perl is not recognized as an internal or external command, operable
program or batch file."

Clarification of Answer by palitoy-ga on 21 Aug 2005 13:25 PDT
Good work!

Step 10 should probably be this:

c:\perl\perl\bin\perl.exe technorati.pl site2.txt >output.txt

You need to locate where the perl.exe file is in your perl
installation.  As you have installed Perl at C:\perl\perl it should be
in the bin folder mentioned above.

You may wish to add Perl to your system path environment variable,
there are easy to follow instructions here:
http://www.peacefire.org/circumventor/adding-perl-to-path-variable.html

Again you will need to alter c:\perl to c:\perl\perl

This means in future you would only need to type:

perl technorati.pl site2.txt >output.txt

Let me know how you get on with this...

Request for Answer Clarification by neutral-ga on 21 Aug 2005 14:40 PDT
I don't know what a 'system path environment variable' is, so I discarded that. :)

I did a trial for only 4 sites like you recommended - to make it last not so long.

It worked, but turned only one result.

Does that mean that the other 3 had the value 0?

Clarification of Answer by palitoy-ga on 21 Aug 2005 15:12 PDT
Great! At least we know you can now run perl scripts and everything is
working fine there :)

You are correct in your assumption that the other 3 had no links, this
is because of this line in the script:

$page =~ m/\<strong\>(.*) sites link/ or next;

This line is saying, check the contents of the Technorati page and
search for a certain pattern.  If this pattern is matched then
continue with the rest of the while loop otherwise lets go on to the
next url.

If you wish to double check the script I would recommend using urls
that you know will bring up a solution (I used www.google.com,
www.ebay.com and www.yahoo.com).

If you wish me to alter the script slightly so that it becomes 0|url
for when no sites are linking please let me know.

[This will be my last opportunity to respond to any clarifications
tonight.  I will respond to any further ones you have in the morning.]

Request for Answer Clarification by neutral-ga on 22 Aug 2005 01:20 PDT
1- Yes. I would like to have those with no links to be listed with 0.
Because, Technorati site usually has a high volume of requests, and I
want to know if a site has really 0 links OR Tehnorati was down and
data could not be retrieved. (By the way, the script doesn't show 0
links when it cannot retrieve data, or does it?)

2- If you can add the sorting function, looks like we will be done.

3- Is it possible for you to write a similar PHP script too? I can tip $20 if it is.

Clarification of Answer by palitoy-ga on 22 Aug 2005 01:32 PDT
1- Yes. I would like to have those with no links to be listed with 0.
No problem, I will add this to the script for you and post the new
script here once I have completed this addition.

2- If you can add the sorting function, looks like we will be done.
This is already included :)  Or was there a problem when you were
testing it?  It appeared to work correctly when I tested it...

3- Is it possible for you to write a similar PHP script too?
I will work on this for you this morning and should hopefully have a
working solution in a few hours (as it will take this amount of time
to write and test).

Clarification of Answer by palitoy-ga on 22 Aug 2005 02:09 PDT
Just a quick update.  Part 1 is completed but I have noticed an error
in the sorting method I have used so I will be devoting my time into
fixing this.

The PHP rewrite will therefore involve more time than your offer of
the $20 tip would allow.  I had already estimated this would take a
couple of hours for the $20 :(  I hope you understand.

Clarification of Answer by palitoy-ga on 22 Aug 2005 03:14 PDT
Please replace the old perl script with this one.

#!/usr/bin/perl

# set up required modules and stuff
use strict;
use warnings;
use LWP::Simple;

# if a parameter is not passed to the script then stop
die "parameter missing" if @ARGV != 1;

# read the urls to check
open (URLS, $ARGV[0]) or die "cannot open ".$ARGV[0];

# set up some holding variables
my $site;
my @urls;

# loop through the urls and process them
while ($site = <URLS>) {
        # remove line feeds
        chomp($site);
        # get the search page
        my $page = get 'http://technorati.com/search/'.$site;
        # check something is found
        defined($page) or die 'cannot fetch '.$site;
        # match the number of sites linking to your url
        if ( $page =~ m/\<strong\>(.*) sites link/ ) {
          # add it to a url in the format:
          # number of links | url
          my $links = $1;
          $links =~ s/\,//g;
          push @urls, "$links|$site";
        }
        else { push @urls, "0|$site"; };
}

# sort the urls
@urls = sort {
        ($b =~ /(\d+)\|/)[0] <=> ($a =~ /(\d+)\|/)[0]
                            ||
                    uc($a)  cmp  uc($b)
    } @urls;
# print out the sorted list
foreach my $link ( @urls ) { print $link."\n"; };

Request for Answer Clarification by neutral-ga on 22 Aug 2005 03:39 PDT
I will also need a PHP code which lists the Google Page Ranks of the listed URLs.

Can you do both for a tip of $40?

Clarification of Answer by palitoy-ga on 22 Aug 2005 03:49 PDT
Can you please post the page rank question as a separate Google
Answers question as it is significantly different to the original
question asked?  I would have to investigate this further as
Technorati does not seem to provide this information.

If you specifically want me to answer this question also you can put
"For palitoy-ga" as the question title.

Clarification of Answer by palitoy-ga on 23 Aug 2005 03:36 PDT
Please let me know if the final Perl script I provided you with is
working as per your needs.  If it is I will archive this script in my
records.

Request for Answer Clarification by neutral-ga on 23 Aug 2005 18:10 PDT
yes, the script works fine.

the only problem can be that the technorati site is rather busy one.

and when their server is busy, the script marks the url in question
'0', instead of n/a.

is there a way to overcome this problem?

Clarification of Answer by palitoy-ga on 24 Aug 2005 00:19 PDT
Unfortunately there is not a way to overcome this that I can think of.
 The script relies on being able to query the Technorati site and if
the site is busy it will not be able to get the result.

Request for Answer Clarification by neutral-ga on 24 Aug 2005 14:12 PDT
This new script has a problem:

If it cannot fetch one URL, it stops inquiring for the rest and leaves
the user with an empty output file - regardless of when the problem
occured.

Here is the error message:

cannot fetch name.blogspot.com at technorati.pl line 26, <URLS> line 3.

I hope it is something simple, because it is almost impossible not to
have a fetch problem when I have almost 400 urls in the list.

Clarification of Answer by palitoy-ga on 25 Aug 2005 00:12 PDT
You need to find this line in the code:

defined($page) or die 'cannot fetch '.$site;

You can either change it to:

# defined($page) or die 'cannot fetch '.$site;

or:

defined($page) or warn 'cannot fetch '.$site;

The first option will stop the script checking whether the site could
be found or not, the second option will warn you that the site could
not be found.

Request for Answer Clarification by neutral-ga on 03 Sep 2005 10:33 PDT
I am still getting errors.

Could you please try the script yourself first, and send me a final version?

Thanks!

Clarification of Answer by palitoy-ga on 03 Sep 2005 10:40 PDT
What errors are you getting?  Did you make the alterations I
suggested?  I have tried the script and am not coming up with any
errors.  Can you send me a list of the sites that are producing the
errors?

Request for Answer Clarification by neutral-ga on 04 Sep 2005 03:21 PDT
When I change 'die' to warn - as you advised - the error message is as follows:

cannot fetch sitename1.com at technorati.pl line 25, <URLS> line 1.
Use of uninitialized value in pattern match <m//> at technorati.pl
line 27, <URLS> line 1.
cannot fetch sitename2.com at technorati.pl line 25, <URLS> line 1.
Use of uninitialized value in pattern match <m//> at technorati.pl
line 27, <URLS> line 1.
...


(now, if this is how it should be when the script cannot retrieve the
information from technorati site, then fine, we have no problems at
all.

But the error message implies that it is something else. Am I wrong?

Clarification of Answer by palitoy-ga on 04 Sep 2005 03:41 PDT
The script is correct, what the script is telling you is there was a
problem matching the number on technorati site that indicates the
number of links.  It is just being a bit more verbose than normal and
telling you its exact error.  I have added an extra check to ensure
that data is received from the technorati site but have been unable to
check whether it is working as I do not have a long list on URLs.  The
one I am using is only 15 long and it always seems to get the data
correctly when I run it.

The following script will alter it so that the "Use of uninitialized
value in pattern match" warning is not displayed and only an error
stating "cannot fetch xyz.com".

#!/usr/bin/perl

# set up required modules and stuff
use strict;
use warnings;
use LWP::Simple;

# if a parameter is not passed to the script then stop
die "parameter missing" if @ARGV != 1;

# read the urls to check
open (URLS, $ARGV[0]) or die "cannot open ".$ARGV[0];

# set up some holding variables
my $site;
my @urls;

# loop through the urls and process them
while ($site = <URLS>) {
        # remove line feeds
        chomp($site);
        # get the search page
        my $page = get 'http://technorati.com/search/'.$site;
        # check something is found
        if ( defined($page) ) {
          # match the number of sites linking to your url
          if ( $page =~ m/\<strong\>(.*) sites link/ ) {
            # add it to a url in the format:
            # number of links | url
            my $links = $1;
            $links =~ s/\,//g;
            push @urls, "$links|$site";
          }
          else { push @urls, "0|$site"; };
        }
        else { print 'cannot fetch '.$site."\n"; };
}

# sort the urls
@urls = sort {
        ($b =~ /(\d+)\|/)[0] <=> ($a =~ /(\d+)\|/)[0]
                            ||
                    uc($a)  cmp  uc($b)
    } @urls;
# print out the sorted list
foreach my $link ( @urls ) { print $link."\n"; };

Request for Answer Clarification by neutral-ga on 04 Sep 2005 11:43 PDT
OK. I am glad that it was not as big of a problem as I thought.

Clarification of Answer by palitoy-ga on 04 Sep 2005 11:46 PDT
Hopefully this version of the script will produce the results you
require.  Let me know if you need anything else.
Comments  
Subject: Re: PHP Code Needed
From: vladimir-ga on 21 Aug 2005 06:35 PDT
 
Consider this simple script:

#!/usr/bin/perl

use strict;
use warnings;
use LWP::Simple;

die "parameter missing" if @ARGV != 1;

open (URLS, $ARGV[0]) or die "cannot open ".$ARGV[0];

print "<html>\n";
print "<table border>\n";

my $site;
while ($site = <URLS>) {
        chomp($site);
        my $page = get 'http://technorati.com/search/'.$site;
        defined($page) or die 'cannot fetch '.$site;
        $page =~ m#<h2><em>(\d+) Posts</em> linking to# or next;
        print "<tr><td>$site</td><td>$1</td></tr>\n";
}

print "</table>\n";
print "</html>\n";


You run it with a single parameter, being the name of a text file that
lists the sites that we want to check (one address per line). The file
could look like this:

www.site1.com
www.site2.com
www.site3.com
...

The script fetches the information from Technorati and prints a simple
HTML on standard output. You could use it like so (assuming you saved
the script in a file called technolinks.pl and the list of sites is in
a file called sites.txt and that you're on some kind of Linux/Unix):

./technolinks.pl sites.txt > output.html

You get your report in the file output.html that is ready to be served
via a web server. (Of course it could use some nice formatting.) There
is no dynamic script running every time someone wants to view the
report, you manually (or mechanically via cron etc.) update the report
by running the Perl script.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy