Google Answers: PHP Code Needed

View Question

Q: PHP Code Needed ( Answered, 1 Comment )

Question

Subject: PHP Code Needed
Category: Computers > Programming
Asked by: neutral-ga
List Price: $40.00

Posted: 21 Aug 2005 04:39 PDT
Expires: 20 Sep 2005 04:39 PDT
Question ID: 558300

ACKNOWLEDGEMENT:

Technorati.com has a search system which tells the user how many
incoming links a site has.

When you put in the following format in your browser's address bar,
Technorati tells you how many incoming links this site has:

The format is: http://technorati.com/search/www.SiteName.com

To see a working example, see: http://technorati.com/search/www.socioeconomics.com


WHAT I NEED:

I have about 400 sites, of which I want to get the latest # of incoming links.

To achieve this goal, I need someone to write a PHP script which does
that automatically in order to save me from the trouble of manually
checking them every once in a while.

A script which can take good care of the following would do just fine:

1- Will extract the necessary information from Technorati either
simultaneously OR when I run it. (which means that the script does not
necessarily have to perform the same task each time the page loads.
This would take a lot of time. I can run and see the results myself,
and update it every week manually.)

2- Will not require a MySQL to work, and will be self-sufficient.

3- Will order the results top-down.

Request for Question Clarification by palitoy-ga on 21 Aug 2005 04:59 PDT

Hello neutral-ga

I will be able to write this script for you but I am a little unsure
as to what you require in one section.

You state "the script does not necessarily have to perform the same
task each time the page loads", can you explain what you mean by this?

As I understand it you require a script that searches Technorati for
your 400 websites and then lists your sites in order of which ones are
the most popular (have the most links).  These results will be stored
in a text file rather than an MySQL database.

Does the script have to be written in PHP?  This kind of checking
might be easier done in Perl/CGI.

Please let me know what you mean by the above phrase so I can begin
work on this and whether a Perl script would also be acceptable.

Clarification of Question by neutral-ga on 21 Aug 2005 05:56 PDT

I believe you have two questions. Below are my answers for each:

1- As for clarification about what I meant with the phrase "the script
does not necessarily have to perform the same task each time the page
loads":

I will publish the results on a web page. And I thought it would not
be the best way to do this that way, if the script is to extract
latest results from technorati each time it loads.

Therefore I thought maybe I should run the script offline myself
every, shall we say, 10 days, and manually publish the top20 results
on a static page in order for the page to load faster.

I wanted to let you know about this concern of mine beforehand - just
considering the possibility that it might affect the way you write the
script. (if it doesn't, simply ignore it!)


2- PHP would be great. But if you think that you can do it with perl
in a better way, and that it will work with no problems, then I can
say yes to perl too.

I hope these replies have answered your question. Feel free let me
know if you need any further details.

Answer

Subject: Re: PHP Code Needed
Answered By: palitoy-ga on 21 Aug 2005 08:27 PDT

Hello neutral-ga Thank-you for your question. Whilst I was working on this vladimir-ga produced an answer that is nearly complete but fails to sort the list once it is been obtained. I will build on his solution to provide you with a complete answer. First of all you will require a text file that contains each of your website URL's. Each URL should be on a separate line. This should be placed in the same location as the perl script we will write below. As vladimir-ga stated it should take the form: www.site1.com www.site2.com www.site3.com ... The ideal solution would make use of the Technorati API but that would require more time than a $40 question would permit and would also require you to register with the Technorati site for an API key. You did not mention in your question whether you had one of these so my solution, like vladimir-ga's, did not use the API method. The API method may be a quicker solution in the long run as it would probably be quicker. Submitting 400 URL's at once will take some time... in my solution I have built in a 2 second delay before each URL is submitted to Technorati. Without this their server would be unfairly hit and may cause a degradation in their service and result in your IP address being banned in the future. It is always recommended not to "hit" a server too often too quickly when performing tasks such as these out of courtesy. I was unsure whether you wanted the number of sites that link to your url or what vladimir-ga gave you which was the latest number of links. It is not too difficult to switch back to vladimir-ga's solution. My script will output a SORTED list with the most linked site in first position. The output will be in the form number\|url. This can easily be changed to another format, just let me know what you require and I can alter this for you. Finally to run the script on a Windows machine you should use something like this at the command prompt: perl nameofthescript.pl nameofthetextfilecontainingtheurls.txt If you have any further questions or queries on this subject please do not hesitate to ask and I will do my best to respond swiftly. Finally the finished script (with comments and sorting): #!/usr/bin/perl # set up required modules and stuff use strict; use warnings; use LWP::Simple; # if a parameter is not passed to the script then stop die "parameter missing" if @ARGV != 1; # read the urls to check open (URLS, $ARGV[0]) or die "cannot open ".$ARGV[0]; # set up some holding variables my $site; my @urls; # loop through the urls and process them while ($site = <URLS>) { # remove line feeds chomp($site); # get the search page my $page = get 'http://technorati.com/search/'.$site; # check something is found defined($page) or die 'cannot fetch '.$site; # match the number of sites linking to your url $page =~ m/\<strong\>(.*) sites link/ or next; # add it to a url in the format: # number of links \| url push @urls, "$1\|$site"; # wait for 2 seconds before querying Technorati again sleep(2); } # sort the urls @urls = sort(@urls); # print out the sorted list foreach my $link ( @urls ) { print $link."\n"; }; # quit the script exit(0);
Request for Answer Clarification by neutral-ga on 21 Aug 2005 09:20 PDT All I need was the sorted list as in: 132 links \| www.sitenumber1.com 127 links \| www.sitenumber2.com 117 links \| www.sitenumber3.com ...and so on. So it looks like you got me right. However, I do not know anything about command lines or how to run/execute this script. I have a site.txt file with a list of the URLs. I also have the texhnorati.pl file with your script in it. I do not know what to do next. Please clarify. Thanks.
Clarification of Answer by palitoy-ga on 21 Aug 2005 09:58 PDT Can you please let me know what operating system you will be running the script on? You should be able to run the script on any operating system but you will probably need to install ActivePerl if you intend to run the script on a Windows machine. This is free software and can be downloaded from here: http://www.activestate.com/Products/ActivePerl/?mp=1 If you are running Windows and do not have ActivePerl installed (you probably will not have if you have never done any Perl programming before), please proceed to start downloading this while you are waiting for my response to your answer of what operating system you are running the script on.
Request for Answer Clarification by neutral-ga on 21 Aug 2005 10:29 PDT My server is a Linux which can run perl scripts. On my computer, I have windows.
Clarification of Answer by palitoy-ga on 21 Aug 2005 11:22 PDT If you do not have a preference I would find it easier to talk you through a Windows solution as this is the operating system I am working on at the present moment. Here are the steps you require: 1) Double-click on My Computer and then C: (your main hard disk). 2) Create a folder in C: with a name of your choice (I will call it urlupdate) 3) Copy the perl script and text file containing your URL's to C:/urlupdate (the folder you made in part 2 above). 4) Go to the ActivePerl website and download the free software: http://www.activestate.com/Products/ActivePerl/?mp=1 5) Install the software by clicking on the program you download and follow the instructions. 6) This will install Perl on your system. 7) Go to Start->All Programs->Accessories and choose Command Prompt (alternatively Start->Run and type "command" in the window that appears). 8) A new mainly black window should appear. This is the command prompt, type: cd C:\urlupdate 9) You are using DOS commands and this has located your position to the urlupdate folder you made in step 2. 10) Now type: perl nameofperlscript.pl nameoftextfile.txt >nameofyourchoicefortheoutput.txt 11) This should start running the script (I would initially ensure you only have a few URL's in the text file just to make sure it is working!). It will take some time before the process completes (it took about 20 seconds on my PC to do 3 URL's earlier when I was writing the program). Completion is indicated by the fact you can type something else in to the command prompt. I know this must seem quite daunting but I am here to help you through each step. If you get any error messages please ask for clarification, state the step you got to, the error message and I will do my best to respond swiftly (I should be here for another 2 hours today and all day tomorrow).
Clarification of Answer by palitoy-ga on 21 Aug 2005 11:32 PDT FYI: If you are confused as to which ActivePerl program to download, he simplest installation package for ActivePerl is the one located here: http://downloads.activestate.com/ActivePerl/Windows/5.6/ActivePerl-5.6.1.638-MSWin32-x86.msi
Request for Answer Clarification by neutral-ga on 21 Aug 2005 13:04 PDT 9 down 2 to go! I was at the 10th step. I installed perl under c:\perl\perl so I typed: c:\perl\perl\perl technorati.pl site2.txt >output.txt However, it turned the folliowng error: "perl is not recognized as an internal or external command, operable program or batch file."
Clarification of Answer by palitoy-ga on 21 Aug 2005 13:25 PDT Good work! Step 10 should probably be this: c:\perl\perl\bin\perl.exe technorati.pl site2.txt >output.txt You need to locate where the perl.exe file is in your perl installation. As you have installed Perl at C:\perl\perl it should be in the bin folder mentioned above. You may wish to add Perl to your system path environment variable, there are easy to follow instructions here: http://www.peacefire.org/circumventor/adding-perl-to-path-variable.html Again you will need to alter c:\perl to c:\perl\perl This means in future you would only need to type: perl technorati.pl site2.txt >output.txt Let me know how you get on with this...
Request for Answer Clarification by neutral-ga on 21 Aug 2005 14:40 PDT I don't know what a 'system path environment variable' is, so I discarded that. :) I did a trial for only 4 sites like you recommended - to make it last not so long. It worked, but turned only one result. Does that mean that the other 3 had the value 0?
Clarification of Answer by palitoy-ga on 21 Aug 2005 15:12 PDT Great! At least we know you can now run perl scripts and everything is working fine there :) You are correct in your assumption that the other 3 had no links, this is because of this line in the script: $page =~ m/\<strong\>(.*) sites link/ or next; This line is saying, check the contents of the Technorati page and search for a certain pattern. If this pattern is matched then continue with the rest of the while loop otherwise lets go on to the next url. If you wish to double check the script I would recommend using urls that you know will bring up a solution (I used www.google.com, www.ebay.com and www.yahoo.com). If you wish me to alter the script slightly so that it becomes 0\|url for when no sites are linking please let me know. [This will be my last opportunity to respond to any clarifications tonight. I will respond to any further ones you have in the morning.]
Request for Answer Clarification by neutral-ga on 22 Aug 2005 01:20 PDT 1- Yes. I would like to have those with no links to be listed with 0. Because, Technorati site usually has a high volume of requests, and I want to know if a site has really 0 links OR Tehnorati was down and data could not be retrieved. (By the way, the script doesn't show 0 links when it cannot retrieve data, or does it?) 2- If you can add the sorting function, looks like we will be done. 3- Is it possible for you to write a similar PHP script too? I can tip $20 if it is.
Clarification of Answer by palitoy-ga on 22 Aug 2005 01:32 PDT 1- Yes. I would like to have those with no links to be listed with 0. No problem, I will add this to the script for you and post the new script here once I have completed this addition. 2- If you can add the sorting function, looks like we will be done. This is already included :) Or was there a problem when you were testing it? It appeared to work correctly when I tested it... 3- Is it possible for you to write a similar PHP script too? I will work on this for you this morning and should hopefully have a working solution in a few hours (as it will take this amount of time to write and test).
Clarification of Answer by palitoy-ga on 22 Aug 2005 02:09 PDT Just a quick update. Part 1 is completed but I have noticed an error in the sorting method I have used so I will be devoting my time into fixing this. The PHP rewrite will therefore involve more time than your offer of the $20 tip would allow. I had already estimated this would take a couple of hours for the $20 :( I hope you understand.
Clarification of Answer by palitoy-ga on 22 Aug 2005 03:14 PDT Please replace the old perl script with this one. #!/usr/bin/perl # set up required modules and stuff use strict; use warnings; use LWP::Simple; # if a parameter is not passed to the script then stop die "parameter missing" if @ARGV != 1; # read the urls to check open (URLS, $ARGV[0]) or die "cannot open ".$ARGV[0]; # set up some holding variables my $site; my @urls; # loop through the urls and process them while ($site = <URLS>) { # remove line feeds chomp($site); # get the search page my $page = get 'http://technorati.com/search/'.$site; # check something is found defined($page) or die 'cannot fetch '.$site; # match the number of sites linking to your url if ( $page =~ m/\<strong\>(.*) sites link/ ) { # add it to a url in the format: # number of links \| url my $links = $1; $links =~ s/\,//g; push @urls, "$links\|$site"; } else { push @urls, "0\|$site"; }; } # sort the urls @urls = sort { ($b =~ /(\d+)\\|/)[0] <=> ($a =~ /(\d+)\\|/)[0] \|\| uc($a) cmp uc($b) } @urls; # print out the sorted list foreach my $link ( @urls ) { print $link."\n"; };
Request for Answer Clarification by neutral-ga on 22 Aug 2005 03:39 PDT I will also need a PHP code which lists the Google Page Ranks of the listed URLs. Can you do both for a tip of $40?
Clarification of Answer by palitoy-ga on 22 Aug 2005 03:49 PDT Can you please post the page rank question as a separate Google Answers question as it is significantly different to the original question asked? I would have to investigate this further as Technorati does not seem to provide this information. If you specifically want me to answer this question also you can put "For palitoy-ga" as the question title.
Clarification of Answer by palitoy-ga on 23 Aug 2005 03:36 PDT Please let me know if the final Perl script I provided you with is working as per your needs. If it is I will archive this script in my records.
Request for Answer Clarification by neutral-ga on 23 Aug 2005 18:10 PDT yes, the script works fine. the only problem can be that the technorati site is rather busy one. and when their server is busy, the script marks the url in question '0', instead of n/a. is there a way to overcome this problem?
Clarification of Answer by palitoy-ga on 24 Aug 2005 00:19 PDT Unfortunately there is not a way to overcome this that I can think of. The script relies on being able to query the Technorati site and if the site is busy it will not be able to get the result.
Request for Answer Clarification by neutral-ga on 24 Aug 2005 14:12 PDT This new script has a problem: If it cannot fetch one URL, it stops inquiring for the rest and leaves the user with an empty output file - regardless of when the problem occured. Here is the error message: cannot fetch name.blogspot.com at technorati.pl line 26, <URLS> line 3. I hope it is something simple, because it is almost impossible not to have a fetch problem when I have almost 400 urls in the list.
Clarification of Answer by palitoy-ga on 25 Aug 2005 00:12 PDT You need to find this line in the code: defined($page) or die 'cannot fetch '.$site; You can either change it to: # defined($page) or die 'cannot fetch '.$site; or: defined($page) or warn 'cannot fetch '.$site; The first option will stop the script checking whether the site could be found or not, the second option will warn you that the site could not be found.
Request for Answer Clarification by neutral-ga on 03 Sep 2005 10:33 PDT I am still getting errors. Could you please try the script yourself first, and send me a final version? Thanks!
Clarification of Answer by palitoy-ga on 03 Sep 2005 10:40 PDT What errors are you getting? Did you make the alterations I suggested? I have tried the script and am not coming up with any errors. Can you send me a list of the sites that are producing the errors?
Request for Answer Clarification by neutral-ga on 04 Sep 2005 03:21 PDT When I change 'die' to warn - as you advised - the error message is as follows: cannot fetch sitename1.com at technorati.pl line 25, <URLS> line 1. Use of uninitialized value in pattern match <m//> at technorati.pl line 27, <URLS> line 1. cannot fetch sitename2.com at technorati.pl line 25, <URLS> line 1. Use of uninitialized value in pattern match <m//> at technorati.pl line 27, <URLS> line 1. ... (now, if this is how it should be when the script cannot retrieve the information from technorati site, then fine, we have no problems at all. But the error message implies that it is something else. Am I wrong?
Clarification of Answer by palitoy-ga on 04 Sep 2005 03:41 PDT The script is correct, what the script is telling you is there was a problem matching the number on technorati site that indicates the number of links. It is just being a bit more verbose than normal and telling you its exact error. I have added an extra check to ensure that data is received from the technorati site but have been unable to check whether it is working as I do not have a long list on URLs. The one I am using is only 15 long and it always seems to get the data correctly when I run it. The following script will alter it so that the "Use of uninitialized value in pattern match" warning is not displayed and only an error stating "cannot fetch xyz.com". #!/usr/bin/perl # set up required modules and stuff use strict; use warnings; use LWP::Simple; # if a parameter is not passed to the script then stop die "parameter missing" if @ARGV != 1; # read the urls to check open (URLS, $ARGV[0]) or die "cannot open ".$ARGV[0]; # set up some holding variables my $site; my @urls; # loop through the urls and process them while ($site = <URLS>) { # remove line feeds chomp($site); # get the search page my $page = get 'http://technorati.com/search/'.$site; # check something is found if ( defined($page) ) { # match the number of sites linking to your url if ( $page =~ m/\<strong\>(.*) sites link/ ) { # add it to a url in the format: # number of links \| url my $links = $1; $links =~ s/\,//g; push @urls, "$links\|$site"; } else { push @urls, "0\|$site"; }; } else { print 'cannot fetch '.$site."\n"; }; } # sort the urls @urls = sort { ($b =~ /(\d+)\\|/)[0] <=> ($a =~ /(\d+)\\|/)[0] \|\| uc($a) cmp uc($b) } @urls; # print out the sorted list foreach my $link ( @urls ) { print $link."\n"; };
Request for Answer Clarification by neutral-ga on 04 Sep 2005 11:43 PDT OK. I am glad that it was not as big of a problem as I thought.
Clarification of Answer by palitoy-ga on 04 Sep 2005 11:46 PDT Hopefully this version of the script will produce the results you require. Let me know if you need anything else.

Comments

Subject: Re: PHP Code Needed
From: vladimir-ga on 21 Aug 2005 06:35 PDT

Consider this simple script:

#!/usr/bin/perl

use strict;
use warnings;
use LWP::Simple;

die "parameter missing" if @ARGV != 1;

open (URLS, $ARGV[0]) or die "cannot open ".$ARGV[0];

print "<html>\n";
print "<table border>\n";

my $site;
while ($site = <URLS>) {
        chomp($site);
        my $page = get 'http://technorati.com/search/'.$site;
        defined($page) or die 'cannot fetch '.$site;
        $page =~ m#<h2><em>(\d+) Posts</em> linking to# or next;
        print "<tr><td>$site</td><td>$1</td></tr>\n";
}

print "</table>\n";
print "</html>\n";


You run it with a single parameter, being the name of a text file that
lists the sites that we want to check (one address per line). The file
could look like this:

www.site1.com
www.site2.com
www.site3.com
...

The script fetches the information from Technorati and prints a simple
HTML on standard output. You could use it like so (assuming you saved
the script in a file called technolinks.pl and the list of sites is in
a file called sites.txt and that you're on some kind of Linux/Unix):

./technolinks.pl sites.txt > output.html

You get your report in the file output.html that is ready to be served
via a web server. (Of course it could use some nice formatting.) There
is no dynamic script running every time someone wants to view the
report, you manually (or mechanically via cron etc.) update the report
by running the Perl script.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy