Google Answers Logo
View Question
 
Q: Need a re-write of a Java / Python Spider ( Answered 5 out of 5 stars,   2 Comments )
Question  
Subject: Need a re-write of a Java / Python Spider
Category: Miscellaneous
Asked by: coolguy90210-ga
List Price: $25.00
Posted: 17 Sep 2004 02:52 PDT
Expires: 17 Oct 2004 02:52 PDT
Question ID: 402414
Specifically for efn-ga, however, as he may not want to do it,
everyone with Java and Python skills is welcome to give it a go.

I need a re-write of a Java / Python spider that I wrote myself.  I've
been tweaking it off and on as time permits, however, I now have
something else that is more pressing, hence I need some help in making
this one perfect.

1)  I've only tested it on up to 500 or so urls.  I'm confident it
works on up to that number.  I recently did a test of 60,000 urls, and
it stalled at 3684, so that might be an issue or it might be CPU
related.  It never taxes my server so I don't think that is the issue.

2)  I'm sure the code is in-efficient as it was a first draft.  Please correct.

3)  The code passes the actual url visit off to a python script I
wrote.  The reason for this is that the following python code:

socket.setdefaulttimeout(30)
remotefile = urllib._urlopener.open(x)
content = remotefile.read()

can't be reproduced (easily) in the java URLConnection class.  No
method within the URLConnection class exists to set a timeout.  Of
course I could get into sockets but I don't have the time (pardon the
pun).

I'd like this issue addressed.  It can be addressed in one of two ways:
a)  Use the java Socket class or other such class to set a time out,
and re-write the python portion into a method within my Spider class.
b)  Just re-write the whole thing in python.

4) And finally, my last feature was to create a threaded application. 
Please add a threaded capability to this.  I think up to 4 threads
running would be OK, however, assuming memory is not an issue (and it
isn't with me) I could see up to 10 or 20 threads running at the same
time.

FYI, the spider hits only one page of a site.  It will not be used to
retrieve multiple pages from a site, hence I haven't incorporated a
reading of the robots.txt file.

I don't see editing and improving my code as taking more than 1 hour. 
Please advise.  I'll start with 25 USD.

Here is the Python code:

import urllib
import re
import sys
import string
import socket

class AppURLopener(urllib.FancyURLopener):
    def __init__(self, *args):
        self.version = "Bot Information"
        urllib.FancyURLopener.__init__(self, *args)

urllib._urlopener = AppURLopener()

x = sys.argv[1]
socket.setdefaulttimeout(30)

try:
	#remotefile = urllib.urlopen(x)
	remotefile = urllib._urlopener.open(x)
	content = remotefile.read()

	p = re.compile('<TITLE>.*?</TITLE>',re.DOTALL|re.IGNORECASE)
	q = re.compile('<meta.*?>',re.DOTALL|re.IGNORECASE)
	r = re.compile('<meta.*?description.*?content="',re.DOTALL|re.IGNORECASE)
	s = re.compile('<TITLE>|</TITLE>',re.DOTALL|re.IGNORECASE)

	plist = p.findall(content)
	qlist = q.findall(content)

	remotefile.close()

	for x in plist:
		m = p.match(x)	
		if m:
			#print 'Match found: ', m.group()
			x = re.sub(s, '', x)
			x = string.strip(x)
			print '<TITLE>',x,'<TITLE>'		
			break

	for x in qlist:
		m = r.match(x)	
		if m:
			#print 'Match found: ', m.group()
			x = re.sub(r, '', x)
			x = string.replace(x, '">', '')
			x = string.strip(x)
			print '<DESC>',x,'<DESC>'
			break#need to break out for those sites where the designer made a
mistake and has multiple descriptions
	
except Exception, e:
  print '<TITLE>Timeout<TITLE>'
  print '<DESC>Timeout<DESC>'

###End Python Code###

Here is the Java Code:

import java.io.*;
import java.sql.*;
import java.util.List;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.ListIterator;
import java.util.Collections;

public class Spider {

	private ArrayList urlList;
	private ArrayList selectList;
	private String urlString;

	//SelectData variables
	private Statement stmt;
	private Connection conn;
	private String url = "jdbc:mysql://domainname.com/dbName";
	private String username = "username	";
	private String password = "password";
	private Statement selectData;
	private ResultSet rs;
	private int total=0;
	private int initial=0;
	private int num_of_rows = 1000;

	private class ResultSetData {

		protected int linkID;
		protected String linkURL;
		protected int linkFatherID;

		public ResultSetData (int lid, String lur, int fid) {

			linkID = lid;
			linkURL = lur;
			linkFatherID = fid;

		}//end IncidentData constructor

	}//end IncidentData class

	private Spider() {

		urlList = new ArrayList();

		total = countData();

		System.out.println("The total is:  " + total);
		System.out.println("The inital is:  " + initial);
		System.out.println("The num_of_rows is:  " + num_of_rows);

		while(initial < total) {

			if(total - initial > num_of_rows) {
				num_of_rows = num_of_rows;
			}//end if

			else {
				num_of_rows = total - initial;
			}//end else

			System.out.println("Initial = " + initial + "and Total = " + total);
			urlList = SelectData(initial,num_of_rows);

			Iterator it = urlList.iterator();
			int y = urlList.size();

			System.out.println("Size is:  " + y);

			//for(int i=0;i<100;i++) {
			while (it.hasNext()) {

				System.out.println("In while loop....and size is " + y);

				//You need to cast the it.next elements to the appropriate Object.
				ResultSetData rsd = (ResultSetData) it.next();

				//urlString = (String) it.next();
				urlString = rsd.linkURL;

				//Test effect of bogus URL
				//urlString = "http://www.unknownunknowndoesnotexit.com";

				String rp = RunPython(urlString);
				//System.out.println("Now back in constructor");
				//System.out.println(rp);
				String [] sp = SplitText(rp);

				String title	 = "Not available.";
				String desc		 = "Not available.";
				String strUrl	 = "Not available.";
				int link_id	     = rsd.linkID;
				int father_id    = rsd.linkFatherID;

				//System.out.println("Now in for loop...");
				for(int h=0; h<sp.length; h++) {

					//System.out.println(sp[h]);

					if(sp[h].startsWith("<TITLE>")) {
						title = sp[h].replaceAll("<TITLE>","").trim();
					}//end if
					else if(sp[h].startsWith("<DESC>")) {
						desc = sp[h].replaceAll("<DESC>","").trim();
					}//end if
					else if(sp[h].startsWith("<URL>")) {
						strUrl = sp[h].replaceAll("<URL>","").trim();
					}//end if

				}//end for

				ShowToday st = new ShowToday();
				//System.out.println(st.demo());
				//System.out.println("Data retrieved...");
				//System.out.println("URL = " + strUrl);
				//System.out.println("TITLE = " + title);
				//System.out.println("DESC = " + desc + "\n");

				InsertData(title,desc,strUrl,link_id,father_id);

			}//end while

			initial = initial + num_of_rows;
			System.out.println("The inital is now:  " + initial);

	}//end while

		System.exit(0);

	}//end constructor

	public String RunPython(String u) {

		urlString = u;
		String s;
		String text = "<URL>" + u + "<URL>";

		try {

			// run the python application

			Process p = Runtime.getRuntime().exec("python2.3 urlcontent.py " + urlString);

			BufferedReader stdInput = new BufferedReader(new
				 InputStreamReader(p.getInputStream()));

			BufferedReader stdError = new BufferedReader(new
				 InputStreamReader(p.getErrorStream()));

			while ((s = stdInput.readLine()) != null) {
				//System.out.println(s);
				text += "_:,:_" + s;
			}//end while

			//System.exit(0);

		}//end try

		catch (IOException e) {
			System.out.println("IOException");
			e.printStackTrace();
			System.exit(-1);
		}//end catch

		return text;

	}//end method

	public String [] SplitText(String i) {

		String text = i;
		String [] split_text = text.split("_:,:_");
		return split_text;

	}//end method

	private void InsertData(String t, String d, String u, int l, int f) {

		PreparedStatement stmt;
		Connection conn;

		try {

			Class.forName("com.mysql.jdbc.Driver").newInstance();

		}//end try

		catch (Exception ex) {
			ex.printStackTrace();
		}//end catch

		try {

			String insert = "INSERT INTO spider (title,description,url,net_Links_ID," +
				"net_Category_FatherID) values (?,?,?,?,?)";

			String url = "jdbc:mysql://domainname.com/dbname";
			String username = "username";
			String password = "password";

			conn = DriverManager.getConnection(url, username, password);

			PreparedStatement insertData = conn.prepareStatement(insert);
			insertData.setString(1, t);
			insertData.setString(2, d);
			insertData.setString(3, u);
			insertData.setInt(4, l);
			insertData.setInt(5, f);
			insertData.executeUpdate();

			insertData.close();
			conn.close();

		}//end try

		catch (java.sql.SQLException ex) {
			ex.printStackTrace();
		}//end catch

	}//end method

	private ArrayList SelectData(int in, int nmr) {

			/*this try will query 3 tables in my database and retrieve a slice
of URL's that I
			want to spider.  I can provide a CREATE statement and some data for
troubleshooting.
			*/
			try {
				selectList = new ArrayList();
				String select = "SELECT net_Links.ID, net_Links.URL,
net_Links.TITLE, net_Category.FatherID " +
				"FROM net_Links,net_Category, net_CatLinks " +
				"WHERE net_Category.Full_Name LIKE 'Health%' " +
				//"WHERE net_Category.FatherID = '72658' " +
				//"AND net_Category.ID = CategoryID AND net_Links.ID = LinkID";
				"AND net_Category.ID = CategoryID AND net_Links.ID = LinkID LIMIT
" + initial + "," + num_of_rows;

				System.out.println(select);

				conn = DriverManager.getConnection(url, username, password);
				selectData = conn.createStatement();
				rs = selectData.executeQuery(select);

				while (rs.next()) {
					int	   id = rs.getInt("net_Links.ID");
					String ul = rs.getString("net_Links.URL");
					int    fd = rs.getInt("net_Category.FatherID");
					ResultSetData rsd = new ResultSetData(id, ul, fd);
					selectList.add(rsd);
				}//end while

				selectData.close();
				conn.close();
				System.out.println("The select statement is:\n\n" + select);
			}//end try

			catch (java.sql.SQLException ex) {
				ex.printStackTrace();
			}//end catch

			return selectList;

	}//end method

	private int countData() {

		try {
			/*this try counts the number of rows that could be returned if all
rows were to be selected.
			The idea is to use this number to determine the final loop's SQL statement.
			*/
			Class.forName("com.mysql.jdbc.Driver").newInstance();
			String select_count = "SELECT count(net_Links.ID) AS total FROM
net_Links,net_Category, net_CatLinks " +
				"WHERE net_Category.Full_Name like 'Health%' " +
				//"WHERE net_Category.FatherID = '72658' " +
				"AND net_Category.ID = CategoryID AND net_Links.ID = LinkID";
			conn = DriverManager.getConnection(url, username, password);
			selectData = conn.createStatement();
			rs = selectData.executeQuery(select_count);

			while (rs.next()) {
				total = rs.getInt("total");
			}//end while

			System.out.println(total);

		}//end try

		catch (Exception ex) {
			System.out.println("countData exception");
			ex.printStackTrace();
		}//end catch

		return total;

	}//end method

	public static void main(String args[]) {
		Spider sp = new Spider();
	}//end main

}//end class

Request for Question Clarification by leapinglizard-ga on 17 Sep 2004 04:01 PDT
I consider myself a proficient Python programmer, and I estimate it
would take four hours to rewrite your code as a multithreaded Python
script. Then again, these things always take longer than one thinks.
In the event of overtime, you could tip accordingly. See previous
examples of my work below.

http://answers.google.com/answers/threadview?id=121280

http://answers.google.com/answers/threadview?id=399628

Also, do you have a publicly available version of your SQL database?
Perhaps a dummy version that you take down once the work is done? It
would be an immmense help in testing the script.

Let me know if you want me to tackle the job.

leapinglizard

Clarification of Question by coolguy90210-ga on 17 Sep 2004 08:05 PDT
I'm posting this in response to leapinglizard-ga's post on 17 Sep 2004 04:01 PDT.

leapinglizard-ga:

I've reviewed the 2 examples of your work that you gave as well as all
of your remaining examples in your profile.  I must say that I am
impressed with your level of detail in creating an answer.

Based on your record I will accept your offer to tackle the job. 
Everyone else, feel free to briefly describe your interest and
approach.  If leapinglizard-ga's is unable to answer then I'll have to
quickly move on to another expert.  However, I'm assuming
leapinglizard-ga's answer will be on par with his past answers, and if
so, I will accept his work as the answer.

RE:  public database.  Let me consider the alternatives.  For now,
really, to get started, you really only need an array list of urls. 
I'll have some type of db data info for you within 8 hours, as it is
12 midnight for me.

Clarification of Question by coolguy90210-ga on 17 Sep 2004 08:07 PDT
leapinglizard-ga,

Also, re:  number of hours.  I'm willing to accept the 4 hour time
table, and as I want to have this done ASAP, I'll go ahead and accept
an upper limit of 100 USD, i.e. 75 dollar tip, assuming it works as
requested.

Request for Question Clarification by leapinglizard-ga on 17 Sep 2004 16:52 PDT
I've been busy with another scripting project, but I'll soon be done
and then I'll get started on yours. If I run into any serious
obstacles, I'll notify you via a further Clarification Request.
Otherwise, I expect to have a first draft, subject to modification and
debugging, ready tonight. It is currently 7:47 pm in my time zone. I
will also notify you if I need more details on the database, although
most of the work can be done independently of it, as you point out. If
I can't post a first draft tonight, I'll abandon the project and
invite other code-savvy Researchers to take a crack at it.

leapinglizard

Clarification of Question by coolguy90210-ga on 17 Sep 2004 18:40 PDT
leapinglizard-ga,

RE:  4 hours.  I did not mean that it had to be done in 4 hours.  I
meant that if it took up to 4 hours of your time, that would be OK. 
If you could have it done by the end of this weekend, that would be
great.

Also, I know more about Java's cleanup of memory problems than I do
Python's.  Please ensure that the Python code you are writing will
clean up after itself, i.e. close connections, close running processes
etc.

Request for Question Clarification by leapinglizard-ga on 17 Sep 2004 18:49 PDT
Aha! When you said ASAP, I thought you really meant it. I may leave
the script until tomorrow morning, then. I'll be sure to keep track of
the amount of time I spend on it.

As for exiting cleanly, I am well aware of the various synchronization
problems pertaining to Python and of the garbage collector's behavior.
I always take care to open and close my files and sockets in the right
order. Never fear, I don't leave references hanging unless I know from
experience that it's safe to do so.

leapinglizard

Clarification of Question by coolguy90210-ga on 18 Sep 2004 20:26 PDT
leapinglizard,

Just responding to Google's request for clarification.  Apparently if
you post a statement as a clarification, they still want me to respond
with a clarification post, even though you were making a statement,
rather than a question.

Request for Question Clarification by leapinglizard-ga on 18 Sep 2004 21:31 PDT
Yes, the clarification-request notices can get tiresome. I'm familiar
with them from the other direction, when customers make a statement in
the form of a Request for Answer Clarification.

I started work on your project later than I wanted to, but I've made
good progress over the past hour and a half. I've cleaned up and
rewritten almost everything. The one part of your Java code that I
can't translate for the time being, and it's an indispensable part, is
the dialogue with your mySQL database. I had assumed that there would
be a Python module for accessing SQL databases, but it turns out that
there's no such thing.

At this point, several courses of action are possible. I could ask you
to download and install a third-party Python module that provides
mySQL access, but this sort of thing can get messy. Alternatively, I
could write my own functions to talk with your database. I don't know
exactly what the protocol is, but my past experience with running
manual queries on a PostgreSQL database tells me that it could be done
fairly easily by talking through a socket.

Now, in order to suss out the protocol, there are again two
possibilities. One is that you put up a (temporarily) publicly
available version of your database contain some sample content. The
other is that I set up a mySQL database server locally and make a
sample database with a CREATE statement and data that you provide, as
you suggested in your Java comments. This would take a little time, of
course, but it's the only remaining obstacle. I'm confident I can
overcome it within a few more hours. Your time zone is UTC+8, correct?

Let me know how you'd like to proceed, or if you want to fire me altogether.

leapinglizard

Request for Question Clarification by leapinglizard-ga on 18 Sep 2004 22:04 PDT
I've made a decision. I won't attempt to kludge together a mySQL
access function after all. In fact, I've abandoned the attempt to
rewrite your script in Python. I'm sorry.

I could still rewrite it in Java, but I'd have to start over and I
can't guarantee it'll be done by the end of the weekend. It's Saturday
night where I am, but I believe it's Sunday afternoon where you are.
I've been working all day. Must go home and sleep now. I'm getting up
in five hours, at 6am, as I do every morning. If at that point I
haven't heard from you yet, or if you respond that you do want me to
go on, I'll get to work on the Java rewrite. I believe I would be able
to finish that around noon on Sunday, or 1am Monday morning in the
GMT+8 time zone.

leapinglizard

Clarification of Question by coolguy90210-ga on 18 Sep 2004 22:45 PDT
leapinglizard,

Go ahead and write it in Java.  I'm more familiar with Java anyway. 
Just don't have the knowledge of threads or sockets or the time to
complete the project myself.

Yes, you are correct about my time zone.  It is 14:35 for me in South Korea.

Regarding data.  Let me know what you need.  I'm a little concerned
about putting my site db info here even if it is a test db.

I have no problem getting you the create sql statements.  To be frank
the SelectData method simply obtains a list of urls and puts them into
an array list.  Granted the select statement is a rather large 3 table
join, however for your purposes, simply obtaining a list of urls from
a table should suffice to mirror the function of SelectData().  Bottom
line, we need to count and set a limit to how many urls are grabbed
into the result set at one time, for obvious reasons, i.e. the result
set returned could be up to 300,000 or more urls.

The table that receives the spidered data for the InsertData method is simply:

CREATE TABLE `spider` (
  `ID` int(10) unsigned NOT NULL auto_increment,
  `title` varchar(100) NOT NULL default '',
  `url` varchar(255) NOT NULL default 'http://',
  `description` text,
  `net_Links_ID` int(10) unsigned default NULL,
  `net_Category_FatherID` int(10) unsigned default NULL,
  PRIMARY KEY  (`ID`)
) TYPE=MyISAM |

Clarification of Question by coolguy90210-ga on 19 Sep 2004 00:58 PDT
leapinglizard,

I'm not married to the InserData method.  MySQL's import utility is
phenomenally fast.  Much faster than anything I've ever written to
import data into MySQL.  Would having the application write to a CSV
file be more efficient?  I hadn't explored that avenue as my test
result sets were no more than a few thousand URL's so it didn't matter
if I inserted the data as I collected it with the InsertData method.

Request for Question Clarification by leapinglizard-ga on 19 Sep 2004 04:19 PDT
CSV is a very simple protocol, and i/o methods for it are
correspondingly lightweight. Nevertheless, if you're happy with
mySQL's performance, you should stick with a database solution. The
nice thing about a database is the data security it offers, what with
journaling and transaction rollbacks. Also consider that when you're
dealing with a CVS file, it's your job to manage insertions, data
queries, and so on. I don't see a good reason to use CVS as anything
more than an intermediary format between compatible spreadsheet
applications and the like, which is in fact its raison d'etre. I
certainly wouldn't use CVS as a basis for structuring and manipulating
data. If I wanted raw speed, I would construct a simple data structure
such as a string array in memory, then serialize it and write it to a
binary file. That way, the data gets compressed while it's in memory,
thereby minimizing disk access times.

leapinglizard

Clarification of Question by coolguy90210-ga on 19 Sep 2004 06:29 PDT
leapinglizard,

The idea behind a comma separated file suggestion was to have the java
application write to a file, rather than the database.  At the end of
a run of thousands (probably hundreds of thousands) of URL's I could
then do a mysqlimport of the data.

In any event, let's stick to the InsertData method of inserting the
data into the mysql table.  Obviously, if you are able to improve upon
the InsertData method, please do so.

Request for Question Clarification by leapinglizard-ga on 19 Sep 2004 07:06 PDT
coolguy90210,

It looks like I'm done with everything but the threading. While I work
on that, please try out the code as it stands. I haven't had time to
set up a database locally, but perhaps you could run the program
yourself and report any problems.

In the event of a crash, please post the output. Furthermore, if you
have a chance to read the code, let me know if you disagree with any
of my optimizations. I've been going too fast to add comments. Do you
need any?

Let me also make a few observations. First, the reason there's no need
to set a socket timeout is that Java's URL class manages timeouts on
its own, as defined in the HTTP standard. Second, it seems to me that
you may have calculated the total number of rows incorrectly in your
countData() method. In the line that reads

                total = rs.getInt("total");

you're setting the total in each iteration of the loop, rather than
incrementing the total. Could this be why your program was halting
after only 3684 pages?

Finally, I'm not sure how you want to handle malformed and
inaccessible URLs. At present, the Scraper constructor makes a dummy
page with the title "_malformed_URL_" or "_inaccessible_URL_", as
appropriate, but this could be handled differently.

leapinglizard



//----------begin Spider.java

import java.io.*;
import java.sql.*;
import java.util.*;
import java.util.regex.*;
import java.net.*;

class PageData {
    String title, desc;
    public PageData (String title, String desc) {
        this.title = title;
        this.desc = desc;
    }
    void setTitle(String title) {
        this.title = title;
    }
    void setDesc(String desc) {
        this.desc = desc;
    }
}

class Scraper {
    String text;
    int pFlags = Pattern.CASE_INSENSITIVE;
    Pattern titlePattern = Pattern.compile(
            "<title>(.*?)</title>", pFlags);
    Pattern descPattern = Pattern.compile(
            "<meta\\s+name=\"description\"\\s+content=\"(.*?)\"", pFlags);

    public Scraper(String address) {
        URL url;
        BufferedReader reader;
        String line;
        StringBuffer buf = new StringBuffer();
        try {
            url = new URL(address);
            InputStreamReader stream = new InputStreamReader(url.openStream());
            reader = new BufferedReader(stream);
            while ((line = reader.readLine()) != null)
                buf.append(line.trim()+" ");
        } catch (MalformedURLException e) {
            text = "<title>_malformed_URL</title>";
            return;
        } catch (IOException e) {
            text = "<title>_inaccessible_URL_</title>";
            return;
        }
        text = buf.toString();
    }

    public PageData getPageData() {
        PageData pageData = new PageData("_no_title_", "_no_description_");
        Matcher matcher;
        if ((matcher = titlePattern.matcher(text)).find())
            pageData.setTitle(matcher.group(1).trim());
        if ((matcher = descPattern.matcher(text)).find())
            pageData.setDesc(matcher.group(1).trim());
        return pageData;
    }
}

class LinkData {
    int id, father_id;
    String url;
    public LinkData (String url, int id, int father_id) {
        this.url = url;
        this.id = id;
        this.father_id = father_id;
    }
}

public class Spider {
    String urlString;
    String dbURL = "jdbc:mysql://domainname.com/dbName";
    String username = "username", password = "password";
    Connection conn;
    ResultSet rs;
    int total, count, row_num;

    public Spider() {
        count = 0;
        row_num = 1000;
        total = countData();
        System.out.println("count = "+count+", total = "+total);
        while(count < total) {
            row_num = Math.min(row_num, total-count);
            System.out.println("row_num = "+row_num);
            LinkData links[] = selectData(count, row_num);
            System.out.println("found "+links.length+" links");
            for (int i = 0; i < links.length; i++) {
                LinkData link = links[i];
                String url = link.url;
                //url = "http://www.unknownunknowndoesnotexit.com";
                PageData page = new Scraper(url).getPageData();
                insertData(page.title, page.desc, url, link.id, link.father_id);
            }
            count += row_num;
            System.out.println("count = "+count);
        }
        System.exit(0);
    }

    private void insertData(String t, String d, String u, int l, int f) {
        try {
            Class.forName("com.mysql.jdbc.Driver").newInstance();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        try {
            String insert = "INSERT INTO spider (title,description,url,"
                + "net_Links_ID,net_Category_FatherID) values (?,?,?,?,?)";
            String url = "jdbc:mysql://domainname.com/dbname";
            conn = DriverManager.getConnection(dbURL, username, password);
            PreparedStatement insertData = conn.prepareStatement(insert);
            insertData.setString(1, t);
            insertData.setString(2, d);
            insertData.setString(3, u);
            insertData.setInt(4, l);
            insertData.setInt(5, f);
            insertData.executeUpdate();
            insertData.close();
            conn.close();
        } catch (java.sql.SQLException ex) {
            ex.printStackTrace();
        }
    }

    private LinkData[] selectData(int in, int nmr) {
            // queries 3 tables in database to retrieve URLs
            Vector vector = new Vector();
            try {
                String select = "SELECT net_Links.ID, net_Links.URL,"
                    + " net_Links.TITLE, net_Category.FatherID"
                    + " FROM net_Links,net_Category, net_CatLinks"
                    + " WHERE net_Category.Full_Name LIKE 'Health%'"
                    + " AND net_Category.ID = CategoryID AND"
                    + " net_Links.ID = LinkID LIMIT "+count+","+row_num;
                System.out.println("select = \""+select+"\"");
                conn = DriverManager.getConnection(dbURL, username, password);
                Statement statement = conn.createStatement();
                rs = statement.executeQuery(select);
                while (rs.next()) {
                    String url = rs.getString("net_Links.URL");
                    int id = rs.getInt("net_Links.ID");
                    int father_id = rs.getInt("net_Category.FatherID");
                    vector.add(new LinkData(url, id, father_id));
                }
                statement.close();
                conn.close();
            } catch (java.sql.SQLException ex) {
                ex.printStackTrace();
            }
            LinkData links[] = new LinkData[vector.size()];
            for (int i = vector.size()-1; i >= 0; i--)
                links[i] = (LinkData) vector.get(i);
            return links;
    }

    private int countData() { 
        int ct = 0;
        try {
            // counts number of rows; will use to construct final SQL statement
            Class.forName("com.mysql.jdbc.Driver").newInstance();
            String select_count = "SELECT count(net_Links.ID) AS total"
                + " FROM net_Links,net_Category, net_CatLinks WHERE"
                + " net_Category.Full_Name like 'Health%'"
                + " AND net_Category.ID = CategoryID AND net_Links.ID = LinkID";
            conn = DriverManager.getConnection(dbURL, username, password);
            Statement statement = conn.createStatement();
            rs = statement.executeQuery(select_count);
            while (rs.next())
                ct += rs.getInt("total");
            System.out.println("count = "+ct);
        } catch (Exception ex) {
            System.out.println("countData exception");
            ex.printStackTrace();
        }
        return ct; 
    }

    public static void main(String args[]) {
        System.out.println(new Scraper(args[0]).getPageData().title);
        if (true) return;
        new Spider();
    }
}

//----------end Spider.java

Request for Question Clarification by leapinglizard-ga on 19 Sep 2004 07:09 PDT
Whoops! I forget to remove my debugging output from the main() method.
Please change it to the following.

    public static void main(String args[]) {
        new Spider();
    }

leapinglizard

Clarification of Question by coolguy90210-ga on 19 Sep 2004 07:56 PDT
leapinglizard,

I'm running the program now as we speak.  For a moment it looked like
it had stalled on only 13 urls, but, apparently that 14th url was a
timeout which was handled by the UrlConnection class.

1)  Yes I know that the URL class will handle the timeout, however, I
could not find what the default timeout duration was, or how to change
it.  Since this will be a threaded application, the timeout doesn't
matter anymore.

2)  You are right, now that I review it, it appears that I made a
mistake with the total variable.

3)  Malformed and missing urls will be handled by being marked as you
have done, and then at some point I'll just delete them.  I have so
many urls to go through that I don't have time to mess with problem
urls.  The designation of "malformed" or "inaccessible" that you have
given and entered into the database is correct.

OK, in any event, I have 256 urls visited so far, and data is being
entered correctly.  If you can get the thread portion finished, then
we'll be done.  I'll continue to review the changes you made, and will
comment on that later.

Finally, I notice that after executing the spider, other than counts,
and the select statement, it doesn't print anything at least for every
1000 urls collected.  It will print a select statement every 1000 urls
correct?  That is not a big deal.  If you can think of a good place to
put a progress indicator that would be great, if not, no big deal.  I
can always open another terminal window and do a SQL count of the
spider table.

Request for Question Clarification by leapinglizard-ga on 19 Sep 2004 09:13 PDT
coolguy90210,

I think I've implemented threaded page downloading. I can't be sure it
works, though. I wrote the code blind, having no database for tests.
All I can say is that the program compiles. Please take it for a spin.

If it works as intended, you'll be able to adjust the value of
maxThreadCount to control the number of simultaneous downloads.

You'll see that I added a line in the main loop to print a message
after every 100 downloads. I don't increment the count variable
directly but a copy of it, since I don't know exactly what's going in
the database query that sets the total.

I can't think of anything else at the moment, save that a further
argument against CSV occurred to me. The cost of executing insertData
is negligible compared to the cost of the downloads, so you wouldn't
really be saving time by importing the data in one shot.

Again, please send output in the event of a crash, and feel free to
ask any questions that come to mind.

leapinglizard





//----------begin Spider.java

import java.io.*;
import java.sql.*;
import java.util.*;
import java.util.regex.*;
import java.net.*;

class PageData {
    String title, desc;
    public PageData (String title, String desc) {
        this.title = title;
        this.desc = desc;
    }
    void setTitle(String title) {
        this.title = title;
    }
    void setDesc(String desc) {
        this.desc = desc;
    }
}

class Scraper {
    String text;
    int pFlags = Pattern.CASE_INSENSITIVE;
    Pattern titlePattern = Pattern.compile(
            "<title>(.*?)</title>", pFlags);
    Pattern descPattern = Pattern.compile(
            "<meta\\s+name=\"description\"\\s+content=\"(.*?)\"", pFlags);

    public Scraper(String address) {
        URL url;
        BufferedReader reader;
        String line;
        StringBuffer buf = new StringBuffer();
        try {
            url = new URL(address);
            InputStreamReader stream = new InputStreamReader(url.openStream());
            reader = new BufferedReader(stream);
            while ((line = reader.readLine()) != null)
                buf.append(line.trim()+" ");
        } catch (MalformedURLException e) {
            text = "<title>_malformed_URL</title>";
            return;
        } catch (IOException e) {
            text = "<title>_inaccessible_URL_</title>";
            return;
        }
        text = buf.toString();
    }

    public PageData getPageData() {
        PageData pageData = new PageData("_no_title_", "_no_description_");
        Matcher matcher;
        if ((matcher = titlePattern.matcher(text)).find())
            pageData.setTitle(matcher.group(1).trim());
        if ((matcher = descPattern.matcher(text)).find())
            pageData.setDesc(matcher.group(1).trim());
        return pageData;
    }
}

class LinkData {
    int id, father_id;
    String url;
    public LinkData (String url, int id, int father_id) {
        this.url = url;
        this.id = id;
        this.father_id = father_id;
    }
}

class PageCrawler extends Thread {
    Spider spider;
    LinkData link;

    PageCrawler(Spider spider, LinkData link) {
        this.spider = spider;
        this.link = link;
    }

    public void run() {
        PageData page = new Scraper(link.url).getPageData();
        insertData(page.title, page.desc, link.url, link.id, link.father_id);
        spider.unlock();
    }

    void insertData(String t, String d, String u, int l, int f) {
        Connection conn;
        String dbURL = spider.dbURL;
        String username = spider.username, password = spider.password;
        try {
            Class.forName("com.mysql.jdbc.Driver").newInstance();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        try {
            String insert = "INSERT INTO spider (title,description,url,"
                + "net_Links_ID,net_Category_FatherID) values (?,?,?,?,?)";
            conn = DriverManager.getConnection(dbURL, username, password);
            PreparedStatement insertData = conn.prepareStatement(insert);
            insertData.setString(1, t);
            insertData.setString(2, d);
            insertData.setString(3, u);
            insertData.setInt(4, l);
            insertData.setInt(5, f);
            insertData.executeUpdate();
            insertData.close();
            conn.close();
        } catch (java.sql.SQLException ex) {
            ex.printStackTrace();
        }
    }
}

public class Spider {
    int maxThreadCount = 20;
    String dbURL = "jdbc:mysql://domainname.com/dbName";
    String username = "username", password = "password";
    int threadCount, total, count, row_num, ct;

    public Spider() {
        threadCount = 0;
        count = 0;
        row_num = 1000;
        total = countData();
        System.out.println("count = "+count+", total = "+total);
        while(count < total) {
            ct = count;
            row_num = Math.min(row_num, total-count);
            System.out.println("row_num = "+row_num);
            LinkData links[] = selectData(count, row_num);
            System.out.println("found "+links.length+" links");
            for (int i = 0; i < links.length; i++) {
                lock();
                PageCrawler crawler = new PageCrawler(this, links[i]);
                crawler.start();
                if (++ct % 100 == 0)
                    System.out.println("launched "+ct+" crawlers so far\n");
            }
            count += row_num;
            System.out.println("count = "+count);
        }
        System.exit(0);
    }

    public synchronized void lock() {
        while (threadCount == maxThreadCount)
            try {
                wait();
            } catch (InterruptedException e) {}
        threadCount++;
    }
    
    public synchronized void unlock() {
        threadCount--;
        notify();
    }

    private LinkData[] selectData(int in, int nmr) {
            // queries 3 tables in database to retrieve URLs
            Vector vector = new Vector();
            Connection conn;
            try {
                String select = "SELECT net_Links.ID, net_Links.URL,"
                    + " net_Links.TITLE, net_Category.FatherID"
                    + " FROM net_Links,net_Category, net_CatLinks"
                    + " WHERE net_Category.Full_Name LIKE 'Health%'"
                    + " AND net_Category.ID = CategoryID AND"
                    + " net_Links.ID = LinkID LIMIT "+count+","+row_num;
                System.out.println("select = \""+select+"\"");
                conn = DriverManager.getConnection(dbURL, username, password);
                Statement statement = conn.createStatement();
                ResultSet rs = statement.executeQuery(select);
                while (rs.next()) {
                    String url = rs.getString("net_Links.URL");
                    int id = rs.getInt("net_Links.ID");
                    int father_id = rs.getInt("net_Category.FatherID");
                    vector.add(new LinkData(url, id, father_id));
                }
                statement.close();
                conn.close();
            } catch (java.sql.SQLException ex) {
                ex.printStackTrace();
            }
            LinkData links[] = new LinkData[vector.size()];
            for (int i = vector.size()-1; i >= 0; i--)
                links[i] = (LinkData) vector.get(i);
            return links;
    }

    private int countData() {
        int ct = 0;
        Connection conn;
        try {
            // counts number of rows; will use to construct final SQL statement
            Class.forName("com.mysql.jdbc.Driver").newInstance();
            String select_count = "SELECT count(net_Links.ID) AS total"
                + " FROM net_Links,net_Category, net_CatLinks WHERE"
                + " net_Category.Full_Name like 'Health%'"
                + " AND net_Category.ID = CategoryID AND net_Links.ID = LinkID";
            conn = DriverManager.getConnection(dbURL, username, password);
            Statement statement = conn.createStatement();
            ResultSet rs = statement.executeQuery(select_count);
            while (rs.next())
                ct += rs.getInt("total");
            System.out.println("count = "+ct);
        } catch (Exception ex) {
            System.out.println("countData exception");
            ex.printStackTrace();
        }
        return ct;
    }

    public static void main(String args[]) {
        System.out.println(new Scraper(args[0]).getPageData().title);
        //if (true) return;
        new Spider();
    }
}

//----------end Spider.java

Request for Question Clarification by leapinglizard-ga on 19 Sep 2004 09:17 PDT
Once again, I mistakenly left a couple of spurious lines in the main()
method. Go ahead and delete them.

leapinglizard

Request for Question Clarification by leapinglizard-ga on 19 Sep 2004 19:09 PDT
So does the program work properly? Is my work in line with your
expectations? If not, let me know what more I can do for you.

At your service,

leapinglizard

Clarification of Question by coolguy90210-ga on 19 Sep 2004 21:00 PDT
leapinglizard,

Approximately 7 hours later, after my last clarification post I went
to my pc to check on the spidering.  It was at over 8000.  I had to go
to work so I could not check it further.  In approximately 7 hours,
after work and dinner, I will try the threaded version and report
results back here.

Only possible issue is that the count upon getting up, and 20 minutes
later when I was ready to leave for work, was the same.  I didn't have
time to check processes on my linux server, so I'll have to get back
to you on that.

All in all things are looking good.  After a final discussion as
mentioned above, I'll go ahead and mark the question answered and make
payment plus tip as detailed previously.

Clarification of Question by coolguy90210-ga on 20 Sep 2004 02:55 PDT
leapinglizard,

I got back from work, and compiled the threaded version.  I'm running
it now.  It is very fast.  1000 urls in 1 min 47 seconds.  And that is
just with max 20 threads.  I haven't played with increasing that as
currently it is fast enough.  I have it running on one of my three
dedicated servers, a test platform PIV 2Ghz, with 1.5Gig RAM, and it
is occupying anywhere from 0 to 5.0% of my CPU per the linux TOP
command.  Load average is approximately .47.  I've seen occasional
spikes up to a load average of 1.0, and CPU 12.0% but very sporadic. 
It is a test platform anyway, so I'm not concerned.  I'd love to see a
thread count so I can see how many threads up to the max 20 affect the
load.

OK, back to the non-threaded version.  I mentioned that it appeared to
hang at 8000 or so.  It is true.  When I returned from work, 9 hours
later, the count was the same as when I had left.  The number was
8698.  Not sure if that has any significance with the loop.  It has no
significance with my total rows.  As I write the threaded version has
completed 18550 urls.  To be honest, I suspect that it has to do with
a timeout on one of the urls.  I wonder if an exception needs to be
written for this?  However, with the threaded version, it would seem
unnecessary as from what I can tell, even if one thread times out,
there are up to 19 more to take its place.

So, in conclusion:

"So does the program work properly?"
Answer:  The threaded version is excellent, and appears to be working
as I envisioned.  I'm running a test of 60,000 urls which should
finish up shortly.

"Is my work in line with your expectations?"
Answer:  Yes, of course.  Excellent work.

No need to work on the above mentioned possible timeout issue of the
non threaded version.  Assuming the threaded version completes the
60,000 urls, there is no need to work on this project anymore.  I
would consider it complete.

RE:  The number of threads running, if that is easily written you
could add it, but I know you've spent a lot of time on this, and I
could probably just do that myself.

Upon completion of the above mentioned run, we will conclude our project.

Clarification of Question by coolguy90210-ga on 20 Sep 2004 07:01 PDT
leapinglizard,

I noticed that the "user-agent" info that was in the python script was
not carried over into the Java code.  I usually use something like
this:

URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent","Place User-Agent Info Here");
InputStream urlStream = conn.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(urlStream));

I should be able to simply replace your connection method here, with
the above, correct?  Do you know of an easier way?  I couldn't find
setRequestProperty() in the URL class.

URL url;
BufferedReader reader;
String line;
StringBuffer buf = new StringBuffer();
try {
  url = new URL(address);
  InputStreamReader stream = new InputStreamReader(url.openStream());
  reader = new BufferedReader(stream);


I'm ready to conclude this project.

The 67,000 url retrieval was successful.  It even finished up the
remaining 8 urls at the end of the loop as it should.

I'm working on 37,000 now.  I upped the maxThreadCount to 50 and
changed the if statement to if (++ct % 250 == 0) in the Spider
constructor.  As I read the code I'm not convinced that the above
makes a difference.  Would you be able to tell me how many threads at
one time run, and what controls that process?  To be honest, I'm a
little confused on your thread code portion.  I understand everything
else 100% except for the thread portion.  If you don't have time,
bottom line, I'd like to know how many threads I can run at one time. 
I'd like to have a minimum of 25 PageCrawlers going at one time.

Specifically, I'm confused by this line here:

PageCrawler crawler = new PageCrawler(this, links[i]);

I know the above passes links to PageCrawler, and that "this" is a
self reference to the Spider class, right?  "this" enables the
PageCrawler inner class to use the variables in the Spider class,
correct?

and then these lines here:

PageCrawler(Spider spider, LinkData link) {
        this.spider = spider;
        this.link = link;
    }

Having this.spider, and this.link means that PageCrawler can now
reference the variables of the Spider class, correct?

Clarification of Question by coolguy90210-ga on 20 Sep 2004 08:10 PDT
leapinglizard,

I finished 60,000 plus and 37,000 plus urls no problem.  I switched
over to a 2nd machine with 1 gig RAM and changed databases over to
this new machine.  The new total urls to grab was 310,000.  I changed
the maxThreadCount back to 20.  I changed this line:  (++ct % 250 ==
0), back to (++ct % 100 == 0).  This time, I received a
java.lang.OutOfMemoryError at this point:

launched 4600 crawlers so far

java.lang.OutOfMemoryError
java.lang.OutOfMemoryError

The total number of urls to retrieve should have no bearing as we are
taking urls out in slices of 1000, right?  What exactly is getting
threaded, the scraper only, i.e. only a single slice of 1000 urls
rests in memory along with multiple scraper threads retrieving url
content from this single slice of 1000 urls, OR are there multiple
slices of 1000 urls residing in memory along with multiple scraper
threads for these multiple slices of 1000?

Clarification of Question by coolguy90210-ga on 20 Sep 2004 16:45 PDT
leapinglizard,

I have performed the same retrieval of the same 310,000 url category 3
times now on two different dedicated servers with no other processes
or applications running.  All fail at 4600 urls retrieved.

In addition to the above, I performed a retrieval of a different
category of 47,000 urls with no problems.  Each of these url data sets
(4 so far) are each different categories (total of 12).  3 successful
/ 1 failure.

Here is what I suspect is happening.  Somewhere between url number
4000 - 5000, of the 310,000 category, an error occurs.  It could be a
malformed URL, or it could be a timeout.  If I'm not mistaken the
default URL class timeout is at least 60 seconds, perhaps even more. 
I don't know for sure, but recall reading that somewhere.

I wrote a System.out.println for each of the try / catch statements,
and ran the Spider again.  I obtained the same error as mentioned
above, however I could not pinpoint at which point it occured, as none
of the print statements I wrote were displayed.

Finally, bottom line, your spider works.  Let's conclude the project. 
No need to update or code anymore, however, if you would comment on my
posts, in particular this one, that would be great.

I believe the only thing left to do, on my end, is to have it keep
track of where it fails, i.e. print to a file a "Failed at url 4653: 
http://blahblahblah.com", and then simply skip ahead by 1000.  At my
leisure I can go back, examine the url, and troubleshoot.  The only
problem I forsee is that I couldn't catch the error in an exception. 
java.lang.OutOfMemoryError can be caught, right?

Request for Question Clarification by leapinglizard-ga on 20 Sep 2004 17:06 PDT
I read your earlier posts with great interest, and I've been thinking
about the concerns you've raised, but I'm heavily occupied at the
moment with another project. Can you hang on for two or three hours?

Once I've wrapped up my current work, I'll finally post an official
answer to this question. I plan to address each of your follow-up
questions in order, and I'll include a slightly modified version of
the Spider code that uses much less memory to do the job.

leapinglizard

Clarification of Question by coolguy90210-ga on 20 Sep 2004 18:56 PDT
leapinglizard,

No problem.  Take your time.  While I wait, I will continue to spider.
 I'm at work now so I can't kick off any more until my return home in
6 hours.

Not sure which issues/questions you will be able to address.  I've
given the matter considerable thought, and it would seem to me that
being able to catch the error, make note of it somewhere (file), and
continue on would be ideal.  That way, even if there are poblems, I
can go back and manually review the problem urls.  I'm making the
assumption that it is a problem url causing some type of
timeout/pause, leading to more threads, and finally resulting in the
memory problem.  As I mentioned before, I'm weak on the threaded
portion, so the above assumption is probably full of holes.
Answer  
Subject: Re: Need a re-write of a Java / Python Spider
Answered By: leapinglizard-ga on 20 Sep 2004 19:44 PDT
Rated:5 out of 5 stars
 
Dear coolguy90210,

Among the posts you made above after I had posted my multi-threaded
Spider code, the first few mention a problem you observed with the
single-threaded version. In particular, the program was going into a
trance around the 8698th URL in your database. If I were in your place
and were inclined to investigate this problem, my first step would be
to run the single-threaded code on just the five or ten URLs around
that point. Once I had definitively identified the one URL that was
causing all the trouble, I would visit it manually with a web browser
to investigate the characteristics of the web page. I would also insert
many println() statements into the program and rerun it just for that
one URL, carefully observing its behavior. Does it get confused while
opening the connection? Or while reading the contents of the web site?

I can't say for certain, but I suspect the problem is that a web site
begins feeding information to the crawler, then abruptly stops. The
crawler keeps waiting, since it has not yet received an End Of File (EOF)
character and believes there's more text to come. I have adopted this
as my hypothesis because the business of requesting and establishing
an HTTP connection is conducted according to a fairly well defined and
widely respected protocol. I would be surprised if Java's URL class
didn't have a reasonable timeout to deal with web servers that can't
make an HTTP connection according to spec. I might be wrong. The only
way to find out for sure is to carry out close debugging.

A heavy-duty solution to the problem of unpredictable timeouts is to
use supervisor threads that act something like killer droids. You can
either have a bunch of these supervisor threads polling the crawler
threads and terminating any that are taking too long to do their job,
or you can launch a supervisor together with each crawler. Under a buddy
system of this kind, the only job of the supervisor is to terminate its
associated crawler as well as itself if the crawling isn't done after,
say, 30 seconds. Otherwise, if the crawler finishes on time, it terminates
its supervisor and itself.

The next item you mention is that you'd like to know how many crawler
threads are running at once in the multi-threaded Spider. In the code
below, I'll modify the periodic output so that it shows the currently
simultaneous number of threads, but this should almost always be equal to
maxThreadCount. This is because as soon as one crawler is finished, the
next one starts up. The transition is practically instantaneous. To see
only 19 crawlers running instead of the maximum 20 would be a rare treat.

You also express an interest in the upper limit on the number of threads
that could run at one time. The answer is complicated because the crawler
threads are not merely using CPU bandwidth, they're also using Internet
bandwidth. If you were running threads that didn't access the disk or
the network, you could comfortably run thousands of them, even tens of
thousands at a time, on a fast machine equipped with a robust operating
system such as Linux. When it comes to web crawling, however, you will
probably congest your Internet connection once you've got several dozen
threads downloading webpages at the same time. Divide the incoming
bandwidth of your home computer with the outgoing bandwidth of your
average website, add something like a twofold factor to account for
latency, and that's roughly the number of crawlers you can reasonably
operate at the same time. I know that Korea has the fastest consumer
broadband in the world, so you might be able to run a couple of hundred
crawlers on a good hookup. I'm quite willing to bet that 25 crawlers is
a safe number on a DSL modem anywhere in the world, and I'd even feel
safe with 50.

You asked about setting the HTTP user agent for a Java crawler. It is
indeed a nice habit to inform web sites of your crawler's identity, but
I'm afraid I can't help you on this score. I'm more familiar with the
Python web-access methods myself. I looked at the HttpURLConnection class,
but it doesn't seem to offer anything relevant. You could certainly send
the user-agent information if you wrote your own URL access methods that
talk directly to the socket.

At one point, you inquire about the line

                PageCrawler crawler = new PageCrawler(this, links[i]);

and essentially proceed to answer your own question. Yes, "this" refers
to the Spider object inside which this line is being executed. The Spider
object is passing a reference to itself, so that the PageCrawler object
can subsequently call a method of the Spider object. In particular,
the PageCrawler says

                        spider.unlock();

when it is about to terminate, thereby informing the Spider object that it
may proceed to launch a new PageCrawler. The practice of objects retaining
references to each other is an integral part of the object-oriented
programming (OOP) style. If you hear OOP practitioners talking about
objects passing messages to each other, what they actually mean is that
object A, having a reference to object B, calls some method B.foo()
to notify B of something. In reply, B might call A.bar(), and then we
have an OOP dialogue going.

The OutOfMemoryError exception is a difficult matter. I have no clearcut
answer for you, but I can mention some possibilities. First, you may
want to upgrade to a newer version of the Java Runtime Environment if
you haven't done so lately. Memory management is a known problem in
Java, and while Sun engineers keep improving the heap allocation and
garbage collection, it's still far from perfect. Your hypothesis of
excessive thread propagation is off the mark, I hope, or else it would
mean that I'd coded the threading incorrectly. The lock() and unlock()
methods are meant to prevent the thread count from ever exceeding the
value of maxThreadCount.

Upon reviewing the Spider.java code, what strikes me as the greatest
source of memory usage is that the Scraper constructor reads an entire
web page into a String before proceeding to search for the <title>
and <meta...> patterns. One quick fix, which I've implemented below,
is to search for matching patterns on a line-by-line basis. Thus,
only a single line is stored at a time. This does mean that you can't
extract your information when the pattern is spread over several lines,
but I believe such cases are rare. Note that I've also added a check
for the </head> tag, since the title and description should only be
found in the HTML header. The most general approach would be to read
the web page one character at a time or in slightly larger chunks,
accumulating a sequence of them only when they begin to match one of
your patterns. I don't know whether you have the patience or the need
for such a solution. An intermediate solution would be to read the entire
header into one String, without the body, and pattern-match in that.

Finally, you are right to presume that the OutOfMemoryError exception,
like all other exceptions, can be caught in your program. If there's
a particular segment of code that does something bad to the memory,
you might be able to narrow it down by enclosing smaller and smaller
blocks of code with an appropriate try...catch statement. Alternatively,
and this is something I often do, you can just sprinkle println()
statements throughout the program to display the current value of the
most interesting variables. Then, at the point of failure, you'll have
a customized debugging trace to review.

Was there anything else? I can't recall more at the moment. In the
event of further trouble, I'll be glad to respond to your Clarification
Requests.

Regards,

leapinglizard



//----------begin Spider.java

import java.io.*;
import java.sql.*;
import java.util.*;
import java.util.regex.*;
import java.net.*; 

class PageData {
    String title, desc;
    public PageData (String title, String desc) {
        this.title = title;
        this.desc = desc;
    }
    void setTitle(String title) {
        this.title = title;
    }
    void setDesc(String desc) {
        this.desc = desc;
    }
}

class Scraper { 
    int pFlags = Pattern.CASE_INSENSITIVE;
    Pattern titlePattern = Pattern.compile(
            "<title>(.*?)</title>", pFlags);
    Pattern descPattern = Pattern.compile(
            "<meta\\s+name=\"description\"\\s+content=\"(.*?)\"", pFlags);
    Pattern endPattern = Pattern.compile(
            "</head>", pFlags);
    PageData pageData;

    public Scraper(String address) {
        URL url;
        InputStreamReader stream = null;
        BufferedReader reader = null;
        String line;
        pageData = new PageData("_no_title_", "_no_description_");
        boolean gotTitle = false, gotDesc = false;
        Matcher matcher;
        try {
            url = new URL(address);
            stream = new InputStreamReader(url.openStream());
            reader = new BufferedReader(stream);
            while ((line = reader.readLine()) != null) {
                if ((matcher = titlePattern.matcher(line)).find()) {
                    pageData.setTitle(matcher.group(1).trim());
                    gotTitle = true;
                }
                if ((matcher = descPattern.matcher(line)).find()) {
                    pageData.setDesc(matcher.group(1).trim());
                    gotDesc = true;
                }
                if (endPattern.matcher(line).find() || (gotTitle && gotDesc))
                    return;
            }
        } catch (MalformedURLException e) {
            pageData.setTitle("_malformed_URL_");
        } catch (IOException e) {
            pageData.setTitle("_inaccessible_URL_");
        }
    }

    public PageData getPageData() {
        return pageData;
    }
}

class LinkData {
    int id, father_id;
    String url;
    public LinkData (String url, int id, int father_id) {
        this.url = url;
        this.id = id;
        this.father_id = father_id;
    }
}

class PageCrawler extends Thread {
    Spider spider;
    LinkData link;

    PageCrawler(Spider spider, LinkData link) {
        this.spider = spider;
        this.link = link;
    }

    public void run() {
        PageData page = new Scraper(link.url).getPageData();
        insertData(page.title, page.desc, link.url, link.id, link.father_id);
        spider.unlock();
    }

    void insertData(String t, String d, String u, int l, int f) {
        Connection conn;
        String dbURL = spider.dbURL;
        String username = spider.username, password = spider.password;
        try {
            Class.forName("com.mysql.jdbc.Driver").newInstance();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        try {
            String insert = "INSERT INTO spider (title,description,url,"
                + "net_Links_ID,net_Category_FatherID) values (?,?,?,?,?)";
            conn = DriverManager.getConnection(dbURL, username, password);
            PreparedStatement insertData = conn.prepareStatement(insert);
            insertData.setString(1, t);
            insertData.setString(2, d);
            insertData.setString(3, u);
            insertData.setInt(4, l);
            insertData.setInt(5, f);
            insertData.executeUpdate();
            insertData.close();
            conn.close();
        } catch (java.sql.SQLException ex) {
            ex.printStackTrace();
        }
    }
}

public class Spider {
    int maxThreadCount = 50;
    String dbURL = "jdbc:mysql://domainname.com/dbName";
    String username = "username", password = "password";
    int threadCount, total, count, row_num, ct;

    public Spider() {
        threadCount = 0;
        count = 0;
        row_num = 1000;
        total = countData();
        System.out.println("count = "+count+", total = "+total);
        while(count < total) {
            ct = count;
            row_num = Math.min(row_num, total-count);
            System.out.println("row_num = "+row_num);
            LinkData links[] = selectData(count, row_num);
            System.out.println("found "+links.length+" links");
            for (int i = 0; i < links.length; i++) {
                lock();
                PageCrawler crawler = new PageCrawler(this, links[i]);
                crawler.start();
                if (++ct % 100 == 0) {
                    System.out.println("launched "+ct+" crawlers so far");
                    System.out.println("    ("+threadCount+" simultaneous)");
                }
            }
            count += row_num;
            System.out.println("count = "+count);
        }
        System.exit(0);
    }

    public synchronized void lock() {
        while (threadCount == maxThreadCount)
            try {
                wait(); 
            } catch (InterruptedException e) {}
        threadCount++;
    }

    public synchronized void unlock() { 
        threadCount--;
        notify();
    }

    private LinkData[] selectData(int in, int nmr) {
            // queries 3 tables in database to retrieve URLs
            Vector vector = new Vector();
            Connection conn;
            try {
                String select = "SELECT net_Links.ID, net_Links.URL,"
                    + " net_Links.TITLE, net_Category.FatherID"
                    + " FROM net_Links,net_Category, net_CatLinks"
                    + " WHERE net_Category.Full_Name LIKE 'Health%'"
                    + " AND net_Category.ID = CategoryID AND"
                    + " net_Links.ID = LinkID LIMIT "+count+","+row_num;
                System.out.println("select = \""+select+"\"");
                conn = DriverManager.getConnection(dbURL, username, password);
                Statement statement = conn.createStatement();
                ResultSet rs = statement.executeQuery(select);
                while (rs.next()) {
                    String url = rs.getString("net_Links.URL");
                    int id = rs.getInt("net_Links.ID");
                    int father_id = rs.getInt("net_Category.FatherID");
                    vector.add(new LinkData(url, id, father_id));
                }
                statement.close();
                conn.close();
            } catch (java.sql.SQLException ex) {
                ex.printStackTrace();
            }
            LinkData links[] = new LinkData[vector.size()];
            for (int i = vector.size()-1; i >= 0; i--)
                links[i] = (LinkData) vector.get(i);
            return links;
    }

    private int countData() {
        int ct = 0;
        Connection conn;
        try {
            // counts number of rows; will use to construct final SQL statement
            Class.forName("com.mysql.jdbc.Driver").newInstance();
            String select_count = "SELECT count(net_Links.ID) AS total"
                + " FROM net_Links,net_Category, net_CatLinks WHERE"
                + " net_Category.Full_Name like 'Health%'"
                + " AND net_Category.ID = CategoryID AND net_Links.ID = LinkID";
            conn = DriverManager.getConnection(dbURL, username, password);
            Statement statement = conn.createStatement();
            ResultSet rs = statement.executeQuery(select_count);
            while (rs.next())
                ct += rs.getInt("total");
            System.out.println("count = "+ct);
            statement.close();
            conn.close();
        } catch (Exception ex) {
            System.out.println("countData exception");
            ex.printStackTrace();
        }
        return ct;
    }

    public static void main(String args[]) {
        new Spider();
    }
}

//----------end Spider.java

Clarification of Answer by leapinglizard-ga on 21 Sep 2004 08:47 PDT
I notice that I forgot the Pattern.DOTALL flag in earlier versions of
the program. It is unnecesary in the latest version, which searches
for patterns in one line at a time.

leapinglizard

Request for Answer Clarification by coolguy90210-ga on 21 Sep 2004 16:35 PDT
leapinglizard,

Program works flawlessly after 1 modification.

I had to add a BufferedReader close statement, outside of the while
loop in the Scraper class.  I did this by adding the line
reader.close().  The reason for this is that I saw the following
paraphrased error:  "Unable to open a connection due to too many files
open".  I monitored this with a couple of linux commands to count the
number of open files/connections associated with the PID of the Spider
process.  The number constantly grew until it hit 1024, and then the
error would occur.

I increased the amount of possible open files via the standard
addition to the /etc/security/Can't Remember Name.file.  However, I
would still get the same error.  I reviewed the error message and
determined it was occuring at 4 possible locations in the program: 
countData(), selectData(), insertData() or Scraper().  My immediate
suspicion was the Scraper class due to the above mentioned linux
commands and the open files actually being open socket connections to
different urls.  I added the reader.close() statement, and observed
the number of open connections again.  This time the number of open
connections would vary between 100 - 400, always coming back down
again if they approached 400, usually staying around 180 or so.

The next addition I made was to create a String category variable to
take the place of the portion of the WHERE clause of the SELECT
statement that reads LIKE 'Health%'.  This same WHERE clause occurs
twice, so as I go through my categories it is better to edit one
place.

The final addition was to add a second dbURL variable, and adjust the
three dbURL variable for methods that make database calls.  With that
addition, I can run the Spider from my various dedicated servers, and
all will get their count and urls from the master server, but insert
their data locally.

Last night I completed the 310,000 category.  I completed a 228,000
category, and I'm working on a 900,000 category.  The 900,000 category
stalled at around 400,000 this morning.   I suspect it is related to
our previous discussions.  In any event, upon my return from home,
I'll simply skip ahead by 1000 and continue on.  At some point though,
I will carefully examine all of the urls within the range of 1000 that
failed looking for the troublesome url and then hopefully being able
to write a catch statement.

leapinglizard, you are one cool guy!  I'm going to rate this 5 stars. 
I'm going to pay you the agreed 100 USD, however I'd like to add some
more to that, so I'm going to review a few answers to see what is
appropriate.  I'd ask, but you wouldn't have time to reply, and it is
a bit of a faux paus.  Back in a moment.

Clarification of Answer by leapinglizard-ga on 21 Sep 2004 18:10 PDT
Thank you for the kind words and the handsome tip. I'm glad I was able
to assist you in this matter. I was nervous about handing you untested
code, but you seem to have quite a knack for debugging. If any serious
problems should crop up in connection with this program, I'll be glad
to offer further assistance through the Answer Clarification
mechanism.

leapinglizard
coolguy90210-ga rated this answer:5 out of 5 stars and gave an additional tip of: $100.00
Leapinglizard knows his Java!  To anyone looking for a Java coder,
Leapinglizard is the one!  He spent quite a bit of time working on my
original code, and improving it to the point of it being a complete
re-write.  He gets 5 stars for his excellent comments, 5 stars for his
Java coding, and 5 stars for his overall expertise.

The spider works, and works exceptionally well.  I've reviewed
probably close to 20 different spider applications paid and open
source and the one I originally developed, which he took and re-wrote
is in fact the best.  It is easy to see exactly what it is doing, and
it is easy to adjust it for personal circumstances.  And, most
importantly, it is FAST.  9 urls per second with the original threaded
version, and much faster with the final revision.

This is an industrial strength spider application.  It is begging to
be made into a parallel / RMI application.

Comments  
Subject: Re: Need a re-write of a Java / Python Spider
From: efn-ga on 18 Sep 2004 12:45 PDT
 
leapinglizard has both more time and more expertise for this question
than I, but thanks for remembering me!

--efn
Subject: Re: Need a re-write of a Java / Python Spider
From: coolguy90210-ga on 18 Sep 2004 20:31 PDT
 
efn,

Good to see you are still around.  I have other ideas/questions coming
up so keep your eyes open.  May not be about coding.  May be about
advice for setting up a web site that has a specific theme or goal.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy