Google Answers: Need a re-write of a Java / Python Spider

View Question

Q: Need a re-write of a Java / Python Spider ( Answered 5 out of 5 stars

Question

Subject: Need a re-write of a Java / Python Spider
Category: Miscellaneous
Asked by: coolguy90210-ga
List Price: $25.00

Posted: 17 Sep 2004 02:52 PDT
Expires: 17 Oct 2004 02:52 PDT
Question ID: 402414

Specifically for efn-ga, however, as he may not want to do it, everyone with Java and Python skills is welcome to give it a go. I need a re-write of a Java / Python spider that I wrote myself. I've been tweaking it off and on as time permits, however, I now have something else that is more pressing, hence I need some help in making this one perfect. 1) I've only tested it on up to 500 or so urls. I'm confident it works on up to that number. I recently did a test of 60,000 urls, and it stalled at 3684, so that might be an issue or it might be CPU related. It never taxes my server so I don't think that is the issue. 2) I'm sure the code is in-efficient as it was a first draft. Please correct. 3) The code passes the actual url visit off to a python script I wrote. The reason for this is that the following python code: socket.setdefaulttimeout(30) remotefile = urllib._urlopener.open(x) content = remotefile.read() can't be reproduced (easily) in the java URLConnection class. No method within the URLConnection class exists to set a timeout. Of course I could get into sockets but I don't have the time (pardon the pun). I'd like this issue addressed. It can be addressed in one of two ways: a) Use the java Socket class or other such class to set a time out, and re-write the python portion into a method within my Spider class. b) Just re-write the whole thing in python. 4) And finally, my last feature was to create a threaded application. Please add a threaded capability to this. I think up to 4 threads running would be OK, however, assuming memory is not an issue (and it isn't with me) I could see up to 10 or 20 threads running at the same time. FYI, the spider hits only one page of a site. It will not be used to retrieve multiple pages from a site, hence I haven't incorporated a reading of the robots.txt file. I don't see editing and improving my code as taking more than 1 hour. Please advise. I'll start with 25 USD. Here is the Python code: import urllib import re import sys import string import socket class AppURLopener(urllib.FancyURLopener): def __init__(self, args): self.version = "Bot Information" urllib.FancyURLopener.__init__(self, args) urllib._urlopener = AppURLopener() x = sys.argv[1] socket.setdefaulttimeout(30) try: #remotefile = urllib.urlopen(x) remotefile = urllib._urlopener.open(x) content = remotefile.read() p = re.compile('<TITLE>.?</TITLE>',re.DOTALL\|re.IGNORECASE) q = re.compile('<meta.?>',re.DOTALL\|re.IGNORECASE) r = re.compile('<meta.?description.?content="',re.DOTALL\|re.IGNORECASE) s = re.compile('<TITLE>\|</TITLE>',re.DOTALL\|re.IGNORECASE) plist = p.findall(content) qlist = q.findall(content) remotefile.close() for x in plist: m = p.match(x) if m: #print 'Match found: ', m.group() x = re.sub(s, '', x) x = string.strip(x) print '<TITLE>',x,'<TITLE>' break for x in qlist: m = r.match(x) if m: #print 'Match found: ', m.group() x = re.sub(r, '', x) x = string.replace(x, '">', '') x = string.strip(x) print '<DESC>',x,'<DESC>' break#need to break out for those sites where the designer made a mistake and has multiple descriptions except Exception, e: print '<TITLE>Timeout<TITLE>' print '<DESC>Timeout<DESC>' ###End Python Code### Here is the Java Code: import java.io.; import java.sql.; import java.util.List; import java.util.ArrayList; import java.util.Iterator; import java.util.ListIterator; import java.util.Collections; public class Spider { private ArrayList urlList; private ArrayList selectList; private String urlString; //SelectData variables private Statement stmt; private Connection conn; private String url = "jdbc:mysql://domainname.com/dbName"; private String username = "username "; private String password = "password"; private Statement selectData; private ResultSet rs; private int total=0; private int initial=0; private int num_of_rows = 1000; private class ResultSetData { protected int linkID; protected String linkURL; protected int linkFatherID; public ResultSetData (int lid, String lur, int fid) { linkID = lid; linkURL = lur; linkFatherID = fid; }//end IncidentData constructor }//end IncidentData class private Spider() { urlList = new ArrayList(); total = countData(); System.out.println("The total is: " + total); System.out.println("The inital is: " + initial); System.out.println("The num_of_rows is: " + num_of_rows); while(initial < total) { if(total - initial > num_of_rows) { num_of_rows = num_of_rows; }//end if else { num_of_rows = total - initial; }//end else System.out.println("Initial = " + initial + "and Total = " + total); urlList = SelectData(initial,num_of_rows); Iterator it = urlList.iterator(); int y = urlList.size(); System.out.println("Size is: " + y); //for(int i=0;i<100;i++) { while (it.hasNext()) { System.out.println("In while loop....and size is " + y); //You need to cast the it.next elements to the appropriate Object. ResultSetData rsd = (ResultSetData) it.next(); //urlString = (String) it.next(); urlString = rsd.linkURL; //Test effect of bogus URL //urlString = "http://www.unknownunknowndoesnotexit.com"; String rp = RunPython(urlString); //System.out.println("Now back in constructor"); //System.out.println(rp); String [] sp = SplitText(rp); String title = "Not available."; String desc = "Not available."; String strUrl = "Not available."; int link_id = rsd.linkID; int father_id = rsd.linkFatherID; //System.out.println("Now in for loop..."); for(int h=0; h<sp.length; h++) { //System.out.println(sp[h]); if(sp[h].startsWith("<TITLE>")) { title = sp[h].replaceAll("<TITLE>","").trim(); }//end if else if(sp[h].startsWith("<DESC>")) { desc = sp[h].replaceAll("<DESC>","").trim(); }//end if else if(sp[h].startsWith("<URL>")) { strUrl = sp[h].replaceAll("<URL>","").trim(); }//end if }//end for ShowToday st = new ShowToday(); //System.out.println(st.demo()); //System.out.println("Data retrieved..."); //System.out.println("URL = " + strUrl); //System.out.println("TITLE = " + title); //System.out.println("DESC = " + desc + "\n"); InsertData(title,desc,strUrl,link_id,father_id); }//end while initial = initial + num_of_rows; System.out.println("The inital is now: " + initial); }//end while System.exit(0); }//end constructor public String RunPython(String u) { urlString = u; String s; String text = "<URL>" + u + "<URL>"; try { // run the python application Process p = Runtime.getRuntime().exec("python2.3 urlcontent.py " + urlString); BufferedReader stdInput = new BufferedReader(new InputStreamReader(p.getInputStream())); BufferedReader stdError = new BufferedReader(new InputStreamReader(p.getErrorStream())); while ((s = stdInput.readLine()) != null) { //System.out.println(s); text += "_:,:_" + s; }//end while //System.exit(0); }//end try catch (IOException e) { System.out.println("IOException"); e.printStackTrace(); System.exit(-1); }//end catch return text; }//end method public String [] SplitText(String i) { String text = i; String [] split_text = text.split("_:,:_"); return split_text; }//end method private void InsertData(String t, String d, String u, int l, int f) { PreparedStatement stmt; Connection conn; try { Class.forName("com.mysql.jdbc.Driver").newInstance(); }//end try catch (Exception ex) { ex.printStackTrace(); }//end catch try { String insert = "INSERT INTO spider (title,description,url,net_Links_ID," + "net_Category_FatherID) values (?,?,?,?,?)"; String url = "jdbc:mysql://domainname.com/dbname"; String username = "username"; String password = "password"; conn = DriverManager.getConnection(url, username, password); PreparedStatement insertData = conn.prepareStatement(insert); insertData.setString(1, t); insertData.setString(2, d); insertData.setString(3, u); insertData.setInt(4, l); insertData.setInt(5, f); insertData.executeUpdate(); insertData.close(); conn.close(); }//end try catch (java.sql.SQLException ex) { ex.printStackTrace(); }//end catch }//end method private ArrayList SelectData(int in, int nmr) { /this try will query 3 tables in my database and retrieve a slice of URL's that I want to spider. I can provide a CREATE statement and some data for troubleshooting. / try { selectList = new ArrayList(); String select = "SELECT net_Links.ID, net_Links.URL, net_Links.TITLE, net_Category.FatherID " + "FROM net_Links,net_Category, net_CatLinks " + "WHERE net_Category.Full_Name LIKE 'Health%' " + //"WHERE net_Category.FatherID = '72658' " + //"AND net_Category.ID = CategoryID AND net_Links.ID = LinkID"; "AND net_Category.ID = CategoryID AND net_Links.ID = LinkID LIMIT " + initial + "," + num_of_rows; System.out.println(select); conn = DriverManager.getConnection(url, username, password); selectData = conn.createStatement(); rs = selectData.executeQuery(select); while (rs.next()) { int id = rs.getInt("net_Links.ID"); String ul = rs.getString("net_Links.URL"); int fd = rs.getInt("net_Category.FatherID"); ResultSetData rsd = new ResultSetData(id, ul, fd); selectList.add(rsd); }//end while selectData.close(); conn.close(); System.out.println("The select statement is:\n\n" + select); }//end try catch (java.sql.SQLException ex) { ex.printStackTrace(); }//end catch return selectList; }//end method private int countData() { try { /this try counts the number of rows that could be returned if all rows were to be selected. The idea is to use this number to determine the final loop's SQL statement. / Class.forName("com.mysql.jdbc.Driver").newInstance(); String select_count = "SELECT count(net_Links.ID) AS total FROM net_Links,net_Category, net_CatLinks " + "WHERE net_Category.Full_Name like 'Health%' " + //"WHERE net_Category.FatherID = '72658' " + "AND net_Category.ID = CategoryID AND net_Links.ID = LinkID"; conn = DriverManager.getConnection(url, username, password); selectData = conn.createStatement(); rs = selectData.executeQuery(select_count); while (rs.next()) { total = rs.getInt("total"); }//end while System.out.println(total); }//end try catch (Exception ex) { System.out.println("countData exception"); ex.printStackTrace(); }//end catch return total; }//end method public static void main(String args[]) { Spider sp = new Spider(); }//end main }//end class
Request for Question Clarification by leapinglizard-ga on 17 Sep 2004 04:01 PDT I consider myself a proficient Python programmer, and I estimate it would take four hours to rewrite your code as a multithreaded Python script. Then again, these things always take longer than one thinks. In the event of overtime, you could tip accordingly. See previous examples of my work below. http://answers.google.com/answers/threadview?id=121280 http://answers.google.com/answers/threadview?id=399628 Also, do you have a publicly available version of your SQL database? Perhaps a dummy version that you take down once the work is done? It would be an immmense help in testing the script. Let me know if you want me to tackle the job. leapinglizard
Clarification of Question by coolguy90210-ga on 17 Sep 2004 08:05 PDT I'm posting this in response to leapinglizard-ga's post on 17 Sep 2004 04:01 PDT. leapinglizard-ga: I've reviewed the 2 examples of your work that you gave as well as all of your remaining examples in your profile. I must say that I am impressed with your level of detail in creating an answer. Based on your record I will accept your offer to tackle the job. Everyone else, feel free to briefly describe your interest and approach. If leapinglizard-ga's is unable to answer then I'll have to quickly move on to another expert. However, I'm assuming leapinglizard-ga's answer will be on par with his past answers, and if so, I will accept his work as the answer. RE: public database. Let me consider the alternatives. For now, really, to get started, you really only need an array list of urls. I'll have some type of db data info for you within 8 hours, as it is 12 midnight for me.
Clarification of Question by coolguy90210-ga on 17 Sep 2004 08:07 PDT leapinglizard-ga, Also, re: number of hours. I'm willing to accept the 4 hour time table, and as I want to have this done ASAP, I'll go ahead and accept an upper limit of 100 USD, i.e. 75 dollar tip, assuming it works as requested.
Request for Question Clarification by leapinglizard-ga on 17 Sep 2004 16:52 PDT I've been busy with another scripting project, but I'll soon be done and then I'll get started on yours. If I run into any serious obstacles, I'll notify you via a further Clarification Request. Otherwise, I expect to have a first draft, subject to modification and debugging, ready tonight. It is currently 7:47 pm in my time zone. I will also notify you if I need more details on the database, although most of the work can be done independently of it, as you point out. If I can't post a first draft tonight, I'll abandon the project and invite other code-savvy Researchers to take a crack at it. leapinglizard
Clarification of Question by coolguy90210-ga on 17 Sep 2004 18:40 PDT leapinglizard-ga, RE: 4 hours. I did not mean that it had to be done in 4 hours. I meant that if it took up to 4 hours of your time, that would be OK. If you could have it done by the end of this weekend, that would be great. Also, I know more about Java's cleanup of memory problems than I do Python's. Please ensure that the Python code you are writing will clean up after itself, i.e. close connections, close running processes etc.
Request for Question Clarification by leapinglizard-ga on 17 Sep 2004 18:49 PDT Aha! When you said ASAP, I thought you really meant it. I may leave the script until tomorrow morning, then. I'll be sure to keep track of the amount of time I spend on it. As for exiting cleanly, I am well aware of the various synchronization problems pertaining to Python and of the garbage collector's behavior. I always take care to open and close my files and sockets in the right order. Never fear, I don't leave references hanging unless I know from experience that it's safe to do so. leapinglizard
Clarification of Question by coolguy90210-ga on 18 Sep 2004 20:26 PDT leapinglizard, Just responding to Google's request for clarification. Apparently if you post a statement as a clarification, they still want me to respond with a clarification post, even though you were making a statement, rather than a question.
Request for Question Clarification by leapinglizard-ga on 18 Sep 2004 21:31 PDT Yes, the clarification-request notices can get tiresome. I'm familiar with them from the other direction, when customers make a statement in the form of a Request for Answer Clarification. I started work on your project later than I wanted to, but I've made good progress over the past hour and a half. I've cleaned up and rewritten almost everything. The one part of your Java code that I can't translate for the time being, and it's an indispensable part, is the dialogue with your mySQL database. I had assumed that there would be a Python module for accessing SQL databases, but it turns out that there's no such thing. At this point, several courses of action are possible. I could ask you to download and install a third-party Python module that provides mySQL access, but this sort of thing can get messy. Alternatively, I could write my own functions to talk with your database. I don't know exactly what the protocol is, but my past experience with running manual queries on a PostgreSQL database tells me that it could be done fairly easily by talking through a socket. Now, in order to suss out the protocol, there are again two possibilities. One is that you put up a (temporarily) publicly available version of your database contain some sample content. The other is that I set up a mySQL database server locally and make a sample database with a CREATE statement and data that you provide, as you suggested in your Java comments. This would take a little time, of course, but it's the only remaining obstacle. I'm confident I can overcome it within a few more hours. Your time zone is UTC+8, correct? Let me know how you'd like to proceed, or if you want to fire me altogether. leapinglizard
Request for Question Clarification by leapinglizard-ga on 18 Sep 2004 22:04 PDT I've made a decision. I won't attempt to kludge together a mySQL access function after all. In fact, I've abandoned the attempt to rewrite your script in Python. I'm sorry. I could still rewrite it in Java, but I'd have to start over and I can't guarantee it'll be done by the end of the weekend. It's Saturday night where I am, but I believe it's Sunday afternoon where you are. I've been working all day. Must go home and sleep now. I'm getting up in five hours, at 6am, as I do every morning. If at that point I haven't heard from you yet, or if you respond that you do want me to go on, I'll get to work on the Java rewrite. I believe I would be able to finish that around noon on Sunday, or 1am Monday morning in the GMT+8 time zone. leapinglizard
Clarification of Question by coolguy90210-ga on 18 Sep 2004 22:45 PDT leapinglizard, Go ahead and write it in Java. I'm more familiar with Java anyway. Just don't have the knowledge of threads or sockets or the time to complete the project myself. Yes, you are correct about my time zone. It is 14:35 for me in South Korea. Regarding data. Let me know what you need. I'm a little concerned about putting my site db info here even if it is a test db. I have no problem getting you the create sql statements. To be frank the SelectData method simply obtains a list of urls and puts them into an array list. Granted the select statement is a rather large 3 table join, however for your purposes, simply obtaining a list of urls from a table should suffice to mirror the function of SelectData(). Bottom line, we need to count and set a limit to how many urls are grabbed into the result set at one time, for obvious reasons, i.e. the result set returned could be up to 300,000 or more urls. The table that receives the spidered data for the InsertData method is simply: CREATE TABLE `spider` ( `ID` int(10) unsigned NOT NULL auto_increment, `title` varchar(100) NOT NULL default '', `url` varchar(255) NOT NULL default 'http://', `description` text, `net_Links_ID` int(10) unsigned default NULL, `net_Category_FatherID` int(10) unsigned default NULL, PRIMARY KEY (`ID`) ) TYPE=MyISAM \|
Clarification of Question by coolguy90210-ga on 19 Sep 2004 00:58 PDT leapinglizard, I'm not married to the InserData method. MySQL's import utility is phenomenally fast. Much faster than anything I've ever written to import data into MySQL. Would having the application write to a CSV file be more efficient? I hadn't explored that avenue as my test result sets were no more than a few thousand URL's so it didn't matter if I inserted the data as I collected it with the InsertData method.
Request for Question Clarification by leapinglizard-ga on 19 Sep 2004 04:19 PDT CSV is a very simple protocol, and i/o methods for it are correspondingly lightweight. Nevertheless, if you're happy with mySQL's performance, you should stick with a database solution. The nice thing about a database is the data security it offers, what with journaling and transaction rollbacks. Also consider that when you're dealing with a CVS file, it's your job to manage insertions, data queries, and so on. I don't see a good reason to use CVS as anything more than an intermediary format between compatible spreadsheet applications and the like, which is in fact its raison d'etre. I certainly wouldn't use CVS as a basis for structuring and manipulating data. If I wanted raw speed, I would construct a simple data structure such as a string array in memory, then serialize it and write it to a binary file. That way, the data gets compressed while it's in memory, thereby minimizing disk access times. leapinglizard
Clarification of Question by coolguy90210-ga on 19 Sep 2004 06:29 PDT leapinglizard, The idea behind a comma separated file suggestion was to have the java application write to a file, rather than the database. At the end of a run of thousands (probably hundreds of thousands) of URL's I could then do a mysqlimport of the data. In any event, let's stick to the InsertData method of inserting the data into the mysql table. Obviously, if you are able to improve upon the InsertData method, please do so.
Request for Question Clarification by leapinglizard-ga on 19 Sep 2004 07:06 PDT coolguy90210, It looks like I'm done with everything but the threading. While I work on that, please try out the code as it stands. I haven't had time to set up a database locally, but perhaps you could run the program yourself and report any problems. In the event of a crash, please post the output. Furthermore, if you have a chance to read the code, let me know if you disagree with any of my optimizations. I've been going too fast to add comments. Do you need any? Let me also make a few observations. First, the reason there's no need to set a socket timeout is that Java's URL class manages timeouts on its own, as defined in the HTTP standard. Second, it seems to me that you may have calculated the total number of rows incorrectly in your countData() method. In the line that reads total = rs.getInt("total"); you're setting the total in each iteration of the loop, rather than incrementing the total. Could this be why your program was halting after only 3684 pages? Finally, I'm not sure how you want to handle malformed and inaccessible URLs. At present, the Scraper constructor makes a dummy page with the title "_malformed_URL_" or "_inaccessible_URL_", as appropriate, but this could be handled differently. leapinglizard //----------begin Spider.java import java.io.; import java.sql.; import java.util.; import java.util.regex.; import java.net.; class PageData { String title, desc; public PageData (String title, String desc) { this.title = title; this.desc = desc; } void setTitle(String title) { this.title = title; } void setDesc(String desc) { this.desc = desc; } } class Scraper { String text; int pFlags = Pattern.CASE_INSENSITIVE; Pattern titlePattern = Pattern.compile( "<title>(.?)</title>", pFlags); Pattern descPattern = Pattern.compile( "<meta\\s+name=\"description\"\\s+content=\"(.*?)\"", pFlags); public Scraper(String address) { URL url; BufferedReader reader; String line; StringBuffer buf = new StringBuffer(); try { url = new URL(address); InputStreamReader stream = new InputStreamReader(url.openStream()); reader = new BufferedReader(stream); while ((line = reader.readLine()) != null) buf.append(line.trim()+" "); } catch (MalformedURLException e) { text = "<title>_malformed_URL</title>"; return; } catch (IOException e) { text = "<title>_inaccessible_URL_</title>"; return; } text = buf.toString(); } public PageData getPageData() { PageData pageData = new PageData("_no_title_", "_no_description_"); Matcher matcher; if ((matcher = titlePattern.matcher(text)).find()) pageData.setTitle(matcher.group(1).trim()); if ((matcher = descPattern.matcher(text)).find()) pageData.setDesc(matcher.group(1).trim()); return pageData; } } class LinkData { int id, father_id; String url; public LinkData (String url, int id, int father_id) { this.url = url; this.id = id; this.father_id = father_id; } } public class Spider { String urlString; String dbURL = "jdbc:mysql://domainname.com/dbName"; String username = "username", password = "password"; Connection conn; ResultSet rs; int total, count, row_num; public Spider() { count = 0; row_num = 1000; total = countData(); System.out.println("count = "+count+", total = "+total); while(count < total) { row_num = Math.min(row_num, total-count); System.out.println("row_num = "+row_num); LinkData links[] = selectData(count, row_num); System.out.println("found "+links.length+" links"); for (int i = 0; i < links.length; i++) { LinkData link = links[i]; String url = link.url; //url = "http://www.unknownunknowndoesnotexit.com"; PageData page = new Scraper(url).getPageData(); insertData(page.title, page.desc, url, link.id, link.father_id); } count += row_num; System.out.println("count = "+count); } System.exit(0); } private void insertData(String t, String d, String u, int l, int f) { try { Class.forName("com.mysql.jdbc.Driver").newInstance(); } catch (Exception ex) { ex.printStackTrace(); } try { String insert = "INSERT INTO spider (title,description,url," + "net_Links_ID,net_Category_FatherID) values (?,?,?,?,?)"; String url = "jdbc:mysql://domainname.com/dbname"; conn = DriverManager.getConnection(dbURL, username, password); PreparedStatement insertData = conn.prepareStatement(insert); insertData.setString(1, t); insertData.setString(2, d); insertData.setString(3, u); insertData.setInt(4, l); insertData.setInt(5, f); insertData.executeUpdate(); insertData.close(); conn.close(); } catch (java.sql.SQLException ex) { ex.printStackTrace(); } } private LinkData[] selectData(int in, int nmr) { // queries 3 tables in database to retrieve URLs Vector vector = new Vector(); try { String select = "SELECT net_Links.ID, net_Links.URL," + " net_Links.TITLE, net_Category.FatherID" + " FROM net_Links,net_Category, net_CatLinks" + " WHERE net_Category.Full_Name LIKE 'Health%'" + " AND net_Category.ID = CategoryID AND" + " net_Links.ID = LinkID LIMIT "+count+","+row_num; System.out.println("select = \""+select+"\""); conn = DriverManager.getConnection(dbURL, username, password); Statement statement = conn.createStatement(); rs = statement.executeQuery(select); while (rs.next()) { String url = rs.getString("net_Links.URL"); int id = rs.getInt("net_Links.ID"); int father_id = rs.getInt("net_Category.FatherID"); vector.add(new LinkData(url, id, father_id)); } statement.close(); conn.close(); } catch (java.sql.SQLException ex) { ex.printStackTrace(); } LinkData links[] = new LinkData[vector.size()]; for (int i = vector.size()-1; i >= 0; i--) links[i] = (LinkData) vector.get(i); return links; } private int countData() { int ct = 0; try { // counts number of rows; will use to construct final SQL statement Class.forName("com.mysql.jdbc.Driver").newInstance(); String select_count = "SELECT count(net_Links.ID) AS total" + " FROM net_Links,net_Category, net_CatLinks WHERE" + " net_Category.Full_Name like 'Health%'" + " AND net_Category.ID = CategoryID AND net_Links.ID = LinkID"; conn = DriverManager.getConnection(dbURL, username, password); Statement statement = conn.createStatement(); rs = statement.executeQuery(select_count); while (rs.next()) ct += rs.getInt("total"); System.out.println("count = "+ct); } catch (Exception ex) { System.out.println("countData exception"); ex.printStackTrace(); } return ct; } public static void main(String args[]) { System.out.println(new Scraper(args[0]).getPageData().title); if (true) return; new Spider(); } } //----------end Spider.java
Request for Question Clarification by leapinglizard-ga on 19 Sep 2004 07:09 PDT Whoops! I forget to remove my debugging output from the main() method. Please change it to the following. public static void main(String args[]) { new Spider(); } leapinglizard
Clarification of Question by coolguy90210-ga on 19 Sep 2004 07:56 PDT leapinglizard, I'm running the program now as we speak. For a moment it looked like it had stalled on only 13 urls, but, apparently that 14th url was a timeout which was handled by the UrlConnection class. 1) Yes I know that the URL class will handle the timeout, however, I could not find what the default timeout duration was, or how to change it. Since this will be a threaded application, the timeout doesn't matter anymore. 2) You are right, now that I review it, it appears that I made a mistake with the total variable. 3) Malformed and missing urls will be handled by being marked as you have done, and then at some point I'll just delete them. I have so many urls to go through that I don't have time to mess with problem urls. The designation of "malformed" or "inaccessible" that you have given and entered into the database is correct. OK, in any event, I have 256 urls visited so far, and data is being entered correctly. If you can get the thread portion finished, then we'll be done. I'll continue to review the changes you made, and will comment on that later. Finally, I notice that after executing the spider, other than counts, and the select statement, it doesn't print anything at least for every 1000 urls collected. It will print a select statement every 1000 urls correct? That is not a big deal. If you can think of a good place to put a progress indicator that would be great, if not, no big deal. I can always open another terminal window and do a SQL count of the spider table.
Request for Question Clarification by leapinglizard-ga on 19 Sep 2004 09:13 PDT coolguy90210, I think I've implemented threaded page downloading. I can't be sure it works, though. I wrote the code blind, having no database for tests. All I can say is that the program compiles. Please take it for a spin. If it works as intended, you'll be able to adjust the value of maxThreadCount to control the number of simultaneous downloads. You'll see that I added a line in the main loop to print a message after every 100 downloads. I don't increment the count variable directly but a copy of it, since I don't know exactly what's going in the database query that sets the total. I can't think of anything else at the moment, save that a further argument against CSV occurred to me. The cost of executing insertData is negligible compared to the cost of the downloads, so you wouldn't really be saving time by importing the data in one shot. Again, please send output in the event of a crash, and feel free to ask any questions that come to mind. leapinglizard //----------begin Spider.java import java.io.; import java.sql.; import java.util.; import java.util.regex.; import java.net.; class PageData { String title, desc; public PageData (String title, String desc) { this.title = title; this.desc = desc; } void setTitle(String title) { this.title = title; } void setDesc(String desc) { this.desc = desc; } } class Scraper { String text; int pFlags = Pattern.CASE_INSENSITIVE; Pattern titlePattern = Pattern.compile( "<title>(.?)</title>", pFlags); Pattern descPattern = Pattern.compile( "<meta\\s+name=\"description\"\\s+content=\"(.*?)\"", pFlags); public Scraper(String address) { URL url; BufferedReader reader; String line; StringBuffer buf = new StringBuffer(); try { url = new URL(address); InputStreamReader stream = new InputStreamReader(url.openStream()); reader = new BufferedReader(stream); while ((line = reader.readLine()) != null) buf.append(line.trim()+" "); } catch (MalformedURLException e) { text = "<title>_malformed_URL</title>"; return; } catch (IOException e) { text = "<title>_inaccessible_URL_</title>"; return; } text = buf.toString(); } public PageData getPageData() { PageData pageData = new PageData("_no_title_", "_no_description_"); Matcher matcher; if ((matcher = titlePattern.matcher(text)).find()) pageData.setTitle(matcher.group(1).trim()); if ((matcher = descPattern.matcher(text)).find()) pageData.setDesc(matcher.group(1).trim()); return pageData; } } class LinkData { int id, father_id; String url; public LinkData (String url, int id, int father_id) { this.url = url; this.id = id; this.father_id = father_id; } } class PageCrawler extends Thread { Spider spider; LinkData link; PageCrawler(Spider spider, LinkData link) { this.spider = spider; this.link = link; } public void run() { PageData page = new Scraper(link.url).getPageData(); insertData(page.title, page.desc, link.url, link.id, link.father_id); spider.unlock(); } void insertData(String t, String d, String u, int l, int f) { Connection conn; String dbURL = spider.dbURL; String username = spider.username, password = spider.password; try { Class.forName("com.mysql.jdbc.Driver").newInstance(); } catch (Exception ex) { ex.printStackTrace(); } try { String insert = "INSERT INTO spider (title,description,url," + "net_Links_ID,net_Category_FatherID) values (?,?,?,?,?)"; conn = DriverManager.getConnection(dbURL, username, password); PreparedStatement insertData = conn.prepareStatement(insert); insertData.setString(1, t); insertData.setString(2, d); insertData.setString(3, u); insertData.setInt(4, l); insertData.setInt(5, f); insertData.executeUpdate(); insertData.close(); conn.close(); } catch (java.sql.SQLException ex) { ex.printStackTrace(); } } } public class Spider { int maxThreadCount = 20; String dbURL = "jdbc:mysql://domainname.com/dbName"; String username = "username", password = "password"; int threadCount, total, count, row_num, ct; public Spider() { threadCount = 0; count = 0; row_num = 1000; total = countData(); System.out.println("count = "+count+", total = "+total); while(count < total) { ct = count; row_num = Math.min(row_num, total-count); System.out.println("row_num = "+row_num); LinkData links[] = selectData(count, row_num); System.out.println("found "+links.length+" links"); for (int i = 0; i < links.length; i++) { lock(); PageCrawler crawler = new PageCrawler(this, links[i]); crawler.start(); if (++ct % 100 == 0) System.out.println("launched "+ct+" crawlers so far\n"); } count += row_num; System.out.println("count = "+count); } System.exit(0); } public synchronized void lock() { while (threadCount == maxThreadCount) try { wait(); } catch (InterruptedException e) {} threadCount++; } public synchronized void unlock() { threadCount--; notify(); } private LinkData[] selectData(int in, int nmr) { // queries 3 tables in database to retrieve URLs Vector vector = new Vector(); Connection conn; try { String select = "SELECT net_Links.ID, net_Links.URL," + " net_Links.TITLE, net_Category.FatherID" + " FROM net_Links,net_Category, net_CatLinks" + " WHERE net_Category.Full_Name LIKE 'Health%'" + " AND net_Category.ID = CategoryID AND" + " net_Links.ID = LinkID LIMIT "+count+","+row_num; System.out.println("select = \""+select+"\""); conn = DriverManager.getConnection(dbURL, username, password); Statement statement = conn.createStatement(); ResultSet rs = statement.executeQuery(select); while (rs.next()) { String url = rs.getString("net_Links.URL"); int id = rs.getInt("net_Links.ID"); int father_id = rs.getInt("net_Category.FatherID"); vector.add(new LinkData(url, id, father_id)); } statement.close(); conn.close(); } catch (java.sql.SQLException ex) { ex.printStackTrace(); } LinkData links[] = new LinkData[vector.size()]; for (int i = vector.size()-1; i >= 0; i--) links[i] = (LinkData) vector.get(i); return links; } private int countData() { int ct = 0; Connection conn; try { // counts number of rows; will use to construct final SQL statement Class.forName("com.mysql.jdbc.Driver").newInstance(); String select_count = "SELECT count(net_Links.ID) AS total" + " FROM net_Links,net_Category, net_CatLinks WHERE" + " net_Category.Full_Name like 'Health%'" + " AND net_Category.ID = CategoryID AND net_Links.ID = LinkID"; conn = DriverManager.getConnection(dbURL, username, password); Statement statement = conn.createStatement(); ResultSet rs = statement.executeQuery(select_count); while (rs.next()) ct += rs.getInt("total"); System.out.println("count = "+ct); } catch (Exception ex) { System.out.println("countData exception"); ex.printStackTrace(); } return ct; } public static void main(String args[]) { System.out.println(new Scraper(args[0]).getPageData().title); //if (true) return; new Spider(); } } //----------end Spider.java
Request for Question Clarification by leapinglizard-ga on 19 Sep 2004 09:17 PDT Once again, I mistakenly left a couple of spurious lines in the main() method. Go ahead and delete them. leapinglizard
Request for Question Clarification by leapinglizard-ga on 19 Sep 2004 19:09 PDT So does the program work properly? Is my work in line with your expectations? If not, let me know what more I can do for you. At your service, leapinglizard
Clarification of Question by coolguy90210-ga on 19 Sep 2004 21:00 PDT leapinglizard, Approximately 7 hours later, after my last clarification post I went to my pc to check on the spidering. It was at over 8000. I had to go to work so I could not check it further. In approximately 7 hours, after work and dinner, I will try the threaded version and report results back here. Only possible issue is that the count upon getting up, and 20 minutes later when I was ready to leave for work, was the same. I didn't have time to check processes on my linux server, so I'll have to get back to you on that. All in all things are looking good. After a final discussion as mentioned above, I'll go ahead and mark the question answered and make payment plus tip as detailed previously.
Clarification of Question by coolguy90210-ga on 20 Sep 2004 02:55 PDT leapinglizard, I got back from work, and compiled the threaded version. I'm running it now. It is very fast. 1000 urls in 1 min 47 seconds. And that is just with max 20 threads. I haven't played with increasing that as currently it is fast enough. I have it running on one of my three dedicated servers, a test platform PIV 2Ghz, with 1.5Gig RAM, and it is occupying anywhere from 0 to 5.0% of my CPU per the linux TOP command. Load average is approximately .47. I've seen occasional spikes up to a load average of 1.0, and CPU 12.0% but very sporadic. It is a test platform anyway, so I'm not concerned. I'd love to see a thread count so I can see how many threads up to the max 20 affect the load. OK, back to the non-threaded version. I mentioned that it appeared to hang at 8000 or so. It is true. When I returned from work, 9 hours later, the count was the same as when I had left. The number was 8698. Not sure if that has any significance with the loop. It has no significance with my total rows. As I write the threaded version has completed 18550 urls. To be honest, I suspect that it has to do with a timeout on one of the urls. I wonder if an exception needs to be written for this? However, with the threaded version, it would seem unnecessary as from what I can tell, even if one thread times out, there are up to 19 more to take its place. So, in conclusion: "So does the program work properly?" Answer: The threaded version is excellent, and appears to be working as I envisioned. I'm running a test of 60,000 urls which should finish up shortly. "Is my work in line with your expectations?" Answer: Yes, of course. Excellent work. No need to work on the above mentioned possible timeout issue of the non threaded version. Assuming the threaded version completes the 60,000 urls, there is no need to work on this project anymore. I would consider it complete. RE: The number of threads running, if that is easily written you could add it, but I know you've spent a lot of time on this, and I could probably just do that myself. Upon completion of the above mentioned run, we will conclude our project.
Clarification of Question by coolguy90210-ga on 20 Sep 2004 07:01 PDT leapinglizard, I noticed that the "user-agent" info that was in the python script was not carried over into the Java code. I usually use something like this: URLConnection conn = url.openConnection(); conn.setRequestProperty("User-Agent","Place User-Agent Info Here"); InputStream urlStream = conn.getInputStream(); BufferedReader in = new BufferedReader(new InputStreamReader(urlStream)); I should be able to simply replace your connection method here, with the above, correct? Do you know of an easier way? I couldn't find setRequestProperty() in the URL class. URL url; BufferedReader reader; String line; StringBuffer buf = new StringBuffer(); try { url = new URL(address); InputStreamReader stream = new InputStreamReader(url.openStream()); reader = new BufferedReader(stream); I'm ready to conclude this project. The 67,000 url retrieval was successful. It even finished up the remaining 8 urls at the end of the loop as it should. I'm working on 37,000 now. I upped the maxThreadCount to 50 and changed the if statement to if (++ct % 250 == 0) in the Spider constructor. As I read the code I'm not convinced that the above makes a difference. Would you be able to tell me how many threads at one time run, and what controls that process? To be honest, I'm a little confused on your thread code portion. I understand everything else 100% except for the thread portion. If you don't have time, bottom line, I'd like to know how many threads I can run at one time. I'd like to have a minimum of 25 PageCrawlers going at one time. Specifically, I'm confused by this line here: PageCrawler crawler = new PageCrawler(this, links[i]); I know the above passes links to PageCrawler, and that "this" is a self reference to the Spider class, right? "this" enables the PageCrawler inner class to use the variables in the Spider class, correct? and then these lines here: PageCrawler(Spider spider, LinkData link) { this.spider = spider; this.link = link; } Having this.spider, and this.link means that PageCrawler can now reference the variables of the Spider class, correct?
Clarification of Question by coolguy90210-ga on 20 Sep 2004 08:10 PDT leapinglizard, I finished 60,000 plus and 37,000 plus urls no problem. I switched over to a 2nd machine with 1 gig RAM and changed databases over to this new machine. The new total urls to grab was 310,000. I changed the maxThreadCount back to 20. I changed this line: (++ct % 250 == 0), back to (++ct % 100 == 0). This time, I received a java.lang.OutOfMemoryError at this point: launched 4600 crawlers so far java.lang.OutOfMemoryError java.lang.OutOfMemoryError The total number of urls to retrieve should have no bearing as we are taking urls out in slices of 1000, right? What exactly is getting threaded, the scraper only, i.e. only a single slice of 1000 urls rests in memory along with multiple scraper threads retrieving url content from this single slice of 1000 urls, OR are there multiple slices of 1000 urls residing in memory along with multiple scraper threads for these multiple slices of 1000?
Clarification of Question by coolguy90210-ga on 20 Sep 2004 16:45 PDT leapinglizard, I have performed the same retrieval of the same 310,000 url category 3 times now on two different dedicated servers with no other processes or applications running. All fail at 4600 urls retrieved. In addition to the above, I performed a retrieval of a different category of 47,000 urls with no problems. Each of these url data sets (4 so far) are each different categories (total of 12). 3 successful / 1 failure. Here is what I suspect is happening. Somewhere between url number 4000 - 5000, of the 310,000 category, an error occurs. It could be a malformed URL, or it could be a timeout. If I'm not mistaken the default URL class timeout is at least 60 seconds, perhaps even more. I don't know for sure, but recall reading that somewhere. I wrote a System.out.println for each of the try / catch statements, and ran the Spider again. I obtained the same error as mentioned above, however I could not pinpoint at which point it occured, as none of the print statements I wrote were displayed. Finally, bottom line, your spider works. Let's conclude the project. No need to update or code anymore, however, if you would comment on my posts, in particular this one, that would be great. I believe the only thing left to do, on my end, is to have it keep track of where it fails, i.e. print to a file a "Failed at url 4653: http://blahblahblah.com", and then simply skip ahead by 1000. At my leisure I can go back, examine the url, and troubleshoot. The only problem I forsee is that I couldn't catch the error in an exception. java.lang.OutOfMemoryError can be caught, right?
Request for Question Clarification by leapinglizard-ga on 20 Sep 2004 17:06 PDT I read your earlier posts with great interest, and I've been thinking about the concerns you've raised, but I'm heavily occupied at the moment with another project. Can you hang on for two or three hours? Once I've wrapped up my current work, I'll finally post an official answer to this question. I plan to address each of your follow-up questions in order, and I'll include a slightly modified version of the Spider code that uses much less memory to do the job. leapinglizard
Clarification of Question by coolguy90210-ga on 20 Sep 2004 18:56 PDT leapinglizard, No problem. Take your time. While I wait, I will continue to spider. I'm at work now so I can't kick off any more until my return home in 6 hours. Not sure which issues/questions you will be able to address. I've given the matter considerable thought, and it would seem to me that being able to catch the error, make note of it somewhere (file), and continue on would be ideal. That way, even if there are poblems, I can go back and manually review the problem urls. I'm making the assumption that it is a problem url causing some type of timeout/pause, leading to more threads, and finally resulting in the memory problem. As I mentioned before, I'm weak on the threaded portion, so the above assumption is probably full of holes.

Answer

Subject: Re: Need a re-write of a Java / Python Spider
Answered By: leapinglizard-ga on 20 Sep 2004 19:44 PDT
Rated: 5 out of 5 stars

Dear coolguy90210, Among the posts you made above after I had posted my multi-threaded Spider code, the first few mention a problem you observed with the single-threaded version. In particular, the program was going into a trance around the 8698th URL in your database. If I were in your place and were inclined to investigate this problem, my first step would be to run the single-threaded code on just the five or ten URLs around that point. Once I had definitively identified the one URL that was causing all the trouble, I would visit it manually with a web browser to investigate the characteristics of the web page. I would also insert many println() statements into the program and rerun it just for that one URL, carefully observing its behavior. Does it get confused while opening the connection? Or while reading the contents of the web site? I can't say for certain, but I suspect the problem is that a web site begins feeding information to the crawler, then abruptly stops. The crawler keeps waiting, since it has not yet received an End Of File (EOF) character and believes there's more text to come. I have adopted this as my hypothesis because the business of requesting and establishing an HTTP connection is conducted according to a fairly well defined and widely respected protocol. I would be surprised if Java's URL class didn't have a reasonable timeout to deal with web servers that can't make an HTTP connection according to spec. I might be wrong. The only way to find out for sure is to carry out close debugging. A heavy-duty solution to the problem of unpredictable timeouts is to use supervisor threads that act something like killer droids. You can either have a bunch of these supervisor threads polling the crawler threads and terminating any that are taking too long to do their job, or you can launch a supervisor together with each crawler. Under a buddy system of this kind, the only job of the supervisor is to terminate its associated crawler as well as itself if the crawling isn't done after, say, 30 seconds. Otherwise, if the crawler finishes on time, it terminates its supervisor and itself. The next item you mention is that you'd like to know how many crawler threads are running at once in the multi-threaded Spider. In the code below, I'll modify the periodic output so that it shows the currently simultaneous number of threads, but this should almost always be equal to maxThreadCount. This is because as soon as one crawler is finished, the next one starts up. The transition is practically instantaneous. To see only 19 crawlers running instead of the maximum 20 would be a rare treat. You also express an interest in the upper limit on the number of threads that could run at one time. The answer is complicated because the crawler threads are not merely using CPU bandwidth, they're also using Internet bandwidth. If you were running threads that didn't access the disk or the network, you could comfortably run thousands of them, even tens of thousands at a time, on a fast machine equipped with a robust operating system such as Linux. When it comes to web crawling, however, you will probably congest your Internet connection once you've got several dozen threads downloading webpages at the same time. Divide the incoming bandwidth of your home computer with the outgoing bandwidth of your average website, add something like a twofold factor to account for latency, and that's roughly the number of crawlers you can reasonably operate at the same time. I know that Korea has the fastest consumer broadband in the world, so you might be able to run a couple of hundred crawlers on a good hookup. I'm quite willing to bet that 25 crawlers is a safe number on a DSL modem anywhere in the world, and I'd even feel safe with 50. You asked about setting the HTTP user agent for a Java crawler. It is indeed a nice habit to inform web sites of your crawler's identity, but I'm afraid I can't help you on this score. I'm more familiar with the Python web-access methods myself. I looked at the HttpURLConnection class, but it doesn't seem to offer anything relevant. You could certainly send the user-agent information if you wrote your own URL access methods that talk directly to the socket. At one point, you inquire about the line PageCrawler crawler = new PageCrawler(this, links[i]); and essentially proceed to answer your own question. Yes, "this" refers to the Spider object inside which this line is being executed. The Spider object is passing a reference to itself, so that the PageCrawler object can subsequently call a method of the Spider object. In particular, the PageCrawler says spider.unlock(); when it is about to terminate, thereby informing the Spider object that it may proceed to launch a new PageCrawler. The practice of objects retaining references to each other is an integral part of the object-oriented programming (OOP) style. If you hear OOP practitioners talking about objects passing messages to each other, what they actually mean is that object A, having a reference to object B, calls some method B.foo() to notify B of something. In reply, B might call A.bar(), and then we have an OOP dialogue going. The OutOfMemoryError exception is a difficult matter. I have no clearcut answer for you, but I can mention some possibilities. First, you may want to upgrade to a newer version of the Java Runtime Environment if you haven't done so lately. Memory management is a known problem in Java, and while Sun engineers keep improving the heap allocation and garbage collection, it's still far from perfect. Your hypothesis of excessive thread propagation is off the mark, I hope, or else it would mean that I'd coded the threading incorrectly. The lock() and unlock() methods are meant to prevent the thread count from ever exceeding the value of maxThreadCount. Upon reviewing the Spider.java code, what strikes me as the greatest source of memory usage is that the Scraper constructor reads an entire web page into a String before proceeding to search for the <title> and <meta...> patterns. One quick fix, which I've implemented below, is to search for matching patterns on a line-by-line basis. Thus, only a single line is stored at a time. This does mean that you can't extract your information when the pattern is spread over several lines, but I believe such cases are rare. Note that I've also added a check for the </head> tag, since the title and description should only be found in the HTML header. The most general approach would be to read the web page one character at a time or in slightly larger chunks, accumulating a sequence of them only when they begin to match one of your patterns. I don't know whether you have the patience or the need for such a solution. An intermediate solution would be to read the entire header into one String, without the body, and pattern-match in that. Finally, you are right to presume that the OutOfMemoryError exception, like all other exceptions, can be caught in your program. If there's a particular segment of code that does something bad to the memory, you might be able to narrow it down by enclosing smaller and smaller blocks of code with an appropriate try...catch statement. Alternatively, and this is something I often do, you can just sprinkle println() statements throughout the program to display the current value of the most interesting variables. Then, at the point of failure, you'll have a customized debugging trace to review. Was there anything else? I can't recall more at the moment. In the event of further trouble, I'll be glad to respond to your Clarification Requests. Regards, leapinglizard //----------begin Spider.java import java.io.; import java.sql.; import java.util.; import java.util.regex.; import java.net.; class PageData { String title, desc; public PageData (String title, String desc) { this.title = title; this.desc = desc; } void setTitle(String title) { this.title = title; } void setDesc(String desc) { this.desc = desc; } } class Scraper { int pFlags = Pattern.CASE_INSENSITIVE; Pattern titlePattern = Pattern.compile( "<title>(.?)</title>", pFlags); Pattern descPattern = Pattern.compile( "<meta\\s+name=\"description\"\\s+content=\"(.*?)\"", pFlags); Pattern endPattern = Pattern.compile( "</head>", pFlags); PageData pageData; public Scraper(String address) { URL url; InputStreamReader stream = null; BufferedReader reader = null; String line; pageData = new PageData("_no_title_", "_no_description_"); boolean gotTitle = false, gotDesc = false; Matcher matcher; try { url = new URL(address); stream = new InputStreamReader(url.openStream()); reader = new BufferedReader(stream); while ((line = reader.readLine()) != null) { if ((matcher = titlePattern.matcher(line)).find()) { pageData.setTitle(matcher.group(1).trim()); gotTitle = true; } if ((matcher = descPattern.matcher(line)).find()) { pageData.setDesc(matcher.group(1).trim()); gotDesc = true; } if (endPattern.matcher(line).find() \|\| (gotTitle && gotDesc)) return; } } catch (MalformedURLException e) { pageData.setTitle("_malformed_URL_"); } catch (IOException e) { pageData.setTitle("_inaccessible_URL_"); } } public PageData getPageData() { return pageData; } } class LinkData { int id, father_id; String url; public LinkData (String url, int id, int father_id) { this.url = url; this.id = id; this.father_id = father_id; } } class PageCrawler extends Thread { Spider spider; LinkData link; PageCrawler(Spider spider, LinkData link) { this.spider = spider; this.link = link; } public void run() { PageData page = new Scraper(link.url).getPageData(); insertData(page.title, page.desc, link.url, link.id, link.father_id); spider.unlock(); } void insertData(String t, String d, String u, int l, int f) { Connection conn; String dbURL = spider.dbURL; String username = spider.username, password = spider.password; try { Class.forName("com.mysql.jdbc.Driver").newInstance(); } catch (Exception ex) { ex.printStackTrace(); } try { String insert = "INSERT INTO spider (title,description,url," + "net_Links_ID,net_Category_FatherID) values (?,?,?,?,?)"; conn = DriverManager.getConnection(dbURL, username, password); PreparedStatement insertData = conn.prepareStatement(insert); insertData.setString(1, t); insertData.setString(2, d); insertData.setString(3, u); insertData.setInt(4, l); insertData.setInt(5, f); insertData.executeUpdate(); insertData.close(); conn.close(); } catch (java.sql.SQLException ex) { ex.printStackTrace(); } } } public class Spider { int maxThreadCount = 50; String dbURL = "jdbc:mysql://domainname.com/dbName"; String username = "username", password = "password"; int threadCount, total, count, row_num, ct; public Spider() { threadCount = 0; count = 0; row_num = 1000; total = countData(); System.out.println("count = "+count+", total = "+total); while(count < total) { ct = count; row_num = Math.min(row_num, total-count); System.out.println("row_num = "+row_num); LinkData links[] = selectData(count, row_num); System.out.println("found "+links.length+" links"); for (int i = 0; i < links.length; i++) { lock(); PageCrawler crawler = new PageCrawler(this, links[i]); crawler.start(); if (++ct % 100 == 0) { System.out.println("launched "+ct+" crawlers so far"); System.out.println(" ("+threadCount+" simultaneous)"); } } count += row_num; System.out.println("count = "+count); } System.exit(0); } public synchronized void lock() { while (threadCount == maxThreadCount) try { wait(); } catch (InterruptedException e) {} threadCount++; } public synchronized void unlock() { threadCount--; notify(); } private LinkData[] selectData(int in, int nmr) { // queries 3 tables in database to retrieve URLs Vector vector = new Vector(); Connection conn; try { String select = "SELECT net_Links.ID, net_Links.URL," + " net_Links.TITLE, net_Category.FatherID" + " FROM net_Links,net_Category, net_CatLinks" + " WHERE net_Category.Full_Name LIKE 'Health%'" + " AND net_Category.ID = CategoryID AND" + " net_Links.ID = LinkID LIMIT "+count+","+row_num; System.out.println("select = \""+select+"\""); conn = DriverManager.getConnection(dbURL, username, password); Statement statement = conn.createStatement(); ResultSet rs = statement.executeQuery(select); while (rs.next()) { String url = rs.getString("net_Links.URL"); int id = rs.getInt("net_Links.ID"); int father_id = rs.getInt("net_Category.FatherID"); vector.add(new LinkData(url, id, father_id)); } statement.close(); conn.close(); } catch (java.sql.SQLException ex) { ex.printStackTrace(); } LinkData links[] = new LinkData[vector.size()]; for (int i = vector.size()-1; i >= 0; i--) links[i] = (LinkData) vector.get(i); return links; } private int countData() { int ct = 0; Connection conn; try { // counts number of rows; will use to construct final SQL statement Class.forName("com.mysql.jdbc.Driver").newInstance(); String select_count = "SELECT count(net_Links.ID) AS total" + " FROM net_Links,net_Category, net_CatLinks WHERE" + " net_Category.Full_Name like 'Health%'" + " AND net_Category.ID = CategoryID AND net_Links.ID = LinkID"; conn = DriverManager.getConnection(dbURL, username, password); Statement statement = conn.createStatement(); ResultSet rs = statement.executeQuery(select_count); while (rs.next()) ct += rs.getInt("total"); System.out.println("count = "+ct); statement.close(); conn.close(); } catch (Exception ex) { System.out.println("countData exception"); ex.printStackTrace(); } return ct; } public static void main(String args[]) { new Spider(); } } //----------end Spider.java
Clarification of Answer by leapinglizard-ga on 21 Sep 2004 08:47 PDT I notice that I forgot the Pattern.DOTALL flag in earlier versions of the program. It is unnecesary in the latest version, which searches for patterns in one line at a time. leapinglizard
Request for Answer Clarification by coolguy90210-ga on 21 Sep 2004 16:35 PDT leapinglizard, Program works flawlessly after 1 modification. I had to add a BufferedReader close statement, outside of the while loop in the Scraper class. I did this by adding the line reader.close(). The reason for this is that I saw the following paraphrased error: "Unable to open a connection due to too many files open". I monitored this with a couple of linux commands to count the number of open files/connections associated with the PID of the Spider process. The number constantly grew until it hit 1024, and then the error would occur. I increased the amount of possible open files via the standard addition to the /etc/security/Can't Remember Name.file. However, I would still get the same error. I reviewed the error message and determined it was occuring at 4 possible locations in the program: countData(), selectData(), insertData() or Scraper(). My immediate suspicion was the Scraper class due to the above mentioned linux commands and the open files actually being open socket connections to different urls. I added the reader.close() statement, and observed the number of open connections again. This time the number of open connections would vary between 100 - 400, always coming back down again if they approached 400, usually staying around 180 or so. The next addition I made was to create a String category variable to take the place of the portion of the WHERE clause of the SELECT statement that reads LIKE 'Health%'. This same WHERE clause occurs twice, so as I go through my categories it is better to edit one place. The final addition was to add a second dbURL variable, and adjust the three dbURL variable for methods that make database calls. With that addition, I can run the Spider from my various dedicated servers, and all will get their count and urls from the master server, but insert their data locally. Last night I completed the 310,000 category. I completed a 228,000 category, and I'm working on a 900,000 category. The 900,000 category stalled at around 400,000 this morning. I suspect it is related to our previous discussions. In any event, upon my return from home, I'll simply skip ahead by 1000 and continue on. At some point though, I will carefully examine all of the urls within the range of 1000 that failed looking for the troublesome url and then hopefully being able to write a catch statement. leapinglizard, you are one cool guy! I'm going to rate this 5 stars. I'm going to pay you the agreed 100 USD, however I'd like to add some more to that, so I'm going to review a few answers to see what is appropriate. I'd ask, but you wouldn't have time to reply, and it is a bit of a faux paus. Back in a moment.
Clarification of Answer by leapinglizard-ga on 21 Sep 2004 18:10 PDT Thank you for the kind words and the handsome tip. I'm glad I was able to assist you in this matter. I was nervous about handing you untested code, but you seem to have quite a knack for debugging. If any serious problems should crop up in connection with this program, I'll be glad to offer further assistance through the Answer Clarification mechanism. leapinglizard

coolguy90210-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $100.00

Leapinglizard knows his Java!  To anyone looking for a Java coder,
Leapinglizard is the one!  He spent quite a bit of time working on my
original code, and improving it to the point of it being a complete
re-write.  He gets 5 stars for his excellent comments, 5 stars for his
Java coding, and 5 stars for his overall expertise.

The spider works, and works exceptionally well.  I've reviewed
probably close to 20 different spider applications paid and open
source and the one I originally developed, which he took and re-wrote
is in fact the best.  It is easy to see exactly what it is doing, and
it is easy to adjust it for personal circumstances.  And, most
importantly, it is FAST.  9 urls per second with the original threaded
version, and much faster with the final revision.

This is an industrial strength spider application.  It is begging to
be made into a parallel / RMI application.

Comments

Subject: Re: Need a re-write of a Java / Python Spider
From: efn-ga on 18 Sep 2004 12:45 PDT

leapinglizard has both more time and more expertise for this question
than I, but thanks for remembering me!

--efn

Subject: Re: Need a re-write of a Java / Python Spider
From: coolguy90210-ga on 18 Sep 2004 20:31 PDT

efn,

Good to see you are still around.  I have other ideas/questions coming
up so keep your eyes open.  May not be about coding.  May be about
advice for setting up a web site that has a specific theme or goal.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy