Google Answers: Retrieving patent data in XML format

View Question

Q: Retrieving patent data in XML format ( Answered 5 out of 5 stars

, 0 Comments )

Question

Subject: Retrieving patent data in XML format
Category: Computers > Programming
Asked by: chesketh-ga
List Price: $20.00

Posted: 24 Jul 2004 07:48 PDT
Expires: 23 Aug 2004 07:48 PDT
Question ID: 378510

I would like to find a way to retrieve patent information from the USPTO (US patent trademark office) in XML format rather than HTML format.
Clarification of Question by chesketh-ga on 24 Jul 2004 07:57 PDT That is to say, the HTML patent information available from http://uspto.gov I want to either be able to extract specific field elements from XML output or the HTML (clean with no tags) from the HTML output. I would prefer to work with XML for obvious reasons.
Request for Question Clarification by palitoy-ga on 24 Jul 2004 11:09 PDT Hello Chesketh This sounds like the kind of thing I would enjoy doing. Can you clarify your question slightly? 1) How do you want to extract this information? Via a perl or PHP script? 2) Can you please give a link to an example page of an item you wish to extract? 3) Is it correct that you wish to parse the HTML page of the result and produce an XML file? If so which elements do you require in the XML file?
Clarification of Question by chesketh-ga on 24 Jul 2004 12:32 PDT Hi, thanks for the response, 1) I would like to extract the information with PHP script. 2) http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,670,149.WKU.&OS=PN/6,670,149&RS=PN/6,670,149 This HTML output contains particular "fields" such as inventor, asignee, patent number, abstract etc... which I would like to cleanly extract. 3) I would rather not have to deal with HTML if it is possible to get an XML directly from USPTO, but I think this is only possible with European patent services. Saying that, yes, I would like to parse the HTML, extracting several fields, 5 examples are: a) Patent # b) Title c) inventor(s) d) asignee(s) e) abstract Best regards, Christian

Answer

Subject: Re: Retrieving patent data in XML format
Answered By: palitoy-ga on 24 Jul 2004 13:18 PDT
Rated: 5 out of 5 stars

Hello again I have written this small script for you to parse the pages on the USPTO website to an "XML-style" page that can be copied and pasted into Notepad or similar software. In the print statements I have escaped the < and > characters so that they appear in your web browser when the program is run. Should you not require this simply alter the < to < and > to >. I have not had too much time to make the script pretty but please take a look over it and post any clarifications or changes that you require to be done here and I will look at them in the morning when I am fresher. <? $siteurl = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,670,149.WKU.&OS=PN/6,670,149&RS=PN/6,670,149"; $siteinfo = get_file($siteurl); preg_match ("\|<TITLE>United\sStates\sPatent:\s(.)<\/TITLE>\|imU",$siteinfo,$out); print "<item><br /><patent>" . $out[1] . "</patent><br />"; preg_match ("\|<font size=\"\+1\">(.)<\/font>\|imU",$siteinfo,$out); print "<title>" . $out[1] . "</title><br />"; preg_match ("\|Abstract</B></CENTER><P>(.)<\/P>\|imU",$siteinfo,$out); print "<abstract>" . $out[1] . "</abstract><br />"; preg_match_all ("\|<TD ALIGN=\"LEFT\" WIDTH=\"90%\">(.)<\/TD>\|imU",$siteinfo,$out, PREG_PATTERN_ORDER); print "<inventors>" . strip_tags($out[0][0]) . "</inventors><br />"; print "<assignee>" . strip_tags($out[0][1]) . "</assignee><br /></item>"; function get_file($filename) { $file = file ($filename); $lines = ereg_replace("[\r\t\n]","",join("",$file)); $lines = preg_replace("\|\s{1,}\|"," ",$lines); return $lines; } ?>
Request for Answer Clarification by chesketh-ga on 24 Jul 2004 14:28 PDT Thanks Palitoy, the only other requirements I would have is that the script is commented so I can understand and expand it, i would like each element written to a variable so that it can be written to mysql and also, the xml file should be saved with a filename uspto-[patent #]. Thanks again, Christian
Clarification of Answer by palitoy-ga on 25 Jul 2004 03:02 PDT Hi Christian I have updated the script with the other items you requested. I have not added the mysql database section as this would require knowing the structure, username/password and elements of your databases. This should be fairly easily to add on as all the patent elements are now stored in variables for your use. To run the script you now also need to pass it a URL in the browser - this is so you can use the same script for every patent you wish to look at (rather than saving multiple copies of the same script with just the URL changed). The syntax of this is: http://nameofyourscript.php?url=http://whichpatent If you have any further questions please ask and I will do my best to help again. <? // retrieve the URL for the patent from the browser address line $siteurl = $_GET['url']; // if no URL is given... if ( $siteurl == "" ) { // print out a message and quit the program here print "<p>No URL was given.</p><p>To give a URL add '?url=http://site.com' to the end of your URL (substituting the URL of your choice for site.com).</p>"; exit(0); }; // get the patent information and store it in a variable called $siteinfo $siteinfo = get_file($siteurl); // find the patent number - it is always in the title of the patent preg_match ("\|<TITLE>United\sStates\sPatent:\s(.)<\/TITLE>\|imU",$siteinfo,$out); // store the patent number in a variable and clean it up $patent_number = clean($out[1]); // find the title of the patent - it is always the first line with these html tags preg_match ("\|<font size=\"\+1\">(.)<\/font>\|imU",$siteinfo,$out); $patent_title = clean($out[1]); // find the abstract of the patent - it is always the first paragraph after the word abstract preg_match ("\|Abstract(.)<\/P>\|imU",$siteinfo,$out); $patent_abstract = clean($out[1]); / find the inventor and assignee data - these are stored in a table with this html cell information, the first match is the inventor data and the second the assignee data / preg_match_all ("\|<TD ALIGN=\"LEFT\" WIDTH=\"90%\">(.)<\/TD>\|imU",$siteinfo,$out, PREG_PATTERN_ORDER); $patent_inventors = clean($out[0][0]); $patent_assignees = clean($out[0][1]); // create a variable for the XML file including the data found $xmlData = "<xml><item><patent>$patent_number</patent><title>$patent_title</title><abstract>$patent_abstract</abstract><inventors>$patent_inventors</inventors><assignees>$patent_assignees</assignees></item></xml>"; // save the file to a location using the patent number (with the commas removed) if($file=fopen("uspto-" . str_replace(",","",$patent_number) . ".xml", "w")) { // open file for writing fwrite($file, $xmlData); // write to file }; fclose($file); /* add mysql data storage here using the following variables in the SQL INSERT statement patent number = $patent_number patent title = $patent_title patent abstract = $patent_abstract patent inventors = $patent_inventors patent assignees = $patent_assignees */ // print something out on the screen so you know the process has finished. print "<p>Process completed.</p><p>" . htmlentities($xmlData) . "</p>"; // a function to retrieve the patent from the internet function get_file($filename) { // get the info and store it in $file $file = file ($filename); // tidy up the information by removing unwanted line feeds, multiple spaces etc $lines = ereg_replace("[\r\t\n]","",join("",$file)); $lines = preg_replace("\|\s{1,}\|"," ",$lines); // give the cleaned up information back to the variable calling this function return $lines; } // a function to clean the information passed to it function clean($str) { // remove any html tags from the data given $str = strip_tags($str); // remove any whitespace at the beginning and end of the data given $str = ltrim($str); $str = rtrim($str); // give the cleaned up info back to the variable calling this function return $str; } ?>
Request for Answer Clarification by chesketh-ga on 25 Jul 2004 05:46 PDT Palitoy, I'm having a few problems with the script, when I run it, i get the following output: Warning: file(http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1): failed to open stream: HTTP request failed! in /home/chesketh/public_html/getpatent.php on line 64 Warning: join(): Bad arguments. in /home/chesketh/public_html/getpatent.php on line 66 I have chmodded the directory to 777 which cleared up some issues, but I'm still left with these puzzling statements. I'm not sure if the '&' character in the URL is causing problems... Thanks in advance Process completed. <xml><item><patent></patent><title></title><abstract></abstract><inventors></inventors><assignees></assignees></item></xml>
Clarification of Answer by palitoy-ga on 25 Jul 2004 06:23 PDT Thank-you for the 5-star rating and tip! With regards to your problem I think you are right the & is causing a problem, I forgot those were there. There is an easy remedy to this... Change: // retrieve the URL for the patent from the browser address line $siteurl = $_GET['url']; To: // retrieve the URL for the patent from the browser address line $siteurl = ""; // get each part of the URL and then stitch it back together while(list($key, $val) = each($HTTP_GET_VARS)){ if ( $siteurl == "" ) { $siteurl = $val; } else { $siteurl = $siteurl . "&" . $key . "=" . $val; }; } This should fix the problem, sorry about that...
Request for Answer Clarification by chesketh-ga on 25 Jul 2004 06:31 PDT Thanks again, just one more problem, I get the following output (also note that the inventor and asignee fields are empty: Warning: fopen(uspto-6727353.xml): failed to open stream: Permission denied in /home/chesketh/public_html/getpatent.php on line 53 Warning: fclose(): supplied argument is not a valid stream resource in /home/chesketh/public_html/getpatent.php on line 56 Process completed. <xml><item><patent>6,727,353</patent><title>Nucleic acid encoding Kv10.1, a voltage-gated potassium channel from human brain</title><abstract>The invention provides isolated nucleic acid and amino acid sequences of Slo potassium family members such as, antibodies to Kv10 subfamily members such as Kv10.1, methods of detecting Kv10, subfamily members such as Kv10.1, methods of screening for potassium channel activators and inhibitors using biologically active Kv10 subfamily members such as Kv10.1, and kits for screening for activators and inhibitors of voltage-gated potassium channels comprising Kv10 subfamily members such as Kv10.1.</abstract><inventors></inventors><assignees></assignees></item></xml>
Request for Answer Clarification by chesketh-ga on 25 Jul 2004 06:36 PDT Sorry for the trouble, I have solved the file writing problem (just a chmod oversight), but the inventor and asignee fields remain empty, it doesn't seem to be pulling the information properly from the table...
Clarification of Answer by palitoy-ga on 25 Jul 2004 07:08 PDT I have just tried to pull that patent myself here and it seems to work... I used this URL: http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,727,353.WKU.&OS=PN/6,727,353&RS=PN/6,727,353 In your script is this line all on one line: preg_match_all ("\|<TD ALIGN=\"LEFT\" WIDTH=\"90%\">(.*)<\/TD>\|imU",$siteinfo,$out, PREG_PATTERN_ORDER); If not, it should be :-)
Clarification of Answer by palitoy-ga on 25 Jul 2004 07:11 PDT I forgot to mention there should also be a single space between the ALIGN=\"LEFT\" and the WIDTH=\"90%\ parts.
Request for Answer Clarification by chesketh-ga on 25 Jul 2004 08:02 PDT Thanks palitoy, that did it, the script works perfectly, thanks again, I'll keep you in mind for my next project. All the best...
Clarification of Answer by palitoy-ga on 25 Jul 2004 08:52 PDT I'm glad I could help! If you do want to hire me again through Google Answers simply ask for palitoy-ga in the question title and I will probably see it (I usually check the site at least a couple of times a day).

chesketh-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $5.00

Thanks so much Palitoy, I appreciate the hard work you put into this,
it was well worth the money. Best regards...

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy