I would like to find a way to retrieve patent information from the
USPTO (US patent trademark office) in XML format rather than HTML
format. |
Clarification of Question by
chesketh-ga
on
24 Jul 2004 07:57 PDT
That is to say, the HTML patent information available from http://uspto.gov
I want to either be able to extract specific field elements from XML
output or the HTML (clean with no tags) from the HTML output. I would
prefer to work with XML for obvious reasons.
|
Request for Question Clarification by
palitoy-ga
on
24 Jul 2004 11:09 PDT
Hello Chesketh
This sounds like the kind of thing I would enjoy doing. Can you
clarify your question slightly?
1) How do you want to extract this information? Via a perl or PHP script?
2) Can you please give a link to an example page of an item you wish to extract?
3) Is it correct that you wish to parse the HTML page of the result
and produce an XML file? If so which elements do you require in the
XML file?
|
Clarification of Question by
chesketh-ga
on
24 Jul 2004 12:32 PDT
Hi, thanks for the response,
1) I would like to extract the information with PHP script.
2) http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,670,149.WKU.&OS=PN/6,670,149&RS=PN/6,670,149
This HTML output contains particular "fields" such as inventor,
asignee, patent number, abstract etc... which I would like to cleanly
extract.
3) I would rather not have to deal with HTML if it is possible to get
an XML directly from USPTO, but I think this is only possible with
European patent services. Saying that, yes, I would like to parse the
HTML, extracting several fields, 5 examples are:
a) Patent #
b) Title
c) inventor(s)
d) asignee(s)
e) abstract
Best regards,
Christian
|
Hello again
I have written this small script for you to parse the pages on the
USPTO website to an "XML-style" page that can be copied and pasted
into Notepad or similar software.
In the print statements I have escaped the < and > characters so that
they appear in your web browser when the program is run. Should you
not require this simply alter the < to < and > to >.
I have not had too much time to make the script pretty but please take
a look over it and post any clarifications or changes that you require
to be done here and I will look at them in the morning when I am
fresher.
<?
$siteurl = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,670,149.WKU.&OS=PN/6,670,149&RS=PN/6,670,149";
$siteinfo = get_file($siteurl);
preg_match ("|<TITLE>United\sStates\sPatent:\s(.*)<\/TITLE>|imU",$siteinfo,$out);
print "<item><br /><patent>" . $out[1] . "</patent><br />";
preg_match ("|<font size=\"\+1\">(.*)<\/font>|imU",$siteinfo,$out);
print "<title>" . $out[1] . "</title><br />";
preg_match ("|Abstract</B></CENTER><P>(.*)<\/P>|imU",$siteinfo,$out);
print "<abstract>" . $out[1] . "</abstract><br />";
preg_match_all ("|<TD ALIGN=\"LEFT\"
WIDTH=\"90%\">(.*)<\/TD>|imU",$siteinfo,$out, PREG_PATTERN_ORDER);
print "<inventors>" . strip_tags($out[0][0]) . "</inventors><br />";
print "<assignee>" . strip_tags($out[0][1]) .
"</assignee><br /></item>";
function get_file($filename) {
$file = file ($filename);
$lines = ereg_replace("[\r\t\n]","",join("",$file));
$lines = preg_replace("|\s{1,}|"," ",$lines);
return $lines;
}
?> |
Request for Answer Clarification by
chesketh-ga
on
24 Jul 2004 14:28 PDT
Thanks Palitoy, the only other requirements I would have is that the
script is commented so I can understand and expand it, i would like
each element written to a variable so that it can be written to mysql
and also, the xml file should be saved with a filename uspto-[patent
#].
Thanks again,
Christian
|
Clarification of Answer by
palitoy-ga
on
25 Jul 2004 03:02 PDT
Hi Christian
I have updated the script with the other items you requested. I have
not added the mysql database section as this would require knowing the
structure, username/password and elements of your databases. This
should be fairly easily to add on as all the patent elements are now
stored in variables for your use.
To run the script you now also need to pass it a URL in the browser -
this is so you can use the same script for every patent you wish to
look at (rather than saving multiple copies of the same script with
just the URL changed). The syntax of this is:
http://nameofyourscript.php?url=http://whichpatent
If you have any further questions please ask and I will do my best to help again.
<?
// retrieve the URL for the patent from the browser address line
$siteurl = $_GET['url'];
// if no URL is given...
if ( $siteurl == "" ) {
// print out a message and quit the program here
print "<p>No URL was given.</p><p>To give a URL add
'?url=http://site.com' to the end of your URL (substituting the URL of
your choice for site.com).</p>";
exit(0);
};
// get the patent information and store it in a variable called $siteinfo
$siteinfo = get_file($siteurl);
// find the patent number - it is always in the title of the patent
preg_match ("|<TITLE>United\sStates\sPatent:\s(.*)<\/TITLE>|imU",$siteinfo,$out);
// store the patent number in a variable and clean it up
$patent_number = clean($out[1]);
// find the title of the patent - it is always the first line with these html tags
preg_match ("|<font size=\"\+1\">(.*)<\/font>|imU",$siteinfo,$out);
$patent_title = clean($out[1]);
// find the abstract of the patent - it is always the first paragraph
after the word abstract
preg_match ("|Abstract(.*)<\/P>|imU",$siteinfo,$out);
$patent_abstract = clean($out[1]);
/* find the inventor and assignee data - these are stored in a table
with this html cell information,
the first match is the inventor data and the second the assignee data */
preg_match_all ("|<TD ALIGN=\"LEFT\"
WIDTH=\"90%\">(.*)<\/TD>|imU",$siteinfo,$out, PREG_PATTERN_ORDER);
$patent_inventors = clean($out[0][0]);
$patent_assignees = clean($out[0][1]);
// create a variable for the XML file including the data found
$xmlData = "<xml><item><patent>$patent_number</patent><title>$patent_title</title><abstract>$patent_abstract</abstract><inventors>$patent_inventors</inventors><assignees>$patent_assignees</assignees></item></xml>";
// save the file to a location using the patent number (with the commas removed)
if($file=fopen("uspto-" . str_replace(",","",$patent_number) . ".xml",
"w")) { // open file for writing
fwrite($file, $xmlData); // write to file
};
fclose($file);
/* add mysql data storage here using the following variables in the
SQL INSERT statement
patent number = $patent_number
patent title = $patent_title
patent abstract = $patent_abstract
patent inventors = $patent_inventors
patent assignees = $patent_assignees */
// print something out on the screen so you know the process has finished.
print "<p>Process completed.</p><p>" . htmlentities($xmlData) . "</p>";
// a function to retrieve the patent from the internet
function get_file($filename) {
// get the info and store it in $file
$file = file ($filename);
// tidy up the information by removing unwanted line feeds, multiple spaces etc
$lines = ereg_replace("[\r\t\n]","",join("",$file));
$lines = preg_replace("|\s{1,}|"," ",$lines);
// give the cleaned up information back to the variable calling this function
return $lines;
}
// a function to clean the information passed to it
function clean($str) {
// remove any html tags from the data given
$str = strip_tags($str);
// remove any whitespace at the beginning and end of the data given
$str = ltrim($str);
$str = rtrim($str);
// give the cleaned up info back to the variable calling this function
return $str;
}
?>
|
Request for Answer Clarification by
chesketh-ga
on
25 Jul 2004 05:46 PDT
Palitoy, I'm having a few problems with the script, when I run it, i
get the following output:
Warning: file(http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1):
failed to open stream: HTTP request failed! in
/home/chesketh/public_html/getpatent.php on line 64
Warning: join(): Bad arguments. in
/home/chesketh/public_html/getpatent.php on line 66
I have chmodded the directory to 777 which cleared up some issues, but
I'm still left with these puzzling statements. I'm not sure if the
'&' character in the URL is causing problems...
Thanks in advance
Process completed.
<xml><item><patent></patent><title></title><abstract></abstract><inventors></inventors><assignees></assignees></item></xml>
|
Clarification of Answer by
palitoy-ga
on
25 Jul 2004 06:23 PDT
Thank-you for the 5-star rating and tip!
With regards to your problem I think you are right the & is causing a
problem, I forgot those were there. There is an easy remedy to
this...
Change:
// retrieve the URL for the patent from the browser address line
$siteurl = $_GET['url'];
To:
// retrieve the URL for the patent from the browser address line
$siteurl = "";
// get each part of the URL and then stitch it back together
while(list($key, $val) = each($HTTP_GET_VARS)){
if ( $siteurl == "" ) { $siteurl = $val; }
else { $siteurl = $siteurl . "&" . $key . "=" . $val; };
}
This should fix the problem, sorry about that...
|
Request for Answer Clarification by
chesketh-ga
on
25 Jul 2004 06:31 PDT
Thanks again, just one more problem, I get the following output (also
note that the inventor and asignee fields are empty:
Warning: fopen(uspto-6727353.xml): failed to open stream: Permission
denied in /home/chesketh/public_html/getpatent.php on line 53
Warning: fclose(): supplied argument is not a valid stream resource in
/home/chesketh/public_html/getpatent.php on line 56
Process completed.
<xml><item><patent>6,727,353</patent><title>Nucleic acid encoding
Kv10.1, a voltage-gated potassium channel from human
brain</title><abstract>The invention provides isolated nucleic acid
and amino acid sequences of Slo potassium family members such as,
antibodies to Kv10 subfamily members such as Kv10.1, methods of
detecting Kv10, subfamily members such as Kv10.1, methods of screening
for potassium channel activators and inhibitors using biologically
active Kv10 subfamily members such as Kv10.1, and kits for screening
for activators and inhibitors of voltage-gated potassium channels
comprising Kv10 subfamily members such as
Kv10.1.</abstract><inventors></inventors><assignees></assignees></item></xml>
|
Request for Answer Clarification by
chesketh-ga
on
25 Jul 2004 06:36 PDT
Sorry for the trouble, I have solved the file writing problem (just a
chmod oversight), but the inventor and asignee fields remain empty, it
doesn't seem to be pulling the information properly from the table...
|
Clarification of Answer by
palitoy-ga
on
25 Jul 2004 07:08 PDT
I have just tried to pull that patent myself here and it seems to
work... I used this URL:
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,727,353.WKU.&OS=PN/6,727,353&RS=PN/6,727,353
In your script is this line all on one line:
preg_match_all ("|<TD ALIGN=\"LEFT\"
WIDTH=\"90%\">(.*)<\/TD>|imU",$siteinfo,$out, PREG_PATTERN_ORDER);
If not, it should be :-)
|
Clarification of Answer by
palitoy-ga
on
25 Jul 2004 07:11 PDT
I forgot to mention there should also be a single space between the
ALIGN=\"LEFT\" and the WIDTH=\"90%\ parts.
|
Request for Answer Clarification by
chesketh-ga
on
25 Jul 2004 08:02 PDT
Thanks palitoy, that did it, the script works perfectly, thanks again,
I'll keep you in mind for my next project. All the best...
|
Clarification of Answer by
palitoy-ga
on
25 Jul 2004 08:52 PDT
I'm glad I could help! If you do want to hire me again through Google
Answers simply ask for palitoy-ga in the question title and I will
probably see it (I usually check the site at least a couple of times a
day).
|