From your question I presume that you dont care so much how Google
does it as you want to be able to do it yourself. Below are several
options from a free e-mail conversion service run by Adobe which will
accept batch submissions, to some plug-ins for Adobe Acrobat Reader,
and source code for a conversion program which will run under *nix.
Google-specific questions are answered by Google staff but I can
provide a simple way to convert files or pages from .pdf to .html or
.txt if thats what you want.
The simplest way is to just use Adobes accessibility tools. Start
here for the complete set of free online tools.
Or, simply send Web sites or files directly to Adobes conversion
sites as explained next.
E-mail the URL (Web address) of a PDF document in the body of an
e-mail message to email@example.com and youll get back a .TXT (plain
text) translation. Send the e-mail to firstname.lastname@example.org and youll
get back an HTML file.
If you have a .PDF document on local media, send it as a MIME
attachment to an e-mail to email@example.com for plain text and to
firstname.lastname@example.org for HTML format.
Response time is usually a few minutes. I have been posting these
links on my accessibility Web site for years so no research was
For more options, including the ability to convert from .TIFF to .PDF
or vice versa, or to convert .PDF to a number of different formats see
The major problem with this site is that the plug-ins dont work with
the free, downloadable Acrobat Reader; you must have the full
commercial Adobe Acrobat program.
PDFTOHTML is a program which may answer the second part of your
question about a GPL program that does the conversion.
This utility converts .PDF files to HTML or XML formats.
Theres also XPDF, an open source .PDF viewer which is GPL freeware.
This might be the best option if you want to get into the nitty-gritty
of creating your own software or modifying someone elses programs.
If you really wanted to know how Google does this youll have to
contact them directly but I doubt youd be able to use the same
technology so I hope this provides what you really wanted.
pdf conversion gpl