This is somewhat of an open-ended question, but I hope to get insight
from experienced web developers out there. I'm hoping use GoogleAPI
to query 3-4 lists of query terms, each list with somewhere between
2000-5000 terms. (I realize this might take many days given Google's
API licensing limits.) For each query term, I want to retrieve the
top 10 results in each of 5 different (non-English) languages. For
each resulting page, I just want to keep the sentence or table row
that has the query term.
Then, I want to keep these sentences in flat file(s), data struct, or
a database somehow, and do some pretty major string manipulation.
I'm trying to figure out what platforms and tools out there will best
handle these tasks. I don't mind learning entirely new environments /
languages. It's an academic project, and I prefer to stick to
freely/cheaply available tools under Windows because that's what I
have in front of me. However, if there's a great idea under the
unix/linux umbella, I'll consider it.
An answer to this question will outline an end-to-end solution,
mentioning all languages, development tools and libraries needed to
best accomplish this project as quickly as possible. It should
include some non-mainstream, non-obvious information (perhaps
specialized string manipulation or web retrieval tools) that will make
my job easier. |