Greetings: I want suggestions and specific help in developing a
Google search strategy that will allow me to identify and find a
historical document that contains the letters "x", "y", and "v"", at
various locations within the contents of a document file.
As a "test document", the Declaration of Independence can be used to
determine the accuracy and cunctionality of the search strategy. The
search strategy should be able to identify the Declaration of
Independence as a document that contains the word "sexes" at word
position 994. Because of the reality of different editions of the
Declaration of Independence, the word position may be at position 884.
The search strategy should be able to identify the Declaration of
Independence as a document that contains the word "fundamentally" at
word position 822, or possibly at position 812 (depending on the
edition of the document). The search strategy should be able to find
the word "valuable" at word position 818, or at position 808 (again
depending on the edition of the declaration of independence).
If the search strategy will function using the Declaration of
Independence as a test document, then, I would like to use the search
strategy to find documents other than the Declaration of Independence
that were printed before 1820 that contains a word at position 2906
that contains the letter "x". This document would also contain the
letter "y" at word position 2160; and, the letter "v" would be found
at word position 2018.
I am also interested in locating a document printed before 1820 that
contains a word at position 975 that contains the letter "x" . This
document would also contain the letter "y" at word position 952. The
letter "v" would be located at position 951.
In counting words, the following rules should apply: Words divided by
a space, plus sign, hyphen, ampersand, or slash will count as two
words. For example, the phrase "self-evident" would count as two
words. The phrase "cruelty & perfidy" would count as three words. I
only want to find documents that are in the English language and
documents that were published before 1820. I want documents that are
less than 4,000 words in total length. I do not want documents that
are less than 900 in total length.
I am not sure that the Google search engines can provide this type of
searching but I am interested in trying and would appreciate your'
help.
Many thanks
Waldo |
Request for Question Clarification by
andyt-ga
on
07 Jul 2004 20:28 PDT
Hi waldo555-ga,
As far as I know Google does not provide a built-in, publicly
accessible interface to do search down to this type of specificity.
However, with the help of the Google API (://www.google.com/apis/)
there may be a way to program a script to help with this. Together
with manually reviewing any matches, or close matches this may turn up
the results you're looking for. The success of this endeavor depends a
lot on the combined uniqueness of the query terms "x", "y", "v".
For instance, searching for sexes+fundamentally+valuable turns up the
Declaration in the first 10 results, which means it would be feasible
to search Google using the API for these terms, go through the first X
number of results programmatically and return a match if the words
appear in the order specified. If the search terms are "a", "the",
and "is", it would be near impossible to search all results and get a
good match. If there is a match, it would be necessary to manually go
through the matching document to verify that it was the correct one,
such as being dated before 1820.
Also, are the files you're looking for a specific filetype, such as
only txt? If they're html, it still might work, but it would be
necessary to strip the tags out (which can be done programmatically),
as well as stripping any additional text that is not part of the
original document(which is much harder).
I'm not completey sure I can accomplish this, but I'd like to give it
a shot if this is the type of answer you're looking for.
Regards,
Andyt-ga
|
Clarification of Question by
waldo5555-ga
on
07 Jul 2004 21:14 PDT
Hi Andyt-ga: I didn't know about Google API and I also lack the programming skills
to take advantage of it. But, I hope that you pursue the
identification of the documents that I'm looking for. About the query
words. My hypothesis is that the
documents can be identified by finding an "x" at word positions, 994,
884, 2906, 951 and 975. There should be three documents. There are
relatively few words that
contain the letter "x". Words such as: experience, exercise,
exposed, taxes, example, executioners, excite, sexes, and extend. I'm
sure that there are many more that contain "x" but I don't know how to
structure the search query.
If my guess is correct, then the letter "y" would be located at
positions 822, 812, 2160 and 952 in the three documents. There are
many more words that contain "Y" than "x-words".
The logical extension of this thinking would locate "v" at word
positions 818, 808, 2018 and 951.
I would be interested in documents that met the "x" test. I would be really
interested that met the "x" and "y" test and I would be ectatic if a document
met the "x", "Y", and "v" test. I would be interested in documents
that came "close" to the identified word positions. e.g, plus or minus
three word positions.
The file type would be .txt and not HTML. The time of publication is
not of great importance but I do believe that the documents were
published sometime in the 18th century, or before.
Please proceed and make any comments or ask questions if they should arise.
Many Thanks
Waldo
|
Request for Question Clarification by
andyt-ga
on
07 Jul 2004 21:53 PDT
It looks like there's about 10,000 words with the letter x in them
according to the moby words project
(http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=3201).
So the strategy of searching on each individual 'x-word' is out.. If
any other researcher wants to try this, I'm all out of suggestions for
now.
Andyt-ga
|
Request for Question Clarification by
webadept-ga
on
07 Jul 2004 23:45 PDT
Hi, Andy is correct, it is not a matter really of finding the x in the
space 523 or what ever in the document, that is the easy part. I could
do that in my sleep, as I'm sure Andy could as well. What we are
dealing with here is the huge amount of words using X in them, and,
the massive amount of documents that use that word. Experience? for
instance.. can you think of a more common word used in a document of
any size, and any seriousness? Other than perhaps 'I' ? well, you get
my point.
In order to do/create, such a search the search engine would have to
have some type of filter to start with, and the best one I can think
of at this time is the document title. Or, at least a search that
could, in a reasonable way, describes the document by title.
Such an Engine would not really be able to start there, it would have
to have a rather large knowledge base behind it as well, but with
Google's Search Engine and the API together this could be over come.
Just searching for any document with X in the right spot is not
feasible at this time in the game. There are TeraBytes of documents
out there on the Internet. That is TeraBytes, not GigaBytes. Such an
Engine would require Multiple TeraBytes of data tables for every
Terabyte of data on the Internet to search for key letters in key
positions in every document. Again, not feasible. And even if it were,
even if I possessed TeraBytes of data space and GigHertz of bandwidth
to send my bots forth to index the world and its letters, I certainly
wouldn't spend them on this type of search engine. So, not only, not
feasible, but not practical in the hopes that a greater tool than
Google will soon come forth to meet this challenge.
However, with a place to start, we cut down that huge area to a
manageable size, and a manageable scale.
Another method would be to limit the search area, by space rather than
description. For example, if you wanted a search engine that could
find in certain repository, a document where X marks the spot, then we
are back in the ball game as well. Such a search could be done even
without the use of Google's API, but could be done much better and
faster if Google has indexed the repository as well. Without it, the
search could take quite a long time, and would have to email you
results, rather than showing them to you directly on a page. The
searches would get faster, the more the engine was used, and it is
possible that at some point results could be show on the page, without
Google's API there to help out, but that would be up to the owner of
the engine.
If you can consent to either starting with Documents Description in
the search along with X's position in the document sought, or, can use
the Repository Method, where only a specific Repository, pre set (as
in not changing), I see methods to solve your need. Else, I too would
need to bow from interest in this question.
webadept-ga
|
Clarification of Question by
waldo5555-ga
on
08 Jul 2004 10:54 PDT
Dear Andyt-ga and webadept-ga, and other silent observers:
Thank you for your interest and suggestions. Andy, you raised some real hopes
by demonstrating that searching for "sexes+fundamentally+valuable" you
could identify the declaration of independence within 10 choices. Not
bad. My thinking would then lead to me try and see if the search
engine could search for *x* plus *y*, or *y, plus *v" or *v.
Webadept-ga, you suggested that using the document file would provide
a way of limiting the search; however, I don't know the name of the
"document title", or perhaps I don't understand what you mean by
"doculment title". That leaves the repository method to explore.
Sounds ok to me. These documents could be considered historical, or
literary, or political or governmental
or even legal. I would be interested in pursing the repository method further.
Please help me in my thinking. I envisioned a spider scanning the
terabytes of literature for documents that contained the letter "v" at
position 2018 and thus identifying 45 million documents; then scanning
that database of documents that contain "y" at position 2160
containing a much smaller number of documents and lastly scanning the
remaining documents for "x" at position 2906. Is this process done in
sequence as I have suggested or is this done simultaneously. Please
help me understand.
In summary: I need a search scheme that will identify documents by
the appearance of the letters "x", "y" and "v". The First document
that I need contains
the letter "x" at the word position of 2906; "y" at 2160 and "v" at 2018.
The Second document that I need contains the letter "x" at word position 975 and
the letter "y" at position 952 and the letter "v" at position 951
The accuracy and function of the search scheme can be tested by using
the Declaration of Independence by testing for "x" at word position
994 or 884; "y" at position 822 or 812 and
"v" at word position 818 or 808.
I look forward to hearing from you.
Best wishes,
Waldo
|
Request for Question Clarification by
pafalafa-ga
on
08 Jul 2004 11:18 PDT
Hello Waldo,
I'm one of the "silent observers" here, but I'm ready to jump in with a thought.
Can you tell us WHY you want to conduct such an unusual search?
It sounds as if you're trying to establish a lexical "fingerprint" for
certain documents, and if that's the case, then there may be other
strategies that would meet your needs, and that are easier to put into
practice.
I'm not saying your original request is impossible. But it would help
me (and probably webadept, and andyt, and anyone else out there) to
understand why you want to construct a word-894-has-an-x type of
query, and to think about the variety of lexical/search tools that
might best meet your needs.
Thanks.
pafalafa-ga
|
Clarification of Question by
waldo5555-ga
on
09 Jul 2004 06:10 PDT
Welcome Pafalafa-ga and thanks for your interest.
I didn't recognize that my search inquiry was unusua. There are two reasons
for my search. One is to try and solve a book cipher that was written
200 years ago and has not been solved. The area of cryptography is a
very minor interest
for me. The other reason is to try and get an intellectual "handle"
on the incredible power of search ingines and to determine the nature
of their limitations. I'm a retired ophthalmologist and l'm looking
for intellectural "fodder". The search for documents that can be
identified by the position of "x", "y", and "v" won't stop until I'm
convinced that Google and spiders don't have that
capability. But my expereince in learning Google has only caused me
to have greater appreciation and intrigue about the considerable power
in searching.
I hope that you will continue to offer suggestions.
Waldo
|
Request for Question Clarification by
pafalafa-ga
on
09 Jul 2004 07:56 PDT
As I understand your question, your are asking for something QUITE
unusual in terms of search strategies. In particular, there are two
things that stand out:
--You are asking for information on word "position" -- is the word at
position 885, etc. Search engines are not generally designed to
provide this function.
--Secondly, you are asking for words containing specific letters.
Again, this is not something search engines ordinarily do. Google --
and all other search engines -- look for complete words. There is no
search-engine capacity that I am aware of to conduct a search such as
"find words that have an "X" in them". And there is certainly no
capacity to "find words at position 885 that have an "X" in them".
Of course, not all text searching tools are search engines. There is
an entire field of text analysis that has developed a suite of tools
for parsing and exploring the subtle details of a text. Once again,
most of the emphasis is on words-in-context (i.e. is the word used as
a noun, verb,etc), rather than on word position, or letters within
words. However, a good programmer (which I am not) could probably
create the search tool that you needed.
You might want to have a look at one of these offbeat text search
tools to get a bit familiar with it (they are not easy to use). You
can find a tool known as a "regular expressions" search tool at the
National Puzzler's League website:
http://www.puzzlers.org/wordlists/grepdict.php
It really takes a few days of poking around here to begin to get a
feel for how the search tool works (their onsite instructions are just
awful!).
For instance, searching on:
^..x..$
[go ahead...copy the above line and paste it into the search box]
will give you a set of 281 words that are 5-characters in length, and
have an "X" in the 3rd position.
Is this of interest at all as "fodder" for your explorations? Let me
know what you think.
|
Clarification of Question by
waldo5555-ga
on
10 Jul 2004 12:51 PDT
July 10, 2004: Dear Andyt-ga , Webadept-ga, and all other silent
observer searchers: I'm much more aware of the problems associated
with the type of search
that I"ve requested; and I still want to continue. The filter that is
necessary to limit the sea of documents, will have to consist of a
depository that can be identified by 1. language (english) 2. date of
publication (prior to 1820)
3. subjects (American revolutionary) (legal and political) and literature.
My intuition suggests to me that Webadept-ga sees the necessity of
having the spiders do all of the work. My ignorance on this subject
would lead me into a room with a very large mountain of documents that
only partially met my requirements.
Andyt: You demonstrated the feasability of my thinking by showing that
a search for "sexes+fundamentally+valuable" could identify the
declaration of independence. That means that the "test" search for a
document with x at 994 can be eliminated. That means there are only
two documents that I want to identify. Document "A" will have an "x"
at position 2906. Document "A" will probably have a "y" at 2160 and
may well have a "v" at 2018.
Document "B" will have an "x" at position 975 and probably have a "y"
at position at 952 and may well have a "v" at 951. I would like to
purchase a search scheme
that would identify documents that pass the (1) "x" test only and (2) documents
that passed the "x", "y", and "v" tests. I would be concerned about
the counting of the words, and would like to be sure that my
previous requirements about counting words that were hyphenated,
spaced, or associated with apersands was strictly followed. Is there
some way that I can find out which repositories of documents have been
idexed by Google? An additional "filter" might be the date of
publication (prior to 1820). I believe that it will be possible to
identify these two documents with your' help along with some good
fortune. I have been unable to download Google's api developers
"package" but I will continue to try and hope that success will
follow.
I look forward to hearing from you.
Best wishes,
Waldo.
|
Clarification of Question by
waldo5555-ga
on
24 Jul 2004 16:33 PDT
Question: This is a modification of question 370562, posted July 6,
in which I requested a search strategy for identifying a historical
document by identifying the location of the letters "x", "y", and "v"
within the document. I do not know the title, or name, or author of
the two documents.
After reviewing the comments and suggestions, I?ve attempted to modify
and scale down the requirements of this search strategy. The search
strategy is essentially a search for the letter "x" at word position
2906 and word position 975.
The search strategy must do the following:
A. Count the words in the document. In counting the words of the document,
the following rules should apply. Counting starts with the
first word of the
first paragraph. Counting does not start with the title of the
document. Words
that are divided by a space, plus sign, hyphen, ampersand, or
slash will count
as two words. For example, the phrase, "self-evident" would count as two
words. Phrase "cruelty & perfidy" would count as three words. The accuracy
of the count should be plus or minus two words. Word position 2906 should
cover words from position 2904 to 2908. I only want to find documents that
are in the English language and documents that were published before 1820.
B. Identify all documents that have the letter "x" at the 2906 (2904-2908)
word position. The search strategy should be able to find "x" at any letter
position in the word. For example, a first position would be a word such as
"xylophone"; a second position word would be a word such as "exist", and a
third position word would be a word such as "Mexico".
The website www.puzzlers.org/wordlists/ has a search engine that
will identify
letters so that a key-letter search can be performed. The puzzlers website
identifies 20,844 words with "x" in the second position (using a
search request
of ( .x.*). Words with "x" in the third position totaled
15,914, using a search
request of (..x.*); and 11,778 words with "x" in the fourth
position (using a
search request of (?x.*). I wonder if this strategy could be
used as part of a
google search? If possible, I would like the google search
strategy to identify
"x" in any letter position from 1-12; however, I would accept a
strategy that
identifies "x" in letter positions second and third, because I
believe that this
would cover the most likely possibilities.
C. When a document is found with "x" at word position, 2906 (2904-2908)
then print or email the document with the words at position 2906, and
position 2160 (2158-2162) and position 2018 (2016-2020) "highlighted" by
an asterisk, or by underlining, or by capitalization.
D. When a document is found with "x" at word position 975 (973-977) then
print or email the document with the words at position
975 (973-977) "highlighted". The words at position 952 (950-954) and
position 951 (949-953) should also be highlighted in a similar manner, by
an asterisk, by underlining, or by capitalization.
I hope that these modifications to my original question will allow someone to
answer my question.
I?ve been unable to download the Google API. The zipped file reads
"corrupted or invalid" whenever I attempt to unzip the file. I hope that the
Google Search strategy would be simple enough that I can perform a
search without the necessity of getting a doctorate degree in computer
programming.
Your thoughts and ideas are welcomed.
Waldo.
|
Request for Question Clarification by
pafalafa-ga
on
24 Jul 2004 17:15 PDT
For a while I found myself wondering "Where's Waldo?" (sorry...!).
Nice to hear back from you.
I'm glad to see you had a chance to play around with the puzzler's
website -- they have some nifty features there. And though their
letter-in-a-word tool is useful, it's still not going to get you
closer to your search strategy...especially if you're looking for
something quick and easy.
For starters, theres the counting problem.
Imagine a search tool coming across this article in today's Washington Post:
==========
http://www.washingtonpost.com/wp-dyn/articles/A10570-2004Jul23.html
Beauty and the Bicycle:
The Art of Going the Distance
By Sarah Kaufman
Washington Post Staff Writer
Saturday, July 24, 2004
Page C01
LE GRAND BORNAND, France -- Follow the Tour de France for any amount
of time and it becomes clear that this bicycle race is not just a
sport and a science -- it is also an art....
====
Your eye -- attached to your wonderful human brain -- has no trouble
at all picking out "Follow" as the first word of the first paragraph.
It is there that you would start your counting.
For a computer search program, this in an incredibly sophisticated
challenge, and one that -- even in the hands of the best programmers
-- would be very prone to error.
Is the right place to start counting at "Beauty"? At "The"? At "By"?
At "Washington"? At "Page"? At "Le"? Your brain can do it.
Computer's can't. It's just that painfully simple, I'm afraid.
I'm only pointing out one (probably the largest) challenge to getting
you the tool you want...there are quite a number of others as well.
Bottom line...there's no simple way to do what you want. There may
not even be a reliable complicated way to do it. Some word challenges
just aren't handled very well by computers, which is why there is not
yet a reliable translation tool available (a task some early
programmers thought would be relatively easy....Ha!!).
I'm still not clear just WHY you want to look for words containing an
X at position 2906 plus or minus two words. If we knew more about the
WHY of your quest, perhaps an alternative search strategy would
suggest itself.
There are many ways to find or identify particular
documents/passages/excerpts (anti-plagarism software programs do it
all the time). If this is your goal, one of these programs may be
able to help.
Just some thoughts...I'd be interested to hear your reaction.
pafalafa-ga
|
Clarification of Question by
waldo5555-ga
on
24 Jul 2004 21:38 PDT
Dear Pafalafa-ga: Thanks for your continuing in your interest. The
why of this search scheme is a very lengthy explanation which I will
provide------if my hypothesis is correct. I will be happy to try and
explain why I believe that word position 2906 is a very special and
junique position for the letter "x"; but for now all I can do is to
ask for your indulgence. The capability of a search engine to search
for "key-words" is a tool that I would like to push to it's limits.
However, I believe that in principle the search engines can be
programed to provide key-letter searches that will have very useful
results in very specific instances.
The counting of the words is obviously one of the many limiting factors that I have
to be considered. There are many other limitations. Slight changes
in various editions of the same document is another. But I still want
to try the approach that I've outlined. To answer your question about
where to start: the answer is that counting should start with the
first word of the first paragraph of the actual text. In the example
that you gave, the first word would be "Follow".
I'm a little anxious to put my ideas and hunches to the test and
appreciate your reference to puzzlers.org because it has confirmed
that the technology already exists to do "key-letter" searches and
that if the counting problem can be solved then I have an opportunity
to identify these two old, but very special documents. I appreciate
your patience and interest.
Best wishes,
Waldo
|
Clarification of Question by
waldo5555-ga
on
04 Aug 2004 07:56 PDT
Greetings to Andyt-ga: Hi Andy: My question ID 370562 is about to expire.
I'm only a little further down the road to finding an answer. Because you have
previously expressed a willingness to approach this problem, I wonder if you will
try to offer me some help along the lines of my last "clarification.
I look forward to hearing from you.
Waldo
|