Google Answers: Develop a Google Search Strategy

View Question

Q: Develop a Google Search Strategy ( No Answer, 0 Comments )

Question

Subject: Develop a Google Search Strategy
Category: Computers > Programming
Asked by: waldo5555-ga
List Price: $75.00

Posted: 06 Jul 2004 18:00 PDT
Expires: 05 Aug 2004 18:00 PDT
Question ID: 370562

Greetings: I want suggestions and specific help in developing a Google search strategy that will allow me to identify and find a historical document that contains the letters "x", "y", and "v"", at various locations within the contents of a document file. As a "test document", the Declaration of Independence can be used to determine the accuracy and cunctionality of the search strategy. The search strategy should be able to identify the Declaration of Independence as a document that contains the word "sexes" at word position 994. Because of the reality of different editions of the Declaration of Independence, the word position may be at position 884. The search strategy should be able to identify the Declaration of Independence as a document that contains the word "fundamentally" at word position 822, or possibly at position 812 (depending on the edition of the document). The search strategy should be able to find the word "valuable" at word position 818, or at position 808 (again depending on the edition of the declaration of independence). If the search strategy will function using the Declaration of Independence as a test document, then, I would like to use the search strategy to find documents other than the Declaration of Independence that were printed before 1820 that contains a word at position 2906 that contains the letter "x". This document would also contain the letter "y" at word position 2160; and, the letter "v" would be found at word position 2018. I am also interested in locating a document printed before 1820 that contains a word at position 975 that contains the letter "x" . This document would also contain the letter "y" at word position 952. The letter "v" would be located at position 951. In counting words, the following rules should apply: Words divided by a space, plus sign, hyphen, ampersand, or slash will count as two words. For example, the phrase "self-evident" would count as two words. The phrase "cruelty & perfidy" would count as three words. I only want to find documents that are in the English language and documents that were published before 1820. I want documents that are less than 4,000 words in total length. I do not want documents that are less than 900 in total length. I am not sure that the Google search engines can provide this type of searching but I am interested in trying and would appreciate your' help. Many thanks Waldo
Request for Question Clarification by andyt-ga on 07 Jul 2004 20:28 PDT Hi waldo555-ga, As far as I know Google does not provide a built-in, publicly accessible interface to do search down to this type of specificity. However, with the help of the Google API (://www.google.com/apis/) there may be a way to program a script to help with this. Together with manually reviewing any matches, or close matches this may turn up the results you're looking for. The success of this endeavor depends a lot on the combined uniqueness of the query terms "x", "y", "v". For instance, searching for sexes+fundamentally+valuable turns up the Declaration in the first 10 results, which means it would be feasible to search Google using the API for these terms, go through the first X number of results programmatically and return a match if the words appear in the order specified. If the search terms are "a", "the", and "is", it would be near impossible to search all results and get a good match. If there is a match, it would be necessary to manually go through the matching document to verify that it was the correct one, such as being dated before 1820. Also, are the files you're looking for a specific filetype, such as only txt? If they're html, it still might work, but it would be necessary to strip the tags out (which can be done programmatically), as well as stripping any additional text that is not part of the original document(which is much harder). I'm not completey sure I can accomplish this, but I'd like to give it a shot if this is the type of answer you're looking for. Regards, Andyt-ga
Clarification of Question by waldo5555-ga on 07 Jul 2004 21:14 PDT Hi Andyt-ga: I didn't know about Google API and I also lack the programming skills to take advantage of it. But, I hope that you pursue the identification of the documents that I'm looking for. About the query words. My hypothesis is that the documents can be identified by finding an "x" at word positions, 994, 884, 2906, 951 and 975. There should be three documents. There are relatively few words that contain the letter "x". Words such as: experience, exercise, exposed, taxes, example, executioners, excite, sexes, and extend. I'm sure that there are many more that contain "x" but I don't know how to structure the search query. If my guess is correct, then the letter "y" would be located at positions 822, 812, 2160 and 952 in the three documents. There are many more words that contain "Y" than "x-words". The logical extension of this thinking would locate "v" at word positions 818, 808, 2018 and 951. I would be interested in documents that met the "x" test. I would be really interested that met the "x" and "y" test and I would be ectatic if a document met the "x", "Y", and "v" test. I would be interested in documents that came "close" to the identified word positions. e.g, plus or minus three word positions. The file type would be .txt and not HTML. The time of publication is not of great importance but I do believe that the documents were published sometime in the 18th century, or before. Please proceed and make any comments or ask questions if they should arise. Many Thanks Waldo
Request for Question Clarification by andyt-ga on 07 Jul 2004 21:53 PDT It looks like there's about 10,000 words with the letter x in them according to the moby words project (http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=3201). So the strategy of searching on each individual 'x-word' is out.. If any other researcher wants to try this, I'm all out of suggestions for now. Andyt-ga
Request for Question Clarification by webadept-ga on 07 Jul 2004 23:45 PDT Hi, Andy is correct, it is not a matter really of finding the x in the space 523 or what ever in the document, that is the easy part. I could do that in my sleep, as I'm sure Andy could as well. What we are dealing with here is the huge amount of words using X in them, and, the massive amount of documents that use that word. Experience? for instance.. can you think of a more common word used in a document of any size, and any seriousness? Other than perhaps 'I' ? well, you get my point. In order to do/create, such a search the search engine would have to have some type of filter to start with, and the best one I can think of at this time is the document title. Or, at least a search that could, in a reasonable way, describes the document by title. Such an Engine would not really be able to start there, it would have to have a rather large knowledge base behind it as well, but with Google's Search Engine and the API together this could be over come. Just searching for any document with X in the right spot is not feasible at this time in the game. There are TeraBytes of documents out there on the Internet. That is TeraBytes, not GigaBytes. Such an Engine would require Multiple TeraBytes of data tables for every Terabyte of data on the Internet to search for key letters in key positions in every document. Again, not feasible. And even if it were, even if I possessed TeraBytes of data space and GigHertz of bandwidth to send my bots forth to index the world and its letters, I certainly wouldn't spend them on this type of search engine. So, not only, not feasible, but not practical in the hopes that a greater tool than Google will soon come forth to meet this challenge. However, with a place to start, we cut down that huge area to a manageable size, and a manageable scale. Another method would be to limit the search area, by space rather than description. For example, if you wanted a search engine that could find in certain repository, a document where X marks the spot, then we are back in the ball game as well. Such a search could be done even without the use of Google's API, but could be done much better and faster if Google has indexed the repository as well. Without it, the search could take quite a long time, and would have to email you results, rather than showing them to you directly on a page. The searches would get faster, the more the engine was used, and it is possible that at some point results could be show on the page, without Google's API there to help out, but that would be up to the owner of the engine. If you can consent to either starting with Documents Description in the search along with X's position in the document sought, or, can use the Repository Method, where only a specific Repository, pre set (as in not changing), I see methods to solve your need. Else, I too would need to bow from interest in this question. webadept-ga
Clarification of Question by waldo5555-ga on 08 Jul 2004 10:54 PDT Dear Andyt-ga and webadept-ga, and other silent observers: Thank you for your interest and suggestions. Andy, you raised some real hopes by demonstrating that searching for "sexes+fundamentally+valuable" you could identify the declaration of independence within 10 choices. Not bad. My thinking would then lead to me try and see if the search engine could search for x plus y, or y, plus v" or *v. Webadept-ga, you suggested that using the document file would provide a way of limiting the search; however, I don't know the name of the "document title", or perhaps I don't understand what you mean by "doculment title". That leaves the repository method to explore. Sounds ok to me. These documents could be considered historical, or literary, or political or governmental or even legal. I would be interested in pursing the repository method further. Please help me in my thinking. I envisioned a spider scanning the terabytes of literature for documents that contained the letter "v" at position 2018 and thus identifying 45 million documents; then scanning that database of documents that contain "y" at position 2160 containing a much smaller number of documents and lastly scanning the remaining documents for "x" at position 2906. Is this process done in sequence as I have suggested or is this done simultaneously. Please help me understand. In summary: I need a search scheme that will identify documents by the appearance of the letters "x", "y" and "v". The First document that I need contains the letter "x" at the word position of 2906; "y" at 2160 and "v" at 2018. The Second document that I need contains the letter "x" at word position 975 and the letter "y" at position 952 and the letter "v" at position 951 The accuracy and function of the search scheme can be tested by using the Declaration of Independence by testing for "x" at word position 994 or 884; "y" at position 822 or 812 and "v" at word position 818 or 808. I look forward to hearing from you. Best wishes, Waldo
Request for Question Clarification by pafalafa-ga on 08 Jul 2004 11:18 PDT Hello Waldo, I'm one of the "silent observers" here, but I'm ready to jump in with a thought. Can you tell us WHY you want to conduct such an unusual search? It sounds as if you're trying to establish a lexical "fingerprint" for certain documents, and if that's the case, then there may be other strategies that would meet your needs, and that are easier to put into practice. I'm not saying your original request is impossible. But it would help me (and probably webadept, and andyt, and anyone else out there) to understand why you want to construct a word-894-has-an-x type of query, and to think about the variety of lexical/search tools that might best meet your needs. Thanks. pafalafa-ga
Clarification of Question by waldo5555-ga on 09 Jul 2004 06:10 PDT Welcome Pafalafa-ga and thanks for your interest. I didn't recognize that my search inquiry was unusua. There are two reasons for my search. One is to try and solve a book cipher that was written 200 years ago and has not been solved. The area of cryptography is a very minor interest for me. The other reason is to try and get an intellectual "handle" on the incredible power of search ingines and to determine the nature of their limitations. I'm a retired ophthalmologist and l'm looking for intellectural "fodder". The search for documents that can be identified by the position of "x", "y", and "v" won't stop until I'm convinced that Google and spiders don't have that capability. But my expereince in learning Google has only caused me to have greater appreciation and intrigue about the considerable power in searching. I hope that you will continue to offer suggestions. Waldo
Request for Question Clarification by pafalafa-ga on 09 Jul 2004 07:56 PDT As I understand your question, your are asking for something QUITE unusual in terms of search strategies. In particular, there are two things that stand out: --You are asking for information on word "position" -- is the word at position 885, etc. Search engines are not generally designed to provide this function. --Secondly, you are asking for words containing specific letters. Again, this is not something search engines ordinarily do. Google -- and all other search engines -- look for complete words. There is no search-engine capacity that I am aware of to conduct a search such as "find words that have an "X" in them". And there is certainly no capacity to "find words at position 885 that have an "X" in them". Of course, not all text searching tools are search engines. There is an entire field of text analysis that has developed a suite of tools for parsing and exploring the subtle details of a text. Once again, most of the emphasis is on words-in-context (i.e. is the word used as a noun, verb,etc), rather than on word position, or letters within words. However, a good programmer (which I am not) could probably create the search tool that you needed. You might want to have a look at one of these offbeat text search tools to get a bit familiar with it (they are not easy to use). You can find a tool known as a "regular expressions" search tool at the National Puzzler's League website: http://www.puzzlers.org/wordlists/grepdict.php It really takes a few days of poking around here to begin to get a feel for how the search tool works (their onsite instructions are just awful!). For instance, searching on: ^..x..$ [go ahead...copy the above line and paste it into the search box] will give you a set of 281 words that are 5-characters in length, and have an "X" in the 3rd position. Is this of interest at all as "fodder" for your explorations? Let me know what you think.
Clarification of Question by waldo5555-ga on 10 Jul 2004 12:51 PDT July 10, 2004: Dear Andyt-ga , Webadept-ga, and all other silent observer searchers: I'm much more aware of the problems associated with the type of search that I"ve requested; and I still want to continue. The filter that is necessary to limit the sea of documents, will have to consist of a depository that can be identified by 1. language (english) 2. date of publication (prior to 1820) 3. subjects (American revolutionary) (legal and political) and literature. My intuition suggests to me that Webadept-ga sees the necessity of having the spiders do all of the work. My ignorance on this subject would lead me into a room with a very large mountain of documents that only partially met my requirements. Andyt: You demonstrated the feasability of my thinking by showing that a search for "sexes+fundamentally+valuable" could identify the declaration of independence. That means that the "test" search for a document with x at 994 can be eliminated. That means there are only two documents that I want to identify. Document "A" will have an "x" at position 2906. Document "A" will probably have a "y" at 2160 and may well have a "v" at 2018. Document "B" will have an "x" at position 975 and probably have a "y" at position at 952 and may well have a "v" at 951. I would like to purchase a search scheme that would identify documents that pass the (1) "x" test only and (2) documents that passed the "x", "y", and "v" tests. I would be concerned about the counting of the words, and would like to be sure that my previous requirements about counting words that were hyphenated, spaced, or associated with apersands was strictly followed. Is there some way that I can find out which repositories of documents have been idexed by Google? An additional "filter" might be the date of publication (prior to 1820). I believe that it will be possible to identify these two documents with your' help along with some good fortune. I have been unable to download Google's api developers "package" but I will continue to try and hope that success will follow. I look forward to hearing from you. Best wishes, Waldo.
Clarification of Question by waldo5555-ga on 24 Jul 2004 16:33 PDT Question: This is a modification of question 370562, posted July 6, in which I requested a search strategy for identifying a historical document by identifying the location of the letters "x", "y", and "v" within the document. I do not know the title, or name, or author of the two documents. After reviewing the comments and suggestions, I?ve attempted to modify and scale down the requirements of this search strategy. The search strategy is essentially a search for the letter "x" at word position 2906 and word position 975. The search strategy must do the following: A. Count the words in the document. In counting the words of the document, the following rules should apply. Counting starts with the first word of the first paragraph. Counting does not start with the title of the document. Words that are divided by a space, plus sign, hyphen, ampersand, or slash will count as two words. For example, the phrase, "self-evident" would count as two words. Phrase "cruelty & perfidy" would count as three words. The accuracy of the count should be plus or minus two words. Word position 2906 should cover words from position 2904 to 2908. I only want to find documents that are in the English language and documents that were published before 1820. B. Identify all documents that have the letter "x" at the 2906 (2904-2908) word position. The search strategy should be able to find "x" at any letter position in the word. For example, a first position would be a word such as "xylophone"; a second position word would be a word such as "exist", and a third position word would be a word such as "Mexico". The website www.puzzlers.org/wordlists/ has a search engine that will identify letters so that a key-letter search can be performed. The puzzlers website identifies 20,844 words with "x" in the second position (using a search request of ( .x.). Words with "x" in the third position totaled 15,914, using a search request of (..x.); and 11,778 words with "x" in the fourth position (using a search request of (?x.*). I wonder if this strategy could be used as part of a google search? If possible, I would like the google search strategy to identify "x" in any letter position from 1-12; however, I would accept a strategy that identifies "x" in letter positions second and third, because I believe that this would cover the most likely possibilities. C. When a document is found with "x" at word position, 2906 (2904-2908) then print or email the document with the words at position 2906, and position 2160 (2158-2162) and position 2018 (2016-2020) "highlighted" by an asterisk, or by underlining, or by capitalization. D. When a document is found with "x" at word position 975 (973-977) then print or email the document with the words at position 975 (973-977) "highlighted". The words at position 952 (950-954) and position 951 (949-953) should also be highlighted in a similar manner, by an asterisk, by underlining, or by capitalization. I hope that these modifications to my original question will allow someone to answer my question. I?ve been unable to download the Google API. The zipped file reads "corrupted or invalid" whenever I attempt to unzip the file. I hope that the Google Search strategy would be simple enough that I can perform a search without the necessity of getting a doctorate degree in computer programming. Your thoughts and ideas are welcomed. Waldo.
Request for Question Clarification by pafalafa-ga on 24 Jul 2004 17:15 PDT For a while I found myself wondering "Where's Waldo?" (sorry...!). Nice to hear back from you. I'm glad to see you had a chance to play around with the puzzler's website -- they have some nifty features there. And though their letter-in-a-word tool is useful, it's still not going to get you closer to your search strategy...especially if you're looking for something quick and easy. For starters, theres the counting problem. Imagine a search tool coming across this article in today's Washington Post: ========== http://www.washingtonpost.com/wp-dyn/articles/A10570-2004Jul23.html Beauty and the Bicycle: The Art of Going the Distance By Sarah Kaufman Washington Post Staff Writer Saturday, July 24, 2004 Page C01 LE GRAND BORNAND, France -- Follow the Tour de France for any amount of time and it becomes clear that this bicycle race is not just a sport and a science -- it is also an art.... ==== Your eye -- attached to your wonderful human brain -- has no trouble at all picking out "Follow" as the first word of the first paragraph. It is there that you would start your counting. For a computer search program, this in an incredibly sophisticated challenge, and one that -- even in the hands of the best programmers -- would be very prone to error. Is the right place to start counting at "Beauty"? At "The"? At "By"? At "Washington"? At "Page"? At "Le"? Your brain can do it. Computer's can't. It's just that painfully simple, I'm afraid. I'm only pointing out one (probably the largest) challenge to getting you the tool you want...there are quite a number of others as well. Bottom line...there's no simple way to do what you want. There may not even be a reliable complicated way to do it. Some word challenges just aren't handled very well by computers, which is why there is not yet a reliable translation tool available (a task some early programmers thought would be relatively easy....Ha!!). I'm still not clear just WHY you want to look for words containing an X at position 2906 plus or minus two words. If we knew more about the WHY of your quest, perhaps an alternative search strategy would suggest itself. There are many ways to find or identify particular documents/passages/excerpts (anti-plagarism software programs do it all the time). If this is your goal, one of these programs may be able to help. Just some thoughts...I'd be interested to hear your reaction. pafalafa-ga
Clarification of Question by waldo5555-ga on 24 Jul 2004 21:38 PDT Dear Pafalafa-ga: Thanks for your continuing in your interest. The why of this search scheme is a very lengthy explanation which I will provide------if my hypothesis is correct. I will be happy to try and explain why I believe that word position 2906 is a very special and junique position for the letter "x"; but for now all I can do is to ask for your indulgence. The capability of a search engine to search for "key-words" is a tool that I would like to push to it's limits. However, I believe that in principle the search engines can be programed to provide key-letter searches that will have very useful results in very specific instances. The counting of the words is obviously one of the many limiting factors that I have to be considered. There are many other limitations. Slight changes in various editions of the same document is another. But I still want to try the approach that I've outlined. To answer your question about where to start: the answer is that counting should start with the first word of the first paragraph of the actual text. In the example that you gave, the first word would be "Follow". I'm a little anxious to put my ideas and hunches to the test and appreciate your reference to puzzlers.org because it has confirmed that the technology already exists to do "key-letter" searches and that if the counting problem can be solved then I have an opportunity to identify these two old, but very special documents. I appreciate your patience and interest. Best wishes, Waldo
Clarification of Question by waldo5555-ga on 04 Aug 2004 07:56 PDT Greetings to Andyt-ga: Hi Andy: My question ID 370562 is about to expire. I'm only a little further down the road to finding an answer. Because you have previously expressed a willingness to approach this problem, I wonder if you will try to offer me some help along the lines of my last "clarification. I look forward to hearing from you. Waldo

Answer

There is no answer at this time.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy