Google Answers Logo
View Question
 
Q: Keyword Power Search ( Answered 4 out of 5 stars,   0 Comments )
Question  
Subject: Keyword Power Search
Category: Computers > Internet
Asked by: abrandt-ga
List Price: $10.00
Posted: 13 Nov 2002 17:16 PST
Expires: 13 Dec 2002 17:16 PST
Question ID: 107287
How many corporate websites have web page named management.htm? 
What search methodology did you use to derive this number?

What SEARCH ENGINES and KEYWORD TECHNIQUES can be utilized that will
not limit a search like “corporate management” to only 1,000 results
like Google does?

How can one limit a keyword search like “corporate management” to U.S.
and Canadian corporations or to San Jose, CA?

Some of the filters I have used so far to prevent the following words
have been:  -"resume" -"press" -"news" -"release" -"employment"
-"jobs".
Answer  
Subject: Re: Keyword Power Search
Answered By: robertskelton-ga on 13 Nov 2002 20:26 PST
Rated:4 out of 5 stars
 
Hi there,

The best search engines for your purpose are Google and AlltheWeb.
They are the two largest indexes with 3 billion and 2,112,188,990
documents indexed respectively. And each has the ability to set
specific criteria using their advanced searches.

Google Advanced Search
://www.google.com/advanced_search?hl=en

AlltheWeb Advanced Search
http://www.alltheweb.com/advanced

Now, here are my answers to your questions...


1. Although it isn't possible for search engines to tell whether a
website is "corporate" or not, the two biggest indexes have the
ability to search for a phrase within the URL.

Google search for "management.htm" in the URL finds 52,400
://www.google.com/search?num=30&hl=en&lr=&ie=UTF-8&oe=UTF-8&newwindow=1&safe=off&as_qdr=all&q=allinurl%3A+management.htm+

AlltheWeb search for "management.htm" in the URL finds 12,378
http://www.alltheweb.com/search?advanced=1&cat=web&type=all&q=&jsact=&l=any&ics=iso-8859-1&cs=iso-8859-1&wf%5Bn%5D=3&wf%5B0%5D%5Br%5D=%2B&wf%5B0%5D%5Bq%5D=management.htm&wf%5B0%5D%5Bw%5D=url.all%3A&wf%5B1%5D%5Br%5D=-&wf%5B1%5D%5Bq%5D=&wf%5B1%5D%5Bw%5D=&wf%5B2%5D%5Br%5D=&wf%5B2%5D%5Bq%5D=&wf%5B2%5D%5Bw%5D=&dincl=&dexcl=&geo=&limip=&doctype=&dfr%5Bd%5D=1&dfr%5Bm%5D=1&dfr%5By%5D=1980&dto%5Bd%5D=14&dt
%5Bm%5D=11&dto%5By%5D=2002&size%5Bp%5D=%3D&size%5Bv%5D=&size%5Bx%5D=0&depth%5Bp%5D=%3E&depth%5Bv%5D=&hits=10&nooc=on

(The selections I made in the AlltheWeb advanced search form can be
seen below the search results.)

The results do not include "management.html". For some reason that
requires a second search.


2. Most of the search engines limit displayed results to 1,000. The
reason given by the now defunct Excite is quoted at Search Engine
Watch:

"As a highly trafficked site, Excite has a responsibility to provide
the best possible service to our customer base. Our Excite Search
engine is designed to float the most relevant data from our index to
the forefront of the results list returned for each query submitted.
We have found that nearly 100% of users never have a need to drill
down beyond the 1,000th result for a given query. For these reasons,
we no longer provide more than 1,000 results per query submitted. We
hope that 1,000 results more than meet your needs."
http://searchenginewatch.com/sereport/00/07-sizetest.html

The above link also provides a table of maximum displayed results, and
the chart is still quite accurate, despite being over two years old.
Newer engines also have less than 1000:

Wisenut
Limit of 300
http://www.wisenut.com/search/query.dll?q=management&p=80&c=10

Walhello
Limit of 260
http://80.60.35.143/cgi-bin/search?key=management&taal=a&nummer=26&no&no&no&vert=0&no

The only major search engine which provides over 1000 is AlltheWeb
(aka FAST), which provides 4010 results. There are no keyword
techniques that can allow you to see more results than the limits.


3. How can one limit a keyword search like "corporate management" to
U.S. and Canadian corporations or to San Jose, CA?

Advanced searches at Google and AlltheWeb allow you to search only
domains with a certain extension, such as .ca for Canada, and .au for
Australia. Most countries only let residents use such domain names, so
you can be sure that the results are from that country. But many
businesses will use the regular .com rather than their country
extension, so the results will only find a subset of each country. It
would theoretically be possible for a search engine to determine where
a website is hosted, but that is a flawed approach, because it is
quite common to host your website in a different region or country to
where the content relates.

American websites are unusual in that although a .us extension exists,
almost no-one uses it. This is seen by some to be the same kind of
arrogance (or ignorance) that gives us the "World Series".
Consequently trying to have only American results is possible but will
only include a small subset of all American sites, and typically they
are mostly government sites.

Google: "corporate management", .us sites only - 155,000 results
://www.google.com/search?num=30&hl=en&lr=&ie=UTF-8&oe=UTF-8&newwindow=1&safe=off&as_qdr=all&q=%E2%80%9Ccorporate+management%E2%80%9D+site%3A.us

For Canadian sites, a Google search for "corporate management" finds
288,000 results:
://www.google.com/search?num=30&hl=en&lr=&ie=UTF-8&oe=UTF-8&newwindow=1&safe=off&as_qdr=all&q=%E2%80%9Ccorporate+management%E2%80%9D+site%3A.ca

Searching for the phrase in regular .com sites yields 4,020,000
results.

I know of no easy way to search for websites located in a particular
city, using major search engines. It might be possible if the
particular city has a local search engine. The best I can suggest is
searching for the name of the city within pages that have the phrase
about.htm or contact.htm in the URL. Google doesn't let you do both at
once, but AlltheWeb can:

AlltheWeb: "San Jose" in the text, "contact.htm" in the URL
http://www.alltheweb.com/search?cat=web&cs=iso-8859-1&advanced=1&type=all&q=&jsact=&l=any&ics=iso-8859-1&cs=iso-8859-1&wf%5Bn%5D=3&wf%5B0%5D%5Br%5D=%2B&wf%5B0%5D%5Bq%5D=contact.htm&wf%5B0%5D%5Bw%5D=url.all%3A&wf%5B1%5D%5Br%5D=%2B&wf%5B1%5D%5Bq%5D=%22San+Jose%22&wf%5B1%5D%5Bw%5D=&wf%5B2%5D%5Br%5D=&wf%5B2%5D%5Bq%5D=&wf%5B2%5D%5Bw%5D=&dincl=&dexcl=&geo=&limip=&doctype=&dfr%5Bd%5D=1&dfr%5Bm%5D=1&dfr%5
y%5D=1980&dto%5Bd%5D=14&dto%5Bm%5D=11&dto%5By%5D=2002&size%5Bp%5D=%3D&size%5Bv%5D=&size%5Bx%5D=0&depth%5Bp%5D=%3E&depth%5Bv%5D=&hits=10&nooc=on


It's worth taking the time to play with the advanced search facilities
of the major search engines. As you have seen, for your purposes
AlltheWeb has a couple of tricks which Google doesn't have.


Search strategy:
google "more than 1000 results"
://www.google.com/search?q=google+%22more+than+1000+results%22&num=30


Best wishes,
robertskelton-ga

Clarification of Answer by robertskelton-ga on 13 Nov 2002 20:28 PST
Apologies for how the extremely long AlltheWeb URLs are appearing -
you'll need to cut and paste the two halves into the Address box of
your browser.

Request for Answer Clarification by abrandt-ga on 13 Nov 2002 21:34 PST
Hello robertskelton-ga,

Here's the dilemna. From a marketing perspective, it doesn't matter
that the Google search for "management.htm" in the URL finds 52,400...
because only 1,000 are accessible... or AlltheWeb search for
"management.htm" in the URL finds 12,378 ... since only 4,010 are
accessible.

If I take the AlltheWeb search for "management.htm" in the URL finds
12,378 at the LINK you've provided and insert the keyword "CA"
(generally found in the corporate address) as MUST INCLUDE in the TEXT
... the results fall to 466. Am I to believe this number?  I don't
think so.

In order for this question or exercise to have practical value... let
me ask you this:  How can a search using Google and Alltheweb be
narrowed by CITY... so that 1,000 or 4,010 results becomes a viable
number?

Thank you.

Request for Answer Clarification by abrandt-ga on 13 Nov 2002 21:48 PST
In regards to:
Search strategy: 
google "more than 1000 results"

Why is google NOT IN QUOTES or in advanced: WITH ALL THE WORDS?
and "more than 1000 results" in QUOTES or in advanced: EXACT PHRASE?

What is the rationale for this keyword STRATEGY?

Thank you.

Clarification of Answer by robertskelton-ga on 13 Nov 2002 23:51 PST
Oops, hope I didn't cause you any confusion. The "Search Strategy" is
what I usually include at the end of my answer, to help demonstrate
how I found it.

In this case, it was how I found the article at Search Engine Watch. 

The advanced search form is not necessary if you know the correct
operators to use.

Google Advanced Search Made Easy
://www.google.com/help/refinesearch.html

Google Advanced Search Operators
://www.google.com/help/operators.html

The advanced "WITH ALL THE WORDS?" is the default for all Google
searches, so it's normal not to use it.

Google is not in quotes because it is a single word. Only phrases need
to be placed between quotes, with the exception being stop words. Stop
words are words like "the" which Google has determined are ordinarily
not worth searching for. To force Google to search for them, placing
them between quotes works. This is why when searching for about.htm I
placed it between quotes, because "about" is a stop word.

"more than 1000 results" is between quotes because I was looking for a
page with that exact phrase in it. Without the quotes I might not have
found the information I was after. As it happens I would've still
found the Search Engine Watch page, although it was further down the
results:
://www.google.com/search?q=google+more+than+1000+results&sa=Search&num=30

Request for Answer Clarification by abrandt-ga on 14 Nov 2002 00:01 PST
Appreciate the good explanations, robertskelton-ga

ACCIDENTLY MISSED:

Request for Answer Clarification by abrandt-ga  on 13 Nov 2002 21:34 PST
Here's the dilemna. From a marketing perspective, it doesn't matter...

Clarification of Answer by robertskelton-ga on 14 Nov 2002 00:38 PST
Regarding your other question...

A search for management.htm in the URL will return websites from
countries all over the world. This would explain why such a low
percentage have CA in the text. If you restricted the search to just
.com websites, there are only 6383 results. As I explained, searching
for US websites is very difficult because so few businesses use the
.us extention.

I agree that searching for a page called management.htm is a good way
of ensuring that mostly corporate websites are returned. However
looking through the first 10 results of such a search...

AlltheWeb
management.htm in the URL
only include .com
http://www.alltheweb.com/search?advanced=1&cat=web&type=all&q=&jsact=&l=any&ics=iso-8859-1&cs=iso-8859-1&wf%5Bn%5D=3&wf%5B0%5D%5Br%5D=%2B&wf%5B0%5D%5Bq%5D=management.htm&wf%5B0%5D%5Bw%5D=url.all%3A&wf%5B1%5D%5Br%5D=-&wf%5B1%5D%5Bq%5D=&wf%5B1%5D%5Bw%5D=&wf%5B2%5D%5Br%5D=&wf%5B2%5D%5Bq%5D=&wf%5B2%5D%5Bw%5D=&dincl=.com&dexcl=&geo=&limip=&doctype=&dfr%5Bd%5D=1&dfr%5Bm%5D=1&dfr%5By%5D=1980&dto%5Bd%5D=1
&dto%5Bm%5D=11&dto%5By%5D=2002&size%5Bp%5D=%3D&size%5Bv%5D=&size%5Bx%5D=0&depth%5Bp%5D=%3E&depth%5Bv%5D=&hits=10&nooc=on

...only 1 had an address on the page given in the results. The search
is good for finding corporate websites, but poor for searching for
locations. Typically they list a number of company employees, and each
has their past experience detailed, including their previous employers
and where they studied - and the place names given could be anywhere
they previously studied or worked.

Unless the address is expected on the page, California would be more
appropriate than CA.

A better tactic might be:

AlltheWeb
contact.htm in the URL
CA in the text
Corporation in the text
http://www.alltheweb.com/search?cat=web&cs=iso-8859-1&advanced=1&type=all&q=&jsact=&l=any&ics=iso-8859-1&cs=iso-8859-1&wf%5Bn%5D=3&wf%5B0%5D%5Br%5D=%2B&wf%5B0%5D%5Bq%5D=contact.htm&wf%5B0%5D%5Bw%5D=url.all%3A&wf%5B1%5D%5Br%5D=%2B&wf%5B1%5D%5Bq%5D=CA&wf%5B1%5D%5Bw%5D=&wf%5B2%5D%5Br%5D=%2B&wf%5B2%5D%5Bq%5D=Corporation&wf%5B2%5D%5Bw%5D=&dincl=.com&dexcl=&geo=&limip=&doctype=&dfr%5Bd%5D=1&dfr%5Bm%5D=1
dfr%5By%5D=1980&dto%5Bd%5D=14&dto%5Bm%5D=11&dto%5By%5D=2002&size%5Bp%5D=%3D&size%5Bv%5D=&size%5Bx%5D=0&depth%5Bp%5D=%3E&depth%5Bv%5D=&hits=10&nooc=on

Seven of the first ten results are corporations which are either
located in California or have an office there. Two of the failures
were due to having dropdown boxes for selecting a state (so every
state appears on the page).

Searching for California instead of CA gets even better results.
Because the page file name could also be contactus.html, contact.asp
and many others, you can also receive good results by searching for
"contact us" in the title.

Request for Answer Clarification by abrandt-ga on 14 Nov 2002 08:21 PST
robertskelton-ga

In my opinion, you are one of the tier-1 "Researchers" onboard Google
Answers. I have reviewed numerous answers of yours... some qualify as
excellent essays.

Your responses to this question are all sound... but they have not
proactively driven to the heart of the challenge and what I'm trying
to achieve. For some reason... in this case, my question seems to NOT
have brought out your best... "pulling teeth" versus "value-added."

SUMMARY: I want to locate corporations large enough to have a web page
that lists CORPORATE OFFICERS. Since we have a limitation of 1000/4010
- Google/Alltheweb respectively, it appears that targeted searches
need be reduced to the smallest common denominator: CITIES.
Simultaneously, it would make sense to remove the NOISE: -"resume"
-"press" -"news" -"release" -"employment" -"jobs"... the goal here is
clean or at least cleaner results.

I understand that in most cases, 1,000 results is more than most
people will ever read. In this case, I am compiling corporate data
instead of paying $100 to $250 per 1000 for a list. BESIDES, I refuse
to accept that it can't be done because I have verified that most of
these list companies are currently extracting this web contact data
and calling it proprietary databases.

NOTE: If you can get me over the HUMP... I'm willing to double the
ante via tip. Perseverence must prevail.

COMMENTS from experienced researchers would also be appreciated.

Clarification of Answer by robertskelton-ga on 16 Nov 2002 18:52 PST
Hi again,

I believe that what you are attempting to achieve is not possible
using regular search engines. Any businesses extracting this type of
information from the web, and then selling it in list form, would not
be using regular search engines - they would either write their own
applications or use specialized software.

I have found a couple of products which are representative of the type
of software required for your task.

Teleport Pro
http://www.tenmax.com/teleport/pro/home.htm

WebQL
http://www.caesius.com/

Request for Answer Clarification by abrandt-ga on 16 Nov 2002 23:44 PST
Thank you for getting back. I will further review your references on Monday.

Have a good weekend, robertskelton-ga.

Request for Answer Clarification by abrandt-ga on 20 Nov 2002 08:22 PST
Hello robertskelton-ga,

1. Per your reference to Tenmax.com, I contacted them and received a
the following reply.

2. Per your reference to Caesius.com, I and am expecting to hear back
from them towards the end of this week.

Any additional comments and insights would be greatly appreciated.

Looking forward to a resolution of this question,

abrandt

..................................


-------- Original Message --------
Subject: INQUIRY: Teleport Pro v1.29
Date: 18 Nov 2002 21:48:26 +0000 GMT
From: Support Team


> Is your software capable of extracting the following?
> 1. Corporate websites that list their CORPORATE OFFICERS on a web 
page

Hello,

Ordinarily we would attempt this with our Dataplex service, in which 
we run projects for clients on our Dataplex spider, and return the 
data directly to the customer.  The jobs are typically very large 
scans, similar to the scope of a search engine, except that because we
scan the actual sites, and don't go through a search engine, we have 
the ability to return comprehensive results, as well as extract data 
from the sites that match your criteria.

Unfortunately, at this time we are not taking on additional Dataplex 
jobs.  The service has reached capacity.  We are restructuring our 
system to accommodate larger numbers of small jobs, but this change is
also some months away from completion.

____________________________________
Support Team
Tennyson Maxwell Information Systems, Inc.

Request for Answer Clarification by abrandt-ga on 25 Nov 2002 17:06 PST
Per your reference to Caesius.com, I and am expecting to hear back
from them in the next 1-2 days.
abrandt-ga rated this answer:4 out of 5 stars
The finaly answer provided value.

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy