Hello tagger-ga,
Thank you for your question. After searching the Web high and low on
various search terms without success (see search terms below), I found
the perfect site that has the data you are looking for :
World Gazetteer : population figures for cities, towns and places
http://www.gazetteer.de/
"This site provides statistics about current population of countries,
their administrative divisions, cities and towns as well as images of
the current national flags."
The data you seek can found in this downloadable file :
Largest towns (all places with a population above 100 000, as a
tab-separated text file compressed to zip 241k)
http://www.gazetteer.de/st/cities.zip
The fields available include country, administrative district (state),
city, and other details like population, latitude, longitude. HOWEVER,
region was not included.
I proceeded to import this tab-separated file into Excel, trimmed out
unnecessary columns, and then inserted a new column (field) for region
(making sure to double-check for correctness and spelling). Where
there was any doubt as to which region a country belonged to, I
consulted the CIA World Factbook 2001 at [
http://www.odci.gov/cia/publications/factbook/index.html ].
I then rearranged the columns to match the relevance to your question.
There are 5,400+ records in this file, containing all the information
you asked for except for alternate names.
From my experience as a web developer and database
designer/programmer, the tab-separated format (TSV) is the best format
to use because of its portability. I presume you will be using this
data in a database-enabled application (not using a database is a bad
idea for this volume of data), so therefore with a database you will
be able to generate the listing in any form you wish, including a
hierarchical listing where there are no repeated instances of names
(currently there are row repeats of regions and countries in the
file). Also, this current "redundant" format that includes repeats can
easily be "normalized" in a database table. For example, you will
convert all instances of "Africa" to "1". Another table will contain
the reference codes for all countries. This saves you table space and
allows you to perform relational actions.
If this list of 5400+ cities is too large for your needs, you can use
the Auto Filter function (in Excel) to limit the listing to cities
above a specified population. This will give you a smaller list with
larger cities. Feel free to experiment to find the group you are
looking for.
I have made this file available on my Web server at :
http://www.quikpublish.com/cities_x.zip (195Kb)
This is a Zip file that contains an Excel file (.xls). You can export
a TSV (tab-separated values) or CSV (comma-separated values) file from
this, for importing into a database.
Google Search Terms :
+-----------------------+
Unsuccessful :
geographical place name list
xml geography
gml data
gml cities data
gml hierarchy OR hierarchical data
xml city OR cities list
xml country OR countries city OR cities list OR data
hierarchical city OR cities
hierarchical region city OR cities list OR listing
hierarchical world region city OR cities list OR listing
major cities by country by region
major cities hierarchical list
geographic place names
world OR global geographic place names
download free world OR global geographic place names
geographical place name list
+-----------------------+
Successful :
political place name list
://www.google.com/search?q=political+place+name+list&hl=en&lr=&ie=ISO-8859-1&safe=off
+-----------------------+
I hope this is what you are looking for. If you need further
assistance, for instance with data format conversion, please do not
hesitate to request for clarification and I will do my best to help
you. Thank you for using Google Answers!
Best regards,
kyrie26-ga |
Request for Answer Clarification by
tagger-ga
on
22 Sep 2002 12:30 PDT
Thanks for the answer.
Indeed I've also tried seeking high and low with no success (many of
my queries were similar to your unsuccessful ones...).
With regards to the development tips: I'm a web developer myself, and
seasoned with DBs, so I'm covered there :)
There are two caveats in your answer, however:
1) Some of the names are in their native language (phonetically). For
example, Jerusalem appears as Yerushalayim. This is not precisely what
I need, since I need to cross this with another DB, where the names
are in English. (Note that in the example I've provided, the name was
Jerusalem).
2) As you've pointed out, there is no list of alternate names which is
also crucial for me (and may even solve caveat number 1). While the
file you have supplied me with is indeed a long way from where I was,
it's still a short distance from where I want to be, as described in
my original question.
So, I'll be happy if you can try to find the additional list of
alternative names (it can be from another source, and I'll be happy to
cross-reference it with this one as long as there is a deterministic
way of doing that).
I'll also try to find this, and if I do find it before you I'll let
you know and I'll regard your answer as a complete one.
Best regards,
Uri.
|
Clarification of Answer by
kyrie26-ga
on
23 Sep 2002 09:28 PDT
Hello again tagger-ga,
I've taken a look at the file I downloaded from here :
ADL Gazetteer Development Page
http://alexandria.sdc.ucsb.edu/~lhill/adlgaz/
[Scroll to the middle where it says "List of 5.9 million geographic
names available for download"]
Excerpt : "The ADL Project has created a list of all of the names,
both primary and alternative names, from its ADL Gazetteer and is
making it available for download and local use within the limits of
our copyright statement. We anticipate that the list will be useful
for geoparsing applications where geographic names need to be
identified in natural language text. Each entry in the list, one line
per entry, consists of (1) the ADL Gazetteer Identifier for the entry
associated with the name; (2) the name; (3) the date of entry into the
database."
This is a proprietary database that includes geographical feature
names such as "spring well" in addition to political place names.
I've taken a look at it, however it's too large for me to view in
Excel (fields are delimited by the "|" character) or any text viewer.
The first 65,000+ records or so look promising. It looks like there
are phoenetic names (natural language?) in there as well.
As I mentioned in my earlier comment, it may be possible to match one
place name to its other variants through the proprietary primary key.
You would run a one-time job to build a similar "variant place names"
table in your own database using your own primary key (as the foreign
key), using a given place name for each record to drive the search.
The end result is a table that is a subset of the above file, relevant
to your records, and using your own ID key scheme. An application
could go looking for variant names in this new subset table using the
key from the given place name.
Possible problems at this stage :
1) The file may be too large for your computing resources to handle.
2) The phoenetic scheme from our first World Gazetteer file may not be
consistent with what's in the file, resulting in very few matches. At
this point we don't know until we actually run the subset build job.
My hunch is that this is not a problem because it is such a huge file
and looks comprehensive.
At this point I would encourage you to have a look at this file, and
see if you can use it. Again, you may have further questions, so don't
hesitate to request clarification again. Let me know what you need.
Good luck,
kyrie26-ga
|