Hi mikeu-ga,
Since I didn't find a central list containing all the Dow Jones stocks
on a one page, I wrote a small program that retrieves the info from
the CBS site you listed. It was definitely a challenge to come up with
a program that parsed out all the required data correctly and it took
quite some time. I won't provide the code behind it here, as it was
programmed in a quick'n dirty fashion. It did the job however, and you
can find an XML file containing all the Dow Jones stocks, in
hierarchical order from these locations:
Zipped (129k):
http://users.pandora.be/rami/dow/dow_jones_stocks.zip
Unzipped (740k):
http://users.pandora.be/rami/dow/dow_jones_stocks.xml
The structure of the XML will be obvious to you as soon as you take a
look at the file, since its a very simple hierarchical structuring,
so I wont bore you with a DTD or XML Schema.
If you know a little about parsing XML (and from reading your question
I believe you do), its quite simple to, for instance, store the data
in a database, or use it in an application, directly from within the
XML. Remember however that, conform to the XML specs, the & is
replaced by & everywhere in the XML file.
If you have any questions about the files, please ask for a
clarification!
Hope this helps you along,
Kind regards,
Rhansenne-ga. |
Clarification of Answer by
rhansenne-ga
on
02 Jul 2002 07:57 PDT
Hi again,
I wrote the program in Java, my programming language of choice.
Unfortunately I removed the code a day or two ago. Anyway, it was
coded quite 'dirty' in the sense that a change of layout to the site
might have rendered the parser useless. But since the program was only
meant to run a single time, I didn't want to spend too much time on
making it more generic.
The technique is quite similar to those bots that scan webpages to
extract e-mail addresses.
The parser starts at the cbs page with the hierarchy of groups. First
task is to parse out all the industries and recognize their level in
the hierarchy. This can be achieved by recognizing the indentation
(for instance WIDTH="20") and whether or not there are boldface-tags
around the name. Once the names and codes are retrieved the program
uses the codes as parameters to get the page with all the stocks in
the current industry (retrieving the page source of a url as a String
is quite easy in Java). Here again the stock names and codes are
parsed out, based on the format of the surrounding tags.
For each of these pages the parser then has to detect whether or not
there are links to follow-up pages present, and if so, recursively
follow them. The problem I faced here was avoiding an endless loop
(since previous pages shouldn't be retrieved again). Some extra checks
on the params corrected this.
Anyway, I had a lot of fun developing the parser (I learned quite a
bit myself ;-) )
Kind regards,
rhansenne-ga.
|