Google Answers: Parse an Internet Explorer Bookmark file in Python or Java

View Question

Q: Parse an Internet Explorer Bookmark file in Python or Java ( Answered 5 out of 5 stars

Question

Subject: Parse an Internet Explorer Bookmark file in Python or Java
Category: Computers > Programming
Asked by: coolguy90210-ga
List Price: $20.00

Posted: 31 May 2004 07:52 PDT
Expires: 30 Jun 2004 07:52 PDT
Question ID: 354247

I'm quite embarrassed to ask for help on this as I've already banged
my head against the wall trying to figure this one out for 8 hours
over this weekend and I think this should be easily solvable!

My question is:  How do I parse an IE saved bookmark.html file such
that no matter how many subfolders are contained by a top level
folder, the correct parent/child folder path will be maintained.

Here is what I mean.  For the following list of bookmark folders and
subfolders, insert the path into a database or write to a file:

Let's say we have 3 top level folders (all starting at an index of 4
by the way in IE).

Arts
Business
Computers

In Arts we have:  Animation.

In Animation we have:  Education.

In Education we have:  On-line

And so on.  Each top level folder might have no sub folders, or it
might have 1000's.  Each sub folder in turn might have no sub folders,
or it might have 1000's.

Help me get past my stuck point.  I can get a list of the top level
folders and every sub folder within, but not the actual parent / child
relationship as detailed above.

The end result of the above example would look like this inserted as a
single row in a MySQL database, or written to a text file:

Arts
Arts/Animation
Arts/Animation/Education
Arts/Animation/Education/On-line
Business   (assuming no sub folders)
Computers  (assuming no sub folders)

Here is my existing Python code.  Your code may be in Python or Java. 
Preferably in Python.  You may modify the code as you see fit. 
Simplify it, or start over from scratch.  You are ONLY reading the
bookmark.html file!!!  You are not traversing the Windows Favorites
directory.  (Actually, I've already written something to do that, and
that was easy.  The OS tells you what the path is.)  No other language
is acceptable as I only have experience with the above 2.  Also, if
you could explain what was difficult about this?  I mean it seems so
straightforward, and I was able to write it out in pseudocode no
problem, but when it came to implementing it....I just can't get my
mind to go beyond the second level of folders.  Perhaps a suggestion
on how to do these problems in the future?

import os
import string
import re

f=open('all_bookmark.html', 'r+')

line_list = f.readlines()
xy = 4
top_folder = ''
sub_folder = []
folder = []
filename = 'out.txt'        	
localfile = open(filename, 'wb')
p = re.compile('<DT><H3 FOLDED ADD_DATE=.*?>')
o = re.compile(' ADD_DATE.*?.>')

for x in line_list:	
	
	y = string.find(x,'<');

	if(y == xy):
		strip_version = x.lstrip()
		if(strip_version.startswith('<DT><H3 FOLDED ADD_DATE')):
			sub_folder = []
			strip_version = x.lstrip()
			strip_version = p.sub('', strip_version)
			strip_version = strip_version.replace('</H3>','')
			strip_version = strip_version.replace('\n','')
			top_folder = strip_version
			list.append(folder, top_folder)
	
	if(y > xy):		
		strip_version = x.lstrip()
		if(strip_version.startswith('<DT><H3 FOLDED ADD_DATE')):
			strip_version = p.sub('', strip_version)
			strip_version = strip_version.replace('</H3>','')
			strip_version = strip_version.replace('\n','')
			sub_folder = [top_folder + '/' + strip_version]	
			list.append(folder, sub_folder)
				
	if(y < xy):		
		strip_version = x.lstrip()
		if(strip_version.startswith('<DT><H3 FOLDED ADD_DATE')):
			strip_version = p.sub('', strip_version)
			strip_version = strip_version.replace('</H3>','')
			strip_version = strip_version.replace('\n','')
			sub_folder = [sub_folder + '/' + strip_version]
			folder[0:-1]
			list.append(folder, sub_folder)
			
'''Now that I have all of the folder paths, I iterate through the
folder list and write each line
to a file.  Problem is, in Python, you can't write the element of a
list to a file.....must be
a way, but I haven't found it.'''
for x in folder:
	
	localfile.write(string.join(x) + '\n')
	
localfile.close()

Answer

Subject: Re: Parse an Internet Explorer Bookmark file in Python or Java
Answered By: efn-ga on 31 May 2004 23:45 PDT
Rated: 5 out of 5 stars

Hi coolguy90210,

In a Netscape-format bookmark file exported from Internet Explorer,
each folder is listed in a DT element, which is followed by a DL
element, which may be empty, or may contain folders or links.  That
means if you see a folder inside a DL element, you know it's inside
the folder named in the preceding DT element, and if you see the end
of a DL element, you know you have reached the end of the containing
folder.

So when you see a folder name, you know you want to put out a line,
and the problem is what previously seen folder names to prefix to the
one just read.  To get the prefix, every time you see a folder name,
you start a scope and append the name to the current prefix, and every
time you see the end of a DL element, you know you have left a folder
scope, so you remove the last folder name from the prefix.

So I wrote the program (in Python) to look just for DT elements for
folder names and the ends of DL elements (coded as "</DL>") that mark
the ends of folder scopes.  Here's the program:


#   Extract a list of folder pathnames from a Netscape-style bookmark file
#   as exported by Microsoft Internet Explorer.

#   Constants
ENDFOLDER = "</DL><p>"
PREFIX = "<DT><H3 FOLDED ADD_DATE="

#   Input file
f=open('c:/Doc/bookmark.htm', 'r+')

#   Output file
filename = 'out.txt'        
localfile = open(filename, 'w')

currentFolder = ""

#   Read in the whole input file.
line_list = f.readlines()

for x in line_list:
    #   Remove all leading whitespace.
    line = x.lstrip()
    if line.startswith(PREFIX):
        #   This line tells us about a folder.
        #   Find the next '>' after the prefix.
        lpos = line.find('>', len(PREFIX))
        if lpos > 0:
            #   Find the next '<' after the '>'.
            rpos = line.find('<', lpos)
            #   Extract the folder name.
            name = line[lpos + 1:rpos]
            #   If at the top level, we don't want a leading slash.
            if len(currentFolder) == 0:
                currentFolder = name
            else:
                #   This folder must be within the last one we saw,
                #   so append its name to the currentFolder name.
                currentFolder = currentFolder + '/' + name
            #   Put out the constructed pathname.
            localfile.write(currentFolder + '\n')
            
    elif line.startswith(ENDFOLDER):
        #   The other kind of interesting input marks the end of a
        #   containing folder's scope.
        #   Trim the last folder name from the current path.
        trimpos = currentFolder.rfind('/')
        if trimpos < 0:
            #   Must be top level, so trash it all.
            currentFolder = ""
        else:
            currentFolder = currentFolder[:trimpos]

localfile.close()


I expect it is also possible to solve the problem using the standard
Python htmllib module, which would actually parse all the HTML.

http://www.python.org/doc/current/lib/module-htmllib.html

It's hard to say why this was hard for you.  It looks like the
difficulty was in finding the algorithm rather than implementing it.

I'm also not sure how to advise you about tackling such problems in
the future.  In this case, it might have helped to review basic
computer science material on parsing and tree data structures, as well
as the structure of HTML.  And with Python, unless you are dealing
with something very specialized, it's often a good idea to look for
existing modules that do some of the work for you.

I hope this answer is helpful.  If anything is unclear or you need
more information about any part of it, please ask for a clarification.

--efn

coolguy90210-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $5.00

efn-ga answered my question about parsing IE bookmarks and going
beyond the first 2 levels of bookmarks.  He actually re-wrote my
Python code from scratch, and the result is that the program is now
more elegant, clear and concise.

Comments

Subject: Re: Parse an Internet Explorer Bookmark file in Python or Java
From: yahmed-ga on 31 May 2004 14:49 PDT

How about maintaining an array of "LinkedList" (
http://java.sun.com/j2se/1.3/docs/api/java/util/LinkedList.html ) for
these folders?

Subject: Re: Parse an Internet Explorer Bookmark file in Python or Java
From: coolguy90210-ga on 01 Jun 2004 03:54 PDT

yahmed-ga,

Thanks for the reference.  I looked it over, but it wasn't quite what
I had in mind.  I appreciate the effort, and have enjoyed reading your
posts.

Subject: Re: Parse an Internet Explorer Bookmark file in Python or Java
From: coolguy90210-ga on 01 Jun 2004 04:36 PDT

efn-ga,

Your solution worked flawlessly.  As I went over your code, I asked
myself why hadn't I thought of that.  Seems pretty obvious now.

Look for more Python related work from me.  Sporadic though.  I try to
solve the problem myself first.  But after my experience here, I'll
never spend hours of my own time getting stuck and frustrated, when I
could have the answer here and continue on with the more important
aspects of the web app I'm developing.

Curious to know, how long did solving it take you?  I got caught up in
thinking of the solution in a way that would involve taking into
account the indexOf() the starting regex, going forward by +4 as more
subfolders are discovered, and then backwards, and I completely
overlooked the fact that each bookmark category has a closing tag.  As
I look at it now, what threw me was the lack of a </DT> closing tag.

Subject: Re: Parse an Internet Explorer Bookmark file in Python or Java
From: efn-ga on 01 Jun 2004 08:19 PDT

I would guess it took me about an hour to get the program working and
another hour to polish it and write the answer.  This is just a guess,
though, because I was concurrently doing other things and didn't track
my time on the question.

Thanks for the rating and the tip.  I'll be happy to help you with
other questions if I can.

--efn

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy