|
|
Subject:
Parse an Internet Explorer Bookmark file in Python or Java
Category: Computers > Programming Asked by: coolguy90210-ga List Price: $20.00 |
Posted:
31 May 2004 07:52 PDT
Expires: 30 Jun 2004 07:52 PDT Question ID: 354247 |
I'm quite embarrassed to ask for help on this as I've already banged my head against the wall trying to figure this one out for 8 hours over this weekend and I think this should be easily solvable! My question is: How do I parse an IE saved bookmark.html file such that no matter how many subfolders are contained by a top level folder, the correct parent/child folder path will be maintained. Here is what I mean. For the following list of bookmark folders and subfolders, insert the path into a database or write to a file: Let's say we have 3 top level folders (all starting at an index of 4 by the way in IE). Arts Business Computers In Arts we have: Animation. In Animation we have: Education. In Education we have: On-line And so on. Each top level folder might have no sub folders, or it might have 1000's. Each sub folder in turn might have no sub folders, or it might have 1000's. Help me get past my stuck point. I can get a list of the top level folders and every sub folder within, but not the actual parent / child relationship as detailed above. The end result of the above example would look like this inserted as a single row in a MySQL database, or written to a text file: Arts Arts/Animation Arts/Animation/Education Arts/Animation/Education/On-line Business (assuming no sub folders) Computers (assuming no sub folders) Here is my existing Python code. Your code may be in Python or Java. Preferably in Python. You may modify the code as you see fit. Simplify it, or start over from scratch. You are ONLY reading the bookmark.html file!!! You are not traversing the Windows Favorites directory. (Actually, I've already written something to do that, and that was easy. The OS tells you what the path is.) No other language is acceptable as I only have experience with the above 2. Also, if you could explain what was difficult about this? I mean it seems so straightforward, and I was able to write it out in pseudocode no problem, but when it came to implementing it....I just can't get my mind to go beyond the second level of folders. Perhaps a suggestion on how to do these problems in the future? import os import string import re f=open('all_bookmark.html', 'r+') line_list = f.readlines() xy = 4 top_folder = '' sub_folder = [] folder = [] filename = 'out.txt' localfile = open(filename, 'wb') p = re.compile('<DT><H3 FOLDED ADD_DATE=.*?>') o = re.compile(' ADD_DATE.*?.>') for x in line_list: y = string.find(x,'<'); if(y == xy): strip_version = x.lstrip() if(strip_version.startswith('<DT><H3 FOLDED ADD_DATE')): sub_folder = [] strip_version = x.lstrip() strip_version = p.sub('', strip_version) strip_version = strip_version.replace('</H3>','') strip_version = strip_version.replace('\n','') top_folder = strip_version list.append(folder, top_folder) if(y > xy): strip_version = x.lstrip() if(strip_version.startswith('<DT><H3 FOLDED ADD_DATE')): strip_version = p.sub('', strip_version) strip_version = strip_version.replace('</H3>','') strip_version = strip_version.replace('\n','') sub_folder = [top_folder + '/' + strip_version] list.append(folder, sub_folder) if(y < xy): strip_version = x.lstrip() if(strip_version.startswith('<DT><H3 FOLDED ADD_DATE')): strip_version = p.sub('', strip_version) strip_version = strip_version.replace('</H3>','') strip_version = strip_version.replace('\n','') sub_folder = [sub_folder + '/' + strip_version] folder[0:-1] list.append(folder, sub_folder) '''Now that I have all of the folder paths, I iterate through the folder list and write each line to a file. Problem is, in Python, you can't write the element of a list to a file.....must be a way, but I haven't found it.''' for x in folder: localfile.write(string.join(x) + '\n') localfile.close() |
|
Subject:
Re: Parse an Internet Explorer Bookmark file in Python or Java
Answered By: efn-ga on 31 May 2004 23:45 PDT Rated: |
Hi coolguy90210, In a Netscape-format bookmark file exported from Internet Explorer, each folder is listed in a DT element, which is followed by a DL element, which may be empty, or may contain folders or links. That means if you see a folder inside a DL element, you know it's inside the folder named in the preceding DT element, and if you see the end of a DL element, you know you have reached the end of the containing folder. So when you see a folder name, you know you want to put out a line, and the problem is what previously seen folder names to prefix to the one just read. To get the prefix, every time you see a folder name, you start a scope and append the name to the current prefix, and every time you see the end of a DL element, you know you have left a folder scope, so you remove the last folder name from the prefix. So I wrote the program (in Python) to look just for DT elements for folder names and the ends of DL elements (coded as "</DL>") that mark the ends of folder scopes. Here's the program: # Extract a list of folder pathnames from a Netscape-style bookmark file # as exported by Microsoft Internet Explorer. # Constants ENDFOLDER = "</DL><p>" PREFIX = "<DT><H3 FOLDED ADD_DATE=" # Input file f=open('c:/Doc/bookmark.htm', 'r+') # Output file filename = 'out.txt' localfile = open(filename, 'w') currentFolder = "" # Read in the whole input file. line_list = f.readlines() for x in line_list: # Remove all leading whitespace. line = x.lstrip() if line.startswith(PREFIX): # This line tells us about a folder. # Find the next '>' after the prefix. lpos = line.find('>', len(PREFIX)) if lpos > 0: # Find the next '<' after the '>'. rpos = line.find('<', lpos) # Extract the folder name. name = line[lpos + 1:rpos] # If at the top level, we don't want a leading slash. if len(currentFolder) == 0: currentFolder = name else: # This folder must be within the last one we saw, # so append its name to the currentFolder name. currentFolder = currentFolder + '/' + name # Put out the constructed pathname. localfile.write(currentFolder + '\n') elif line.startswith(ENDFOLDER): # The other kind of interesting input marks the end of a # containing folder's scope. # Trim the last folder name from the current path. trimpos = currentFolder.rfind('/') if trimpos < 0: # Must be top level, so trash it all. currentFolder = "" else: currentFolder = currentFolder[:trimpos] localfile.close() I expect it is also possible to solve the problem using the standard Python htmllib module, which would actually parse all the HTML. http://www.python.org/doc/current/lib/module-htmllib.html It's hard to say why this was hard for you. It looks like the difficulty was in finding the algorithm rather than implementing it. I'm also not sure how to advise you about tackling such problems in the future. In this case, it might have helped to review basic computer science material on parsing and tree data structures, as well as the structure of HTML. And with Python, unless you are dealing with something very specialized, it's often a good idea to look for existing modules that do some of the work for you. I hope this answer is helpful. If anything is unclear or you need more information about any part of it, please ask for a clarification. --efn |
coolguy90210-ga
rated this answer:
and gave an additional tip of:
$5.00
efn-ga answered my question about parsing IE bookmarks and going beyond the first 2 levels of bookmarks. He actually re-wrote my Python code from scratch, and the result is that the program is now more elegant, clear and concise. |
|
Subject:
Re: Parse an Internet Explorer Bookmark file in Python or Java
From: yahmed-ga on 31 May 2004 14:49 PDT |
How about maintaining an array of "LinkedList" ( http://java.sun.com/j2se/1.3/docs/api/java/util/LinkedList.html ) for these folders? |
Subject:
Re: Parse an Internet Explorer Bookmark file in Python or Java
From: coolguy90210-ga on 01 Jun 2004 03:54 PDT |
yahmed-ga, Thanks for the reference. I looked it over, but it wasn't quite what I had in mind. I appreciate the effort, and have enjoyed reading your posts. |
Subject:
Re: Parse an Internet Explorer Bookmark file in Python or Java
From: coolguy90210-ga on 01 Jun 2004 04:36 PDT |
efn-ga, Your solution worked flawlessly. As I went over your code, I asked myself why hadn't I thought of that. Seems pretty obvious now. Look for more Python related work from me. Sporadic though. I try to solve the problem myself first. But after my experience here, I'll never spend hours of my own time getting stuck and frustrated, when I could have the answer here and continue on with the more important aspects of the web app I'm developing. Curious to know, how long did solving it take you? I got caught up in thinking of the solution in a way that would involve taking into account the indexOf() the starting regex, going forward by +4 as more subfolders are discovered, and then backwards, and I completely overlooked the fact that each bookmark category has a closing tag. As I look at it now, what threw me was the lack of a </DT> closing tag. |
Subject:
Re: Parse an Internet Explorer Bookmark file in Python or Java
From: efn-ga on 01 Jun 2004 08:19 PDT |
I would guess it took me about an hour to get the program working and another hour to polish it and write the answer. This is just a guess, though, because I was concurrently doing other things and didn't track my time on the question. Thanks for the rating and the tip. I'll be happy to help you with other questions if I can. --efn |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |