Google Answers: A Program to Analyse a HTML File

View Question

Q: A Program to Analyse a HTML File ( Answered 5 out of 5 stars

Question

Subject: A Program to Analyse a HTML File
Category: Computers > Programming
Asked by: bmcompany-ga
List Price: $200.00

Posted: 05 Sep 2003 03:00 PDT
Expires: 05 Oct 2003 03:00 PDT
Question ID: 252523

We require a program to analyse the content of a html file, and perform simple calculations based on figures within the file. Once the program is written, we require the un-compiled code for future use. One of our company's tasks is to generate 'Position Reports' for client's websites on the major search engines. We use TopDog (http://www.topdogsoftware.net) to generate these reports. Once the report has been run, the program outputs a report detailing which position the site was found under certain keywords on certain search engines. This report is now sent to the client. Unfortunately, the program only generates a list of positions and doesn’t give any more details such as - total number of positions, number of top 10 positions, top 20 positions, number 1 positions etc. Some of these reports can be 5MB in size and contain 5000 positions or more. To count these positions can take a while :) We require a program that will analyse the HTML file, extract the relevant figures and calculate: Total number of positions Total number 1 positions Total number of top 10 positions Total number of top 20 positions That’s the easy bit. The optimisation company generates optimised pages that use the following file naming format. Conveyor_Belt_Parts.htm Conveyor_Belt_Repairs.htm Conveyor_Belt_Machines.htm Conveyor_Belt_Importers.htm When calculating the amount of positions achieved, we can only count positions attained by these optimised pages. In the position reports, the 'Status' column includes the page that was found. For Example Keyword Position Status Conveyor Belt Parts 10 http://www.domain.com/Conveyor_Belt_Parts.htm Conveyor Belt Repairs 12 http://www.domain.com Only the position with the status including the name of an optimised pages can be counted NOTE: the name of the pages wont necessarily correspond with the position. Ideally, when entering the path of the file to be analysed, you would browse to a dir containing the optimised pages, the program would create a list with all the file names in that dir, and then use this to ascertain whether or not the position should be counted. I have uploaded a few reports here http://www.sesuk.net/examples.zip. The structure of the reports remains fairly constant however there are a one or two changes you should be aware of - eg, when running a report for the second or third time, the 'Position' shows any change from the last report with a + or -. Keyword Position Status Conveyor Belt Parts 10(+3) http://www.domain.com/Conveyor_Belt_Parts.htm Secondly, the Overture search engine returns the status of http://www.domain.com when in fact should read domain/page_name.htm. We are not concerned with this. Please only count the positions where the status contains the page name. Ignoring Overture positions is fine. We would like to add more functionality in the future such as the ability to analyse several reports at once, however, i will repost a separate question once we have tested this first version. I would like a clarification of question before attempting the program, as I want to make sure that this spec has been fully understood first.
Request for Question Clarification by joseleon-ga on 05 Sep 2003 12:12 PDT Hello, bmcompany: I have read your question and think it's fairly easy to develop such software. This is the list of features I get from what you say: -You need to generate a new report based on an input HTML source and add some information to make it easy to read and more meaningful -The final report will show this additional information: Total number of positions Total number 1 positions Total number of top 10 positions Total number of top 20 positions -The positions to be counted are those that point to specific pages -The list of pages to be counted will be extracted from a directory the user can select -Some positions can be relative to positions on previous reports -I have checked the examples you provide and I think it won't be a problem I will use Delphi 7 to develop such software, and I will test it on every Windows version. If you want I can develop a cross-platform version and will work both on Windows and Linux, that's not a problem. You will own the copyrights of the source code. Don't worry to request more features, I will add all you need until you get the software you want. I can start to develop right now and I will provide a daily progress report. Regards.
Clarification of Question by bmcompany-ga on 05 Sep 2003 12:51 PDT joseleon-ga, thanks for the quick response. You seem to have understood the brief perfectly and i dont think anything has been missed out. Just to let you know that in our company, we use Windows 2000. There is no need for it to be cross platform. Just one think i forgot to mention. The columns in the report read; rank, page, position. The only figure we are interested in is rank. The second and third column refer to results page number and position relative to that page. The overall position is the 'rank'. Sorry if im stating the obvious but just wanted to be as clear as possible. I'll be on hand if you need any further clarification. Best of luck. Oh and if this could be done say before Tuesday - we'll award a $50 tip for your extra effort.
Clarification of Question by bmcompany-ga on 05 Sep 2003 13:28 PDT Hi again, ive just re-read your question and spotted something. The report generated by the this software should be a seperate report - not added to the original file. The position report is sent to the client in its original form - ie, with no summary figures. Thanks again
Request for Question Clarification by joseleon-ga on 05 Sep 2003 15:23 PDT Hello, Sure, no problem, I thought that, I expect to have something for you to see on Sunday. Regards.

Answer

Subject: Re: A Program to Analyse a HTML File
Answered By: joseleon-ga on 07 Sep 2003 01:40 PDT
Rated: 5 out of 5 stars

Hello, bmcompany: I have just placed a preliminary version on this location: http://www.xpde.com/HTMLParser.zip It just to check if I'm going on the right path, you can open any report and press the Play button to get some results parsed. It stills don't ask for a list of optimized pages, this will be done once you press the Play button. Please, tell me if I'm in the right path and if you like the interface, I will finish it on monday so we can talk about it and tune it to get a final version on tuesday. Regards.
Request for Answer Clarification by bmcompany-ga on 07 Sep 2003 03:16 PDT Hi there. Absolutely brilliant - works great - and very speedy too. Ive tested it with the biggest report i can find - 4 MB and 2500 positions, worked perfectly. Once you have the positions analysed with only the correct pages, then you are there. This may be a bit late to add this to the spec, but is there any chance that you could produce an extra set of figures - positions with all pages and positions with just the pages in the dir? If that is a quick job then that would make for a very useful additional feature. And for your next trick - (obviously this will be posted as a seperate question and priced accordingly) can you think of a way to analyse a directory with say 50 reports - and produce a summary for each report? This will be tricky when you have to select the pages ? Just something to think about if you're interested in taking this further. But it's looking very good indeed. Thanks again for your time
Clarification of Answer by joseleon-ga on 08 Sep 2003 00:38 PDT Hello, bmcompany: I have some questions, just to finish the next version: -Can you send me your biggest file, so I can test the performance in the worst case? -Can you send me a report and a list of optimized pages that match that report? -Do you want to have excluding or including count operations? That is -If I find a #1 position, right now, this position counts both for: Number ones Top 10 Top 20 -Do you want to work this way, or do you want number 1 positions, just count as number 1 positions and NOT as Top 10 and Top 20? -Can you send me an HTML template (just a basic design) to put results inside? If not, don't worry, I will design a basic one -What would be the default filename to save the results page? i.e. If the source input file is example1.htm, then, the results can be results_example1.htm -Regarding the last requested feature, do you want to add an extra set?, that is: First set (optimized pages) -Total number of positions -Total number 1 positions -Total Top 10 positions -Total Top 20 positions Second set (all pages) -Total number of positions -Total number 1 positions -Total Top 10 positions -Total Top 20 positions Is this what you want? -Regarding the summary feature you want, no problem, once I finish the process with a single file, it's just a matter to automate the process for several files. Regards.
Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 02:36 PDT -Can you send me your biggest file, so I can test the performance in the worst case? http://www.sesuk.net/big.zip (link will work in half an hour) -Can you send me a report and a list of optimized pages that match that report? http://www.sesuk.net/pages0001.zip (link will work in half an hour) -Do you want to have excluding or including count operations? That is -If I find a #1 position, right now, this position counts both for: Number ones Top 10 Top 20 -Do you want to work this way, or do you want number 1 positions, just count as number 1 positions and NOT as Top 10 and Top 20? For a #1 position, it needs to be counted as a #1, a top 10 and a top 20. For a #7 position, it needs to be counted as a top 10 and top 20. Hope that’s clear. -Can you send me an HTML template (just a basic design) to put results inside? If not, don't worry, I will design a basic one Basic one is fine. This report is only for us to pull positions out of, not to send to clients. -What would be the default filename to save the results page? i.e. If the source input file is example1.htm, then, the results can be results_example1.htm Is there any chance that the report could be named based on the first URL in the report? The report header includes the URL under "Performed for: www.domain.com" If there are more than one domain, just the first one is fine. Additionally could you insert either a random or incremental figure at the END of the filename? Hope all that's ok. -Regarding the last requested feature, do you want to add an extra set?, that is: First set (optimized pages) -Total number of positions -Total number 1 positions -Total Top 10 positions -Total Top 20 positions Second set (all pages) -Total number of positions -Total number 1 positions -Total Top 10 positions -Total Top 20 positions Is this what you want? Thats exactly correct! -Regarding the summary feature you want, no problem, once I finish the process with a single file, it's just a matter to automate the process for several files. Excellent, we'll chat about this once this question is closed. Thank you again for your time and attention to this project. We all look forward to working further with you in the very near future.
Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 02:44 PDT Regarding my previous post. The biggest report i can find is at www.sesuk.net/big.zip i dont have the pages for that one, but the other http://www.sesuk.net/pages0001.zip has a 1.5MB report with the pages. The report is called report_here.htm. Thanks again
Clarification of Answer by joseleon-ga on 08 Sep 2003 02:48 PDT Hello bmcompany, I have just uploaded an updated version at the same time you were answering my clarification, here it is: http://www.xpde.com/HTMLParser2.zip I'm going to answer your last clarification: "http://www.sesuk.net/big.zip (link will work in half an hour)" Great. "http://www.sesuk.net/pages0001.zip (link will work in half an hour)" Great too. "For a #1 position, it needs to be counted as a #1, a top 10 and a top 20. For a #7 position, it needs to be counted as a top 10 and top 20. Hope that’s clear." Perfectly clear and this is the way it works now. "Basic one is fine. This report is only for us to pull positions out of, not to send to clients." Ok, already included in the last update, please, check it out, in any case I'm going to externalize the template, so you can change it at any time if needed, don't worry, it doesn't takes too much time. "Is there any chance that the report could be named based on the first URL in the report? The report header includes the URL under "Performed for: www.domain.com" If there are more than one domain, just the first one is fine." I will check it out, in the last update is named results_XXXXX.htm, I will change it. "Additionally could you insert either a random or incremental figure at the END of the filename?" A timestamp it's ok for you? Visible or not visible? Readable or a unix-kind timestamp? "Thats exactly correct!" Ok, already included on the last update, please, check it out. "Excellent, we'll chat about this once this question is closed. Thank you again for your time and attention to this project. We all look forward to working further with you in the very near future." Thanks, you are a very kind customer ;-) Regards.
Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 03:19 PDT a visable timestamp is fine. Ive tested the latest version and it's 100% perfect! All we wanted and more. If you could sort the report naming of the report with the domain that would be a great bonus, but if its gonna take a while i understand that you have worked very hard on this already and we already have more than we asked for. A million thanks again.
Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 03:38 PDT We've just spotted something. can you investigate this for me? http://www.sesuk.net/possible_bug.zip The total results reads LESS than the optimised results. I thought at first that the results were just the wrong way round. ie, optimised/total - but it works fine with the other reports. Let me know what you think
Clarification of Answer by joseleon-ga on 08 Sep 2003 04:06 PDT Hello, bmcompany: The problem was that optimized pages are named like this: London_Connaught.htm The_London_Connaught.htm So when I was looking into an url that pointed to The_London_Connaught.htm, it was counted twice because it also matched London_Connaught.htm. It's fixed now, download it from: http://www.xpde.com/HTMLParser3.zip I will fix the rest of things this afternoon. Also, I would like you do the following test: -Get an small report but which features most of the ways a report can be -Then, calculate by hand what would be the results -Then, use the software and see if the results match If not, send it to me, to check it, this is the only way you can be sure the software works as it should, maybe I'm forgetting to count some kind of link. Regards.
Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 05:06 PDT http://www.sesuk.net/checker1.zip It is slightly out with this one. Some postions arnt being counted. I think it's because a keyword has 2 positions. EG, listed at #1 AND #7. have a look and let me know what you think.
Clarification of Answer by joseleon-ga on 08 Sep 2003 06:40 PDT Hello, bmcompany: I have fixed the problems, it was just a matter to add more types of lines to parse, feel free to download it from here, also included all the features we were talking about (templates, timestamp, reportname, ... etc) http://www.xpde.com/HTMLParser4.zip As soon as you tell me everything it's ok, I will post the source code. Regards.
Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 06:55 PDT Ok, all is good! Please post the source code and ill accept the answers. I think this answer easily deserves 2 stars!
Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 07:09 PDT Im only joking about the 2 stars:) 2 small problems to report. 1. the save as doesnt work. The button does nothing. 2. Copying from the results preview doesnt work. The option to copy appears from the right-click, but the data isnt copied to the clipboard.
Clarification of Answer by joseleon-ga on 08 Sep 2003 07:11 PDT Hello, bmcompany: You can download the source code from here: http://www.xpde.com/HTML_Parser.zip I have used Delphi 7, but it will compile easily with Delphi 5,6. If you have any problems, just tell me. Also, if you need more comments on the source code (I think there isn't because the code it's very easy to read), just tell me also. Regarding the poor rating you mention, I expect something more than 2 stars!!!!! ;-) Regarding any other feature you want, you are free to post a question for any researcher, but if you are interested in me, you can place in the subject, For joseleon only. Also, if you find a bug, please, don't hesitate to contact with me. Regards.
Clarification of Answer by joseleon-ga on 08 Sep 2003 07:28 PDT Hello, bmcompany: "Im only joking about the 2 stars:)" I hope so ;-)) "1. the save as doesnt work. The button does nothing." Please, be sure you are using the last version I sent you, that option was disabled on previous versions. Also, be sure the button graphic it's enabled, that happens when you load a report into the tool. "2. Copying from the results preview doesnt work. The option to copy appears from the right-click, but the data isnt copied to the clipboard." That's really strange..., do you have any clipboard management utility or something similar? I have just tested it and it works ok. Please, tell me the size (in bytes) of the HTMLParser.exe file you are using, please. Regards.

bmcompany-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $60.00

joseleon-ga has done a fantastic job with this program. Replies to
clarification reqests were speedy, friedly and helpful. He knows his
subject well and the program was finished in a VERY short amount of
time. We are very pleased and would hapily use him/her again. Infact,

joseleon-ga - please head over to question - Question ID: 253458.

Comments

Subject: Re: A Program to Analyse a HTML File
From: dungga-ga on 06 Sep 2003 11:45 PDT

another alternative is to create a HTML parser using JavaCC(It takes
few minutes to generate a parser)

here is the grammar for HTML.  
http://www.cobase.cs.ucla.edu/pub/javacc/html-3.2.jjt

Once the code is generated you can further customize 
the code to your particular problem.

Thanks

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy