Google Answers Logo
View Question
 
Q: A Program to Analyse a HTML File ( Answered 5 out of 5 stars,   1 Comment )
Question  
Subject: A Program to Analyse a HTML File
Category: Computers > Programming
Asked by: bmcompany-ga
List Price: $200.00
Posted: 05 Sep 2003 03:00 PDT
Expires: 05 Oct 2003 03:00 PDT
Question ID: 252523
We require a program to analyse the content of a html file, and
perform simple calculations based on figures within the file.

Once the program is written, we require the un-compiled code for
future use.

One of our company's tasks is to generate 'Position Reports' for
client's websites on the major search engines. We use TopDog

(http://www.topdogsoftware.net) to generate these reports.

Once the report has been run, the program outputs a report detailing
which position the site was found under certain keywords on certain

search engines.

This report is now sent to the client. Unfortunately, the program only
generates a list of positions and doesn’t give any more details such

as - total number of positions, number of top 10 positions, top 20
positions, number 1 positions etc.

Some of these reports can be 5MB in size and contain 5000 positions or
more. To count these positions can take a while :)

We require a program that will analyse the HTML file, extract the
relevant figures and calculate:

Total number of positions
Total number 1 positions
Total number of top 10 positions
Total number of top 20 positions

That’s the easy bit.

The optimisation company generates optimised pages that use the
following file naming format.

Conveyor_Belt_Parts.htm
Conveyor_Belt_Repairs.htm
Conveyor_Belt_Machines.htm
Conveyor_Belt_Importers.htm

When calculating the amount of positions achieved, we can only count
positions attained by these optimised pages.

In the position reports, the 'Status' column includes the page that
was found.

For Example

Keyword				Position			Status

Conveyor Belt Parts		10				http://www.domain.com/Conveyor_Belt_Parts.htm
Conveyor Belt Repairs		12				http://www.domain.com

Only the position with the status including the name of an optimised
pages can be counted NOTE: the name of the pages wont necessarily

correspond with the position.

Ideally, when entering the path of the file to be analysed, you would
browse to a dir containing the optimised pages, the program would

create a list with all the file names in that dir, and then use this
to ascertain whether or not the position should be counted.

I have uploaded a few reports here http://www.sesuk.net/examples.zip.
The structure of the reports remains fairly constant however there are
a one or two changes you should

be aware of - eg, when running a report for the second or third time,
the 'Position' shows any change from the last report with a + or -.

Keyword				Position			Status

Conveyor Belt Parts		10(+3)				http://www.domain.com/Conveyor_Belt_Parts.htm

Secondly, the Overture search engine returns the status of
http://www.domain.com when in fact should read domain/page_name.htm.
We are not

concerned with this. Please only count the positions where the status
contains the page name. Ignoring Overture positions is fine.

We would like to add more functionality in the future such as the
ability to analyse several reports at once, however, i will repost a

separate question once we have tested this first version.

I would like a clarification of question before attempting the
program, as I want to make sure that this spec has been fully
understood

first.

Request for Question Clarification by joseleon-ga on 05 Sep 2003 12:12 PDT
Hello, bmcompany:

  I have read your question and think it's fairly easy to develop such
software. This is the list of features I get from what you say:
  -You need to generate a new report based on an input HTML source and
add some information to make it easy to read and more meaningful
  -The final report will show this additional information:
       Total number of positions 
       Total number 1 positions 
       Total number of top 10 positions 
       Total number of top 20 positions 
  -The positions to be counted are those that point to specific pages
  -The list of pages to be counted will be extracted from a directory
the user can select
  -Some positions can be relative to positions on previous reports
  -I have checked the examples you provide and I think it won't be a
problem
  
I will use Delphi 7 to develop such software, and I will test it on
every Windows version. If you want I can develop a cross-platform
version and will work both on Windows and Linux, that's not a problem.

You will own the copyrights of the source code.

Don't worry to request more features, I will add all you need until
you get the software you want.

I can start to develop right now and I will provide a daily progress
report.

Regards.

Clarification of Question by bmcompany-ga on 05 Sep 2003 12:51 PDT
joseleon-ga, thanks for the quick response.

You seem to have understood the brief perfectly and i dont think
anything has been missed out.

Just to let you know that in our company, we use Windows 2000. There
is no need for it to be cross platform.

Just one think i forgot to mention. The columns in the report read;
rank, page, position.

The only figure we are interested in is rank. The second and third
column refer to results page number and position relative to that
page.

The overall position is the 'rank'. Sorry if im stating the obvious
but just wanted to be as clear as possible.

I'll be on hand if you need any further clarification.

Best of luck. Oh and if this could be done say before Tuesday - we'll
award a $50 tip for your extra effort.

Clarification of Question by bmcompany-ga on 05 Sep 2003 13:28 PDT
Hi again, ive just re-read your question and spotted something.

The report generated by the this software should be a seperate report
- not added to the original file. The position report is sent to the
client in its original form - ie, with no summary figures.

Thanks again

Request for Question Clarification by joseleon-ga on 05 Sep 2003 15:23 PDT
Hello, 

  Sure, no problem, I thought that, I expect to have something for you
to see on Sunday.

Regards.
Answer  
Subject: Re: A Program to Analyse a HTML File
Answered By: joseleon-ga on 07 Sep 2003 01:40 PDT
Rated:5 out of 5 stars
 
Hello, bmcompany:

  I have just placed a preliminary version on this location:
  
  http://www.xpde.com/HTMLParser.zip
  
  It just to check if I'm going on the right path, you can open any
report and press the Play button to get some results parsed. It stills
don't ask for a list of optimized pages, this will be done once you
press the Play button. Please, tell me if I'm in the right path and if
you like the interface, I will finish it on monday so we can talk
about it and tune it to get a final version on tuesday.
  
Regards.

Request for Answer Clarification by bmcompany-ga on 07 Sep 2003 03:16 PDT
Hi there.

Absolutely brilliant - works great - and very speedy too.

Ive tested it with the biggest report i can find - 4 MB and 2500
positions, worked perfectly.

Once you have the positions analysed with only the correct pages, then
you are there.

This may be a bit late to add this to the spec, but is there any
chance that you could produce an extra set of figures - positions with
all pages and positions with just the pages in the dir?

If that is a quick job then that would make for a very useful
additional feature.

And for your next trick - (obviously this will be posted as a seperate
question and priced accordingly) can you think of a way to analyse a
directory with say 50 reports - and produce a summary for each report?
This will be tricky when you have to select the pages ? Just something
to think about if you're interested in taking this further.

But it's looking very good indeed.

Thanks again for your time

Clarification of Answer by joseleon-ga on 08 Sep 2003 00:38 PDT
Hello, bmcompany:
 I have some questions, just to finish the next version:
 
 -Can you send me your biggest file, so I can test the performance in
the worst case?
 
 -Can you send me a report and a list of optimized pages that match
that report?
 
 -Do you want to have excluding or including count operations? That is
   -If I find a #1 position, right now, this position counts both for:
     Number ones
     Top 10
     Top 20
   
   -Do you want to work this way, or do you want number 1 positions,
just count as number 1 positions and NOT as Top 10 and Top 20?
   
 -Can you send me an HTML template (just a basic design) to put
results inside? If not, don't worry, I will design a basic one
 
 -What would be the default filename to save the results page? i.e. If
the source input file is example1.htm, then, the results can be
results_example1.htm
 
 -Regarding the last requested feature, do you want to add an extra
set?, that is:
  First set (optimized pages)
  -Total number of positions
  -Total number 1 positions
  -Total Top 10 positions
  -Total Top 20 positions
  
  Second set (all pages)
  -Total number of positions
  -Total number 1 positions
  -Total Top 10 positions
  -Total Top 20 positions  
  
  Is this what you want?
  
 -Regarding the summary feature you want, no problem, once I finish
the process with a single file, it's just a matter to automate the
process for several files.
 
Regards.

Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 02:36 PDT
-Can you send me your biggest file, so I can test the performance in
the worst case?

http://www.sesuk.net/big.zip  (link will work in half an hour)
  
 -Can you send me a report and a list of optimized pages that match
that report?

http://www.sesuk.net/pages0001.zip  (link will work in half an hour)
  
 -Do you want to have excluding or including count operations? That is
   -If I find a #1 position, right now, this position counts both for:
     Number ones 
     Top 10 
     Top 20

   -Do you want to work this way, or do you want number 1 positions,
just count as number 1 positions and NOT as Top 10 and Top 20?

For a #1 position, it needs to be counted as a #1, a top 10 and a top
20.
For a #7 position, it needs to be counted as a top 10 and top 20. Hope
that’s clear.
    
 -Can you send me an HTML template (just a basic design) to put
results inside? If not, don't worry, I will design a basic one

Basic one is fine. This report is only for us to pull positions out
of, not to send to clients.
  
 -What would be the default filename to save the results page? i.e. If
the source input file is example1.htm, then, the results can be
results_example1.htm
 
Is there any chance that the report could be named based on the first
URL in the report? The report header includes the URL under "Performed
for: www.domain.com" If there are more than one domain, just the first
one is fine.

Additionally could you insert either a random or incremental figure at
the END of the filename?

Hope all that's ok.

 -Regarding the last requested feature, do you want to add an extra
set?, that is:
  First set (optimized pages) 
  -Total number of positions 
  -Total number 1 positions 
  -Total Top 10 positions 
  -Total Top 20 positions 
   
  Second set (all pages) 
  -Total number of positions 
  -Total number 1 positions 
  -Total Top 10 positions 
  -Total Top 20 positions   
   
  Is this what you want? 

Thats exactly correct!
   
 -Regarding the summary feature you want, no problem, once I finish
the process with a single file, it's just a matter to automate the
process for several files.

Excellent, we'll chat about this once this question is closed.

Thank you again for your time and attention to this project. We all
look forward to working further with you in the very near future.

Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 02:44 PDT
Regarding my previous post. The biggest report i can find is at 

www.sesuk.net/big.zip

i dont have the pages for that one, but the other
http://www.sesuk.net/pages0001.zip has a 1.5MB report with the pages.
The report is called report_here.htm.

Thanks again

Clarification of Answer by joseleon-ga on 08 Sep 2003 02:48 PDT
Hello bmcompany, 

I have just uploaded an updated version at the same time you were
answering my clarification, here it is:
   
   http://www.xpde.com/HTMLParser2.zip
   
I'm going to answer your last clarification:   

"http://www.sesuk.net/big.zip  (link will work in half an hour)"
Great.
   
"http://www.sesuk.net/pages0001.zip  (link will work in half an hour)"
Great too.
   
"For a #1 position, it needs to be counted as a #1, a top 10 and a top
20. For a #7 position, it needs to be counted as a top 10 and top 20.
Hope
that’s clear."
Perfectly clear and this is the way it works now.

"Basic one is fine. This report is only for us to pull positions out
of, not to send to clients."
Ok, already included in the last update, please, check it out, in any
case I'm going to externalize the template, so you can change it at
any time if needed, don't worry, it doesn't takes too much time.
   
"Is there any chance that the report could be named based on the first
URL in the report? The report header includes the URL under "Performed
for: www.domain.com" If there are more than one domain, just the first
one is fine."
I will check it out, in the last update is named results_XXXXX.htm, I
will change it.
 
"Additionally could you insert either a random or incremental figure
at
the END of the filename?"
A timestamp it's ok for you? Visible or not visible? Readable or a
unix-kind timestamp?
 
"Thats exactly correct!"
Ok, already included on the last update, please, check it out.
    
"Excellent, we'll chat about this once this question is closed. 
Thank you again for your time and attention to this project. We all
look forward to working further with you in the very near future."
Thanks, you are a very kind customer ;-)

Regards.

Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 03:19 PDT
a visable timestamp is fine.

Ive tested the latest version and it's 100% perfect! All we wanted and
more.

If you could sort the report naming of the report with the domain that
would be a great bonus, but if its gonna take a while i understand
that you have worked very hard on this already and we already have
more than we asked for.

A million thanks again.

Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 03:38 PDT
We've just spotted something.

can you investigate this for me?

http://www.sesuk.net/possible_bug.zip

The total results reads LESS than the optimised results. I thought at
first that the results were just the wrong way round. ie,
optimised/total - but it works fine with the other reports.

Let me know what you think

Clarification of Answer by joseleon-ga on 08 Sep 2003 04:06 PDT
Hello, bmcompany:

  The problem was that optimized pages are named like this:

  London_Connaught.htm
  The_London_Connaught.htm
  
  So when I was looking into an url that pointed to
The_London_Connaught.htm, it was counted twice because it also matched
London_Connaught.htm. It's fixed now, download it from:
  
  http://www.xpde.com/HTMLParser3.zip
  
  I will fix the rest of things this afternoon. Also, I would like you
do the following test:
  
  -Get an small report but which features most of the ways a report
can be
  -Then, calculate *by hand* what would be the results
  -Then, use the software and see if the results match
  
  If not, send it to me, to check it, this is the only way you can be
sure the software works as it should, maybe I'm forgetting to count
some kind of link.
  
Regards.

Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 05:06 PDT
http://www.sesuk.net/checker1.zip

It is slightly out with this one. Some postions arnt being counted. I
think it's because a keyword has 2 positions. EG, listed at #1 AND #7.
have a look and let me know what you think.

Clarification of Answer by joseleon-ga on 08 Sep 2003 06:40 PDT
Hello, bmcompany:

  I have fixed the problems, it was just a matter to add more types of
lines to parse, feel free to download it from here, also included all
the features we were talking about (templates, timestamp, reportname,
... etc)

  http://www.xpde.com/HTMLParser4.zip

As soon as you tell me everything it's ok, I will post the source
code.

Regards.

Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 06:55 PDT
Ok, all is good!

Please post the source code and ill accept the answers.

I think this answer easily deserves 2 stars!

Request for Answer Clarification by bmcompany-ga on 08 Sep 2003 07:09 PDT
Im only joking about the 2 stars:)

2 small problems to report.

1. the save as doesnt work. The button does nothing.
2. Copying from the results preview doesnt work. The option to copy
appears from the right-click, but the data isnt copied to the
clipboard.

Clarification of Answer by joseleon-ga on 08 Sep 2003 07:11 PDT
Hello, bmcompany:

You can download the source code from here:

http://www.xpde.com/HTML_Parser.zip

I have used Delphi 7, but it will compile easily with Delphi 5,6. If
you have any problems, just tell me. Also, if you need more comments
on the source code (I think there isn't because the code it's very
easy to read), just tell me also.

Regarding the poor rating you mention, I expect something more than 2
stars!!!!! ;-)

Regarding any other feature you want, you are free to post a question
for any researcher, but if you are interested in me, you can place in
the subject, For joseleon only.

Also, if you find a bug, please, don't hesitate to contact with me.

Regards.

Clarification of Answer by joseleon-ga on 08 Sep 2003 07:28 PDT
Hello, bmcompany:

"Im only joking about the 2 stars:)"
I hope so ;-))
 
"1. the save as doesnt work. The button does nothing."
Please, be sure you are using the last version I sent you, that option
was disabled on previous versions. Also, be sure the button graphic
it's enabled, that happens when you load a report into the tool.

"2. Copying from the results preview doesnt work. The option to copy
appears from the right-click, but the data isnt copied to the
clipboard."
That's really strange..., do you have any clipboard management utility
or something similar? I have just tested it and it works ok. Please,
tell me the size (in bytes) of the HTMLParser.exe file you are using,
please.

Regards.
bmcompany-ga rated this answer:5 out of 5 stars and gave an additional tip of: $60.00
joseleon-ga has done a fantastic job with this program. Replies to
clarification reqests were speedy, friedly and helpful. He knows his
subject well and the program was finished in a VERY short amount of
time. We are very pleased and would hapily use him/her again. Infact,

joseleon-ga - please head over to question - Question ID: 253458.

Comments  
Subject: Re: A Program to Analyse a HTML File
From: dungga-ga on 06 Sep 2003 11:45 PDT
 
another alternative is to create a HTML parser using JavaCC(It takes
few minutes to generate a parser)

here is the grammar for HTML.  
http://www.cobase.cs.ucla.edu/pub/javacc/html-3.2.jjt

Once the code is generated you can further customize 
the code to your particular problem.

Thanks

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy