Google Answers Logo
View Question
 
Q: Need a Regular Expression Pattern for .Net ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: Need a Regular Expression Pattern for .Net
Category: Computers > Programming
Asked by: lookingforegex-ga
List Price: $10.00
Posted: 30 Sep 2004 08:41 PDT
Expires: 30 Oct 2004 08:41 PDT
Question ID: 408383
Hi everyone,
in my C# .Net application, I'm using Regular Expessions (RegEx) to
scrape parts of the HTML of certain pages.

This is how the RegEx currently looks like:
[^>]*>(\n|.)*? 

Using this pattern, I can scrape a part of my HTML including the
characters, that I'm putting on start string and end string.

<table>[^>]*>(\n|.)*?</table> returns me all content including the table tags.

Now my question:

Can you find me a pattern which only retrieves the part between
<table> and </table>? My aim is to use a RegEx that returns only the
part BETWEEN two positions, regardsless how long my strings might be.
So, the pattern shall start retrieving right AFTER my start position
and end retrieving right BEFORE my ending position.

Thank you!
Answer  
Subject: Re: Need a Regular Expression Pattern for .Net
Answered By: palitoy-ga on 30 Sep 2004 10:22 PDT
Rated:5 out of 5 stars
 
Hello lookingforegex-ga

I believe the regex you are looking for is the following one:

<table[^>]*>(\n|.)*?</table>

The section encapsulated by the parentheses contains the part you are
looking for.  I am a little rusty in C# so I *think* the code should
look something like this:

string table = "<table height=100%><tr><td>hello</td></tr></table>";
Regex r = new Regex( @"<table[^>]*>(\n|.)*?</table>" );

Match m = r.Match(table);

if (m.Success) {
  string thestuffinthetable = m.Groups[1].value;
}

If you need any further information on this subject please ask for
clarification and I will do my best to respond swiftly.

Clarification of Answer by palitoy-ga on 30 Sep 2004 10:51 PDT
I forgot to mention that from your question it sounds like you are
looking at m.Groups[0].value rather than m.Groups[1].value. [0]
contains the text of the entire match whereas progressive numbers [x]
contain the higher matches.

Request for Answer Clarification by lookingforegex-ga on 30 Sep 2004 12:36 PDT
Hi plitoy-ga,

since I'm using a config file in Visual Studio .Net, the pattern is
encoded and looks like this:

<?xml version="1.0" encoding="utf-8" ?> 
<configuration>	
<GoogleNews>
<add key="URL" value="http://news.google.de/news?ned=us&amp;topic=n" />
<add key="IdentifyPattern" value="&lt;table border=0 width=75%
valign=top[^&gt;]*&gt;(\n|.)*?&lt;/table&gt;" />
</GoogleNews>
</configuration>
	

Looking at the Google News site
http://news.google.de/news?ned=us&topic=n, you will see, that this
pattern retrieves the entire table, where a news is in.

Trying your suggestion  @"<table[^>]*>(\n|.)*?</table>" , I noticed
that I can't use parentheses at all.
It would look like this in the xml file:
@&quot;&lt;table[^&gt;]*&gt;(\n|.)*?&lt;/table&gt;&quot;

Is there any other option, say if I wanted only retrieve a headline,
but not an entire table? I would define a start string and and end
string, and the between will be returned.

Great thanks

Clarification of Answer by palitoy-ga on 30 Sep 2004 13:03 PDT
I am puzzled by your phrase that you "cannot use parentheses at all",
the parentheses are the part you require to be able to identify the
match.  Can you please let me know what you mean by this.  The @""
syntax allows you to define regular expressions without having to
escape the characters.

The actual regular expression that should be used is:
<table[^>]*>(\n|.)*?</table>

If you were wishing to only retrieve the headlines the easiest way to
achieve this would be to search for <b>...</b> tags as the headline
always is between these.

What is the output of the script at the moment?  Your question just
asked for everything between the <table>...</table> tags.

I am also uncertain as to the legality of you scraping the Google News
site.  This may be something that Google would frown upon.

Request for Answer Clarification by lookingforegex-ga on 30 Sep 2004 13:44 PDT
Sorry, if I didn't explain correctly. Google News was just an example
to make sure it works across any pages.

Another example:
http://finance.yahoo.com/

There you can see the DOW index.

I only want the actual DOW value e.g. 10,080.27 retreived.

In source code you find before the DOW value:
<td class="yfnc_mktsumtxt" nowrap>

and after the DOW value:
</td><td class="yfnc_mktsumtxt" nowrap>

Now what I need is:
class="yfnc_mktsumtxt" nowrap>MYREGEX</td><td class="yfnc_mktsumtxt" nowrap>

where MYREGEX returnes only the number between these two strings.

Thank you!

Clarification of Answer by palitoy-ga on 01 Oct 2004 01:15 PDT
I have been advised by the Google Editors that the scraping of the
Google News site would "violate the News Terms of Service" and
therefore I should not answer questions relating directly to this
subject.

As to your further clarification I have retested the regular
expression and it also works on the Yahoo page.  Can you please post
the section of your C# code that you are using to retrieve the match?

When you create a match object, there are several properties that are
available one of which is the 'GroupCollection' object.  This is
essentially an array of all the matches that are achieved. 
m.Groups[0].value is the entire regular expression match (which it
sounds like you are looking at), m.Groups[1].value is the first match
in parentheses (which is what you require).

I am also still unsure as to what you meant when you said "I can't use
parentheses at all", if this is the case then you will NEVER be able
to get the exact match you require as the parentheses direct the
program to the sub-match that you need.  Can you please clarify what
you meant by this statement?

I look forward to hearing the answers to my two questions so that I
can help you further.
lookingforegex-ga rated this answer:5 out of 5 stars
Thank you! It works now fine for me.

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy