Google Answers: Need a Regular Expression Pattern for .Net

View Question

Q: Need a Regular Expression Pattern for .Net ( Answered 5 out of 5 stars

, 0 Comments )

Question

Subject: Need a Regular Expression Pattern for .Net
Category: Computers > Programming
Asked by: lookingforegex-ga
List Price: $10.00

Posted: 30 Sep 2004 08:41 PDT
Expires: 30 Oct 2004 08:41 PDT
Question ID: 408383

Hi everyone,
in my C# .Net application, I'm using Regular Expessions (RegEx) to
scrape parts of the HTML of certain pages.

This is how the RegEx currently looks like:
[^>]*>(\n|.)*? 

Using this pattern, I can scrape a part of my HTML including the
characters, that I'm putting on start string and end string.

<table>[^>]*>(\n|.)*?</table> returns me all content including the table tags.

Now my question:

Can you find me a pattern which only retrieves the part between
<table> and </table>? My aim is to use a RegEx that returns only the
part BETWEEN two positions, regardsless how long my strings might be.
So, the pattern shall start retrieving right AFTER my start position
and end retrieving right BEFORE my ending position.

Thank you!

Answer

Subject: Re: Need a Regular Expression Pattern for .Net
Answered By: palitoy-ga on 30 Sep 2004 10:22 PDT
Rated: 5 out of 5 stars

Hello lookingforegex-ga I believe the regex you are looking for is the following one: <table[^>]>(\n\|.)?</table> The section encapsulated by the parentheses contains the part you are looking for. I am a little rusty in C# so I think the code should look something like this: string table = "<table height=100%><tr><td>hello</td></tr></table>"; Regex r = new Regex( @"<table[^>]>(\n\|.)?</table>" ); Match m = r.Match(table); if (m.Success) { string thestuffinthetable = m.Groups[1].value; } If you need any further information on this subject please ask for clarification and I will do my best to respond swiftly.
Clarification of Answer by palitoy-ga on 30 Sep 2004 10:51 PDT I forgot to mention that from your question it sounds like you are looking at m.Groups[0].value rather than m.Groups[1].value. [0] contains the text of the entire match whereas progressive numbers [x] contain the higher matches.
Request for Answer Clarification by lookingforegex-ga on 30 Sep 2004 12:36 PDT Hi plitoy-ga, since I'm using a config file in Visual Studio .Net, the pattern is encoded and looks like this: <?xml version="1.0" encoding="utf-8" ?> <configuration> <GoogleNews> <add key="URL" value="http://news.google.de/news?ned=us&topic=n" /> <add key="IdentifyPattern" value="<table border=0 width=75% valign=top[^>]>(\n\|.)?</table>" /> </GoogleNews> </configuration> Looking at the Google News site http://news.google.de/news?ned=us&topic=n, you will see, that this pattern retrieves the entire table, where a news is in. Trying your suggestion @"<table[^>]>(\n\|.)?</table>" , I noticed that I can't use parentheses at all. It would look like this in the xml file: @"<table[^>]>(\n\|.)?</table>" Is there any other option, say if I wanted only retrieve a headline, but not an entire table? I would define a start string and and end string, and the between will be returned. Great thanks
Clarification of Answer by palitoy-ga on 30 Sep 2004 13:03 PDT I am puzzled by your phrase that you "cannot use parentheses at all", the parentheses are the part you require to be able to identify the match. Can you please let me know what you mean by this. The @"" syntax allows you to define regular expressions without having to escape the characters. The actual regular expression that should be used is: <table[^>]>(\n\|.)?</table> If you were wishing to only retrieve the headlines the easiest way to achieve this would be to search for <b>...</b> tags as the headline always is between these. What is the output of the script at the moment? Your question just asked for everything between the <table>...</table> tags. I am also uncertain as to the legality of you scraping the Google News site. This may be something that Google would frown upon.
Request for Answer Clarification by lookingforegex-ga on 30 Sep 2004 13:44 PDT Sorry, if I didn't explain correctly. Google News was just an example to make sure it works across any pages. Another example: http://finance.yahoo.com/ There you can see the DOW index. I only want the actual DOW value e.g. 10,080.27 retreived. In source code you find before the DOW value: <td class="yfnc_mktsumtxt" nowrap> and after the DOW value: </td><td class="yfnc_mktsumtxt" nowrap> Now what I need is: class="yfnc_mktsumtxt" nowrap>MYREGEX</td><td class="yfnc_mktsumtxt" nowrap> where MYREGEX returnes only the number between these two strings. Thank you!
Clarification of Answer by palitoy-ga on 01 Oct 2004 01:15 PDT I have been advised by the Google Editors that the scraping of the Google News site would "violate the News Terms of Service" and therefore I should not answer questions relating directly to this subject. As to your further clarification I have retested the regular expression and it also works on the Yahoo page. Can you please post the section of your C# code that you are using to retrieve the match? When you create a match object, there are several properties that are available one of which is the 'GroupCollection' object. This is essentially an array of all the matches that are achieved. m.Groups[0].value is the entire regular expression match (which it sounds like you are looking at), m.Groups[1].value is the first match in parentheses (which is what you require). I am also still unsure as to what you meant when you said "I can't use parentheses at all", if this is the case then you will NEVER be able to get the exact match you require as the parentheses direct the program to the sub-match that you need. Can you please clarify what you meant by this statement? I look forward to hearing the answers to my two questions so that I can help you further.

lookingforegex-ga rated this answer: 5 out of 5 stars

Thank you! It works now fine for me.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy