Hello lookingforegex-ga
I believe the regex you are looking for is the following one:
<table[^>]*>(\n|.)*?</table>
The section encapsulated by the parentheses contains the part you are
looking for. I am a little rusty in C# so I *think* the code should
look something like this:
string table = "<table height=100%><tr><td>hello</td></tr></table>";
Regex r = new Regex( @"<table[^>]*>(\n|.)*?</table>" );
Match m = r.Match(table);
if (m.Success) {
string thestuffinthetable = m.Groups[1].value;
}
If you need any further information on this subject please ask for
clarification and I will do my best to respond swiftly. |
Clarification of Answer by
palitoy-ga
on
30 Sep 2004 10:51 PDT
I forgot to mention that from your question it sounds like you are
looking at m.Groups[0].value rather than m.Groups[1].value. [0]
contains the text of the entire match whereas progressive numbers [x]
contain the higher matches.
|
Request for Answer Clarification by
lookingforegex-ga
on
30 Sep 2004 12:36 PDT
Hi plitoy-ga,
since I'm using a config file in Visual Studio .Net, the pattern is
encoded and looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<GoogleNews>
<add key="URL" value="http://news.google.de/news?ned=us&topic=n" />
<add key="IdentifyPattern" value="<table border=0 width=75%
valign=top[^>]*>(\n|.)*?</table>" />
</GoogleNews>
</configuration>
Looking at the Google News site
http://news.google.de/news?ned=us&topic=n, you will see, that this
pattern retrieves the entire table, where a news is in.
Trying your suggestion @"<table[^>]*>(\n|.)*?</table>" , I noticed
that I can't use parentheses at all.
It would look like this in the xml file:
@"<table[^>]*>(\n|.)*?</table>"
Is there any other option, say if I wanted only retrieve a headline,
but not an entire table? I would define a start string and and end
string, and the between will be returned.
Great thanks
|
Clarification of Answer by
palitoy-ga
on
30 Sep 2004 13:03 PDT
I am puzzled by your phrase that you "cannot use parentheses at all",
the parentheses are the part you require to be able to identify the
match. Can you please let me know what you mean by this. The @""
syntax allows you to define regular expressions without having to
escape the characters.
The actual regular expression that should be used is:
<table[^>]*>(\n|.)*?</table>
If you were wishing to only retrieve the headlines the easiest way to
achieve this would be to search for <b>...</b> tags as the headline
always is between these.
What is the output of the script at the moment? Your question just
asked for everything between the <table>...</table> tags.
I am also uncertain as to the legality of you scraping the Google News
site. This may be something that Google would frown upon.
|
Request for Answer Clarification by
lookingforegex-ga
on
30 Sep 2004 13:44 PDT
Sorry, if I didn't explain correctly. Google News was just an example
to make sure it works across any pages.
Another example:
http://finance.yahoo.com/
There you can see the DOW index.
I only want the actual DOW value e.g. 10,080.27 retreived.
In source code you find before the DOW value:
<td class="yfnc_mktsumtxt" nowrap>
and after the DOW value:
</td><td class="yfnc_mktsumtxt" nowrap>
Now what I need is:
class="yfnc_mktsumtxt" nowrap>MYREGEX</td><td class="yfnc_mktsumtxt" nowrap>
where MYREGEX returnes only the number between these two strings.
Thank you!
|
Clarification of Answer by
palitoy-ga
on
01 Oct 2004 01:15 PDT
I have been advised by the Google Editors that the scraping of the
Google News site would "violate the News Terms of Service" and
therefore I should not answer questions relating directly to this
subject.
As to your further clarification I have retested the regular
expression and it also works on the Yahoo page. Can you please post
the section of your C# code that you are using to retrieve the match?
When you create a match object, there are several properties that are
available one of which is the 'GroupCollection' object. This is
essentially an array of all the matches that are achieved.
m.Groups[0].value is the entire regular expression match (which it
sounds like you are looking at), m.Groups[1].value is the first match
in parentheses (which is what you require).
I am also still unsure as to what you meant when you said "I can't use
parentheses at all", if this is the case then you will NEVER be able
to get the exact match you require as the parentheses direct the
program to the sub-match that you need. Can you please clarify what
you meant by this statement?
I look forward to hearing the answers to my two questions so that I
can help you further.
|