|
List Info
Thread: PHP preg_match not working like test environments
|
|
| PHP preg_match not working like test
environments |
  United States |
2007-03-07 16:19:03 |
I'm using the preg_match function in PHP and I want to grab
a link tag
from a page and the preceding and following text to
establish a
context.
If the link is inside a set of <div> tags I only want
to grab the text
within that div, if there are no <div>s I want to grab
the text all
the way to the <body> tags.
To give an example if the html is:
<html>
<head>
</head>
<body>
aaa aaa aaa
<div>
bbb bbb bbb
<a href='http://www.domain.com/ind
ex,html'>link text</a>
ccc ccc ccc
</div>
ddd ddd ddd
</body>
</html>
I want to match three groups
1: bbb bbb bbb
2: <a href='http://www.domain.com/ind
ex,html'>link text</a>
3: ccc ccc ccc
but on the other hand if the divs weren't there and the html
was
<html>
<head>
</head>
<body>
aaa aaa aaa
bbb bbb bbb
<a href='http://www.domain.com/ind
ex,html'>link text</a>
ccc ccc ccc
ddd ddd ddd
</body>
</html>
I'd want to match
1: aaa aaa aaa bbb bbb bbb
2: <a href='http://www.domain.com/ind
ex,html'>link text</a>
3: ccc ccc ccc ddd ddd ddd
The expression I'm working with is
#.*<(?:div|body).*?>(.*?)(<as[^>]*?hrefs*?=s*
?["']{0,1}http://
www.domain.com/index.html['"]{0,1}.*?>.*?</a>)
(.*?)</(?:div|body)#i
Which is nearly there because it works as expected in the
Rad Software
Regular Expression Designer (http://www.radsoftware
.com.au/
regexdesigner/) and in the similar Expresso tool (http://
www.ultrapico.com/ExpressoBeta.htm) but returns no matches
when I use
it in PHP.
I guess this means something is not implemented the same way
in PHP
but what? Does anyone have a work around to get the
expression working
in PHP?
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regex googlegroups.com
To unsubscribe from this group, send email to
regex-unsubscribe googlegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---
|
|
| Re: PHP preg_match not working like test
environments |
  United States |
2007-03-08 17:34:06 |
On Mar 8, 6:19 am, "Neff" <fgpsm... gmail.com> wrote:
> I'm using the preg_match function in PHP and I want to
grab a link tag
> from a page and the preceding and following text to
establish a
> context.
> If the link is inside a set of <div> tags I only
want to grab the text
> within that div, if there are no <div>s I want to
grab the text all
> the way to the <body> tags.
>
> To give an example if the html is:
>
> <html>
> <head>
> </head>
> <body>
> aaa aaa aaa
> <div>
> bbb bbb bbb
> <a href='http://www.domain.com/ind
ex,html'>link text</a>
> ccc ccc ccc
> </div>
> ddd ddd ddd
> </body>
> </html>
>
> I want to match three groups
> 1: bbb bbb bbb
> 2: <a href='http://www.domain.com/ind
ex,html'>link text</a>
> 3: ccc ccc ccc
>
> but on the other hand if the divs weren't there and the
html was
>
> <html>
> <head>
> </head>
> <body>
> aaa aaa aaa
> bbb bbb bbb
> <a href='http://www.domain.com/ind
ex,html'>link text</a>
> ccc ccc ccc
> ddd ddd ddd
> </body>
> </html>
>
> I'd want to match
> 1: aaa aaa aaa bbb bbb bbb
> 2: <a href='http://www.domain.com/ind
ex,html'>link text</a>
> 3: ccc ccc ccc ddd ddd ddd
>
> The expression I'm working with is
>
>
#.*<(?:div|body).*?>(.*?)(<as[^>]*?hrefs*?=s*
?["']{0,1}http://www.domain.co
m/index.html['"]{0,1}.*?>.*?</a>)(.*?)<
/(?:div|body)#i
>
> Which is nearly there because it works as expected in
the Rad Software
> Regular Expression Designer (http://www.radsoftware
.com.au/
> regexdesigner/) and in the similar Expresso tool (http://www.
ultrapico.com/ExpressoBeta.htm) but returns no matches
when I use
> it in PHP.
>
> I guess this means something is not implemented the
same way in PHP
> but what? Does anyone have a work around to get the
expression working
> in PHP?
Yes, you missed the 's' modifier which allows the dot in
your pattern
matching newlines, so the complete pattern/modifier shuold
be:
#your_pattern#is
Regards,
Xicheng
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regex googlegroups.com
To unsubscribe from this group, send email to
regex-unsubscribe googlegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---
|
|
| Re: PHP preg_match not working like test
environments |
  United States |
2007-03-08 19:07:48 |
On Mar 8, 7:09 pm, "Neff" <fgpsm... gmail.com> wrote:
> Sorry, I'm afraid that's not it. I only wrote out the
html like that
> for clarity in this posting. In reality the html I and
processing has
> had all carriage returns and line feeds removed and all
multiple
> spaces concatentated to single spaces.
>
> Just in case I tried it anyway, and it didn't make any
difference.
>
OK, I just did not check your pattern carefully, but it
actually works
under my php file..so what's the error message you got from
your
screen.. what's your preg_match() line, have you quoted the
pattern
and escaped the corresonding quotation marks contained in
your
pattern?? for example:
preg_match("#...["']{0,1}...#i", $html,
$match);
or try the following preg_match() exppression:
preg_match("#.*<(?:div|body).*?>(.*?)(<as[^&g
t;]*?hrefs*=
s*["']?http://www.domain.com/index.html['
"]?.*?</a>)(.*?)</
(?:div|body)#i",$html,$match);
Regards,
Xicheng
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regex googlegroups.com
To unsubscribe from this group, send email to
regex-unsubscribe googlegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---
|
|
| Re: PHP preg_match not working like test
environments |
  United States |
2007-03-08 18:09:27 |
Sorry, I'm afraid that's not it. I only wrote out the html
like that
for clarity in this posting. In reality the html I and
processing has
had all carriage returns and line feeds removed and all
multiple
spaces concatentated to single spaces.
Just in case I tried it anyway, and it didn't make any
difference.
On Mar 8, 11:34 pm, "Xicheng Jia" <xich... gmail.com> wrote:
>
> Yes, you missed the 's' modifier which allows the dot
in your pattern
> matching newlines, so the complete pattern/modifier
shuold be:
>
> #your_pattern#is
>
> Regards,
> Xicheng
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regex googlegroups.com
To unsubscribe from this group, send email to
regex-unsubscribe googlegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---
|
|
| Re: PHP preg_match not working like test
environments |
  United States |
2007-03-09 18:28:50 |
I did a little reading, a little bit of rethinking and
re-did my
expression and I've got one that's working. The PHP code to
create the
expression is as follows...
$hrefpattern =
"#(?:<(?:body|div|td|p)[^>]*>)((?:.(?!</?(?
:body|div|td|
p)))*)"
.
"(<as[^>]*?hrefs*?=s*?["']{0,1}"
. preg_quote($targetURL)
.
"["']{0,1}.*?>.*?</a>)"
.
"(.*?)</(?:body|div|td|p)>#is";
It finds me the link tag, the preceding and following text
inside
whatever is the inner most of body, div, table cell or
paragraph tags.
Well it does on my tests and the half does "in the
wild" pages I've
tried it on, I'm always prepared to be proved wrong.
On Mar 9, 1:07 am, "Xicheng Jia" <xich... gmail.com> wrote:
> On Mar 8, 7:09 pm, "Neff" <fgpsm... gmail.com> wrote:
>
> > Sorry, I'm afraid that's not it. I only wrote out
the html like that
> > for clarity in this posting. In reality the html I
and processing has
> > had all carriage returns and line feeds removed
and all multiple
> > spaces concatentated to single spaces.
>
> > Just in case I tried it anyway, and it didn't make
any difference.
>
> OK, I just did not check your pattern carefully, but it
actually works
> under my php file..so what's the error message you got
from your
> screen.. what's your preg_match() line, have you quoted
the pattern
> and escaped the corresonding quotation marks contained
in your
> pattern?? for example:
>
> preg_match("#...["']{0,1}...#i",
$html, $match);
>
> or try the following preg_match() exppression:
>
preg_match("#.*<(?:div|body).*?>(.*?)(<as[^&g
t;]*?hrefs*=
> s*["']?http://www.domain.com/index.html['
"]?.*?</a>)(.*?)</
> (?:div|body)#i",$html,$match);
>
> Regards,
> Xicheng
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regex googlegroups.com
To unsubscribe from this group, send email to
regex-unsubscribe googlegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---
|
|
[1-5]
|
|