List Info

Thread: Capturing last two matches




Capturing last two matches
user name
2006-09-26 14:38:49
donohoegmail.com wrote:
> Hello,
>
> I'm new, so please be patient. I'm using PHP to read in
HTML from a DB.
> The HTML itself starts and ends with the BODY tag.
There is a mix of
> HTML tags within that.
>
> What I'm trying to do is find a way to only grab the
last two
> paragraphs within the BODY tags. The text itself may
have line breaks.
> A (basic) sample looks like this:
>
> <BODY><p>This is paragraph one. The quick
brown fox jumped over the
> burning fence</p>
> <p>This is the second<p><p>This is
the third paragraph. Talk about
> repetition</p><p>This is the second last
paragraph</p>
> <p>This is the last paragraph</p>
> </BODY>
>
> So ideally I want to get (with or without a
line-break):
>
> <p>This is the second last paragraph</p>
> <p>This is the last paragraph</p>
>
> Whether other tags (apart from the <p> tags)
within the results are
> preserved or stripped is not a concern for me. I just
need to have the
> text.
>
> This should be easy (I hope) but I can't figure it out.

You can use a regex like the following which will keep the
last two <p>
elements and all other contents in between and before the
</body> tag:

   
/(?:<p>(?:(?!</?p>).)*?</p>(?:(?!</?
p>).)*?)(?=</body>)/si

(two modifiers: 's' : dot-match-newline, and 'i' : no-case,
are on)

where the basic structure is like:

    (?:<p>.*?</p>.*?)(?=</body>)

add a negative look ahead to the 'dot' like

    (?!</?p>).

then the new construct

    (?:(?!</?p>).)*?       (original .*?)

can make sure no any other <p> or </p> within a
<p> element, and
meanwhile the captured two <p> elements are adjacent
to either each
other or the </body> tag.

BTW. (?!</?p>). construct is kind of low efficient
way as you may
find. I guess you can get some cleaner/faster ways to do the
same with
some PHP functions. I am not quite familiar with those
though..

Good luck,
Xicheng


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regexgooglegroups.com
To unsubscribe from this group, send email to
regex-unsubscribegooglegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---

Capturing last two matches
user name
2006-09-26 17:38:24
Thanks very much! However it doesn't seem to work in PHP. In
fact it
chokes the web server.

I'm using this:
preg_match_all("/(?:<p>(?:(?!</?p>).)*?<
;/p>(?:(?!</?p>).)*?)(?=</body>)/si&quo
t;,
$file, $match);

I guess if I could get all unique occurrences of each
opening and
closing paragrah (<p>) into an array I could just look
at the last two
results.

-Michael


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regexgooglegroups.com
To unsubscribe from this group, send email to
regex-unsubscribegooglegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---

Capturing last two matches
user name
2006-09-27 00:01:57
michael wrote:
> Thanks very much! However it doesn't seem to work in
PHP. In fact it
> chokes the web server.
>
> I'm using this:
>
preg_match_all("/(?:<p>(?:(?!</?p>).)*?<
;/p>(?:(?!</?p>).)*?)(?=</body>)/si&quo
t;,
> $file, $match);
>

This looks work fine on your test string:

php -r '
    $str = "<BODY><p>This is paragraph one.
The quick brown fox jumped
over the burning fence</p><p>This is the
second<p><p>This is the third
paragraph. Talk about repetition</p><p>This is
the second last
paragraph</p> <p>This is the last
paragraph</p> </BODY>";

preg_match_all("/(?:<p>(?:(?!</?p>).)*?<
;/p>(?:(?!</?p>).)*?)(?=</body>)/si&quo
t;,
$str, $match);
    print_r($match);
'
Array
(
    [0] => Array
        (
            [0] => <p>This is the second last
paragraph</p> <p>This is
the last paragraph</p>
        )

)

But i guess it's just tooooo slow for a large test string
when there
are too many negative-lookahead tests..

> I guess if I could get all unique occurrences of each
opening and
> closing paragrah (<p>) into an array I could just
look at the last two
> results.

Yes, I think that should be a better way. if your HTMLs are
well-formated and no-nested <p> elements, and you want
to fetch only
contents within these elements, then the way you said should
be much
much easier and faster.   

  $n =
preg_match_all("/<p>.*?</p>/si",$str,
$match);

then print out
   $match[0][$n-1], and $match[0][$n-2].

Also, I guess, some modules might do your job more robustly.

Regards,
Xicheng


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regexgooglegroups.com
To unsubscribe from this group, send email to
regex-unsubscribegooglegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---

Capturing last two matches
user name
2006-09-27 16:17:57
That works perfectly the way you have it!

Thank you for taking the time out to respond with helpful
suggestions
each time. You've saved me a lot of time and effort.

Regards,
Michael


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regexgooglegroups.com
To unsubscribe from this group, send email to
regex-unsubscribegooglegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---

[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )