List Info

Thread: Matching html tags




Matching html tags
user name
2005-12-30 11:15:11
Hi Darsin,
If we try it this way
<([^>]+)>;([^<]+)</$1>
this will only work for tags without parameters, like <ul>something&lt;/ul> and will not work for tags with parameters like <a href=something>anything</a>

The idea is to repeat as $1 what was matched as an opening tag.

So let's enhance it so it matches tags with parameters:
<(w+)(s+[^>]+)*>( [^<]+)</$1>
 &nbsp; _/  ______/&nbsp; &nbsp;  ___/   ; &nbsp; |
 &nbsp; &nbsp;| &nbsp;   ;  | &nbsp; &nbsp; &nbsp;   ; | &nbsp; &nbsp; &nbsp;  `-- first
 &nbsp;  $1 &nbsp;   ; $2 &nbsp;   ; &nbsp;  $3 &nbsp; &nbsp;   ; &nbsp;  match
&nbsp;  base&nbsp; &nbsp; params&nbsp; &nbsp;  text   ; &nbsp; &nbsp; &nbsp; repeated
 part of &nbsp; of the   between
 opening&nbsp; &nbsp; tag   ; &nbsp; &nbsp; tags
 &nbsp; tag
 
I have not tested it live, so there might be some flaws/typos.
But I hope you catch the idea.


On 12/23/05, Darsin <gmail.com" target="_blank">darsingmail.com> wrote:
>;
> Hi all
> I am trying to build a regex which can get an html tag from given html
> text for eg if given:
>;
> <P>this is a test html</P><P
&gt; class=left>saumitra</P><P>class=center&gt;chaturvedi</P>;<P
> class=center>darpan sinha</P>
>
> then i can get above 4 tags as four seperate matches as shown below:
>; <P>this is a test html</P>
> <P class=left>saumitra</P>
> <P class=center>chaturvedi</P>
> <P class=center>darpan sinha</P>
>
> I have built the below string to be used in Visual Basic:
>; <[a-z][a-z]*d?s*(class=[a-z]*)*>[^<;]+</[a-z][a-z]*d?&gt;
> Here d is used to test one or more occurences of a digit (for eg in
> case an H1, H2, etc is used)
> Pattern works fine if the first closing tag after an opening tag is
> same ie. <p>some text</p>. But fails if the tag is soemthing like:
> <ul> <li>saumitra<;/li><;li>chaturvedi</li></ul>
> The result i get are two matches <li>saumitra<;/li> and
> <li>chaturvedi&lt;/li> where as i want one match
> i would like to match the starting tag with the ending tag and pickup
&gt; all the text (html or non-html) in between, including the tags as well.
> Thus i should get all the text between <ul> and </ul>; as one match. In
> case i get all its children as the second and third match then i wont
> mind at all.
> Any help in this regard would be appreciated.
>
>


--
best regards,&nbsp; &nbsp;   ; &nbsp; &nbsp;Eugeny
&nbsp;
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )