|
List Info
Thread: Finding the XPATH of a given node
|
|
| Finding the XPATH of a given node |

|
2006-12-13 21:21:41 |
I am trying to find the xpath to a given node. For example,
if I find a node by some text:
nodes = findnodes(
"//[contains(.,'Argentina')]");
$node=$nodes[0];
How can I get the full xpath from the root to $node?
My initial thought is.... not good....
Walk the entire tree recursively from the root on down
looking for a node who's text matches the search string. As
I walk the tree keep track of the node names at each step.
When I find a matching node, cascade back up the tree
appending the node names. BUT this won't work when faced
with more than one child of the same type, ie:
/html/body/a[1], /html/body/a[2] will both be named
/html/body/a and it won't work. And I can't even use the
index of the child because [i] notation refers to the i-th
child OF THE SAME TYPE.
There's GOT to be an easier way to get the xpath to a node!?
__--=Peter Theobald=--__
www.PeterTheobald.com
_______________________________________________
Perl-XML mailing list
Perl-XML listserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
|
|
| Finding the XPATH of a given node |

|
2006-12-13 22:51:47 |
On Wednesday 13 December 2006 22:21, Peter Theobald wrote:
> I am trying to find the xpath to a given node. For
example, if I find a
> node by some text:
>
> nodes = findnodes(
"//[contains(.,'Argentina')]");
> $node=$nodes[0];
>
> How can I get the full xpath from the root to $node?
If you use XML::LibXML, then $node->nodePath should
return just that.
Internally, it indeed crowls the tree from $node up to the
root, at each level
counting the number of preceding siblings of the same type.
> My initial thought is.... not good....
> Walk the entire tree recursively from the root on down
looking for a node
> who's text matches the search string.
Why would you walk the tree down when you already have the
node and can walk
upwards from it (and thus spare yourself recursing into
subtrees of all
preceding nodes)?
-- Petr
> As I walk the tree keep track of the
> node names at each step. When I find a matching node,
cascade back up the
> tree appending the node names. BUT this won't work when
faced with more
> than one child of the same type, ie: /html/body/a[1],
/html/body/a[2] will
> both be named /html/body/a and it won't work. And I
can't even use the
> index of the child because [i] notation refers to the
i-th child OF THE
> SAME TYPE.
>
> There's GOT to be an easier way to get the xpath to a
node!?
>
>
>
>
> __--=Peter Theobald=--__
> www.PeterTheobald.com
>
> _______________________________________________
> Perl-XML mailing list
> Perl-XML listserv.ActiveState.com
> To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
_______________________________________________
Perl-XML mailing list
Perl-XML listserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
|
|
| Finding the XPATH of a given node |

|
2006-12-13 22:57:13 |
* Peter Theobald <peter PeterTheobald.com>
[2006-12-13 22:25]:
> I am trying to find the xpath to a given node. For
example, if
> I find a node by some text:
>
> nodes = findnodes(
"//[contains(.,'Argentina')]");
> $node=$nodes[0];
>
> How can I get the full xpath from the root to $node?
1. Use XML::LibXML instead of whatever you’re using.
2. Call the `nodePath` method on the node you found.
> My initial thought is.... not good.... Walk the entire
tree
> recursively from the root on down looking for a node
who's text
> matches the search string.
That’s what the XPath matcher does.
Once you have the node, you can just do the opposite: walk
from
the node up to the root, taking note of sibling positions as
you
go up.
That’s what LibXML’s nodePath does.
Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/&g
t;
_______________________________________________
Perl-XML mailing list
Perl-XML listserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
|
|
| Finding the XPATH of a given node |

|
2006-12-14 00:14:23 |
I want to THANK everyone who led me to SML::LibXML
$node->nodePath.
I was using XML::XPath which doesn't have a method like
nodePath.
-Peter
At 05:51 PM 12/13/2006, Petr Pajas wrote:
>On Wednesday 13 December 2006 22:21, Peter Theobald
wrote:
>> I am trying to find the xpath to a given node. For
example, if I find a
>> node by some text:
>>
>> nodes = findnodes(
"//[contains(.,'Argentina')]");
>> $node=$nodes[0];
>>
>> How can I get the full xpath from the root to
$node?
>
>If you use XML::LibXML, then $node->nodePath should
return just that.
>
>Internally, it indeed crowls the tree from $node up to
the root, at each level
>counting the number of preceding siblings of the same
type.
>
>> My initial thought is.... not good....
>> Walk the entire tree recursively from the root on
down looking for a node
>> who's text matches the search string.
>
>Why would you walk the tree down when you already have
the node and can walk
>upwards from it (and thus spare yourself recursing into
subtrees of all
>preceding nodes)?
>
>-- Petr
>
>> As I walk the tree keep track of the
>> node names at each step. When I find a matching
node, cascade back up the
>> tree appending the node names. BUT this won't work
when faced with more
>> than one child of the same type, ie:
/html/body/a[1], /html/body/a[2] will
>> both be named /html/body/a and it won't work. And I
can't even use the
>> index of the child because [i] notation refers to
the i-th child OF THE
>> SAME TYPE.
>>
>> There's GOT to be an easier way to get the xpath to
a node!?
>
>
>
>>
>>
>>
>>
>> __--=Peter Theobald=--__
>> www.PeterTheobald.com
>>
>> _______________________________________________
>> Perl-XML mailing list
>> Perl-XML listserv.ActiveState.com
>> To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
__--=Peter Theobald=--__
www.PeterTheobald.com
_______________________________________________
Perl-XML mailing list
Perl-XML listserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
|
|
| Finding the XPATH of a given node |

|
2006-12-14 00:36:56 |
>>>>> "Peter" == Peter Theobald
<peter PeterTheobald.com> writes:
Peter> I want to THANK everyone who led me to SML::LibXML
$node->nodePath.
Peter> I was using XML::XPath which doesn't have a method
like nodePath.
That'll teach you to move from that old slow lib to
something modern.
--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. -
+1 503 777 0095
<merlyn stonehenge.com> <URL:http://www.ston
ehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy,
etc. etc.
See PerlTraining.Stonehenge.com for onsite and
open-enrollment Perl training!
_______________________________________________
Perl-XML mailing list
Perl-XML listserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
|
|
| Finding the XPATH of a given node |

|
2006-12-14 06:29:54 |
Hmm... Apparently $node->nodePath does NOT count the
number of preceding siblings of the same type. It 'cheats'
by returning less than useful paths like this:
/*/*/*[3]/*/*/*/*/*/*/*[2]/*[2]/*/*/*/*[2]
At least it would have been nice (and easy) to use the node
name if it's the first/only child and 'cheat' when it wasnt,
like so:
/html/body/*[3]/table/tbody/tr/td/div/div/*[2]/*[2]/p/div/di
v/*[2]
And MUCH better if it actually counted the preceding
siblings of the same type in order to say things like td[3]
instead of *[3]
I am using this in a page scraping application, so it does
make a difference. /*/*/*[2]/*/*[3]/*/*/* is VERY brittle
and susceptible to breaking with ANY change to the
underlying XML (HTML).
*sigh*
-Peter
At 05:51 PM 12/13/2006, Petr Pajas wrote:
>On Wednesday 13 December 2006 22:21, Peter Theobald
wrote:
>> I am trying to find the xpath to a given node. For
example, if I find a
>> node by some text:
>>
>> nodes = findnodes(
"//[contains(.,'Argentina')]");
>> $node=$nodes[0];
>>
>> How can I get the full xpath from the root to
$node?
>
>If you use XML::LibXML, then $node->nodePath should
return just that.
>
>Internally, it indeed crowls the tree from $node up to
the root, at each level
>counting the number of preceding siblings of the same
type.
>
>> My initial thought is.... not good....
>> Walk the entire tree recursively from the root on
down looking for a node
>> who's text matches the search string.
>
>Why would you walk the tree down when you already have
the node and can walk
>upwards from it (and thus spare yourself recursing into
subtrees of all
>preceding nodes)?
>
>-- Petr
>
>> As I walk the tree keep track of the
>> node names at each step. When I find a matching
node, cascade back up the
>> tree appending the node names. BUT this won't work
when faced with more
>> than one child of the same type, ie:
/html/body/a[1], /html/body/a[2] will
>> both be named /html/body/a and it won't work. And I
can't even use the
>> index of the child because [i] notation refers to
the i-th child OF THE
>> SAME TYPE.
>>
>> There's GOT to be an easier way to get the xpath to
a node!?
>
>
>
>>
>>
>>
>>
>> __--=Peter Theobald=--__
>> www.PeterTheobald.com
>>
>> _______________________________________________
>> Perl-XML mailing list
>> Perl-XML listserv.ActiveState.com
>> To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
__--=Peter Theobald=--__
www.PeterTheobald.com
_______________________________________________
Perl-XML mailing list
Perl-XML listserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
|
|
| Finding the XPATH of a given node |

|
2006-12-14 08:29:00 |
Peter Theobald wrote:
> Hmm... Apparently $node->nodePath does NOT count the
number of preceding siblings of the same type. It 'cheats'
by returning less than useful paths like this:
> /*/*/*[3]/*/*/*/*/*/*/*[2]/*[2]/*/*/*/*[2]
>
> At least it would have been nice (and easy) to use the
node name if it's the first/only child and 'cheat' when it
wasnt, like so:
>
/html/body/*[3]/table/tbody/tr/td/div/div/*[2]/*[2]/p/div/di
v/*[2]
Of course that's what XML::Twig's xpath method does, but if
you want to
stick to a barebones module like XML::LibXML ;--) maybe you
can adapt
the code:
sub xpath
{ my $elt= shift;
my $xpath;
foreach my $ancestor (reverse
$elt->ancestors_or_self)
{ my $gi= $ancestor->gi;
$xpath.= "/$gi";
my $index= $ancestor->prev_siblings( $gi) + 1;
unless( ($index == 1) &&
!$ancestor->next_sibling( $gi))
{ $xpath.= "[$index]"; }
}
return $xpath;
}
You might need to replace the gi, prev_siblings and
next_sibling methods
by an XML::LibXML equivalent though.
gi is the tag name, so it would be nodeName in XML::LibXML
prev_sibblings and next_sibling can be done using the
following-sibling
and preceding-sibling axes in XPath.
You could also use getChildrenByTagName on the parent of
ancestor to see
if the element has siblings with the same tag name (of
course Robin
would contend that you should use getChildrenByTagNameNS as
"no one in
their right mind should use XML without namespaces"
(his words, not mine)).
I hope that helps.
--
mirod
_______________________________________________
Perl-XML mailing list
Perl-XML listserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
|
|
| Finding the XPATH of a given node |

|
2006-12-14 09:10:36 |
On Thursday 14 December 2006 07:29, Peter Theobald wrote:
> Hmm... Apparently $node->nodePath does NOT count the
number of preceding
> siblings of the same type. It 'cheats' by returning
less than useful paths
> like this: /*/*/*[3]/*/*/*/*/*/*/*[2]/*[2]/*/*/*/*[2]
>
> At least it would have been nice (and easy) to use the
node name if it's
> the first/only child and 'cheat' when it wasnt, like
so:
>
/html/body/*[3]/table/tbody/tr/td/div/div/*[2]/*[2]/p/div/di
v/*[2]
>
> And MUCH better if it actually counted the preceding
siblings of the same
> type in order to say things like td[3] instead of *[3]
>
> I am using this in a page scraping application, so it
does make a
> difference. /*/*/*[2]/*/*[3]/*/*/* is VERY brittle and
susceptible to
> breaking with ANY change to the underlying XML (HTML).
Must be XHTML then. If you see /*/ in the output of nodePath
then namespaces
must be in use (and libxml2 doesn't know how to associate
them with namespace
prefixes, since the association depends on XPath evaluation
context; in
particular this applies to a default namespace).
I can offer you the (best-on-the-market )
implementation I use in XSH2
(xsh.sf.net). See the attached demo script. If you provide
association of
namespace-URIs and prefixes, then you get xpath like
/x:html/x:/body/x:p[3]/x:i[8].
If you don't, you get
/*[name()="html"]/*[name()="body"]/*[nam
e()="p"][3]/*[name()="x"][8]
as a fallback. Of course, this would not be as fast as
nodePath which is pure
C, but works reasonably well for me.
The usage of the script is like this:
./node_path.pl index.html
'//x:i[contains(.,"Argentina")]'
x 'http://www.w3.org/1999/x
html'
further arguments may provide further prefix-to-uri
mappings.
The code is devided into 3 functions: node_path computes the
actual path,
_node_address is used to compute one step and cannon_name is
to compute the
name-part of each step.
HTH,
-- Petr
>
> At 05:51 PM 12/13/2006, Petr Pajas wrote:
> >On Wednesday 13 December 2006 22:21, Peter Theobald
wrote:
> >> I am trying to find the xpath to a given node.
For example, if I find a
> >> node by some text:
> >>
> >> nodes = findnodes(
"//[contains(.,'Argentina')]");
> >> $node=$nodes[0];
> >>
> >> How can I get the full xpath from the root to
$node?
> >
> >If you use XML::LibXML, then $node->nodePath
should return just that.
> >
> >Internally, it indeed crowls the tree from $node up
to the root, at each
> > level counting the number of preceding siblings of
the same type.
> >
> >> My initial thought is.... not good....
> >> Walk the entire tree recursively from the root
on down looking for a
> >> node who's text matches the search string.
> >
> >Why would you walk the tree down when you already
have the node and can
> > walk upwards from it (and thus spare yourself
recursing into subtrees of
> > all preceding nodes)?
> >
> >-- Petr
> >
> >> As I walk the tree keep track of the
> >> node names at each step. When I find a
matching node, cascade back up
> >> the tree appending the node names. BUT this
won't work when faced with
> >> more than one child of the same type, ie:
/html/body/a[1],
> >> /html/body/a[2] will both be named
/html/body/a and it won't work. And I
> >> can't even use the index of the child because
[i] notation refers to the
> >> i-th child OF THE SAME TYPE.
> >>
> >> There's GOT to be an easier way to get the
xpath to a node!?
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> __--=Peter Theobald=--__
> >> www.PeterTheobald.com
> >>
> >>
_______________________________________________
> >> Perl-XML mailing list
> >> Perl-XML listserv.ActiveState.com
> >> To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
>
> __--=Peter Theobald=--__
> www.PeterTheobald.com
>
> _______________________________________________
> Perl-XML mailing list
> Perl-XML listserv.ActiveState.com
> To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
_______________________________________________
Perl-XML mailing list
Perl-XML listserv.ActiveState.com
To unsubscribe: http:/
/listserv.ActiveState.com/mailman/mysubs
|
|
[1-8]
|
|