|
List Info
Thread: ufXtract - new microformats parser
|
|
| ufXtract - new microformats parser |
  United Kingdom |
2007-11-25 12:09:42 |
Hi All
I have being work hard on a new microformats parser
(ufXtract) to help
explore the real world issues of creating portable social
networks.
Although I have previously designed a number spiders that
can find the
most common hCard and XFN structures, this is my first full
blown
parser. It has been built from the ground up to take
configuration
objects which allow the parsing of different microformats or
POSH
patterns. It was important that I could parse more general
patterns such
as the joint hCard-XFN being promoted for use with friend's
lists.
http://lab.backn
etwork.com/ufXtract/
After some further testing I am going to start to produce a
number of
portable social network demo's and posts. This should also
provide
others with experimental API's. By sharing this early work I
hope in
some way to add to the important technical and architectural
discussions
that are taking place.
I have already added hCard-XFN, rel="me",
rel="next" and hAtom to the
parser. These are the four cornerstone microformats/patterns
required to
gather profile and content from other social networks.
Although for
technical/speed reasons ufXtract is currently only parsing
the hEntry
sub-element of hAtom.
The component also contains extendable output options, so
far, I have
built a simple text format for debugging, JSON and XML for
building
services. For the more technically minded ufXtract is a .net
component
written in c#. It uses a combination of DOM structures and
xPaths. It
can typically parse a page in 50-200ms.
At the moment, I am building a test suite to fine tune the
components'
compliancy. It still has some small issues with most of the
compound
microformats, which I am trying to address.
If you have any comments or want to point out any issues,
please give me
as much feedback as possible.
Thanks
Glenn Jones
www.glennjones.net
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: ufXtract - new microformats parser |
  United Kingdom |
2007-11-25 13:20:47 |
On 25 Nov 2007, at 18:09, Glenn Jones wrote:
> http://lab.backn
etwork.com/ufXtract/
Great work, Glenn!
drew.
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: ufXtract - new microformats parser |
  United States |
2007-11-25 18:58:40 |
Glenn Jones wrote:
> It has been built from the ground up to take
configuration
> objects which allow the parsing of different
microformats or POSH
> patterns.
Great job Glenn,
I was wondering what the configuration objects look like. Do
you use a
grammar for each uf expressed?
Thank you,
Guillaume
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| Re: ufXtract - new microformats parser |

|
2007-11-25 21:32:34 |
Awesome work, Glenn.
I like the way results are return (specially in XML, for
server-side).
BUT... I just tested it on my blog and I think there might
be an issue
with charsets different than utf-8. I'm using iso-8859-1, by
the way.
http://lab.backnetwork.com/ufXtract/?url=http%3A%2F%2Fan
dr3.net%2Fblog&format=hcard&output=xml&callback=
Cheers,
André Luís
ps: it might be my problem, but still, thought I'd let you
know.
Parsers should be as robust as possible, right? hehe
On Nov 26, 2007 12:58 AM, Guillaume Lebleu <gl brixlogic.com> wrote:
> Glenn Jones wrote:
> > It has been built from the ground up to take
configuration
> > objects which allow the parsing of different
microformats or POSH
> > patterns.
> Great job Glenn,
> I was wondering what the configuration objects look
like. Do you use a
> grammar for each uf expressed?
> Thank you,
> Guillaume
>
>
> _______________________________________________
> microformats-discuss mailing list
> microformats-discuss microformats.org
> http://microformats.org/mailman/listinfo/microforma
ts-discuss
>
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| RE: ufXtract - new microformats parser |
  United Kingdom |
2007-11-26 06:58:13 |
Guillaume Lebleu wrote:
> I was wondering what the configuration objects look
like. Do you use a
grammar for each uf expressed?
They are c# collections. The plan is that once I have tuned
the
components compliancy, I will add Xml serialisation. This
will mean that
anyone will be able to defined their own POSH pattern or
test new uf
ideas.
I believe this is similar to how Michael Kaply used
JavaScript objects
to defind microformats in Operator. Take a look at hAtom.js
on
h
ttp://www.kaply.com/weblog/operator-user-scripts/.
The Xml from a ufXtract configuration objects should look
like:
<ufformatdescriber>
<name>geo</name>
<description>Location constructed of latitude and
longitude</description>
<type>geo</type>
<ufelementdescriber name="geo"
attribute="class"
mandatory="false", multiples="true"
concatenatevalues="false"
type="text">
<ufelementdescriber name="latitude"
attribute="class"
mandatory="false", multiples="false"
concatenatevalues="false"
type="text" />
<ufelementdescriber name="longitude"
attribute="class"
mandatory="false", multiples="false"
concatenatevalues="false"
type="text" />
</ufelementdescriber>
</ufformatdescriber>
This are more complex in real life, but should give you an
idea. You can
not define everything this way, there are some rules like
hCard implied
'n' optimization which cannot be describe with this type of
schemea.
That said it covers most cases without having to add new
hardcoded rules
to the parser.
Glenn Jones
www.glennjones.net
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
| RE: ufXtract - new microformats parser |
  United Kingdom |
2007-11-26 07:08:33 |
André Luís: wrote
>I just tested it on my blog and I think there might be
an issue with charsets different than utf-8. I'm using
iso-8859-1, by the way.
Thanks for the feedback, this a classic .net text handling
issue, it hates anything that's not in utf-8. I will try and
fix it ASAP
Glenn Jones
www.glennjones.net
_______________________________________________
microformats-discuss mailing list
microformats-discuss microformats.org
http://microformats.org/mailman/listinfo/microforma
ts-discuss
|
|
[1-6]
|
|
|
about | contact Other archives ( Real Estate discussion Medical topics )
|