|
List Info
Thread: License Metadata Extraction and Search, Summer of Code
|
|
| License Metadata Extraction and Search,
Summer of Code |
  United States |
2007-03-21 13:18:25 |
Hi,
I'm looking into adding support for searching/indexing
licenses for a
service such as Tracker, Beagle, or Strigi for a Google SoC
project. My
first hurdle though, is picking which indexer. The ideal
service would
be cross-desktop, to avoid implementing extraction filters
over and over
again for different indexers. It also needs to be widely
adopted.
Tracker is looking like a good candidate, given that it is
a
Freedesktop.org project, is desktop-neutral, and appears to
have the
intention of following standards as well as creating
standards for other
search services to use. I get the impression GNOME will be
including
this soon.
Strigi is also desktop-neutral, though favored by KDE and is
going to be
used by KDE 4. It doesn't rely on KDE, though. In fact,
Strigi's only
requirements are are the stdc++ libraries, while Tracker is
glib-based.
And for Beagle, Mono is one significant reason I'm shying
away from it.
Tracker or Strigi appear more interoperable and look to be
getting wider
adoption.
Formats I plan to include are:
HTML, SVG, SMIL, XML in general (RDF)
PDF, JPEG, other images (XMP)
MP3, OGG, other audio/video
RSS
>From what I've seen, most license data is either in RDF
or XMP form.
MP3, OGG, and RSS are exceptions. For all these formats, I
would follow
the embedding specification on the Creative Commons website,
at
htt
p://creativecommons.org/technology/usingmarkup
Since most licenses are placed in RDF or XMP, that code can
be separated
and reused from various extraction modules.
So enough rambling... thoughts?
-Jason Kivlighn
_______________________________________________
cc-devel mailing list
cc-devel lists.ibiblio.org
ht
tp://lists.ibiblio.org/mailman/listinfo/cc-devel
|
|
| Re: License Metadata Extraction and
Search, Summer of Code |
  United States |
2007-03-21 13:28:07 |
On Wed, 2007-03-21 at 11:18 -0700, Jason K wrote:
> Since most licenses are placed in RDF or XMP, that code
can be separated
> and reused from various extraction modules.
>
> So enough rambling... thoughts?
Sounds perfect to me, though there are two RDF
serializations to be
aware of: RDF/XML (and XMP is actually a constrained
RDF/XML) and RDFa,
see http://wiki.crea
tivecommons.org/RDFa
MP3 and OGG are the other obviously interesting targets, as
you realize.
There is code for extracting all of these used in ccLookup
(Python, so
presumably you'd have to rewrite, but the tests Nathan has
been building
may be reusable).
--
http://wiki.creativecommons.org/User:Mike_Linksvayer
_______________________________________________
cc-devel mailing list
cc-devel lists.ibiblio.org
ht
tp://lists.ibiblio.org/mailman/listinfo/cc-devel
|
|
| Re: License Metadata Extraction and
Search, Summer of Code |

|
2007-03-21 18:26:33 |
Jason,
I did something similar to this last year for SoC and it
resulted in a
new CC library called cc-sharp:
http://code.google
.com/p/cc-sharp/
So your project could have two parts: the 1) license
handling and then
2) integrating that data with the desktop search
application. If you
wanted to use C# (Beagle), I'd help flesh out cc-sharp with
you and
you could work on the integration.
The other C# CC lib around is CCLicenseLib which hasn't been
developed
in four years.
http://workspac
es.gotdotnet.com/cclib
It contains object representations of the older CC licenses.
It would
be nice to make one condensed lib for CC stuff in C# so
developers for
other projects could easily integrate with their software. I
see it
being laid out as such:
- Attaching licenses to media
- Reading licenses from meda
- Verifying licenses
This desktop search idea would primarily use reading and
verifying.
Right now all cc-sharp does is verify because I was
originally working
on Banshee. Banshee already had read the metadata from the
MP3 via my
patch so all my lib really was, was an abstraction of the
verification. Since verification is done over the Internet,
that's not
really something you want to include by default in core
application
code.
I'd like to abstract license reading so we can just
"plug" support for
different file types to be read whether they are images,
audio, etc.
Kind of like vfs.
What are your thoughts?
-Luke
On 3/21/07, Jason K <jkivlighn gmail.com> wrote:
> Hi,
>
> I'm looking into adding support for searching/indexing
licenses for a
> service such as Tracker, Beagle, or Strigi for a Google
SoC project. My
> first hurdle though, is picking which indexer. The
ideal service would
> be cross-desktop, to avoid implementing extraction
filters over and over
> again for different indexers. It also needs to be
widely adopted.
>
> Tracker is looking like a good candidate, given that it
is a
> Freedesktop.org project, is desktop-neutral, and
appears to have the
> intention of following standards as well as creating
standards for other
> search services to use. I get the impression GNOME
will be including
> this soon.
>
> Strigi is also desktop-neutral, though favored by KDE
and is going to be
> used by KDE 4. It doesn't rely on KDE, though. In
fact, Strigi's only
> requirements are are the stdc++ libraries, while
Tracker is glib-based.
>
> And for Beagle, Mono is one significant reason I'm
shying away from it.
> Tracker or Strigi appear more interoperable and look to
be getting wider
> adoption.
>
> Formats I plan to include are:
> HTML, SVG, SMIL, XML in general (RDF)
> PDF, JPEG, other images (XMP)
> MP3, OGG, other audio/video
> RSS
>
> >From what I've seen, most license data is either in
RDF or XMP form.
> MP3, OGG, and RSS are exceptions. For all these
formats, I would follow
> the embedding specification on the Creative Commons
website, at
> htt
p://creativecommons.org/technology/usingmarkup
>
> Since most licenses are placed in RDF or XMP, that code
can be separated
> and reused from various extraction modules.
>
> So enough rambling... thoughts?
>
> -Jason Kivlighn
> _______________________________________________
> cc-devel mailing list
> cc-devel lists.ibiblio.org
> ht
tp://lists.ibiblio.org/mailman/listinfo/cc-devel
>
--
Luke Hoersten
http://www.c
s.purdue.edu/homes/lhoerste/
http://openradix.org/
_______________________________________________
cc-devel mailing list
cc-devel lists.ibiblio.org
ht
tp://lists.ibiblio.org/mailman/listinfo/cc-devel
|
|
| Re: License Metadata Extraction and
Search, Summer of Code |

|
2007-03-21 19:07:46 |
One more thing. In response to your hesitation about using
Mono and C#
: Mono is a great platform and C# is a great high level
language. Keep
in mind that SoC only lasts for three months so you want to
use
something that will:
1. Have a fast development time and
2. Be easy to integrate with a large code base.
Learning pre-existing code bases (especially in C) is not an
easy or
fast process. I think this is one of the huge benefits of
having a lot
of the CC stuff written in Python: it's much nicer to get
your hands
dirty when just starting out.
Also, Beagle is already included in Ubuntu and is, if I
recall
correctly, the oldest and most mature desktop search engine.
Jason,
you mentioned that you are looking for a cross-desktop
solution, well
C# has the potential to be cross platform (though in most
cases it's
not used that way).
I'd think about ease of implementation and integration
before thinking
as far ahead as being cross-desktop just because of the
nature of the
situation.
Luke
On 3/21/07, Luke Hoersten <luke.hoersten gmail.com> wrote:
> Jason,
> I did something similar to this last year for SoC and
it resulted in a
> new CC library called cc-sharp:
> http://code.google
.com/p/cc-sharp/
>
> So your project could have two parts: the 1) license
handling and then
> 2) integrating that data with the desktop search
application. If you
> wanted to use C# (Beagle), I'd help flesh out cc-sharp
with you and
> you could work on the integration.
>
> The other C# CC lib around is CCLicenseLib which hasn't
been developed
> in four years.
> http://workspac
es.gotdotnet.com/cclib
>
> It contains object representations of the older CC
licenses. It would
> be nice to make one condensed lib for CC stuff in C# so
developers for
> other projects could easily integrate with their
software. I see it
> being laid out as such:
>
> - Attaching licenses to media
> - Reading licenses from meda
> - Verifying licenses
>
> This desktop search idea would primarily use reading
and verifying.
> Right now all cc-sharp does is verify because I was
originally working
> on Banshee. Banshee already had read the metadata from
the MP3 via my
> patch so all my lib really was, was an abstraction of
the
> verification. Since verification is done over the
Internet, that's not
> really something you want to include by default in core
application
> code.
>
> I'd like to abstract license reading so we can just
"plug" support for
> different file types to be read whether they are
images, audio, etc.
> Kind of like vfs.
>
> What are your thoughts?
>
> -Luke
>
> On 3/21/07, Jason K <jkivlighn gmail.com> wrote:
> > Hi,
> >
> > I'm looking into adding support for
searching/indexing licenses for a
> > service such as Tracker, Beagle, or Strigi for a
Google SoC project. My
> > first hurdle though, is picking which indexer.
The ideal service would
> > be cross-desktop, to avoid implementing extraction
filters over and over
> > again for different indexers. It also needs to be
widely adopted.
> >
> > Tracker is looking like a good candidate, given
that it is a
> > Freedesktop.org project, is desktop-neutral, and
appears to have the
> > intention of following standards as well as
creating standards for other
> > search services to use. I get the impression
GNOME will be including
> > this soon.
> >
> > Strigi is also desktop-neutral, though favored by
KDE and is going to be
> > used by KDE 4. It doesn't rely on KDE, though.
In fact, Strigi's only
> > requirements are are the stdc++ libraries, while
Tracker is glib-based.
> >
> > And for Beagle, Mono is one significant reason I'm
shying away from it.
> > Tracker or Strigi appear more interoperable and
look to be getting wider
> > adoption.
> >
> > Formats I plan to include are:
> > HTML, SVG, SMIL, XML in general (RDF)
> > PDF, JPEG, other images (XMP)
> > MP3, OGG, other audio/video
> > RSS
> >
> > >From what I've seen, most license data is
either in RDF or XMP form.
> > MP3, OGG, and RSS are exceptions. For all these
formats, I would follow
> > the embedding specification on the Creative
Commons website, at
> > htt
p://creativecommons.org/technology/usingmarkup
> >
> > Since most licenses are placed in RDF or XMP, that
code can be separated
> > and reused from various extraction modules.
> >
> > So enough rambling... thoughts?
> >
> > -Jason Kivlighn
> > _______________________________________________
> > cc-devel mailing list
> > cc-devel lists.ibiblio.org
> > ht
tp://lists.ibiblio.org/mailman/listinfo/cc-devel
> >
>
>
> --
> Luke Hoersten
> http://www.c
s.purdue.edu/homes/lhoerste/
> http://openradix.org/
>
--
Luke Hoersten
http://www.c
s.purdue.edu/homes/lhoerste/
http://openradix.org/
_______________________________________________
cc-devel mailing list
cc-devel lists.ibiblio.org
ht
tp://lists.ibiblio.org/mailman/listinfo/cc-devel
|
|
| Re: License Metadata Extraction and
Search, Summer of Code |
  United States |
2007-03-21 19:25:19 |
I think I've settled on Tracker. I got an okay from them as
well as
someone who volunteered to mentor me with Tracker code while
working
under Creative Commons.
I like the idea of separating it into two parts. Since
there's so many
indexers out there, separating the parser means we have an
application/library that any indexer can use. Looking at
Tracker's
infrastructure it should work nicely. Even using Tracker,
cc-sharp may
come in handy, since Tracker can call external processes to
extract the
search data. Here's the list of formats I was hoping to
support: MP3,
OGG, RSS, SVG, HTML, XML, JPEG, PDF, SMIL. The big problem
I see with
cc-sharp is working with C#. I'd consider myself fairly
fluent in
C,C++,Java, and Python.
I notice that ccPublisher already attaches licenses, and
ccLookup reads
licenses in anything with RDF metadata as well as in mp3s.
In response
to your second email, Luke, it might work to extend ccLookup
to support
more formats and then have the Tracker extractor call this
program.
Then I'm sticking with a high-level language I'm familiar
with.
However, I'm not sure if that will bode well for
performance, though.
The extraction process needs to be fast, so a C library
might be a
better option. Given the scope of formats, our extractor
would be run
quite often for the typical desktop.
The Tracker code base from what I've seen looks very
manageable, but I
hope to get more feedback from the Tracker folks soon.
Cheers,
Jason
> Jason,
> I did something similar to this last year for SoC and
it resulted in a
> new CC library called cc-sharp:
> http://code.google
.com/p/cc-sharp/
>
> So your project could have two parts: the 1) license
handling and then
> 2) integrating that data with the desktop search
application. If you
> wanted to use C# (Beagle), I'd help flesh out cc-sharp
with you and
> you could work on the integration.
>
> The other C# CC lib around is CCLicenseLib which hasn't
been developed
> in four years.
> http://workspac
es.gotdotnet.com/cclib
>
> It contains object representations of the older CC
licenses. It would
> be nice to make one condensed lib for CC stuff in C# so
developers for
> other projects could easily integrate with their
software. I see it
> being laid out as such:
>
> - Attaching licenses to media
> - Reading licenses from meda
> - Verifying licenses
>
> This desktop search idea would primarily use reading
and verifying.
> Right now all cc-sharp does is verify because I was
originally working
> on Banshee. Banshee already had read the metadata from
the MP3 via my
> patch so all my lib really was, was an abstraction of
the
> verification. Since verification is done over the
Internet, that's not
> really something you want to include by default in core
application
> code.
>
> I'd like to abstract license reading so we can just
"plug" support for
> different file types to be read whether they are
images, audio, etc.
> Kind of like vfs.
>
> What are your thoughts?
>
> -Luke
>
> On 3/21/07, Jason K <jkivlighn gmail.com> wrote:
>
>> Hi,
>>
>> I'm looking into adding support for
searching/indexing licenses for a
>> service such as Tracker, Beagle, or Strigi for a
Google SoC project. My
>> first hurdle though, is picking which indexer. The
ideal service would
>> be cross-desktop, to avoid implementing extraction
filters over and over
>> again for different indexers. It also needs to be
widely adopted.
>>
>> Tracker is looking like a good candidate, given
that it is a
>> Freedesktop.org project, is desktop-neutral, and
appears to have the
>> intention of following standards as well as
creating standards for other
>> search services to use. I get the impression GNOME
will be including
>> this soon.
>>
>> Strigi is also desktop-neutral, though favored by
KDE and is going to be
>> used by KDE 4. It doesn't rely on KDE, though. In
fact, Strigi's only
>> requirements are are the stdc++ libraries, while
Tracker is glib-based.
>>
>> And for Beagle, Mono is one significant reason I'm
shying away from it.
>> Tracker or Strigi appear more interoperable and
look to be getting wider
>> adoption.
>>
>> Formats I plan to include are:
>> HTML, SVG, SMIL, XML in general (RDF)
>> PDF, JPEG, other images (XMP)
>> MP3, OGG, other audio/video
>> RSS
>>
>> >From what I've seen, most license data is
either in RDF or XMP form.
>> MP3, OGG, and RSS are exceptions. For all these
formats, I would follow
>> the embedding specification on the Creative Commons
website, at
>> htt
p://creativecommons.org/technology/usingmarkup
>>
>> Since most licenses are placed in RDF or XMP, that
code can be separated
>> and reused from various extraction modules.
>>
>> So enough rambling... thoughts?
>>
>> -Jason Kivlighn
>> _______________________________________________
>> cc-devel mailing list
>> cc-devel lists.ibiblio.org
>> ht
tp://lists.ibiblio.org/mailman/listinfo/cc-devel
>>
>>
>
>
>
_______________________________________________
cc-devel mailing list
cc-devel lists.ibiblio.org
ht
tp://lists.ibiblio.org/mailman/listinfo/cc-devel
|
|
| Re: License Metadata Extraction and
Search, Summer of Code |

|
2007-03-21 21:02:31 |
That sounds like a good plan. Calling external libraries
will
definately make programming faster (which right now is more
important
than execution speed).
Luke
On 3/21/07, Jason Kivlighn <jkivlighn gmail.com> wrote:
> I think I've settled on Tracker. I got an okay from
them as well as
> someone who volunteered to mentor me with Tracker code
while working
> under Creative Commons.
>
> I like the idea of separating it into two parts. Since
there's so many
> indexers out there, separating the parser means we have
an
> application/library that any indexer can use. Looking
at Tracker's
> infrastructure it should work nicely. Even using
Tracker, cc-sharp may
> come in handy, since Tracker can call external
processes to extract the
> search data. Here's the list of formats I was hoping
to support: MP3,
> OGG, RSS, SVG, HTML, XML, JPEG, PDF, SMIL. The big
problem I see with
> cc-sharp is working with C#. I'd consider myself
fairly fluent in
> C,C++,Java, and Python.
>
> I notice that ccPublisher already attaches licenses,
and ccLookup reads
> licenses in anything with RDF metadata as well as in
mp3s. In response
> to your second email, Luke, it might work to extend
ccLookup to support
> more formats and then have the Tracker extractor call
this program.
> Then I'm sticking with a high-level language I'm
familiar with.
> However, I'm not sure if that will bode well for
performance, though.
> The extraction process needs to be fast, so a C library
might be a
> better option. Given the scope of formats, our
extractor would be run
> quite often for the typical desktop.
>
> The Tracker code base from what I've seen looks very
manageable, but I
> hope to get more feedback from the Tracker folks soon.
>
> Cheers,
> Jason
> > Jason,
> > I did something similar to this last year for SoC
and it resulted in a
> > new CC library called cc-sharp:
> > http://code.google
.com/p/cc-sharp/
> >
> > So your project could have two parts: the 1)
license handling and then
> > 2) integrating that data with the desktop search
application. If you
> > wanted to use C# (Beagle), I'd help flesh out
cc-sharp with you and
> > you could work on the integration.
> >
> > The other C# CC lib around is CCLicenseLib which
hasn't been developed
> > in four years.
> > http://workspac
es.gotdotnet.com/cclib
> >
> > It contains object representations of the older CC
licenses. It would
> > be nice to make one condensed lib for CC stuff in
C# so developers for
> > other projects could easily integrate with their
software. I see it
> > being laid out as such:
> >
> > - Attaching licenses to media
> > - Reading licenses from meda
> > - Verifying licenses
> >
> > This desktop search idea would primarily use
reading and verifying.
> > Right now all cc-sharp does is verify because I
was originally working
> > on Banshee. Banshee already had read the metadata
from the MP3 via my
> > patch so all my lib really was, was an abstraction
of the
> > verification. Since verification is done over the
Internet, that's not
> > really something you want to include by default in
core application
> > code.
> >
> > I'd like to abstract license reading so we can
just "plug" support for
> > different file types to be read whether they are
images, audio, etc.
> > Kind of like vfs.
> >
> > What are your thoughts?
> >
> > -Luke
> >
> > On 3/21/07, Jason K <jkivlighn gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I'm looking into adding support for
searching/indexing licenses for a
> >> service such as Tracker, Beagle, or Strigi for
a Google SoC project. My
> >> first hurdle though, is picking which indexer.
The ideal service would
> >> be cross-desktop, to avoid implementing
extraction filters over and over
> >> again for different indexers. It also needs
to be widely adopted.
> >>
> >> Tracker is looking like a good candidate,
given that it is a
> >> Freedesktop.org project, is desktop-neutral,
and appears to have the
> >> intention of following standards as well as
creating standards for other
> >> search services to use. I get the impression
GNOME will be including
> >> this soon.
> >>
> >> Strigi is also desktop-neutral, though favored
by KDE and is going to be
> >> used by KDE 4. It doesn't rely on KDE,
though. In fact, Strigi's only
> >> requirements are are the stdc++ libraries,
while Tracker is glib-based.
> >>
> >> And for Beagle, Mono is one significant reason
I'm shying away from it.
> >> Tracker or Strigi appear more interoperable
and look to be getting wider
> >> adoption.
> >>
> >> Formats I plan to include are:
> >> HTML, SVG, SMIL, XML in general (RDF)
> >> PDF, JPEG, other images (XMP)
> >> MP3, OGG, other audio/video
> >> RSS
> >>
> >> >From what I've seen, most license data is
either in RDF or XMP form.
> >> MP3, OGG, and RSS are exceptions. For all
these formats, I would follow
> >> the embedding specification on the Creative
Commons website, at
> >> htt
p://creativecommons.org/technology/usingmarkup
> >>
> >> Since most licenses are placed in RDF or XMP,
that code can be separated
> >> and reused from various extraction modules.
> >>
> >> So enough rambling... thoughts?
> >>
> >> -Jason Kivlighn
> >>
_______________________________________________
> >> cc-devel mailing list
> >> cc-devel lists.ibiblio.org
> >> ht
tp://lists.ibiblio.org/mailman/listinfo/cc-devel
> >>
> >>
> >
> >
> >
>
>
--
Luke Hoersten
http://www.c
s.purdue.edu/homes/lhoerste/
http://openradix.org/
--
Luke Hoersten
http://www.c
s.purdue.edu/homes/lhoerste/
http://openradix.org/
_______________________________________________
cc-devel mailing list
cc-devel lists.ibiblio.org
ht
tp://lists.ibiblio.org/mailman/listinfo/cc-devel
|
|
[1-6]
|
|