|
List Info
Thread: Extracting Embedded Licenses
|
|
| Extracting Embedded Licenses |
  United States |
2007-06-18 14:49:22 |
Hi,
imagemagick: Uses 'convert filename xmp:-' to output an
image's embedded
XMP. This works for at least JPEG and TIFF files. For
JPEGs, however,
Imagemagick outputs the namespace and XMP, seperated by .
I'm not
sure how I can handle this, without simply assuming that
'convert'
returned two null-terminated strings. Nevertheless, this
extracts the
XMP from TIFF files.
msoffice: Extends the msoffice extractor to also parse the
DocumentSummeryInformation infile, which contains
user-defined metadata,
along with license metadata embedded by the MSOffice
Creative Commons Add-in
pdf: Extends the pdf extractor to read a PDF's metadata
stream and parse
it as XMP. I'm still awaiting poppler extending the glib
bindings to
allow reading the metadata stream. Until then, it will
simply never
find the metadata stream and go on without error.
png: Adds a check for the XML:com:adobe mp
iTXt field, and parses it as
XMP.
html: Adds a new html parser using libxml2. Parses the
document,
checking for RDFa licenses. It also checks for other basic
HTML
properties like title and author.
There's also several XML formats I'd like to parse for
license data,
particularly SVG and SMIL. Would this be do-able, and if
so, how should
I go about it? Write new extractors for each format or is
this too much
overhead? These could use GMarkupParse, rather than
bringing in libxml2
like the HTML parser.
Cheers,
Jason
_______________________________________________
cc-devel mailing list
cc-devel lists.ibiblio.org
ht
tp://lists.ibiblio.org/mailman/listinfo/cc-devel
|
|
|
|
|
|
|
| Re: Extracting Embedded Licenses |
  United States |
2007-06-18 15:20:55 |
Whoops, I forgot the intro on this.
This is my progress thus far with extracting licenses from
various
formats. Jamie, I'm curious on your thoughts on adding
new extractors
(besides the ones mentioned below, GIF is another I have in
mind. I'm
not sure whether or not it's worthwhile, however). I don't
want to be
adding bloat...
Cheers,
Jason
Jason Kivlighn wrote:
> Hi,
>
> imagemagick: Uses 'convert filename xmp:-' to output an
image's embedded
> XMP. This works for at least JPEG and TIFF files. For
JPEGs, however,
> Imagemagick outputs the namespace and XMP, seperated by
. I'm not
> sure how I can handle this, without simply assuming
that 'convert'
> returned two null-terminated strings. Nevertheless,
this extracts the
> XMP from TIFF files.
>
> msoffice: Extends the msoffice extractor to also parse
the
> DocumentSummeryInformation infile, which contains
user-defined metadata,
> along with license metadata embedded by the MSOffice
Creative Commons Add-in
>
> pdf: Extends the pdf extractor to read a PDF's metadata
stream and parse
> it as XMP. I'm still awaiting poppler extending the
glib bindings to
> allow reading the metadata stream. Until then, it will
simply never
> find the metadata stream and go on without error.
>
> png: Adds a check for the XML:com:adobe mp
iTXt field, and parses it as
> XMP.
>
> html: Adds a new html parser using libxml2. Parses the
document,
> checking for RDFa licenses. It also checks for other
basic HTML
> properties like title and author.
>
> There's also several XML formats I'd like to parse for
license data,
> particularly SVG and SMIL. Would this be do-able, and
if so, how should
> I go about it? Write new extractors for each format or
is this too much
> overhead? These could use GMarkupParse, rather than
bringing in libxml2
> like the HTML parser.
>
> Cheers,
> Jason
>
>
_______________________________________________
cc-devel mailing list
cc-devel lists.ibiblio.org
ht
tp://lists.ibiblio.org/mailman/listinfo/cc-devel
|
|
| Re: Extracting Embedded Licenses |
  Germany |
2007-06-18 16:33:46 |
On Mon, 2007-06-18 at 12:49 -0700, Jason Kivlighn wrote:
> Hi,
>
> imagemagick: Uses 'convert filename xmp:-' to output an
image's embedded
> XMP. This works for at least JPEG and TIFF files. For
JPEGs, however,
> Imagemagick outputs the namespace and XMP, seperated by
. I'm not
> sure how I can handle this, without simply assuming
that 'convert'
> returned two null-terminated strings. Nevertheless,
this extracts the
> XMP from TIFF files.
>
> msoffice: Extends the msoffice extractor to also parse
the
> DocumentSummeryInformation infile, which contains
user-defined metadata,
> along with license metadata embedded by the MSOffice
Creative Commons Add-in
>
> pdf: Extends the pdf extractor to read a PDF's metadata
stream and parse
> it as XMP. I'm still awaiting poppler extending the
glib bindings to
> allow reading the metadata stream. Until then, it will
simply never
> find the metadata stream and go on without error.
>
> png: Adds a check for the XML:com:adobe mp
iTXt field, and parses it as
> XMP.
>
> html: Adds a new html parser using libxml2. Parses the
document,
> checking for RDFa licenses. It also checks for other
basic HTML
> properties like title and author.
>
> There's also several XML formats I'd like to parse for
license data,
> particularly SVG and SMIL. Would this be do-able, and
if so, how should
> I go about it? Write new extractors for each format or
is this too much
> overhead? These could use GMarkupParse, rather than
bringing in libxml2
> like the HTML parser.
>
> Cheers,
> Jason
Nathan, what do you think about these as well?
Jon
> plain text document attachment
(tracker-imagemagick-extract-xmp.patch)
> Index:
src/tracker-extract/tracker-extract-imagemagick.c
>
============================================================
=======
> ---
src/tracker-extract/tracker-extract-imagemagick.c (revision
598)
> +++
src/tracker-extract/tracker-extract-imagemagick.c (working
copy)
>  -35,7 +35,7 
> gint exit_status;
>
> /* imagemagick crashes trying to extract from xcf
files */
> - if (g_str_has_suffix (filename, '.xcf')) {
> + if (g_str_has_suffix (filename, ".xcf")) {
> return;
> }
>
>  -60,5 +60,16 
> g_hash_table_insert (metadata, g_strdup
("Image:Comments"), g_strdup (g_strescape
(lines[2], "")));
> }
> }
> +
> + gchar *xmp;
> + argv[0] = g_strdup ("convert");
> + argv[1] = g_strdup (filename);
> + argv[2] = g_strdup ("xmp:-");
> + argv[3] = NULL;
> +
> + if (tracker_spawn (argv, 10, &xmp,
&exit_status)) {
> + if (exit_status == EXIT_SUCCESS) {
> + tracker_read_xmp(xmp,strlen(xmp),metadata);
> + }
> + }
> }
> -
> plain text document attachment
> (tracker-msoffice-extract-license.patch)
> Index: src/tracker-extract/tracker-extract-msoffice.c
>
============================================================
=======
> ---
src/tracker-extract/tracker-extract-msoffice.c (revision
598)
> +++
src/tracker-extract/tracker-extract-msoffice.c (working
copy)
>  -118,7 +118,26 
> }
> }
>
> +static void
> +doc_metadata_cb (gpointer key, gpointer value,
gpointer user_data)
> +{
> + gchar *name;
> + GsfDocProp *property;
> + GHashTable *metadata;
> + GValue const *val;
>
> + name = (gchar *) key;
> + property = (GsfDocProp *) value;
> + metadata = (GHashTable *) user_data;
> +
> + val = gsf_doc_prop_get_val (property);
> +
> + if (strcmp (name,
"CreativeCommons_LicenseURL") == 0) {
> + add_gvalue_in_hash_table (metadata,
"File:License", val);
> + }
> +}
> +
> +
> void
> tracker_extract_msoffice (gchar *filename, GHashTable
*metadata)
> {
>  -145,25 +164,37 
> }
>
> stream = gsf_infile_child_by_name (infile,
" | |