List Info

Thread: Re: Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework




Re: Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework
country flaguser name
Poland
2007-10-09 15:57:21
Chris A. Mattmann (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/NUTCH-562?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
> 
> Chris A. Mattmann closed NUTCH-562.
> -----------------------------------
> 
> 
> - Patch applied to trunk in r583016

I think this issue didn't get enough attention before it was
committed. 
I agree with the direction of this patch -
functionality-wise the mime 
type detector in Tika is clearly superior to the one that we
have now in 
Nutch - but I feel that the use of an external framework,
which is not 
yet released, should be discussed first, and the proper
working of the 
patch should be confirmed by other users. There was too
little time to 
do this before the commit.

I vote for reverting this patch, unless there is an overall
consensus 
among Nutch developers that it's ok to keep it as it is - on
one hand 
considering the added functionality and simplification of
Nutch code, 
and on the other hand considering the (lack of) maturity of
Tika.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||/|  Information Retrieval, Semantic Web
___|||__||  |  ||  |  Embedded Unix, System Integration
http://www.sigram.com 
Contact: info at sigram dot com


Re: Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework
user name
2007-10-09 16:22:36
UWhat bothers me here is not the time to commit, although I
agree 
probably should have been longer than 1 day, but that AFAIK
there is 
very little documentation about Tika.  That being said, both
Chris and 
Sami are committers for Tika.  So if they both feel that
Tika is mature 
enough to use, and can help answer the inevitable question
on the Nutch 
list about it, then I feel it would be okay to keep the
changes.

Dennis Kubes



Andrzej Bialecki wrote:
> Chris A. Mattmann (JIRA) wrote:
>>      [ 
>> https://issues.apache.org/jira/browse/NUTCH-562?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel 
>> ]
>>
>> Chris A. Mattmann closed NUTCH-562.
>> -----------------------------------
>>
>>
>> - Patch applied to trunk in r583016
> 
> I think this issue didn't get enough attention before
it was committed. 
> I agree with the direction of this patch -
functionality-wise the mime 
> type detector in Tika is clearly superior to the one
that we have now in 
> Nutch - but I feel that the use of an external
framework, which is not 
> yet released, should be discussed first, and the proper
working of the 
> patch should be confirmed by other users. There was too
little time to 
> do this before the commit.
> 
> I vote for reverting this patch, unless there is an
overall consensus 
> among Nutch developers that it's ok to keep it as it is
- on one hand 
> considering the added functionality and simplification
of Nutch code, 
> and on the other hand considering the (lack of)
maturity of Tika.
> 

Re: Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework
country flaguser name
United States
2007-10-09 16:55:27
Folks,

 Either way is fine with me. I committed the patch for the
following
reasons:

 1. Though the patch sat for around 36 hrs, the JIRA issue
has been around
nearly 2 weeks, without any comment at all. I used this as a
baseline for
relative interest in the patch. Though a patch file is
ultimately the means
for which contributions are to be judged, I had pretty much
laid out the
plan in the JIRA issue: port Nutch to use Tika mime system.
Tika mime system
provides X, Y, Z that Nutch doesn't, etc. This described the
ultimate intent
of the code that was soon to be reified.

 2. Similarity of Tika mime API to existing Nutch mime API.
The core classes
of the API in both Tika and the old mime system in Nutch are
90% the same
(in some cass, like MimeTypes.java, the file is nearly
identical). This fact
is not incidental: it's because Jerome wrote the majority of
both code
bases. This made it easier for me to swallow that the API
would work as
expected.

 3. My experience testing the patch in the case of small
crawls against
subsets of the apache.org sites. I was primarily looking for
2 things:

  a. performance -- there wasn't a significant hit that I
could notice while
observing crawl time anecdotally.

  b. effectiveness -- were mime types still being set in the
metadata, were
the right parsers getting called, etc.? The answer here was
"yes".

 I'm sure that this is more of a procedural issue than
anything else.
Because of this I'm happy to revert the patch. My +1 for it
in fact. Then
I'll happily await other folks to test it and provide
feedback. I can't
promise I'll get to updating it and committing revised
versions of it back
to the sources right away though: the rest of my week is
actually very busy
(another reason for my desire to contribute the patch and
commit it over the
past weekend -- it was the only time in the next week or so
that I would
have to get it into the sources and to solve some issues
that have been
plaguing Nutch for a while, e.g., reliable content type
detection in the
case of XML/RDF/RSS files, etc.).

 In any case, let me know what you decide.

Chris


  


On 10/9/07 1:57 PM, "Andrzej Bialecki" <abgetopt.org> wrote:

> Chris A. Mattmann (JIRA) wrote:
>>      [ 
>> https://issues.apache.org/jira/brow
se/NUTCH-562?page=com.atlassian.jira.plugi
>> n.system.issuetabpanels:all-tabpanel ]
>> 
>> Chris A. Mattmann closed NUTCH-562.
>> -----------------------------------
>> 
>> 
>> - Patch applied to trunk in r583016
> 
> I think this issue didn't get enough attention before
it was committed.
> I agree with the direction of this patch -
functionality-wise the mime
> type detector in Tika is clearly superior to the one
that we have now in
> Nutch - but I feel that the use of an external
framework, which is not
> yet released, should be discussed first, and the proper
working of the
> patch should be confirmed by other users. There was too
little time to
> do this before the commit.
> 
> I vote for reverting this patch, unless there is an
overall consensus
> among Nutch developers that it's ok to keep it as it is
- on one hand
> considering the added functionality and simplification
of Nutch code,
> and on the other hand considering the (lack of)
maturity of Tika.




Re: Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework
country flaguser name
Finland
2007-10-10 13:05:17
Andrzej Bialecki wrote:
> Chris A. Mattmann (JIRA) wrote:
>>      [
>> https://issues.apache.org/jira/browse/NUTCH-562?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> ]
>>
>> Chris A. Mattmann closed NUTCH-562.
>> -----------------------------------
>>
>>
>> - Patch applied to trunk in r583016
> 
> I think this issue didn't get enough attention before
it was committed.
> I agree with the direction of this patch -
functionality-wise the mime
> type detector in Tika is clearly superior to the one
that we have now in
> Nutch - but I feel that the use of an external
framework, which is not
> yet released, should be discussed first, and the proper
working of the
> patch should be confirmed by other users. There was too
little time to
> do this before the commit.
> 
> I vote for reverting this patch, unless there is an
overall consensus
> among Nutch developers that it's ok to keep it as it is
- on one hand
> considering the added functionality and simplification
of Nutch code,
> and on the other hand considering the (lack of)
maturity of Tika.

I agree with Andrzej here. I would have waited a bit more
before rushing
into this. Because at this point (where no Tika releases
have been made)
it might (even though it does not look like it right now)
even be
possible that the project will be retired without any
releases at all.

-- 
 Sami Siren

[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )