List Info

Thread: Created: (NUTCH-565) Arc File to Nutch Segments Converter




Created: (NUTCH-565) Arc File to Nutch Segments Converter
country flaguser name
United States
2007-10-09 00:08:50
Arc File to Nutch Segments Converter
------------------------------------

                 Key: NUTCH-565
                 URL: https
://issues.apache.org/jira/browse/NUTCH-565
             Project: Nutch
          Issue Type: Improvement
         Environment: all
            Reporter: Dennis Kubes
            Assignee: Dennis Kubes
             Fix For: 1.0.0


Functionality that allows arc files, such as those produced
by the internet archive project or by the Grub distributed
crawler to be parsed into Nutch segments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (NUTCH-565) Arc File to Nutch Segments Converter
country flaguser name
United States
2007-10-09 00:11:50
     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment: archive-commons-1.11.0-200612262257.jar

Archive commons jar needed for reading arc files.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments:
archive-commons-1.11.0-200612262257.jar
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (NUTCH-565) Arc File to Nutch Segments Converter
country flaguser name
United States
2007-10-09 00:18:50
     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment: nutch-565-1-20071009.patch

An arc file input format, record reader, and utility to
convert arc files to nutch segments.  The conversion utility
acts in place of the fetcher to convert compressed web pages
in arc files into the standard nutch segments format.  All
current fetcher rules for url filtering and normalization as
well as content parsing still apply.  Currently only
text/html conent types are supported within the arc files. 
This functionality is meant to be used with hadoop-0.14 or
higher.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments:
archive-commons-1.11.0-200612262257.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (NUTCH-565) Arc File to Nutch Segments Converter
country flaguser name
United States
2007-10-09 13:59:50
     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment: fastutil-5.0.3-heritrix-subset-1.0.jar

Also requires some fastutils classes.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments:
archive-commons-1.11.0-200612262257.jar,
fastutil-5.0.3-heritrix-subset-1.0.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (NUTCH-565) Arc File to Nutch Segments Converter
country flaguser name
United States
2007-10-09 14:07:50
    [ https://issues.apache.org/jira/browse/N
UTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12533458 ] 

Sami Siren commented on NUTCH-565:
----------------------------------

What are the licenses for those jars? 

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments:
archive-commons-1.11.0-200612262257.jar,
fastutil-5.0.3-heritrix-subset-1.0.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (NUTCH-565) Arc File to Nutch Segments Converter
country flaguser name
United States
2007-10-09 15:45:51
    [ https://issues.apache.org/jira/browse/N
UTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12533499 ] 

Dennis Kubes commented on NUTCH-565:
------------------------------------

Currently the input format uses 1 map task per arc file. 
This could be improved in the future by breaking a file into
multiple map tasks.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments:
archive-commons-1.11.0-200612262257.jar,
fastutil-5.0.3-heritrix-subset-1.0.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (NUTCH-565) Arc File to Nutch Segments Converter
country flaguser name
United States
2007-10-09 15:45:50
    [ https://issues.apache.org/jira/browse/N
UTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12533498 ] 

Dennis Kubes commented on NUTCH-565:
------------------------------------

Both jars are LGPL.  The archive-commons is from archive.org
and is currently used in NutchWax.  The fastutil jar is a
subset of fastutil classes used by archive.org.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments:
archive-commons-1.11.0-200612262257.jar,
fastutil-5.0.3-heritrix-subset-1.0.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (NUTCH-565) Arc File to Nutch Segments Converter
country flaguser name
United States
2007-10-10 11:32:50
    [ https://issues.apache.org/jira/browse/N
UTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12533787 ] 

Sami Siren commented on NUTCH-565:
----------------------------------

bq. Both jars are LGPL. 
I think that prohibits direct inclusion then. Take a look at
http://pe
ople.apache.org/~rubys/3party.html for available
options.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments:
archive-commons-1.11.0-200612262257.jar,
fastutil-5.0.3-heritrix-subset-1.0.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (NUTCH-565) Arc File to Nutch Segments Converter
country flaguser name
United States
2007-10-11 16:30:50
     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment: arcsegments2.patch

Here is the updated patch.  Works without any archive.org or
othe LGPL code so it can  be included in Nutch.  Since arcs
a simply tars of gzips it scans through the arc file for the
gzip header then when found starts input there and unzips
each record in turn.  It takes about 40 min to process a
single file which outputs ~1G in segments.  Multiple files
can be run at once on a Hadoop cluster. 

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments:
archive-commons-1.11.0-200612262257.jar, arcsegments2.patch,
fastutil-5.0.3-heritrix-subset-1.0.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (NUTCH-565) Arc File to Nutch Segments Converter
country flaguser name
United States
2007-10-11 16:32:50
     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment:     (was:
archive-commons-1.11.0-200612262257.jar)

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: arcsegments2.patch,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


[1-10] [11-15]

about | contact  Other archives ( Real Estate discussion Medical topics )