|
|
| Created: (NUTCH-565) Arc File to Nutch
Segments Converter |
  United States |
2007-10-09 00:08:50 |
Arc File to Nutch Segments Converter
------------------------------------
Key: NUTCH-565
URL: https
://issues.apache.org/jira/browse/NUTCH-565
Project: Nutch
Issue Type: Improvement
Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Fix For: 1.0.0
Functionality that allows arc files, such as those produced
by the internet archive project or by the Grub distributed
crawler to be parsed into Nutch segments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (NUTCH-565) Arc File to Nutch
Segments Converter |
  United States |
2007-10-09 00:11:50 |
[ https://issues.apache.org/jira/browse/NUTCH-565?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-565:
-------------------------------
Attachment: archive-commons-1.11.0-200612262257.jar
Archive commons jar needed for reading arc files.
> Arc File to Nutch Segments Converter
> ------------------------------------
>
> Key: NUTCH-565
> URL: https
://issues.apache.org/jira/browse/NUTCH-565
> Project: Nutch
> Issue Type: Improvement
> Environment: all
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments:
archive-commons-1.11.0-200612262257.jar
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (NUTCH-565) Arc File to Nutch
Segments Converter |
  United States |
2007-10-09 00:18:50 |
[ https://issues.apache.org/jira/browse/NUTCH-565?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-565:
-------------------------------
Attachment: nutch-565-1-20071009.patch
An arc file input format, record reader, and utility to
convert arc files to nutch segments. The conversion utility
acts in place of the fetcher to convert compressed web pages
in arc files into the standard nutch segments format. All
current fetcher rules for url filtering and normalization as
well as content parsing still apply. Currently only
text/html conent types are supported within the arc files.
This functionality is meant to be used with hadoop-0.14 or
higher.
> Arc File to Nutch Segments Converter
> ------------------------------------
>
> Key: NUTCH-565
> URL: https
://issues.apache.org/jira/browse/NUTCH-565
> Project: Nutch
> Issue Type: Improvement
> Environment: all
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments:
archive-commons-1.11.0-200612262257.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (NUTCH-565) Arc File to Nutch
Segments Converter |
  United States |
2007-10-09 13:59:50 |
[ https://issues.apache.org/jira/browse/NUTCH-565?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-565:
-------------------------------
Attachment: fastutil-5.0.3-heritrix-subset-1.0.jar
Also requires some fastutils classes.
> Arc File to Nutch Segments Converter
> ------------------------------------
>
> Key: NUTCH-565
> URL: https
://issues.apache.org/jira/browse/NUTCH-565
> Project: Nutch
> Issue Type: Improvement
> Environment: all
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments:
archive-commons-1.11.0-200612262257.jar,
fastutil-5.0.3-heritrix-subset-1.0.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (NUTCH-565) Arc File to
Nutch Segments Converter |
  United States |
2007-10-09 14:07:50 |
[ https://issues.apache.org/jira/browse/N
UTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12533458 ]
Sami Siren commented on NUTCH-565:
----------------------------------
What are the licenses for those jars?
> Arc File to Nutch Segments Converter
> ------------------------------------
>
> Key: NUTCH-565
> URL: https
://issues.apache.org/jira/browse/NUTCH-565
> Project: Nutch
> Issue Type: Improvement
> Environment: all
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments:
archive-commons-1.11.0-200612262257.jar,
fastutil-5.0.3-heritrix-subset-1.0.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (NUTCH-565) Arc File to
Nutch Segments Converter |
  United States |
2007-10-09 15:45:51 |
[ https://issues.apache.org/jira/browse/N
UTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12533499 ]
Dennis Kubes commented on NUTCH-565:
------------------------------------
Currently the input format uses 1 map task per arc file.
This could be improved in the future by breaking a file into
multiple map tasks.
> Arc File to Nutch Segments Converter
> ------------------------------------
>
> Key: NUTCH-565
> URL: https
://issues.apache.org/jira/browse/NUTCH-565
> Project: Nutch
> Issue Type: Improvement
> Environment: all
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments:
archive-commons-1.11.0-200612262257.jar,
fastutil-5.0.3-heritrix-subset-1.0.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (NUTCH-565) Arc File to
Nutch Segments Converter |
  United States |
2007-10-09 15:45:50 |
[ https://issues.apache.org/jira/browse/N
UTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12533498 ]
Dennis Kubes commented on NUTCH-565:
------------------------------------
Both jars are LGPL. The archive-commons is from archive.org
and is currently used in NutchWax. The fastutil jar is a
subset of fastutil classes used by archive.org.
> Arc File to Nutch Segments Converter
> ------------------------------------
>
> Key: NUTCH-565
> URL: https
://issues.apache.org/jira/browse/NUTCH-565
> Project: Nutch
> Issue Type: Improvement
> Environment: all
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments:
archive-commons-1.11.0-200612262257.jar,
fastutil-5.0.3-heritrix-subset-1.0.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (NUTCH-565) Arc File to
Nutch Segments Converter |
  United States |
2007-10-10 11:32:50 |
[ https://issues.apache.org/jira/browse/N
UTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12533787 ]
Sami Siren commented on NUTCH-565:
----------------------------------
bq. Both jars are LGPL.
I think that prohibits direct inclusion then. Take a look at
http://pe
ople.apache.org/~rubys/3party.html for available
options.
> Arc File to Nutch Segments Converter
> ------------------------------------
>
> Key: NUTCH-565
> URL: https
://issues.apache.org/jira/browse/NUTCH-565
> Project: Nutch
> Issue Type: Improvement
> Environment: all
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments:
archive-commons-1.11.0-200612262257.jar,
fastutil-5.0.3-heritrix-subset-1.0.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (NUTCH-565) Arc File to Nutch
Segments Converter |
  United States |
2007-10-11 16:30:50 |
[ https://issues.apache.org/jira/browse/NUTCH-565?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-565:
-------------------------------
Attachment: arcsegments2.patch
Here is the updated patch. Works without any archive.org or
othe LGPL code so it can be included in Nutch. Since arcs
a simply tars of gzips it scans through the arc file for the
gzip header then when found starts input there and unzips
each record in turn. It takes about 40 min to process a
single file which outputs ~1G in segments. Multiple files
can be run at once on a Hadoop cluster.
> Arc File to Nutch Segments Converter
> ------------------------------------
>
> Key: NUTCH-565
> URL: https
://issues.apache.org/jira/browse/NUTCH-565
> Project: Nutch
> Issue Type: Improvement
> Environment: all
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments:
archive-commons-1.11.0-200612262257.jar, arcsegments2.patch,
fastutil-5.0.3-heritrix-subset-1.0.jar,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (NUTCH-565) Arc File to Nutch
Segments Converter |
  United States |
2007-10-11 16:32:50 |
[ https://issues.apache.org/jira/browse/NUTCH-565?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-565:
-------------------------------
Attachment: (was:
archive-commons-1.11.0-200612262257.jar)
> Arc File to Nutch Segments Converter
> ------------------------------------
>
> Key: NUTCH-565
> URL: https
://issues.apache.org/jira/browse/NUTCH-565
> Project: Nutch
> Issue Type: Improvement
> Environment: all
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: arcsegments2.patch,
nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those
produced by the internet archive project or by the Grub
distributed crawler to be parsed into Nutch segments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|