List Info

Thread: Created: (NUTCH-447) Dmoz Structure Parser Tool




Created: (NUTCH-447) Dmoz Structure Parser Tool
country flaguser name
United States
2007-02-20 15:03:06
Dmoz Structure Parser Tool
--------------------------

                 Key: NUTCH-447
                 URL: https
://issues.apache.org/jira/browse/NUTCH-447
             Project: Nutch
          Issue Type: New Feature
    Affects Versions: 0.9.0
         Environment: all platforms
            Reporter: Dennis Kubes
         Assigned To: Dennis Kubes
            Priority: Minor


This is a tool that will take the dmoz structure RDF file
and return a listing of the categories.  The categories
return can be limited by depth or by regular expression
pattern.  This tool borrows heavily from the DmozParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (NUTCH-447) Dmoz Structure Parser Tool
country flaguser name
United States
2007-02-20 15:05:05
     [ https://issues.apache.org/jira/browse/NUTCH-447?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes updated NUTCH-447:
-------------------------------

    Attachment: dmoz-structure.patch

Patch that contains the DmozStructureParser class.

> Dmoz Structure Parser Tool
> --------------------------
>
>                 Key: NUTCH-447
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-447
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>         Environment: all platforms
>            Reporter: Dennis Kubes
>         Assigned To: Dennis Kubes
>            Priority: Minor
>         Attachments: dmoz-structure.patch
>
>
> This is a tool that will take the dmoz structure RDF
file and return a listing of the categories.  The categories
return can be limited by depth or by regular expression
pattern.  This tool borrows heavily from the DmozParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (NUTCH-447) Dmoz Structure Parser Tool
country flaguser name
United States
2007-02-21 03:26:05
    [ https://issues.apache.org/jira/browse/N
UTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12474663 ] 

Otis Gospodnetic commented on NUTCH-447:
----------------------------------------

The idea being to limit crawling only to links under a
certain category as opposed to crawling all links in Dmoz?


> Dmoz Structure Parser Tool
> --------------------------
>
>                 Key: NUTCH-447
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-447
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>         Environment: all platforms
>            Reporter: Dennis Kubes
>         Assigned To: Dennis Kubes
>            Priority: Minor
>         Attachments: dmoz-structure.patch
>
>
> This is a tool that will take the dmoz structure RDF
file and return a listing of the categories.  The categories
return can be limited by depth or by regular expression
pattern.  This tool borrows heavily from the DmozParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (NUTCH-447) Dmoz Structure Parser Tool
country flaguser name
United States
2007-02-21 08:29:19
    [ https://issues.apache.org/jira/browse/N
UTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12474713 ] 

Dennis Kubes commented on NUTCH-447:
------------------------------------

This tool is for people who need a defined category
structure or want to grab all or part of the dmoz category
structure without urls.  You could certainly then use this
list as the topic list in the DmozParserTool to only crawl
under a certain category.  

> Dmoz Structure Parser Tool
> --------------------------
>
>                 Key: NUTCH-447
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-447
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>         Environment: all platforms
>            Reporter: Dennis Kubes
>         Assigned To: Dennis Kubes
>            Priority: Minor
>         Attachments: dmoz-structure.patch
>
>
> This is a tool that will take the dmoz structure RDF
file and return a listing of the categories.  The categories
return can be limited by depth or by regular expression
pattern.  This tool borrows heavily from the DmozParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )