List Info

Thread: De-Weighting Outbound Anchor Text




De-Weighting Outbound Anchor Text
country flaguser name
United States
2007-10-21 22:57:13
I would like to be able to reduce (or eliminate altogether)
Nutch using a
page's outbound anchor text when determining similarity to
the user's query.
For instance, if a page has an outbound link to another site
with the anchor
text "new jersey", and "new jersey"
isn't mentioned anywhere else on the
page, I don't want it to be considered a valid response to a
query for "new
jersey". I know you can adjust the weightings of other
different properties,
but I did not see anything regarding outbound anchors.

On a separate but similar note, I'd like to consider
INCLUDING meta keyword
and description tags. Has anyone done that before?
-- 
View this message in context: http://www.nabble.com/De-Weighti
ng-Outbound-Anchor-Text-tf4668566.html#a13336341
Sent from the Nutch - User mailing list archive at
Nabble.com.


Re: De-Weighting Outbound Anchor Text
country flaguser name
United States
2007-10-22 02:05:11
Hey,

HtmlParseFilter plugins decide things to parse out from html
(sample : 
languageidentifier)
Similarly, IndexingFilter plugins decide things to Index:
(sample : 
index-basic plugin)

So to ignore, anchor of outbound link, u can implement a
custom 
HtmlParseFilter plugin






grif wrote:
> I would like to be able to reduce (or eliminate
altogether) Nutch using a
> page's outbound anchor text when determining similarity
to the user's query.
> For instance, if a page has an outbound link to another
site with the anchor
> text "new jersey", and "new jersey"
isn't mentioned anywhere else on the
> page, I don't want it to be considered a valid response
to a query for "new
> jersey". I know you can adjust the weightings of
other different properties,
> but I did not see anything regarding outbound anchors.
>
> On a separate but similar note, I'd like to consider
INCLUDING meta keyword
> and description tags. Has anyone done that before?
>   


-- 
This message has been scanned for viruses and
dangerous content and is believed to be clean.


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )