List Info

Thread: Nearly everything is either 0.500000 or 1.000000




Nearly everything is either 0.500000 or 1.000000
user name
2007-08-15 07:48:33
Hi,

I'm using Bogofilter for a few years now and I'm quite happy
with it.  
I receive lots of spam, had only two or three false
positives and not  
too many false negatives.

But right now, after batch training Bogofilter with about
15000 spams  
filtered by other means, I've observed a strange thing:
*All* mails  
get a bogosity of either 0.000000 (very rare, always ham
mails),  
0.500000 (nearly all good mails and all false negatives) or
1.000000  
(all spam mails). This was certainly not the case before the
last  
training batch, I had good mails always at or very near 0,
spam at or  
near 1 and false negatives somewhere in between. As it
should be.

Baffling. I decided to have a look at some mails with
"bogofilter - 
vvv": It looks as if headers identically found in all
mails (spam and  
ham) drag the bogosity heavily towards "spammy",
so that even  
otherwise clearly good mails go to 0.500000. These headers
are  
inserted by the two last SMTP servers on the way in, so they
are  
present in all mails. This again lead me to look at the
message counts:

$ bogoutil -p .bogofilter/wordlist.db .MSG_COUNT
                                  spam    good    Fisher
.MSG_COUNT                     126749    5761  0.500000

Eek! This explains it somehow, there is more than twenty
times more  
spam than ham and so identical tokens found in all mails
will drag  
the bogosity up. But wait: I have received *many* more good
mails  
than just 5761! I've about 25000 mails right now laying
around (and I  
don't keep everything).

Hmm. Then I remembered that I had set thresh_update=0.01 in 

~/.bogofilter.cf which lead to clearly good or spammy mails
not to be  
registered at all and since good mails were nearly always at
0, they  
didn't register anymore. Together with me cleaning up the
database  
once a year  from old tokens not used for a while *and* the
recent  
batch training  with 15000 spams I have now a database full
of spam  
tokens and quite void of ham tokens...

What to do now? Just wait for Bogofilter to catch up on
tokens from  
good mail (since they are far away from 0 now they will be
registered  
again)? Toss away the database and completely retrain with
all good  
mail and an equal amount of spam (is 1:1 a good idea
anyway?)?  
Manually remove all the common header tokens from the
database to  
make the actually meaningful tokens stand out more?

I have to say that while good mails are now at 0.500000
there's still  
a good distance to the spam cutoff of 0.99, so I don't
really fear  
false positives. But seeing both false negatives (which are
clearly  
spammy to the eye) *and* perfectly good mail both
registering exactly  
the same bogosity makes me somewhat uneasy.


	Jochem

-- 
When the revolution comes, I will be shot by both sides.



_______________________________________________
Bogofilter mailing list
Bogofilterbogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

Re: Nearly everything is either 0.500000 or 1.000000
country flaguser name
United States
2007-08-15 10:58:48
At first glance, it seems to me that one or two headers
should not have 
that kind of effect.  Moving from 0 to 0.5 would require
something else 
than all of a sudden having a few tokens slightly more
spammy than 
before.  Are you classifying on the headers only?  Run a ham
through 
with -vvv and see what all of the body tokens are
contributing.

As a quick solution, if it were me, I would just grab my
entire archive 
of hams and run it through training once.

BTW, this is why I never do batch training in the first
place.  Just 
train on error and you should never have problems like
this.

Tom


Jochem Huhmann wrote:
> Hi,
> 
> I'm using Bogofilter for a few years now and I'm quite
happy with it.  
> I receive lots of spam, had only two or three false
positives and not  
> too many false negatives.
> 
> But right now, after batch training Bogofilter with
about 15000 spams  
> filtered by other means, I've observed a strange thing:
*All* mails  
> get a bogosity of either 0.000000 (very rare, always
ham mails),  
> 0.500000 (nearly all good mails and all false
negatives) or 1.000000  
> (all spam mails). This was certainly not the case
before the last  
> training batch, I had good mails always at or very near
0, spam at or  
> near 1 and false negatives somewhere in between. As it
should be.
> 
> Baffling. I decided to have a look at some mails with
"bogofilter - 
> vvv": It looks as if headers identically found in
all mails (spam and  
> ham) drag the bogosity heavily towards
"spammy", so that even  
> otherwise clearly good mails go to 0.500000. These
headers are  
> inserted by the two last SMTP servers on the way in, so
they are  
> present in all mails. This again lead me to look at the
message counts:
> 
> $ bogoutil -p .bogofilter/wordlist.db .MSG_COUNT
>                                   spam    good   
Fisher
> .MSG_COUNT                     126749    5761 
0.500000
> 
> Eek! This explains it somehow, there is more than
twenty times more  
> spam than ham and so identical tokens found in all
mails will drag  
> the bogosity up. But wait: I have received *many* more
good mails  
> than just 5761! I've about 25000 mails right now laying
around (and I  
> don't keep everything).
> 
> Hmm. Then I remembered that I had set
thresh_update=0.01 in  
> ~/.bogofilter.cf which lead to clearly good or spammy
mails not to be  
> registered at all and since good mails were nearly
always at 0, they  
> didn't register anymore. Together with me cleaning up
the database  
> once a year  from old tokens not used for a while *and*
the recent  
> batch training  with 15000 spams I have now a database
full of spam  
> tokens and quite void of ham tokens...
> 
> What to do now? Just wait for Bogofilter to catch up on
tokens from  
> good mail (since they are far away from 0 now they will
be registered  
> again)? Toss away the database and completely retrain
with all good  
> mail and an equal amount of spam (is 1:1 a good idea
anyway?)?  
> Manually remove all the common header tokens from the
database to  
> make the actually meaningful tokens stand out more?
> 
> I have to say that while good mails are now at 0.500000
there's still  
> a good distance to the spam cutoff of 0.99, so I don't
really fear  
> false positives. But seeing both false negatives (which
are clearly  
> spammy to the eye) *and* perfectly good mail both
registering exactly  
> the same bogosity makes me somewhat uneasy.
> 
> 
> 	Jochem
> 

_______________________________________________
Bogofilter mailing list
Bogofilterbogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

Re: Nearly everything is either 0.500000 or 1.000000
user name
2007-08-18 03:42:11
On 2007-08-15, at 17:58, Tom Anderson wrote:

> At first glance, it seems to me that one or two headers
should not  
> have
> that kind of effect.  Moving from 0 to 0.5 would
require something  
> else
> than all of a sudden having a few tokens slightly more
spammy than
> before.  Are you classifying on the headers only?

No, I'm classifying on headers and body.

> Run a ham through
> with -vvv and see what all of the body tokens are
contributing.

Already did that, but it didn't help me much... looks quite
OK actually.

>
> As a quick solution, if it were me, I would just grab
my entire  
> archive
> of hams and run it through training once.

I meanwhile created a fresh database with all my ham and
450000 spam  
mails -- bogofilter behaves more normally now, with good
mail near 0,  
most spam near 1 and false negatives in between. I still
have spam  
getting through with a bogosity of about 0.5, but these are 

containing large amounts of (hidden) text which drags the
bogosity  
down, so bogofilter seems to do its best.

>
> BTW, this is why I never do batch training in the first
place.  Just
> train on error and you should never have problems like
this.

This batch *was* training on error, it was all false
negatives  
catched by other filters running after bogofilter (in my
MUA).

Thanks for replying,


	Jochem

-- 
When the revolution comes, I will be shot by both sides.



_______________________________________________
Bogofilter mailing list
Bogofilterbogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )