|
List Info
Thread: Nearly everything is either 0.500000 or 1.000000
|
|
| Nearly everything is either 0.500000 or
1.000000 |

|
2007-08-15 07:48:33 |
Hi,
I'm using Bogofilter for a few years now and I'm quite happy
with it.
I receive lots of spam, had only two or three false
positives and not
too many false negatives.
But right now, after batch training Bogofilter with about
15000 spams
filtered by other means, I've observed a strange thing:
*All* mails
get a bogosity of either 0.000000 (very rare, always ham
mails),
0.500000 (nearly all good mails and all false negatives) or
1.000000
(all spam mails). This was certainly not the case before the
last
training batch, I had good mails always at or very near 0,
spam at or
near 1 and false negatives somewhere in between. As it
should be.
Baffling. I decided to have a look at some mails with
"bogofilter -
vvv": It looks as if headers identically found in all
mails (spam and
ham) drag the bogosity heavily towards "spammy",
so that even
otherwise clearly good mails go to 0.500000. These headers
are
inserted by the two last SMTP servers on the way in, so they
are
present in all mails. This again lead me to look at the
message counts:
$ bogoutil -p .bogofilter/wordlist.db .MSG_COUNT
spam good Fisher
.MSG_COUNT 126749 5761 0.500000
Eek! This explains it somehow, there is more than twenty
times more
spam than ham and so identical tokens found in all mails
will drag
the bogosity up. But wait: I have received *many* more good
mails
than just 5761! I've about 25000 mails right now laying
around (and I
don't keep everything).
Hmm. Then I remembered that I had set thresh_update=0.01 in
~/.bogofilter.cf which lead to clearly good or spammy mails
not to be
registered at all and since good mails were nearly always at
0, they
didn't register anymore. Together with me cleaning up the
database
once a year from old tokens not used for a while *and* the
recent
batch training with 15000 spams I have now a database full
of spam
tokens and quite void of ham tokens...
What to do now? Just wait for Bogofilter to catch up on
tokens from
good mail (since they are far away from 0 now they will be
registered
again)? Toss away the database and completely retrain with
all good
mail and an equal amount of spam (is 1:1 a good idea
anyway?)?
Manually remove all the common header tokens from the
database to
make the actually meaningful tokens stand out more?
I have to say that while good mails are now at 0.500000
there's still
a good distance to the spam cutoff of 0.99, so I don't
really fear
false positives. But seeing both false negatives (which are
clearly
spammy to the eye) *and* perfectly good mail both
registering exactly
the same bogosity makes me somewhat uneasy.
Jochem
--
When the revolution comes, I will be shot by both sides.
_______________________________________________
Bogofilter mailing list
Bogofilter bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter
|
|
| Re: Nearly everything is either 0.500000
or 1.000000 |
  United States |
2007-08-15 10:58:48 |
At first glance, it seems to me that one or two headers
should not have
that kind of effect. Moving from 0 to 0.5 would require
something else
than all of a sudden having a few tokens slightly more
spammy than
before. Are you classifying on the headers only? Run a ham
through
with -vvv and see what all of the body tokens are
contributing.
As a quick solution, if it were me, I would just grab my
entire archive
of hams and run it through training once.
BTW, this is why I never do batch training in the first
place. Just
train on error and you should never have problems like
this.
Tom
Jochem Huhmann wrote:
> Hi,
>
> I'm using Bogofilter for a few years now and I'm quite
happy with it.
> I receive lots of spam, had only two or three false
positives and not
> too many false negatives.
>
> But right now, after batch training Bogofilter with
about 15000 spams
> filtered by other means, I've observed a strange thing:
*All* mails
> get a bogosity of either 0.000000 (very rare, always
ham mails),
> 0.500000 (nearly all good mails and all false
negatives) or 1.000000
> (all spam mails). This was certainly not the case
before the last
> training batch, I had good mails always at or very near
0, spam at or
> near 1 and false negatives somewhere in between. As it
should be.
>
> Baffling. I decided to have a look at some mails with
"bogofilter -
> vvv": It looks as if headers identically found in
all mails (spam and
> ham) drag the bogosity heavily towards
"spammy", so that even
> otherwise clearly good mails go to 0.500000. These
headers are
> inserted by the two last SMTP servers on the way in, so
they are
> present in all mails. This again lead me to look at the
message counts:
>
> $ bogoutil -p .bogofilter/wordlist.db .MSG_COUNT
> spam good
Fisher
> .MSG_COUNT 126749 5761
0.500000
>
> Eek! This explains it somehow, there is more than
twenty times more
> spam than ham and so identical tokens found in all
mails will drag
> the bogosity up. But wait: I have received *many* more
good mails
> than just 5761! I've about 25000 mails right now laying
around (and I
> don't keep everything).
>
> Hmm. Then I remembered that I had set
thresh_update=0.01 in
> ~/.bogofilter.cf which lead to clearly good or spammy
mails not to be
> registered at all and since good mails were nearly
always at 0, they
> didn't register anymore. Together with me cleaning up
the database
> once a year from old tokens not used for a while *and*
the recent
> batch training with 15000 spams I have now a database
full of spam
> tokens and quite void of ham tokens...
>
> What to do now? Just wait for Bogofilter to catch up on
tokens from
> good mail (since they are far away from 0 now they will
be registered
> again)? Toss away the database and completely retrain
with all good
> mail and an equal amount of spam (is 1:1 a good idea
anyway?)?
> Manually remove all the common header tokens from the
database to
> make the actually meaningful tokens stand out more?
>
> I have to say that while good mails are now at 0.500000
there's still
> a good distance to the spam cutoff of 0.99, so I don't
really fear
> false positives. But seeing both false negatives (which
are clearly
> spammy to the eye) *and* perfectly good mail both
registering exactly
> the same bogosity makes me somewhat uneasy.
>
>
> Jochem
>
_______________________________________________
Bogofilter mailing list
Bogofilter bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter
|
|
| Re: Nearly everything is either 0.500000
or 1.000000 |

|
2007-08-18 03:42:11 |
On 2007-08-15, at 17:58, Tom Anderson wrote:
> At first glance, it seems to me that one or two headers
should not
> have
> that kind of effect. Moving from 0 to 0.5 would
require something
> else
> than all of a sudden having a few tokens slightly more
spammy than
> before. Are you classifying on the headers only?
No, I'm classifying on headers and body.
> Run a ham through
> with -vvv and see what all of the body tokens are
contributing.
Already did that, but it didn't help me much... looks quite
OK actually.
>
> As a quick solution, if it were me, I would just grab
my entire
> archive
> of hams and run it through training once.
I meanwhile created a fresh database with all my ham and
450000 spam
mails -- bogofilter behaves more normally now, with good
mail near 0,
most spam near 1 and false negatives in between. I still
have spam
getting through with a bogosity of about 0.5, but these are
containing large amounts of (hidden) text which drags the
bogosity
down, so bogofilter seems to do its best.
>
> BTW, this is why I never do batch training in the first
place. Just
> train on error and you should never have problems like
this.
This batch *was* training on error, it was all false
negatives
catched by other filters running after bogofilter (in my
MUA).
Thanks for replying,
Jochem
--
When the revolution comes, I will be shot by both sides.
_______________________________________________
Bogofilter mailing list
Bogofilter bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter
|
|
[1-3]
|
|
|
about | contact Other archives ( Real Estate discussion Medical topics )
|