List Info

Thread: Bayes_00 pain




Bayes_00 pain
user name
2006-08-23 13:41:47
Robert LeBlanc wrote:

> While I would still call it "experimental"
at this stage, that's mostly
> because it's being developed very rapidly.  The
version I'm using in
> production is the one I describe in the wiki (2.1c),
but there are
> already beta versions in the 2.2 series, and alphas in
the 2.3 series,
> with new experimental releases becoming available at a
rate of one or
> two a day.  Clearly this is an area receiving a lot of
attention at the
> moment, and there's a mailing list called
"Devel-Spam"
> <http://lists.own-hero.net/mailman/listinfo/devel-spam&g
t; you can
> subscribe to if you want to keep up with the bleeding
edge of its
> development.
> 
> The 2.1 series is quite stable and works quite well for
most purposes.

What does "quite" mean in this context? False
negatives? Crashing
binaries? Stopped mail-delivery? Need for manual
intervention?

> In terms of the extra load and resource usage, it's
minor because of the
> fact that the OCR plugin only gets invoked on mail that
contains inline
> images.  For those particular emails, it adds 2-4
seconds of processing
> time, but since those emails represent a very small
fraction of the
> total mail volume, the average increase in processing
time works out a
> few milliseconds per item, or a few (i.e. < 10)
extra processor-minutes
> per day.
> 
> The decision to implement OCR in a production
environment at this stage
> is obviously your call, but with the 2.1 stable series
I don't see the
> harm in it, unless perhaps your server is very close to
its resource
> limits as it is. 

Ok, this means I have no problems with CPU/RAM ...

> You must also weigh this against the prevalence of
> image-spam, of course; if you haven't been receiving
much of it yet, you
> probably won't feel much pressure to implement OCR. 
Once you /do/ start
> receiving it in larger volumes, however, the pressure
may reach a
> tipping point, and you may be willing to accept a bit
more risk and a
> bit more resource consumption in order to stem the tide
of the image-spam.

Correct  Exactly
my point of view although I wasn't able to verbalize
in the first mail ... y'know, english isn't my first
language.

> As image-spam becomes more pervasive, however, we're
eventually /all/
> going to need to implement OCR or something equivalent.
 When the spam
> content is entirely within the images, and the text
portion of the mail
> contains just non-spammy words and phrases, there's
really very little
> else left for us to do but try to extract the spam
content from the images.

Yup. I think I am gonna head over to your HOWTO and give it
a try. From
what I have seen, this OCR-functionality is switched by that
loadplugin-line in SA, so I can still decide to keep it
turned off per
default as long as I get familiar with it.

Thanks so far, greetings to you, Robert, and Maia 

Stefan



_______________________________________________
Maia-users mailing list
Maia-usersrenaissoft.com
http://www.renaissoft.com/mailman/listinfo/maia-users
Bayes_00 pain
user name
2006-08-23 13:58:02
Stefan G. Weichinger wrote:
> Robert LeBlanc wrote:
> 
>>The 2.1 series is quite stable and works quite well
for most purposes.
> 
> What does "quite" mean in this context?
False negatives? Crashing
> binaries? Stopped mail-delivery? Need for manual
intervention?

No false positives that I've seen.  No crashing binaries
(as long as you
apply the patches mentioned in the wiki document).  No mail
delivery
interruptions.  No need for manual intervention at any time.
 It's been
very solid.

A few false negatives though, mostly due to spammers
experimenting with
strategically malformed images, interlaced GIFs, and more
recently
animated GIFs.  That's what motivates the alpha and beta
versions of the
OCR plugin.  Both sides are still "feeling each other
out", trading
measure for countermeasure.  These are exciting times! ;)

Basically, without the OCR plugin, 100% of these image spams
would end
up as false negatives, hitting BAYES_50 or worse.  With the
OCR plugin
(2.1c stable), about 5% of them slip through as false
negatives.  The
number becomes smaller still with the 2.2 betas and 2.3
alphas of
course, but with more risk of crashes and slowdowns (e.g.
the 2.3 alpha
calls the OCR routine up to three times per image at
different scan
resolutions, so while it provides a more accurate result and
sees
through more kinds of "noise", it also increases
processing time and load).

-- 
Robert LeBlanc <rjlrenaissoft.com>
Renaissoft, Inc.
Maia Mailguard <http://www.maiamail
guard.com/>

_______________________________________________
Maia-users mailing list
Maia-usersrenaissoft.com
http://www.renaissoft.com/mailman/listinfo/maia-users
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )