List Info

Thread: FuzzyOCR plugin 2.3b released




FuzzyOCR plugin 2.3b released
user name
2006-08-28 13:31:29
The latest release of the FuzzyOCR plugin (2.3b) is out, and
I've
updated the wiki accordingly with new installation and
configuration
instructions: <https://secure.renaissoft.com/maia/wiki/FuzzyOCR23>.

There are a number of key improvements in this version of
the plugin
that make it worth the upgrade, including:

* Handling of interlaced and animated GIFs

As the anti-spam community has come to embrace OCR
technologies,
spammers have been working on ways to confuse OCR engines,
using
interlaced images and animated GIFs.  With interlaced
images, the data
is ordered differently (all the odd-numbered pixel rows
together, all
the even-numbered pixel rows together), so tools that
aren't able to
detect interlaced images and reconstruct them properly would
fail to
load them.  Animated GIFs are also becoming more common,
since tools
that don't know how to handle them will only see the first
frame of the
animation--so spammers simply include a blank first frame
that lasts a
fraction of a second, followed by a long second frame that
contains the
spam message.  This version of the FuzzyOCR plugin uses
tools that can
properly detect and handle interlaced and animated GIFs,
unpacking the
individual frames as necessary.


* Word list in a separate file

In previous versions of the plugin, the list of target words
was stored
in the FuzzyOcr.cf file.  Now they're stored in a separate
file
(FuzzyOcr.words) that won't be overwritten during plugin
upgrades.


* Hashing database cache for previously-scanned images

On the theory that if you see one instance of a given spam
image, you're
likely to see multiple copies of it, this version of the
plugin
maintains a local database to cache scan information about
the images it
has seen.  It's not an MD5 hash, it's a collection of
image property
data that aims to be an invariant "signature"
for a given image, even if
other copies aren't exact pixel-perfect matches.  The
image's score is
cached as well, so that if it is seen again in the future,
the plugin
won't need to run the OCR engine on it again.


* Ability to use multiple, more configurable scan sets

This is perhaps the most powerful addition, as it's a
feature that lets
you configure the plugin to run the OCR scanner on the image
multiple
times with different resolution and tolerance settings, in
order to
catch a wider range of image spam (at the cost of extra
processing time,
naturally; you can still configure the plugin to do just one
pass, if
you prefer).

The trouble with OCR is that a single pass over the image
only tells you
what's visible at a single, fixed resolution, and if the
image is
crafted with a very different resolution the OCR scan may
see a bunch of
dots instead of letters, or misinterpret "noise"
dots as parts of
letters.  If you're only going to do one pass over the
image, you've got
to choose a "compromise" resolution--one that
will read text in most
cases, but will fail for edge cases.

By making a second pass over the image at a different
resolution though,
you can get the best of both worlds by comparing the results
from both
scans and choosing the one with the best result.  You could
even make a
third pass with yet another set of scanner options if you
wanted to test
for even more challenging conditions (e.g. white text on a
dark
background, text in multiple colours, etc.).

These "scan sets" are also highly
configurable--you can construct your
own tool-chains by piping the image through a series of
utilities, as
long as the image begins as a PNM and ends the chain as
input to GOCR.
Thus you can do things like normalize, resize, greyscale,
rotate, etc.
as you see fit, using the Netpbm, Libungif, or ImageMagick
tools to
prepare the image in whatever way you want before it gets
OCR'ed.  I
expect that various "recipes" will be shared
eventually, as people
experiment with tool-chains and scanner settings that catch
particular
image strains.


Upgrade notes:

(1) SpamAssassin 3.1.4 is preferred, due to some
optimizations that make
handling animated GIFs somewhat easier.  The plugin will
still work with
versions as early as 3.1.0 using some (less-efficient)
internal
workarounds, but you should really be using the latest
SpamAssassin in
any case, if only for the newer rules and bug fixes anyway,
so consider
this your excuse to upgrade 

(2) This version of the plugin requires the ImageMagick
suite,
specifically for the "convert" and
"identify" utilities, used to unpack
the animated GIFs.

(3) There's a small patch for the libungif utility
"giftext" which
hardens it against a segfault exploit.  This means you'll
need to get
the libungif sources and patch them and build them, rather
than just
using the binary packages from your favourite repository. 
Sad, but
necessary.

-- 
Robert LeBlanc <rjlrenaissoft.com>
Renaissoft, Inc.
Maia Mailguard <http://www.maiamail
guard.com/>

_______________________________________________
Maia-users mailing list
Maia-usersrenaissoft.com
http://www.renaissoft.com/mailman/listinfo/maia-users
FuzzyOCR plugin 2.3b released
user name
2006-08-28 15:44:30
Hi!

I installed this for ~amd64 Gentoo, and works fine (it
scores my test
spam pic just for 1.000, but it works).
But my problem is When i check that mail in the web program,
in the list
it shows the mail has 21.8 point:
21.8	2006-08-28 17:04:47.488234	vargyasverafreem...
root.tsabi.humar...	(nincs tárgy)

But when i look at in i see just these rules applied:
1.816 	MISSING_SUBJECT 	Missing Subject: header
1.000 	FUZZY_OCR 	Mail contains an image with common spam
text inside

This is summa 2.816! Where is the other 19... more point?

ty for helping,
tsabi


Robert LeBlanc írta:
> The latest release of the FuzzyOCR plugin (2.3b) is
out, and I've
> updated the wiki accordingly with new installation and
configuration
> instructions: <https://secure.renaissoft.com/maia/wiki/FuzzyOCR23>.
> 
> There are a number of key improvements in this version
of the plugin
> that make it worth the upgrade, including:
> 
> * Handling of interlaced and animated GIFs
> 
> As the anti-spam community has come to embrace OCR
technologies,
> spammers have been working on ways to confuse OCR
engines, using
> interlaced images and animated GIFs.  With interlaced
images, the data
> is ordered differently (all the odd-numbered pixel rows
together, all
> the even-numbered pixel rows together), so tools that
aren't able to
> detect interlaced images and reconstruct them properly
would fail to
> load them.  Animated GIFs are also becoming more
common, since tools
> that don't know how to handle them will only see the
first frame of the
> animation--so spammers simply include a blank first
frame that lasts a
> fraction of a second, followed by a long second frame
that contains the
> spam message.  This version of the FuzzyOCR plugin uses
tools that can
> properly detect and handle interlaced and animated
GIFs, unpacking the
> individual frames as necessary.
> 
> 
> * Word list in a separate file
> 
> In previous versions of the plugin, the list of target
words was stored
> in the FuzzyOcr.cf file.  Now they're stored in a
separate file
> (FuzzyOcr.words) that won't be overwritten during
plugin upgrades.
> 
> 
> * Hashing database cache for previously-scanned images
> 
> On the theory that if you see one instance of a given
spam image, you're
> likely to see multiple copies of it, this version of
the plugin
> maintains a local database to cache scan information
about the images it
> has seen.  It's not an MD5 hash, it's a collection of
image property
> data that aims to be an invariant
"signature" for a given image, even if
> other copies aren't exact pixel-perfect matches.  The
image's score is
> cached as well, so that if it is seen again in the
future, the plugin
> won't need to run the OCR engine on it again.
> 
> 
> * Ability to use multiple, more configurable scan sets
> 
> This is perhaps the most powerful addition, as it's a
feature that lets
> you configure the plugin to run the OCR scanner on the
image multiple
> times with different resolution and tolerance settings,
in order to
> catch a wider range of image spam (at the cost of extra
processing time,
> naturally; you can still configure the plugin to do
just one pass, if
> you prefer).
> 
> The trouble with OCR is that a single pass over the
image only tells you
> what's visible at a single, fixed resolution, and if
the image is
> crafted with a very different resolution the OCR scan
may see a bunch of
> dots instead of letters, or misinterpret
"noise" dots as parts of
> letters.  If you're only going to do one pass over the
image, you've got
> to choose a "compromise" resolution--one
that will read text in most
> cases, but will fail for edge cases.
> 
> By making a second pass over the image at a different
resolution though,
> you can get the best of both worlds by comparing the
results from both
> scans and choosing the one with the best result.  You
could even make a
> third pass with yet another set of scanner options if
you wanted to test
> for even more challenging conditions (e.g. white text
on a dark
> background, text in multiple colours, etc.).
> 
> These "scan sets" are also highly
configurable--you can construct your
> own tool-chains by piping the image through a series of
utilities, as
> long as the image begins as a PNM and ends the chain as
input to GOCR.
> Thus you can do things like normalize, resize,
greyscale, rotate, etc.
> as you see fit, using the Netpbm, Libungif, or
ImageMagick tools to
> prepare the image in whatever way you want before it
gets OCR'ed.  I
> expect that various "recipes" will be
shared eventually, as people
> experiment with tool-chains and scanner settings that
catch particular
> image strains.
> 
> 
> Upgrade notes:
> 
> (1) SpamAssassin 3.1.4 is preferred, due to some
optimizations that make
> handling animated GIFs somewhat easier.  The plugin
will still work with
> versions as early as 3.1.0 using some (less-efficient)
internal
> workarounds, but you should really be using the latest
SpamAssassin in
> any case, if only for the newer rules and bug fixes
anyway, so consider
> this your excuse to upgrade 
> 
> (2) This version of the plugin requires the ImageMagick
suite,
> specifically for the "convert" and
"identify" utilities, used to unpack
> the animated GIFs.
> 
> (3) There's a small patch for the libungif utility
"giftext" which
> hardens it against a segfault exploit.  This means
you'll need to get
> the libungif sources and patch them and build them,
rather than just
> using the binary packages from your favourite
repository.  Sad, but
> necessary.
> 
> 
> 
>
------------------------------------------------------------
------------
> 
> _______________________________________________
> Maia-users mailing list
> Maia-usersrenaissoft.com
> http://www.renaissoft.com/mailman/listinfo/maia-users


_______________________________________________
Maia-users mailing list
Maia-usersrenaissoft.com
http://www.renaissoft.com/mailman/listinfo/maia-users
FuzzyOCR plugin 2.3b released
user name
2006-08-28 16:30:37
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tóth Csaba wrote:

> But when i look at in i see just these rules applied:
> 1.816 	MISSING_SUBJECT 	Missing Subject: header
> 1.000 	FUZZY_OCR 	Mail contains an image with common
spam text inside
> 
> This is summa 2.816! Where is the other 19... more
point?

As Robert explained earlier, the score == 1.0 is not
accurate, because currently
there is no way to see what spamassassin actually assigns;
it is a dynamic score
like AWL. I've updated the FAQ to include this case.

https://secure.renaissoft.com/maia/wiki/LoadRulesConfig

- --
David Morton
Maia Mailguard                        - http://www.maiamailguard
.com
Morton Software Design and Consulting - http://www.dgrmm.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFE8xotUy30ODPkzl0RAhSCAKC+BlcYAhQP9Uu8faRY71jlauWmbgCf
Wrzb
XixRQwb1QkREImAs/IQanWU=
=F/Uf
-----END PGP SIGNATURE-----
_______________________________________________
Maia-users mailing list
Maia-usersrenaissoft.com
http://www.renaissoft.com/mailman/listinfo/maia-users
FuzzyOCR plugin 2.3b released
user name
2006-08-28 17:58:13
>The latest release of the FuzzyOCR plugin (2.3b) is out,
and I've
>updated the wiki accordingly with new installation and
configuration
>instructions: <https://secure.renaissoft.com/maia/wiki/FuzzyOCR23>.
>

I needed to install it on 3 Debian machines, so I made a
Debian HOWTO for 
myself:

ht
tp://www200.pair.com/mecham/spam/image_spam.html

Gary V

____________________________________________________________
_____
Search from any web page with powerful protection. Get the
FREE Windows Live 
Toolbar Today!   http://get.live.
com/toolbar/overview

_______________________________________________
Maia-users mailing list
Maia-usersrenaissoft.com
http://www.renaissoft.com/mailman/listinfo/maia-users
FuzzyOCR plugin 2.3b released
user name
2006-08-29 01:02:11
As a footnote to my earlier post about the 2.3b release, a
small bug has
been discovered in the FuzzyOcr.pm file, causing the hashing
database
not to work as advertised.  A small patch to fix this glitch
has been
posted to the wiki instructions (Step 15):
<https://secure.renaissoft.com/maia/wiki/FuzzyOCR23>.

Note that the plugin author may not issue a whole new
release when he
fixes this himself, he may just replace the 2.3b tarball, so
the version
you download as of tomorrow may already have this correction
incorporated.  If the patch refuses to apply properly, that
will be your
clue 

-- 
Robert LeBlanc <rjlrenaissoft.com>
Renaissoft, Inc.
Maia Mailguard <http://www.maiamail
guard.com/>

_______________________________________________
Maia-users mailing list
Maia-usersrenaissoft.com
http://www.renaissoft.com/mailman/listinfo/maia-users
FuzzyOCR plugin 2.3b released
user name
2006-08-29 06:35:18
Robert LeBlanc wrote:
> As a footnote to my earlier post about the 2.3b
release, a small bug has
> been discovered in the FuzzyOcr.pm file, causing the
hashing database
> not to work as advertised.  A small patch to fix this
glitch has been
> posted to the wiki instructions (Step 15):
> <https://secure.renaissoft.com/maia/wiki/FuzzyOCR23>.
>
> Note that the plugin author may not issue a whole new
release when he
> fixes this himself, he may just replace the 2.3b
tarball, so the version
> you download as of tomorrow may already have this
correction
> incorporated.  If the patch refuses to apply properly,
that will be your
> clue 
>
>   
>
------------------------------------------------------------
------------
>
> _______________________________________________
> Maia-users mailing list
> Maia-usersrenaissoft.com
> http://www.renaissoft.com/mailman/listinfo/maia-users
>   
In addition to above. I've updated to 2.3b and now started
receiving 
this durring process-quarantine.pl. Anyone have suggestions?
This 
repeats per cycle.

reporter: SpamCop message older than 2 days, not reporting
reporter: SpamCop message older than 2 days, not reporting
Subroutine new redefined at
/etc/mail/spamassassin/FuzzyOcr.pm line 116.
Subroutine parse_config redefined at
/etc/mail/spamassassin/FuzzyOcr.pm 
line 126.
Subroutine dummy_check redefined at
/etc/mail/spamassassin/FuzzyOcr.pm 
line 223.
Subroutine fuzzyocr_check redefined at 
/etc/mail/spamassassin/FuzzyOcr.pm line 227.
Subroutine load_global_words redefined at 
/etc/mail/spamassassin/FuzzyOcr.pm line 237.
Subroutine load_personal_words redefined at 
/etc/mail/spamassassin/FuzzyOcr.pm line 255.
Subroutine parse_scansets redefined at 
/etc/mail/spamassassin/FuzzyOcr.pm line 278.
Subroutine max redefined at
/etc/mail/spamassassin/FuzzyOcr.pm line 285.
Subroutine reorder redefined at
/etc/mail/spamassassin/FuzzyOcr.pm line 293.
Subroutine pipe_io redefined at
/etc/mail/spamassassin/FuzzyOcr.pm line 298.
Subroutine handle_error redefined at
/etc/mail/spamassassin/FuzzyOcr.pm 
line 410.
Subroutine logfile redefined at
/etc/mail/spamassassin/FuzzyOcr.pm line 416.
Subroutine check_image_hash_db redefined at 
/etc/mail/spamassassin/FuzzyOcr.pm line 435.
Subroutine add_image_hash_db redefined at 
/etc/mail/spamassassin/FuzzyOcr.pm line 475.
Subroutine calc_image_hash redefined at 
/etc/mail/spamassassin/FuzzyOcr.pm line 497.
Subroutine debuglog redefined at
/etc/mail/spamassassin/FuzzyOcr.pm line 
537.
Subroutine wrong_ctype redefined at
/etc/mail/spamassassin/FuzzyOcr.pm 
line 543.
Subroutine corrupt_img redefined at
/etc/mail/spamassassin/FuzzyOcr.pm 
line 562.
Subroutine known_img_hash redefined at 
/etc/mail/spamassassin/FuzzyOcr.pm line 587.
Subroutine check_fuzzy_ocr redefined at 
/etc/mail/spamassassin/FuzzyOcr.pm line 602.
2006-08-29 01:06:07 Maia: [process-quarantine-sub] Learned
mail item 
3863652 as spam and reported it
2006-08-29 01:06:08 Maia: [process-quarantine-sub] Learned
mail item 
3876791 as spam and reported it
2006-08-29 01:06:12 Maia: [process-quarantine-sub] Learned
mail item 
3881984 as spam and reported it
2006-08-29 01:06:15 Maia: [process-quarantine-sub] Learned
mail item 
3884237 as spam and reported it

_______________________________________________
Maia-users mailing list
Maia-usersrenaissoft.com
http://www.renaissoft.com/mailman/listinfo/maia-users
FuzzyOCR plugin 2.3b released
user name
2006-08-29 13:15:15
Hi!

I made ebuilds to can install this easier:

app-text/gocr
 * added the segfault patch
perl-gcpan/String-Approx
 * generated the ebuild with g-cpan
media-libs/giflib
 * added the segfault patch
mail-filter/spamassassin-fuzzyocr
 * new ebuild
 * hashdb USE flag to enable hashdb support in config
 * hashdb-fix is included

Download this package:
http://dev.davidnet.hu/gentoo-portage/fuzzyocr-ge
ntoo-2.tar.bz2

Unpack into /usr/local, and enable the overlay at
/etc/make.conf:
PORTDIR_OVERLAY="/usr/local/portage"

Than you can intall mail-filter/spamassassin-fuzzyocr.

If install fails with wrong digest (you cannot download the
FuzzyOcr
tarball), than that means the new tarball is released
without version
growing. You can run this command, and than u should can
install without
any error:
cd /usr/local/portage/mail-filter/spamassassin-fuzzyocr
ebuild spamassassin-fuzzyocr-2.3b.ebuild digest

Have fun,
tsabi


Robert LeBlanc írta:
> The latest release of the FuzzyOCR plugin (2.3b) is
out, and I've
> updated the wiki accordingly with new installation and
configuration
> instructions: <https://secure.renaissoft.com/maia/wiki/FuzzyOCR23>.
> 
> There are a number of key improvements in this version
of the plugin
> that make it worth the upgrade, including:
> 
> * Handling of interlaced and animated GIFs
> 
> As the anti-spam community has come to embrace OCR
technologies,
> spammers have been working on ways to confuse OCR
engines, using
> interlaced images and animated GIFs.  With interlaced
images, the data
> is ordered differently (all the odd-numbered pixel rows
together, all
> the even-numbered pixel rows together), so tools that
aren't able to
> detect interlaced images and reconstruct them properly
would fail to
> load them.  Animated GIFs are also becoming more
common, since tools
> that don't know how to handle them will only see the
first frame of the
> animation--so spammers simply include a blank first
frame that lasts a
> fraction of a second, followed by a long second frame
that contains the
> spam message.  This version of the FuzzyOCR plugin uses
tools that can
> properly detect and handle interlaced and animated
GIFs, unpacking the
> individual frames as necessary.
> 
> 
> * Word list in a separate file
> 
> In previous versions of the plugin, the list of target
words was stored
> in the FuzzyOcr.cf file.  Now they're stored in a
separate file
> (FuzzyOcr.words) that won't be overwritten during
plugin upgrades.
> 
> 
> * Hashing database cache for previously-scanned images
> 
> On the theory that if you see one instance of a given
spam image, you're
> likely to see multiple copies of it, this version of
the plugin
> maintains a local database to cache scan information
about the images it
> has seen.  It's not an MD5 hash, it's a collection of
image property
> data that aims to be an invariant
"signature" for a given image, even if
> other copies aren't exact pixel-perfect matches.  The
image's score is
> cached as well, so that if it is seen again in the
future, the plugin
> won't need to run the OCR engine on it again.
> 
> 
> * Ability to use multiple, more configurable scan sets
> 
> This is perhaps the most powerful addition, as it's a
feature that lets
> you configure the plugin to run the OCR scanner on the
image multiple
> times with different resolution and tolerance settings,
in order to
> catch a wider range of image spam (at the cost of extra
processing time,
> naturally; you can still configure the plugin to do
just one pass, if
> you prefer).
> 
> The trouble with OCR is that a single pass over the
image only tells you
> what's visible at a single, fixed resolution, and if
the image is
> crafted with a very different resolution the OCR scan
may see a bunch of
> dots instead of letters, or misinterpret
"noise" dots as parts of
> letters.  If you're only going to do one pass over the
image, you've got
> to choose a "compromise" resolution--one
that will read text in most
> cases, but will fail for edge cases.
> 
> By making a second pass over the image at a different
resolution though,
> you can get the best of both worlds by comparing the
results from both
> scans and choosing the one with the best result.  You
could even make a
> third pass with yet another set of scanner options if
you wanted to test
> for even more challenging conditions (e.g. white text
on a dark
> background, text in multiple colours, etc.).
> 
> These "scan sets" are also highly
configurable--you can construct your
> own tool-chains by piping the image through a series of
utilities, as
> long as the image begins as a PNM and ends the chain as
input to GOCR.
> Thus you can do things like normalize, resize,
greyscale, rotate, etc.
> as you see fit, using the Netpbm, Libungif, or
ImageMagick tools to
> prepare the image in whatever way you want before it
gets OCR'ed.  I
> expect that various "recipes" will be
shared eventually, as people
> experiment with tool-chains and scanner settings that
catch particular
> image strains.
> 
> 
> Upgrade notes:
> 
> (1) SpamAssassin 3.1.4 is preferred, due to some
optimizations that make
> handling animated GIFs somewhat easier.  The plugin
will still work with
> versions as early as 3.1.0 using some (less-efficient)
internal
> workarounds, but you should really be using the latest
SpamAssassin in
> any case, if only for the newer rules and bug fixes
anyway, so consider
> this your excuse to upgrade 
> 
> (2) This version of the plugin requires the ImageMagick
suite,
> specifically for the "convert" and
"identify" utilities, used to unpack
> the animated GIFs.
> 
> (3) There's a small patch for the libungif utility
"giftext" which
> hardens it against a segfault exploit.  This means
you'll need to get
> the libungif sources and patch them and build them,
rather than just
> using the binary packages from your favourite
repository.  Sad, but
> necessary.
> 
> 
> 
>
------------------------------------------------------------
------------
> 
> _______________________________________________
> Maia-users mailing list
> Maia-usersrenaissoft.com
> http://www.renaissoft.com/mailman/listinfo/maia-users


_______________________________________________
Maia-users mailing list
Maia-usersrenaissoft.com
http://www.renaissoft.com/mailman/listinfo/maia-users
[1-7]

about | contact  Other archives ( Real Estate discussion Medical topics )