|
List Info
Thread: RE Spam
|
|
| RE Spam |

|
2006-05-23 06:57:11 |
Hi,
Sorry that I don't answer your question, but did you know
that your
disclaimer is longer than the actual message?
You should read this page:
http://www.goldmark.org/jeff/stupid-disclaimers/fun.html
************************************************************
*************************************
> This message and any attachments, or any part
> of it is intended solely for the named addressee.
>
> Reading, printing, distribution, storing,
commercialising
> or acting on this transmission or any information it
contains, by anyone
> other than the addressee, is prohibited. If you have
received this message
> in error, please destroy all copies and notify
> Qld Police Credit Union Ltd on +61 7 3008 4444 or by
replying to the
> sender.
>
> This message may contain legally privileged and
> confidential information, and/or copyright material
> of QPCU or third parties.
>
> QPCU is not responsible for any changes made
> to a document other than those made by QPCU,
> or for the effect of the changes on the document's
meaning.
> You should only re-transmit, distribute or
commercialise
> the material if you are authorised to do so.
>
> Any views expressed in this message are
> those of the individual sender. You may not rely on
this message as
> advice unless subsequently confirmed by fax or letter
signed by an Officer
> or Director of QPCU, or
> an Authorised Representative QPCU.
>
> QPCU advises that this e-mail and any attached files
should be scanned to
> detect viruses. QPCU accepts no liability for loss or
damage (whether
> caused by negligence or not) resulting from the use of
any attached files.
>
> Information regarding Privacy can be found at the QPCU
web site. (
> www.qpcu.org.au )
>
> General Advice Warning
>
> Any advice has been prepared without taking into
account your particular
> objectives, financial situation or needs. For that
reason, before acting
> on the advice you should consider the appropriateness
of the advice having
> regard to your own objectives, financial situation and
needs. Where the
> advice relates to the acquisition, or possible
acquisition, of a
> particular financial product, you should obtain a
Product Disclosure
> Statement relating to the product and consider the
Product Disclosure
> Statement before making any decision about whether to
acquire the product.
>
************************************************************
*************************************
>
> _______________________________________________
> SpamBayes python.org
> htt
p://mail.python.org/mailman/listinfo/spambayes
> Check the FAQ before asking: http://spambayes.sf.
net/faq.html
--
Amedee
_______________________________________________
SpamBayes python.org
htt
p://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.
net/faq.html
|
|
| RE Spam |

|
2006-05-23 13:20:02 |
amedee> Sorry that I don't answer your question, but
did you know that your
amedee> disclaimer is longer than the actual message?
Most such disclaimers are added by the company's
intellectual property
police as the email message heads out the virtual door, not
by the person
sending the message. I suspect you probably just shot the
messenger.
Skip
_______________________________________________
SpamBayes python.org
htt
p://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.
net/faq.html
|
|
| RE Spam |

|
2006-05-23 14:45:07 |
On Tue, May 23, 2006 15:20, skip pobox.com said:
>
> amedee> Sorry that I don't answer your
question, but did you know that
> your
> amedee> disclaimer is longer than the actual
message?
>
> Most such disclaimers are added by the company's
intellectual property
> police as the email message heads out the virtual door,
not by the person
> sending the message. I suspect you probably just shot
the messenger.
>
> Skip
Skip,
It was not my intention to shoot the messenger.
The url you didn't quote should have made it obvious:
http://www.goldmark.org/jeff/stupid-disclaimers/fun.html
My intention was to give some amusement to the sender, *not*
to make fun
of him.
To get back on the topic of this mailing list:
I have noticed that a lot of spam contains disclaimer-ish
text.
If I train spambayes with "disclaimed" ham, I
fear this will "pollute" the
sb database.
The result might be that any email with a disclaimer-ish
text will get a
relatively high ham score.
At the moment, I don't see a solution for this possible
problem.
I *could* not train on disclaimed ham, but if most of my
correspondents
have such boilerplates, training spambayes won't be very
efficient.
- Amedee
--
Disclaimer:
By sending an email to ANY of my addresses you are agreeing
that:
1. I am by definition, "the intended
recipient"
2. All information in the email is mine to do with as I
see fit and
make such financial profit, political mileage, or good joke
as it lends
itself to. In particular, I may quote it on usenet.
3. I may take the contents as representing the views of
your company.
4. This overrides any disclaimer or statement of
confidentiality that
may be included on your message.
_______________________________________________
SpamBayes python.org
htt
p://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.
net/faq.html
|
|
| RE Spam |

|
2006-05-23 17:04:16 |
Amedee> I have noticed that a lot of spam contains
disclaimer-ish text.
Amedee> If I train spambayes with
"disclaimed" ham, I fear this will
Amedee> "pollute" the sb database. The
result might be that any email
Amedee> with a disclaimer-ish text will get a
relatively high ham score.
Amedee> At the moment, I don't see a solution for
this possible problem.
Amedee> I *could* not train on disclaimed ham, but if
most of my
Amedee> correspondents have such boilerplates,
training spambayes won't
Amedee> be very efficient.
That depends. Most common English words (most of the words
in disclaimers
are probably pretty common) should probably score around 0.5
and thus not be
used in ranking messages, e.g.:
spamcounts the only which that disclaimer property
token,nspam,nham,spam prob
the,3591,844,0.5
only,782,267,0.5
which,893,232,0.5
that,2111,424,0.5
disclaimer,2,1,0.352062362221
property,184,50,0.5
After you subtract all the common words, it depends on
what's left worth
using. The approach SpamBayes uses is purely probabilistic
(is
"statistical" more accurate?). The score of any
given message is based the
"preponderance of evidence" contained in the
non-trivial tokens the message
contains (or which SB synthesizes).
Skip
_______________________________________________
SpamBayes python.org
htt
p://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.
net/faq.html
|
|
| RE Spam |

|
2006-05-24 09:21:23 |
On Tue, May 23, 2006 19:04, skip pobox.com said:
>
> Amedee> I have noticed that a lot of spam
contains disclaimer-ish
> text.
> Amedee> If I train spambayes with
"disclaimed" ham, I fear this will
> Amedee> "pollute" the sb database.
The result might be that any email
> Amedee> with a disclaimer-ish text will get a
relatively high ham
> score.
> Amedee> At the moment, I don't see a solution
for this possible
> problem.
> Amedee> I *could* not train on disclaimed ham,
but if most of my
> Amedee> correspondents have such boilerplates,
training spambayes
> won't
> Amedee> be very efficient.
>
> That depends. Most common English words (most of the
words in disclaimers
> are probably pretty common) should probably score
around 0.5 and thus not
> be
> used in ranking messages, e.g.:
Interesting.
However, English is not my mother language and most of my
correspondence
is in Dutch.
As a consequence, most common English words are quite
uncommon for me. The
result is that common English words will score a bit above
0.5. Perhaps
not much, but enough to be significant after a while.
--
Disclaimer:
By sending an email to ANY of my addresses you are agreeing
that:
1. I am by definition, "the intended
recipient"
2. All information in the email is mine to do with as I
see fit and
make such financial profit, political mileage, or good joke
as it lends
itself to. In particular, I may quote it on usenet.
3. I may take the contents as representing the views of
your company.
4. This overrides any disclaimer or statement of
confidentiality that
may be included on your message.
_______________________________________________
SpamBayes python.org
htt
p://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.
net/faq.html
|
|
| RE Spam |

|
2006-05-24 10:26:35 |
>>> I have noticed that a lot of spam contains
disclaimer-ish text.
>>> If I train spambayes with
"disclaimed" ham, I fear this will
>>> "pollute" the sb database. The
result might be that any email
>>> with a disclaimer-ish text will get a
relatively high ham
>>> score.
>>
>> That depends. Most common English words (most of
the words in
>> disclaimers
>> are probably pretty common) should probably score
around 0.5 and
>> thus not
>> be used in ranking messages, e.g.:
>
> However, English is not my mother language and most of
my
> correspondence
> is in Dutch.
> As a consequence, most common English words are quite
uncommon for
> me. The
> result is that common English words will score a bit
above 0.5.
> Perhaps
> not much, but enough to be significant after a while.
Note that they have to be above 0.6 before they are used,
and even
then only the 150 strongest tokens are used, so in a longer
message
tokens have to be fairly strong to count.
IAC, if you train on both ham and spam with disclaimers,
then their
score will remain around 0.5, and so they will have no
effect. If
the only messages you received with disclaimers were spam,
then they
would be spammy clues, but that would be good (and
vice-versa).
=Tony.Meyer
--
Please always include the list (spambayes at python.org) in
your replies
(reply-all), and please don't send me personal mail about
SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html
a> explains this.
_______________________________________________
SpamBayes python.org
htt
p://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.
net/faq.html
|
|
| RE Spam |

|
2006-05-24 10:48:41 |
Amedee> However, English is not my mother language
and most of my
Amedee> correspondence is in Dutch. As a
consequence, most common
Amedee> English words are quite uncommon for me. The
result is that
Amedee> common English words will score a bit above
0.5. Perhaps not
Amedee> much, but enough to be significant after a
while.
Thanks, I didn't realize that. Do you have an example in
your training
database you can share with us (both message and word
scores) where you
think the English disclaimer text has tipped the scales and
caused a ham
message to later be scored as spam? If you simple train on
one or two of
those misclassified hams does the problem go away? How
skewed is your
training database (number of spams vs number of hams)? Have
you considered
throwing out your current training database and starting
fresh?
One thing that might help is to further break messages which
score as spam
into "low" and "high" spam. Based
on my current settings that gives me
these four categories:
ham 0.00-0.14
unsure 0.15-0.59
low spam 0.60-0.74
high spam 0.75-1.00
High spam is tossed without further consideration. Ham is
sorted in the
appropriate mailbox by procmail. Unsure and low spam
messages each wind up
in their own mailboxes for further consideration. I train
on most unsure
messages but only train on lospams which are actually ham.
My suspicion is that if you have ham messages which are
erroneously winding
up as spam they are at the very low end of the spam scale.
It might be
sufficient to move your spam threshold up a bit so they are
more likely to
land in the unsure category.
Skip
_______________________________________________
SpamBayes python.org
htt
p://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.
net/faq.html
|
|
| RE Spam |

|
2006-05-24 12:01:52 |
On Wed, May 24, 2006 12:48, skip pobox.com said:
>
> Amedee> However, English is not my mother
language and most of my
> Amedee> correspondence is in Dutch. As a
consequence, most common
> Amedee> English words are quite uncommon for me.
The result is that
> Amedee> common English words will score a bit
above 0.5. Perhaps not
> Amedee> much, but enough to be significant after
a while.
>
> Thanks, I didn't realize that. Do you have an example
in your training
> database you can share with us (both message and word
scores) where you
> think the English disclaimer text has tipped the scales
and caused a ham
> message to later be scored as spam? If you simple
train on one or two of
> those misclassified hams does the problem go away? How
skewed is your
> training database (number of spams vs number of hams)?
Have you
> considered
> throwing out your current training database and
starting fresh?
>
> One thing that might help is to further break messages
which score as spam
> into "low" and "high" spam.
Based on my current settings that gives me
> these four categories:
>
> ham 0.00-0.14
> unsure 0.15-0.59
> low spam 0.60-0.74
> high spam 0.75-1.00
>
> High spam is tossed without further consideration. Ham
is sorted in the
> appropriate mailbox by procmail. Unsure and low spam
messages each wind
> up
> in their own mailboxes for further consideration. I
train on most unsure
> messages but only train on lospams which are actually
ham.
>
> My suspicion is that if you have ham messages which are
erroneously
> winding
> up as spam they are at the very low end of the spam
scale. It might be
> sufficient to move your spam threshold up a bit so they
are more likely to
> land in the unsure category.
>
> Skip
Skip,
I think you have hit the mark there.
I already use something like your lospam/hiham.
I have 5 categories: high ham, low ham, unsure, low spam,
high spam
The high ham/spam respectively go to procmail or /dev/null.
And indeed, the misclassified hams all wind up in unsure or
low spam.
--
Amedee
_______________________________________________
SpamBayes python.org
htt
p://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.
net/faq.html
|
|
[1-8]
|
|