I write from the end of June, 2008 having just completed a quarterly
spam analysis and adjustment. Following is a brief description of the
mail community, the incoming mail stream, how I process it, and the
results.
The Mail Community
- 150 people, mostly engineers in a software company
- old addresses, average age 5 years+
- many “first name” addresses
The Incoming Mail Stream
- We are running about 500 to 1000 incoming emails per hour.
- 95% of the incoming email is spam, 5% is real.
The Process
- No mail is destroyed or rejected for spaminess, it is marked with a
header and the mail clients shuttle it off to a junk folder, just in
case.
- All mail first passes
through bogofilter. This can
definitively mark a message as real mail or spam or it may be unsure
and pass the message on to more expensive filters. 90% of the real
mail is discovered at this point, as is 85% of the spam. I have a
broad ‘unsure’ area to reduce false positives.
- Only the 15% or so of the mail that bogofilter was not sure about
will proceed to the following filters.
- The second filter is dcc,
the distributed checksum clearinghouse. This sends a fuzzy checksum
to a central server and checks how many copies of the message have
been seen so far. If it has been seen too many times then I consider
it spam. This successfully discovers about 50% of the remaining spam
with a quick round trip of a UDP packet.
- clamav is used to detect viruses and mark
them as spam so the mail clients will sequester them. This only
marks a couple messages out of a 1000 incoming, but dcc marks many
viruses so I don’t see the total size of my virus stream.
- If a message is still uncharacterized it goes on
to spamassassin. This discovers
90% of the remaining spam. That leaves about 0.3% of the total spam
sneaking past my filters to offend the users. Spamassassin is
configured to do the network checks, but not to use its bayesian
filter, since bogofilter already does something similar.
The Results
- 99.7% of the spam is detected and tagged.
- <0.0% false positives. (I haven’t found one.)
- CPU consumption small enough to be unmeasurable.
- Mail which gets as far as spamassassin will take a 4 to 10 second
delay while it processes. The other tests are fast enough to not be
noticed.
Maintenance
The bogofilter works best if it is trained regularly to follow spam
trends. I have in the past manually sorted thousands of messages into
good and bad piles for training, but that is mind numbing. For ongoing
training I do the following:
- Anything that just barely got tagged as spam by bogofilter (scored
above 85% but below 90%) is used as spam to train bogofilter. This
tracks spam techniques as they drift out of my target sights without
warping my spam stats by reporting 10,000 copies of the same
message.
- Anything that gets past bogofilter, but is subsequently caught by
dcc or spamassassin is trained into bogofilter as spam. This catches
new trends in spam.
- Periodically I spot check the real mail, pick out any spam that
squeaked through, and train it into the bogofilter to keep up with
trends in our real mail.
Results
The end result is I spend dozens of man hours per year to stop 250,000
spam. I’d just hire google to front end filter our mail for
\$3/address/year, but the security policy won’t allow that.