Jim's Depository

this code is not yet written
 

I write from the end of June, 2008 having just completed a quarterly spam analysis and adjustment. Following is a brief description of the mail community, the incoming mail stream, how I process it, and the results.

The Mail Community

  • 150 people, mostly engineers in a software company
  • old addresses, average age 5 years+
  • many “first name” addresses

The Incoming Mail Stream

  • We are running about 500 to 1000 incoming emails per hour.
  • 95% of the incoming email is spam, 5% is real.

The Process

  • No mail is destroyed or rejected for spaminess, it is marked with a header and the mail clients shuttle it off to a junk folder, just in case.
  • All mail first passes through bogofilter. This can definitively mark a message as real mail or spam or it may be unsure and pass the message on to more expensive filters. 90% of the real mail is discovered at this point, as is 85% of the spam. I have a broad ‘unsure’ area to reduce false positives.
  • Only the 15% or so of the mail that bogofilter was not sure about will proceed to the following filters.
  • The second filter is dcc, the distributed checksum clearinghouse. This sends a fuzzy checksum to a central server and checks how many copies of the message have been seen so far. If it has been seen too many times then I consider it spam. This successfully discovers about 50% of the remaining spam with a quick round trip of a UDP packet.
  • clamav is used to detect viruses and mark them as spam so the mail clients will sequester them. This only marks a couple messages out of a 1000 incoming, but dcc marks many viruses so I don’t see the total size of my virus stream.
  • If a message is still uncharacterized it goes on to spamassassin. This discovers 90% of the remaining spam. That leaves about 0.3% of the total spam sneaking past my filters to offend the users. Spamassassin is configured to do the network checks, but not to use its bayesian filter, since bogofilter already does something similar.

The Results

  • 99.7% of the spam is detected and tagged.
  • <0.0% false positives. (I haven’t found one.)
  • CPU consumption small enough to be unmeasurable.
  • Mail which gets as far as spamassassin will take a 4 to 10 second delay while it processes. The other tests are fast enough to not be noticed.

Maintenance

The bogofilter works best if it is trained regularly to follow spam trends. I have in the past manually sorted thousands of messages into good and bad piles for training, but that is mind numbing. For ongoing training I do the following:

  • Anything that just barely got tagged as spam by bogofilter (scored above 85% but below 90%) is used as spam to train bogofilter. This tracks spam techniques as they drift out of my target sights without warping my spam stats by reporting 10,000 copies of the same message.
  • Anything that gets past bogofilter, but is subsequently caught by dcc or spamassassin is trained into bogofilter as spam. This catches new trends in spam.
  • Periodically I spot check the real mail, pick out any spam that squeaked through, and train it into the bogofilter to keep up with trends in our real mail.

Results

The end result is I spend dozens of man hours per year to stop 250,000 spam. I’d just hire google to front end filter our mail for \$3/address/year, but the security policy won’t allow that.

An extra note on bogofilter:

Bogofilter is built with a single user in mind. I'm sure it works better when it has a single user's mail to think about and can rely on the human to tag the false positives and negatives.

In a 150 user common filter you can rely on exactly 0 of them to report their miscategorized spam. If you try to force them to comply you will find that 10% of them do it backwards and pollute your statistics so badly you have to erase everything and start again.

That said, it works quite well and is speedy and doesn't rely on external network servers so it makes a good first line of defense.
Going forward:

I will have to drop dcc. Their licensing is no longer free enough to be distributed by Debian. That will slow more messages, but in practice anything dcc catches is also caught by spamassassin.

I'd like to add an adaptive whitelist out front to prevent false positives and give me a stream of known good messages for training the bogofilter. I haven't found one I like yet, but I keep looking. Maybe I'll have to write it.