Excogitate Consultancy

Coping with 50,000 spam a day

The volume of spam arriving at the web domains I own has grown exponentially, from a few hundred a day six months ago, to well over 50,000 in a day. The trend looks worringly like the Net (or at least my ISP's server) is heading for meltdown in just a few months.

Graph of daily spam count

Most of the recent growth has been created by viruses, trojans and spamming software manufacturing e-mail addresses at my domains, some plausible sounding such as alvarez@, snyder@, socialjustice@, snowwbaby@, soccertime4us@, and others just random strings of numbers and letters. I'm currently running a count on how much mail is addressed to non-existent addresses, and it appears to amount to between 97 and 98% of all the spam I receive.

So how do I cope with the onslaught?

On an average day I download fewer than 10 spam e-mails. That's because I filter all my mail through three filters: SpamAssassin, procmail and SpamCop. Let me explain how (the boxed sections are asides):

All my e-mail arrives at my web server hosted at pair Networks, which uses qmail as the mail processor. Although I could in theory filter my mail using qmail recipe files, they are not flexible enough for some of the automated mail processing I require, and are inconvenient to configure for many different e-mail addresses.

Like many people I use many different e-mail addresses, e.g. [client_name]@ for all correspondence with a particular client. This makes it easy to filter incoming mail automatically (by 'To' address) for automatic action (e.g. 'unsubscribe' requests and database log files to be fed into a local backup/mirror) or to be sorted into an appropriate folder, saving time and helping me prioritise my work.

So I use procmail instead. But before the mail is passed to procmail it is filtered through SpamAssassin.

The .qmail-default file in my home directory looks like this (on one line):

/usr/local/bin/sa_client.pl -rh -c1
     '/var/qmail/bin/preline /usr/local/bin/procmail -d "$USER"'


sa_client.pl: Pass e-mail to SpamAssassin.
-rh: Write the spam report to the header rather than the message body
-c1: Pass processed e-mail to the command that follows (in quotes)
preline: Prepend standard message headers
procmail: Pass processed e-mail to procmail
-d "$USER": Deliver mail locally as 'me'

SpamAssassin scans through the mail looking for patterns commonly found in spam: words and phrases, presentation, forged lines in the header, and common errors in the formatting. It also compares the e-mail to a distillation of e-mails that I have told it are OK or spam (more about that later). I have tweaked the spam scoring system and set a high threshold score so that I can be virtually 100% confident that anything scoring higher than the threshold is indeed spam.

My .spamassassin/user_prefs file looks like this:

required_hits 9.5
score BAYES_60 4
score BAYES_70 5
score BAYES_80 6
score BAYES_90 8.5
score BAYES_99 9.5

score FORGED_HOTMAIL_RCVD 2
score FORGED_HOTMAIL_RCVD2 5
score FORGED_JUNO_RCVD 3
score FORGED_MUA_AOL_FROM 5
score FORGED_MUA_EUDORA 5
score FORGED_MUA_MOZILLA 3
score FORGED_MUA_OUTLOOK 6
score FORGED_YAHOO_RCVD 3

So, procmail receives each e-mail with a spam score assigned by SpamAssassin inserted into the header. Amongst many other tasks, my procmail recipe checks the delivery address against a list of valid addresses, and keeps a running daily total of those it doesn't recognise.

A curious subset of spam are those e-mails that have no body at all. I assume that these are the result of someone firing up their spam sender before they've loaded it with their make-me-rich message. Anyone know different?

Then anything that has been identified by SpamAssassin as spam is counted and deleted. As is anything addressed to a 'hijacked' e-mail address (i.e. one that has been used as the 'From' address in a spam or viral e-mail). No 'bounce' message is sent to the sender because their identity is almost always unconnected with the 'From' address. That disposes of over 95% of the spam I receive.

procmail processes mail after it has been delivered in its entirety, so the only option available for dealing with spam and e-mail viruses at this stage is to delete them. It is however possible to refuse delivery of an e-mail if it is addressed to an unrecognised address. This has the advantage of reducing the volume of data on the local network and the load on the server. However it is currently complex to set up on pair Network's servers if one has a long list of valid addresses. There's also some debate over whether the 'delivery failed' messages, which may end up in an innocent party's in-box, exacerbate the whole spam problem.

The remaining mail is sorted into three folders in my 'master' pair Networks mailbox: anything to an unrecognised address (which accounts for most of the remaining spam) goes into 'Unrecognised'. Mail to my personal address goes into 'Personal'. The rest goes into 'General'. There is one other folder called 'Junk', which I'll come to in a moment.

My 'master' mailbox is separate from the default mailbox that comes with a pair Networks account. This is for security reasons: just in case somebody 'sniffs' the (unencrypted) password sent whenever I access the mailbox, it wouldn't allow them to access my main web server account. I took this risk seriously after reading a news post from another pair Networks user saying his account had been compromised in this way.

A copy of each personal e-mail is forwarded to a separate personal mailbox with pair Networks (I'll explain why in a moment), and a copy of all other mail to recognised addresses is forwarded to an account with SpamCop.

SpamCop filters mail according by 'blacklists', 'whitelists' and also using SpamAssassin. It automatically deletes e-mails bearing viruses and it intercepts most of the remaining spam (around 30 a day), moving it into a folder called 'Held' in my SpamCop mailbox. That typically leaves fewer than ten spam a day in my Inbox. In some ways it is more useful as a reserve filter should pair Networks find it necessary to temporarily disable spam filtering on my web server.

I have a Linux mail server in my office which downloads mail (using fetchmail) from the Inbox of my personal (pair Networks) mailbox into one local mailbox, and from the Inbox of my general (SpamCop) mailbox into another local mailbox. I use a procmail recipe to organise my general mail into subfolders, mostly based on the 'To' address.

So, I have succeeded in filtering out nearly all of the spam and my mail is neatly organized in folders on my office mail server, without any manual intervention. Great! But what about the mail building up in all those other mailboxes?

The SpamCop mailbox has an excellent webmail interface, so I leave mail in there for when I'm away from the office. I also examine the mailbox via IMAP most days, firstly to check for legitimate e-mail falsely identified as spam (about one a fortnight), which I move from the 'Held' folder to the Inbox (from where it gets downloaded automatically to my local mailbox); and secondly to move spam from the Inbox into the 'Held' subfolder. Sometimes I will go onto the SpamCop web site and report the spam in the 'Held' subfolder; otherwise I just delete it. About once a month I clear down the Inbox, when I am certain I have another offsite backup of my mail.

IMAP is a protocol for accessing mailboxes. Most people use POP, which is designed to download the entire contents of a mailbox in one go. IMAP is designed to let you interact with the mailbox message by message. Without downloading the message to your computer, you can view headline details (to, from, subject, size), delete it, or move it to another folder in the same mailbox. When you download a message a copy is stored locally, which you can view offline. Sounds great? However not all mailboxes can be accessed by IMAP and because your mail stays in the mailbox you may need to hone your housekeeping skills to keep it from filling up.

My personal (pair Networks) mailbox is automatically emptied whenever my office mailserver downloads mail from it, so there's nothing for me to do here. In fact I only use this mailbox because fetchmail cannot retrieve mail directly from the 'Personal' subfolder of the 'master' mailbox.

I examine the 'master' mailbox via IMAP most days: firstly, I check the 'Unrecognised' folder for e-mails sent to an address I'd forgotten I use or where the sender has mistyped my e-mail address, and move these into the 'General' folder. If there are several 'Bounce' message to the same address (indicating that it has been 'hijacked' as the 'From' address for a spam or virus mailing), I add the address to my procmail recipe for automatic deletion. The rest of the 'Unrecognised' mail I move into the 'Junk' folder. When pair makes it easier to configure, I will refuse delivery of all mail to unrecognised addresses; then this entire step will become unnecessary.

If I'm being thorough I also move any spam in the 'Personal' and 'General' folders into the 'Junk' folder. This is because I have a cron job that runs each day to 'train' SpamAssassin to recognise spam (using its bayesian search algorithms) by having it examine the spam in the 'Junk' folder. The script empties the 'Junk' folder when all the e-mails have been scanned.

The script to 'train' SpamAssassin looks like this:

#!/bin/csh

set junk = '/usr/boxes/{account}/{domain}/general^/.imap/Junk'
nice /usr/local/bin/sa-learn --spam --mbox $junk

# Now replace 'Junk' message file with the IMAP header message
# (which is 13 lines long)

head -13 $junk > ~/tmp/temp
mv ~/tmp/temp $junk

Finally, about once a month I clear down the 'master' mailbox.

I hope you have found this article useful or informative. I offer to configure a similar mail filtering system for any web hosting account I manage at no extra cost. Or I can offer consultancy at a negotiable rate to assist with setting up or configuring your own mail filtering system.

 

consultancy@excogitate.co.uk Phone: +44 (0)1223 312 377 Fax: +44 (0)1223 561 042
42 Devonshire Road · Cambridge · CB1 2BL · United Kingdom