Coping with 50,000 spam a day
The volume of spam arriving at the web domains I own has grown exponentially,
from a few hundred a day six months ago, to well over 50,000 in a day. The
trend looks worringly like the Net (or at least my ISP's server)
is
heading
for meltdown in just a few months.

Most of the recent growth has been created by viruses, trojans and spamming
software manufacturing e-mail addresses at my domains, some plausible
sounding such as alvarez@, snyder@,
socialjustice@, snowwbaby@,
soccertime4us@, and others just random strings of numbers and letters.
I'm currently running a count on how much mail is addressed to non-existent
addresses, and it appears to amount to between 97 and 98% of
all the spam I receive.
So how do I cope with the onslaught?
On an average day I download fewer than 10 spam e-mails. That's
because I filter all my mail through three filters: SpamAssassin, procmail and SpamCop.
Let me explain how (the boxed sections are asides):
All my e-mail arrives at my web server hosted at pair
Networks, which uses qmail as
the mail processor. Although I could in theory filter my mail using
qmail recipe files, they are not flexible
enough for some of the automated mail processing I require, and are
inconvenient to configure for many different e-mail addresses.
Like many people I use many different e-mail
addresses, e.g. [client_name]@ for all correspondence
with a particular client. This makes it easy to filter
incoming mail automatically (by 'To' address) for automatic action (e.g.
'unsubscribe' requests and database log files to be fed into a local
backup/mirror) or to be sorted into an appropriate folder, saving
time and
helping
me
prioritise my work.
So I use procmail instead. But before the mail is passed to procmail
it is filtered through SpamAssassin.
The .qmail-default file in my home directory looks
like this (on one line):
/usr/local/bin/sa_client.pl -rh -c1
'/var/qmail/bin/preline /usr/local/bin/procmail
-d "$USER"'
sa_client.pl: Pass e-mail to SpamAssassin.
-rh: Write the spam report to the header rather than the message
body
-c1: Pass processed e-mail to the command that follows (in
quotes)
preline: Prepend standard message headers
procmail: Pass processed e-mail to procmail
-d "$USER": Deliver mail locally as 'me'
SpamAssassin scans through the mail looking for patterns commonly
found in spam: words and phrases, presentation, forged lines in the
header, and common
errors
in the
formatting.
It also compares the e-mail to a distillation of e-mails that I have
told it are OK or spam (more about that later). I have tweaked the
spam scoring system and set a high threshold score so that I can be virtually
100%
confident
that anything scoring
higher than the threshold is indeed spam.
My .spamassassin/user_prefs file looks
like this:
required_hits 9.5
score BAYES_60 4
score BAYES_70 5
score BAYES_80 6
score BAYES_90 8.5
score BAYES_99 9.5
score FORGED_HOTMAIL_RCVD 2
score FORGED_HOTMAIL_RCVD2 5
score FORGED_JUNO_RCVD 3
score FORGED_MUA_AOL_FROM 5
score FORGED_MUA_EUDORA 5
score FORGED_MUA_MOZILLA 3
score FORGED_MUA_OUTLOOK 6
score FORGED_YAHOO_RCVD 3
So, procmail receives each e-mail with a spam score assigned by SpamAssassin
inserted into the header. Amongst many other tasks, my procmail recipe
checks the delivery address against a list of valid addresses, and keeps
a running daily total of those it doesn't recognise.
A curious subset of spam are those e-mails that have
no body at all. I assume that these are the result of someone firing
up their spam sender before they've loaded it with their make-me-rich
message. Anyone know different?
Then anything that has been identified by SpamAssassin as spam is counted
and deleted. As is anything addressed to a 'hijacked' e-mail address
(i.e. one that has been used as the 'From' address in a spam or viral
e-mail). No 'bounce' message is sent to the sender because their identity
is almost always
unconnected
with
the
'From'
address.
That disposes
of
over 95% of
the spam I receive.
procmail processes mail after
it has been delivered in its entirety, so the only option available for
dealing with spam and e-mail viruses at this stage is to delete them.
It is however possible to refuse delivery of an e-mail if it
is addressed to an unrecognised
address. This has the advantage of reducing the volume of data on the
local network and the load on the server. However it
is currently
complex
to set up on pair Network's servers if one has a long
list
of valid addresses. There's also some debate over whether
the 'delivery failed' messages, which may end up in an innocent
party's in-box, exacerbate the whole spam problem.
The remaining mail is sorted into three folders in my 'master' pair
Networks mailbox: anything to an unrecognised
address (which
accounts
for most of the remaining spam) goes into 'Unrecognised'. Mail to my
personal address goes into 'Personal'. The rest goes into 'General'.
There is one other folder called 'Junk', which I'll come to in a moment.
My 'master' mailbox is separate from the default
mailbox that comes with a pair Networks account. This is for security
reasons: just in case somebody 'sniffs' the (unencrypted) password sent
whenever I access the mailbox, it wouldn't allow them to access my main
web server account. I took this risk seriously after reading a news
post from another pair Networks user saying his account had been compromised
in this way.
A copy of each personal e-mail is forwarded to a separate personal
mailbox with pair Networks (I'll explain why in a moment), and a copy
of all
other
mail to recognised addresses is forwarded to an account with SpamCop.
SpamCop filters mail according by 'blacklists', 'whitelists' and also
using SpamAssassin. It automatically deletes e-mails bearing viruses
and it intercepts most of the remaining spam (around 30 a day), moving
it into a folder called 'Held' in my SpamCop mailbox. That typically
leaves fewer than ten spam a day in my Inbox. In some ways it is more
useful as a reserve filter should pair Networks find it necessary to
temporarily disable spam filtering
on my web server.
I have a Linux mail server in my office which downloads mail (using
fetchmail)
from the Inbox of my personal (pair Networks) mailbox into one local
mailbox, and from the Inbox of my general (SpamCop) mailbox into another
local mailbox.
I use
a procmail recipe
to
organise
my
general mail into subfolders, mostly based on the 'To' address.
So, I have succeeded in filtering out nearly all of the spam and my
mail is neatly organized in folders on my office mail server, without
any manual intervention. Great! But what about the mail building
up in all those other mailboxes?
The SpamCop mailbox has an excellent webmail interface, so I leave mail
in there for when I'm away from the office. I also examine the mailbox
via
IMAP most days, firstly to check for legitimate e-mail falsely identified
as spam (about one a fortnight), which I move from the 'Held' folder
to the Inbox (from where it gets
downloaded automatically to my local mailbox); and secondly to move
spam from the Inbox into the 'Held' subfolder.
Sometimes I will go onto the SpamCop web site and report the spam in
the 'Held'
subfolder;
otherwise
I just delete it. About once a month I clear down the Inbox, when I am
certain I have another offsite backup of my mail.
IMAP is a protocol for accessing mailboxes. Most people
use POP, which is designed to download the entire contents of a mailbox
in one go. IMAP
is designed to let you interact with the mailbox message by message.
Without downloading the message to your computer, you can view headline
details (to, from, subject, size), delete it, or move
it to another
folder in the same mailbox. When you download a message a copy is stored
locally, which you can view offline. Sounds great? However not
all mailboxes
can be accessed
by IMAP and because your mail stays in the mailbox
you may need to hone your housekeeping
skills
to
keep
it
from
filling
up.
My personal (pair Networks) mailbox is automatically emptied whenever
my office mailserver downloads mail from it, so there's nothing for
me to do here. In fact I only use this mailbox because fetchmail cannot
retrieve mail directly from the 'Personal' subfolder of the 'master'
mailbox.
I examine the 'master' mailbox via IMAP most
days: firstly, I check the 'Unrecognised' folder for e-mails sent
to an address I'd forgotten I use or where the sender has mistyped
my e-mail
address, and move these into the 'General' folder. If there are several
'Bounce' message to the same address (indicating that it has been 'hijacked'
as
the 'From'
address for a spam or virus mailing), I add the address to my
procmail recipe for automatic deletion.
The
rest of the 'Unrecognised' mail I move into the 'Junk' folder.
When
pair makes
it easier to configure,
I will refuse delivery of all mail to unrecognised
addresses; then
this entire step
will
become unnecessary.
If I'm being thorough I also move any spam in the 'Personal' and 'General'
folders into the 'Junk' folder. This is because I have a cron job that
runs each day to 'train' SpamAssassin to recognise spam (using its bayesian
search algorithms) by having it examine the spam in the 'Junk' folder.
The script empties the
'Junk' folder when all the e-mails have been scanned.
The script to 'train' SpamAssassin looks like this:
#!/bin/csh
set junk = '/usr/boxes/{account}/{domain}/general^/.imap/Junk'
nice /usr/local/bin/sa-learn --spam --mbox $junk
# Now replace 'Junk' message file with the IMAP header message
# (which is 13 lines long)
head -13 $junk > ~/tmp/temp
mv ~/tmp/temp $junk
Finally, about once a month I clear down the 'master' mailbox.
I hope you have found this article
useful or informative. I offer to configure a similar mail filtering
system for any web hosting account
I manage at no extra cost. Or I can offer consultancy at
a negotiable rate to assist with setting up or configuring your own mail
filtering system.
|