Wednesday, August 15, 2007

I finally got around to restoring from the last backup of my home desktop (its hard drive died in late 2004), and began the task of importing about 2.5 years of email into gnus.

During this period, all my mail was piling up on my mailhost, in both maildir and mbox formats (I'm paranoid about losing email). I'd been reading email using mutt -R because I figured that I'd import my email into gnus at some point in the future. Unfortunately, the amount of spam I received during these 2.5 years had grown massively:


-rw-------  1 me       luser   977K Jan  1  2004 Mailbox-20040101.bz2
-rw-------  1 me       luser   6.3M Jan 25  2004 Mailbox-20040126.bz2
-rw-------  1 me       luser   4.4M Jan 18  2005 Mailbox-20050118.bz2
-rw-------  1 me       luser    12M Oct  7  2006 Mailbox-20060107.bz2
-rw-------  1 me       luser   2.8M Jan 18  2006 Mailbox-20060118.bz2
-rw-------  1 me       luser   8.6M Jan  1  2007 Mailbox-20070101.bz2
-rw-------  1 me       luser    17M Jan 15  2007 Mailbox-20070115.bz2
-rw-------  1 me       luser    14M Jan 29  2007 Mailbox-20070129.bz2

Each file has about 2-4 weeks of email.

It's only late last year that I began taking anti-spam measures at the SMTP level. I added something like sendmail's greet pause (with a five second delay), began using the bl.spamcop.net rbl, and blocked certain email addresses that received nothing but spam (old email addresses used for mailing lists that I no longer subscribe to). Yesterday, the checks caught:

checkmessages caught
greet pause479
rbl1,222
blocked email addresses251

I also began verifying SPF records and DomainKey headers, but the checks are either not rejecting any email or my mailer isn't logging those rejections.

So. A lot of spam. Fortunately, gnus now has the ability to filter using external apps, like bogofilter, crm114, ifile, and a few others. I set up gnus to filter all imported email through bogofilter first, and I set up group parameters on almost all my groups to automatically train bogofilter with ham/spam messages when I exited the groups.

This worked well for all the email I imported from August 2004 to late 2005, and then the training process began slowing down massively. The other night, it took about two hours to train bogofilter on some 800 spam messages. Vacuuming the wordlist database (it's sqlite) helped some, but it's still pretty slow. It takes less than a second to classify email as spam, ham, or "unsure", but I need to train it with more ham messages, and training bogofilter takes over a second per email.

Right now, it's trained with almost three times as many spam messages as ham (12,159 vs 4,313). Given the amount of spam I have in the unimported mailboxes, it's only going to get worse. I can throw about 5,000 or so known "good" email messages at it, but it's going to take all evening.

After I'm done importing all my old email, I plan to upload the trained bogofilter wordlist to the mailhost, and begin rejecting email that it thinks is spam at the SMTP level.

No comments: