Intro
One of my main duties at my current job is administering email. We're processing mail for about 18000 users across 1300 domains. What this means for me is that for every 1 legitimate piece of mail I get, I get 50-100 pieces of spam. It's quite nasty. In order to combat the spam, we've got a cluster of MTAs running Postfix, Amavis, SpamAssassin, ClamAV combo. This setup works fairly well to combat spam, except that the Amavis/SA is written in Perl.
What does recipient verification (RV) and the amavis suite have to do with anything? It makes sense to only send message to amavis for scanning only if they have a valid recipient. No sense in scanning a message that will get discarded later anyway. It's far less costly to do RV and subsequently reject all messages at connect time to non-existant addresses. My log analysis showed that each box was scanning between 30~60k extra messages a day without recipient verification. That's a lot of extra work for amavis/sa/clamav to handle. Not to mention the delayed scan times for legitimate emails, and extra processing overhead.
Dynamically Typed Languages
I love Perl. It's uber-useful and is good at what it's designed to do. It's drawback though is that it's an interpreted, dynamically typed language. These points aren't a very big deal when you've got an application on a smaller scale. It's when the application gets larger, that problems like to introduce themselves. Here's why interpreted & dynamically typed is a big no-no.
Interpreted languages are very useful for small scripts, general purpose plumbing around the OS, but when an application/script has a layer of abstraction between it and the hardware, at the expense of quick & easy scripting, more hardware resources are utilized.
Dynamically typed languages don't have types like int, string, long, char, etc. Variables are all instantiated by var/my/$var whatever the language. Good old Programming Languages 400-something taught me that these languages are 10x slower. This is because a variable could be any type, so the interpretor and/or runtime environment has to check the variable each time it accesses it. Is it a string? Is it a number, int, long, float? Is it an object? And so on and so forth the easy of declaring my $var; is paid for later during run time.
Recipient Verification (RV)
A section in Postfix's main.cf is dedicated to recipient verification:
address_verify_map = btree:/var/spool/postfix/etc/verify_db
address_verify_negative_expire_time = 3h
address_verify_negative_refresh_time = 1h
address_verify_poll_count = 1
address_verify_positive_expire_time = 31d
address_verify_positive_refresh_time = 7d
address_verify_poll_delay = 1s
For those unfamiliar with recipient verification (RV), the above configuration directives tell Postfix to consult it's recipient verification database and see if a user exists. It does this by sending an email from postmaster@mail.hostname.tld to wherever the email is supposed to go. It's almost like greylisting where it rejects an unknown recipient with a 400 level error, verifies the address by sending the test message out, and on the next connect, it's ready to accept the message provided that the address is valid.
The problem that I have is that I don't have uniform hardware on the cluster of receiving MTAs. I was noticing that there were email delays and dropped connections to certain boxes in the cluster. The strongest box has a dedicated hardware raid 5 controller with 3x 15k rpm scsi disks behind it. The weakest box has a single 7200rpm ide drive behind it. It's the weakest boxes that had a problem with mail delay.
Receiving a high volume of mail is very IO intensive. Especially when recipient verification is turned on. Typically when running iostat 1 on a strong box, only about 1~5% of the processes are being blocked waiting for an IO request to finish. On the weak box, during peak times, about 20~60% of the processes are blocking waiting for IO to complete. That's completely unacceptable.
The problem is that there may be between 200~800 smtpd processes running accepting new incoming mail. Most of them are dictionary/random user spam attacks. So, smtpd accepts a connection, verifyd then must check for a match. If there's a match, it rejects or accepts, if no match, then it writes to the db, verifies the address by sending a message out to the next email hop, gets the result and rewrites it's success. However, when there are 800 smtpd processes with most of them are shooting out a verification probe, all the verify processes get in contention with each other while locking and updating this btree or hash.
At worst when I telnet into port 25 and manually send an email, helo, mail from: is fast. rcpt to: will hang for 10~15 seconds while waiting on recipient verification. At peak when there are over 5000 simultaneous connections to the cluster, each connection is prolonged for an extra 10~15 seconds waiting for RV to either 250, 450, or 550 them. This is unacceptable. Longer connection times mean an increased number of smtpd processes receiving mail, idle processing time while processes are IO blocked, and increased memory use by all the smtpd processes. If it gets bad, swap is hit, then connections time out, people start complaining about delayed mail, so on and so forth.
So by this point, why not just get faster harddrives? Budgets Budgets Budgets. *sigh* With no built in raid controller, and ide/sata drives, gotta get new boxes for a raid setup, or get raptors. I'm working on that =) For the meantime here's the solution.
Place the recipient verification database on a ramdisk.
I got 4GiB ram, why not spare a couple extra megs for RV?
The RV database gets fairly large. All valid addresses come out to take up around 40MiB of space. The RV db has grown to about 100MiB in a matter of a week. In about a month, it will hang out around 400MiB with all the "invalid" addresses being stored in the RV DB.
See how big your ramdisk is:
dmesg | grep RAMDISK
Change boot time kernel options to create bigger ramdisk. See the end of line 3. I've specified a 256MiB ram disk. Editing: /boot/grub/menu.lst
title Debian GNU/Linux, kernel 2.6.18-5-amd64
root (hd0,0)
kernel /vmlinuz-2.6.18-5-amd64 root=/dev/md1 ro ramdisk_size=256000
initrd /initrd.img-2.6.18-5-amd64
savedefault
Reboot.
Stop Postfix
/etc/init.d/postfix stop
Format RamDisk, Mount Disk & link DB.
mke2fs -q -m 0 /dev/ram0;
mkdir /mnt/ram0;
mount /dev/ram0 /mnt/ram0;
cp /path/to/verify.db /mnt/ram0; #why reverify all the addresses?
mv /path/to/verify.db /path/to/verify.db~; #make backup copy
ln -s /mnt/ram0/verify.db /path/to/verify.db
Start Postfix!
/etc/init.d/postfix start
Remember that each time you reboot, you'll lose all the contents of the ramdisk, so make sure to have the contents of your ramdisk stored on non-volatile storage.
I have a quick command to parse the mail log and show maildelay. Also if you're running debian, you should install the "num-utils" package to use the following command.
tail -n$10000 /var/log/mail.log|grep 'status=sent'|grep 'relay=127.0.0.1'|grep -v devnull|awk '{print $9} '|grep delay|sed "s/delay=//g"|sed "s/,//g"|/usr/bin/numaverage -M|cut -d"." -f1
maildelay is at an unwavering 1 second. Testing has shown a significant speed up of mail processing on a single 7200rpm ide drive system. Now this box can actually keep up, if not exceed the 3x 15k scsi raid 5 box. Most of the disk IO is dedicated to amavis/spamassassin/clamav combo to do their job killing spam. Post RV on ramdisk, I am seeing far less processes waiting for io (iostat showing around 2~7%), and lower load averages all around.
For more information on mounting ramdisks: http://www.vanemery.com/Linux/Ramdisk/ramdisk.html
Btw, I would love to move to dspam as it's far more accuarate and faster, but when there is an infrastructure already in place, changing the system wholesale also has its unintended ramifications. So for the moment, it's optimizing a system to lessen the pain for now.