Category Archives: Spamchek

There’s no fighting an email DDoS

Earlier today we emerged largely unscathed from a three-day DDoS (Distributed-Denial-of-Service) attack. I say specifically largely, because our pride, our customers and our sleep have all suffered quite a bit.

The difficult thing about an email DDoS is that it doesn’t look like one. It just looks like lots and lots of email. In the anti-spam business, that’s business as usual.  We have seen this before and we tend to call it a “spam-wave”.  They happen, they do cause some slow-down, they finish and we catch up. End of story.

Well, Monday morning when it started, we thought nothing of it until the queues on the filtering backends started growing very large.  That is generally not a real problem, except that the queue for the quarantine system got so large that it started to affect delivery of clean emails to our customers.  Around mid-day the wave seemed to decrease in intensity and for the afternoon the queues were able to slowly dissipate.  So far nothing really unusual except the incredible amounts of email. Targetted really just at two customer domains, using randomly generated addresses.  Normally we reject unknown addresses right away (avoiding back-scatter), but these mails were being trapped for our honeypot, hence being diverted to the quarantine for analysis and feedback.

Tuesday morning everything was in the green, up until 12 o’clock or thereabouts. The scenario from Monday started repeating – incredible amounts of email for a handful of domains, all being honeypotted. Delivery of clean emails to customers started slowing down, and frankly panic started to spread a little. We added two extra backend servers (COTS stuff, 8 cores, 24Gb), but they got equally swamped in no time.

By now, our phones were red-hot. Customers and IT helpdesks wanting to know why emails aren’t arriving seconds after they’ve been sent.  (we have long ago given up trying to explain that email is not intended as an instant messenging system).

Tuesday toward 19:00, the spam-wave seemd to be dimishing. The queues began to reduce, it was looking good. I’m not sure how far we got, but towards 01:00 Wednesday morning, the spam-wave began intensifying again.

Wednesday between 7 and 8 in the morning, the spam-wave seemed to be coming to an end.  The queues were still incredibly long, with more than 2million mails waiting.  Many emails from Monday and Tuesday were still being delivered with more than 24 hours delay.  We decided to add another two servers and redirect the current traffic, hopefully thereby getting the regular traffic through without much delay.  It took a little longer than expected (note to self: need to practice more). By 12 o’clock we diverted current traffic to the two new backends, causing immediate relief.  This took significant load of the existing backends allowing them to spend more time processing their queues. Wednesday around 16, we’re down to about 250’000 mails, all done by 22:00.

Thursday morning we spent a good deal of time tweaking the backend settings to optimize for large queues.  The spam-wave had resumed, although with much reduced intensity. At the time of writing, we’re handling the flood just fine, no queues building.

Test-system upgrade

Our spam-filter test-system cluster is made up of a bunch of ancient commodity PCs, Dell Optiplex’es and HP Vectras. It is a functional mirror of our production cluster, but with much less capacity. Two frontends, a number of backends, mostly single Pentium II 400MHz with 384Mb RAM. Generally not much to write about except it has just celebrated it’s tenth anniversary and replacement is really way over-due.

Performance- and capacity-wise it’s still perfectly adequate, and would likely remain so for another few years. However, it needs a new harddrive every now and then  and parallel ATA/IDE drives are getting quite scarce. Second, when compared to e.g. a virtual setup, it’s simply using up way too much electrical power.

Internally, we’ve discussed moving to a virtual setup more than once, usually every time a harddisk needs replacing, but it’s not a priority nor is it critical, at least not at the moment.