adrift on a cosmic ocean

Writings on various topics (mostly technical) from Oliver Hookins and Angela Collins. We currently reside in Sydney after almost a decade in Berlin, have three kids, and have far too little time to really justify having a blog.

Don't use AWS SES

Posted by Oliver on the 14th of August, 2022 in category Tech
Tagged with: awsclouddevopsemailsecurity

I'm continuing my winning streak of publishing a blog post once every 5 months, and what better reason than to rant about something? I feel slightly conflicted about this one but at the same time I'm annoyed enough to write this... anyway, enjoy.

Email is an annoying domain of the internet. It features heavily on my list of things I don't want to be responsible for anymore, in addition to DNS, BGP, VPNs, physical infrastructure and most things non-cloudy. I spent enough time doing those things many years ago and don't have much inclination to going back to managing them myself again. Thus, it's very nice that for many years it has been possible to pay other people to run email servers for you - known now as Email Service Providers (ESPs).

It's a thankless job, and a very delicate balance. People and companies want to send email, which can take numerous forms from personal 1:1 communications to broad marketing campaigns. It is up to the recipient to determine whether those marketing emails are spam or not. As an ESP, you have to somehow provide a reliable and valuable service, but also prevent your customers contributing to the large spam problem that surrounds us on the internet. It's mostly thanks to advanced spam detection and techniques like DKIM, SPF and DMARC that the problem isn't larger. Email standards (in RFCs) unfortunately provide a very fertile ground for still sending a lot of illegitimate email without the ability to escape many of the underlying standards of the internet. As a mail provider, you need to run the protocols, and suffer their problems, but still uphold a high standard.

I'm not going to recommend any alternatives here - there are numerous and I haven't found one that I think does a great job of balancing all of these needs - only a number of unsatisfying options. That being said, I've been using AWS SES since last year due to "reasons" and found a need to write about it. To their credit, AWS does a good job of verifying sender identity and ensuring that you are not attempting to impersonate anyone else. They also start you off with very low sending rates and limits that are designed to limit the blast radius of anyone intending to sign up and fire out as many spam/scam emails as they can before they are detected and stopped. Similarly, requesting limit increases is not automatic and requires some justification (which I assume a human reads and judges). I appreciate these measures a lot.

Using AWS SES means your behaviour is being continually monitored. The two main metrics are bounce and complaint rate. Bounces are self-explanatory - due to a full mailbox, or a non-existent account (or really any reason a receiving server decides) an email is sent back to the recipient because it couldn't be delivered. This can be the fault of the sender or the recipient and really shouldn't be a high number but would be surprising if it were zero for any reasonable number of recipients and emails. Complaints are where a recipient has marked the email as spam, or an automated system has decided the email is illegitimate and sent it back, sometimes using the feedback loop system but also sometimes using proprietary complaint systems.

SES requires your complaint rate to remain below 0.5% (with a review period starting at 0.1%). It's low, but perhaps a fair number (and one I assume they arrived at with plenty of thought). So far so good, I'm not expecting any of the email to generate complaints as I'm not working with marketing emails. And yet, it does. There are numerous reasons for this to happen - one popular reason I've seen mentioned in Stack Overflow and forum posts is that SES's shared IP pool has a poor reputation and this triggers automatic complaints. Given that there are all kinds of users of SES, it's quite possible there are some bad apples in there, and their behaviour drags down the reputation of the shared IP addresses for everyone else. You can work around this by requesting dedicated IPs from AWS, in which case you are solely responsible for your own sender reputation.

Other reasons could be that you have some user interaction in your process for acquiring new email addresses, and someone is maliciously signing up email addresses that don't belong to them, or mistypes the address. Or perhaps someone simply doesn't want to bother with unsubscribing or disabling notifications properly, and hits the spam button instead; after all, it's a lot simpler. This might end up being treated as a complaint. AWS has a lot more detail around these scenarios in a helpful FAQ.

So we have a few options for when this count of complaints might go up. How might the rate be calculated? The simplest (and perhaps only?) answer is to take how many deliveries you made, and see how many resulted in complaints. If you send 100 emails, and one of them is returned as a complaint, you've got a 1% complaint rate. But that's only good if you are looking at a point-in-time, single delivery. Over time you need to consider all of the emails in that period. If you look at the above FAQ, you'll see that SES doesn't use a fixed period of time but a "representative volume", and also that the complaint rate isn't based on every email. A bit of head-scratching ensues.

OK, let's continue. We can see our complaint, bounce and delivery rates at any point in time using the Cloudwatch metrics and also in the SES console dashboard. However, we can't really get any more detail than that. If our complaint rate starts increasing, why is that happening? Who is complaining and why? You might expect (like most other ESPs) that this detail is provided - perhaps as a Cloudwatch Logs event stream. No such luck - you must solve this problem yourself. Indeed I did this, using a Lambda to capture the events and publish them to Cloudwatch Logs as I'd hoped to find in the first place. So we have a complete stream of delivery, bounce and complaint events upon which we can dig into the statistics and problems in detail.

Since last year I've had the displeasure of having to address several complaint review periods on the account I'm managing. The background is identical - our complaint rate went above the review threshold (0.1%) and we have to tell them what's going on. They provide a number of generic suggestions for how to help keep your complaint rate low - mostly obvious things that seem tailored to people sending marketing content or with very insecure systems. However, the complaint rate has always been the key problem in these scenarios. Using a variety of timeframes, I've calculated (using our captured events) the actual complaint rate across all deliveries and found that it would typically work out at one or two orders of magnitude lower than the number AWS was alleging us to have.

I'd reply back with my calculations, the specific Cloudwatch Log Insights queries I was using and cross my fingers. There were two responses I would typically receive:

  1. The review period was closed without further comment, with a templated response generally indicating we were off the hook for now, but to keep out of trouble. They're watching us!
  2. AWS can't disclose how they calculate the complaint ratio, but they are right and we are wrong. Tell us how you are fixing the problem.

In all but the most recent case, I've followed up with more calculations and questions and I presume due to my constant badgering they lost the will to live and closed the case. Most recently they insisted that we implement something like CAPTCHAs to protect against illegitimate traffic, and they'd not end the review period until we had dealt with the spam bot problem. I should point out at this point that the delivery rates were measuring typically (in some time periods) in the hundreds of thousands, and the complaint rate was a single digit number. Usually less than five. I don't know about you, but I would consider any automated spam bot attack capable of only sending five emails a day to not be very successful. Turning off the sarcasm for a moment, it was clear that these were the result of human activity - and hence completely impervious to CAPTCHA protection.

I wish I could give a happy ending to this but ultimately it ended in a meaningless gesture of improved security and no real answers to the questions I still have. We still don't know how they are calculating this complaint ratio, and so it is impossible to know why the numbers don't add up to what we are seeing - hence, difficult to address whatever the perceived problem is. The one tool that has been the biggest help to me around all of this is having a full event stream of all deliveries/bounces/complaints - and that's not even a built-in feature of SES! If you've ever worked with the Cloudwatch Logs API you'll know that it's not straightforward to use, and forcing every customer to build the exact same solution feels like a big oversight.

Again, generally I really, really appreciate the hard-line approach to security that SES takes, especially given the nuclear wasteland that is the state of the world's email infrastructure and message content. It's a hard domain to work in and succeed at. But the utter lack of transparency into how customers can understand the metrics they are being evaluated on effectively holds their cloud infrastructure to ransom. I worry about the day that we don't escape the review period and find our outbound email shut off unexpectedly. It doesn't fill me with the confidence that I usually have around AWS services, and as a result I will never recommend SES to anybody.

© 2010-2022 Oliver Hookins and Angela Collins