adrift on a cosmic ocean

Writings on various topics (mostly technical) from Oliver Hookins and Angela Collins. We have lived in Berlin since 2009, have two kids, and have far too little time to really justify having a blog.

The current state of feature flags

Posted by Oliver on the 11th of February, 2014 in category Tech
Tagged with: etsyfeature flagsflickrflippersgithubpercentages

Feature flags, flippers, toggles, selective rollouts and various other terms have been used to describe systems that allow you to deploy code without necessarily forcing it on all of your users. I won't describe it here as it has already been covered in depth by a variety of companies and people, notably (and not necessarily in date order) Flickr, Github, Forrst, Martin Fowler and many others. The concept seems to have sprung into the general consciousness around 2009-2010 and many derivative articles, blog posts and screencasts appeared in 2011-2012. You can easily find many of the original and forked gems and other libraries between Google and Github.

The basic feature flag concept is simple - it is either on or off for all of your users. Groups of users for which to enable a feature is a small extension, but does raise architectural concerns when the groups become quite large - you don't want to be searching linearly through an array of user identifiers, and similarly, arranging the identifiers into a data structure capable of being searched in less than O(N) time complicates what was previously a fairly simple system.

What is most interesting to me is the percentage selection mechanism for flags. Almost exclusively, the mechanism by which rollout libraries perform this is something like

user_id % 100 < desired_percentage

or some simple variation on this theme. This flipper gem actually takes a CRC of the user identifier in order to not use the raw identifier in the comparison (or perhaps if the identifier is non-numeric):

Zlib.crc32(key) % 100 < percentage

which is an interesting twist on the original idea. That and other libraries (such as etsy/feature occasionally add the ability to enable a feature not just for a percentage of users browsing the site, but by postings, post creators, etc or basically any item or entity that can be consumed or act on your site that has some kind of unique identifier that can be modulo'd and turned into a percentage.

However all these mechanisms suffer from the same limitation - whatever identifier you select to do this comparison will have the simultaneous advantage and disadvantage of behaving in the same way every time you do the comparison. The first 1% of users will always be the same, every time they visit the site (ignoring the additional users from growth through new signups). Sure, you could use one of the mechanisms that selects a random 1%, but for reasons of response and object representation cachability, having responses change every time something is retrieved can have a performance impact. Even more importantly, if you are trying to perform A/B testing and want to know the impact or user perception of your new feature by collecting metrics around its use, the selection of users for that feature cannot change.

Oddly, I've only seen one comment anywhere that asks about this problem - the first 1% of users in that percentage will always be the same. Thus every feature flag you roll out to a percentage of users will hit that 1% first, every time. If you have a lot of features controlled in this way, and your features tend to be very experimental in nature you could end up with a lot of users in this percentage who experience a very odd, buggy, inconsistent view of your website. Alternatively, they may get a completely awesome experience with the most up-to-date and useful features!

Clearly what is lacking here is the ability to segregate experiments and features by different sections of your userbase. A basic less-than operator on the modulo of some number is clearly insufficient. My own thoughts on this haven't produced anything significantly different from the status quo. At a minimum, we cannot use a less-than comparison but must segregate individual experimental features into non-overlapping ranges of users. This is starting to sound like a system of multiple A/B tests rather than feature flags, and in fact this principle of non-colliding userbases involved in different tests is present in some A/B testing frameworks.

That being said, Google searches on the topic usually bring up content relating to multivariate testing and Multi-armed bandit testing - both extremely useful concepts in their own right but not quite my desired mechanism. What few resources I can find on the subject are fairly lacking in technical implementation details. Perhaps this is just not a requirement many people have. If you know better, please leave a comment below!

© 2010-2018 Oliver Hookins and Angela Collins