Wednesday, October 05, 2016

RubyConf 2015 - A Tale of Two Feature Flags by Rebecca Sliter

My name is Rebecca Sliter,
and I'm an engineering manager at Kickstarter.
Before joining Kickstarter I was a consultant,
which meant that I traveled to clients
of all shapes and sizes
to talk about software development practices.
Because of that,
I've grown to care a lot about tooling
and practices that help teams
deliver better code more efficiently.
Over the last few years in particular,
I've come to appreciate simple code that can be understood
and extended by my teammates.
In a nutshell, to me,
development should be as simple as possible.
I try to stay away from writing clever code,
and a lot of my time is actually spent thinking
about the best possible way to communicate an idea,
domain, or tool across a development team.
Which brings me to the point of my talk.
This is "A Tale of Two Feature Flags."
Specifically, what happens when a team implements
a bunch of features,
puts them behind flags, and ships 'em.
Today we'll be thinking about what makes a flag successful
and comparing it to other feature flagging strategies
that haven't worked out so well.
So let's talk feature flags.
This is the opening of the book "A Tale of Two Cities"
and I think it really nicely frames
what I'm gonna talk about today.
Software development, in many senses,
has never been so good.
Over the last few years in particular
we've developed languages, patterns, and release strategies
that abstract away a ton of the grunt work
and let's us focus on the good stuff.
It really is the best of times.
On the other hand though, all of these conventions,
patterns, languages, what have you
they can be misused.
In the wrong hands, the best of times really just means
that we can release terrible code faster than ever.
And feature flags really exemplify both the best and worst
of software development.
They're designed to make our lives easier
but often they do anything but.
But, I'm getting ahead of myself.
Let's focus on the good things for a moment.
What exactly is a feature flag and why are they so powerful.
Features start off with an idea, right?
Everyone's ready to go, jazzed up to build something great
and deliver value to real live users.
But then they're hit with a couple of really hard truths.
First of all, writing software is hard work.
Developing a feature from beginning to end
takes a ton of development time.
Not to mention, orchestration across the entire team.
Ideas take work, turns out.
And once all this hard development work has been done,
typically you're sitting there
with this big fat DIF's worth of work
just sitting there ready to go out
and it's really risky to release that.
The code's been tested, sure
but there are a ton of moving pieces
and some small bugs are bound to slip through.
And in thinking critically
about this sort of big bang release,
you're probably familiar with the type
where all of a sudden you ship a ton of code,
flip the lights on and everything looks totally different.
You realize that you're not sure
how this feature is gonna perform
with those real live users.
What would be ideal would be to test your idea
on a small subset of users and see how they react to it.
This is difficult to do though
with a gigantic monolithic release.
And finally, going back to an earlier point,
there are a ton of moving pieces in what you've built
beyond the things that you yourself or your teammates
have implemented.
Perhaps you're relying on a totally new data store
or a third party API.
The point is releasing all this at once is risky,
especially when you're not sure how your dependencies
are gonna scale.
And you can communicate this to the business
all you want, right?
They can have all the grand ideas they want
and you get to be the killjoy.
It's really hard, it's a ton of work,
I'm not sure how the system is gonna perform with users
or with our current infrastructure.
And they'll say, "Fine. That's nice."
So, let's go back to my really nice big feature idea.
How can we minimize the risk of releasing this grand idea
that I have?
And we all know
that the answer is to put it behind a feature flag.
At a high level, flags are configured in their own files,
typically yaml files, where you set the features
to on or off.
And these files usually start out pretty naively,
just a flat file with Boolean values, right?
Big feature off, maybe some other features are on,
and that's how they're configured.
And then throughout your code base,
rather than just implementing the feature,
you query that Boolean config value
and decide whether the user should see the feature
or just the other existing functionality.
Depending on the state of the flag,
the user will see one of two things.
Either the existing UI plus the brand new feature
that's behind the flag turned on
or just the plain old UI.
It's as if the feature doesn't exist to them.
And when you think about it
flags solve a lot of the concerns that I mentioned earlier.
Well, if something is a lot of work
then you can release it iteratively.
You can send your code out behind an off flag
as you develop it.
You don't have to worry about big bang deploys
or specing out months of work up front.
You can just release as you go.
Once the feature is ready to be shown to users
you could extend your flag to only be accesible
to a small group of those people.
For example, maybe in the beginning
you only want to show the feature
to internal employees for testing.
That's possible with feature flags.
And, as you gain confidence in the system,
you can slowly increase who sees the feature
until finally you've turned the flag on for everyone
and you get to go back through and remove the flag
and the conditional logic
and the feature's just on for everyone.
Which is pretty exciting, right?
And, as you might know,
it really just gets better from there.
I mentioned earlier this idea of toggling
or finely grained toggling between groups of users, right?
So beyond even just doing internal only
you can target maybe only French speakers
or only people within a certain mileage of San Antonio.
A really cool thing about feature flags
is that they can be really easily extended
to AB testing, right?
With a flag you're either showing one thing to some users
or another thing to another group of users.
And then, once you've got that,
all you need to do to make it an AB test
is just measure what happens afterwards.
So it's a pretty cool natural extension of a feature flag.
And finally, a lot of teams use feature flags
as kill switches so when a part of their system fails
or a feature isn't performing as they expected
they're able to just turn off that part of the system
without taking the rest of the application down.
A good example of this is a page
that might experiencing really high load
because people are pinging it and adding a ton of comments
to a blog post or something.
And rather than just taking the whole page down right,
or the site breaking somehow,
you can just turn off the comments feature
and make it read only.
So, these are all some pretty cool natural extensions
of feature flags,
which sure sounds like the best of times.
What could go wrong, right?
Let's talk about what happens in practice
when a team decides to ship code behind a feature flag.
We always start out with the best of intentions, right?
I worked on a project last fall
where we were shipping something called faceted search.
It was for a retail client
and the search results would come from different sources.
Each source was considered a facet.
These sources were independent services
and the client didn't have a ton of confidence
in the performance of these services.
So we decided to put our feature behind a feature flag
and that's how we started.
A feature with enough ambiguity and dependencies
to warrant putting it behind a flag for testing purposes.
All was well.
Or at least all was well
until the inevitable scope change came up.
The external dependency wasn't ready
and we needed to move on to the next facet.
And, unfortunately, that facet needed to be behind
a totally separate feature flag
so that we could deploy the features
completely independently of each other.
And that's how our tidy, well-defined feature flag
became two feature flags.
In theory, this doesn't sound so bad.
Two facets, one flag each.
Toggling one facet or feature on or off
would be totally independent of the other,
until we started writing the code.
We had no idea what the user should see
if one feature was toggled on but not the other,
and vice versa.
These flags, which were seemingly independent,
were actually really tightly coupled to each other
just by virtue of them being associated
with the same broader feature, search.
My pair and I spent so much time
speaking to various stakeholders
that we ended up drawing a truth table
to make sure that everyone was on the same page
in terms of business logic.
And the truth table itself, it started out easy enough.
Both features on meant that both features would be shown
in the UI. Pretty simple.
And only one feature on meant that only that one feature
would be displayed. Fine.
Then it started to get confusing.
Where on the page would each feature be shown?
And what if the placement of even just one of these features
was dependent on the rest of the UI being fixed?
This was ridiculous.
I mean, sure it gave us a spec
of exactly what the page was supposed to look like
under every single possible condition.
And it was helpful when we completed the work
and we were ready for a QA on our team to look at it.
But it was ridiculous.
The feature was so complex and testing it was so challenging
that we were bound to introduce bugs.
And so we did.
There was a ton of back and forth with this QA
on our piece of work simply because no one was exactly sure
what the user was supposed to see.
Which brings me to my first point,
that treating independent flags
as a way to release a single feature iteratively
is really challenging.
And at the time, about a year ago, our team realized
that we had a couple of options, right?
We could either scrap the project, push back on PM
and just say we needed to wait until the service was ready.
Which, at the time, didn't seem very realistic.
So we went with option two, which was to go ahead
but be really careful
about testing this feature.
And that sounds fine except for the fact
that you can't really automate carefulness, right?
Especially in manual testing.
Something's always gonna slip through.
And all of this time spent clicking through,
manual testing things, it got tiring.
And we realized that all of this time spent testing
was actually time that we weren't getting feedback
from our client or from real users.
Which we thought was the whole point
of putting something behind a flag
and getting it out there.
So, this was kind of messy situation
and what we realized is that a key thing about feature flags
is that they aught to be simple.
They shouldn't be adding a ton of complexity to your app.
In fact, they should be just like very simple
Boolean values wrapping the complexity.
You shouldn't have to think a ton about implementing them
or working with them.
And if you're spending a lot of time implementing the flag
rather than the feature it wraps,
feature flags probably aren't going to solve the problem
that you have.
In fact, you might even be using these flags
where they shouldn't be used.
The moral of my faceted search results story
is that we quickly learned that we should stick to one flag
at a time for any given feature
because even when you manually QA the hell out of something
bugs are bound to happen,
so we might as well go with something that's simple enough
to allow us to focus on implementing our feature
as opposed to the flag.
So now let's go a little further with one of those flags
and talk about how it was actually implemented.
The search feature that I mentioned earlier
was delivering different kinds of results
that we call facets, right?
And we ended up placing this flag for this facet
in the back end, which at the time seemed okay.
We deliver search results to the UI without having to toggle
based on the flag's value there.
The UI would be feature flag agnostic,
which we thought was pretty cool.
I have some pseudocode up of what our approach looked like.
In order for us not to tell the frontend
that we were making backend changes
we ended up appending logic to the existing backend method
that performed the query.
And that was fine in the short term.
It was temporary code anyway, right?
Over time, however, these flags started spreading
throughout the code base.
As other teams implemented other features
we realized that each team
had a totally different implementation strategy.
And for us, with our backend strategy,
unit testing was particularly difficult.
We had to tell each test about the state of our flag.
These flags were temporary,
which meant that we spent a ton of time
writing what were supposed to be small isolated tests.
In the end, these tests ended up knowing a lot more
about the system than what a unit test traditionally should.
My gif doesn't work, which is sad.
So removing toggles is really hard with this strategy.
We had to go back and update backend code once again
after the toggle
was dialed up to 100%.
This meant that we spent a lot more time
cleaning up the code
because both the logic and those unit tests that I mentioned
were coupled to the flag concept.
And the worst part about this
was that we were breaking tests as we went back through
and removed the flag, right?
Because all of our unit tests had to know
about the flag's state.
So this wasn't actually just a very simple refactoring
because we weren't in a state where all of our tests
were giving us any confidence.
In fact, what we were doing in removing that flag
was more like feature development.
Which just meant more opportunity for bugs and errors
and more necessity for additional testing.
Removing a flag had become really time intensive
for our team.
And this was the absolute best case
wherein a team takes the time afterwards
to clean up the flag once it's ready.
Often, engineers move on to a new feature
before having the opportunity to remove the flag.
It's saved for later.
The team becomes scared to remove it
as it probably flags some critical piece of functionality.
There's the gif.
This is how flags that were intended to be temporary
become immortalized as kill switch flags.
The team doesn't really know how to clean it up
or maybe just figures that such an important part
of the code base that's placed behind a flag of this gravity
deserves this kind of finely grained control.
What's terrible about this
was that the code wasn't necessarily written
in such a way that it should stick around forever.
It's just needless complexity.
Conditional statements, config values
that exist in your code base forever.
It's additional mental cycles
that a developer needs to spend
every time they pull up a file
touching the flag.
This isn't to say though
that all kill switches are always bad.
Sometimes they work really well.
Stack Overflow, for example, uses a kill switch
to disable posting questions and answers
when they're undergoing maintenance.
These kill switches can come in really handy
which we'll go into a little more later,
but what's important to note is that this kill switch
was architected to stick around permanently.
It's a whole different beast
from a more temporary feature flag.
In thinking about flagging strategies
for these temporary feature flags,
I have a couple recommendations.
My first is that your team should come up with an approach
for isolating the flag.
It shouldn't be used across the code base.
In fact, ideally it should only be referenced
in a single place.
Doing so will lessen confusion and room for error
when you or your teammates go to remove the flag.
And if we take this idea of feature flag isolation
a little further, I'd recommend that the team
adopt the habit as treating flagged code as new code
that is completely separate from the existing product.
Again, this encourages ease of removal.
Beyond that, it clarifies
exactly what a feature flag is supposed to do
and precisely what the state of the product will be
once the flag is ripped out.
In the chance your team decides not to release the feature
to everyone, it will be obvious
exactly what will need to be removed.
There will be less wrestling with the code
in that case and less wrestling means
decreased chance kreft or bugs creeping in.
So at this point, we've talked about the genesis
of a feature flag and how it can warp
before it makes it into production,
and the kinds of code we write when shipping flags.
Usually this is the end of the line for a flag.
We ship a feature, turn the flag on, and we're done.
Oftentimes, we forget about the flag
or the person who wrote it changes projects
or maybe leaves the company entirely.
Sometimes we tell ourselves
that we'll just leave it in there for some period of time
just to insure
that the feature it's wrapping performs correctly.
We'll come back to it later.
It's not a big deal.
Most code bases end up with a config file
thousands of lines long denoting flags
that have long outlived their utility.
This is gross.
These flags left unattended represent mental energy
that the team has to spend every time they touch a file
dealing with the flag,
their maintenance overhead.
So, sure these flags are technical debt.
But beyond technical debt, these flags are risky.
They're a configuration and potentially code
that was engineered to be a temporary thing.
They probably haven't been tested
in the way that other code has,
and they probably aren't understood
by the current team members.
A while back a financial services organization
called Knight Capital developed a tool called power peg
that was built to mimic changes in stock prices
in order to test their trading algorithms
so that this code didn't actually run
on the real stock market.
They placed it behind a feature flag
and eventually turned it off in all environments.
It was turned off for eight years,
which kind of makes sense, right?
You write something risky, the maintainers leave,
it gets turned off.
No one wants to remove it years later,
so instead they just end of building stuff around it.
They don't want to touch it, it's not theirs.
They don't know how it works.
It makes sense.
Years later another team in a whole different part of Knight
wrote some code, threw a feature flag on it,
and shipped the code.
What they didn't realize was that their flag
had the same name as the flag
that kept power peg turned off.
When the stock market opened the next morning
the defective power peg code
caused Knight to send millions of orders
into the stock market.
The system didn't have a kill switch,
and they did manual deployments,
so over the course of 45 minutes Knight Capital
sunk 460 million dollars
into incorrectly priced stocks.
Eventually, the New York Stock Exchange stepped in
and triggered a circuit breaker
to halt all trading on these stocks.
This debacle caused Knight Capital's stock price to collapse
and they never really bounced back from this.
They were acquired a couple months later.
Which is a pretty hellish story.
Obviously, this is something that we all like to prevent
from happening in our own production environments,
and in thinking about how exactly this had happened
what it comes down to is that there was a long-lived flag
that didn't need to be in production anymore.
No one was maintaining it.
It was just legacy code that no one wanted to touch.
And long-lived feature flags
are a relatively common problem, right?
I mentioned earlier that tons of organizations
have code bases that have thousands lines long config files.
Conventional wisdom with this issue
is that you can just make the problem more visible.
The thinking is that visibility equates to empathy
which eventually means more buy-in across the organization
for time or people to solve the problem.
In this case, go in and remove the flags.
Typically folks will try to slap this kind of thing
on a dashboard.
And sometimes it'll even be put on a screen
around the office for the whole team to look at.
And this is fine.
It's a good first step, but to me it lacks agency.
Everyone can see that you have an old flag,
but if it's been sitting around in the code for a while
chances are that that's not gonna come as a surprise
to anyone on your team, right?
They're pulling up the code base every day
and looking at those flags.
It's more likely that most folks
just probably don't want to remove it
for whatever reason.
So what you really need is a forcing function
or a way to delegate the problem to a specific person.
Let's go back to the flag configuration file
I mentioned earlier,
which was just key value pairs, right?
This flag on, that other flag off,
just that whole flat file of Boolean values.
The solution that we came to at Kickstarter
was to implement additional fields for each feature flag.
In addition to on or off, we added a deadline field,
which specified exactly when that flag should be removed.
And we also added a maintainer field,
which delegated exactly who was responsible
for the flag's eventual removal.
On our CI server every time tests are run
we then check if each flag actually is past it's deadline.
If it is, we create a new branch, amend the config file
to contain a message to the maintainer,
and commit the branch cc'ing the maintainer
in the commitment message.
We use the GitHub flow really heavily at Kickstarter
which means that the maintainer soon gets an email
that she was cc'd on a commit.
And, ya know, mistakes happen.
Maybe the maintainer's on vacation
or maybe she's left the project or the organization.
Ya know, maybe this branching thing doesn't always work.
So in that case, our CI server, if the branch exists
and has been around for a while, which in our case
is more than two days, we commit again.
And instead of cc'ing the maintainer we escalate it
and cc the entire team.
And this works for us, but this strategy is admittedly
pretty Kickstarter specific, right?
I'm talking about our CI server,
our GitHub workflow all that, but I think that the approach
of identifying a maintainer and a deadline
can take a lot of different shapes,
depending on your individual code base
and your team culture.
So beyond a dashboard or a visualization,
my recommendation for teams trying to avoid
long-lived feature flags
is to define a lifespan for each flag.
The flag and it's cleanup should be delegated
to an individual or a team in order to insure it's removal.
Most importantly, teams should be ruthless
about deleting these flags
that have just outlived their utility.
I think we can all agree that feature flags
introduce complexity and overhead into a code base
with the end goal of getting something out faster.
There are obvious trade offs here.
Feature flags, at the end of the day,
they create technical debt.
But they also help us achieve some really great things,
primarily releasing better code faster and more iteratively.
But without careful consideration
the damage that feature flags can wreak on a code base
or even on a production environment
can quickly outweigh the benefits.
So, a team has to really think about what it means
to do them right.
Get the worst stuff out of the way
and focus on the best of times,
which to me is leveraging these flags
to ship great code.
Thank you.