1 by Asa Schachar Ship Confidently with Progressive Delivery and Experimentation A book of best practices to enable your engineering team to adopt feature flags, phased rollouts, A/B testing, and other proven techniques to deliver the right features faster and with confidence. 2 Today, we’re living through the third major change in the way companies deliver better software products. First came the Agile Manifesto, encouraging teams to iterate quickly based on customer feedback, resulting in teams building and releasing features in small pieces. Secondly, as software development moved to the cloud, many teams also adopted DevOps practices like continuous integration and continuous delivery to push code to production more frequently, further reducing risk and increasing speed to market for new features. However, today’s most successful software companies like Google, Facebook, Amazon, Netflix, Lyft, Uber, and Airbnb have gone one step further, ushering in a third major change. They move fast by releasing code to small portions of their traffic and improve confidence in product decisions by testing their hypotheses with real customers. Instead of guessing at the best user experience and deploying it to everyone, these companies progressively release new features to live traffic in the form of gradual rollouts, targeted feature flags, and A/B tests. This process helps product and engineering teams reduce uncertainty, make data-driven decisions, and deliver the right experience to end users faster. When customers are more engaged with features and products, it ultimately drives retention and increased revenue for these businesses. The name for this third major shift is progressive delivery and experimentation. Progressive delivery with feature flags gives teams new ways to test in production and validate changes before releasing them to everyone, instead of scrambling to roll back changes or deploy hotfixes. With experimentation, development teams gain the confidence of knowing they’re building the most impactful products and features because they can validate product decisions well before expensive investments are lost. At the core of this practice is a platform that gives teams the ability to control the release of new features to production, decouple deployment from feature enablement, and measure the impact of those changes with real users in production. As with any new process or platform, incorporating progressive delivery and experimentation into your software development and delivery process can bring up many questions for engineering teams: It’s not enough to continuously integrate and continuously deliver new code. Progressive delivery and experimentation enable you to test and learn to move quickly with the confidence you’re delivering the right features. How can we get started with feature flagging and A/B testing without creating technical debt down the line? Intro 3 Get started with progressive delivery and experimentation Enable company-wide experimentation and safe feature delivery Use progressive delivery and experimentation to innovate faster Contents 01 02 03 04 p4 p17 p26 p38 Scale from tens to hundreds of feature flags and experiments In today’s fast-moving world, it’s no longer enough to just ship small changes quickly. Teams must master these new best practices to test and learn in order to deliver better software, products, and growth faster. By going from a foundational understanding to learning how to avoid pitfalls in process, technology, and strategy, this book will enable any engineering team to successfully incorporate feature flags, phased rollouts, and data- driven A/B tests as core development practices. By following this journey, you and your team will unlock the power of progressive delivery and experimentation like the software giants already have. This book is intended for software engineers and software engineering leaders, but adjacent disciplines will also find it useful. If your job involves software engineering, product, or quality assurance, then you’re in the right place. How can we scale to thousands of flags and still retain good governance and QA processes? How can we adopt these new practices organization-wide without slowing down development and delivery? 4 A feature flag (aka feature toggle), in its most basic form, is a switch that allows product teams to enable or disable functionality of their product without deploying new code changes. For example, let’s say you’re building a new front-end dashboard. You could wait until the entire dashboard code is complete before merging and releasing. Alternatively, you could put unfinished dashboard code behind a feature flag that is currently disabled and only enable the flag for your users once the dashboard works as expected. In code, a feature flag might look like the following: Get started with progressive delivery and experimentation In building for experimentation, you’ll first want to understand the different ways that software products can enable progressive delivery and experimentation. In this first chapter, we’ll cover the basics of feature flags, phased rollouts, and A/B tests, and best practices for implementing them. You’ll discover how these techniques all fit together, where an effective feature flag process can easily evolve into product experimentation and statistically rigorous A/B testing. Once you’ve implemented these techniques, you’ll be able to ship the right features safely. 01 Feature flags: Enable feature releases without deploying code if (isFeatureEnabled(‘new_dashboard’)) { // true or false showNewDashboard(); } else { showExistingDashboard(); } Feature flags toggle features on and off giving you another layer of control over what your users experience. 5 New Feature Feature Flag or Toggle Consumers Seamless releases Feature kill switches Trunk-based development Platform for experimentation When the function isFeatureEnabled is connected to a database or remote connection to control whether it returns true or false, we can dynamically control the visibility of the new dashboard without deploying new code. Even with simple feature flags like this one, you get the benefits of: Instead of worrying about the feature code merging and releasing at the right time, you can use a feature flag for more control over when features are released. Use feature flags for a fully controllable product launch, allowing you to decide whether the code is ready to show your users. If a release goes wrong, a feature flag can act as a kill switch that allows you to quickly turn off the feature and mitigate the impact of a bad release. Instead of having engineers work in long-lived, hard-to-merge, conflict-ridden feature branches, the team can merge code faster and more frequently in a trunk-based development workflow. [1] A simple feature flag is the start of a platform that enables experimentation, as we will see later in this chapter. New feature Feature flag or toggle 6 There are two general ways to perform a feature rollout, targeted and random, which both suit different use cases. 01 Targeted rollout: For specific users first A targeted rollout enables features for specific users at a time, allowing for different types of experimentation. Experiment with beta software If you have power users or early adopters who are excited to use your features as soon as they are developed, then you can release your features directly to these users first, getting valuable feedback early on while the feature is still subject to change. Experiment across regions If your users behave differently based on their attributes, like their country of origin, you can use targeted rollouts to expose features to specific subsets of users or configure the features differently for each group. Experiment with prototype designs Similarly, for new designs that dramatically change the experience for users, it’s useful to target specific test users to see how these changes will realistically be used when put in the hands of real users. Basic feature flags are either enabled or disabled for everyone, but feature flags become more powerful when you can control whether a feature flag is exposed to a portion of your traffic. A feature rollout is the idea that you only enable a feature for a subset of your users at a time rather than all at once. In our dashboard example, let’s say you have many different users for your application. By providing a user identifier to isFeatureEnabled , the method will have just enough information to return different values for different users. Feature rollouts: Enable safe, feedback-driven releases // true or false depending on the user if (isFeatureEnabled(‘new_dashboard’, ‘user123’)) showNewDashboard(); } else { showExistingDashboard(); } Rollouts allow you to control which subset of users can see your flagged feature. 7 02 Random rollout: For small samples of users first Another way of performing a feature rollout is by random percentage. Perhaps at first you show the new feature to only 10% of your users; then a week later, you enable the new feature for 50% of your users; and finally, you enable the feature for all 100% of your users in the third week. With a phased rollout that introduces a feature to parts of your traffic, you unlock several different types of experimentation. Experiment with gradual launches: Instead of focusing on a big launch day for your feature that has the risk of going horribly wrong for all your users, you can slowly roll out your new feature to fractions of your traffic as an experiment to make sure you catch bugs early and mitigate the risk of losing user trust. Gradually rolling out your features limits your blast radius to only the affected customers versus your entire user base. This process limits negative user sentiment and potential revenue loss. Experiment with risky changes: A random rollout can give you confidence that particularly risky changes, like data migrations, large refactors, or infrastructure changes, are not going to negatively impact your product or business. Experiment with scale: Performance and scale issues are challenging. Unlike other software problems, they are hard to predict and hard to simulate. Often, you only find out that your feature has a scale problem when it’s too late. Instead of waiting for your performance dashboards to show ugly spikes in response times or error rates, using a phased rollout can help you foresee any real-world scale problems beforehand by understanding the scaling characteristics of parts of your traffic first. Experiment with painted-door tests If you’re not sure whether you should build a feature, you may consider running a painted-door test [2] where you build only the suggestion of the feature and analyze how different random users are affected by the appearance of the new feature. For instance, adding a new button or navigation element in your UI for a feature you’re considering can show you how many people interact with it. New Feature Feature Rollout Some users get the feature New feature Feature rollout 8 Relying on an external database to store whether your feature is enabled can increase latency, while hard coding the variable diminishes your ability to dynamically change. Find an architecture that strikes a balance between the two. Fetch feature flag configuration when the application starts up Cache feature flag configuration in-memory so that decisions can be made with low latency Listen for updates to the feature flag configuration so that updates are pushed to the application in as real time as possible Poll for updates to the feature flag configuration at regular intervals, so if a push fails, the application is still guaranteed to have the latest feature configuration within some well-defined interval 1 2 3 4 Best practice: Balancing low-latency decisions with dynamism When the logic of your codebase depends on the return value of a function like isFeatureEnabled , you’ll want to make sure that isFeatureEnabled returns its decision as fast as possible. In the worst case, if you rely on an external database to store whether the feature is enabled, you risk increasing the latency of your application by requiring a roundtrip external network request across the internet, even when the new feature is not enabled. In the best-case performance, the isFeatureEnabled function is hard coded to true or false either as a variable or environment variable, but then you lose the ability to dynamically change the value of the feature flag without code deploys or reconfiguring your application. So, one of the first challenges of feature flags is striking this balance between a low-latency decision and the ability to change that decision dynamically and quickly. There are multiple methods for achieving this balance to suit the needs and capabilities of different applications. An architecture that strikes this balance well and is suitable on many platforms will: As an example, a mobile application may initially fetch the feature flag configuration when the app is started, then cache the feature flag configuration in-memory on the phone. The mobile app can use push notifications to listen for updates to the feature flag configuration as well as poll on regular, 10-minute intervals to ensure that the feature flags are always up to date in case the push notification fails. 9 Admin panel Client apps App servers 10 With phased rollouts, your application has the ability to simultaneously deliver two different experiences: one with the feature on and another with the feature off. But how do you know which one is better? And how can you use data to determine which is best for your users and the metrics you care about? An A/B test can point you in the right direction. By shipping different versions of your product simultaneously to different portions of your traffic, you can use the usage data to determine which version is better. If you’re resource constrained, you can simply test the presence or absence of a new feature to validate whether it has a positive, negative, or neutral impact on application performance and user metrics. By being precise with how, when, and who is exposed to these different feature configurations, you can run a controlled product experiment, get statistically significant data, and be scientific about developing the features that are right for your users, rather than relying on educated guesses. If you want to use objective-truth data to resolve differing opinions within your organization, then running an A/B test is right for you. A/B tests: Make data-driven product decisions New Feature Variation A A/B Test Some users get variation A or B Variation B Simply testing the presence or absence of a new feature can help you validate whether it has a positive, negative, or neutral impact on application performance and user metrics. When your A/B test is precise with the how, when, and who is exposed to different feature configurations, you can make data- driven decisions as opposed to educated guesses. New feature A/B test 11 Because of the properties of a good hash function, you are always guaranteed a deterministic but random output given the same inputs, which gives you several benefits: 01 Best practice: Deterministic experiment bucketing — hashing over Math.Random() If you’re building a progressive delivery and experimentation platform, you may be tempted to rely on a built-in function like Math.random() to randomly bucket users into variations. Once bucketed, a user should only see their assigned variation for the lifetime of the experiment. However, introducing Math.random() adds indeterminism to your codebase, which will be hard to reason about and hard to test later. Storing the result of the bucketing decision also forces your platform to be stateful. A better approach is to use hashing as a random but deterministic and stateless way of bucketing users. To visualize how hashing can be used for bucketing, let’s represent the traffic to your application as a number line from 0 to 10,000. For an experiment with two variations of 50% each, the first 5,000 numbers of your number line can correspond to the 50% of your traffic that will get variation A, and the second 5,000 numbers can correspond to the 50% of your traffic that will receive variation B. The bucketing process is simplified to assigning a number between 0 and 10,0000 for each user. Using a hashing function that takes as input the user id (ex: user123 ) and experiment id (ex: homepage_experiment ) or feature key and outputs a number between 0 and 10,0000, you achieve that numbered assignment for assigning variations: hash(‘user123’, ‘homepage_experiment’) -> 6756 // variation B Your application runs predictably for a given user Automated tests run predictably because the inputs can be controlled Your progressive delivery and experimentation platform is stateless by re-evaluating the hashing function at any time rather than storing the result A large hashing range like 0 to 10,000 allows assigning traffic granularity at fine increments of 0.01% The same pseudo-random bucketing can be used for random phased rollouts You can exclude a percentage of traffic from an experiment by excluding a portion of the hashing range 12 Ever wonder which landing page would lead to the most signups for your product? During the 2008 presidential campaign, Barack Obama’s optimization team ran several A/B tests to determine the best image of Obama and corresponding button text to put on the landing page of the campaign website. These A/B tested adjustments increased signups and led to $60 million of additional donations from their website. Product metrics: For product improvements Landing page signups 02 Best practice: Use A/B tests for insight on specific metrics A/B tests make the most sense when you want to test different hypotheses for improving a specific metric. The following are examples of the types of metrics that allow you to run statistically significant A/B tests. Referral signups through shares Want to know which referral program would increase virality of your product the most cost-effectively through sharing? Ride-sharing services like Lyft and Uber often experiment on the amount of money to reward users for bringing other users to their platforms (ex: give $20 to a friend and get $20 yourself). It’s important to get this amount right so the cost of growth doesn’t negatively impact your business in the long term. 0 10000 5000 user123 1812 5934 8981 user456 user981 hash() hash() hash() user123 gets bucketed into Variation A user456 gets bucketed into Variation B Variation A Variation B 13 An impression event occurs when a user is assigned to a variation of an A/B test. For these events, the following information is useful to send as a payload to an analytics system: an identifier of the user, an identifier of the variation the user was exposed to, and a timestamp of when the user was exposed to the variation. With this information as an event in an analytics system, you can attribute all subsequent actions (or conversion events) that the user takes to being exposed to that specific variation. Conversion event A conversion event corresponds to the desired outcome of the experiment. Looking at the example metrics above, you could have conversion events for when a user signs up, when a user shares a product, the time it takes a dashboard to load, or when an error occurs while using the product. With conversion events, the following information is useful to send as a payload to an analytics system: an identifier of the user, an identifier of the type of event that happened (ex: signup, share, etc.), and a timestamp. 03 Best practice: Capture event information for analyzing an A/B test When instrumenting for A/B testing and tracking metrics, it’s important to track both impression and conversion events because each type includes key information about the experiment. Operational metrics: For infrastructure improvements Latency & throughput If engineers are debating over which implementation will perform best under real-world conditions, you can gather statistical significance on which solution is more performant with metrics like throughput and latency. Error rates Impression event If your team is working on a platform or language shift and has a theory that your application will result in fewer errors after the change, then error rates can serve as a metric to determine which platform is more stable. 14 04 Best practice: Avoid common experiment analysis pitfalls Once you have the above events, you can run an experiment analysis to compare the number of conversion events in each variation and determine which one is statistically stronger. However, experiment analysis is not always straightforward. It’s best to consult data scientists or trained statisticians to help ensure your experiment analysis is done correctly. Although this book does not dive deep into statistics, you should keep an eye out for these common pitfalls. Example impression events Example conversion events U S E R I D U S E R I D Caroline Caroline Dennis Dennis Flynn Flynn Flynn Erin original purchase free-shipping purchase add_to_cart free-shipping add_to_cart signed_up 2019-10-08T02:13:01 50 2019-10-08T00:05:32 2019-10-08T05:30:46 30 2019-10-08T00:07:19 10 2019-10-09T01:15:51 2019-10-09T01:11:20 5 2019-10-09T01:14:23 - 2019-11-09T12:02:36 V A R I A T I O N I D E V E N T I D T I M E S T A M P V A L U E T I M E S T A M P Creating too many variations or evaluating too many metrics will increase the likelihood of seeing a false positive just by chance. To avoid that outcome, make sure the variations of your experiment are backed by a meaningful hypothesis or use a statistical platform that provides false discovery rate control. Note: The identifiers used in the table on the right are just for illustration purposes. Typically, identifiers are globally unique and are non-identifiable strings of digits and characters. Also note that a value is included in the conversion events that are non-binary (ex: how much money was associated with a purchase event). However, binary events like someone signing up, does not have a value associated. Rather these events are binary: they either happened or did not. If you calculate the results of an A/B test when only a small number of users have been exposed to the experiment, the results may be due to random chance rather than the difference between variations. Make sure your sample size is big enough for the statistical confidence you want. Multiple comparisons Small sample size 15 A simple feature flag is just an on-and-off switch that corresponds to the A and B variations of an A/B test. However, feature flags can become more powerful when they expose not only whether the feature is enabled, but also how the feature is configured. For example, if you were building a new dashboard feature for an email application, you could expose an integer variable that controls the number of emails to show in the dashboard at a time, a boolean variable that determines whether you show a preview of each email in the dashboard list, or a string variable that controls the button text to remove emails from the dashboard list. By exposing feature configurations as remote variables, you can enable A/B tests beyond just two variations. In the dashboard example, you can experiment not only with turning the dashboard on or off, but also with different versions of the dashboard itself. You can see whether email previews and fewer emails on screen will enable users to go through their email faster. A/B/n tests go beyond two variations to test feature configurations { title: “Latest Updates”, Feature Configuration color: “#36d4FF”, num_items: 3, } A classical experiment should be set up and run to completion before any statistical analysis is done to determine which variation is a winner. This is referred to as fixed-horizon testing. Allowing experimenters to peek at the results before the experiment has reached its sample size increases the likelihood of seeing false positives and making the wrong decisions based on the experiment. However, in the modern digital world, employing solutions like sequential testing can allow analysis to be done in real time during the course of the experiment. Peeking at results Feature configuration 16 One challenge besides knowing what to feature flag for an experiment is knowing how to integrate this new process into your team’s existing software development cycle. Taking a look at each of your in-progress initiatives and asking questions upfront can help you build feature flags into your standard development process. For example, by asking “How can we de-risk this project with feature flags?” you highlight how the benefits of feature flags outweigh the cost of an expensive bug in production or a disruptive, off-hours hotfix. Similarly, by asking “How can we run an experiment to validate or invalidate our hypothesis for why we should build this feature?” you will find that spending engineering time building the wrong feature is much more costly than investing in a well-designed experiment. These questions should speed your overall net productivity by enabling your team to move toward a progressive delivery and experiment-driven process. You can start to incorporate feature flags into your development cycle by asking a simple question in your technical design document template. For instance, by asking “What feature flag or experiment are you using to rollout/validate your feature?” you insert feature flags into discussions early in the development process. Because technical design docs are used as a standard process for large features, the document template is a natural place for feature flags to help de-risk complex or big launches. Feature flag driven development “How can we de-risk this project with feature flags?” “How can we run an experiment to validate or invalidate our hypothesis for why we should build this feature?” Best practice: Ask feature-flag questions in technical design docs 17 One challenge with experiments, feature flags, and rollouts is that you may be tempted to use them for every possible change. It’s a good idea to recognize that even in an advanced experimentation organization, you likely won’t be feature flagging every single change or A/B testing every feature. This high-level decision tree can be useful when determining when to run an experiment or set up a feature behind a flag. Scale from tens to hundreds of feature flags and experiments After completing your first few experiments, you will probably want to take a step back and start thinking about improvements to help scale your experimentation program. The following best practices are things you will need to consider when scaling from tens to hundreds of feature flags and experiments. 02 Decide when to feature flag, rollout, or A/B test Best practice: Don’t feature flag or A/B test every change. 18 Should I run an experiment or a rollout? Working on docs? Working on refactor? Working on a bug? Working on a feature? 19 02 Share constants to keep applications in sync Some feature flags and experiments are cross-product or cross-codebase. By having a centralized list of feature key constants, you are less likely to have typos that prevent this type of coordination across your product. For example, let’s say you have a feature flag file that is stored in your application backend but passed to your frontend. This way the frontend and backend not only reference the same feature keys but you can also deliver a consistent experience across the frontend and backend. When dealing with experiments or feature flags, it’s best practice to use a human-readable string identifier or key to refer to the experiment so that the code describes itself. For example, you might use the key ‘ promotional_ discount ’ to refer to the feature flag powering a promotional discount feature that is enabled for certain users. It’s easy to define a constant for this feature flag or experiment key exactly where it’s going to be used in your codebase. However, as you start using a lot of feature flags and experiments, your codebase will soon be riddled with keys. Centralizing the constants into one place can help. Reduce complexity by centralizing constants 01 Centralize constants to visualize complexity Centralizing all feature keys gives you a better sense of how many features exist in a codebase. This enables an engineering team to develop processes around revisiting the list of feature flags for removal. Having a sense of how many feature flags exist also gives a sense of the codebase and product complexity. Some organizations may decide to have a limit on the number of active feature flags to reduce this complexity. Best practice: Compile all your feature flags currently in use in an application into a single constants file. Best practice: Use the same feature key constants across the backend and frontend of your application. 20 As a company implements more feature flags and experiments, the codebase gets more identifiers or keys referencing these items (like: site_ redesign_phase_1 or homepage_hero_experiment ). However, the keys that are used in-code will inevitably lack the full context of what the rollout or experiment actually does. For example, if an engineer saw site_redesign_ phase_1 , it’s unclear what the redesign includes or what is included in phase 1. Although you could just increase the verbosity of these keys so that they are self-explanatory, it’s a better practice to have a process by which anyone can understand the original context or documentation behind a given feature rollout or experiment. Document your feature flags to understand original context Ensuring your software works before your users try it out is paramount to building trustworthy solutions. A common way for engineering teams to ensure their software is running as expected is to write automated unit, integration, and end-to-end tests. However, because you’re scaling experimentation, you’ll need a strategy to ensure you still deliver high quality software without an explosion of automated tests. Having a strategy to test every combination is not going to be sustainable. As an example, let’s say you have an application with 10 features and 10 corresponding automated tests. If you add just 8 on/off feature flags, you theoretically now have 2^8 = 256 possible additional states, which is nearly 25 times as many tests as you started with. Because testing every possible combination is nearly impossible, you’ll want to get the most value out of writing automated tests. Make sure you understand the different levels of automated testing, which include: Ensure quality with the right automated testing & QA strategy What the rollout or experiment changes were The owner for the rollout or experiment If the rollout or experiment can be safely paused or rolled back Whether the lifetime of this experiment or rollout is temporary or permanent Often times, engineering teams will rely on an integration between their task tracking system and their feature flag and experiment service to add context to their feature flags and experiments. Best practice: Make sure your team can easily find out: