Ship Confidently with Progressive Delivery and Experimentation A book of best practices to enable your engineering team to adopt feature flags, phased rollouts, A/B testing, and other proven techniques to deliver the right features faster and with confidence. by Asa Schachar 1 Intro Today, we’re living through the third major change in the way companies deliver better software products. First came the Agile Manifesto, encouraging teams to iterate quickly based on customer feedback, resulting in teams building and releasing features in small pieces. Secondly, as software development moved to the cloud, many teams also adopted DevOps practices like continuous integration and continuous delivery to push code to production more frequently, further reducing risk and increasing speed to market for new features. However, today’s most successful software companies like Google, Facebook, Amazon, Netflix, Lyft, Uber, and Airbnb have gone one step further, ushering in a third major change. They move fast by releasing code to small portions of their traffic and improve confidence in product decisions by testing their hypotheses with real customers. Instead of guessing at the best user experience and deploying it to everyone, these companies progressively release new features to live traffic in the form of gradual rollouts, targeted feature flags, and A/B tests. This process helps product and engineering teams reduce uncertainty, make data-driven decisions, and deliver the right experience to end users faster. When customers are more engaged with features and products, it ultimately drives retention and increased revenue for these businesses. It’s not enough to The name for this third major shift is progressive delivery and continuously integrate experimentation. Progressive delivery with feature flags gives teams new and continuously deliver new code. ways to test in production and validate changes before releasing them to Progressive delivery everyone, instead of scrambling to roll back changes or deploy hotfixes. and experimentation With experimentation, development teams gain the confidence of knowing enable you to test and learn to move quickly they’re building the most impactful products and features because they with the confidence can validate product decisions well before expensive investments are lost. you’re delivering the right features. At the core of this practice is a platform that gives teams the ability to control the release of new features to production, decouple deployment from feature enablement, and measure the impact of those changes with real users in production. As with any new process or platform, incorporating progressive delivery and experimentation into your software development and delivery process can bring up many questions for engineering teams: How can we get started with feature flagging and A/B testing without creating technical debt down the line? 2 How can we scale to thousands of flags and still retain good governance and QA processes? How can we adopt these new practices organization-wide without slowing down development and delivery? In today’s fast-moving world, it’s no longer enough to just ship small changes quickly. Teams must master these new best practices to test and learn in order to deliver better software, products, and growth faster. By going from a foundational understanding to learning how to avoid pitfalls in process, technology, and strategy, this book will enable any engineering team to successfully incorporate feature flags, phased rollouts, and data- driven A/B tests as core development practices. By following this journey, you and your team will unlock the power of progressive delivery and experimentation like the software giants already have. This book is intended for software engineers and software engineering leaders, but adjacent disciplines will also find it useful. If your job involves software engineering, product, or quality assurance, then you’re in the right place. Contents 01 Get started with progressive delivery and experimentation p4 02 Scale from tens to hundreds of feature flags and experiments p17 03 Enable company-wide experimentation and safe feature p26 delivery 04 Use progressive delivery and experimentation to innovate faster p38 3 Get started with 01 progressive delivery and experimentation In building for experimentation, you’ll first want to understand the different ways that software products can enable progressive delivery and experimentation. In this first chapter, we’ll cover the basics of feature flags, phased rollouts, and A/B tests, and best practices for implementing them. You’ll discover how these techniques all fit together, where an effective feature flag process can easily evolve into product experimentation and statistically rigorous A/B testing. Once you’ve implemented these techniques, you’ll be able to ship the right features safely. Feature flags: Enable feature releases without deploying code A feature flag (aka feature toggle), in its most basic form, is a switch that allows product teams to enable or disable functionality of their product without deploying new code changes. For example, let’s say you’re building a new front-end dashboard. You could wait until the entire dashboard code is complete before merging and Feature flags toggle releasing. Alternatively, you could put unfinished dashboard code behind a features on and off giving you another feature flag that is currently disabled and only enable the flag for your users layer of control once the dashboard works as expected. over what your users experience. In code, a feature flag might look like the following: if (isFeatureEnabled(‘new_dashboard’)) { // true or false showNewDashboard(); } else { showExistingDashboard(); } 4 When the function isFeatureEnabled is connected to a database or remote connection to control whether it returns true or false, we can dynamically control the visibility of the new dashboard without deploying new code. Even with simple feature flags like this one, you get the benefits of: Seamless releases Trunk-based development Instead of worrying about the feature Instead of having engineers work code merging and releasing at the in long-lived, hard-to-merge, right time, you can use a feature flag conflict-ridden feature branches, for more control over when features the team can merge code faster and are released. Use feature flags for more frequently in a trunk-based a fully controllable product launch, development workflow.[1] allowing you to decide whether the code is ready to show your users. Feature kill switches Platform for experimentation If a release goes wrong, a feature A simple feature flag is the flag can act as a kill switch that start of a platform that enables allows you to quickly turn off the experimentation, as we will see later feature and mitigate the impact of a in this chapter. bad release. New Feature New feature Feature Flag flag Consumers or toggle or Toggle 5 Feature rollouts: Enable safe, feedback-driven releases Basic feature flags are either enabled or disabled for everyone, but feature flags become more powerful when you can control whether a feature flag is exposed to a portion of your traffic. A feature rollout is the idea that you only enable a feature for a subset of your users at a time rather than all at once. In our dashboard example, let’s say you have many different users for your application. By providing a user identifier to isFeatureEnabled, the method will have just enough information to return different values for different users. // true or false depending on the user if (isFeatureEnabled(‘new_dashboard’, ‘user123’)) showNewDashboard(); } else { showExistingDashboard(); } Rollouts allow you to control which subset of users can see your There are two general ways to perform a feature rollout, targeted and flagged feature. random, which both suit different use cases. 01 Targeted rollout: For specific users first A targeted rollout enables features for specific users at a time, allowing for different types of experimentation. Experiment with beta software If you have power users or early adopters who are excited to use your features as soon as they are developed, then you can release your features directly to these users first, getting valuable feedback early on while the feature is still subject to change. Experiment with prototype designs Similarly, for new designs that dramatically change the experience for users, it’s useful to target specific test users to see how these changes will realistically be used when put in the hands of real users. Experiment across regions If your users behave differently based on their attributes, like their country of origin, you can use targeted rollouts to expose features to specific subsets of users or configure the features differently for each group. 6 02 Random rollout: For small samples of users first Another way of performing a feature rollout is by random percentage. Perhaps at first you show the new feature to only 10% of your users; then a week later, you enable the new feature for 50% of your users; and finally, you enable the feature for all 100% of your users in the third week. With a phased rollout that introduces a feature to parts of your traffic, you unlock several different types of experimentation. Experiment with gradual launches: Instead of focusing on a big launch day for your feature that has the risk of going horribly wrong for all your users, you can slowly roll out your new feature to fractions of your traffic as an experiment to make sure you catch bugs early and mitigate the risk of losing user trust. Gradually rolling out your features limits your blast radius to only the affected customers versus your entire user base. This process limits negative user sentiment and potential revenue loss. Experiment with risky changes: A random rollout can give you confidence that particularly risky changes, like data migrations, large refactors, or infrastructure changes, are not going to negatively impact your product or business. Experiment with scale: Performance and scale issues are challenging. Unlike other software problems, they are hard to predict and hard to simulate. Often, you only find out that your feature has a scale problem when it’s too late. Instead of waiting for your performance dashboards to show ugly spikes in response times or error rates, using a phased rollout can help you foresee any real-world scale problems beforehand by understanding the scaling characteristics of parts of your traffic first. Experiment with painted-door tests If you’re not sure whether you should build a feature, you may consider running a painted-door test[2] where you build only the suggestion of the feature and analyze how different random users are affected by the appearance of the new feature. For instance, adding a new button or navigation element in your UI for a feature you’re considering can show you how many people interact with it. Some users 7 New New feature Feature Feature rollout Rollout get the feature Best practice: Balancing low-latency decisions with dynamism When the logic of your codebase depends on the return value of a function like isFeatureEnabled, you’ll want to make sure that isFeatureEnabled returns its decision as fast as possible. In the worst case, if you rely on an external database to store whether the feature is enabled, you risk increasing the latency of your application by requiring a roundtrip external network request across the internet, even when the new feature is not enabled. In the best-case performance, the isFeatureEnabled function is hard coded to true or false either as a variable or environment variable, but then you lose the ability to dynamically change the value of the feature flag without code deploys or reconfiguring your application. Relying on an external So, one of the first challenges of feature flags is striking this balance database to store whether your feature is between a low-latency decision and the ability to change that decision enabled can increase dynamically and quickly. latency, while hard coding the variable There are multiple methods for achieving this balance to suit the needs diminishes your ability and capabilities of different applications. An architecture that strikes this to dynamically change. Find an architecture balance well and is suitable on many platforms will: that strikes a balance between the two. 1 Fetch feature flag configuration when the application starts up 2 Cache feature flag configuration in-memory so that decisions can be made with low latency 3 Listen for updates to the feature flag configuration so that updates are pushed to the application in as real time as possible 4 Poll for updates to the feature flag configuration at regular intervals, so if a push fails, the application is still guaranteed to have the latest feature configuration within some well-defined interval As an example, a mobile application may initially fetch the feature flag configuration when the app is started, then cache the feature flag configuration in-memory on the phone. The mobile app can use push notifications to listen for updates to the feature flag configuration as well as poll on regular, 10-minute intervals to ensure that the feature flags are always up to date in case the push notification fails. 8 Admin panel Client apps App servers 9 A/B tests: Make data-driven product decisions With phased rollouts, your application has the ability to simultaneously deliver two different experiences: one with the feature on and another with the feature off. But how do you know which one is better? And how can you Simply testing the use data to determine which is best for your users and the metrics you care presence or absence of a new feature can help about? An A/B test can point you in the right direction. By shipping different you validate whether versions of your product simultaneously to different portions of your traffic, it has a positive, you can use the usage data to determine which version is better. If you’re negative, or neutral impact on application resource constrained, you can simply test the presence or absence of a new performance and user feature to validate whether it has a positive, negative, or neutral impact on metrics. application performance and user metrics. When your A/B test By being precise with how, when, and who is exposed to these different is precise with the feature configurations, you can run a controlled product experiment, get how, when, and who is statistically significant data, and be scientific about developing the features exposed to different feature configurations, that are right for your users, rather than relying on educated guesses. If you you can make data- want to use objective-truth data to resolve differing opinions within your driven decisions as organization, then running an A/B test is right for you. opposed to educated guesses. Variation B Variation A Some users New Feature New feature A/B Test A/B test get variation A or B 10 01 Best practice: Deterministic experiment bucketing — hashing over Math.Random() If you’re building a progressive delivery and experimentation platform, you may be tempted to rely on a built-in function like Math.random() to randomly bucket users into variations. Once bucketed, a user should only see their assigned variation for the lifetime of the experiment. However, introducing Math.random() adds indeterminism to your codebase, which will be hard to reason about and hard to test later. Storing the result of the bucketing decision also forces your platform to be stateful. A better approach is to use hashing as a random but deterministic and stateless way of bucketing users. To visualize how hashing can be used for bucketing, let’s represent the traffic to your application as a number line from 0 to 10,000. For an experiment with two variations of 50% each, the first 5,000 numbers of your number line can correspond to the 50% of your traffic that will get variation A, and the second 5,000 numbers can correspond to the 50% of your traffic that will receive variation B. The bucketing process is simplified to assigning a number between 0 and 10,0000 for each user. Using a hashing function that takes as input the user id (ex: user123) and experiment id (ex: homepage_experiment) or feature key and outputs a number between 0 and 10,0000, you achieve that numbered assignment for assigning variations: hash(‘user123’, ‘homepage_experiment’) -> 6756 // variation B Because of the properties of a good hash function, you are always guaranteed a deterministic but random output given the same inputs, which gives you several benefits: Your application runs predictably for a given user Automated tests run predictably because the inputs can be controlled Your progressive delivery and experimentation platform is stateless by re-evaluating the hashing function at any time rather than storing the result A large hashing range like 0 to 10,000 allows assigning traffic granularity at fine increments of 0.01% The same pseudo-random bucketing can be used for random phased rollouts You can exclude a percentage of traffic from an experiment by excluding a portion of the hashing range 11 user123 user981 user456 hash() hash() hash() Variation A 1812 5934 8981 Variation B 0 5000 10000 user123 user456 gets bucketed into gets bucketed into Variation A Variation B 02 Best practice: Use A/B tests for insight on specific metrics A/B tests make the most sense when you want to test different hypotheses for improving a specific metric. The following are examples of the types of metrics that allow you to run statistically significant A/B tests. Product metrics: For product improvements Landing page signups Ever wonder which landing page would lead to the most signups for your product? During the 2008 presidential campaign, Barack Obama’s optimization team ran several A/B tests to determine the best image of Obama and corresponding button text to put on the landing page of the campaign website. These A/B tested adjustments increased signups and led to $60 million of additional donations from their website. Referral signups through shares Want to know which referral program would increase virality of your product the most cost-effectively through sharing? Ride-sharing services like Lyft and Uber often experiment on the amount of money to reward users for bringing other users to their platforms (ex: give $20 to a friend and get $20 yourself). It’s important to get this amount right so the cost of growth doesn’t negatively impact your business in the long term. 12 Operational metrics: For infrastructure improvements Latency & throughput If engineers are debating over which implementation will perform best under real-world conditions, you can gather statistical significance on which solution is more performant with metrics like throughput and latency. Error rates If your team is working on a platform or language shift and has a theory that your application will result in fewer errors after the change, then error rates can serve as a metric to determine which platform is more stable. 03 Best practice: Capture event information for analyzing an A/B test When instrumenting for A/B testing and tracking metrics, it’s important to track both impression and conversion events because each type includes key information about the experiment. Impression event An impression event occurs when a user is assigned to a variation of an A/B test. For these events, the following information is useful to send as a payload to an analytics system: an identifier of the user, an identifier of the variation the user was exposed to, and a timestamp of when the user was exposed to the variation. With this information as an event in an analytics system, you can attribute all subsequent actions (or conversion events) that the user takes to being exposed to that specific variation. Conversion event A conversion event corresponds to the desired outcome of the experiment. Looking at the example metrics above, you could have conversion events for when a user signs up, when a user shares a product, the time it takes a dashboard to load, or when an error occurs while using the product. With conversion events, the following information is useful to send as a payload to an analytics system: an identifier of the user, an identifier of the type of event that happened (ex: signup, share, etc.), and a timestamp. 13 Example impression events USER ID VA R I AT I O N I D T I M E STA M P Caroline original 2019-10-08T02:13:01 Note: The identifiers used in the table on the right are just for Dennis free-shipping 2019-10-08T05:30:46 illustration purposes. Typically, Flynn free-shipping 2019-10-09T01:11:20 identifiers are globally unique and are non-identifiable strings of digits and characters. Also Example conversion events note that a value is included in the conversion events that USER ID EVENT ID VALUE T I M E STA M P are non-binary (ex: how much money was associated with Caroline purchase 50 2019-10-08T00:05:32 a purchase event). However, Dennis purchase 30 2019-10-08T00:07:19 binary events like someone signing up, does not have a Flynn add_to_cart 5 2019-10-09T01:14:23 value associated. Rather these Flynn add_to_cart 10 2019-10-09T01:15:51 events are binary: they either happened or did not. Erin signed_up - 2019-11-09T12:02:36 04 Best practice: Avoid common experiment analysis pitfalls Once you have the above events, you can run an experiment analysis to compare the number of conversion events in each variation and determine which one is statistically stronger. However, experiment analysis is not always straightforward. It’s best to consult data scientists or trained statisticians to help ensure your experiment analysis is done correctly. Although this book does not dive deep into statistics, you should keep an eye out for these common pitfalls. Multiple comparisons Creating too many variations or evaluating too many metrics will increase the likelihood of seeing a false positive just by chance. To avoid that outcome, make sure the variations of your experiment are backed by a meaningful hypothesis or use a statistical platform that provides false discovery rate control. Small sample size If you calculate the results of an A/B test when only a small number of users have been exposed to the experiment, the results may be due to random chance rather than the difference between variations. Make sure your sample size is big enough for the statistical confidence you want. 14 Peeking at results A classical experiment should be set up and run to completion before any statistical analysis is done to determine which variation is a winner. This is referred to as fixed-horizon testing. Allowing experimenters to peek at the results before the experiment has reached its sample size increases the likelihood of seeing false positives and making the wrong decisions based on the experiment. However, in the modern digital world, employing solutions like sequential testing can allow analysis to be done in real time during the course of the experiment. A/B/n tests go beyond two variations to test feature configurations A simple feature flag is just an on-and-off switch that corresponds to the A and B variations of an A/B test. However, feature flags can become more powerful when they expose not only whether the feature is enabled, but also how the feature is configured. For example, if you were building a new dashboard feature for an email application, you could expose an integer variable that controls the number of emails to show in the dashboard at a time, a boolean variable that determines whether you show a preview of each email in the dashboard list, or a string variable that controls the button text to remove emails from the dashboard list. By exposing feature configurations as remote variables, you can enable A/B tests beyond just two variations. In the dashboard example, you can experiment not only with turning the dashboard on or off, but also with different versions of the dashboard itself. You can see whether email previews and fewer emails on screen will enable users to go through their email faster. Feature Configuration Feature configuration { title: “Latest Updates”, color: “#36d4FF”, num_items: 3, } 15 Feature flag driven development One challenge besides knowing what to feature flag for an experiment is knowing how to integrate this new process into your team’s existing software development cycle. Taking a look at each of your in-progress initiatives and asking questions upfront can help you build feature flags into your standard development process. For example, by asking “How can we de-risk this project with Best practice: Ask feature flags?” you highlight how the benefits of feature flags outweigh feature-flag questions the cost of an expensive bug in production or a disruptive, off-hours hotfix. in technical design docs Similarly, by asking “How can we run an experiment to validate or invalidate our hypothesis for why we should build this feature?” you will find that “How can we de-risk spending engineering time building the wrong feature is much more costly this project with than investing in a well-designed experiment. These questions should feature flags?” speed your overall net productivity by enabling your team to move toward a progressive delivery and experiment-driven process. You can start to incorporate feature flags into your development cycle by “How can we run an asking a simple question in your technical design document template. experiment to validate or invalidate our For instance, by asking “What feature flag or experiment are you using to hypothesis for why rollout/validate your feature?” you insert feature flags into discussions early we should build this in the development process. Because technical design docs are used as feature?” a standard process for large features, the document template is a natural place for feature flags to help de-risk complex or big launches. 16 Scale from tens to 02 hundreds of feature flags and experiments After completing your first few experiments, you will probably want to take a step back and start thinking about improvements to help scale your experimentation program. The following best practices are things you will need to consider when scaling from tens to hundreds of feature flags and experiments. Decide when to feature flag, rollout, or A/B test One challenge with experiments, feature flags, and rollouts is that you may be tempted to use them for every possible change. It’s a good idea to recognize that even in an advanced experimentation organization, you likely won’t be feature flagging every single change or A/B testing every feature. This high-level decision tree can be useful when determining when to run an experiment or set up a feature behind a flag. Best practice: Don’t feature flag or A/B test every change. 17 Should I run an experiment or a rollout? Working on Working on Working on Working on a feature? a bug? refactor? docs? 18 Reduce complexity by centralizing constants When dealing with experiments or feature flags, it’s best practice to use a human-readable string identifier or key to refer to the experiment so that the code describes itself. For example, you might use the key ‘promotional_ discount’ to refer to the feature flag powering a promotional discount feature that is enabled for certain users. It’s easy to define a constant for this feature flag or experiment key exactly where it’s going to be used in your codebase. However, as you start using a lot of feature flags and experiments, your codebase will soon be riddled with keys. Centralizing the constants into one place can help. 01 Centralize constants to visualize complexity Centralizing all feature keys gives you a better sense of how many features exist in a codebase. This enables an engineering team to develop processes Best practice: Compile around revisiting the list of feature flags for removal. Having a sense of how all your feature flags many feature flags exist also gives a sense of the codebase and product currently in use in an application into a complexity. Some organizations may decide to have a limit on the number of single constants file. active feature flags to reduce this complexity. 02 Share constants to keep applications in sync Some feature flags and experiments are cross-product or cross-codebase. By having a centralized list of feature key constants, you are less likely to have typos that prevent this type of coordination across your product. Best practice: Use For example, let’s say you have a feature flag file that is stored in your the same feature key constants across the application backend but passed to your frontend. This way the frontend and backend and frontend of backend not only reference the same feature keys but you can also deliver a your application. consistent experience across the frontend and backend. 19 Document your feature flags to understand original context As a company implements more feature flags and experiments, the codebase gets more identifiers or keys referencing these items (like: site_ redesign_phase_1 or homepage_hero_experiment). However, the keys that are used in-code will inevitably lack the full context of what the rollout or experiment actually does. For example, if an engineer saw site_redesign_ phase_1, it’s unclear what the redesign includes or what is included in phase 1. Although you could just increase the verbosity of these keys so that they are self-explanatory, it’s a better practice to have a process by which anyone can understand the original context or documentation behind a given feature rollout or experiment. Best practice: Make sure your team can easily find out: What the rollout or experiment changes were The owner for the rollout or experiment If the rollout or experiment can be safely paused or rolled back Whether the lifetime of this experiment or rollout is temporary or permanent Often times, engineering teams will rely on an integration between their task tracking system and their feature flag and experiment service to add context to their feature flags and experiments. Ensure quality with the right automated testing & QA strategy Ensuring your software works before your users try it out is paramount to building trustworthy solutions. A common way for engineering teams to ensure their software is running as expected is to write automated unit, integration, and end-to-end tests. However, because you’re scaling experimentation, you’ll need a strategy to ensure you still deliver high quality software without an explosion of automated tests. Having a strategy to test every combination is not going to be sustainable. As an example, let’s say you have an application with 10 features and 10 corresponding automated tests. If you add just 8 on/off feature flags, you theoretically now have 2^8 = 256 possible additional states, which is nearly 25 times as many tests as you started with. Because testing every possible combination is nearly impossible, you’ll want to get the most value out of writing automated tests. Make sure you understand the different levels of automated testing, which include: 20 01 Unit tests—test frequently for solid building blocks Unit tests are the smallest pieces of testable code. It’s best practice that these units are so small that they are not aware or are not affected by experiments or feature flags. As an example, if a feature flag forks into two separate code paths, each code path should have its own set of independent unit tests. You should frequently test these small units of code to ensure high code coverage, just as you would if you didn’t have any feature flags or experiments in your codebase. If the code you are unit testing does need to contain code that is affected by a feature flag or experiment, take a look at the techniques of mocking and stubbing described in the integration tests section below. Manual Manual QA End-to-end tests Integration tests Best practice: Ensure the building blocks of your application are well tested with lots of unit tests. The Unit tests smaller units are often unaware of experiment or feature state. For those units that are aware, use mocks and stubs to control this white-box testing environment. 21 02 Integration tests—force states to test code For integration tests, you are combining units into higher-level business logic. This is where experiments and feature flags will likely affect the logical flow of the code, and you’ll have to force a particular variation or a state of a feature flag in order to test the code. In some integration tests, you’ll still have complete access to the code’s executing environment where you can mock out the function calls to external systems or internal SDKs that power your experiments to force particular code paths to execute during your integration tests. For example, you can mock an isFeatureEnabled SDK call to always return true in an integration test. This removes any unpredictability, allowing your tests to run deterministically. In other integration tests, you may not have access to individual function calls, but you can still stub out API calls to external systems. For example, you can stub data powering the feature flag or experimentation platform to return an experiment in a specific state to force a given code path. Although you can mock out indeterminism coming from experiments or feature flags at this stage of testing, it’s still best practice for your code and tests to have as little awareness of experiment or feature flag as possible, and focus on the code paths of the variations executing as expected. Manual QA End-to-end tests Best practice: Use mocks and stubs to control feature and experiment states. Focus on individual code Integration tests paths to ensure proper integration and business logic. Unit tests 22 03 End-to-end tests—focus testing on critical variations End-to-end tests are the most expensive tests to write and maintain because they’re often black-box tests that don’t provide good control over their running environment and you may have to rely on external systems. For this reason, avoid relying on end-to-end or fully black-box tests to verify every branch of every experiment or feature flag. This combinatorial explosion of end-to-end tests will slow down your product development. Instead, reserve end-to-end tests for the most business-critical paths of an experiment or feature flag or use them to test the state of your application when most or all of your feature flags are in a given state. For example, you may want one end-to-end test for when all your feature flags are on, and another when all your feature flags are off. The latter test can simulate what would happen if the system powering your feature flags goes down and must degrade gracefully. When you do require end-to-end tests, make sure you can still control the experiment or feature-flag state to remove indeterminism. For example, in a web application, you may want to have a special test user, a special test cookie, or a special test query parameter that can be used to force a particular variation of an experiment or feature flag. Note that when implementing these special overrides, be sure to make them internal-only so that your users don’t have the same control over their own feature or experiment states. Manual QA Best practice: Do not test every possible combination of experiment or feature with end-to-end tests. Instead, focus on important End-to-end tests variations or tests that ensure your application still works if all features are on/off. Integration tests Unit tests 23 04 Manual verification (QA)—reserve for business-critical functions Similar to end-to-end tests, manual verification of different variations can be difficult and time consuming, which is why organizations typically have only a few manual QA tests. Reserve manual verification for business-critical functions. And if you implemented special parameters to control the states of experiments or feature flags for end-to-end tests, these same parameters can be used by a QA team to force a variation and verify a particular experience. Manual QA Best practice: Save time and End-to-end tests resources by reserving manual QA to test the most critical variations. Make sure you provide tools for QA to force feature and experiment states. Integration tests Unit tests Increase safety with the right permissions As more individuals contribute to your progressive delivery process, it becomes imperative to have safe and transparent practices. Permissioning, exposing user states, and emulation enable your team to keep the process secure and viable as you scale. 01 Establish permissions based on your roles With rollouts and experiments, typically your team will have a dashboard where you can edit production configurations without having to make changes to the core development repository. With this setup, you’ll want to consider the permissions of different parts of your organization and their ability to make changes to your rollouts and experiments. Ideally, your permissions should match the permissioning that you would typically use for 24 feature development, which include: Read-level access Almost everyone at your company should likely at least have read-level access, allowing them to see rollouts and experiments actively running at any given time. Edit-level access Anyone who can do standard feature development should have standard edit-level access. However, it’s best practice to require a higher level of edit access for important system-wide infrastructure or billing configuration. Administrative access Individuals with the ability to provision or change permissions for standard feature development should have administrative access to the feature flag and experimentation configuration setup. This allows an IT team or super user to provision and secure the above roles for individual developers on the team. 02 Expose user state for observability As your company uses more feature flags and experiments, the different possible states that a given customer could experience begins to combinatorially explode. But if one of those customers encounters an issue, Best practice: Consider enabling features you’ll want to know which states of rollouts or experiments may be active to customer by customer understand the full state of the world for that particular individual. It’s best and using a centralized practice to have an interface either through a UI, a command line tool, or a tool for anyone to input a customerId and dashboard to query which feature flags or experiments a given customer/ see what combination user/request receives for increased observability of your system. of features that particular customer has If your bucketing tool is deterministic (it will always give the same bucketing enabled. decision given the same inputs), then you can easily provide a tool that takes the customer information as inputs and the state they should be experiencing as outputs. 03 Allow emulation for faster debugging Even if you know the particular combination of features and experiments that a given customer has access to, it can sometimes be difficult to reason why they are still seeing a particular experience. This is similar to a more general problem of debugging complex customer issues. Many engineering organizations build an ability to emulate a user’s view of the product to make it easier to see what the customer sees when debugging. If using this technique, ensure you use appropriate access controls, permissions and restrictions. Some engineering organizations also have the ability to let their production site load a development version of their code in order to enable engineers to test out fixes on hard-to-replicate setups. Both techniques are extremely helpful in minimizing the time to debug specific issues that are 25 only relevant to a certain combination of feature flag or experiment states. Enable company-wide 03 software experimentation Scaling your experiment program across your entire organization can become complicated quickly. With these best practices you’ll be able to minimize the complexity of an advanced system running more than hundreds of experiments or rollouts simultaneously. Prevent emergencies with smart defaults When you opt for a separate experimentation or rollout service to control your configuration, you must be prepared for when the service goes down. This is where smart defaults can help by answering the following questions: If the feature flag service went down, Would you prefer that all users have the experience of the feature being on or off? Would you prefer that all users get the version that was most recently delivered by the feature flag or experimentation service? What configuration or feature variable values would be preferred for your users? Some organizations save a snapshot of the feature flag and experiment state in the codebase at a regular fixed interval. This process provides a smart, local fallback that is fairly recent in case the remote system goes down. Best practice: Think through all possible failure scenarios and prepare for them with smart defaults for when your feature flagging or experimentation services go down. 26 Avoid broken states with simple feature flag dependency graphs As a company builds more features and experiments, it’s likely that an engineering team will find a feature flag or experiment built on top of an existing one. Take, for example, a feature flag used to rollout a new dashboard UI interface. As you roll out the new dashboard, your team may want to experiment on a component of the new UI. Although you could manage both the rollout and the experiment with the same feature flag, there are reasons you might want to separate the two to allow for the rollout to happen independently from the experiment. However, to see the different variations of the dashboard component, you have to both enable the feature flag and be in a particular variation of the experiment. You now have a kind of dependency graph of your feature flags and experiments. Naturally, your systems will develop more and more of these feature flag dependencies where one feature depends on another. It’s best practice to minimize these dependencies as much as possible and strive for feature flag combinations that won’t break the application state. If feature flags A and B, but not C, result in an un-working application Best practice: Keep dependencies between setup, then it’s likely your team lacks this contextual knowledge and flags simple to avoid accidentally puts your application in a bad state just by changing feature broken states. flag configurations. One option is to keep your feature flag hierarchy extremely shallow—for instance, a simple 1:1 mapping between feature flags and customer-facing features—ensuring there are few dependencies between flags. Balance low-latency, consistency, and compatibility in a microservice architecture When you start to deploy multiple versions of features and experiments across multiple separate applications or services you’ll want to ensure the services are consistent with how they evaluate feature flags and experiments. If they aren’t—where some evaluate the feature flag as on and others evaluate it as off—you could run into backward or forward compatibility issues between different services and your application might break. Below are two options for developing in this microservice architecture. 27 01 Services independently evaluate feature state The benefit of each service independently evaluating the state of feature flags on its own is that it minimizes the dependencies of a given service. The downside is that it requires updating every service. Also, if the services are truly independent, then they will be less consistent with the state of a Best practice: Put in feature flag. For example, when you toggle a feature flag on, there will be a the extra work to make sure your different time when some services get the update and evaluate the feature flag as on states are forward and while others are still receiving the update and evaluate it as off. Eventually, backward compatible all services will get the update and evaluate the flag as on. In other words, with the other services. the independent services are eventually consistent. In this case, it’s best practice to put in the extra work to make sure the different feature flag and experiment states are forward and backward compatible with the other services to prevent unexpected states across services. Example: Services are independent Store.com native mobile user Store.com browser user 28 02 Services depend on feature state service In this architecture, the services all depend on a centralized hub, ensuring they’re all consistent with the way they evaluate a feature flag or experiment. Although this architecture is consistent and does not have to worry about backward and forward compatibility, it comes at the cost of latency. Because each service has to communicate to this separate feature flag Best practice: Expect or experimentation service, you will have added the necessary latency to some latency at the achieve a consistent state across services. expense of consistent evaluation of feature A separate feature and experiment service does have the added benefits of flags or experiments being: across services. Easily implemented in a microservice architecture Compatible with other services in different languages by exposing APIs in the form of language-agnostic protocols like HTTP or gRPC Centralized for easier maintenance, monitoring, and upgrading Example: Services depend on central service Store.com native mobile user Store.com browser user 29 Prevent technical debt by understanding feature flag and experiment lifetimes As your organization uses more feature flags and experiments, it’s paramount to understand that some of these changes are ephemeral and should be removed from your codebase before they become outdated and Best practice: To add technical debt and complexity. avoid technical debt, regularly review One heuristic you can track is how long the feature flag or experiment has flags in case they’re obsolete or should be been in your system and how many different states it’s in. If the feature flag deprecated, even if has been in your system for a long time and all of your users have the same they’re meant to be state of the feature flag, then it should likely be removed. However, it’s smart permanent. to always evaluate the purpose of a flag or experiment before removing it. The real lifetime of an experiment or feature flag depends heavily on its use case. 01 Remove temporary flags and experiments If a feature is designed to be rolled out to everyone and you don’t expect to experiment on the feature once it’s launched, then you’ll want to ensure you have a ticket tracking its removal as soon as the feature has been fully launched to your users. These temporary flags may last weeks, months, or quarters. Examples include: Painted-door experiments These experiments are intended to be used only in the early phases of the software development lifecycle and aren’t intended to be kept in the product after they have validated or in-validated the experiment hypothesis. Performance experiments These experiments are intended to put two different implementations against each other in a live, real-world performance experiment. Once enough data has been gathered to determine the more performant solution, it’s usually best to move all traffic over to the higher performing variation. Large-scale refactors When moving between frameworks, languages, or implementation details, it’s useful to deploy these rather risky changes behind a feature flag so that you have confidence they will not negatively impact your users or business. However, once the re-factor is done, you hopefully won’t go back in the other direction. 30 Product re-brands If your business decides to change the look and feel of your product for brand purposes, it’s useful to have a rollout to gracefully move to the new branding. After the new branding is established, it’s a good idea to remove the feature flag powering the switch. 02 Review permanent flags and experiments If a feature is designed to have different states for different customers, or you want to control its configuration for operational processes, then it’s likely the flag will stay in your product for a longer period of time. Examples of these flags and experiments include: Permission flags These flags are useful if you have different permission levels in your product like read-only that don’t allow edit access to the feature. They are also useful if you have modular pricing like an inexpensive “starter plan” that doesn’t have the feature, but a more costly “enterprise plan” that does have the feature. Operational flags These flags control the operational knobs of your application. For example, these flags can control whether you batch events sent from your application to minimize the number of outbound requests. You could also use them to control the number of machines that are used for horizontally scaling your application. In addition, they can be used to disable a computational expensive non-essential service or allow for a graceful switchover from one third-party service to another in an outage. Configuration-based software For any software or product that is powered by a config file, this is a great place to seamlessly insert experimentation that has a low cost to maintain and still allows endless product experimentation. For example, some companies may have their product layout powered by a config file that describes in abstract terms whether different modules are included and how they are positioned on the screen. With this architectural setup, even if you aren’t running an experiment right now, you can still enable future product experimentation. Note that even if a flag is meant to be permanent, it’s still paramount to regularly review these flags in case they are obsolete or should be deprecated. Otherwise, keeping these permanent flags may add technical 31 debt to your codebase. Some organizations use an integration between a task tracking system and their feature flag and experiment service to manage this cleanup process seamlessly and quickly. If the state of feature flags and experiments can be synced with a ticket tracking tool, then an engineering manager can run queries for all feature flags and experiments whose state has not been changed in the past 30 days and track down owners of the feature flags and experiments to evaluate their review. Other organizations have a recurring feature flag and experiment removal day in which engineers review the oldest items in the list at a regular cadence. Make your organization resilient with a code- ownership strategy As with any feature you build, the individuals and teams that originally engineered the feature or experiment are not going to be around forever. As a best practice, your engineering organization should agree on who is responsible for owning and maintaining a feature or experiment. This is particularly important for when you need to remove or clean up old feature flags or experiments. Options around ownership include: 01 Individual ownership Individual developers are labeled as owners of a feature or an experiment. At a regular cadence, for example every two quarters, ownership is re- evaluated and transferred if necessary. Pros Simple and understandable. Cons Hard to maintain if engineers frequently move between projects or code areas. 02 Feature team ownership The team responsible for building the feature takes ownership of the feature and experiment. Pros Resilient to individual contributor changes. Cons Hard to maintain if teams are constantly changing or are unbalanced and have uneven distribution of ownership. 03 Centralized ownership This ownership falls under a dedicated experimentation or growth team with experts who set up, run, and remove experiments. The downside is that it severely limits the scale of your experimentation program to the size of this 32 central team. The upside is that this centralized team can be the experiment experts at your company and help ensure experiments are always run with high quality. This method can be especially helpful when getting started, and it’s useful to have one team prove the best practices before fanning them out to other parts of the organization. Pros Resilient to many changes and simplest to reason about. Cons Central teams aren’t going to be the experts where the experiment is actually implemented and may require a lot of help from other teams. The size of this team will eventually limit the number of experiments and feature flags that your company can handle. Minimize on-call surprises by treating changes like deploys Companies often have core engineering hours for a given product or feature, for example, a core team working Monday through Friday in similar time zones. Even companies that do continuous integration and continuous delivery realize that deploying production code changes outside of these Best practice: Save core working hours (either late at night or on the weekends) is usually a bad rollout or experiment changes for core idea. If something goes wrong with a deploy outside of these working hours, working hours and avoid teams run the risk of releasing issues when the full team is not around. With making these changes on fewer teammates, it’s slower to fix and mitigate the impact that the issue Fridays and before a weekend or a holiday. will have. Because experiments and rollouts give individuals the ability to If you must, do it easily make production code changes, it’s best practice to treat changes in responsibly. an experiment or a rollout with the same level of focus as standard software deploys. Avoid making changes when no one is around or no one is aware of the change. If it is advantageous to make changes during off-hours, do so transparently with proper communication so no one is caught by surprise. Expect changes and recover from mistakes faster with audit logs When someone at your organization does a deployment of code changes to your product, it’s best practice to have a change log that lets everyone know what the change was, who made it, and why. With this information, you won’t be surprised by changes to your product or if a user sees something new. This practice of increasing visibility into the changes of your application are no different for feature flags and experiments—you’ll want to be able to quickly answer questions like: A user’s experience changed recently, did we start an experiment? 33 Did anyone recently change the traffic allocation or targeting of this feature to include this set of users? An unexpected bug recently occurred for a customer, but we didn’t Best practice: Build deploy recently; did anyone change anything regarding an experiment in broader visibility or feature-flag state? with a change history or audit logs so you can quickly diagnose Having a change history or audit log for your feature management and issues and pinpoint experimentation platform is key to scaling your experiments while still having root causes. the ability to quickly diagnose changes to your user’s experience. An audit log can also speed the time to recovery by pinpointing the root cause of an undesirable change and more quickly understanding its implications. Understand your systems with intelligent alerting With many possible states, you’ll want visibility into what’s actually happening in production for your customers and have intelligent alerting for when things are not acting as expected. For example, you may have thought you released a feature to all of your customers, only to realize that a targeting condition prevented the feature from being visible to only a portion of them. Having an alert for when the feature flag has been accessed by X number of users can be a useful way to ensure that your feature flags are acting as expected in the wild. Some organizations even set up systems to auto-stop a rollout if errors can be attributed to the new feature-flag state in production. Code smarter systems by making them feature flag and experiment agnostic Not all your code has to know about all the experiments and feature flags in your product. In fact, the less your code has to worry about the experiments being run on your product, the better. By striving for code that’s experiment agnostic or feature-flag unaware, you can focus on the particular product or feature you are building without having to worry about the different states. The techniques below are just two examples of different design patterns available[3]: Move the fork If you have a feature flag with two different experiences, moving the point where you fork the experience can affect whether individual code paths are dependent on the feature-flag state. For example, you could either have a frontend component fork the experience inside the component, or you could have the frontend page swap out different components entirely, so they 34 don’t have to be aware of the feature-flag state. Avoid conditionals Instead of deploying a feature flag or experiment using if/else branches in an imperative style, consider a declarative style where your feature is controlled by a variable configuration that could be managed by not only a feature flag service but any remote service. For example, if you were experimenting on an email subject, you could either code the variation as: email = new Email(); email.subject = “Welcome to Optimizely”; variation = activateExperiment(“email_feature”) if (variation == “treatment_A”) { email.subject = “Hello from Optimizely”; } Or you could decide to remove the conditional and have the subject just be provided as a variable by the experimentation service: email = new Email(); email.subject = getVariable(“email_feature”, “subject”); Or you could recognize that you can declaratively code the subject as a variable property on your email class that has defaults that a feature or experimentation service can override: @feature(“email_feature”) class Email { @variable(“subject”) subject = “Welcome to Optimizely” } In these latter two implementation techniques, you reduce how experiment aware or feature-flag aware your email code paths are. 35 Increase developer speed with framework-specific engineering Integrating feature flags and experiments into the same tools and systems that you already use will make scaling your experimentation program easier. Best practice: To some, this means making experiments and feature flags ergonomic, to Aim to match your others, this means working where you work and how you work. For React, feature flags and experimentations to it’s much easier to develop with the mindset of components. For Express, it’s the idiomatic patterns much easier to develop with the mindset of middleware. of your development framework to easily Each framework and platform has its own idiomatic ways of developing. integrate them into The closer feature flags and experimentations match those same idiomatic your development, testing, and deploying patterns, the more likely you are to easily integrate feature flags and processes. experiments into your development, testing, and deploying processes. Leverage configuration-based design to foster a culture of experimentation To truly achieve a culture of experimentation, you have to enable non- technical users to be able to experiment across your business and product. This starts with architecting a system that does not require engineering Best practice: involvement, rather it has enough safeguards to prevent individuals from Architect a system breaking your product. that’s accessible for non-technical users and Configuration-based development is an architectural pattern commonly use configuration files used to enable this type of large-scale experimentation because it uses to power your products and features. configuration files to power your product. For instance, a configuration file that powers the layout of content in a mobile application, or a configuration file that controls the scaling characteristics of a distributed system. By centralizing the different possible product states to a configuration file that can be validated programmatically, you can enable experiments to power different configuration setups while maintaining confidence that your application can’t be put in a broken state. Evaluating feature delivery and experimentation platforms Now that you’ve learned about progressive delivery and experimentation, you might be considering whether to build your own testing framework, integrate an open source solution, or extend a commercial progressive 36 delivery and experimentation platform like Optimizely. When deciding which option is right for your organization, consider the following: Total cost of developing and maintaining your system Building in-house or adopting an open source framework typically comes with a relatively small upfront investment. Over time, additional features and customizations become necessary as more teams use the platform, and maintenance burdens like bug fixes, UI improvements, and more begin to distract engineers from a core product focus. Committing to building a platform yourself is a commitment to continuing to innovate on experimentation and develop new functionality to support your teams. Companies that successfully scale experimentation with a homebuilt system have engineers on staff dedicated to enabling others and supporting the system with ongoing maintenance. Ease of use for developers and non-technical stakeholders Usability for both technical and non-technical users can be the difference between running a few experiments a year to running thousands. An enterprise system often includes remote configuration capabilities—the ability to start and stop a rollout or experiment, change traffic allocation, or update targeting conditions in real time from a central dashboard without a code deploy. When a progressive delivery and experimentation system is easy for your engineering organization to adopt, more teams will be able to deploy quickly and safely. Developers will spend less time figuring out how to manage the release or experiment of their code, and more time working on customer- facing feature work. Look for systems with robust documentation, multiple implementation options, and open APIs. Statistical rigor and system accuracy To learn quickly through experimentation, teams need to trust that tests are being run reliably and that the results are accurate. You’ll need a tool that can track events, provide or connect to a data pipeline that can filter and aggregate those events, and correctly integrate into your system. Vet the statistical models used to calculate significance to ensure your team can make decisions quickly, backed by accurate data. 37 Use progressive delivery 04 and experimentation to innovate faster In this book, we’ve gone from building a basic foundation of progressive delivery and experimentation, to more advanced best practices on how to do it well. Many of the most successful software companies have gone on this journey to not only deploy features safely and effectively, but also to make sure they’re building the right features to begin with. Because in today’s age of rapid change, having the tools and techniques to quickly adapt and experiment is the most crucial aspect to staying ahead of the curve. 38 Appendix 01 Trunk-based development workflow What is trunk-based development? Trunk-based development is software development strategy where engineers merge smaller changes more frequently into the main codebase and work off the trunk copy rather than work on long-lived feature branches. Why trunk-based development With many engineers working in the same codebase, it’s important to have a strategy for how individuals work together. To avoid overriding each other’s changes, engineers create their own copy of the codebase, called branches. Following an analogy of a tree, the master copy is sometimes called the mainline or the trunk. The process of incorporating the changes of an individual’s copy into the main master trunk is called merging. To understand how trunk-based development works, it’s useful to first look at the alternative strategy, feature branch development. In feature branch development, individual software developers or teams of engineers do not merge their new branch until a feature is complete, sometimes working for weeks or months at a time on a separate copy. Feature-branched development master (or trunk) long-lived feature branches 39 This long stretch of time can make the process of merging difficult because the trunk or master has likely changed due to other engineers merging their code changes. This can result in a lengthy code review process where the changes in different pull requests are analyzed to resolved merge conflicts. Benefits of trunk-based development Trunk-based development takes a more continuous delivery approach to software development, and branches are short-lived and merged as frequently as possible. The branches are smaller because they often contain only a part of a feature. These short-lived branches make the process of merging easier because there is less time for divergence between the main trunk and the branch copies. short-lived Trunk-branched development branches master (or trunk) Merging is done more frequently and more easily for shorter branches Thus, trunk-based development is a methodology for releasing new features and small changes quickly while helping to avoid lengthy bug fixes and “merge hell.” It is a growing popular devops practice among agile development teams, and is often paired with feature flags to ensure that any new features can be rolled back quickly and easily if any bugs are discovered. 02 Painted-door experiment in-depth example When you aren’t sure whether to build a feature, then a painted-door experiment is a high value experiment. In a painted-door experiment, instead of putting in all the time to build a feature, you first verify that the feature is worth it by building just the suggestion of the feature into your 40 product and measuring engagement.
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-