Unleash

Starting an experimentation program: Best practices from Yousician

 

This article explores how Yousician built a successful experimentation program, covering platform selection, analytics infrastructure, feature flagging practices, and maintaining the balance between optimization and innovation.

Selecting your initial experimentation platform

When starting an experimentation program, the temptation to build a custom solution can be strong, particularly for engineering-focused organizations. Yousician initially considered this path but quickly recognized that experimentation is a complicated domain requiring deep expertise. Instead, they adopted Leanplum, a platform offering AB testing capabilities along with messaging, user profiling, and analytics features.

This choice allowed the team to start with basic experiments and gradually increase complexity. 

Early tests focused on simple UI variations such as button colors and text content. As confidence grew, experiments expanded to include entire screen redesigns and value proposition messaging. The platform enabled the team to test whether a conversion screen with more detailed benefit descriptions would outperform a simpler version. 

Eventually, Yousician progressed to full-stack experimentation, where new features with both frontend and backend components would be hidden behind feature flags and activated for a portion of users to measure impact on key product metrics.

The analytics challenge in full-stack experimentation

As experimentation programs mature, analytics complexity increases dramatically. Simple setups where data flows from a single application to a single database work well for basic tests. 

However, full-stack experimentation introduces multiple data sources that must be reconciled. Backend services generate their own events, third-party providers supply payment and subscription data, and different platforms contribute usage information. For a subscription-based business, understanding how long users maintain subscriptions, when they renew, and their predicted lifetime value becomes essential for proper experiment evaluation.

Standard analytics platforms typically offer retention metrics based on simple criteria such as whether users opened an application on day seven. However, more sophisticated measurement requires custom retention definitions. For a music education application, the relevant question might be whether users actually played songs or used specific features like a tuner. These nuanced engagement metrics provide far better signals about feature success than basic session counts.

Yousician ultimately built their own analytics system to address these limitations. While this required significant investment, it provided the flexibility to define custom metrics, rename events without losing historical data, and implement complex lifetime value calculations tailored to their subscription model. Organizations beginning their experimentation journey should carefully evaluate whether off-the-shelf analytics will suffice for their needs or whether custom solutions will eventually become necessary.

Managing targeting complexity and consistency

Modern applications rarely target all users with every experiment. Different markets have different regulations and content availability. A music streaming feature might include certain licensed songs in the United States but not in Japan due to licensing restrictions. Similarly, different platforms and app versions have varying capabilities. An experiment requiring new API endpoints cannot run on older application versions that lack the necessary integration points.

This creates substantial targeting complexity. An experiment might need to target non-paying users on Android devices running version 4.64 or later, located outside specific countries where content restrictions apply, using a particular instrument within the application. These targeting rules can quickly become convoluted, with each criterion containing multiple subconditions.

The consistency challenge becomes even more acute in full-stack experiments. When a mobile application communicates with an AB testing service to determine which variant a user should see, but the backend also queries the same service for its own logic, both systems must receive identical answers. If the mobile app places a user in the treatment group but the backend assigns them to control, the experiment breaks down. Users might see a new interface but interact with old backend logic, or vice versa, producing meaningless results.

Many platforms address this through sticky targeting, where users remain in their assigned variant even if their circumstances change. However, this introduces its own complications. If an experiment targets users outside Japan and uses sticky targeting, a user traveling to Japan will remain in the experiment despite now falling outside the targeting criteria. Similarly, VPN usage or device switching can create scenarios where users appear to violate experiment inclusion rules, leading to confusion during analysis.

The platform switch: Moving to Unleash

By 2021, the limitations of Yousician’s experimentation platform had become untenable. The combination of sticky targeting complications, insufficient control over experiment bucketing, and reliance on third-party availability for real-time decisions created too many pain points. The team needed a solution that offered deterministic, consistent bucketing while running on their own infrastructure.

The key requirement was that applications should inform the experimentation system about their capabilities rather than having the system dictate participation. Consider two devices owned by the same user: an Android phone running application version 1.0 and an iOS tablet running version 1.1. An experiment requiring features only available in version 1.1 should be enabled on the tablet but disabled on the phone. However, the user’s assignment should remain consistent. If they upgrade their phone to version 1.1, they should see the same variant they were already seeing on their tablet.

This required real-time evaluation on backend servers rather than depending on external services that might experience outages. During third-party service disruptions, users might suddenly see different prices or features, creating negative experiences and invalidating experiment data. Self-hosted evaluation eliminates this dependency.

Additionally, Yousician wanted to maintain their custom data pipeline rather than routing all analytics through the experimentation platform. This allowed their data science team to continue analyzing results using the infrastructure and tools they had already optimized. Small details like geolocation databases also mattered. Different IP geolocation services can assign the same address to different countries, causing apparent targeting violations when users appear in experiments they should have been excluded from. Using consistent geolocation across experiment assignment and analysis prevents these discrepancies.

Key practice one: Feature flags for development

One of the most valuable applications of feature flagging extends beyond experimentation to the development process itself. When engineers can integrate their code frequently rather than maintaining long-lived feature branches, development velocity increases substantially. The practice is straightforward: create a feature flag that defaults to off, then begin merging code to the main branch even before the feature is complete. As long as the flag remains disabled for production users, partially implemented features can be deployed without impact.

This approach prevents the code conflicts and duplicated work that plague long-lived branches. When multiple developers work on separate branches for weeks, merging becomes painful and error-prone. Frequently integrated code surfaces conflicts immediately when they are easiest to resolve.

Feature flags also enable effective dogfooding, where internal users test features before public release. A new search interface might be enabled for employees weeks before external users see it, allowing the team to identify issues and gather feedback in a controlled environment. Once the feature reaches acceptable quality, a different targeting strategy activates it for a percentage of public users, perhaps starting with five percent and gradually increasing. This gradual rollout strategy minimizes risk while still gathering statistically significant data.

This approach eliminates the complex targeting logic that plagued earlier experimentation efforts. By ensuring only application versions with the necessary code and capabilities can be enrolled in experiments, consistency problems disappear. 

Key practice two: Establish solid methodology

As experimentation programs scale, maintaining rigorous methodology becomes essential. Without discipline, even well-intentioned teams can fall prey to analytical pitfalls. The phenomenon of p-hacking, where researchers test multiple hypotheses until finding a statistically significant result, affects industry as much as academia. When evaluating experiments across dozens of metrics, random chance ensures some will show positive results at the conventional 95 percent confidence level, even if the treatment had no real effect.

Imagine testing a new onboarding flow and examining twenty different metrics: signup completion rate, time to first action, seven-day retention, thirty-day retention, feature adoption for eight different features, average session length, sessions per week, subscription conversion, and several others. With a 95 percent confidence threshold, testing twenty independent metrics means roughly one will show a false positive by pure chance. Teams eager for wins might focus on that single positive metric while ignoring neutral or negative results elsewhere.

Preventing this requires establishing clear success criteria before launching experiments. Define the primary metrics that determine success and limit secondary metrics to those providing actionable insights. While monitoring many metrics for unexpected effects makes sense, only predetermined primary metrics should drive decisions. This discipline requires involving data scientists or analysts early in experiment design to identify potential pitfalls and ensure statistical rigor.

The challenge of consistent user identity across devices and login states presents another methodological consideration. Users might interact with an application while logged out, then log in later. They might use multiple devices, some running older application versions. Ensuring consistent experience requires thoughtful approaches to stickiness. Experiments targeting logged-in features should use account identifiers for bucketing, ensuring users see the same variant across devices. Experiments targeting pre-login experiences should use device identifiers, ensuring consistency across app reinstalls on the same device but accepting that the same person might see different variants on different devices.

Key practice three: Optimize versus innovate

Perhaps the most critical lesson from extensive experimentation is understanding its limitations. AB testing excels at optimization, making good products incrementally better through numerous small improvements. Testing button colors, adjusting copy, refining layouts, and tweaking flows produces measurable gains that compound over time. However, experimentation can create a local maximum trap where products become stuck in incremental improvement cycles without breakthrough innovation.

The mathematics of experimentation encourages small tests. Smaller changes are more likely to show clear positive or neutral results. Larger changes introduce more variables and risk, making negative results more likely. Teams naturally gravitate toward safer, smaller experiments that reliably produce wins. Over time, this can lead to a product that is optimized but stagnant, having exhausted opportunities for incremental improvement without taking bigger swings.

Consider a scenario where a product team has spent two years optimizing their subscription flow through hundreds of small experiments. They have tested every button color, tried dozens of copy variations, experimented with different screen sequences, and refined their pricing presentation. Conversion rates have increased by twenty percent through these accumulated improvements. However, the fundamental subscription model remains unchanged, and growth is plateauing. What the product actually needs is a completely new approach: a different pricing tier, a trial structure, or a value proposition that cannot be discovered through incremental testing.

Innovation requires different approaches. Sometimes teams must build substantial new features or fundamentally different experiences based on vision and user research rather than experimentation. These initiatives should still eventually be validated through testing, but the initial development requires trusting intuition and accepting higher risk. Once a major new direction is established and shows promise, experimentation can then optimize it, testing variations and refinements to maximize its effectiveness.

The key is maintaining balance. Use experimentation extensively for optimization, but regularly step back to consider whether the product needs bigger changes that cannot be discovered through testing alone. Reserve time and resources for innovation projects that might fail but could also unlock entirely new levels of product-market fit.

The practical reality: Cleaning up technical debt

A practical consideration that often gets overlooked in discussions of experimentation culture is technical hygiene. Feature flags accumulate rapidly in active experimentation programs. Yousician created over 500 feature flags while running 600+ experiments. The vast majority of these flags should be temporary, serving their purpose during experiments and then being removed once decisions are made.

However, cleanup often gets deprioritized. Engineers finish experiments, analyze results, decide on winning variants, but then move immediately to new projects without removing the flags and conditional logic they introduced. Over time, codebases become littered with defunct flags, making code harder to understand and maintain. New engineers encounter conditionals controlling behavior that has been the default for months or years, wondering whether they can safely remove the code or if it serves some purpose they do not understand.

Establishing cleanup processes prevents this technical debt accumulation. Some teams make flag removal part of experiment completion criteria, requiring engineers to remove flags before starting new work. Others conduct regular cleanup sprints, reviewing all flags and removing those no longer serving active purposes. For example, Unleash’s health dashboards highlight unused or stale flags to help identify cleanup opportunities. Regardless of the specific approach, treating flag cleanup as essential rather than optional prevents experimentation infrastructure from becoming a maintenance burden.

Building the culture

The technical infrastructure of experimentation matters, but culture ultimately determines success. Organizations cannot simply install a feature flagging platform and expect data-driven decision making to emerge. The transition requires fundamental shifts in how teams think about product development.

The most difficult cultural element is ego management. Product managers must accept that their intuitions about user behavior might be wrong. Engineers must acknowledge that features they spent weeks building might not improve key metrics. Leadership must embrace uncertainty rather than insisting on commitment to specific outcomes. This can be particularly challenging in organizations where strong individuals have historically driven product direction through vision and conviction.

Success requires creating psychological safety around experiment failures. If teams feel punished when experiments show negative or neutral results, they will design ever-smaller, safer tests that minimize the risk of failure. This defeats the purpose of experimentation. Instead, teams should celebrate learning regardless of outcomes, recognizing that discovering what does not work is as valuable as discovering what does. Negative results prevent shipping features that would have harmed metrics, saving the engineering effort that would have gone into building them out more fully.

Organizations should also consider hiring practices. While experimentation skills can be taught, candidates with backgrounds in data-driven consumer companies often adapt more quickly to experimentation-heavy cultures. Experience running AB tests, analyzing results, and making decisions based on data rather than intuition provides a foundation that accelerates onboarding.

 

Share this article

Explore further

Product

How can feature flags accelerate A/B testing?

A/B testing is a method of comparing two or more variants of a product feature, design, or experience to determine which performs better. It works by randomly assigning users to different groups, with each group seeing a different variant. By measuring key metrics like conversion rates, engagement, or revenue across these groups, teams can identify […]