Why Do We Perform Statistical Analysis? (Advanced)

Advanced

Our advanced articles describe how our features/systems work in detail and will sometimes require knowledge of certain scientific or technical terms.

A/B testing is a data driven way of optimizing a game. Thus, an experiment starts with collecting data from each variant. Each variant has a latent property that we want to compare. For example, to test binary goal metrics, we assume that the user sets of each variants have a latent true conversion rate. To test playtime per user data, we assume that there is a latent session length distribution that generates our measurements.

For example, consider the following binary A/B test:

Variant	True Conversion Rate	Number of users	Converted users – measurements
“Control”	0.11	10000	1110
“A”	0.13	10000	1290

As we see in the table above, data collection is probabilistic in nature: for a true conversion rate of 0.11, we may not always end up with 11% of converted users. This is due to the fact that our sample is a subset of the overall population. In our case, that would be every single user that would ever play the game; our sample would be the users that happen to be acquired for the experiment. For this reason, inferring latent factors from measurements is not a one-to-one relationship.

For example, let us try to infer the true conversion rate of the control variant in the table above. We may argue that it is highly unlikely that a true conversion rate of 0.01 generated such results. Similarly, it is quite apparent that the true conversion rate is most probably not 0.5. The likely conversion rates are centered around the value 0.111. This relationship is usually represented in models by probability densities.

To determine the conversion rate densities given the measured data, we may perform Bayesian inference. The outcome of bayesian inference may be the following:

On the vertical axis, you can see the probability density of a conversion rate value given the measurements of the variant. To calculate the probability of a variant being the winner, we first sample conversion rate values from these probability densities. After that, we compare these samples one-to-one, hence deriving the probability.

You can see an example of this in the following table:

Sample	Index	Conversion rate “Control”	Conversion rate “A”	Winner
315	1	0.11472557	0.12875037	“A”
256	2	0.10708776	0.12445933	“A”
127	3	0.11878974	0.12853538	“A”
538	4	0.11921174	0.11852222	“Control”
…	...	…	…	…
N	N	0.1101358	0.13813024	“A”

After collecting a sufficient number of samples, one may realize that – what we ended up with using simple common sense arguments is mathematically justified – the probability that version “A” performs better than control is 0.998.

For some other experiment, we may end up with the following probability distributions:

Here, we can see that there is no significant difference between the variants.

The analysis outlined above is applicable for more than one variant. Imagine the following conversion data was measured in an A/B testing experiment with 4 variants:

Variant	Number of users	Converted Users
“Control”	2000	110
“A”	2001	100
“B”	1999	105
“C”	2000	104

After performing Bayesian inference, the probability densities of the conversion rates have the following form:

Bayesian Inference

Among the probability densities above, we cannot find one, that may be distinctively greater than all the others. In fact, if we sample a sufficiently high number of conversion rate values from the probability distributions of the variants, and compare them, we will see that none of the variants’ winning percentages are dominant above the others, indicating that the measurements are not significantly different:

Variant	Winning Percentage
“Control”	44.5%
“A”	11.3%
“B”	24.1%
“C”	20.1%

And also, we could compute the relations of individual variants,

Variant	Percentage
P ( “Control” performs better than “A” )	76.5 %
P ( “Control” performs better than “B” )	63.0%
P( “Control” performs better than “C” )	66.5%
P ( “A” performs better than “B” )	34.9%
P ( “A” performs better than “C” )	38.4%
P ( “B” performs better than “C” )	53.2%

Different variants will naturally produce different measurement results. For example, let’s say that an experiment starts running, and after two days, the collected data shows that the mean playtime of a variant has increased by 10%. This does not necessarily mean that this variant is significantly better than the others. It may be that the number of collected samples is low, and there is an outlier – maybe a test device – that contributed to the variant. So, in order to draw reliable conclusions from these differences, A/B testing is concluded by thorough statistical analysis, to assess whether the measurement differences simply occurred by chance, or if there is a statistically significant difference between the variants.

For example, imagine we are trying to find the best difficulty level of a game in order to keep as many players as possible after a week. Assume the control variant corresponds to a medium level of difficulty, and variant “A” is slightly easier. Consider the following day 7 retention measurements:

Variant	Number of users	Returning users
“Control”	2000	110
“A”	2001	112

The measurements above are very close, it is highly probable that there is no significant difference between the conversion of the two variants. The following measurements however differ to a greater extent.

Variant	Number of users	Returning users
“Control”	2000	110
“A”	2001	220

These measurements suggest that variant “A” is significantly more likely to attract users, than the control variant. In the second example, it seems evident that the probability of users returning after 7 days is significantly higher than in the first example. With A/B testing, game developers can perform similar common-sense reasoning in a data-driven, mathematically rigorous manner.