Best Practices
In order to help you optimize your games we would like to share what we think are the best practices of creating A/B testing experiments. Also, there are important notes that we have to make about the limitations of A/B testing.
Context is important. Don’t jump to conclusions.
All the computations are based solely on the data collected from the experiment. Thus, the result of an A/B test is always a reflection of the behaviour of the users under those specific circumstances. If major changes are applied to a game after an experiment, or if the industry-wide behaviour changes drastically, the results may not be relevant anymore. The same experiment repeated under different circumstances may yield different results.
Another important consequence of this is that an experiment that was run in a specific region, does not have the same predictive power in other locations. The same reasoning applies if other kinds of special configurations – build, type of operating system, etc. – were in place during the experiment. An A/B testing result for a given build or operating system may not be informative for other versions of the same game.
Don’t make too many and drastic changes to the game.
A/B testing is most effective, when only a few factors of interest are varied, and the others are kept controlled. After an experiment done in such a way we can reason that any significant difference in the measurements was caused by the modified factors. For this reason, it is important not to roll out any major changes in the game during an experiment. Similarly, it is not advised to run experiments in periods when quick user behaviour changes can be expected – e.g. public, religious or school holidays, etc.
A/B experiments should run for a reasonable time interval. Don’t run the tests for a really short amount of time.
Experiments using GameAnalytics A/B testing must run for at least a fixed number of days. This eliminates the early transitional effects of the changes introduced in the variants and covers some of the possible seasonality of the data. Also, our experiments have a minimum user count requirement per variant. The reason this is needed is because the certainty of any inference grows with the size of the data it is based on. With high uncertainty, it is highly unlikely to get conclusive experiments – A/B tests would almost surely end without declaring a winner.