Skip to main content

Introduction to BigQuery

We currently use Google's BigQuery to host our Player Warehouse. BigQuery allows you to use SQL to explore the tables and answer questions about your games without worrying about infrastructure management.

If you're new to BigQuery and want to have a feel about it before testing Player Warehouse, you can follow their Get started with BigQuery and explore a sandbox environment with public datasets.

note

To follow along on this introduction, it's recommended to be familiar with SQL.

The main concepts you’ll need to understand while using BigQuery are:

  • Project: your organisation's base-level GCP entity

  • Dataset: top-level container used to organise and control access to your tables and views. By default, you’ll get access to a GameAnalytics dataset named “checkpoints” with your players' data. You are free to create other datasets if you wish to store the results of common queries.

  • Table: each table contains individual records organised in rows; each record is composed of columns (also called fields). By default, within the “checkpoints” dataset, you’ll find tables as per the above documentation. You may not create new tables within this dataset, but you are welcome to create tables in a new dataset.

  • View: a view is nothing but a virtual table defined by a SQL query. Instead of storing the data, it queries other tables and presents the data as per a chosen SQL query, avoiding unnecessary data duplication. In fact, the provided checkpoint tables are actually views.

Getting started with BigQuery (in Player Warehouse)

Once you get access Player Warehouse, you can access it by the link provided or in your projects in BigQuery's project selector. From there, ensure you're on the BigQuery section of Google Cloud by clicking on the menu on the top left and selecting it.

You should be able to see the project and datasets listed:

checkpoints

Player Warehouse consists of multiple tables, split by event type. The only exception is the player_checkpoint table, which has all the players' data like country_code, install_ts, build and much more. Check out the schemas for a full description.

Writing queries

To write a query in BigQuery, we can use the console or access programmatically. In this introduction, we'll focus on the console side of things.

On the same page, there is an open area to write our query. If it's not showing for you, click on the + sign.

pick_editor

First query - Daily Active Users

Let's start with a simple example to find the DAU.

SELECT
COUNT(DISTINCT player_id)
FROM
`<project_id>_checkpoints.player_checkpoint`
WHERE
checkpoint = "2022-01-01"

Remember to change <project_id> by your project_id!

If you press the run button at the top, this will compute and show you the result at the bottom of the screen. Something like this:

first_query

Let's break down what's going on here:

  • Calculate the count of unique player_ids that exist by using SELECT COUNT(DISTINCT player_id).
  • Picking the table player_checkpoint by using FROM <project_id>_checkpoints.player_checkpoint.
    • Note that we're using <project_id>_checkpoints dataset. Queries to the dataset with no postfix will fail.
  • Specify the date by adding the WHERE checkpoint = "2022-01-01.
important

The tables in Player Warehouse are partitioned by checkpoint, so it's crucial to add the WHERE clause for optimisation purposes. Learn more about partitioned tables here.

Breakdown by game

If you have just one game, the first query might work fine, but you might have more than one. For example, maybe you have a test/beta game and a live game. Or perhaps you have dozens or hundreds of games and don't want to mix them up!

Let's adapt the previous query to allow us to see the DAU per game.

SELECT
game_id,
COUNT(DISTINCT player_id) as dau
FROM
`<project_id>_checkpoints.player_checkpoint`
WHERE
checkpoint = "2022-01-01"
GROUP BY
1

dau_per_game

Now the results show one game per row (in my case one android and one ios) and the DAU for each. We can also add a name to the dau column, as f0 is not very descriptive. So what changed?

  • Added a game_id to the SELECT clause.
  • Named the second argument with as dau.
  • Added a clause GROUP BY 1 to group the results based on the game_id.

Joining tables - adding the game name

No one will care for our tables or charts if we just list game ids. After all, people recognise the game's titles, not some number.

To find the title, we just need to join our previous query with the game_metadata table and pick the game's title.

SELECT
m.title,
COUNT(DISTINCT c.player_id) as dau
FROM
`<project_id>_checkpoints.player_checkpoint` c
LEFT JOIN `<project_id>_checkpoints.game_metadata` m ON c.game_id = m.id
WHERE
c.checkpoint = "2022-01-01"
GROUP BY
1

The results are the same as the previous query, with the title instead of the id.

In this section:

  • Add a LEFT JOIN with the game_metadata table on the game's id.
  • Replace the game_id with the title of the game_metadata table.
  • Provided an alias for each of the tables.

Conclusion

Over the previous sections we covered the basics of Player Warehouse on BigQuery. There is a lot more to explore, and we have some more advanced examples in our Example Queries section.

Before moving on to the next page, we recommend reading about optimising your queries.

Query optimisation

The Player Warehouse tables are large, holding up to one year of data, which can cause queries to scan large chunks of data, which can directly affect the query cost/biliing.

BigQuery is excellent for working with large amounts of data but has some differences from traditional SQL databases. Here are some things to do/avoid, at least when using BigQuery with Player Warehouse.

Do:

  • Filter by the partition field; in our case, means adding a WHERE checkpoint = ... clause.
  • When possible, filter by the clustered field; in our case game_id.
  • Query only the columns that you need.
    • BigQuery uses a columnar storage format, scanning only the columns necessary.

Avoid:

  • Similarly to the last DO, avoid using SELECT *, as it scans unnecessary columns.
  • Using LIMIT to limit the data scanned. In BQ, the LIMIT clause is applied after all the data is read and doesn't affect the amount of data read.

For best practices for query optimisation, check the BigQuery docs page on best practices.