Organizing data in BigQuery

Published in

travel audience

8 min readApr 16, 2021

At travel audience, we provide integrated data-driven solutions for travel advertising. Since we deal with terabytes of data per day, the selection of the right tools for data workloads is essential for us. Being on the Google Cloud Platform(GCP), we heavily rely on their most prominent technology — BigQuery. Though we use various other GCP components in our data teams, if I have to pick one component to recommend, it will be BigQuery without doubts. BigQuery is fast, powerful, and in most cases cost-effective.

This article from Google provides a good overview of BigQuery for a data warehouse practitioner. However, organizing data is not covered in much detail. This blog is focusing solely on this part — how to organize data in BigQuery for effective and compliant management across multiple teams in your organization. Each organization is different and Convey’s law is definitely applicable in data modeling. Still, we think this could be a starting point for anyone to extend on.

Why organizing your data is important?

When we started with BigQuery, we were focusing mostly on “how to get work done” and not much on effective data management. But pretty soon, we ran into the following problems.

The number of datasets increased very quickly and eventually, that led to data duplication. We created datasets and tables manually and from different applications (such as Dataflow or Cloud Composer jobs). Applications also possessed access to remove the tables. Basically, service account permissions for BigQuery were not managed at the same place as where BigQuery resources were created.
Data started piling up in both user-specific and generic datasets. Since we didn’t have automated cleanup processes and users were able to override default expiry values, it became impossible to clean up data.
The lack of a proper schema versioning was another big issue. When a table is migrated to a new version, all consumer applications referring to that table also needed to be updated.
Data ownership was yet another issue. Since we didn’t have rules specifying which datasets belonged to which teams and some datasets were created based on the product or business entity, it became harder to track the ownership. We experimented with using labels for governance, but this didn’t fly as the labels themselves were created manually.
We didn’t have an efficient way for managing lookup and static data (data/metadata created in the data warehouse itself). We also found that commonly used lookup data were duplicated across datasets owned by different teams and were often not in sync.

Now let’s talk about how we tackled these issues. To understand it better, first I will present the general data org structure we follow. Then we will dive into BigQuery specific practices.

Organizational Structure

The data team in travel audience consists of 3 operational entities.

Data Engineering: Responsible for making all upstream data (from message queues, RDBMS, files from GCS or SFTP, etc.) available in the data warehouse. In almost all cases, the output (for other teams) is produced in the data warehouse (i.e. BigQuery)

Analytics and Reporting: Works with available data in the data warehouse and performs analytics and produces reports and dashboards. The output here is mostly in BigQuery itself and from there data is exported to Tableau.

Data Science: Works with available data in the data warehouse and some of the outputs from the Analytics team (see above) and builds machine learning models. The output models are stored either in GCS files or message queues. They are then picked up by the model-serving framework.

BigQuery data management

On GCP, we have 3 dedicated projects (and hence 3 dedicated BigQuery instances) — production, staging, and development. We use Terraform for managing all the resources in GCP. And the BigQuery terraform module offers all the features we needed. So, our first decision was to use Terraform for managing all the BigQuery resources. Each team mentioned above got a dedicated Github repo, where they manage the different BigQuery resources — mainly datasets, dataset_access, tables, and views. (we don’t use table_access yet). The access for different service accounts to BigQuery datasets is managed only in this repo, giving total control to the team on sharing their data with others.

The main decisions taken towards organizing data are:

Minimal number of datasets

Each team gets two datasets — one for internal use (let’s refer to it as internal dataset) and one for exposing the output data to the client applications (hereafter referred to as external dataset). In other words, this external dataset serves as the data API between teams. A team can get additional dataset(s) if it's really required(for example, to manage temporary data used by Dataflow BigQuery API). The datasets owned by a team are created across all 3 GCP projects — development, staging, and production using Terraform and CI/CD.

Centralized table and view creation

Previously, it was possible to create BigQuery tables and views from the Dataflow job or Cloud Composer workflow. With this change, table and view creation were limited only to Terraform deployments. The permission for the Dataflow or Cloud Composer service accounts, to create tables and views in BigQuery datasets was revoked. In staging and production projects, the tables and views are created using Terraform from a CI/CD pipeline. In the development project, the team user accounts get CREATE access to the datasets owned by the team. This makes prototyping faster and once the schema is satisfactory, it is deployed in staging and production with Terraform.

Version control for BigQuery resources

The internal dataset contains tables produced by the team and the external dataset(meant for other teams) contains views on top of them. (The views can be used by the applications owned by the same team as well). The view definition always lists selected fields (and never SELECT * FROM ). The advantages of this approach are:

The team knows which data is exposed to client-facing applications. Note that, not all tables in the internal dataset need a view to being defined, rather only the ones used by another team (or another application within the team than the application responsible for producing this data).
Backward compatible changes (such as adding a new field) can be introduced anytime in the underlying tables. Once the data in the new fields are verified, the same can also be added in the views. None of this should break the client applications (since we never use SELECT * FROM in any team’s application). We can change the underlying tables to a different version and modify the view to point to this new table(or union of tables) without breaking client applications.
For breaking changes (such as removing a field or a completely new schema), we deploy a new version of the view (pointing to the new table with updated schema) and ask the clients to migrate their applications.

Clear data ownership

The above setup allows us to have a clear understanding of the data ownership since schemas are never defined elsewhere than the dedicated Github repos. Furthermore, it also makes collaboration easier. It’s possible (and appreciated!) for people from other teams to raise a pull request to another team’s repo asking for access to one of their service accounts or even creating a new schema for them.

Guarantees and Open data

One of the decisions we made is that all data team members get READ access to all datasets, both internal and external, from all teams. This “open data” approach helps the data teams to come up with innovative data products. At the same time, we value compliance and control as well. We achieve the best of both worlds by the following approach:

The service accounts (used by applications) get access only to the external datasets. The user accounts get access to internal datasets as well. This way, data team members get access to “work in progress” data sources. It incentivizes better collaboration between teams. A team provides guarantees (backward compatibility and announcement of breaking changes) only on the external dataset owned by the team.
All PII data is kept in secure storage, which has restricted access. Our data warehouse does not contain any PII data as such, rather only tokens provided by this secure storage.

To make it clear, let’s summarise the above decisions with an example. The table below shows the setup used by our data engineering team(abbreviated as DET in the table).

example for team-specific BigQuery datasets

Note that members in our data org do not get user-specific development datasets in this setup. Each team in our data org is about 5 to 10 members, and a user-specific dataset would be overkill in this case.

Metadata management

Most of our micro-services use Cloud SQL as the backend storage. Previously data from these micro-services were loaded to BigQuery using gRPC via Cloud Composer jobs. Since Cloud SQL allows federated queries in BigQuery now, we can expose Cloud SQL data directly in the external datasets just like a normal view on a table. This approach also eliminates the delay in updating BigQuery data when Cloud SQL data gets updated.

For static metadata (mostly small lookup data) that are defined in the data warehouse, we decided to keep the data in the GitHub repo itself. Our custom Terraform setup either replaces the whole lookup or updates the rows as required, on any change to the Github data file.

Our key takeaways

Automate the creation of BigQuery resources using Terraform and CI/CD.
Forbid creation of BigQuery resources from client applications. They should only be able to read or write data, but not perform DDL operations.
Use a few datasets per team.
Have clear segregation of internal and external datasets.
Support backward-compatible changes and versioning of data.
Use federated queries for Cloud SQL resources.

I hope, you found this article useful. I will appreciate and warmly welcome your questions and suggestions for improvement.