25 July 2023
15 minutes reading time
Discover the benefits of building a data warehouse for analytics. Explore cloud-based solutions, partitioned and clustered tables, normalization vs. denormalization, analytics-ready data products, and the data mesh approach.
Building a data warehouse can be daunting, especially when choosing the right technologies and architectures to support your data needs. As organizations seek to harness the power of their data, understanding the purpose of a data warehouse becomes crucial in designing an effective solution. In this introductory article, we aim to demystify the process and provide answers to common questions about building and managing a data warehouse.
There are plenty of good reasons to create and maintain a warehouse solution in your business. Here are our top three:
Data warehouses are potent systems that organize and process large amounts of business data for complex analytical queries. They are often supported by OLAP (Online Analytical Processing) systems. It’s important to distinguish them from Online Transaction Processing (OLTP) systems, which handle real-time transactions and are optimized for quick user requests and storage. In most cases, the choice for OLAP is already made for you. But since it can be confusing, let’s spend some time making things clear. In the zoo of OLAP-based data warehouses, you can find creatures such as Google BigQuery, Amazon Redshift, Azure Analysis Services, IBM Cognos, and quite a few more. These systems are all meant for analytical data. This article focuses on Online Analytical Processing (OLAP) systems, as they serve analytics and machine learning purposes better than traditional transactional databases. Here’s why:
It’s worth noting that this article doesn’t cover NoSQL databases, which are more suitable for unstructured data and application runtime data handling. Now that we understand the importance and benefits of data warehousing, let’s explore how cloud-based solutions have revolutionized the landscape. Cloud data warehouses, such as Google BigQuery and Amazon Redshift, have emerged as powerful tools specifically designed for handling analytics workloads. With their optimized architectures, flexible pricing models, and self-management capabilities, these cloud-based solutions offer organizations an efficient and scalable way to unlock the full potential of their data.
In recent years, analytics data warehouses have made significant strides in their development. With our ability to collect data growing, historical data has become abundant. It’s no surprise that systems like Google BigQuery and Amazon Redshift have gained prominence in the analytics scene since they were purpose-built for this. Other benefits include:
However, you might wonder about compliance issues when storing your analytics data in the cloud. Here are a few points to consider:
Knowing these factors, you can have confidence that cloud-based projects offer significant benefits in terms of operational efficiency.
Now we’ve got a grasp of why data warehouses are beneficial to your organization and what cloud solutions there are available, let’s dive into how you will set up your data warehouse. At the end of the day, your analytics capabilities are highly dependent on how you design your data warehouse. The rest of this article will give you a go over some important topics that might guide you when deciding how to organize your data. This advice is valid whether you are going to support analytics teams with their dashboarding efforts or data science teams with their training and prediction workloads, and is based on the following two metrics:
Let’s talk tables…
Partitioned tables are a way to divide your data into smaller segments, making it easier for your queries to quickly scan and retrieve the relevant information. For example, if you have a table with a time column, you can partition it based on time intervals. This reduces the number of bytes scanned, especially when you only need to analyze a specific date range. Partitioning can also help with data lifecycle management by automatically deleting old partitions after a certain period without new data. Additionally, in systems like BigQuery, you can enforce that queries targeting a certain table must filter on the partition column. You are then sure the analysts using your table are at least aware of this optimization strategy before they start using the underlying data. Clustered tables are organized in storage blocks based on the attached metadata and values of clustering columns. This arrangement reduces the amount of data scanned when filtering or grouping by those columns. Since the storage blocks are already sorted, it also improves query performance when ordering the data by the clustered columns. When choosing clustering fields, focus on those frequently used for filtering or grouping your queries. Both partitioned and clustered tables optimize query performance and reduce the amount of data scanned. Partitioning is beneficial when you have time-based data or want to manage data lifecycles easily. Clustering is advantageous when your queries frequently involve filtering or grouping by specific fields. However, the benefits of clustering are more pronounced with larger tables and a higher number of entries per storage block.
Normalization is a method used to organize data into multiple tables, reducing redundancy and ensuring data consistency and integrity. On the other hand, denormalization involves merging data from multiple tables into a single table, which speeds up data retrieval. When your data is normalized, your queries often involve joins. However, this can lead to data being shuffled across different computing units during the query process, resulting in increased complexity and slower query performance. In some cases, it may even exceed the computation-to-scanned-data ratio limit of the BigQuery free plan, especially when using cross joins. To overcome these limitations, denormalization is a useful approach. Denormalizing your data involves keeping nested and repeated fields within your dataset, eliminating the need for joins. For example, in a normalized data model, the products sold in each transaction would be stored in a separate table linked by a transaction ID. In a denormalized scenario, a single table would include a repeated (array) field containing the products sold within each transaction. This simplifies the execution process for your queries and improves performance.
Analytics-ready tables are a crucial component in data analytics. They are created by transforming raw data from transactional systems like SAP HANA into tables that contain dimensions (such as user, department, and brand) and aggregated metrics (such as conversions). When designing analytics-ready tables, there are important considerations to keep in mind:
By considering these factors, you can design analytics-ready tables that meet the needs of your organization while providing flexibility for various use cases.
In the previous section, we covered the practical aspects of developing analytical data tables that capture insights and analysis. But what about insights from other departments? Who designs those tables? Do we rely solely on a centralized data team for this? Now we need to start thinking about working in a cross-team environment. Teams often rely on a centralized group of engineers, such as a data platform or business intelligence team, to handle data flows and analysis. However, this can create a bottleneck as these engineers need to be knowledgeable about various domains within the organization. The concept of data mesh, introduced by Zhamak Dehghani, offers a solution for achieving data independence within teams. In this approach, teams take responsibility for ingesting, processing and analyzing their own data. They can also publish data products within a shared data platform. A central data platform team manages the underlying technology of the platform, including query engines, databases, automated policies, and access management. They provide essential services and guidelines to ensure interoperability and compliance. With the support of these central services and clear guidelines, domain-driven teams can create their own data products within their designated areas in the data platform, such as AWS accounts or Google Cloud projects. The data platform and governance teams oversee interoperability, security, compliance, and more.
Implementing a data mesh comes with its technical challenges. ETL pipelines need adjustment, data product tables must be designed to serve various use cases, and user management should accommodate access divisions between public and in-process data. However, the biggest hurdle lies in the organizational aspect. Let’s consider an example: Your customer care center holds valuable data on customer satisfaction and pain points. However, the employees lack the necessary skills to take ownership of this data as their focus is on operational tasks. Integrating data capabilities into the team is not a quick decision, and knowledge sharing with the central data team may be limited. Additionally, data-oriented employees may not possess the domain-specific knowledge needed for context. Separating business ownership from technical ownership of data can be beneficial. However, adopting a data mesh architecture requires careful consideration. Ask yourself:
These factors should be carefully evaluated before adopting a data mesh architecture.
Fortunately, cloud providers offer tools that simplify the technical aspect of implementing a data mesh architecture. For example, the Google Analytics Hub platform, built on BigQuery, enables federated data exchange. The data platform and governance teams can jointly manage the platform, allowing domain projects to create data exchanges with shared datasets. Through data contracts, known as data exchanges in the Analytics Hub, domain-driven teams can collaborate by publishing and subscribing to data. In AWS, the Glue product helps catalog and document data, while Lake Formation assists in governing data sources and enhancing security. These tools are particularly useful for implementing a data mesh within a data lake environment, although they are less focused on structuring the architecture within a data warehouse (Redshift) environment.
In the upcoming articles, we’ll explore the tools that can help you manage and run ETL pipelines within your cloud data warehouse environment. We’ll discuss user-friendly options that require no coding experience, as well as more technical tools that offer advanced capabilities. Whether you prefer a simple, intuitive solution or have a deeper technical understanding, we’ve got you covered. Stay tuned to learn about the various options available for orchestrating and executing your ETL pipelines effectively. If you want to leave nothing to chance, our data experts will be glad to organize a one-to-one call and help you set up a valuable data warehouse solution for your organization.
Building a data warehouse provides complete control over data management, scalability, and advanced analytics for data-driven decision-making. Here are our five key takeaways from this article: