Guide to Enterprise Data Architecture Part 3
If you look at where companies keep their analytical data, you’ll quickly see that this space has split into two major architectural and technology stacks: data warehouses and data lakes.
What are their defining characteristics? What factors should you take into account when choosing a data warehouse vs. a data lake for your company?
Keep reading to get the data warehouse and data lake essentials – together with a handful of best practices to help you succeed, no matter which option you end up picking.
Note: This is the third part of our series that dives into enterprise data architecture – you can find the previous parts here:
- OLTP: Guide to Enterprise Data Architecture Part 1
- Analytical Data: Guide to Enterprise Data Architecture Part 2
What is a data warehouse?
Data warehouses represent an approach for storing structured data. A data warehouse serves as a central repository where companies store critical data for analytics and reporting.
Data warehouses use a relational schema that is optimized for read operations and designed to help with online analytical processing (OLAP). Most of the time, it’s SQL queries that enable functionalities such as business intelligence (BI), reporting, and visualization.
To that end, data warehouses often contain a mix of current and historical data that has been extracted, transformed, and loaded (ETL) from various sources, including internal and external databases.
For decades, data warehouses served as the foundation for descriptive analytics, allowing users to easily query and combine vast amounts of historical data.
Benefits of data warehouses
Teams stand to gain a lot from implementing data warehouses:
- Data warehouses provide a stable, centralized storage location for vast volumes of historical data.
- Teams can use insights derived from data stored in a data warehouse to improve company processes and decision-making.
- They boost data quality and improve BI performance and capabilities by incorporating data from many sources.
- They give everyone in the company access to past data.
Data warehouse architecture
A data warehouse’s design differs depending on many factors. Data warehouses can have one, two, or three levels. The three-tier architectural structure is perhaps the most common one and looks as follows:
- Bottom tier (data layer) – this is where data is delivered to the warehouse.
- Middle tier (application layer) – where an OLAP server handles data.
- Top tier (presentation layer) – intended for end-users, includes specific tools and application programming interfaces (APIs) for data extraction and analysis.
Traditionally, data warehouses were hosted on-premises, but today an increasing number of data warehouses employ cloud storage to host and analyze massive amounts of data. Some of the most popular cloud data warehouse applications are:
- Microsoft Azure – Azure Synapse Analytics and Azure SQL database
- AWS – Amazon Redshift
- Google Cloud Platform – Big Query
- Snowflake data warehouse
4 key characteristics of a data warehouse
- Subject-oriented – You can quickly perform data analysis by restricting it to a certain topic area of the data warehouse. Removing unnecessary information on a subject that isn’t required for decision-making makes the whole analysis process easier.
- Integrated – Data warehouses are collections of data from several sources combined into a single platform. This data is retrieved and converted uniformly without regard for the source from which it was collected; this makes a data warehouse integrated.
- Time-variant – Data warehouse offers a historical perspective; it stores a massive amount of data from all databases maintained in line with time factors. It has a temporal component as well as a broad time range.
- Non-volatile – To preserve data from transient changes, you update data by uploading it into the data warehouse. This implies that once data is given, it cannot be altered. This builds the non-volatile nature of data warehouses.
What is a data lake?
Data lakes are the industry standard for storing massive amounts of file-based data to facilitate data science and large-scale analytical data processing scenarios.
Data lakes became popular in businesses all around the world in the mid-2010s. The rise of smartphones, the Internet of Things (IoT), digital and social media, and e-commerce resulted in the rapid growth of big data. With it came an urgent need for organizations to store massive amounts of unstructured data and mine it for insights using data analytics and machine learning.
Unlike data warehouses, data lakes provide a scalable and flexible solution that doesn’t require pre-defined schemas. Learn more about data lakes here: Data Lakes: The Definitive Guide
Benefits of data lakes
The following are some of the advantages of a data lake for businesses:
- Expanded data types for storage – Since data lakes can store all data types, including ones that are critical for advanced analytics, companies can use them to identify opportunities for improving operational efficiency, increasing revenue, cost efficiency, and risk management.
- New forms of analytics – Data lakes open the door to new forms of analytics on previously inaccessible and compartmentalized data – think log files, data from click-streams, social media, and IoT devices — that are now centrally housed in the data lake.
- No data silos – When linked to a broader data structure, data lakes can consolidate information from separate departmental silos, mainframes, and legacy systems, unloading their individual capabilities and reducing data duplication. This helps team to build a 360-degree customer picture, which improves consumer targeting and marketing campaign orchestration.
Characteristics of data lakes
- Distributed computation and storage – Data lakes analyze and store massive amounts of possibly unstructured data using distributed compute and storage. Data is stored and processed across possibly thousands of servers (big data cluster). This approach took off in the 2010s with Apache Hadoop, an open-source big data software collection that enabled organizations to distribute massive amounts of data across many machines (HDFS distributed storage) and run SQL-like queries on tables stored across them (Hive and Spark distributed compute).
- Schema-on-Read – Data lakes employ the Schema-on-Read paradigm, in which a schema is constructed only when the data is read. This implies you can place data in the lake in bulk without the costly requirement to establish schemas immediately, allowing for the creation of schemas for specific use cases later on.
- Types of files – Unstructured data is housed in data lakes, which contain text files like txt and doc, audio files like MP3 and WAV, graphics like JPEG and PNG, videos like MP4, and even whole PDFs, social network posts, emails, webpages, and sensor data. Data lakes (and NoSQL databases) also allow you to store semi-structured data, such as JSON and XML files, in their native format.
- Cloud computing options – Public cloud providers such as Amazon Web Services, Microsoft Azure, and Google Cloud are increasingly hosting data lakes. This highly scalable infrastructure allows the organization to quickly react to changing demands for computing and storage resources.
- Near real-time processing – data processing technologies like Apache Kafka allow data to be processed in near real-time as a continuous stream of data, helping teams build systems that deliver instant insights and patterns or take rapid responsive action to events as they occur.
Comparing data warehouse and data lake
Data lakes and data warehouses are both used in businesses to combine data from many sources, but their consumers and optimizations differ.
A data lake works like a collection of data streams from many sources. Both structured and unstructured data are permitted, and no processing is performed on the data until it reaches the data lake. This is very appealing to data scientists and apps that use data for AI/ML since it opens the doors to using new methods.
A data warehouse, on the other hand, is a centralized location where structured data can be processed for particular business insight goals. The main difference is that you need to determine your reporting needs during the planning and design of a data warehouse and the ETL process.
Another way to look at these two technologies is that data lakes are schema-free and can contain relational data from business applications as well as non-relational logs from servers and sites like social media. Data warehouses, on the other hand, rely on a schema and only accept relational data.
Here are some more differences to consider when choosing between these two solutions:
Data types and variety
Raw data is data that hasn’t been processed yet for a specific purpose. It’s typically unstructured (for example, a video file) or semi-structured (for example, photos with information attached).
Data lakes usually store raw, unprocessed data, which frequently includes multimedia files, log files, and other large files, whereas data warehouses primarily store organized, processed, and refined data, which typically consists of text and numbers.
Data lakes and warehouses can both have an infinite number of data sources. However, before you can save the data in a data warehouse, you must first create your schema since only structured data may be loaded into the system. Data lakes, on the other hand, have no such requirements.
Preprocessing is usually required before storing data in a data warehouse. Data practitioners use ETL tools to clean, filter, and organize data collections before loading them. Data lakes, on the other hand, may contain any type of data.
You have the freedom to select whether or not to do preprocessing. Extract, load, and tansform (ELT) technologies are often used in organizations. They initially feed the data into the lake and then transform it only when necessary.
Data processing and analytics
Data warehouses are optimized for structured queries and predefined analytics. Data lakes, on the other hand, enable exploratory analysis and advanced analytics with tools like Hadoop and Spark.
A data warehouse can be considered more reliable since you can perform processing ahead of time. To ensure data accuracy, certain operations such as de-duplication, sorting, summarizing, and verification can be performed in advance. If no checks are performed ahead of time, duplicates, erroneous, and unverified data may wind up in a data lake.
A data warehouse is built to provide the quickest query performance possible. Business customers love data warehouses because they allow for faster reporting. Data lake architecture, on the other hand, favors storage volume and cost over performance. You get far more storage space for much less money, and you can still access data at respectable rates.
Data warehouse or data lake? Choosing the right approach for your company
Here are a few factors to consider when selecting between a data warehouse and a data lake:
What makes sense for the company will depend on who the end user is: a business analyst, data scientist, or business operations manager? If the major use case is business insights and reporting, a data warehouse is a good pick – but at a bigger expense for setup and storage.
The data scientist might prefer to work with data lakes – they often want to try out new artificial intelligence and machine learning techniques and value access to both structured and unstructured data. A business analyst may be knowledgeable in SQL and just need to write a trend report on one aspect of the business, in which case a relational database works well.
Data itself: sources, formats, and structure
Other important issues include the number of data sources, the format of the data, and how predictable, consistent, or understandable the structure is ahead of time. Data lakes are open to unstructured data coming from a variety of sources, whereas data warehouses only allow structured data from multiple sources.
Data storage and budget limits
Big data provides businesses with commercial value, which should be represented in budgets for data management plans. Storage prices rise in lockstep with increasing data volume and velocity.
Data lakes are more cost-efficient since they store data in its raw form, but they may use more storage when processing and preparing data for analysis.
Requirements for data processing
The process of determining what a data model is and when it has to be defined as part of your data management strategy. Data lakes allow for the storage of raw data, including all metadata, and a schema may be used when extracting data to be studied.
Data warehouses require ETL operations that transform raw data into a predefined structure, often known as schema-on-write.
Data and technology ecosystem
Organizations differ in their attitudes toward open-source vs. proprietary software and the communities that support it. Data lakes are popular because of the extensive usage of Hadoop and the increase in unstructured data from multiple systems utilized across the company, as well as real-time data streams.
Another component of technology to examine is the system’s accessibility and integrity as data sources and structures change. It’s more expensive to update a data warehouse, but modifications to a data lake are straightforward.
Lakehouses: a hybrid approach and modern data architecture option
The increased quantity of data prompted many leaders to reevaluate their data storage and processing practices. Cloud data warehouses offer flexible data storage, but they require that you employ a certain format that is interoperable with existing business intelligence tools. This means that not all data types can be analyzed. Furthermore, the toolset at your disposal is limited.
For these reasons, many teams are considering data warehouse augmentation. This means adding a secondary data lake to the data management setup to service a larger range of analytical use cases.
This type of data warehouse modernization scenario doesn’t necessarily entail the retirement of existing technological assets. On the contrary, you can continue to use a data warehouse for the same set of processes while expanding your data sources and analytical capabilities with a data lake deployment.
Here’s an example of how a lakehouse architecture works
A data lake receives all unstructured and untransformed data. SQL and NoSQL data can be ingested from a variety of sources, including ERP or CRM systems, external marketing analytics tools, customer support systems, social media, public data sources, GIS platforms, and others.
You can then use the acquired data to explore the lake for relevant datasets and sources to deploy bespoke algorithms for anomaly prediction, predictive maintenance, or fraud detection.
Using an extract-transform-load (ETL) approach, some types of data can be processed and sent to a connected cloud-based data warehouse. After converting the data to a suitable format, you can use it via self-service BI tools, data visualization software, and other analytics tools, including legacy ones.
As a result, data is kept in two places:
- Data lake which works like a sandbox environment holding raw data that can be stored cheaply and rapidly accessed for study.
- Data warehouse which is your serving environment that enables a bigger number of users to query improved data using a range of methods.
Data warehouse and data lakes: Implementation best practices
5 best practices for implementing a data warehouse
1. Identify your data warehousing requirements
Understand the potential of a data warehouse for your unique use case. Examine accessible data sources as well as the structure your data sets may require. If you need to collect data from many sources and consolidate and analyze past data, a data warehouse may fit this requirement perfectly.
2. Choose the data warehouse architecture
Depending on the size and breadth of your operations, you will be going for either an on-premises or cloud-based data warehouse. According to the TDWI 2021 Survey, 53% of companies have an on-premise data warehouse, while 36% use a cloud-based data warehouse.
A cloud data warehouse saves money, and time, and is expandable. It does not require any hardware, server rooms, IT workers, or operational fees. An on-premises data warehouse, on the other hand, allows you to exert greater control and apply stringent governance and regulatory compliance.
Many alternatives are available here – for example, you can combine on-premises and the cloud or use a managed data warehouse as a service (DWaaS) rather than setting up your own.
3. Pick the best data warehouse platform and tools
You also need to decide whether to use an ETL (extract, transform, and load) or an ELT (extract, load, and transform) tool to feed data into the data warehouse.
ELT solutions are adaptable and can quickly store fresh, unstructured data. They take data from the source systems, put it into the warehouse, and then transform it into the best-structured format possible. This eliminates redundant data and inconsistencies to make analysis easier.
The business intelligence (BI) tool is another significant choice. It’s key that you select a no-code BI tool to eliminate complications and difficulties. A quality BI solution will come with an easy-to-use interface and seamlessly integrate with other systems.
4. Implement a Master Data Management (MDM) Methodology
Data precision is critical for your engineering operations. Implementing an MDM methodology involves creating a regulated process that streams and warehouses only correct, consistent, and verified master data.
5. Implement Change Data Capture (CDC)
An ETL/ELT pipeline includes change data capture (CDC), which simplifies change tracking in the warehouse and guarantees that the changes are up to date. The history of changes is maintained in change tables, which show how the data has changed over time.
Layer your CDC tools with data backup and encryption technologies to protect critical data from unauthorized users like warehouse testers and external trainers.
5 best practices for implementing a data lake
1. Define your data objectives
To keep your data lake from becoming a data swamp, determine your organization’s data objectives – the business results you’re looking to achieve – and choose an internal or external data curator who can evaluate new sources/datasets and manage what enters the lake based on those goals.
Clarity on the sort of data that must be collected can help your team to avoid the problem of data redundancy, which frequently skews statistics.
2. Record incoming data
As data is consumed in the lake, it should be recorded. Technical metadata and commercial metadata are the most common types of documentation, but new kinds of documentation are increasingly developing.
A data lake degrades into a data swamp without sufficient documentation, making it impossible to utilize, govern, optimize, and trust. This prevents users from locating the essential information.
3. Maintain a rapid ingestion time
The ingestion procedure should be completed as soon as possible. How do you boost ingestion performance? By adopting innovative data integration technologies for pipelining and orchestration, as well as eliminating earlier data enhancements and transformations.
This helps in making data available as quickly as possible once it is generated or modified, allowing some types of reporting and analytics to work on it.
4. Handle data with care
A data lake’s primary purpose is to offer precise source data for data exploration, discovery, and analytics. Much of the information acquired with the original data will be lost if a business processes the imported data with excessive aggregation, standardization, and transformation, negating the very purpose of the data lake. Use data quality remediation in moderation when processing.
5. Enable data versioning
Data version control is a key capability that allows teams to easily make and track changes, experiment in isolation, achieve reproducibility, and quickly recover from errors. There are few solutions for this on the market, and – luckily – some of them are open source.
lakeFS is an open-source solution that manages the data lake’s versions and employs Git-like semantics to generate and retrieve them. If you’re familiar with Git, you’ll feel right at home with lakeFS.
You can use concepts like a branch to establish an isolated version of the data, commit to create a repeatable point in time, and merge to incorporate your changes in a single atomic operation.
lakeFS assists you in maintaining a clean data lake in numerous ways, including:
- Isolated dev/test environments with copy-on-write – you can create isolated dev/test environments for ETL testing in an instant and at a low cost thanks to the usage of copy-on-write. This allows you to test and validate code changes without affecting production data, as well as execute analysis and experimentation on production data in an isolated clone.
- Data state at a certain point in time – this comes in handy for ML experiment reproducibility. With lakeFS, you can version all components of an ML experiment, including the data, and utilize copy-on-write to reduce the footprint of data versions. Another use case is troubleshooting issues.
- Data error recovery and rollback of data changes – When errors occur, erroneous data may enter production, or essential data assets may be mistakenly erased. Backups, by definition, are an ineffective tool for recovering from such disasters. But backups are routine activities that are typically unrelated to conducting incorrect procedures. As a result, they may be out of date and will necessitate data filtering at the object level. This is an inefficient procedure that can take hours, days, or even weeks to accomplish. With lakeFS, rescuing data from deletion or corruption events becomes an immediate one-line operation: simply locate a suitable historical commit, and then restore to or copy from it.
Ready to dive into the lake? Start here.
Table of Contents