Estimated read time: 4 min, 58 secs
If you have chosen to click on this blog I bet you have heard the term 'BigData'. Even if you are not a Data professional (Data Scientist or Data Engineer or..) BigData isn't too difficult to understand: it refers to extremely 'big' sets of 'data' that are treated differently to normal sets of data due to their sheer size. For many mature businesses, making data-driven decision means taking implementing BigData. Quite often the 'BigData' term will be accompanied with 'Data Warehouse' and 'Data Lake' terms. These terms are not quite as transparent so today I am going to dive deep into these terms and give you an understanding of the key differences between them.
Humans tend to use things not the way they are supposed to - this is what makes us human. By design, the primary purpose of a computer was to compute (or process) not store. This came as a secondary, temporary or even optional component in many designs. And this is how mathematicians and engineers used computers sized bigger than your bedroom to solve their tasks. But just a decade later we started to realise computers can locate a necessary piece of information much faster than a human browsing through copious folders of paper. Nowadays, we process mountains of information which need to be collected, stored and made available for analysis. This task is solved by both concepts of Data Warehouse and Data Lake. So what actually are these?
A Data Warehouse is similar to a traditional warehouse - perfectly organised with full control what is inside and where all items in the inventory are located. Data Warehousing has a long history in an enterprise sector to store, manage and analyse structured datasets. Usually, the data that is stored in Data Warehouses are cleaned, pre-aggregated and organised for specific business purposes. This data is then made accessible directly by different BI tools. Similar to in a physical warehouse, inventory (the data) is organised and mapped in a way that it makes complete sense and serves a predefined specific purpose. In many, cases Data Warehouse is used in "write once, read multiple times" mode.
A Data Lake includes multiple streams of data that all flow together to produce a 'lake' of different data types.
Data Lakes are a newer technology that are usually built with an open-source ecosystem such as Hadoop. Data Lakes allow aggregation of structured, unstructured, or even raw data sets without any pre-processing. The Data Schema definition, processing, and filtering is usually done if, and when, data is read.
You should already be able to tell there are some key differences between the two key terms 'Data Lake' and 'Data Warehouse' . Let's break these down in a bit more detail:
|DATA WAREHOUSE||DATA LAKE|
|Data is structured and processed.||Structured or unstructured or raw (like log files).|
|Data storage schema is enforced during the writing.||Schema on read.|
|More expensive for large data volumes.||Designed for low-cost storage.|
|Less agile - fixed configuration, modernising schema could be difficult.||Highly agile - configure and reconfigure as needed.|
|More "traditional" approach with mature security.||Overall still under development. Less granular security control.|
|Best suited for business professionals.||Best suited for data scientists and data engineers.|
CRM, financial transactions, ERP
social media, web server logs, sensor data, documents, media files.
Data Warehousing is perfect for mature but evolving businesses that have a determined set of data sources, each presenting prepared and structured data. A traditional data warehouse is an expensive resource but is highly beneficial for enterprise organisations. Processing in Data Warehousing should be completed before or as data is written. Some common cloud examples of Data Warehouses are Google BigQuery and Amazon Redshift.
A Data Lake is a data system to support innovation and insights that is agile and prepared for what the future has to offer. Data storage and retention is much easier and cheaper than in Data Warehouse. Processing in Data Lakes is completed when the data is read, and hence Data Lakes can dynamically adapt to the analysis at hand. Data Lakes are becoming more commonly included in enterprise data strategies due to the insights and flexibility that they offer. Some Data Lake Cloud solutions that are becoming common for enterprises to use are: Azure Blob Storage and Amazon S3.