A data lake is a way of storing a vast amount of data in its natural format.
While data warehouses store data in hierarchies, data lakes have a flat architecture. In some ways, data lakes replace data warehouses but are most often used in addition to governed data storage.
This 2016 trend has many experts conflicted while others are excited to explore the potential.
The initial challenge is that data lakes run the risk of becoming “data swamps” — a term coined by 2014 Turing Award winner Michael Stonebraker. Data lakes combine unstructured and structured data, and when this diverse and voluminous data has not been properly curated, it’s near impossible to derive insights.
When the data is curated, commonly through the use of meta data, professionals are able to apply schema on read. In the past, a requirement of data warehousing was to define how the data will be read before the warehouse was built.
Instead, with schema on read, data experts don’t have the same limitations for deriving insights because they can easily run tests and play with the data in imaginative ways. Data lakes are also described as “sandboxes” that can spur innovation.
Data warehousing automation tools have adapted to schema on read practices through late-binding and machine learning techniques. Modern tools used to curate and consolidate data have quickly advanced, and it’s important to remember these practices are entirely new.
So, What’s Damming Data Lakes?
While it may sound like a no-brainer and the natural next step for data warehousing, there are also many obstacles.
Here, we’ve explained key differences between data lakes and data warehouses through a fishing metaphor, and include details about the professionals in charge of data storage practices such as the languages they must know and their key responsibilities.
Experts from Tableau, Qlik and Logi Analytics offered us their advice and predictions. We’ve also included a comment from Gartner on this trend.
Feel free to share this page and the image below to help others understand the next generation of data storage.