What’s Damming Data Lakes? The State of Data Warehousing and Top Challenges

A data lake is a way of storing a vast amount of data in its natural format.

While data warehouses store data in hierarchies, data lakes have a flat architecture. In some ways, data lakes replace data warehouses but are most often used in addition to governed data storage.

This 2016 trend has many experts conflicted while others are excited to explore the potential.

what's damming data lakes cover

The initial challenge is that data lakes run the risk of becoming “data swamps” — a term coined by 2014 Turing Award winner Michael Stonebraker. Data lakes combine unstructured and structured data, and when this diverse and voluminous data has not been properly curated, it’s near impossible to derive insights.

When the data is curated, commonly through the use of meta data, professionals are able to apply schema on read. In the past, a requirement of data warehousing was to define how the data will be read before the warehouse was built.

Instead, with schema on read, data experts don’t have the same limitations for deriving insights because they can easily run tests and play with the data in imaginative ways. Data lakes are also described as “sandboxes” that can spur innovation.

Data warehousing automation tools have adapted to schema on read practices through late-binding and machine learning techniques. Modern tools used to curate and consolidate data have quickly advanced, and it’s important to remember these practices are entirely new.

So, What’s Damming Data Lakes?

While it may sound like a no-brainer and the natural next step for data warehousing, there are also many obstacles.

Here, we’ve explained key differences between data lakes and data warehouses through a fishing metaphor, and include details about the professionals in charge of data storage practices such as the languages they must know and their key responsibilities.

Experts from Tableau, Qlik and Logi Analytics offered us their advice and predictions. We’ve also included a comment from Gartner on this trend.

Feel free to share this page and the image below to help others understand the next generation of data storage.

data lake vs data warehouse Better Buys

Selecting The Right BI Vendor:
The Ultimate Guide

Choosing a BI vendor is all about finding the right fit. Our exclusive report will walk you through the process and help you select the perfect solution.Download Now

Comments

  1. Very helpful post with interesting infographic 🙂

  2. No, no, no!!! Data lakes DO NOT replace data warehouses!

    A data lake is just that – a place where data is pooled together, usually in its original form.

    A data warehouse is an ARCHITECTURE formulated using one of several methodologies – Third Normal Form, Data Vault or Star Schema.

    You still need to cleanse and structure the data in order to perform any meaningful analysis from it. You can’t just chuck your data in a data lake and hope for the best.

    Building a proper data warehouse is not a trivial task, and ETL tools are to be avoided. Now there are Data Warehouse Automation tools, and these should be considered before embarking on any Business Intelligence project.

    • Julia Scavicchio says:

      Thanks for your comment, Ian! I completely agree. The best way I’ve had this described is as a “sandbox” to explore relations in addition to having a structured warehouse.

  3. It’s a serious asset that data lakes enable both structured and unstructured data to be stored in the same place. While this is looked at as a positive thing, the technology is still fairly new, which means there are some kinks to iron out. It might be best to stick with what has been known to work until the technology is more reliable.

    • Julia Scavicchio says:

      On that thought, I think Brinkman said it best: “A good analogy is the the state of data warehousing and business intelligence circa 1996.”

  4. Vincent Lassauw says:

    As a Big Data platform, I can see the logic in choosing a Data Lake architecture, but most businesses still run on a collection of, be it heterogeneous, structured data where Analysis needs to have predictable results.
    A Data Lake with no data quality, no metadata, many security issues is not a viable option in that scenario.
    The process of fishing in this lake by the user to have any meaningful results, will require in depth technical knowledge of the structure of all this unstructured data to make any sense of it and know how to combine data sources that by definition will have compatibility issues.
    Tools like Microsoft Power Query to model your data before visualizing it, are way too simplistic to facilitate creating the complex connections that are underneath all that unstructured data, so you will have a lot of issues with duplicates, data getting lost when linked between sources, incorrect results due to lack of data quality.
    Yes, Data Lake is a nice sandbox, but a sandbox is nice for our kids to play in and yes, maybe even once in a while get an interesting new idea, but once we grow up, we find that our clients will not take kindly, while he is waiting for the relevant data analysis, that you tell him to be patient, that somewhere in the sandbox there lie the answers to his questions and moreso answers to questions he didn’t even think of, but he just has to be patient. “It’s a big lake sir” will not be an answer any business will feel inclined to accept in a world where business critical decisions are very much time and value driven and neither of those will likely find Data Lake to be the answer to their needs.

Speak Your Mind

*