Though most companies have plenty of data, a lot of work needs to be done to get it ready.
The data must be collected, cleaned and formatted properly, and stored in one place for analysis.
This process is known as data warehousing.
What is a Data Warehouse?
As the name implies, a data warehouse takes a lot of different data from various sources and stores it in one place. That data can come from transactional systems, such as marketing, sales, CRM and ERP systems, as well as external sources such as the web. The warehouse then holds the data so it can be used for analytical processing and generating reports.
Storing data this way is critical for running business intelligence.
While the data held in a data warehouse is typically available in other locations, it’s not really usable for analysis until warehousing is done.
The data sources are usually transactional systems, meaning they’re designed and built to use data when performing specific functions. Running reports and analysis from data held in those transactional systems would likely interrupt their normal operations, making it impossible to perform business intelligence and actually run the business at the same time.
Data warehouses, on the other hand, are designed specifically for analysis, making it possible to use all records from all sources at the same time to answer questions.
Types of Databases
Another reason databases are necessary is that data must often be stored in different ways for analytics than for processing transactions.
One key distinction in the ways data can be stored is the difference between a relational and multidimensional database:
- A relational database can be thought of as using a two-dimensional structure. Imagine a simple database such as an Excel spreadsheet. The data is organized using rows and columns. One key is that the data must be normalized so each attribute can be put in the proper place and the entries can be sorted.
- A multidimensional database, as the name implies, stores data based on more than two dimensions. Rather than an entry being defined by its row and column like in a spreadsheet, each entity in a multidimensional database can contain a number of different attributes and can exist independently of the other entries.
One example of a relational database could be a table listing different products and the number of units sold in each state in a given calendar year. In this table, the rows could be labeled with the products, with the columns assigned to the states and the numerical values filled in accordingly. The table could then be sorted for each state based on the number of sales.
In this example, if you want to look at any other attributes – for example, the number of sales in each month – that would require a separate relational database. That could be a table for each product, with the columns representing each state and the rows representing each month.
In comparison, a multidimensional database could hold all of that information in the same database. Each product could be thought of as its own entity in the database, with values stored for sales in each state, sales per month, etc.
Analytical vs Transactional
Data warehouses are most often built around multidimensional databases.
The reason is that relational databases are typical optimal for transactional systems – i.e., the systems involved in processing day-to-day business operations – whereas multidimensional databases better support analytical systems – i.e., those that are used for business intelligence.
Transactional and analytical systems have very different requirements. When processing transactions, the goal is to do so as quickly as possible, so the simple structure of a relational database is beneficial.
In analytics however, the important thing is to have all relevant information stored together, so multidimensional databases come in handy.
History of the Data Warehouse
In fact, the concept of the data warehouse was developed largely in response to the limitations of using databases optimized for transaction processing in order to perform analytics. The term “data warehouse” was coined in a 1988 IBM Systems Journal article, an architecture for a business information system, which explained the problem:
“The transaction-processing environment in which companies maintain their operational databases was the original target for computerization and is now well understood. On the other hand, access to company information on a large scale by an end user for reporting and data analysis is relatively new.”
This idea was developed further by Bill Inmon, who created the concept of a “Corporate Information Factory” – essentially a central repository creating an enterprise-level view of all an organization’s data.
Database vs. Data Warehouse vs. Data Mart
Using the concepts laid out by Inmon, IBM and others, a data warehouse can be defined as a top-down of the entire collection of data that can be used for analysis, whereas a database is a smaller set of data used for a specific purpose, often for transaction processing.
Another concept relevant to business intelligence is the data mart.
Essentially, a data mart is a smaller, more focused version of a data warehouse. Whereas the warehouse holds all of the data from the entire organization necessary to perform business intelligence, a data mart holds all of the data about one particular area.
For example, data marts may be broken up according to different operational areas, in which case the company would have a dedicated sales data mart, a dedicated finance data mart, etc. As with a data warehouse, the information would still come from multiple sources and be transformed into a common format, but the data mart would strictly contain sales or finance or another group of data.
This concept comes in part from the work of Ralph Kimball, one of the “fathers of data warehousing,” along with Inmon. Rather than Inmon’s top-down approach, Kimball advocated for building individual data marts for each section of an enterprise and then integrating them.
Data Marts Within a Data Warehouse
In some cases, data marts are used as a subset of a data warehouse.
The purpose of using data marts is typically to give different groups access to the data they need to make their own particular decisions without being able to see or change the rest of the data in the organization. Otherwise, those groups would need to work with all of the data in the warehouse, most of which isn’t relevant to them.
Data marts can be specifically tailored to different business units, which makes it much easier for business users to access only the data they need in the warehouse.
Data marts that exist as part of a larger data warehouse are called dependent data marts. It’s also possible to use standalone data marts, in which the different data marts are not connected.
However, for companies that want to do enterprise-wide business intelligence, dependent data marts are usually the way to go because the complete data warehouse can hold all the necessary information, while the individual data marts can still be used to address specific needs within the organization.
In addition, using dependent data marts makes it easier to manage the data and keep it secure, since it’s all being held in the same place.
Extract, Transform and Load
Populating a data warehouse isn’t as simple as just copying out of other systems. Data normally isn’t ready for analysis when it’s taken straight out of those transactional systems. Different applications may create data in different formats, or the data may be organized based on different criteria. In order to get a full picture of the entire business, it’s necessary to do some work to the data so different sets can be analyzed in relation to each other.
The quality of the data must also be checked before performing any business intelligence. Errors and other problems can lead to inaccurate data that ultimately impairs decision making. Data warehousing allows records to be verified before they’re used for analysis, making sure the organization has one “single version of the truth” to use for business intelligence.
Getting data ready for analysis
Gathering, preparing and storing data in the data warehouse is done through a process called Extract, Transform and Load (ETL). ETL tools:
- Extract data from internal and external sources
- Transform it into a standard format – for example, converting dates to the same format, splitting customer names into first and last, etc.
- Load data into the data warehouse
The transform stage is especially important when it comes to business intelligence. In order to get complete, accurate information about the organization as a whole, businesses must look at data from a variety of sources at the same time.
However, those sources usually include systems that come from different vendors, running on different types of hardware, and managed by different employees. That means a lot of work likely must be done in order for all of the data to make sense together.
Extract, Load, Transform
In addition to ETL, ELT (Extract, Load and Transform) is also becoming a more viable option for businesses. In this process, data is pulled from the sources and then transferred to a Staging Database, where integrity and business rule checks are performed. The data is then moved into the warehouse, where it’s transformed into the necessary formats.
Previously, ETL was the only option because earlier data warehouses didn’t have the capacity to perform the transformation. Therefore, other tools were needed in between the data sources and the warehouse.
However, as technology advances and data volumes grow, ELT is becoming more attractive due to how long ETL can take. If sets of data don’t make it to the warehouse until after they are transformed, that means there could be a significant delay before it becomes usable for the business. Transforming the data within the warehouse provides faster access to that data.
In addition, ETL is less flexible. If data is transformed before it gets to the warehouse, that usually means only the transformed version of the data is available for analysis. On the other hand, if data is loaded and then transformed, the raw data will still be available. That means if the company’s needs change while analysis is performed, it will be more likely to still have access to the data it needs.
Success in business intelligence depends in large part on what data the company has access to and how it’s used. Therefore, proper implementation of the data warehouse is critical.
Effectively using a data warehouse requires not just installing the right software and hardware, but also plenty of planning, research and change management in the organization. Most importantly, businesses need to make sure the right data is going into the warehouse. That means ensuring that all data that will be relevant for business intelligence is captured and stored, as well as using tools to check the quality of that data and remove irrelevant data.
Before implementing a data warehouse, companies should have a data governance framework in place. Data governance refers to the rules, policies and procedures organizations use for deciding which data should be kept, for how long, in what format, etc.
The strategy should cover not just BI and analytics, but the organization’s overall approach to data. Several attributes of the organization’s data must be properly checked, including, most importantly:
- Privacy and compliance – Certain data created or stored by the business may be subject to very strict regulations and standards. It’s important that those are upheld when the data is moved from an operational database to a data warehouse.
- Data quality – Data must be trusted before it’s relied upon for decision making. Therefore, processes and tools must be in place to check for data consistency and integrity.
- Business relevance – With the large amounts of data already required for business intelligence, it’s important to avoid complexity by storing and hanging on to irrelevant data.
- Metadata – Metadata is what helps data be identified, so properly managing metadata is critical for good data governance long term.
For effective data governance, business leaders should work closely with IT to create policies and standards defining the company’s data strategy and determine what tools may exist to help enforce those policies. To ensure consistency, policies must also be clearly communicated to all stakeholders throughout the organization.
Big Data and the Data Warehouse
As we mentioned in our previous section, Big Data and BI are becoming increasingly intertwined. Many data warehouses are now being complemented by technology capable of storing and processing Big Data.
Perhaps the most important Big Data tools today is an open source technology called Hadoop.
Hadoop, according to the Smart Data Collective, is a set of procedures and programs companies can use as the backbone for their Big Data operations. In order to hold massive amounts of Big Data in a way that is useful for analytics, Hadoop relies on a distributed file-systems that stores data in an accessible across a large number of linked storage devices.
Then, in order for data to be used, a model called MapReduce performs two important functions:
- Pulling data from the storage and putting it into a useable format (map)
- Performing mathematical summary operations for the user (reduce)
When Hadoop first appeared, some observers wondered if the technology may replace traditional data warehousing. However, most experts agree that Hadoop must be used in conjunction with a traditional warehouse, such as by preparing unstructured data so that it can be loaded into the warehouse, or for working with data the traditional warehouse can’t handle.
Reasons for that include the fact that Hadoop does not perform real-time analytics as well as a traditional warehouse, and the technology is still too complicated to rely on it when it’s not necessary.
In addition to establishing data governance, here are some other steps companies can take to make data warehousing a success:
1Decide whether to build or buy
There are plenty of data warehousing and ETL/ELT tools on the market; however, some companies choose to build their own customized tools. The disadvantage of that approach, of course, is that it takes a lot of internal resources and expertise that many organizations don’t have. However, some find that building in-house is the best way to make sure the organization’s unique needs are fully met.
2Consider whether you need Big Data capabilities
As we discussed before, Big Data is becoming increasingly intertwined with business intelligence. When businesses make decisions based on information, they want to look at all relevant data that is available. That may include Big Data from sources such as social media, websites and others. If you need to analyze Big Data, it may be worth investing in systems that can handle both Big Data and large amounts of traditional, structured data.
3Decide how many warehouses you need
Since a data warehouse is supposed to contain all of the data necessary to perform business intelligence, typically a company will use one data warehouse for the entire organization. However, in some organizations, different groups might only need access to certain data. Data marts may be a solution, but in some cases, it may be worth it to have multiple data warehouses.
IBM recommends some companies use a strategy involving an enterprise data warehouse, containing all data from the entire organization, along with several customized data warehouses designed for specific business units. This multi-tiered structure can allow companies to best meet the individual needs of different stakeholders while still offering one complete source of all enterprise data.