VOLUME 1, NUMBER 1 | SPRING 1998

Data Warehousing

By Peter Fiore

"Data warehousing" is one of the hottest terms in business today. It is nearly impossible to open an industry magazine without laying eyes on an article covering the latest and greatest data warehouse solution. Despite the hype, many business leaders still have a vital question left unanswered: Why does my company need a data warehouse? The fact is there are many reasons to incorporate a data warehouse solution:
  • Modern sales, production and warehousing systems generate an overwhelming amount of data, but turning that data into useful business information can be a daunting task.
  • An organization may be using several dissimilar business information systems left over from earlier integration efforts or brought on board through acquisitions. Data from heterogeneous systems can make accurate analysis and reporting nearly impossible.
  • Departments within a company may be using their own methods for generating reports and analysis, leaving executive managers in the difficult position of having to reconcile each department's "version of the truth."
  • Organizations that fail to initiate an adequate data management solution will soon find themselves trailing behind competitors that do. Without timely and accurate decision support information, corporations cannot expect to thrive in today's markets.

The bottom line: Inadequate management of massive volumes of data makes it impossible to convert sales and business data into useful information for decision makers.

The Data Warehouse Response
The high-tech industry's response to this crisis is the data warehouse, a specialized database that provides integrated, relevant and consistent information about the business, modeled for use by decision support and other business analysis software products. A data warehouse can be many things -- big, small, central, distributed. Small or distributed data warehouses are sometimes known as data marts, but all fall under the larger heading of data warehouses.

If you are building a data warehouse, you are probably doing so as part of a strategic IT initiative to improve the flow of business information to users of decision support software within your company. Business users, who access data for analysis, reporting and projections, require data that is consistent and organized. The challenge is to meet these requirements, starting with "dirty" and inconsistent data from multiple data sources without impacting production system performance.

Your Business Information Infrastructure
Implementing a successful data warehouse is more than buying decision support software and a data warehouse database. These products rely on a framework for gathering data, organizing it into meaningful information on an on-going basis and delivering it for evaluation in the context of the business. This framework is called a business information infrastructure (BII) and is sometimes referred to as the data warehousing process. The data warehousing process has six essential stages:
  • Modeling data for the data warehouse.
  • Extracting data from source databases.
  • Cleansing data.
  • Transforming data into the warehouse model.
  • Integrating data with data from other sources.
  • Loading the data into the data warehouse database.
With the exception of data modeling, these activities must occur on an on-going basis. Data modeling is a relatively infrequent activity, but will occur when changes to the warehouse data model or supporting BII are required.

Data Extraction and Cleansing
Data extraction is the process of collecting data for the warehouse from operational databases, archives and external data sources. Employing any of a variety of methods, this step must run on a tight schedule in order to minimize system down time, as operational databases need to be off-line for production purposes during extraction to avoid data inconsistencies.

In practical systems, a variety of extraction techniques are employed:
  • Live queries on operational or mirrored data.
  • Reading database table dump tapes.
  • Interpreting transaction logs.
  • Writing custom programs to extract data from legacy application files.
Because of the large size of many operational databases and the relatively short periods of time available for data extraction, little transformation of the data occurs during this process. A feature known as changed data capture can dramatically reduce the time needed for data extraction by loading only newly modified data. In this case, a program is written to compare current data with an image of the data prior to the last transaction.

Data cleansing is the process by which invalid records are filtered out or repaired on their way to the data warehouse. Operational data is notoriously dirty for many reasons -- negative inventory values, misspelled names and missing fields are common. Businesses that purchase commercial data, such as competitive performance information or market surveys, will need to extract data from whatever format the external source prescribes.

Another interesting, if subtle, source of data "dirtyness" is organizational change over time. Remember, the data warehouse builds up a historical view of the business. This can be difficult when reorganization changes geographical districts and department names, businesses are acquired and divested, product codes are reused or changed and so on. The extraction and cleansing process must either deal with this dirty data, either automatically, or through human intervention.

Data Transformation and Integration
For analysis and reporting, data must be in the form of a dimensional schema. Transforming data from a fully normalized operational schema into a dimensional schema for users of business analysis tools is essential to the data warehousing process. This process is made more difficult when data comes from different sources because it is less likely to be stored or even named in a comparable fashion.

Coding, decoding, adding descriptive information and adjusting representations (for example adding or removing separator characters in account numbers and checking digits) all need to be done. To complicate matters, data from different sources may be stored at different levels of granularity. For example, unit sales in one data source may be stored by day, while in another, they may be stored by week.

Sometimes data for analysis must be synthesized from data that has been stored for operational purposes. For example, operational data stores often provide level information (e.g. quantity in stock) when analysis requires event information (e.g. units shipped or received). This information can be synthesized by comparing current operational data with previous snapshots.

In addition to these logical constraints, the transformation process is time-bound. Because data extraction must occur while the source database is off-line for production, the speed of query execution is important. If front-end tool users frequently require data aggregated along certain dimensions, such as into quarters from weeks, it may be worthwhile to pre-compute and store some of these aggregations during the data transformation process.

Data Loading
Analysis and research involve long and multiple queries, and the validity of conclusions depends on underlying data not changing. For this reason, loading the data warehouse must take place while the warehouse database is off-line. During the loading, new facts are added to fact tables, old facts may need to be archived and new pre-computed aggregations need to replace previous values. The speed of analysis queries depends critically on fast indexing and query optimization using the indices.

Case Study:
EuroTunnel Back on Fast Track
EuroTunnel, a company that transports thousands of vehicles and passengers through the Channel Tunnel each day, was recently faced with an information crisis. Conflicting needs of EuroTunnel marketing and sales departments put a strain on its inadequate data management tools and their IS department had to deploy a solution quickly.

Initially, EuroTunnel, based in the United Kingdom and France, only employed a Travellog booking system for its Le Shuttle ticketing and marketing operations. As tunnel usage grew, Le Shuttle�s sales staff was soon processing more than 5,000 sales daily. At the same time, EuroTunnel's marketing department was accessing the Travellog system to generate analysis reports and forecast sales.

The conflict between demand for marketing analysis data and online transactions like ticket sales caused an information gridlock. As large marketing inquiries were running, ticket sales and other online transactions were delayed. The bottom line: EuroTunnel needed to separate operational data from analysis data.

EuroTunnel's IS team decided to build a data warehouse to migrate more than half a million records from the Travellog system and databases to a new data warehouse. The Le Shuttle data warehouse was up and running in 12 weeks.

Today, the EuroTunnel data warehouse delivers transaction information from the OLTP system to the data warehouse via unattended, overnight operation. When the marketing analysts arrive in the morning, all the data is waiting for them. With this solution, they can perform all of the analysis queries they want without impacting the performance of the booking system.

EuroTunnel's vision is to establish a number of diverse data marts over the next two years, eventually uniting them as a single large data warehouse -- with extraction and transformation tools continuing to play a pivotal role.

Peter Fiore is executive vice president/general manager, Data Warehouse Division, of VMARK Software.



Web Site © Copyright 2020 by Lionheart Publishing, Inc.
All rights reserved.


Lionheart Publishing, Inc.
2555 Cumberland Parkway, Suite 299, Atlanta, GA 30339 USA
Phone: +44 23 8110 3411 |
E-mail:
Web: www.lionheartpub.com


Web Design by Premier Web Designs
E-mail: [email protected]