VOLUME 1, NUMBER 1 | SPRING 1998 By Peter Fiore "Data warehousing" is one of the hottest terms in business today. It is nearly impossible to open an industry magazine without laying eyes on an article covering the latest and greatest data warehouse solution. Despite the hype, many business leaders still have a vital question left unanswered: Why does my company need a data warehouse? The fact is there are many reasons to incorporate a data warehouse solution:
The bottom line: Inadequate management of massive volumes of data makes it impossible to convert sales and business data into useful information for decision makers. The Data Warehouse Response The high-tech industry's response to this crisis is the data warehouse, a specialized database that provides integrated, relevant and consistent information about the business, modeled for use by decision support and other business analysis software products. A data warehouse can be many things -- big, small, central, distributed. Small or distributed data warehouses are sometimes known as data marts, but all fall under the larger heading of data warehouses. If you are building a data warehouse, you are probably doing so as part of a strategic IT initiative to improve the flow of business information to users of decision support software within your company. Business users, who access data for analysis, reporting and projections, require data that is consistent and organized. The challenge is to meet these requirements, starting with "dirty" and inconsistent data from multiple data sources without impacting production system performance. Your Business Information Infrastructure Implementing a successful data warehouse is more than buying decision support software and a data warehouse database. These products rely on a framework for gathering data, organizing it into meaningful information on an on-going basis and delivering it for evaluation in the context of the business. This framework is called a business information infrastructure (BII) and is sometimes referred to as the data warehousing process. The data warehousing process has six essential stages:
Data Extraction and Cleansing Data extraction is the process of collecting data for the warehouse from operational databases, archives and external data sources. Employing any of a variety of methods, this step must run on a tight schedule in order to minimize system down time, as operational databases need to be off-line for production purposes during extraction to avoid data inconsistencies. In practical systems, a variety of extraction techniques are employed:
Data cleansing is the process by which invalid records are filtered out or repaired on their way to the data warehouse. Operational data is notoriously dirty for many reasons -- negative inventory values, misspelled names and missing fields are common. Businesses that purchase commercial data, such as competitive performance information or market surveys, will need to extract data from whatever format the external source prescribes. Another interesting, if subtle, source of data "dirtyness" is organizational change over time. Remember, the data warehouse builds up a historical view of the business. This can be difficult when reorganization changes geographical districts and department names, businesses are acquired and divested, product codes are reused or changed and so on. The extraction and cleansing process must either deal with this dirty data, either automatically, or through human intervention. Data Transformation and Integration For analysis and reporting, data must be in the form of a dimensional schema. Transforming data from a fully normalized operational schema into a dimensional schema for users of business analysis tools is essential to the data warehousing process. This process is made more difficult when data comes from different sources because it is less likely to be stored or even named in a comparable fashion. Coding, decoding, adding descriptive information and adjusting representations (for example adding or removing separator characters in account numbers and checking digits) all need to be done. To complicate matters, data from different sources may be stored at different levels of granularity. For example, unit sales in one data source may be stored by day, while in another, they may be stored by week. Sometimes data for analysis must be synthesized from data that has been stored for operational purposes. For example, operational data stores often provide level information (e.g. quantity in stock) when analysis requires event information (e.g. units shipped or received). This information can be synthesized by comparing current operational data with previous snapshots. In addition to these logical constraints, the transformation process is time-bound. Because data extraction must occur while the source database is off-line for production, the speed of query execution is important. If front-end tool users frequently require data aggregated along certain dimensions, such as into quarters from weeks, it may be worthwhile to pre-compute and store some of these aggregations during the data transformation process. Data Loading Analysis and research involve long and multiple queries, and the validity of conclusions depends on underlying data not changing. For this reason, loading the data warehouse must take place while the warehouse database is off-line. During the loading, new facts are added to fact tables, old facts may need to be archived and new pre-computed aggregations need to replace previous values. The speed of analysis queries depends critically on fast indexing and query optimization using the indices.
Peter Fiore is executive vice president/general manager, Data Warehouse Division, of VMARK Software. Web Site © Copyright 2020 by Lionheart Publishing, Inc. All rights reserved. Lionheart Publishing, Inc. 2555 Cumberland Parkway, Suite 299, Atlanta, GA 30339 USA Phone: +44 23 8110 3411 | E-mail: Web: www.lionheartpub.com Web Design by Premier Web Designs E-mail: [email protected] |