![]() June 1999 ![]()
By Syam Menon & Ramesh Sharda SPSS Inc., developers of the world's top-selling software for desktop analysis, has staked out a major claim in the growing field of data mining. Earlier this year SPSS acquired U.K.-based Integral Solutions Ltd. (ISL), a full-service data mining company, further cementing its commitment to data mining [5]. Meanwhile, SAS Institute launched SAS/ACCESS interface for SAP AG's enterprise resource planning (ERP) package, R/3. This will make ERP data accessible to analysis by a variety of tools, including data mining. Commenting on the release of SAS/ACCESS, Allen Watson of the SAS Institute notes that business professionals are "looking for an easy way to turn data into usable information" [15]. Business professionals, it turns out, aren't the only ones digging around for useful nuggets of information in the largely unexplored mountains of data that dot today's computer-driven corporate landscape. Dozens of state revenue departments, for example, report substantial increases in voluntary filing compliance and collected revenue along with improved relationships with taxpayers thanks to their deployment of data mining and other advanced processes [6]. What is data mining? How does it work? Where has it worked? How can we, as operations researchers, benefit in ways other than avoiding the tax collector's proverbial knock on our doors? The rest of this brief article will try to answer some of these questions. What is Data Mining? The cost of storing and processing data has decreased dramatically in the recent past and, as a result, the amount of data stored in electronic form has grown at an explosive rate. A case in point: Wal-Mart. The retail giant recently installed an NCR WorldMark 5100M "massively parallel processing server" and upgraded a second NCR 5100M [2]. Together, they take Wal-Mart's data warehouse from 7.5 terabytes to more than 24 terabytes. The system collects and analyzes item information from approximately 2,900 stores to track buying trends department-by-department, shelf-by-shelf, item-by-item. It handles more than 30 applications and some 50,000 queries per week. With the creation of large databases came the possibility of analyzing the data stored in them. The term "data mining" was originally used to describe the process through which previously undiscovered patterns in data were identified. This definition has since been stretched beyond these limits to include most forms of data analysis. As a consequence, the data mining label has often been used to add sales value to almost any type of data analysis tool. To some extent, this emphasizes the increasing need to discern useful information from the large quantities of data stored in these gigantic databases. Though the term data mining is relatively new, the ideas behind it are not. Many of the techniques used in data mining have their roots in traditional statistical analysis and artificial intelligence work from the 1980s. Why, then, has it suddenly gained the attention of the business world? IBM has identified six factors behind this sudden rise in popularity [13]:
On the commercial side, perhaps the most common usage of data mining has been in the finance, retail and health care sectors. Data mining is used to reduce fraudulent behavior, especially in insurance claims and credit card use [3]; to identify buying patterns of customers [11]; to reclaim profitable customers [12]; to identify trading rules from historical data; and to aid in market basket analysis. Data mining is already widely used to better target clients, and with the development of electronic commerce, this can only become more important with time. Dragon Systems Inc. recently demonstrated technology that combines speech recognition and search capabilities to pull information from audiotapes of customer service calls [16]. How Does Data Mining Work? Most of the general ideas applicable to modeling of any kind hold true for data mining as well. In order to work effectively, data mining requires clearly stated objectives and evaluation criteria. Data might need to be collected, consolidated and cleaned. Once a model has been built, it might be worth testing it on a smaller scale before putting it to full use. Even after all this, it is essential to monitor the model, and to adapt it to unforeseen changes over time. Data mining algorithms traditionally fall into one of four broad categories classification, clustering, association and sequence discovery. Other data analysis tools such as regression and time series also find their way into practice, as does visualization. Classification. Classification, or supervised induction, is perhaps the most common of all data mining activities. The objective of classification is to analyze the historical data stored in a database and to automatically generate a model that can predict future behavior. This induced model consists of generalizations over the records of a training data set, which help distinguish predefined classes. The hope is that this model can then be used to predict the classes of other unclassified records. Common tools used for classification are neural networks, decision trees and if-then-else rules that need not have a tree structure. Neural networks involve the development of mathematical structures with the ability to learn. They tend to be most effective where the number of variables involved is very large, and the relationships between them too complex and imprecise. It can easily be implemented in a parallel environment, with each node of the network doing its calculations on a different processor. There are disadvantages as well. It is usually very difficult to provide a good rationale for the predictions made by a neural network. Also, neural networks tend to need considerable training. Unfortunately, the time needed for training tends to increase as the volume of data increases, and in general, they cannot be trained on very large databases. These and other factors have limited their acceptability. Decision trees (DTs) classify data into a finite number of classes, based on the values of the variables. DTs are comprised of essentially a hierarchy of if-then statements and are thus significantly faster than neural nets. They are most appropriate for categorical and interval data incorporating continuous variables into a decision tree framework can be difficult. A related classification tool is rule induction unlike in a decision tree, the if-then statements used here need not be hierarchical. Clustering. Clustering partitions the database into segments in which each segment member shares similar qualities. Some of the ideas used for classification, such as neural networks, pertain in part to situations involving clustering. However, unlike in classification, the clusters are unknown when the algorithm starts. Consequently, before the results of clustering techniques are put to actual use, it might be necessary for an expert to interpret, and potentially modify, the suggested clusters. Once reasonable clusters have been identified, they could be used to classify new data. Not surprisingly, clustering techniques include optimization; we want to create groups, which have maximum similarity among members within each group, and minimum similarity among members across the groups. Association. Associations establish relationships about items that occur together in a given record [17]. Gathering of data has been drastically simplified as a result of scanners, and determining associations among items which sell together can be of substantial benefit to the retailer. This is often termed market basket analysis, because one of the primary applications of this technique is in the analysis of sales transactions. Sequence Discovery. Sequence discovery can be looked at as the identification of associations over time. When appropriate information is available (for instance, the identity of a customer in a retail shop), a temporal analysis can be conducted to identify behavior over time. Some sequence discovery techniques keep track of elapsed time between associated events and the frequency of occurring sequences. This provides considerably more information, which could be used to increase sales or to detect fraud. Visualization. The insights to be gained from visualizing the data cannot be over-emphasized. This holds true for most data analysis techniques, but is of special relevance to data mining. Given the sheer volume of data in the databases being considered, visualization in general is a difficult endeavor. However, it can be used in conjunction with data mining to gain a clearer understanding of many underlying relationships. Software Tools for Data Mining Techniques for data mining come from different analytical areas, and to a large extent, data mining is a medley of dissimilar, independent solution techniques. For instance, in addition to the techniques already mentioned, data mining uses logistic regression, genetic algorithms, fuzzy logic and other tools to analyze data when appropriate. Most data mining packages have only a select few of these techniques incorporated in them. As a result, packages vary considerably from each other in what they have to offer. Some companies specialize in a subset of the four main categories mentioned in the previous section. Others provide some tools belonging to each group. When selecting a data mining package, it is important to make sure that the tools most appropriate for tackling your specific situation (both present and future) are included in it. The more complicated the problem being tackled, the wider the array of tools you should have access to. There are various ways for data mining packages to tackle complex problems [17]. For instance, they could provide some way to use a combination of the general model types presented earlier. They could also provide a variety of heuristics that tackle the same problem, as the solutions recommended by each could be quite different. More sophisticated validation methods could improve accuracy. A good data mining package should also provide the user the ability to suggest columns to use for analysis when redundancy in the data could hamper the ability of the algorithm to find a good model. Another convenient feature that can help reduce the complexity of the problem is a good visualization tool. It is important to understand the effectiveness of the tool in dealing with large amounts of data. Does the package have any tools that work better in a multi-processor environment? If so, what kind of parallel processors does the tool support? Can the product be linked with other tools such as an online analytical processing (OLAP) or an ERP package? Does it provide good application programming interface (API) functions? How good is the graphical user interface (GUI)? Can it tackle the difficulties presented by legacy systems? A recent development which could have a profound impact on the users (and, as a result, the developers) of data mining software concerns privacy. Privacy concerning Internet activity has been a serious issue for some time now. Data mining is beginning to play a role online, especially with regard to online advertising. Jason Catlett, a former data mining researcher at Bell Labs, believes that data miners will have to live with legislation about their information, and that a lot of the industry is very unfamiliar with that issue [4]. One company, a major player in the electronic distribution of travel-related products and services, found this out the hard way last year when an article published by PC Week Online suggested that the firm was about to start selling the names and destinations of individual travelers [9]. Given the nature of data mining, this can be very slippery ground. The whole objective of data mining is to uncover concealed information and even the user may not know precisely how the data will ultimately be used [4]. That not-withstanding, data mining packages that provide reliable analysis after accounting for privacy concerns are the ones that will do well in the marketplace. Where Do We Fit In? Data mining algorithms are a heterogeneous group, loosely tied together by the common goal of generating better information. Operations research is concerned with making the best use of available information. By selecting the appropriate definition of "information," it is clear that operations research can play a significant role on both sides of the data mining engine. Indeed, some data mining problems themselves can be posed as large-scale optimization problems [10]. Many of the objectives of data mining algorithms can be stated in a mathematical programming context [8, 1]. For instance, while classification problems can be looked at as density estimation problems, clustering problems have been dealt with in OR for quite a while [14]. Nonlinear programming solution techniques have been adapted for faster training in neural network applications. Scalability the ability to deal with large amounts of data is a difficult and important issue in data mining, one where OR could play a significant role. The lack of reliable data (or of the data itself) is a common problem faced by operations researchers trying to get a good model to work in the real world. This problem becomes more acute when data needs to be deciphered from terabytes of stored information. Data mining tools make accessing and processing the data easier, and may provide more reliable data to the OR modeler. The new patterns discovered could allow operations researchers to alter existing models in such a way as to improve the bottom line. For instance, newly discovered associations between the sales of different products could be used as part of an improved marketing strategy. This would then enable a store to come up with a more efficient inventory holding policy as well. Data mining could also be used to detect behavioral patterns under different states of nature, and this information can then be used to create scenario-based optimization models. Based on the results of data mining, better procedures to schedule personnel, delivery, production, etc., could be developed. It has been argued that ERP will enable OR specialists to have better access to reliable data. Linking data mining with ERP could unearth relationships among factors most pertinent to a problem, thereby increasing the scope for building powerful, yet efficient models that could potentially enable real-time decision-making. Conclusions By detecting patterns hitherto unknown, data mining techniques could suggest new modes to pursue old objectives. They could even allow the formulation of better, more sophisticated models in the wake of new information. In general, the gains to be made from exploiting newly discovered information are significantly higher than the marginal improvements that can be made by improving existing solution procedures. Alliances between data mining and ERP companies should reduce the extent of data issues to be tackled, and the quality of the data mining results should improve considerably, enabling an OR/MS professional to provide better decision support. References
Syam Menon is an assistant professor of Management Science and Information Systems at Oklahoma State University. Ramesh Sharda is Regents Professor of Management Science and Information Systems and Conoco/DuPont Chair of Management of Technology at Oklahoma State University. They can be reached via e-mail at smenon@mstm.okstate.edu and sharda@okstate.edu, respectively. OR/MS Today copyright © 1999 by the Institute for Operations Research and the Management Sciences. All rights reserved. Lionheart Publishing, Inc. 2555 Cumberland Parkway, Suite 299, Atlanta, GA 30339 USA Phone: 770-431-0867 | Fax: 770-432-6969 E-mail: lpi@lionhrtpub.com URL: http://www.lionhrtpub.com Web Site © Copyright 1999 by Lionheart Publishing, Inc. All rights reserved. |