 |

February 1996 Volume 23 No. 1
Number Crunching: 1996 Statistics Survey
Explosive growth in computing power fuels increased use of methodology
By James J. Swain
Editor's note: The following article updates a similar
statistics software survey published in the October 1994 issue of OR/MS
Today. Data for the current survey was collected and compiled by OR/MS
Today Managing Editor David Greenfield from December 1995 to January
1996.
Ours is a quantitative, problem solving field in which statistical methods
are an important part of many projects. Statistics is used to summarize
data, detect and estimate relations among variables, and test hypotheses.
From the very beginning, operations researchers collected and analyzed data
using statistics to understand processes, to build and to validate their
models, and to develop appropriate inputs for use in optimization and simulation
models. Periodic surveys have consistently shown that data analysis is a
perennial activity among OR/MS professionals, and statistics has long been
an integral part of the curriculum. As a recent instance, the Committee
for the Review of the OR/MS Master's Degree Curriculum (OR/MS Today,
Feb 1993) included two semesters of probability and stastistics in the proposed
curriculum. In fact, service courses for statistics are often offered by
operations research, management science, quantitative methods and industrial
engineering departments in which our other courses are taught.
The explosive growth of our field is largely paralleled by the increase
in use of statistical methodology, and both were substantially assisted
by the growth in computing power and the availability of software to perform
routine computation. Computers not only made computations easier, so that
such commonplace techniques as regression and analysis of variance (ANOVA)
could be conveniently performed, but computers have taken a major role in
the generation, collection and management of the data itself. In the earliest
days, data was collected and processed by hand or via punched cards. Data
can now be obtained by the computer from other stored sources: via sensing
equipment connected to the computer (e.g., hand-held bar code readers),
or collated from remote computers at a central source. Commercial operations
in transportation, telecommunications, marketing and retail may generate
tens of thousands of observations to draw upon. Availability of computing
power has also led to the increased use of simulation models and process
improvement tools, such as SPC, TQM, Taguchi methods, and the design of
experiments, and these have, in turn, further increased the need for statistical
analysis.
Statistical software to aid the OR/MS professional is widely available.
As this OR/MS Today survey of statistical software demonstrates,
there are many products for the PC, Macintosh and workstations in a range
of prices and capabilities. Product information for the survey has been
supplied by vendors from a list compiled from reader suggestions, advertisers
and prior surveys. While not an exhaustive list, these products are representative
of the wide range of choices available today. These programs permit analysts
to visually examine the data and pursue different approaches in analysis
through the use of on-screen graphics and interactive analysis. Typically,
these programs can construct histograms and other descriptive plots, and
perform basic statistical tasks such as tests on means, one- and two-way
tables, analysis of variance (ANOVA), and linear regression.
Given the large number of general purpose programs now available, is a statistics
package needed for basic analysis? Most OR/MS users will already have a
word processor, spreadsheet and communications program, and very likely
presentation and database software, plus special purpose software for simulation,
math programming and so on. For basic statistical analysis, spreadsheet
users may not even require a separate statistical analysis program. Spreadsheets
increasingly include graphical and statistical features, confidence intervals
and statistical tests, ANOVA, and linear regression. When more detailed
or specialized analysis is required, many statistical programs can import
data directly from the spreadsheet or copy the data through the "clipboard."
Improvements in programs and operating systems also mean that graphics from
statistical programs can readily be imported into word processing and presentation
software. Likewise, symbolic algebraic programs (such as Mathematica) and
numerical processing programs (such as Gauss and Matlab) have extensive
statistical capabilities, sometimes in the form of application modules.
While not always as easy to use as a statistical program, they generally
have more flexibility and often more versatile graphics available.
Features
In this article we take the view that a statistics program will be used
by analysts or students to supplement other activities. Like all software,
statistical software should be easy to install and use. Because data arises
from a variety of sources and is stored in varying formats, statistics programs
should be able to import data from as many formats as possible, and edit
and transform data once it is acquired. Good products should provide: a
variety of ways to display or view data, classical and some nonparametric
procedures, linear regression and ANOVA. Other features that are often useful
include sampling from distributions (for Monte Carlo sampling), statistical
process control, multivariate statistics, and design of experiments. The
vendor should have a development staff that includes statisticians as well
as programmers.
Statistical software is a mature field; for instance, the SPSS and BMDP
programs have roots that date back several decades. Improvements in operating
systems has meant that command line or batch processing has been replaced
by interactive interfaces. Using and learning these programs has never been
easier. Not only do many products offer on-line help and tutorials, but
many of the programs have readable documentation, and a number of the programs
are widely featured as illustrations in statistics texts (e.g., SAS and
Minitab) and in third party texts devoted to these products. Many of the
latter feature student versions of the software and data sets on diskettes
included with the text. The data sets are discussed in the text, and can
be accessed by the student as they use the software.
Formatting Woes
These days, data arises from many sources, such as databases, spreadsheets,
or CD-ROMs, and experimental data may be monitored and collected directly
by computer. Data sets can be very large and stored in various formats.
Having data that cannot be readily analyzed is extremely frustrating, so
it is important that these data can be imported into the analysis package
with a minimum of additional effort. Most statistical packages support ASCII
(or plain text) input and import from spreadsheet and database formats.
In addition, programs such as DBMS/COPY from Conceptual Software and Stat/Transfer
from Circle Consulting provide data conversion between formats used by various
statistical software programs, spreadsheets and databases. Once the data
is captured, it should be easy to edit and manipulate as part of the analysis.
The program should also be able to handle missing data.
While many people associate statistics with the procedures they learned
in their introductory courses, statistics is the search for meaning within
observations over time or condition or for relations between variables.
To aid in this search, the modern trend (as typified by Exploratory Data
Analysis or EDA) is to graphically examine data from a variety of quick
perspectives. The stem-and-leaf plot (a quick version of a histogram) and
box plot (or box-and-whisker's plot) are suitable for quick summary of a
single or multiple variables. For instance, several different treatments
can be compared using side-by-side box plots. These plots would provide,
at a glance, the likely outcome of an ANOVA and allow a rough confirmation
that assumptions are being met.
For multivariate data, scatter plots and their higher dimensional equivalents
can aid the search for likely relationships. The famous Anscombe data sets
("Graphics in statistical analysis," American Statistician
27, 1973: pp. 17-21) illustrate the danger of relying on summary statistics
alone. These four data sets have identical statistics, but widely differing
interpretations that are readily apparent by simple scatter plots. Likewise,
Tufte's "The Visual Display of Quantitative Information" has futher
reinforced the power of graphical representations of data.
Good statistical software should include nonparametric procedures to augment
the standard techniques (t-tests and ANOVA, for instance) that are based
on parametric families such as the normal distribution. Nonparametric procedures
are often based upon variations of the sign test or on ranks. They generally
require fewer assumptions than parametric procedures and are often less
sensitive to misspecification of assumptions. Parametric assumptions can
be evaluated through the use of probability plots, which are useful diagnostic
tools.
One of the most common activities of data analysis is curve fitting by linear
regression. This is best known as the fitting of straight lines between
two variables (so-called simple linear regression), but through transformations,
polynomial and transcendental functions can be used, and more than one predictor
used. This is a tool of immense power, which can be used for empirical summarization,
as an approach for determining relations between variables in a process,
or as a method of eliminating trends which obscure an analysis, as Fisher
did when he regressed out fertility trends in the multiyear data taken at
Rothamsted to sharpen comparisons between methods of crop treatments. These
procedures were always limited by hand computations, and the statistical
software has literally made regression available for common use. Software
not only makes it possible to perform these analyses, but graphical and
diagnostic statistics can be used to quickly and interactively guide model
building.
Other features
Statistical programs vary in the number of statistical procedures they contain.
For many users, the basic features described already should be sufficient
for most applications. Additional features that are useful include Monte
Carlo sampling, statistical process control (SPC), forecasting, and multivariate
statistics.
With the increasing interest in SPC, having those features broadens the
usefulness of the software, and students can certainly use the software
in more than one course. Likewise, software for performing time series (forecasting)
analysis is often useful.
Statistical software is increasingly including options for multivariate
statistics -- the study of relations among several variables or among different
attributes within records. For instance, a common marketing and demography
problem is to characterize subgroups from the general population. To target
advertising, the marketer needs to determine attributes, such as age, income,
geographic region, or interests associated with a particular product or
service. Multivariate clustering and discimination methods are used for
this purpose. Graphics for making it possible to visualize multivariate
relations are also helpful.
Monte Carlo sampling is particularly valuable when software is being used
in conjunction with a statistics course, since it provides a way for students
to observe the range of variability that they will encounter in practice.
Monte Carlo sampling is also useful to test the sensitivity of a procedure
to assumptions about distributions or to build insight into how various
statistics might perform under different assumptions. The Resampling Statistics
program is particularly suited to this kind of analysis, either by resampling
from a particular set of data (also called bootstrapping) or from standard
statistical distributions.
NOTE: A detailed listing of numerous statistical software packages
is printed in an easily cross-referenced table in the February issue of
OR/MS Today. If you are interested in obtaining a copy, contact Nora
Craver at Lionheart Publishing Inc. -- Phone: (770) 431-0867 ext. 201; Fax:
(770) 432-6969; E-mail: nora@lionhrtpub.com.
James J. Swain is associate professor of ISE at the University of Alabama
in Huntsville. His technical interests include applied statistics and simulation.

E-mail to the Editorial Department of OR/MS Today: orms@lionhrtpub.com


OR/MS Today copyright © 1997, 1998 by the Institute for Operations Research and the Management Sciences. All rights reserved.


Lionheart Publishing, Inc.
2555 Cumberland Parkway, Suite 299, Atlanta, GA 30339 USA
Phone: 770-431-0867 | Fax: 770-432-6969
E-mail: lpi@lionhrtpub.com


Web Site © Copyright 1997, 1998 by Lionheart Publishing, Inc. All rights reserved.
Web Design by Premier Web Designs, e-mail lionwebmaster@preweb.com
|