|
OR/MS Today - October 2005 Software Review WordStat 5.0 Content analysis software offers operations researchers valuable tool for exploring vast, untapped amounts of textual data. By Byung-Gak Son
Content analysis, according to Holsti (1969), is "any technique for making inferences by objectively and systematically identifying specified characteristics of message." Is content analysis relevant to O.R. professionals, who are more familiar with traditional analytical methods such as simulation or linear programming? The answer is yes, since most of the information a company has is textual data in the form of e-mails, documents, reports, etc. Typically such textual information is unstructured. Therefore, extracting meaningful information for decision-making from data of such nature can be quite time-consuming and difficult. Exploring such untapped textual data could complement existing O.R. tools for operational improvement. Many well-known companies use text analysis tools such as WordStat to assess how their products are perceived by the public or by clients. WordStat analyzes databases of customer feedback and e-mail messages sent to customers or technical support by looking at words that are closely associated with their products. Companies also try to identify different types of customers, their consumption habits, their needs, their complaints, etc. Another example of using content analysis appeared in an article by Sodhi and Son (2005) in the August 2005 issue of OR/MS Today. The authors did a basic content analysis to explore what kind of skills employers want from O.R. graduates. The analysis provides useful insights about the key skills employers want from O.R. graduates and provides the kind of quantitative output O.R. people are used to producing. Content analysis is new territory for O.R. professionals, but we can get help with a tool like WordStat 5.0 from Provalis Research. WordStat is an add-on module for the statistical analysis package SimStat that provides the statistical backend O.R. professionals would be quite comfortable with. According to Provalis, WordStat is specially designed to study textual information such as responses to open-ended questions, interviews, titles, journal articles, electronic communications, etc. In this review, I will be focus on exploring the basic features and the potential of this software. WordStat can perform univariate frequency analysis (keyword count and occurrence) and presents results in matrix form (Figure 7). The phrase finder helps users to identify recurring phrases and their counts. WordStat can perform bivariate comparison between any textual field (for example, the personal ads in the tutorial in the next section) and any nominal and ordinal variables (such as gender or age group of the respondents). There are many association measures in WordStat to assess the relationship between the keyword occurrence and nominal/ordinal variables, e.g. the difference between keyword occurrence among the personal ads placed by men and by women. Keyword-in-context (KWIC) is a useful feature in WordStat that allows one to see the occurrence of either a specific word or all words related to a category in an actual text arranged in a table format. It is handy when one needs to assess the consistency (or lack of consistency) of meanings associated with a word (Figure 1).
In addition to the above features, WordStat provides various other features such as automated text classification, analysis of case or document similarity, etc. For details on these and other features, visit www.provalisresearch.com/wordstat/WordstatFeatures.html. Step 1: Creating a data file. In order to create a data file, you can use the base program SimStat and input data just like other statistics packages. I found it is a bit cumbersome to use SimStat for data entry and manipulation due to its rather different data entry interface. However, WordStat (via SimStat) can directly import different types of data files such as MS Access, MS Excel and dBase smoothly. Also, it has a number of tools assisting importing data from plain text or word-processed files. For the field for textual information, you can simply copy and paste into the spreadsheet or database of your choice and import to SimStat. Categorical and other variables related to the textual information such as gender and age group obviously need to be coded by the user. For example, for the data file of our job ads analysis [2], we used MS Access to create our data set by copying and pasting the job ads from Monster.com from the Internet and the HTML files provided by OR/MS Today. We manually coded the industry and other fields for further analysis. In this tutorial, I use the sample data file (SEEKING.DBF), which comes with the software. Once you open the file, you can see three variables: the nominal variable GENDER (1 = Men, 2 = Women), the ordinal variable AGEGROUP (1 = 18-24, 2 = 25-29, 3 = 30-39, 4 = 40+) and the text variable AD_TEXT (Figure 2). The variable AD_TEXT contains the text of the 68 actual personal ads copied and pasted from newspapers; this variable is the focus of our analysis. The other two variables GENDER and AGEGROUP have been manually coded by browsing the personal ads.
Step 2: Select variables. Once you open the SEEKING.DBF file in SimStat, go to STATISTICS menu and execute CHOOSE X-Y command. Here, we need to move the variables to appropriate locations. Let's first move the AD_TEXT variable to the DEPENDENT list box. Then two other categorical variables (GENDER and AGEGROUP) need to be placed in the INDEPENDENT list box (Figure 2). Note that this is similar to what one might do if one were doing an analysis of variance analysis with quantitative data. Note also that so far we have been using SimStat, which is a statistical package. Step 3: Run WordStat. Go to the STATISTICS menu and execute CONTENT ANALYSIS command. A new window with six tabs pops up and now we are ready to do the content analysis. Step 4: Choose the proper dictionaries. The backbone of WordStat usage is a "dictionary." A dictionary is a specification of words and phrases under various named categories that allows WordStat to either exclude certain words from the analysis or, more to the point, create counts under each "category" when a word or phrase under that category is found in a record. WordStat allows users to choose, view and edit dictionaries used for specific content analysis. In this tutorial, we exclude: 1) pre-processing for the custom transformation of text, and 2) "lemmatization," a process by which various forms of words are reduced to a more limited number of canonical forms, for example, transforming plural into singular. The third setting, "exclusion," is a dictionary that contains words to be removed during the process of analysis. For example, words with little semantic values such as pronouns, articles and conjunctions are automatically removed by the rules set by the exclusion dictionary. On the other hand, "categorization" allows one to specify words, word patterns and phrases to be included in the analysis (Figure 3).
All of these dictionaries can be edited in the program or by using any text-editing tool (e.g., Notepad). For this tutorial, we select default exclusion dictionary (DEFAULT.EXC) and a tailor-made categorization dictionary (SEEKING.CAT) that contains words and phrases that frequently appear in personal ads. Keywords can be arranged in hierarchical manner so users can have different levels of analysis (Figure 4). The level-one category includes major attributes partners may be looking for. Under the category "appearance," for example, one would find various words describing physical appearance.
You can download a large number of pre-made dictionaries from the Web page (http://www.provalisresearch.com/wordstat/RID.html), depending on the subject of interest. Most O.R. users will want to construct their own dictionaries from the raw data using WordStat. For this tutorial, the category dictionary SEEKING.CAT was given. Let us assume that we did not have this so we would have to create our own dictionary for this analysis. In this case, we could construct a categorization dictionary by running the frequency analysis of words and the phrase finder in WordStat to identify the ones most commonly used. On the basis of the results, we can construct our own category dictionary by selecting the most frequently occurring words and phases (Figure 5 and 6). However, these two functions do extract irrelevant words and phrases such as "LEAVE A MESSAGE"; we need to go though the lists to single out these words and phrases.
Once you have chosen or created an appropriate dictionary, you can select advanced options. For this tutorial, we disabled all options. Step 5: Perform a frequency analysis of the personal ads. Finally, we are ready to analyze the most important attributes of Mr. or Miss "Perfect" according to the personal ads. We click the third tab (Frequencies) to determine the count of word categories or frequency analysis. We found that words under the "appearance" category are the most frequently mentioned criteria in the personal ads. Indeed, 41 out of 68 ads contain words related to appearance (Figure 7). Note that the "appearance" category contains various words such as "beautiful" and "muscular" (Figure 4). The "finance" category, on the other hand, appeared the least. You can display other words that are not included in the category dictionary by changing the display option.
Step 6: Examining the relationship between included categories and the gender of the author. So far, the frequency analysis on the ads we have just done shows the frequency of words regardless of the gender. It can be also very interesting to see if there is any difference in preference over the ideal partners between men and women. We go to the fourth tab, "Crosstab menu," and WordStat runs two separate frequency analyses for men and women and provides a nice table (Figure 8). The results suggest that the most important criteria for men appears to be "appearance," while women value "communication" and "family" the most. From the same menu we can also estimate the strength of these relationships by selecting different association measures such as Chi-square or a Pearson's R statistics.
You can do various other tasks in "Crosstab page" such as correspondence analysis. You can also create "heatmaps" that help clarify the relationship between words and categories (Figure 9).
We downloaded the demo version and became familiar with the software without the benefit of the printed manual. Thanks to its straightforward interface and easy-to-follow online manual, WordStat was relatively easy to use. In addition, WordStat proved quite versatile in terms of importing data from popular applications and easily exporting outcomes to various formats. We were able to import the data in MS Access format containing 650+ job ads in a few seconds without any difficulties. Two features we particularly liked were "frequency analysis for a single word" and "phrase extractor." As we did not have any category dictionary for "discipline," "degree," "skill" and "nature of work," we had to create our own category dictionary. Although we had to go through more than a thousand key words and phrases automatically identified by these features to single out irrelevant words and phrases, the two features helped us to identify relevant key words and phrases in a quick and more accurate manner. While writing the article, we repeated the above process as we added more ads over time. Therefore, we had to update our category dictionary a number of times, and editing the category dictionary in WordStat was not complicated. We found "keyword-in-context" (KWIC) useful when we were trying to find out the relevance of certain terms. For example, we discovered that Monster's search engine for our phrase "operations research" (within quotes) also returned ads in which the words "operations" and "research" were separated by a punctuation mark. So, we used the KWIC feature to go through individual ads to spot "operations, research." WordStat automatically searched all the ads containing "operations, research" and highlighted in different colors, so we could easily spot and remove the ads with "operations, research." The speed of processing records was fast enough. The specification of the computer I use is Intel Celeron 2.4 with 512 RAM. It took approximately three seconds to do frequency analysis of words and one minute for the key phrase extraction over 650-plus ads. WordStat helped us analyze the vast textual information quickly and in a rigorous manner. However, manipulating the data in the base statistical program, SimStat, was rather difficult and cumbersome relative to MS Excel and MS Access from which data can be imported directly. While writing this review, I found various academic and industry articles reporting results obtained using WordStat. I was amazed how creative the users of WordStat are in terms of applying this software to various situations. For example, Péladeau and Stovall analyzed a database of pilot reports on collision risks, commonly known as TCAS Reports (Traffic Collision Avoidance System Report). Using WordStat they were able to identify the specific risks at different airports, the hour of day where those errors occurred, the flight phase where those collision risks occurred, as well as some properties of those collision incidents (timing of events, multiplicity of events, pilot actions, etc.) [1]. Literally any sort of textual information can be analyzed with WordStat with dictionaries of your own choice. Imagine being able to analyze vast amounts of field operations documents, reports, e-mails, databases and other text fields that were untapped because they were simply too cumbersome or time-consuming to analyze manually. In O.R. classrooms, introducing a content analysis tool like WordStat can help students become aware of extending number-based statistics to text in order to mine information. Overall, WordStat is an easy-to-use, affordable and feature-rich software that provides O.R. professionals with yet another analytical technique. Postcript: For the record, the Ph.D. student's conclusion was that a large part of Blair's campaign was benchmarked from the Clinton campaign, and the evidence presented by the content analysis was quite convincing.
Byung-Gak Son (b.g.son@city.ac.uk) is a Research Fellow at the Cass Business School, City University London, who recently finished his Ph.D. in supply chain management. References
OR/MS Today copyright © 2005 by the Institute for Operations Research and the Management Sciences. All rights reserved. Lionheart Publishing, Inc. 506 Roswell Rd., Suite 220, Marietta, GA 30060 USA Phone: 770-431-0867 | Fax: 770-432-6969 E-mail: lpi@lionhrtpub.com URL: http://www.lionhrtpub.com Web Site © Copyright 2005 by Lionheart Publishing, Inc. All rights reserved. |