Text mining and item response theory for psychiatric and psychological assessment
By Qiwei He
The information age has made it easy to store and process large amounts of data, including both structured data (e.g., responses to questionnaires) and unstructured data (e.g., natural language or prose). As an additional source of information in assessments, textual data has been increasingly used by cognitive, personality, clinical and social psychologists. How to handle these textual data and combine them with structured data in psychiatric and psychological assessments are the major themes in my dissertation.
The dissertation focuses on exploring two main research questions:
- How can we apply text mining to narratives collected in the framework of psychiatric and psychological assessment to make classification decisions.
- How can we simultaneously model the outcome of text mining and the item response theory (IRT)-based outcomes of responses to questionnaires to validate the text mining procedure and enhance the quality of the measurement and classification procedure?
Applying text mining techniques to analyze unstructured data
The impetus for this research project originates from the construction of a screening test for posttraumatic stress disorder (PTSD) by using the lexical features in patients’ self-narratives. Trauma victims are asked to write down their traumatic events and symptoms online rather than conducting time-consuming face-to-face interviews with item-based questionnaires. Based on their textual input, the respondents can be classified into PTSD (i.e., high risk to develop PTSD) and NONPTSD (i.e., low risk to develop PTSD) groups. Those identified as PTSD are later invited to a more extensive test for more precise diagnosis. Therefore, the goal is to maximize the accuracy of the textual screening method in finding potential PTSD patients or excluding NONPTSD individuals from follow-up testing. The focus of this dissertation is to assess to what extent text mining techniques can be applied in the screening phase, and to establish the extent to which they produce better estimates and better prediction of true diagnosis compared to a questionnaire alone. Further, we propose to combine text mining data and questionnaire data in a Bayesian framework (see Figure 1), where a score based on text mining serves as input for a prior distribution of a latent trait associated with PTSD that is measured by a number of questionnaire items using an IRT model (Lord, 1980).
Figure 1. A Bayesian framework that combines textual analysis and IRT scale estimates.
This dissertation introduces the chi-square feature selection model (Oakes, Gaizauskas, Fowkes, Jonsson & Beaulieu, 2001) and presents an alternative machine learning algorithm for binary text classification named the product score model (PSM; He & Veldkamp, 2012; He, Veldkamp, & de Vries, 2012).
The product score model (He et al., 2012) is an alternative machine learning algorithm that features assigning two weights for each keyword (in binary classification) — the probability of the word occurring in the two separate corpora, Ui and Vi — to indicate to how much of a degree the word can represent the two classes. The weights are calculated by
Note that a smoothing constant a (we use a = 0.5 in this study) is added to the word occurrence in the formula to account for words that do not occur in the training set but might occur in new texts (for more on smoothing rules, see Manning & Schütze, 1999; Jurafsky & Martin, 2009).
The name product score comes from a product operation to compute scores for each class, that is, S1 and S2, for each input text based on the term weights. The formula is
where a is a constant, and P(C) is the prior probability for each category given the total corpora. The classification rule is defined as:
where b is a constant.1
To avoid mismatches caused by randomness, unclassification rules are also taken into account. As mentioned above, based on the chi-square selection algorithm, the keywords are assigned to one of two categories: positive or negative indicator. Thus, we define a text as “unclassified” when any of the following conditions are met: (a) no keywords are found in the text; (b) only one keyword is found in the text; (c) only two keywords are found in the text, but one is labeled as a positive indicator and the other as a negative.
Developing a hybrid model combining structured and unstructured data to screen PTSD
A new intake procedure was developed for the detection of PTSD in this dissertation, which combines the utilization of advanced text mining techniques and item response modeling in one framework. The research mainly consists of three parts: (a) computerized text classification of patients’ self-narratives to screen for PTSD; (b) exploring the generalizability of DSM-IV diagnostic criteria (American Psychological Association, 2000) for PTSD using item response modeling; and (c) combining textual assessment of patients’ self-narratives and structured interviews in the PTSD identification process.
Using 300 self-narratives collected online, a textual assessment method based on the PSM was used primarily to distinguish individuals with high or low risk to develop as PTSD. The text mining approach resulted in a high level of agreement (82 percent) with the psychiatrists’ diagnoses and revealed some expressive characteristics in the writings of PTSD patients. Although the results of text analysis are not completely analogous to the results of structured interviews in PTSD diagnosis, it can be concluded that the application of text mining is a promising addition to the assessment of PTSD in clinical and research settings.
An extension of the data representation model from unigrams (i.e., single words) to n-grams, where the occurrences of sets of n consecutive words are counted, was explored. Based on the same sample used in the previous procedure, the PSM, decision trees and naïve Bayes were applied in conjunction with five representation models — unigrams, bigrams, trigrams, a combination of unigrams and bigrams, and a mixture of n-grams — to identify PTSD patients. Although the PSM with unigrams attained the highest prediction accuracy when compared with psychiatrists’ diagnoses in structured interviews, it is interesting to find that the addition of n-grams contributed most to enhancing the reliability of prediction and balancing the performance metrics, in other words, resulting in a fairly high sensitivity with the least sacrifice for specificity (He, Veldkamp, Glas & de Vries, 2017).
The generalizability of DSM-IV diagnostic criteria for PTSD to various subpopulations was also examined in this dissertation by using IRT techniques. Besides identifying differential item functioning (DIF; Camilli & Shepard, 1994) related to various background variables such as gender, marital status and educational level, this study also emphasized the importance of evaluating the impact of DIF on population inferences as made in health surveys and clinical trials and on the diagnosis of individual patients. The study found that the DSM-IV diagnostic criteria for PTSD did not produce substantially biased results in the investigated subpopulations and that there should be few reservations regarding their use (He, Glas & Veldkamp, 2014).
Considering the positive effects with text mining or IRT as discussed earlier, a combination of these two methods is proposed to further enhance the benefits of both. Text mining and item response modeling are used to analyze patients’ writings and responses to standardized questionnaires, respectively. The whole procedure is combined in a Bayesian framework where the textual assessment functions as an informative prior for the estimation of the PTSD latent trait. Results show that by adding textual prior information, detection accuracy is increased by 50 percent and test length can be efficiently shortened.
Extending the textual analysis model to psychological assessment
This dissertation also extends the application of the model from psychiatric datasets to an internet dataset, which consists of both textual posts and responses to the scales related to self-monitoring skills (Snyder, 1974) on Facebook. This subproject emphasizes the importance of validating data collected from the internet and explores the relationship between self-monitoring skills and textual posts on a Facebook wall. The conclusion was drawn that textual posts on a Facebook wall could partially predict users’ self-monitoring skills. The variables of “family” and typical internet languages, such as emoticons and internet slang expressions, were found as the most robust classifiers in this textual analysis (He, Glas, Kosinski, Stillwell & Veldkamp, 2014).
Statement of significance
The textual assessment method developed in this thesis provides patients with opportunities to express themselves freely and can be easily applied to research of similar construction. For instance, the text classification model developed for PTSD screening also can be utilized for initial detection of individuals at risk for depression, which has been reported as the predicted second-leading cause of world disability by 2020 (World Health Organization, 2001) and is expected to be the largest contributor to disease burden by 2030 (World Health Organization, 2008). It is an ideal approach toward improving the cost-effectiveness of a diagnostic procedure, reducing both patients’ burden and clinicians’ workload, and allowing for earlier detection for potential patients. Early detection, either by a general practitioner or through an online screening test, can result in more effective, shorter, and less costly treatment.
Text mining, in combination with IRT, is expected to be a promising tool in psychological and psychiatric assessments in future decades. The involvement of text mining provides a new perspective, allowing for handling structured and unstructured data in a common framework. With the unprecedented popularity of social networks such as Facebook, numerous research ideas related to psychology, such as personality, self-presentation, and online honesty, are springing up. The techniques applied and discussed in this thesis might play an important role in exploring these exciting fields.
American Psychiatric Association. (2000). Diagnostic and statistical manual of mental disorders: DSM-IV (4th ed.). Washington, D.C.: American Psychiatric Association.
Camilli, G., & Shepard, L.A. (1994). Methods for identifying biased test items. Newbury Park, California: Sage.
He, Q., Glas, C.A. W., Kosinski, M., Stillwell, D.J., Veldkamp, B.P. (2014). Predicting self-monitoring skills using textual posts on Facebook. Computers in Human Behavior, 33, 69–78.
He, Q., Glas, C.A.W., & Veldkamp, B.P. (2014). Assessing the impact of differential symptom endorsement on posttraumatic stress disorder (PTSD) diagnosis. International Journal of Methods in Psychiatric Research, 23(2), 131–141.
He, Q., & Veldkamp, B.P. (2012). Classifying unstructured textual data using the Product Score Model: An alternative text mining algorithm. In T.J.H.M. Eggen & B.P. Veldkamp (Eds.), Psychometrics in practice at RCEC (pp. 47-62). Enschede, Netherlands: RCEC.
He, Q., Veldkamp, B.P., Glas, C.A.W., & de Vries, T. (2017). Automated Assessment of Patients’ Self-Narratives for Posttraumatic Stress Disorder Screening Using Natural Language Processing and Text Mining. Assessment, 24(2), 157-172.
He, Q., Veldkamp, B.P., & de Vries T. (2012). Screening for posttraumatic stress disorder using verbal features in self narratives: A text mining approach. Psychiatry Research, 198(3), 441-447.
Jurafsky, D., & Martin, J.H. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, New Jersey: Pearson Prentice Hall.
Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, California: Erlbaum.
Manning, C.D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, Massachusetts: MIT Press.
Oakes, M., Gaizauskas, R., Fowkes, H., Jonsson, W.A.V., & Beaulieu, M. (2001). A method based on chi-square test for document classification. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 440-441). New York: ACM.
Snyder, M. (1974). Self-monitoring of expressive behavior. Journal of Personality and Social Psychology, 30(4), 526-537.
World Health Organization (2001). The world health report 2001 — mental health, new understanding, new hope. Geneva, Switzerland: Author.
World Health Organization (2008). WHO global burden of disease. Retrieved from http://www.who.int/healthinfo/global_burden_disease/GBD_report_2004update_full.pdf (PDF, 4.85MB).
1 In principle, the scope of threshold b could be set to be infinite. However, in practice, (–5,+5) is recommended as a priori for b.