In this Issue

A look back over ten years of international assessment research

An early career retrospective by the 2016 Anne Anastasi Early Career Award winner.

By Leslie Rutkowski

In the blink of an eye, ten years is approaching since I finished my PhD in educational psychology under the guidance of Carolyn Anderson at the University of Illinois. In that span I've spent about half of my time in the U.S. at Indiana University and the other half split between a research position in Hamburg, Germany, and now as a professor at the University of Oslo in the new Centre for Educational Measurement. Over the course of my postdoctoral experiences, my career has been marked by wonderful and exciting collaborative research with colleagues, friends, and my husband of nearly 20 years. The common thread in all of this is the international nature of my research, which focuses on a fun niche – methodological issues in comparative, large-scale studies of educational achievement. In addition to offering a lively area of research, this field has allowed me to fill up nearly two passports.

The first such study of international educational achievement, the Pilot Twelve-Country Study (Foshay, Thorndike, Hotyat, Pidgeon, & Walker, 1962) , included participants from mostly European countries (excepting Israel and the U.S.), with reasonably well-developed economies1. In contrast, the 2015 cycle of the Programme for International Student Achievement (PISA) featured 72 system-level participants from all continents, excluding Antarctica. Among these participants, a range of sub-national systems took part, including Singapore, four Chinese provinces and municipalities, Massachusetts, and the Autonomous Region of the Azores (Organisation for Economic Co-operation and Development; OECD, n.d.b) . National-level participants spanned a range of income brackets, 90 languages or dialects, and a variety of cultures. Further, 15 countries plus Puerto Rico completed a paper-based assessment, while the remaining participants were administered a computer-based version of PISA.

Such a heterogeneous collection of participating educational systems poses challenges in terms of deciding what should be measured and how to measure it in a comparable way. To that end, carefully developed assessment frameworks guide this process. From developing an internationally recognized definition of the content domains (International Association for the Evaluation of Educational Achievement, 2013; OECD, 2016) to the complex process of instrument translation (OECD, 2014) , cross-cultural considerations loom large. And regardless of the care with which international assessments are designed, a cross-cultural measurement paradox exists: “The larger the cross-cultural distance between groups, the more likely cross-cultural differences will be observed, but the more likely these differences may be influenced by uncontrolled variables” (van de Vijver & Matsumoto, 2011, p. 3) . To that end, my research has largely emphasized understanding the degree to which countries psychometrically differ and on developing ways to explicitly account for these differences in the models and methods that are used to develop and report scale scores. My interest focuses on methodological issues in measuring and comparing achievement domains (e.g., math, science, and reading) and non-achievement constructs, such as learning attitudes, motivation, and self-reported experiences in and outside of school.

A primary means for investigating measurement differences in a cross-cultural setting is multiple-groups analyses, including confirmatory factor analysis and item response theory-based differential item functioning. And a key tension in this area surrounds the fact that many commonly used methods have been developed and validated with a small number of groups — often just two — and with smallish sample sizes in comparison to the thousands that are more typical of the international assessment settings. In response, a strand of my work, in collaboration with my close colleague, Dubravka Svetina at Indiana University, is around testing the contours and limits of traditional multiple-groups criteria, including formal tests of model-data consistency and fit indices. Many of our research results point to clear deficiencies with these measures in such a setting. We are currently engaged in further research with colleagues and graduate students to develop better approaches for evaluating model fit when many highly diverse countries and educational systems are thrown into the same analytic hopper.

With my co-PI and husband, David Rutkowski, I am also working with my research group at the University of Oslo on a nationally funded project to develop model- and design-based solutions that will make international assessment results more comparable internationally, yet more useful locally. We just concluded the first phase of work and are very excited to report that our newly developed R package, lsasim, (Matta, Rutkowski, Rutkowski, David J., & Liaw, 2017) is available on the Comprehensive R Archive Network. This freely available package, primarily developed under our guidance by postdoc Tyler Matta and extensively tested by our postdoc Yuan-Ling (Linda) Liaw, will allow researchers to simulate data according to the idiosyncrasies of large-scale assessments. We are already using this software to explore a number of open research topics in the field and hope that others will find it useful in their methodological and applied measurement work. It feels like we hit the ground running and so far it has been great fun working with this group of scholars. I am looking forward to the next steps in our collaborative research. Anyone who is interested in updates on this project can find us online.

Over the next several years, my group and I will be working on developing tractable designs and methodological solutions that allow for locally- or regionally-customized study components, while also ensuring that the constructs of interest remain internationally comparable. Related to this, we are also working on innovative matching applications that will offer the possibility of creating quasi-experimental conditions, with pre- and post-measures of achievement. Of course, with 80 or more heterogeneous systems, this is no mean feat. But thanks to a great group of collaborators, the challenging work ahead will be as possible as it is exciting.


1 Participants included Belgium, England, Finland, France, Federal Republic of Germany, Israel, Poland, Scotland, Sweden, Switzerland, the United States and Yugoslavia.


Foshay, A. W., Thorndike, R. L., Hotyat, F., Pidgeon, D. A., & Walker, D. A. (1962). Educational achievement of thirteen-year-olds in twelve countries. Hamburg, Germany: UNESCO Institute for Education. Retrieved from

International Association for the Evaluation of Educational Achievement. (2013). TIMSS 2015 assessment frameworks . Boston: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.

Matta, T., Rutkowski, L. A., Rutkowski, David J., & Liaw, Y.-L. (2017). lsasim: Simulate large-scale assessment data (Version 1.0.0) [R (>=3.3.1)]. Oslo, Norway: Centre for Educational Measurement at University of Oslo. Retrieved from

OECD. (n.d.). Compare your country. Retrieved February 21, 2017, from

OECD. (2014). PISA 2012 technical report. Paris: OECD Publishing.

OECD. (2016). PISA 2015 assessment and analytical framework. Paris: Organisation for Economic Co-operation and Development. Retrieved from

van de Vijver, F. J. R., & Matsumoto, D. (2011). Introduction to the methodological issues associated with cross-cultural research. In D. Matsumoto & F. J. R. van de Vijver (Eds.), Cross-cultural research methods in psychology. New York, NY: Cambridge University Press.