Methodological Reform 101 for developmental researchers
By M. Brent Donnellan, Glenn I. Roisman, R. Chris Fraley, and Richard E. Lucas
Over the past few years, there has been growing concern about methodological practices in psychological science. These concerns were spurred, in part, by the publication of an article that supposedly provided evidence in support of psychic phenomena in the Journal of Personality and Social Psychology , failures to replicate highly cited and newsworthy findings, and reports of data fabrication by a few accomplished researchers. In some quarters, there is a sense that the typical ways of conducting, reporting, and reviewing psychological research has limitations that need to be addressed. These discussions about methodological reform often generate disagreement (and perhaps even a few eye rolls). The issues, however, are important to consider for any field dedicated to the pursuit of scientific knowledge.
Like many other areas of psychological science, developmental research involves three ingredients that, when combined in specific ways, can create problems for a field: (a) sample sizes that are often constrained; (b) large numbers of data analytic choices that do not have obvious answers (e.g., how to combine data collected across multiple assessment waves, which covariates to include in a statistical model, whether to allow residuals to covary in ways that are not specified in advance), and (c) considerable pressure on both new and established investigators to publish as much as possible. These factors are by no means unique to developmental research but we believe there is an important opportunity to use on-going methodological controversies to engage in a constructive dialogue about the field and to consider the adoption of new research standards—both at the level of individual research labs and in the editorial process. Doing so has the potential to improve research integrity in developmental psychology and to better protect the sub-discipline from some of the concerns that have emerged with respect to other areas of psychology.
This brief essay offers an introduction to the recent methodological discussions and provides general advice for researchers. The focus is practical and aimed at helping researchers approach these issues for themselves. We also provide an annotated guide to important recent publications about methodological reforms.
Seven principles and practices to consider adopting in your own work
Commit to Total Scientific Honesty
Lykken (1991) wrote a prescient essay foreshadowing current discussions about methodological reform. He also offered useful advice for improving the field. His most overarching and challenging recommendation boiled down to a sustained commitment to scientific honesty and integrity. His advice followed from the principles espoused by the Nobel Prize winning physicist Richard Feynman (see e.g., Feynman, 1985, p. 341). The basic idea is to avoid fooling yourself and your scholarly peers when conducting scientific research. Scientific integrity involves full disclosure and maximum transparency. The mandate is to provide all of the details of a given study and to resist the temptation to ignore evidence inconsistent with your preferred perspective.
This is no easy task. A sustained commitment to scientific integrity requires self-awareness and equanimity. Self-awareness is needed to guard against confirmation bias and post hoc rationalizations. Equanimity is needed to prevent oneself from getting emotionally invested in the outcome of a research study. The focus and energy should be directed at the process of designing the best study possible without undue investment in obtaining any particular result.
Be Aware of the Impact of Researcher Degrees of Freedom
Researchers should acknowledge the myriad decisions they face when designing studies, analyzing data, and reporting research. In an ideal world, confirmatory studies (those testing specific hypotheses) would be planned in great detail and any deviations from the plan (including departures from the planned data analytic strategy) could be evaluated by objective readers. In reality, however, this type of pre-registration rarely occurs in psychological studies (although preregistration of replication attempts is being implemented at Perspectives on Psychological Science ). In addition, sticking to an inflexible plan might hinder discovery at early stages of research. However, when multiple options are available, confirmation bias can lead researchers to selectively attend to results that support their pre-existing beliefs and ignore results that are inconsistent with these ideas. These biases often occur outside of conscious awareness. In addition, because positive findings tend to be valued more than negative findings (and hence, are more likely to be publishable), flexible analyses, combined with HARKing (hypothesizing after results are known) can lead researchers to prefer analytic choices that “work,” even if these were not the analyses that had been planned ahead of time. Equally troubling, the preferred and re ported results may not even be representative of the results of the full range of analyses that were conducted. Thus, researcher degrees of freedom increase the likelihood of introducing false positives to the literature.
One way to address these issues is to approach your own research as if pre-registration was necessary. Think carefully about which analyses are most appropriate and create a detailed plan before seeing the results, as well as (for example) a syntax-based documentation of the full set of analyses actually conducted. If the planned analyses result in publishable findings, you can be more confident that your findings are not the chance result of examining the data from multiple angles. However, if in the process of conducting analyses, deviations from the plan are required, you should consider shifting to a more exploratory mode. Alerting readers to the analyses that did not work may give a more complete picture of the robustness of the effect; and replication in additional datasets might be a wise choice.
Although there are some methodological questions that have clear right or wrong answers, many decisions fall on a continuum from more to less reasonable. Increased transparency allows others to judge the appropriateness of your decisions and helps provide some context regarding the number of alternatives that were tried before the final analyses were chosen. Researchers should therefore document their procedures and materials and make those available to peers. More and more journals allow for online supplements that make the dissemination of such material quite easy. In addition, although the sharing of de-identified data can involve a set of thorny issues, this is one of the best ways to combat problems associated with researcher degrees of freedom, and one that developmental scientists can take pride in having been at the leading edge of by way of large, publically available datasets like the NICHD Study of Early Child Care and Youth Development. We strongly encourage the leading journals in developmental psychology to consider ways to facilitate the process of open reporting, data archiving, and pre-registration.
Focus on effect Size Estimation Rather Than Statistical Significance
A call to focus on effect size estimation is inherently tied to criticisms of conventional null hypothesis significance testing (NHST). Unfortunately, discussions about the limitations of NHST often either fall on deaf ears or amount to preaching to the choir. Hopefully, this is changing as a result of the increasing recognition of the importance of estimating effect sizes with precision (see e.g., Cumming, 2012; Fraley & Marks, 2007; Kline, 2013).
To be clear, most of the interesting questions in psychological science concern the strength of the association between two variables or the magnitude of the difference between groups (either experimentally induced or otherwise naturally occurring groups). Questions of this sort are not answered by the p-values that accompany NHST. Instead, they require effect size estimates. It is perhaps easiest to see how a concern with effect sizes is critical when evaluating the results from some of the large scale national studies common in developmental journals or in epidemiological research. Given a large enough sample size, virtually any association or mean difference can attain statistical significance – trivial or substantial. The task is to evaluate the theoretical and practical importance of the statistical association or difference. There is no reason this perspective on interpretation should be limited to large scale studies; indeed, we believe this approach should be the focus of virtually all research conducted in psychology.
A focus on effect size estimation has other benefits. Once researchers routinely attend to effect sizes, there is likely to be a desire for effect size estimates that are more precise (i.e., have narrower confidence intervals) and to test theoretical models that generate risky hypotheses, using a falsificationist approach to science under which larger samples actually represent a greater risk of refutation for the proposed theory. These are good things for a field.
Understand That Sample Size and Statistical Power Matter
This recommendation is closely related to the issues surrounding parameter estimation and concerns with significance testing. For instance, a desire for a more precise understanding of effect sizes will likely motivate researchers to use larger sample sizes. This is critical because concerns over sample sizes and statistical power have a long history of being ignored in psychological research. Unfortunately, many of the methodological issues that have come to light in recent years can be traced to an inattention to statistical power and the problems with research based on small sample sizes.
There are at least three problems that arise when researchers conduct research using small sample sizes. First, such studies are likely to be underpowered and, consequently, unlikely to detect critical effects of interest. Second, small sample sizes are likely to provide biased estimates of population effect sizes if results are screened through a significance test filter (i.e., the estimate from a given study is likely to be larger than the true association because only relatively large effect can attain statistical significance in this case). If one is earnestly interested in a parameter estimate (e.g., the expected difference in academic performance between rejected and non-rejected children), this problem is of obvious importance. But, even if one is not interested in the parameter estimates per se, this second problem with small samples has other unintended negative consequences. The most notable one is that other investigators seeking to build upon and extend the original findings might base their sample size decisions using inflated effect size estimates. This starting point will lead researchers to design future studies with inadequate power. Such a practice will increase the possibility of null findings and contribute to confusion in the literature. It may also perpetuate bad design practices if other researchers follow norms created by small sample size research.
Third, and perhaps most importantly for the scientific literature as a whole, the rate of false positives for published findings is higher in literatures based on underpowered studies than in literatures based on high powered studies. This claim often strikes people as counter-intuitive. Researchers often assume that, although low powered studies might make it more challenging to detect real effect, effect detected by lower powered studies should be real. Ioannidis (2005) and others have shown why this assumption is incorrect. Low powered studies both decrease the likelihood that researchers will detect the effect they hypothesize and increase the likelihood of Type I errors (false positives) in the research literature, often by creating perverse incentives to hunt for statistically significant differences around which a story can be constructed post hoc. Low powered studies hurt not only the investigator in question, but the integrity of the research literature as a whole.
What can be done about these issues? The first step is to gain a better understanding of the concerns over small sample sizes by developing an intuition for how illusionary findings emerge when sample sizes are low (Cumming, 2012). Second, researchers can begin to consider what kinds of effect one might expect in a research area and select sample sizes that will allow those effect to be estimated with reasonable precision (Fraley & Marks, 2007). There are costs and benefits involved with sample size selection. But it is imperative to appreciate the hidden costs in conducting underpowered research. Finally, editors and reviewers can demand that investigators justify the sample sizes they have selected (e.g., Simmons et al., 2011). One solution is for editors to adopt minimal thresholds for the sample sizes, statistical power, or confidence bands of research published in their journals. For example, if there is agreement among researchers in a given field that trivial associations (e.g., r < |.09|) are of relatively little theoretical interest, then studies might be expected to include at least enough participants to have adequate power to determine whether relevant associations exceed that cut-off (it may surprise some scholars that 80% power to reliably differentiate an r = .10 or larger from zero requires 617 participants, assuming a directional prediction ). Regardless of how this issue is handled, researchers, reviewers, and editors should appreciate that sample size is one of the most important factors in research design, even if, historically, it has been the most often neglected.
Review Papers for Methodological Completeness, Accuracy, and Plausibility
One common theme in the report issued by the committee investigating the fraud committed by social psychologist Diederik Stapel was the presence of many implausibly large effect size estimates. It is possible that greater attention to effect size estimates and the methodological details of these papers by the research community might have raised red flags earlier in the process. Even strong proponents of NHST should be able to acknowledge that routinely computing and reporting effect size estimates can serve as a useful check on the plausibility of a result. Sometimes effect can be too big to be believed and simply involve mistakes in reporting like substituting standard errors for standard deviations.
In light of concerns about the methodological rigor of psychological research, researchers should commit to reviewing papers for methodological accuracy in addition to theoretical coherence and plausibility. There is a perhaps a natural tendency to want journal articles to tell a coherent and engaging story. The material in Method and Results sections is critical to the credibility of that story. Attending to these issues in the work of others may help researchers attend to these issues in their own studies (and vice versa).
6. Focus on Reproducibility and Replication. The term replication is becoming saddled with baggage and even controversy. This is unfortunate because the importance of testing whether results can be duplicated (i.e. producing effect sizes of roughly the same magnitude) using the same procedures or very similar procedures is a core scientific value. It is often quite illuminating to run the exact same experiment or simple correlational study multiple times to observe how results fluctuate across trials. Likewise, testing how well results generalize to different kinds of samples and different operational definitions is critical for scientific progress. In short, researchers should strive to make sure their own results are sturdy. Researchers may also consider dedicating a fraction of their time each year to replication studies of outside findings. Increasing the frequency and visibility of these e ff orts at meaningful duplication will ultimately improve the rigor of scientific psychology.
Think Critically about Sampling Strategies and Generalizability
Scholars have raised concerns about the kinds of samples frequently used in psychological research and how these samples constrain the inferences drawn from contemporary studies (e.g., Arnett, 2008; Henrich, Heine, & Norenzayan, 2010). A thorough discussion of these issues deserves its own column. We simply suggest that developmental researchers continue to be cautious and forthright about the limitations imposed by common methods of drawing samples. This suggestion fits well with the themes of transparency and self-criticism emphasized in this column. For example, consider the complex set of issues that arises when developmental researchers use college students enrolled in social science courses as their stand-in for “adults” when making comparisons with samples of children, adolescents, and individuals past retirement age. College student samples often come from a different population in these comparisons besides just age because of their socioeconomic background, education level (naturally!), and experiences with social science research.
The goal of research in developmental psychology is to contribute new knowledge to the field. Researchers want to base their discipline on findings that replicate and stand up to reasonable critical scrutiny. Researchers also want their findings to generalize beyond the particulars of a given sample. These desires lie at the heart of the recent methodological discussions in psychological science. The suggested readings and recommendations in this essay are designed to help researchers approach these discussions in an informed fashion. Ultimately, each researcher must make informed decisions about how to approach his or her craft and some readers may disagree with many of these suggestions and recommendations. However, we hope these ideas serve as a useful entry point and cause for reflection on the critical issues that face the entire field of psychological science. The six recent readings in the sidebar might be of additional interest for developmental scientists who wish to learn more about the issues we have summarized above.
Additional Works Cited
Arnett, J. J. (2008). The neglected 95%: Why American psychology needs to become less American . American Psychologist, 63, 602-614. DOI: 10.1037/0003-066X.63.7.602.
Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called psychological science . Perspectives on Psychological Science, 7, 543-554. DOI: 10.1177/1745691612459060.
Cumming, G. (2012). Understanding the new statistics: effect sizes, confidence intervals, and meta-analysis . New York: Routledge
Ferguson, C. J., & Heene, M. (2012). A vast graveyard of undead theories: Publication bias and psychological science's aversion to the null . Perspectives on Psychological Science, 7, 555-561. DOI: 10.1177/1745691612459059.
Feynman, R. P. (1985). “Surely you're joking, Mr. Feynman!”: Adventures of a curious character . New York: W. V. Norton.
Francis, G. (2012). The psychology of replication and replication in psychology . Perspectives on Psychological Science, 7, 585-594. doi:10.1177/1745691612459520
Francis, G. (2013). Replication, statistical consistency, and publication bias . Journal of Mathematical Psychology.
Fraley, R. C., & Marks, M. J. (2007). The null hypothesis significance testing debate and its implications for personality research . In R. W. Robins, R. C.
Fraley, & R. F. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 149-169). New York: Guilford.
Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences , 33, 61-135. DOI: 10.1017/S0140525X0999152X
Ioannidis, J. P. A. (2005) Why most published research findings are false. PLoS Med 2(8): e124.
Ioannidis, J. P. A., & Trikalinos, T. A. (2007). A n exploratory test for an excess of significant findings . Clinical Trials, 4, 245–253. DOI: 10.1177/1740774507079441
Kline, R. B. (2013). Beyond significance testing: Statistics reform in the behavioral sciences (2nd ed.). Washington DC: American Psychological Association.
Lykken, D. T. (1991). What's wrong with psychology anyway? In D. Cicchetti & W. M. Grove (Eds.), Thinking clearly about psychology: Vol. 1. Matters of public interest (pp. 3–39). Minneapolis: University of Minnesota Press.
Nosek, B., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science , 7, 615-631. DOI: 10.1177/1745691612459058.