In this issue

Crowd sourcing in aging research: Some cautions and best practices

How to use Amazon Mechanical Turk for research in aging.

By Julie Hicks Patrick, PhD, Abigail M. Nehrkorn, and Amy Knepple Carney

Currently, 84 percent of all Americans and 58 percent of those ages 65 and older use the internet for a wide variety of tasks (Pew, 2015). With the rise of the internet and other technological advances, we are rapidly becoming a global marketplace. As such, microtask websites that capitalize on distributed human intelligence are growing in popularity. These sites link businesses with people willing to do time-limited tasks, such as proof-reading or categorization.

Although several microtask websites exist, the most well-known is Amazon’s Mechanical Turk (MTurk, AMT). MTurk was launched in November 2005 (Barr, 2005). On this platform, the “requestor” who is looking for workers sets the pay schedule and determines whether the task has been completed to satisfaction before releasing the pay. The “worker” chooses which tasks to complete. High-paying tasks and interesting tasks are likely to be completed more quickly than other tasks. Workers who perform many tasks and who perform them well are granted higher status within the system, opening the opportunity for more interesting work and/or higher pay rates.

Noting its advantages for quickly reaching a large and more demographically diverse sample than that which results from convenience sampling in one’s own community of undergraduates (Buhrmester, Kwang, & Gosling, 2011), it did not take researchers long to begin using MTurk to pilot measures and instructions, collect survey data and to conduct online experiments. Concomitantly, empirical evaluations of the characteristics of samples and the quality of data resulting from MTurk have emerged. In this brief overview, we describe this emerging literature and offer some cautions and some “best practices” related to using MTurk for aging-related research.

MTurk in Aging Research

Amazon has not released data regarding the demographic characteristics of their workers and some reports in the published literature do not always report full demographics, including age (Hauser & Schwarz, 2015). Although one might assume that MTurk and other microtask sites attract primarily young college students, when researchers have asked, “Who are these MTurkers?,” the results suggest that a majority of workers are in their late 20s and 30s, are slightly more likely to be male rather than female, have completed a college degree, tend to be Caucasian, may be underemployed and although globally dispersed, the majority are from the USA and India (Huff & Tingley, 2015; Paolacci & Chandler, 2014).

Relatively few research studies using MTurk samples have been reported in the aging literature. Several reasons for this are likely. Sometimes age categories may be poorly formed. For example, one recent study reported age quintiles, with the oldest quintile including adults ages 37 to 75 years (Downs, Holbrook, Sheng, & Cranor, 2010). Survey research focused on age differences in technology use has reported ages up to 69 years (Ferraro, Wunderlich, Wyrobek, & Weivoda, 2014), but it is unclear how many middle-aged and older adults were in the sample. High-quality experimental research has been reported (Stothart, Boot & Simons, 2015), but acknowledged that even with a mean age of 36 years, only 22 of the adults in their sample were age 60 or older. Admittedly, the demands of the Stothart et al. study might have dissuaded some middle-aged and older MTurkers from participating. A recent study focusing on sleep disruptions among middle-aged and older adults (Gold, Nadorff, Winer, & Ward, 2014) reported a sample of N = 167, with a mean age of 60 (range 55 to 75 years). Local colleagues (Lemaster, Pichayayothi, & Strough 2015) have recruited as many as 179 adults over age 60 years. In our own lab, we have fared about that well. In a study targeting adults over age 45 years, we recruited more than 400 adults (Graf & Patrick, 2015), with about 180 being over age 60 years. One of the more useful features of MTurk is that research requestors can build a data base of their own participants and invite them specifically to future studies. Thus, once a research lab identifies potential participants via MTurk, the lab may be able to continue contacting those workers within the platform.

Maximizing Data Quality

First, as with any research method, MTurk researchers need to be attentive to the four cornerstones of survey research (Dillman, Smyth, & Christian, 2014): coverage errors, sampling errors, measurement error and total survey error. Each of these issues poses special challenges with any online study of age-related phenomena. If one is recruiting for adults over age 18 years living in the USA, MTurk is likely to be a good resource. If, however, one is recruiting older widowed men with type 2 diabetes, recruiting through MTurk is unlikely to be effective. From what can be gleaned from the current literature, MTurk workers do include middle-aged and older adults. These adults, however, may not be representative of “older adults” on many factors — older MTurk workers are likely to be college educated, are clearly more engaged with technology than many older adults and may differ on other important characteristics.

In addition to good survey practices, MTurk-specific decisions can affect the data. For example, what to title the human intelligence task matters. We clearly indicate that our task is research, and we include a standard online informed consent. Because researchers often tend to pay well, relative to businesses, and because people often find social science surveys interesting, these fill quickly. Decisions about what time of day to open the survey need to consider the varying time zones in the U.S. If a survey opens at 7 a.m. EST, it may fill long before 7 a.m. on the West Coast (we thank the Strough lab at West Virginia University for this tip).

How much to pay is also important. Researchers sometimes comment on how inexpensive MTurk data are compared to in-person lab studies. Although it is probably true that the response burdens of in-person studies are greater than those of online surveys, we need to be mindful to award honoraria appropriately and to remember that MTurk is not a participant pool. As a microtask platform, MTurk workers are justified in raising the issue of low pay (Horton, 2011). But the issue of honoraria differs from the fair wage argument in important ways. APA guidelines remind us not to use amounts of honoraria that would appear too attractive or coercive. Our casual observation of the average honoraria for MTurk psychology studies suggests that the mean is around $1. If the survey required fewer than eight minutes, that $1 is similar to minimum wage. In our lab, we have posted surveys that require 20 to 30 minutes’ time. We usually pay between $2 and $3, which is on the high side of acceptable in MTurk. This amount is not overly attractive/coercive from the viewpoint of the IRB, but it is unacceptable from a minimum-wage standard. Other researchers pay more similarly to businesses on MTurk, offering between a quarter and 50 cents for 10 to 20 minute studies.

Validity of Responses

Dillman, Smyth and Christian (2014) advise researchers to (dis)qualify participants early, out of respect for their time. We agree, of course, but also note that in the MTurk environment, time is money. Workers may have varying pushes to continue a survey for which they are not eligible. What we have done in our lab is to functionally take everyone but to funnel ineligibles off to a parallel survey. Others (e.g., Downs et al., 2010) have collected demographic information early, but do not indicate that such will be used to disqualify. Others have used specific age ranges in the title of various surveys (e.g., Graf & Patrick, 2014; Lemaster et al., 2015), but doing so may create the need for additional data cleaning and response integrity checks.

Researchers are always concerned with the accuracy of the measurements they take. Moving into survey methods on microtask platforms like MTurk adds new causes for concern. Researchers have adopted tried-and-true approaches, including redundancy checks (e.g., asking age in years on page one and year of birth on page 20), instructional manipulations (e.g., “for this item, choose response b”), and asking respondents to verify that they have answered honestly (Lemaster et al., 2015; Paolacci & Chandler, 2014; Rouse, 2015). In our own lab, we use a combination of these approaches, and we apply a decision rule that to be retained in a data set, 80+ percent of the screen items must be passed. We use this more liberal approach because although we do not require an answer to every question, we count a nonresponse as a “miss.” Stothart et al. (2015) also remind researchers to specify these decision rules to themselves a priori.

A different opinion is offered by Downs et al. (2010). They note that because instructional manipulation checks (“choose answer 3”) are so transparent, only the least conscientious responders fail these items. Moreover, they note that such items alter the relationship between requestor and worker, highlighting a mistrust that is unseemly for business and for research. They do, however, offer interesting solutions that employ fact-based screening and skill assessments. We encourage other researchers to read and implement Downs et al.’s suggestions.

MTurk is aware that unscrupulous behavior from both requestors and workers is possible. For example, requestors determine whether a task has been completed with sufficient accuracy to merit paying the worker. Withholding payment, of course, decreases the likelihood that a requestor will continue to attract the most careful and skilled workers. For workers who rely on MTurk to supplement their living expenses (Mason & Suri, 2012), there is sufficient incentive to complete tasks at a minimal level or to “double dip” by having more than one worker account. Prior to the summer of 2015, all that was needed to establish a worker account was a unique email address. Changes have since been implemented, and now there are tax-implications for both requestors and workers, adding a layer of accountability and reducing the potential for multiple accounts for a single individual. In addition, a requestor can restrict future participation of a worker if their work is substandard. Our lab has typically paid all honoraria, but reserve the right to disallow random “button-mashers” from participating in our future studies.

Finally, Rouse (2015) reminds researchers to report reliability coefficients obtained from MTurk samples and to compare those with coefficients obtained from the norming sample. If lower reliabilities are obtained with MTurk samples, power analyses need to incorporate this information. We are excited by the few replication studies that have been reported, including those from Wally Boot’s lab (Stothart et al.) and Crump et al. (2013). We are cautiously optimistic that MTurk-based studies have the potential for expanding the reach of aging research and may yield important insights in the years to come.

References

Barr, J. (2005). Amazon’s Mechanical Turk: The First Three Weeks, AWS blog post. Retrieved from https://aws.amazon.com/blogs/aws/amazons_mechani/

Buhrmester, M., Kwang, T., & Gosling, S.D. (2011). Amazon's Mechanical Turk: A new source of inexpensive, yet high-quality, data?. Perspectives on Psychological Science, 6(1), 3-5. doi: 10.1177/1745691610393980.

Crump, M.J., McDonnell, J.V., & Gureckis, T.M. (2013). Evaluating Amazon's Mechanical Turk as a tool for experimental behavioral research. PloS one, 8(3), e57410.

Dillman, D.A., Smyth, J.D., & Christian, L.M. (2014). Internet, phone, mail, and mixed-mode surveys: The tailored design method. Hoboken, New Jersey: John Wiley & Sons.

Downs, J.S., Holbrook, M.B., Sheng, S., & Cranor, L.F. (2010, April). Are your participants gaming the system?: Screening mechanical turk workers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 2399-2402). ACM.

Golding, S., Nadorff, M.R., Winer, E.S., & Ward, K.C. (2014). Unpacking Sleep and Suicide in Older Adults in a Combined Online Sample. Journal of Clinical Sleep Medicine, 11(12), 1385-1392.

Graf, A.S., & Patrick, J.H. (2014). The influence of sexual attitudes on mid-to late-life sexual well-being: Age, not gender, as a salient factor. The International Journal of Aging and Human Development, 79(1), 55-79.

Hauser, D.J., & Schwarz, N. (2015). Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behavior Research Methods, 1-8.

Horton, J.J. (2011). The condition of the Turking class: Are online employers fair and honest?. Economics Letters, 111(1), 10-12.

Huff, C., & Tingley, D. (2015). “Who are these people?” Evaluating the demographic characteristics and political preferences of MTurk survey respondents. Research & Politics, 2(3), 2053168015604648.

Lemaster, P., Pichayayothin, N.B., & Strough, J. (2015). Using Amazon’s Mechanical Turk to Recruit Older Adults: Easy And Cheap, But Is It Valid? Poster presented at the Annual Scientific Meeting of GSA, Orlando, Florida.

Mason, W., & Suri, S. (2012). Conducting behavioral research on Amazon's Mechanical Turk. Behavior Research Methods, 44, 1–23. doi: 10.3758/s13428-011-0124-6.

Paolacci, G., & Chandler, J. (2014). Inside the turk: Understanding mechanical turk as a participant pool. Current Directions in Psychological Science, 23(3), 184-188.

Rouse, S. V. (2015). A reliability analysis of Mechanical Turk data. Computers in Human Behavior, 43, 304-307.

About the Author