\subsection{Mechanical Turk and the Promise—and Limits—of Crowdsourced Annotation}

The emergence of crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) marked a transformative moment in how the NLP community approached data annotation. Initially seen as a solution to the bottlenecks of expert annotation—especially in scaling to large datasets—MTurk enabled rapid, low-cost collection of labeled data from a distributed pool of non-expert workers. Early work highlighted both the promise and pitfalls of this strategy, and over time, a more nuanced understanding of its limitations has emerged.

\paragraph{The Promise.} Amazon introduced Mechanical Turk in 2005 as a platform for ``artificial artificial intelligence'' \citep{borthwick2005mechanical}, designed to crowdsource tasks that were hard for machines but trivial for humans. This infrastructure was quickly adopted by NLP researchers. \citet{snow2008cheap} conducted one of the first systematic evaluations of MTurk for linguistic annotation, demonstrating that the aggregation of non-expert judgments could rival expert-level annotations across tasks such as word sense disambiguation and textual entailment. Crucially, this study provided early evidence that annotation quality could be recovered through redundancy and statistical modeling, thereby opening the door for broader adoption of crowdsourced labeling.

In parallel, \citet{callison2009fast} applied MTurk to the evaluation of machine translation output, showing that inexpensive crowd-based assessments correlated well with expert judgments. These findings established MTurk not only as a tool for dataset construction, but also as a viable component of evaluation pipelines. The Stanford NLI dataset \citep{bowman2015large}, a cornerstone of modern semantic inference research, was built using a hybrid approach: sentence pairs were generated programmatically and then validated by MTurk annotators. This demonstrated how crowdsourcing could scale the creation of complex semantic datasets when combined with automation and careful task design.

\paragraph{The Limits.} Despite early optimism, several studies began to identify important caveats. \citet{fort2011amazon} critically examined MTurk's role in NLP, warning of its ethical blind spots—especially low pay rates, lack of labor protections, and the cognitive toll of repetitive or emotionally taxing tasks. They also pointed to inconsistencies in annotation quality, especially in subjective or ambiguous tasks, where worker motivation and understanding were difficult to control.

Subsequent work has emphasized that the assumptions behind crowdsourcing often mask deeper structural problems. \citet{sabou2014corpus} argued for best-practice guidelines in crowdsourced annotation, highlighting the need for clear instructions, quality control mechanisms, and fair treatment of annotators. \citet{paullada2021data} provided a broader critique, noting that reliance on large-scale crowd annotation has often led to datasets that are poorly documented, unrepresentative, or ethically problematic. They argue for a data-centric reevaluation of machine learning pipelines, in which the origin, curation, and social implications of datasets are treated as core scientific concerns.

\paragraph{Reflection.} Taken together, these works illustrate the dual nature of crowdsourced annotation. On one hand, platforms like MTurk have made it possible to scale up dataset construction rapidly and affordably, which has been vital to the development of large neural models. On the other hand, the limitations—both practical and ethical—have become increasingly visible, especially as annotation tasks grow more complex and value-laden. In contemporary NLP, annotation practices are beginning to shift toward more curated, expert-driven, or hybrid systems (e.g., combining LLM-generated suggestions with human verification), as researchers grapple with how to align annotation quality, fairness, and sustainability.