An analysis of Wikipedia’s special tags and their implications for the nuanced spectrum between human edits and bot edits
Abstract
Wikipedia articles are a key training source for Natural Language Processing (NLP) research. The platform’s metadata records edit histories, including whether edits are made by humans or bots. A lesser-studied part of this metadata is the set of “special tags,” which can mark edits made with semi-automated tools like Twinkle or fully automated bots such as Cewbot. This spectrum spans from light automation, such as spell checking, to heavy automation, such as AI-generated text later edited by humans. This paper examines how Wikipedia’s special tags are used and explores their potential to trace the continuum between human and bot edits, an increasingly relevant issue with the rise of generative AI tools. We employ a structured SQL-based analysis through Quarry, complemented by manual inspection of revision histories to capture patterns beyond metadata. Using this hybrid approach, we move beyond surface-level records to uncover gaps between official data and editing behavior. The study demonstrates the value of combining SQL-based analysis with human verification to enhance the historical accuracy of Wikipedia’s tag data. The resulting dataset supports machine-learning training, knowledge-graph construction, and analyses of cooperative online behavior.