This repo is part of the OA Signalling project that aims to build a system to signal whether references cited on Wikipedia are free to reuse.
Cited sources form an integral part of both scholarly communication and Wikipedia. They are meant to support statements made in the citing articles and invite readers to dive deeper into the subject at hand.
Enhancing the accessibility of cited sources thus contributes to the educational mission of the Wikimedia community. Many sources, however, are not accessible to the average Wikipedia reader due to paywalls in front of them, and many of those that are free to read can not be freely reused.
For scholarly articles, a system that provides article-level licensing information is currently being developed by DOAJ and CrossRef. This resource could be tapped for signalling the openness of references cited on Wikipedia.
It is the aim of this project to provide the technical infrastructure that would enable that, and to engage the Wikimedia and Open Access communities towards implementing it.
Here is a short version of the envisaged workflow (components central to the project are marked in bold):
- listen to RecentChanges feed across all Wikimedia wikis (cf. event-data-wikipedia-agent)
- filter by bibliographic identifier for papers (currently only DOI, long-term also PubMed ID, PMC ID, arXiv ID, JSTOR ID and perhaps others)
- check whether paper was cited or uncited (all steps until here are included in CrossRef’s live stream of DOI citations in Wikipedia)
- handle potential vandalism/ spam, e.g. via Revision scoring
- pull paper metadata from suitable source (e.g. from CrossRef/ DataCite for DOIs); Recitation bot does that, and so does Source, M.D.
- check whether that paper is available on Wikisource (initially only English, long-term other languages too)
- if so, check proper representation of paper and its metadata on Wikisource (as well as on Commons, Wikidata and Wikipedia) and in case of inconsistencies, notify someone (e.g. the original citer and/ or a relevant WikiProject, or simply a tracking page)
- if not, check whether that paper is available in JATS (currently only via PubMed Central, but long-term from anywhere); Recitation bot does that
- if so, check licensing of the paper
- if license is open, convert paper’s JATS XML to MediaWiki XML
- upload full text to Wikisource (Recitation bot does that — see contribution history, on-wiki page list and tracking categories)
- check for consistency with original (perhaps via fuzzy anchoring?)
- upload images and media to Wikimedia Commons (requires duplicate detection - many images and videos already there; Recitation bot does that too — see contribution history and tracking categories; there is an unresolved issue with high-res images); for video or audio files (covered by the Open Access Media Importer), put a copy of the original file onto the Schnittserver
- if license is not open, notify OA Button (perhaps via OABOT?)
- if so, check licensing of the paper
- start or update the Wikidata items for paper and/ or authors as necessary, perhaps even for references cited in the paper (bib2wikidata can upload CSL)
- check whether the initial citation that was identified through the RecentChanges stream is pulling bibliographic metadata from Wikidata
- if so, purge page to refresh display of citation information
- if not, update original citation with licensing/ OA Button info and links to Wikisource, Commons, Wikidata, as necessary
- keep track of revisions of cited references via CrossMark and notify someone of retractions etc.
- keep track of further citations (of the same cited reference) from within and beyond Wikimedia, e.g. via the DOI Event Tracker and notify someone (including the Cite-o-Meter)
Most of the components of this workflow do already exist but need some tweaking or brushing to fit our purposes better or to turn the pieces into a pipeline.