PyBossa demo project PDF Transcribe
PDF Transcribe is a demo project for PyBossa that shows how you can crowdsource a PDF transcription problem.
By using PDF.JS, we have the possibility of rendering almost any PDF that is hosted under an HTTP server and then use a customized form to get the data that we want to extract from it.
In this simple demo project, we load a PDF file in one side of the page, and in the other one a form where the volunteer will be able to transcribe the PDF page by typing the text in the input form. While this example is really simple, adapting the template to extract specific bits of information from the PDF will be really easy (you will only need to add more HTML input fields with instructions about what you want to extract from the PDF file). The idea is that you could be able for example to extract specific items from the documents, like captions, tabular data, authorship, institutions, etc.
The provided script for creating the tasks is very simple: you only need to tell the script where is the PDF file hosted, the URL, and which pages you want to convert as tasks. By default, this demo explores the 14 pages of the example PDF file.
Re-using the project for your own one
You need to install the pybossa-pbs library first. Use of a virtual environment is recommended:
$ virtualenv env $ source env/bin/activate
$ pip install -r requirements.txt
Creating an account in a PyBossa server
Now that you’ve all the requirements installed in your system, you need a PyBossa account:
- Create an account in your PyBossa server (use Crowdcrafting if you want).
- Copy your API-KEY (you can find it in your profile page).
Configure pybossa-pbs command line
PyBossa-pbs command line tool can be configured with a config file in order to avoid typing the API-KEY and the server every time you want to take an action on your project. For this reason, we recommend you to actually create the config file. For creating the file, follow the next steps:
$ cd ~ $ editorofyourchoice .pybossa.cfg
That will create a file. Now paste the following:
[default] server: http://yourpybossaserver.com apikey: yourapikey
Save the file, and you are done! From now on, pybossa-pbs will always use the default section to run your commands.
Create the project
Now that we’ve everything in place, creating the project is as simple as running this command:
$ pbs create_project
Using a CSV or JSON file for adding tasks
This is very simple too. There’s a sample tasks CSV file included here named ‘pdf_tasks.csv’. You can adapt it to your own PDF files URLs, and then just let pbs do the job:
$ pbs add_tasks --tasks-file pdf_tasks.csv
But notice that it has 3 columns (or keys if you’d work with an equivalent JSON file) which are required by this template: - pdf_url: the url where the PDF file will be loaded from. - question: some text you want to display giving instructions on what the user needs to do. - page: an optional field that will make that only the specified page is displayed if the PDF document has multiple pages. If omitted, the whole document will be shown (with a pagination).
Using the Dropbox importer (via web)
You can also use the built-in Dropbox importer that comes with PyBossa servers (if configured by the admin). For more details, please visit the PyBossa documentation.
Finally, add the task presenter, tutorial and long description
Now that we’ve some data to process, let’s add to our project the required templates to show a better description of our project, to present the tasks to our users, and a small tutorial for the volunteers:
$ pbs update_project
NOTE: we provide templates also for Bootstrap v2 in case your PyBossa server is using Bootstrap 2 instead of Bootstrap 3. See the rest of the files.
Please, check the full documentation here about how to create a project in the command line with pbs:
Setting up your Apache web server for hosting the PDF files
Usually you will have a set of PDF files that you are currently serving via a web server.
If you use the project as it is, you will see that it does not work loading the PDFs, even though the URL links are fine and the PDF pages are correct in the Google Spreadsheet that you have created. The problem, is that you need to enable CORS in order to get access to your PDF files.
In Enable Cors webpage you can check how you can configure most of the web servers properly, so this project can load the PDF files from other domains without problems. For example, for an Apache web server all you have to do is to enable the module mod_headers:
$ sudo a2enmod headers
Then, open the site config file, i.e. /etc/apache2/sites-enabled/000-default and add the following to the **VirtualHost section:
Header set Access-Control-Allow-Origin "*"
Finally restart the web server and you will be done! The PDFs now should be loaded without problems. Note: you can use .htaccess files too in order to not enable CORS to all your site, or if you prefer place the previous sentence in a Directory or Location, instead of at the level of the VirtualHost section.
Using Dropbox to host your PDF files
Alternatively, if you are using a PyBossa server configured to be integrated with Dropbox (like Crowdcrafting) you can use the built-in Dropbox importer to serve the PDF files directly from a Dropbox account. Check the PyBossa docs for more details.
Please, see the COPYING file.
The thumbnail has been created using a photo from TempusVolat (license CC BY-NC-SA 2.0).
Special thanks to Miquel Herrera for his JS libraries for the canvas scrolling, and Mozilla Foundation for their PDF.JS library.