ingestion travis-ci python

The DPLA ingestion system

3 years after

The DPLA Ingestion System

Build Status

Build Status

Documentation

Please see the release notes regarding changes and upgrade steps.

Setting up the ingestion server:

Install Python 2.7 if not already installed (http://www.python.org/download/releases/2.7/);

Install PIP (http://pip.readthedocs.org/en/latest/installing.html);

Install the ingestion subsystem;

$ pip install --no-deps --ignore-installed -r requirements.txt

Configure an akara.ini file appropriately for your environment;

[Akara]
Port=<port for Akara to run on>
; Recommended LogLevel is one of DEBUG or INFO
LogLevel=<priority>

[Bing]
ApiKey=<your Bing Maps API key>

[CouchDb]
Url=<URL to CouchDB instance, including trailing forward-slash>
Username=<CouchDB username>
Password=<CouchDB password>
SyncQAViews=<True or False; consider False on production>
; Recommended LogLevel is INFO for production; defaults to INFO if not set
LogLevel=<priority>

[Geonames]
Username=<Geonames username>
Token=<Geonames token>

[Rackspace]
Username=<Rackspace username>
ApiKey=<Rackspace API key>
DPLAContainer=<Rackspace container for bulk download data>
SitemapContainer=<Rackspace container for sitemap files>

[APITokens]
NYPL=<Your NYPL API token>

[Sitemap]
SitemapURI=<Sitemap URI>
SitemapPath=<Path to local directory for sitemap files>

[Alert]
To=<Comma-separated email addresses to receive alert email>
From=<Email address to send alert email>

[Enrichment]
QueueSize=4
ThreadCount=4

Merge the akara.conf.template and akara.ini file to create the akara.conf file;

$ python setup.py install 

Set up and start the Akara server;

$ akara -f akara.conf setup
$ akara -f akara.conf start

Build the database views;

$ python scripts/sync_couch_views.py dpla
$ python scripts/sync_couch_views.py dashboard
$ python scripts/sync_couch_views.py bulk_download

Testing the ingestion server:

You can test it with this set description from Clemson;

$ curl "http://localhost:8889/oai.listrecords.json?endpoint=http://repository.clemson.edu/cgi-bin/oai.exe&oaiset=jfb&limit=10" 

If you have the endpoint URL but not a set id, there's a separate service for listing the sets;

$ curl "http://localhost:8889/oai.listsets.json?endpoint=http://repository.clemson.edu/cgi-bin/oai.exe&limit=10"

To run the ingest process run the setup.py script, if not done so already, initialize the database and database views, then feed it a source profile (found in the profiles directory);

$ python setup.py install
$ python scripts/sync_couch_views.py dpla
$ python scripts/sync_couch_views.py dashboard
$ python scripts/ingest_provider.py profiles/clemson.pjs

License

This application is released under a AGPLv3 license.

  • Copyright Digital Public Library of America, 2015

Related Repositories

gobblin

gobblin

Universal data ingestion framework for Hadoop. ...

spring-xd

spring-xd

Spring XD makes it easy to solve common big data problems such as data ingestion ...

s3_relay

s3_relay

Direct uploads to S3 and ingestion by your Rails app. ...

snappy-poc

snappy-poc

Use cases built on SnappyData. Use cases contained here: 1. Ad Analytics 2. Stre ...


Top Contributors

migbot markbreedlove anarchivist jlicht distobj no-reply moltude AudreyAltman mdellabitta

Releases

-   v33.8.7 zip tar
-   v33.8.6 zip tar
-   v33.8.5 zip tar
-   v33.8.4 zip tar
-   v33.8.3 zip tar
-   v33.8.2 zip tar
-   v33.8.1 zip tar
-   v33.8.0 zip tar
-   v33.7.3 zip tar
-   v33.7.2 zip tar
-   v33.7.1 zip tar
-   v33.7.0 zip tar
-   v33.6.0 zip tar
-   v33.5.2 zip tar
-   v33.5.1 zip tar
-   v33.5.0 zip tar
-   v33.4.0 zip tar
-   v33.3.0 zip tar
-   v33.2.3 zip tar
-   v33.2.2 zip tar
-   v33.2.1 zip tar
-   v33.2.0 zip tar
-   v33.1.6 zip tar
-   v33.1.5 zip tar
-   v33.1.4 zip tar
-   v33.1.3 zip tar
-   v33.1.2 zip tar
-   v33.1.1 zip tar
-   v33.1.0 zip tar
-   v33.0.2 zip tar