GitHub release GitHub license GoDoc Go Report Card Slack Status

Pachyderm: A Containerized, Version-Controlled Data Lake

Pachyderm is:

For more details, see what’s new about Pachyderm.

Getting Started

Install Pachyderm locally or deploy on AWS/GCE/Azure in about 5 minutes.

You can also refer to our complete developer docs to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

If you’d like to see some examples and learn about core use cases for Pachyderm: - Examples - Use Cases - Case Studies: Learn how General Fusion uses Pachyderm to power commercial fusion research.

Documentation

Official Documentation

What’s new about Pachyderm? (How is it different from Hadoop?)

There are two bold new ideas in Pachyderm:

  • Containers as the core processing primitive
  • Version Control for data

These ideas lead directly to a system that’s much more powerful, flexible and easy to use.

To process data, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it’s all just going in a container! Pachyderm will take your container and inject data into it. We’ll then automatically replicate your container, showing each copy a different chunk of data. With this technique, Pachyderm can scale any code you write to process up to petabytes of data (Example: distributed image processing).

Pachyderm also version controls all data using a commit-based distributed filesystem (PFS), similar to what git does with code. Version control for data has far reaching consequences in a distributed filesystem. You get the full history of your data, can track changes, collaborate with teammates, and if anything goes wrong you can revert the entire cluster with one click!

Version control is also very synergistic with our containerized processing engine. Pachyderm understands how your data changes and thus, as new data is ingested, can run your workload on the diff of the data rather than the whole thing. This means that there’s no difference between a batched job and a streaming job, the same code will work for both!

Community

Keep up to date and get Pachyderm support via: - Twitter - Slack Status Join our community Slack Channel to get help from the Pachyderm team and other users.

Contributing

To get started, sign the Contributor License Agreement.

You should also check out our contributing guide.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled “noob-friendly” as a good place to start. We’re sometimes bad about keeping that label up-to-date, so if you don’t see any, just let us know.

Join Us

WE’RE HIRING! Love Docker, Go and distributed systems? Learn more about our team and email us at [email protected]

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

Related Repositories

pachyderm

pachyderm

Containerized Data Analytics ...

neon-workshop

neon-workshop

A Pachyderm deep learning tutorial for conference workshops ...

pachyderm-mleap-demo

pachyderm-mleap-demo

Pachyderm/MLeap team up to provide versioned datasets + models ...

pachyderm-client

pachyderm-client

Pachyderm gRPC client with Scala niceties ...

pach-neon

pach-neon

An example Pachyderm ML pipeline using Nervana Neon ...


Top Contributors

jdoliner sjezewski peter-edge derekchiang JoeyZwicker sr msteffen sambooo rw-pachyderm tv42 phacops ShengjieLuo mattnenterprise JonathanFraser tadhunt erikreppel SoheilSalehian teodor-pripoae bitwiseman elsonrodriguez fjukstad anchal-agrawal angl aberoham brendanashworth dwhitena fsouza jkingsman JustinTRoss rw

Releases

-   vv1.0.1-dirty zip tar
-   vv1.0.1-RCX3 zip tar
-   vv1.0.1-RCX2 zip tar
-   vv1.0.1-RCX1 zip tar
-   vv1.0.1-RCX zip tar
-   vV1.0.1-RCX zip tar
-   v1.2.1 zip tar
-   v1.2.0 zip tar
-   v1.2.0-RC2 zip tar
-   v1.2.0-RC1 zip tar
-   v1.1.0 zip tar
-   v1.1.0(RC) zip tar
-   v1.1.0-RC1 zip tar
-   v1.0.578 zip tar
-   v1.0.576 zip tar
-   v1.0.566 zip tar
-   v1.0.565 zip tar
-   v1.0.563 zip tar
-   v1.0.558 zip tar
-   v1.0.554 zip tar
-   v1.0.543 zip tar
-   v1.0.524 zip tar
-   v1.0.522 zip tar
-   v1.0.521 zip tar
-   v1.0.520 zip tar
-   v1.0.508 zip tar
-   v1.0.506 zip tar
-   v1.0.505 zip tar
-   v1.0.504 zip tar
-   v1.0.503 zip tar