A curated list of awesome Apache Spark packages and resources.
Table of Contents
Notebooks and IDEs
- Apache Zeppelin - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
- Spark Notebook - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).
- sparkmagic - Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.
General Purpose Libraries
- Succinct - Support for efficient queries on compressed data.
SQL Data Sources
- Spark CSV - CSV reader and writer.
- Spark Avro - Apache Avro reader and writer.
- Spark XML - XML parser and writer.
- Spark-Mongodb - MongoDB reader and writer.
- Spark Cassandra Connector - Cassandra support including data source and API and support for arbitrary queries.
- Spark Riak Connector - Riak TS & Riak KV connector.
- Mongo-Spark - Official MongoDB connector.
- Magellan - Geospatial analytics using Spark.
- GeoSpark - A cluster computing system for processing large-scale spatial data.
Time Series Analytics
- Spark-Timeseries - A Scala / Java / Python library for interacting with time series data on Apache Spark.
- Mazerunner - Graph analytics platform on top of Neo4j and GraphX.
- GraphFrames - Data frame based graph API.
- neo4j-spark-connector - Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.
Machine Learning Extension
- dbscan-on-spark - An Implementation of the DBSCAN clustering algorithm on top of Apache Spark by irvingc and based on the paper from He, Yaobin, et al. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data.
- Spark DBSCAN - Another implementation of the DBSCAN clustering algorithm by alitouka.
- Apache SystemML - Declarative machine learning framework on top of Spark.
- Mahout Spark Bindings - linear algebra DSL and optimizer with R-like syntax.
- spark-sklearn - Scikit-learn integration with distributed model training.
- KeystoneML - Type safe machine learning pipelines with RDDs.
- Livy - REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.
- spark-jobserver - A simple Spark as a Service which supports objects sharing using so called named objects. JVM only.
- Mist - HTTP and MQTT API intended to expose Spark to exeternal services.
- Apache Toree - IPython protocol based middleware for interactive applications.
- silex - A bunch of tools varying from ML extensions to additional RDD methods.
Natural Language Processing
- Apache Bahir - A collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).
- Learning Spark, Lightning-Fast Big Data Analysis - A slightly outdated (Spark 1.3) introduction to Spark API. Good source of knowledge about basic concepts.
- Advanced Analytics with Spark - An useful collection of Spark processing patterns. Accompanying GitHub repository: sryza/aas.
- Mastering Apache Spark - An interesting compilation of notes by Jacek Laskowski. Focused on different aspects of Spark internals.
- Spark Gotchas - A subjective compilation of tips, tricks and common programming mistakes.
- Spark in Action - A new book in the Manning’s “in action” family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to setup Eclipse for Spark application development and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo here.
- Data Science and Engineering with Apache Spark (edX XSeries) - A series of five courses (Introduction to Apache Spark, Distributed Machine Learning with Apache Spark, Big Data Analysis with Apache Spark, Advanced Apache Spark for Data Science and Data Engineering, Advanced Distributed Machine Learning with Apache Spark) covering different aspects of software engineering and data science. Python oriented.
- Big Data Analysis with Scala and Spark (Coursera) - Scala oriented introductory course. Part of Functional Programming in Scala Specialization.
- AMP Camp - A periodical training event organized by the UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.
Projects Using Spark
- Oryx 2 - A lambda architecture built on Apache Spark and Apache Kafka with specialization for real-time large scale machine learning.
- Photon ML - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.
- PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
- Crossdata - Data integration platform with extended DataSource API and multi-user environment.
- Spark Technology Center - A great source of highly diverse posts related to Spark ecosystem. From practical advices to Spark commiter profiles.
- jupyter/docker-stacks/pyspark-notebook - PySpark with Jupyter Notebook and Mesos client.
- sequenceiq/docker-spark - Yarn images from SequenceIQ.