FMA: A Dataset For Music Analysis

Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson, EPFL LTS2.

Note that this is a beta release and that this repository as well as the paper and data are subject to change. Stay tuned!

Data

The dataset is a dump of the Free Music Archive (FMA), an interactive library of high-quality, legal audio downloads directed by WFMU, the longest-running freeform radio station in the United States [Wikipedia]. Please see the paper for a description of how the data was collected and cleaned as well as an analysis and some baselines.

You got various sizes of MP3-encoded audio data:

  1. fma_small.zip: 4,000 tracks of 30 seconds, 10 balanced genres (GTZAN-like) (~3.4 GiB)
  2. fma_medium.zip: 14,511 tracks of 30 seconds, 20 unbalanced genres (~12.2 GiB)
  3. [fma_large.zip]: 77,643 tracks of 30 seconds, 68 unbalanced genres (~90 GiB) (available soon)
  4. [fma_full.zip]: 77,643 untrimmed tracks, 164 unbalanced genres (~900 GiB) (subject to distribution constraints)

As meta-data, you got the following in this repository:

  • tracks.json: a table (to be imported as a pandas dataframe) which contains meta-data about each track such as the ID, the title, the artist or the genres. See the usage notebook for an exhaustive list.
  • genres.json: all the 164 available genres, used to infer the genre hierarchy and top-level genres.
  • features.json: common features extracted with librosa.
  • spotify.json: audio features provided by Spotify, formerly Echonest. Cover all tracks distributed in fma_small.zip and fma_medium.zip as well as some others.

Code

As a user of the dataset, you’re probably most interested by those notebooks:

  1. usage: how to load the datasets and develop, train and test your own models with it.
  2. webapi: query the web API of the FMA to update the dataset or gather further information about tracks, albums or artists.

If you’re curious you may check those notebooks, which most results appear in the paper:

  1. analysis: some exploration of the data.
  2. baselines: baseline models for genre recognition.

For the most curious, these were used to create the dataset:

  1. creation: creation of the dataset, i.e. tracks.json and genres.json.
  2. features: features extraction from the raw audio, i.e. features.json.

Installation

  1. Download some data and verify its integrity.

    echo "e731a5d56a5625f7b7f770923ee32922374e2cbf  fma_small.zip" | sha1sum -c -
    echo "fe23d6f2a400821ed1271ded6bcd530b7a8ea551  fma_medium.zip" | sha1sum -c -
    
  2. Optionally, use pyenv to install Python 3.6 and create a virtual environment.

    pyenv install 3.6.0
    pyenv virtualenv 3.6.0 fma
    pyenv activate fma
    
  3. Clone the repository.

    git clone https://github.com/mdeff/fma.git
    cd fma
    
  4. Install the Python dependencies from requirements.txt. Depending on your usage, you may need to install ffmpeg or graphviz.

    make install
    
  5. Optionnaly, install CUDA to train neural networks on GPUs. See Tensorflow’s instructions.

  6. Fill in the configuration.

    cat .env
    DATA_DIR=/path/to/fma_small
    
  7. Open Jupyter or run a notebook.

    jupyter-notebook
    make fma_baselines.ipynb
    

History

  • 2016-12-06 beta release
    • paper: arXiv:1612.01840v1
    • code: git tag beta
    • fma_small.zip sha1: e731a5d56a5625f7b7f770923ee32922374e2cbf
    • fma_medium.zip sha1: fe23d6f2a400821ed1271ded6bcd530b7a8ea551

License & co

  • Please cite our paper if you use our code or data.
  • The code in this repository is released under the terms of the MIT license.
  • The meta-data, i.e. all the .json files, is released under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
  • We do not hold the copyright on the audio data, i.e. all .mp3 in the .zip archives, and distribute it under the terms of the license chosen by the artist.
  • The dataset is meant for research purposes.
  • We are grateful to SWITCH and EPFL for hosting the dataset within the context of the SCALE-UP project, funded in part by the swissuniversities SUC P-2 program.

Related Repositories

fma

fma

FMA: A Dataset For Music Analysis ...

S3.FMA

S3.FMA

Amazon S3 File Manager API in Python. S3.FMA is a thin wrapper around boto to perform specific high level file management tasks on an AWS S3 Bucket. ...

fmaradio

fmaradio

FMA Radio Streamer (Boston Music Hack Day 2011) ...

FMA-Scraper

FMA-Scraper

A collection of scripts to assist in scraping the FreeMusicArchive. ...

fma

fma

Toy FMA repo for example queries. ...