The viewer is disabled because this dataset repo requires arbitrary Python code execution. Please consider removing the loading script and relying on automated data support (you can use convert_to_parquet from the datasets library). If this is not possible, please open a discussion for direct help.

Dataset Card for NSynth

The NSynth dataset is an audio dataset containing over 300,000 musical notes across over 1000 commercially-sampled instruments, distinguished by pitch, timbre, and envelope. Each recording was made by playing and holding a musical note for three seconds and letting it decay for one second. The collection of four-second recordings ranges over every pitch on a standard MIDI piano (or as many as possible for the given instrument), played at five different velocities. This dataset was created as an attempt to establish a high-quality entry point into audio machine learning, in response to the surge of breakthroughs in generative modeling of images due to the abundance of approachable image datasets (MNIST, CIFAR, ImageNet). NSynth is meant to be both a benchmark for audio ML and a foundation to be expanded on with future datasets.

Dataset Description

Since some instruments are not capable of producing all 88 pitches in the MIDI piano's range, there is an average of 65.4 pitches per instrument. Furthermore, the commercial sample packs occasionally contain duplicate sounds across multiple velocities, leaving an average of 4.75 unique velocities per pitch.

Each of the notes is annotated with three additional pieces of information based on a combination of human evaluation and heuristic algorithms:

  1. Source: The method of sound production for the note’s instrument. This can be one of acoustic or electronic for instruments that were recorded from acoustic or electronic instruments, respectively, or synthetic for synthesized instruments.

    Index ID
    0 acoustic
    1 electronic
    2 synthetic
  2. Family: The high-level family of which the note’s instrument is a member. Each instrument is a member of exactly one family. See the complete list of families and their frequencies by source below.

    Index ID
    0 bass
    1 brass
    2 flute
    3 guitar
    4 keyboard
    5 mallet
    6 organ
    7 reed
    8 string
    9 synth_lead
    10 vocal
Family Acoustic Electronic Synthetic Total
Bass 200 8387 60368 68955
Brass 13760 70 0 13830
Flute 6572 35 2816 9423
Guitar 13343 16805 5275 35423
Keyboard 8508 42645 3838 54991
Mallet 27722 5581 1763 35066
Organ 176 36401 0 36577
Reed 14262 76 528 14866
String 20510 84 0 20594
Synth Lead 0 0 5501 5501
Vocal 3925 140 6688 10753
Total 108978 110224 86777 305979
  1. Qualities: Sonic qualities of the note. See below for descriptions of the qualities, and here for information on co-occurences between qualities.
Index ID Description
0 bright A large amount of high frequency content and strong upper harmonics.
1 dark A distinct lack of high frequency content, giving a muted and bassy sound. Also sometimes described as ‘Warm’.
2 distortion Waveshaping that produces a distinctive crunchy sound and presence of many harmonics. Sometimes paired with non-harmonic noise.
3 fast_decay Amplitude envelope of all harmonics decays substantially before the ‘note-off’ point at 3 seconds.
4 long_release Amplitude envelope decays slowly after the ‘note-off’ point, sometimes still present at the end of the sample 4 seconds.
5 multiphonic Presence of overtone frequencies related to more than one fundamental frequency.
6 nonlinear_env Modulation of the sound with a distinct envelope behavior different than the monotonic decrease of the note. Can also include filter envelopes as well as dynamic envelopes.
7 percussive A loud non-harmonic sound at note onset.
8 reverb Room acoustics that were not able to be removed from the original sample.
9 tempo-synced Rhythmic modulation of the sound to a fixed tempo.

Dataset Sources

Uses

This dataset has seen much use in models for generating audio, and some of these models have even been used by high-profile artists. Another obvious application of the dataset could be for classification (identifying instruments or perhaps even qualities of music, which could be useful in things like music recommendation). See here one such example (which is a work in progress).

Dataset Structure

The dataset has three splits:

  • Train: A training set with 289,205 examples. Instruments do not overlap with valid or test.
  • Valid: A validation set with 12,678 examples. Instruments do not overlap with train.
  • Test: A test set with 4,096 examples. Instruments do not overlap with train.

See below for descriptions of the features.

Feature Type Description
note int64 A unique integer identifier for the note.
note_str str A unique string identifier for the note in the format <instrument_str>-<pitch>-<velocity>.
instrument int64 A unique, sequential identifier for the instrument the note was synthesized from.
instrument_str str A unique string identifier for the instrument this note was synthesized from in the format <instrument_family_str>-<instrument_production_str>-<instrument_name>.
pitch int64 The 0-based MIDI pitch in the range [0, 127].
velocity int64 The 0-based MIDI velocity in the range [0, 127].
sample_rate int64 The samples per second for the audio feature.
qualities [int64] A binary vector representing which sonic qualities are present in this note.
qualities_str [str] A list IDs of which qualities are present in this note selected from the sonic qualities list.
instrument_family int64 The index of the instrument family this instrument is a member of.
instrument_family_str str The ID of the instrument family this instrument is a member of.
instrument_source int64 The index of the sonic source for this instrument.
instrument_source_str str The ID of the sonic source for this instrument.
audio {'path': str, 'array': [float], 'sampling_rate': int64} A dictionary containing a path to the corresponding audio file, a list of audio samples represented as floating point values in the range [-1,1], and the sampling rate.

An example instance generated with the loading script (note that this differs from the example instance on the homepage, as the script integrates the audio into the respective JSON files):

{'note': 84147,
 'note_str': 'bass_synthetic_033-035-050',
 'instrument': 417,
 'instrument_str': 'bass_synthetic_033',
 'pitch': 35,
 'velocity': 50,
 'sample_rate': 16000,
 'qualities': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 'qualities_str': ['dark'],
 'instrument_family': 0,
 'instrument_family_str': 'bass',
 'instrument_source': 2,
 'instrument_source_str': 'synthetic',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/335ef507846fb65b0b87154c22cefd1fe87ea83e8253ef1f72648a3fdfac9a5f/nsynth-test/audio/bass_synthetic_033-035-050.wav',
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 16000}
}

Potential Shortcomings

There are quite a few family-source pairings with little or no representation. While this is understandable in some cases - no acoustic Synth Lead, for instance - it may be problematic in others (no synthetic brass, strings, nor organ, < 100 electronic brass, flute, reed, and string samples). This can be particularly troublesome in classification problems, as there may not be sufficient data for a model to correctly distinguish between sources for a particular family of instruments. In music generation, on the other hand, these disparities may yield a bias toward the use of one source over others for a given family.

Citation

Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck,
  Karen Simonyan, and Mohammad Norouzi. "Neural Audio Synthesis of Musical Notes
  with WaveNet Autoencoders." 2017.

BibTeX:

@misc{nsynth2017,
    Author = {Jesse Engel and Cinjon Resnick and Adam Roberts and
              Sander Dieleman and Douglas Eck and Karen Simonyan and
              Mohammad Norouzi},
    Title = {Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders},
    Year = {2017},
    Eprint = {arXiv:1704.01279},
}

Dataset Card Authors

John Gillen

Downloads last month
189