The Gigamidi Dataset With Features For Expressive Music Performance Detection
The Gigamidi Dataset With Features For Expressive Music Performance Detection
DATASET
Abstract
The Musical Instrument Digital Interface (MIDI), introduced in 1983, revolutionized music pro-
arXiv:2502.17726v1 [[Link]] 24 Feb 2025
duction by allowing computers and instruments to communicate efficiently. MIDI files encode
musical instructions compactly, facilitating convenient music sharing. They benefit Music Infor-
mation Retrieval (MIR), aiding in research on music understanding, computational musicology,
and generative music. The GigaMIDI dataset contains over 1.4 million unique MIDI files, encom-
passing 1.8 billion MIDI note events and over 5.3 million MIDI tracks. GigaMIDI is currently the
largest collection of symbolic music in MIDI format available for research purposes under fair
dealing. Distinguishing between non-expressive and expressive MIDI tracks is challenging, as
MIDI files do not inherently make this distinction. To address this issue, we introduce a set of
innovative heuristics for detecting expressive music performance. These include the Distinctive
Note Velocity Ratio (DNVR) heuristic, which analyzes MIDI note velocity; the Distinctive Note
Onset Deviation Ratio (DNODR) heuristic, which examines deviations in note onset times; and
the Note Onset Median Metric Level (NOMML) heuristic, which evaluates onset positions relative
to metric levels. Our evaluation demonstrates these heuristics effectively differentiate between
non-expressive and expressive MIDI tracks. Furthermore, after evaluation, we create the most
substantial expressive MIDI dataset, employing our heuristic, NOMML. This curated iteration of
GigaMIDI encompasses expressively-performed instrument tracks detected by NOMML, con-
taining all General MIDI instruments, constituting 31% of the GigaMIDI dataset, totalling 1,655,649
tracks.
tion as one of the main resources for training these pressive music performance in MIDI tracks. Our novel
deep learning models. Within automatic music gen- heuristics were applied to each instrument track in the
eration via deep learning, end-to-end models use digi- GigaMIDI dataset, and the resulting values were used
tal audio waveform representations of musical signals to evaluate the expressiveness of tracks in GigaMIDI.
as input (Zukowski and Carr, 2017; Manzelli et al., (3) We provide details of the evaluation results (Sec-
2018; Dieleman et al., 2018). Automatic music gen- tion 5.2) of each heuristic to facilitate expressive mu-
eration based on symbolic representations (Raffel and sic performance research. (4) Through the applica-
Ellis, 2016b; Zhang, 2020) uses digital notations to tion of our optimally performing heuristic, as deter-
represent musical events from a composition or perfor- mined through our evaluation, we create the largest
mance; these can be contained, e.g. in a digital score, MIDI dataset of expressive performances, specifically
a tablature (Sarmento et al., 2023a,b), or a piano-roll. incorporating instrument tracks beyond those associ-
Moreover, symbolic music data can be leveraged in ated with piano and drums (which constitute 31%
computational musicology to analyze the vast corpus of the GigaMIDI dataset), totalling over 1.6 million
of music using MIR and music data mining techniques expressively-performed MIDI tracks.
(Li et al., 2012).
In computational creativity and musicology, a crit- 2. Background
ical aspect is distinguishing between non-expressive Before exploring the GigaMIDI dataset, we examine
performances, which are mechanical renditions of a symbolic music datasets in existing literature. This sets
score, and expressive performances, which reflect vari- the stage for our discussion on MIDI’s musical expres-
ations that convey the performer’s personality and sion and performance aspects, laying the groundwork
style. MIDI files are commonly produced through score for understanding our heuristics in detecting expres-
editors or by recording human performances using sive music performance from MIDI data.
MIDI instruments, which allow for adjustments in pa-
rameters, such as velocity or pressure, to create expres- 2.1 Symbolic Music Data
sively performed tracks. Symbolic formats refer to the representation of music
through symbolic data, such as MIDI files, rather than
However, MIDI files typically do not contain meta- audio recordings (Zeng et al., 2021). Symbolic mu-
data distinguishing between non-expressive and ex- sic understanding involves analyzing and interpreting
pressive performances, and most MIR research has fo- music based on its symbolic data, namely information
cused on file-level rather than track-level analysis. File- about musical notation, music theory and formalized
level analysis examines global attributes like duration, music concepts (Simonetta et al., 2018).
tempo, and metadata, aiding structural studies, while
track-level analysis explores instrumentation and ar- Dataset Format Files Hours Instruments
rangement details. The note-level analysis provides the GigaMIDI MIDI >1.43M >40,000 Misc.
most granular insights, focusing on pitch, velocity, and MetaMIDI MIDI 436,631 >20,000 Misc.
Lakh MIDI MIDI 174,533 >9,000 Misc.
microtiming to reveal expressive characteristics. To-
DadaGP Guitar Pro 22,677 >1,200 Misc.
gether, these hierarchical levels form a comprehensive ATEPP MIDI 11,677 1,000 Piano
framework for studying MIDI data and understanding Essen Folk Song ABC 9,034 56.62 Piano
expressive elements of musical performances. NES Music MIDI 5,278 46.1 Misc.
MID-FiLD MIDI 4,422 >40 Misc.
Our work categorizes MIDI tracks into two types: MAESTRO MIDI 1,282 201.21 Piano
non-expressive tracks, defined by fixed velocities and Groove MIDI MIDI 1,150 13.6 Drums
quantized rhythms (though expressive performances JSB Chorales MusicXML 382 >4 Misc.
may also exhibit some degree of quantization), and ex-
pressive tracks, which feature microtiming variations Table 1: Sample of symbolic datasets in multiple for-
compared to the nominal duration indicated on the mats, including MIDI, ABC, MusicXML and Guitar
score, as well as dynamics variations, translating into Pro formats.
velocity changes across and within notes. To address Symbolic formats have practical applications in mu-
this, we introduce novel heuristics in Section 4 for sic information processing and analysis. Symbolic
detecting expressive music performances by analyzing music processing involves manipulating and analyz-
microtimings and velocity levels to differentiate be- ing symbolic music data, which can be more efficient
tween expressive and non-expressive MIDI tracks. and easier to interpret than lower-level representations
The main contributions of this work can be sum- of music, such as audio files (Cancino-Chacón et al.,
marized as follows: (1) the GigaMIDI dataset, which 2022).
encompasses over 1.4 million MIDI files and over five Musical Instrument Digital Interface (MIDI) is a
million instrument tracks. This data collection is the technical standard that enables electronic musical in-
largest open-source MIDI dataset for research pur- struments and computers to communicate by trans-
poses to date. (2) we have developed novel heuristics mitting event messages that encode information such
(Heuristic 1 and 2) tailored explicitly for detecting ex- as pitch, velocity, and timing. This protocol has be-
3 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
come integral to music production, allowing for the format (Sarmento et al., 2021).
efficient representation and manipulation of musical Focusing on MIDI, Table 1 showcases symbolic mu-
data (Meroño-Peñuela et al., 2017). MIDI datasets, sic datasets. MetaMIDI (Ens and Pasquier, 2021) is a
which consist of collections of MIDI files, serve as collection of 436,631 MIDI files. MetaMIDI comprises
valuable resources for musicological research, enabling a substantial collection of multi-track MIDI files pri-
large-scale analyses of musical trends and styles. For marily derived from an extensive music corpus char-
instance, studies utilizing MIDI datasets have ex- acterized by longer duration. Approximately 57.9% of
plored the evolution of popular music (Mauch et al., MetaMIDI include a piece having a drum track.
2015) and facilitated advancements in music transcrip- Lakh MIDI dataset (LMD) encompasses a collection
tion technologies through machine learning techniques of 174,533 MIDI files (Raffel, 2016), and an audio-
(Qiu et al., 2021). The application of MIDI in various to-MIDI alignment matching technique (Raffel and El-
domains underscores its significance in both the cre- lis, 2016a) is introduced, which is also utilized in
ative and analytical aspects of contemporary music. MetaMIDI for matching musical styles if scraped style
Symbolic music processing has gained attention in metadata is unavailable.
the MIR community, and several music datasets are
available in symbolic formats (Cancino-Chacón et al., 2.2 Music Expression and Performance Representa-
2022). Symbolic representations of music can be used tions of MIDI
for style classification, emotion classification, and mu- We use the terms expressive MIDI, human-performed
sic piece matching (Zeng et al., 2021). Symbolic MIDI, and expressive machine-generated MIDI in-
formats also play a role in the automatic formatting terchangeably to describe MIDI files that capture
of music sheets. XML-compliant formats, such as expressively-performed (EP) tracks, as illustrated in
the WEDEL format, include constructs describing inte- Figure 1. EP-class MIDI tracks capture performances by
grated music objects, including symbolic music scores human musicians or producers, emulate the nuances of
(Bellini et al., 2005). Besides that, the Music Encod- live performance, or are generated by machines trained
ing Initiative (MEI) is an open, flexible format for en- with deep learning algorithms. These tracks incorpo-
coding music scores in a machine-readable way. It al- rate variations of features, such as timing, dynamics,
lows for detailed representation of musical notation and articulation, to convey musical expression.
and metadata, making it ideal for digital archiving, From the perspective of music psychology, ana-
critical editions, and musicological research (Crawford lyzing expressive music performance involves under-
and Lewis, 2016). standing how variations of, e.g. timing, dynamics and
timbre (Barthet et al., 2010) relate to performers’ in-
ABC notation is a text format used to represent
tentions and influence listeners’ perceptions. Repp’s re-
music symbolically, particularly favoured in folk mu-
search demonstrates that expressive timing deviations,
sic (Cros Vila and Sturm, 2023). It offers a human-
like rubato, enhance listeners’ perception of natural-
readable method for notating music, with elements
ness and musical quality by aligning with their cogni-
represented using letters, numbers, and symbols. This
tive expectations of flow and structure (Repp, 1997b).
format is easily learned, written, and converted into
Palmer’s work further reveals that expressive timing
standard notation or MIDI files using software, en-
and dynamics are not random but result from skilled
abling convenient sharing and playback of musical
motor planning, as musicians use mental represen-
compositions.
tations of music to execute nuanced timing and dy-
Csound notation, part of Csound software, symbol- namic changes that reflect their interpretive intentions
ically represents electroacoustic music (Licata, 2002). (Palmer, 1997).
It controls sonic parameters precisely, fostering com- Our focus lies on two main types of MIDI tracks:
plex compositions blending traditional and electronic non-expressive and expressive. Non-expressive MIDI
elements. This enables innovative experimentation in tracks exhibit relatively fixed velocity levels and on-
contemporary music. Max Mathews’ MUSIC 4, devel- set deviations, resulting in metronomic and mechani-
oped in 1962, laid the groundwork for Csound, intro- cal rhythms. In contrast, expressive MIDI tracks fea-
ducing key musical concepts to computing programs. ture subtle temporal deviations (non-quantized but hu-
With the proliferation of deep learning approaches, manized or human-performed) and greater variations
often driven by the need for vast amounts of data, the in velocity levels associated with dynamics.
creation and curation of symbolic datasets have been
active in this research area. The MIDI format can be 2.2.1 Non-expressive and expressively-performed MIDI
considered the most common music format for sym- tracks
bolic music datasets, despite alternatives such as Essen MIDI files are typically produced in two ways (exclud-
folk music database in ABC format (Schaffrath, 1995), ing synthetic data from generative music systems): us-
JSB chorales dataset available via MusicXML format ing a score/piano roll editor or recording a human per-
and Music21, (Boulanger-Lewandowski et al., 2012; formance. MIDI controllers and instruments, such as a
Cuthbert and Ariza, 2010) and Guitar Pro tablature keyboard and pads, can be utilized to adjust the param-
4 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
Figure 1: Four classes (NE= non-expressive, EO= expressive-onset, EV= expressive-velocity, and EP=
expressively-performed) using heuristics in Section 4.2 for the expressive performance detection of MIDI
tracks in GigaMIDI.
eters of each note played, such as velocity and pres- expressive performance detection through variations of
sure, to produce expressively-performed MIDI. Being micro-timings and velocity levels.
able to distinguish non-expressive and expressive MIDI MAESTRO (Hawthorne et al., 2019) and Groove
tracks is useful in MIR applications. However, MIDI MIDI (Gillick et al., 2019) datasets focus on singular
files do not accommodate such distinctions within their instruments, specifically piano and drums, respectively.
metadata. MIDI track-level analysis for music expres- Despite their narrower scope, these datasets are note-
sion has received less attention from MIR researchers worthy for including MIDI files exclusively performed
than MIDI file-level analysis. Previous research re- by human musicians. Saarland music data (SMD) con-
garding interpreting MIDI velocity levels (Dannenberg, tains piano performance MIDI files and audio record-
2006) and modelling dynamics/expression (Berndt ings, but SMD only contains 50 files (Müller et al.,
and Hähnel, 2010; Ortega et al., 2019) was conducted, 2011). The Vienna 4x22 piano corpus (Goebl, 1999)
and a comprehensive review of computational mod- and the Batik-Plays-Mozart MIDI dataset (Hu and Wid-
els of expressive music performance is available in mer, 2023) both provide valuable resources for study-
(Cancino-Chacón et al., 2018). Generation of expres- ing classical piano performance. The Vienna 4x22 Pi-
sive musical performances using a case-based reason- ano Corpus features high-resolution recordings of 22
ing system (Arcos et al., 1998) has been studied in pianists performing four classical pieces aimed at an-
the context of tenor saxophone interpretation and the alyzing expressive elements like timing and dynam-
modelling of virtuosic bass guitar performances (God- ics across performances. Meanwhile, the Batik-Plays-
dard et al., 2018). Velocity prediction/estimation using Mozart dataset offers MIDI recordings of Mozart pieces
deep learning was introduced at the MIDI note-level performed by the pianist Batik, capturing detailed per-
(Kuo et al., 2021; Kim et al., 2022; Collins and Bar- formance data such as note timing and velocity. To-
thet, 2023; Tang et al., 2023). gether, these datasets support research in performance
analysis and machine learning applications in music.
2.2.2 Music expression and performance datasets The Automatically Transcribed Expressive Piano
The aligned scores and performances (ASAP) dataset Performances (ATEPP) dataset (Zhang et al., 2022)
has been developed specifically for annotating non- was devised for capturing performer-induced expres-
expressive and expressively-performed MIDI tracks siveness by transcribing audio piano performances into
(Foscarin et al., 2020). Comprising 222 digital musi- MIDI format. ATEPP addresses inaccuracies inherent in
cal scores synchronized with 1068 performances, ASAP the automatic music transcription process. Similarly,
encompasses over 92 hours of Western classical piano the GiantMIDI piano dataset (Kong et al., 2022), akin
music. This dataset provides paired MusicXML and to ATEPP, comprises AI-transcribed piano tracks that
quantized MIDI files for scores, along with paired MIDI encapsulate expressive performance nuances. How-
files and partial audio recordings for performances. ever, we excluded the ATEPP and GiantMIDI piano
The alignment of ASAP includes annotations for down- datasets from our expressive music performance de-
beat, beat, time signature, and key signature, making tection task. State-of-the-art transcription models are
it notable for its incorporation of music scores aligned known to overfit the MAESTRO dataset (Edwards
with MIDI and audio performance data. The MID-FiLD et al., 2024) due to its recordings originating from a
(Ryu et al., 2024) dataset is the sole dataset offering controlled piano competition setting. These perfor-
detailed dynamics for Western orchestral instruments. mances, all played on similar Yamaha Disklavier pi-
However, it primarily focuses on creating expressive anos under concert hall conditions, result in consis-
dynamics via MIDI Control Change #1 (modulation tent acoustic and timbral characteristics. This unifor-
wheel) and lacks velocity variations, featuring predom- mity restricts the models’ ability to generalize to out-
inantly constant velocities as verified by our manual in- of-distribution data, contributing to the observed over-
spection. In contrast, the GigaMIDI dataset focuses on fitting.
5 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
3. GigaMIDI Data Collection teristics of MIDI tracks in each subset. The curated
We present the GigaMIDI dataset in this section and subsets were subsequently analyzed and incorporated
its descriptive statistics, such as the MIDI instrument into the GigaMIDI dataset to facilitate the detection of
group, the number of MIDI notes, ticks per quarter expressive music performance.
notes, and musical style. Additional descriptive statis- To improve accessibility, the GigaMIDI dataset has
tics are in Supplementary file 1: Appendix.A.1. been made available on the Hugging Face Hub. Early
feedback from researchers in music computing and
3.1 Overview of GigaMIDI Dataset MIR indicates that this platform offers better usabil-
The GigaMIDI dataset is a superset of the MetaMIDI ity and convenience compared to alternatives such as
dataset (Ens and Pasquier, 2021), and it contains GitHub and Zenodo. This platform enhances data pre-
1,437,304 unique MIDI files with 5,334,388 MIDI in- processing efficiency and supports seamless integration
strument tracks, and 1,824,536,824 (over 109 ; hence, with workflows, such as MIDI parsing and tokeniza-
the prefix "Giga") MIDI note events. The GigaMIDI tion using Python libraries like Symusic4 and Midi-
dataset includes 56.8% single-track and 43.2% multi- Tok5 (Fradet et al., 2021), as well as deep learning
track MIDI files. It contains 996,164 drum tracks and model training using Hugging Face. Additionally, the
4,338,224 non-drum tracks. The initial version of the raw metadata of the GigaMIDI dataset is hosted on the
dataset consisted of 1,773,996 MIDI files. Approxi- Hugging Face Hub6 , see Section 8.
mately 20% of the dataset was subjected to a cleaning As part of preprocessing GigaMIDI, single-track
process, which included deduplication achieved by ver- drum files allocated to MIDI channel 1 are subjected
ifying and comparing the MD5 checksums of the files. to re-encoding. This serves the dual purpose of ensur-
While we integrated certain publicly accessible MIDI ing their accurate representation on MIDI channel 10,
datasets from previous research endeavours, it is note- drum channel, while mitigating the risk of misidentifi-
worthy that over 50% of the GigaMIDI dataset was ac- cation as a piano track, denoted as channel 1. Details
quired through web-scraping and organized by the au- of MIDI channels are explained in Section 3.3.1.
thors. Furthermore, all drum tracks in the GigaMIDI
The GigaMIDI dataset includes per-track loop de- dataset were standardized through remapping based
tection, adapting the loop detection and extraction al- on the General MIDI (GM) drum mapping guidelines
gorithm presented in (Adkins et al., 2023) to MIDI (MIDI Association, 1996b) to ensure consistency. De-
files. In total, 7,108,181 loops with lengths rang- tailed information about the drum remapping process
ing from 1 to 8 bars were extracted from GigaMIDI can be accessed via GitHub. In addition, the distribu-
tracks, covering all types of MIDI instruments. Details tion of drum instruments, categorized and visualized
and analysis of the extracted loops from the GigaMIDI by their relative frequencies, is presented in Appendix
dataset will be shared in a companion paper report via A.1 (Gómez-Marín et al., 2020).
our GitHub page.
3.3 Descriptive Statistics of the GigaMIDI Dataset
3.2 Collection and Preprocessing of GigaMIDI Dataset 3.3.1 MIDI Instrument Group
The authors manually collected and aggregated the
GigaMIDI dataset, applying our heuristics for MIDI-
based expressive music performance detection. This
aggregation process was designed to make large-scale
symbolic music data more accessible to music re-
searchers.
Regarding data collection, we manually gathered
freely available MIDI files from online sources like Zen-
odo1 , GitHub2 , and public MIDI repositories by web
scraping. The source links for each subset are provided
via our GitHub webpage3 . During aggregation, files
were organized and deduplicated by comparing MD5
hash values. We also standardized each subset to the
General MIDI (GM) specification, ensuring coherence;
for example, non-GM drum tracks were remapped to
GM. Manual curation was employed to assess the suit-
ability of the files for expressive music performance de-
tection, with particular attention to defining ground Figure 2: Distribution of the duration in bars of the
truth tracks for expressive and non-expressive cate- files from each subset of the GigaMIDI dataset.
gories. This process involved systematically identify- The X-axis is clipped to 300 for better readability.
ing the characteristics of expressive and non-expressive The GigaMIDI dataset is divided into three primary
MIDI track subsets by manually checking the charac- subsets: "all-instrument-with-drums", "drums-only",
6 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
Figure 3: Distribution of files in GigaMIDI according to (a) MIDI notes, and (b) ticks per quarter note (TPQN)
IGN: 1-8 Events IGN: 9-16 Events ducers can customize instrument number allocations
Piano 60.2% Reed/Pipe 1.1% based on their preferences.
CP 2.4% Drums 17.4% The GigaMIDI dataset analysis reveals that most
Organ 1.8% Synth Lead 0.5% MIDI note events (77.6%) are found in two instru-
Guitar 6.7% Synth Pad 0.6% ment groups: piano and drums. The piano instru-
Bass 4.2% Synth FX 0.3% ment group has more MIDI note events (60.2%) be-
String 1.1% Ethnic 0.3% cause most piano-based tracks are longer. The higher
Ensemble 2.1% Percussive FX 0.3% number of MIDI notes in piano tracks compared to
other instrumental tracks can be attributed to several
Brass 0.7% Sound FX 0.3%
factors. The inherent nature of piano playing, which
involves ten fingers and frequently includes simulta-
Table 2: Number of MIDI note events by instrument neous chords due to its dual-staff layout, naturally in-
group in percentage (IGN=instrument group num- creases note density. Additionally, the piano’s wide
ber, CP=chromatic percussion, and FX=effect). pitch range, polyphonic capabilities, and versatility in
musical roles allow it to handle melodies, harmonies,
and accompaniments simultaneously. Piano tracks are
and "no-drums". The "all-instrument-with-drums" sub- often used as placeholders or sketches during compo-
set comprises 22.78% of the dataset and includes sition, and MIDI input is typically performed using a
multi-track MIDI files with drum tracks. The "drums- keyboard defaulting to a piano timbre. These charac-
only" subset makes up 56.85% of the dataset, con- teristics, combined with the cultural prominence of the
taining only drum tracks, while the "no-drums" subset piano and the practice of condensing multiple parts
(20.37%) consists of both multi-track and single-track into a single piano track for convenience, result in a
MIDI files without drum tracks. As shown in Figure 2, higher density of notes in MIDI datasets.
drums-only files typically have a high-density distribu- The GigaMIDI dataset includes a significant pro-
tion and are mostly under 50 bars, reflecting their clas- portion of drum tracks (17.4%), which are generally
sification as drum loops. Conversely, multi-track and shorter and contain fewer note events compared to pi-
single-track piano files exhibit a broader range of du- ano tracks. This is primarily because many drum tracks
rations, spanning 10 to 300 bars, with greater diversity are designed for drum loops and grooves rather than
in musical structure. for full-length musical compositions. The supplemen-
MIDI instrument groups, organized by program tary file provides a detailed distribution of note events
numbers, categorize instrument sounds. Each group for drum sub-tracks, including each drum MIDI instru-
corresponds to a specific program number range, rep- ment in the GigaMIDI dataset. Sound effects, including
resenting unique instrument sounds. For instance, pro- breath noise, bird tweets, telephone rings, applause,
gram numbers 1 to 8 on MIDI Channel 1 are associ- and gunshot sounds, exhibit minimal usage, account-
ated with the piano instrument group (acoustic piano, ing for only 0.249% of the dataset. Chromatic per-
electric piano, harpsichord, etc). The analysis in Ta- cussion (2.4%) stands for pitched percussions, such as
ble 2 focuses on the occurrence of MIDI note events glockenspiel, vibraphone, marimba, and xylophone.
across the 16 MIDI instrument groups (MIDI Associa-
tion, 1996b). Channel 10 is typically reserved for the 3.3.2 Number of MIDI Notes and Ticks Per Quarter Note
drum instrument group. Figure 3 (a) shows the distribution for the number of
Although MIDI groups/channels often align with MIDI notes in GigaMIDI. According to our data analy-
specific instrument types in the General MIDI specifi- sis, the span from the 5th to the 95th percentile covers
cation (MIDI Association, 1996a), composers and pro- 13 to 931 notes, indicating a significant presence of
7 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
short-length drum tracks or loops. data, encompassing audio-text match style metadata
Figure 3 (b) illustrates the distribution of Ticks per sourced from the MetaMIDI subset (Ens and Pasquier,
quarter note (TPQN). TPQN is a unit that measures the 2021), is conducted. Subsequently, all gathered musi-
resolution or granularity of timing information. Ticks cal style metadata undergoes conversion, adhering to
are the smallest indivisible units of time within a MIDI the Musicmap topology for consistency.
sequence. A higher TPQN value means more precise The distribution of musical style metadata in the
timing information can be stored in a MIDI sequence. GigaMIDI dataset, illustrated in Figure 5, is based on
The most common TPQN values are 480 and 960. Ac- the Musicmap topology and encompasses 195,737 files
cording to our data analysis of GigaMIDI, common annotated with musical style metadata. Notably, preva-
TPQN values range from 96 to 960 between the 5th lent styles include classical, pop, rock, and folk music.
and 95th percentiles. These 195,737 style annotations mostly originate from
a combination of scraped metadata acquired online,
3.3.3 Musical Style style data present in our subsets, and manual inspec-
tion conducted by the authors.
A major challenge in utilizing scraped style meta-
data from the MetaMIDI subset is ensuring its accuracy
of metadata. To address this, a subset of the GigaMIDI
dataset, consisting of 29,713 MIDI files, was carefully
reviewed through music listening and manually anno-
tated with style metadata by a doctoral-level music re-
searcher.
MetaMIDI integrates scraped style metadata and
associated labels obtained through an audio-MIDI
matching process7 . However, our empirical assess-
ment, based on manual auditory analysis of musi-
Figure 4: Musicmap style topology (Crauwels, 2016). cal styles, identified inconsistencies and unreliability
in the scraped metadata from the MetaMIDI subset
(Ens and Pasquier, 2021). To address this, we manu-
ally remapped 9,980 audio-text-matched musical style
metadata entries within the MetaMIDI subset, ensuring
consistent and accurate musical style classifications.
Finally, these remapped musical styles were aligned
with the Musicmap topology to provide more uniform
and reliable information on musical style.
We provide audio-text-matched musical style meta-
data available using three musical style metadata:
Discogs8 , Last.fm9 , and Tagtraum10 , collected using
the MusicBrainz11 database.
ings and their expected positions on a quantized met- our heuristic, we first store only the unique values in
ric grid, with the grid’s resolution determined by the each list: for v, the distinct velocity levels are {64, 72,
TPQN (ticks per quarter note) of the MIDI file. These 80, 88}, and for o, the distinct onset time deviations
deviations, often introduced through human perfor- are {-5, 0, 5, 10}. By counting these unique values,
mance, play a crucial role in conveying musical expres- we identify four distinct velocity levels and four dis-
siveness. tinct onset time deviations for this MIDI track, with no
The primary objective of our proposed heuristics deviation being treated as a specific occurrence.
for expressive performance detection is to differenti-
ate between expressive and non-expressive MIDI tracks 4.2 Distinctive Note Velocity/Onset Deviation Ratio
by analyzing velocity and onset time deviations. This (DNVR/DNODR)
analysis is applied at the MIDI track level, with each Distinctive note velocity and onset deviation ratios
instrument track undergoing expressive performance measure the proportion (in %) of unique MIDI note
detection. Our heuristics, introduced in the following velocities and onset time deviations in each MIDI
sections, assess expressiveness by examining velocity track. These metrics form a set of heuristics for detect-
variations and microtimings, offering a versatile frame- ing expressive performances, classified into four cate-
work suitable for various GM instruments. gories: Non-Expressive (NE), Expressive-Onset (EO),
Expressive-Velocity (EV), and Expressively-Performed
Other related approaches for this task are more
(EP), as shown in Figure 1. The DNVR metric counts
specific to acoustic piano performance rather than
unique velocity levels to differentiate between tracks
being tailored to MIDI tracks. Key Overlap Time
with consistent velocity and those with expressive ve-
(Repp, 1997a) and Melody Lead (Goebl, 2001) focus
locity variation, while the DNODR calculation helps
on acoustic piano performances, analyzing legato ar-
identify MIDI tracks that are either perfectly quantized
ticulation and melodic timing anticipation, which lim-
or have minimal microtiming deviations
its their application to piano contexts. Similarly, Lin-
ear Basis Models (Grachten and Widmer, 2012) fo-
cus on Western classical instruments, particularly the Heuristic 1 Calculation of Distinctive Note Veloc-
acoustic piano, and rely on score-based dynamics (e.g., ity/Onset Deviation Ratio (DNVR/DNODR)
crescendo, fortissimo), making them less applicable to 1: x ← [x 1 , ..., x n ] ▷ list of velocity or onset deviation
non-classical or non-Western music. Such dynamics 2: c vel oci t y ← 0 ▷ number of distinctive velocity levels
can be interpreted in MIDI velocity levels, and our 3: c onset ← 0 ▷ number of distinctive onset deviations
heuristics consider this aspect. Compared to these 4: for i ← 2 t o n do ▷ n =number of notes in a track
methods, our heuristics offer broader applicability, ad- 5: if x i ∉ x then
dressing dynamic variations and microtiming devia- 6: c ← c + 1 ▷ add 1 to c if there is a new value
tions across a wide range of MIDI instruments, making 7: return c vel oci t y or c onset
them suitable for detecting expressiveness in diverse 8: cvelocity−ratio = c vel oci t y ÷ 127 × 100
musical contexts. 9: conset−ratio = c onset ÷ T PQN × 100
able for non-expressive and expressive tracks in our bi- et al., 2002) alongside leave-one-out cross-validation
nary classification task. (LOOCV, Wong, 2015) to determine thresholds using
The curated set for threshold selection and evalua- ground truths of NE and EP classes. LR estimates each
tion is split into 80% training for the threshold selec- class probability for binary classification between NE
tion (Section 5.1) and 20% testing for the evaluation and EP class tracks. LOOCV assesses model perfor-
(Section 5.2) to prevent data leakage. Heuristics for mance iteratively by training on all but one data point
Expressive Music Performance Detection, described in and testing on the excluded point, ensuring compre-
Section 4, are assessed for classification accuracy on hensive evaluation. This is particularly beneficial for
this testing set. small datasets to avoid reliance on specific train-test
splits. During this task, the ML regressor is solely used
5.1 Threshold Selection of Heuristics for Expressive for threshold identification rather than classification.
Music Performance Detection The high accuracy of the ML regressor facilitates op-
The threshold denotes the optimal value delineating timal threshold identification without arbitrary thresh-
the boundary between NE and EP tracks. A signifi- old selection.
cant challenge in identifying the threshold stems from
the limited availability of dependable ground-truth in- Heuristic Threshold P4
stances for NE and EP tracks. Distinct Velocity 52 0.7727
The curation process involves manually inspect- Distinct Onset 42 0.7225
ing tracks for velocity and microtiming variations to DNVR 40.965% 0.7727
achieve a 100% confidence level in ground truths. Sub- DNODR 4.175% 0.9529
sets failing to meet this level are strictly excluded from NOMML Level 12 0.9952
consideration. We selected 361 NE and 361 EP tracks
and assigned binary labels 0 for NE and 1 for EP tracks.
Our curated set consists of: Table 3: Optimal threshold selection results based on
1. Non-expressive (361 instances): ASAP (Foscarin the 80% training set, showing the optimal thresh-
et al., 2020) score tracks. old value for each heuristic where the P4 value is
2. Expressively-performed (361 instances): ASAP maximized.
performance tracks, Vienna 4x22 Piano Cor- After completing the machine learning classifier’s
pus (Goebl, 1999), Saarland music data (Müller training phase, efforts are directed toward identifying
et al., 2011), Groove MIDI (Gillick et al., 2019), the classifier’s optimal boundary point to maximize the
and Batik-plays-Mozart Corpus (Hu and Widmer, P4 metric. However, relying solely on the P4 metric
2023). for threshold selection proves inadequate, as it may
For the curated set, we intentionally balanced the num- not comprehensively capture all pertinent aspects of
ber of instances across classes to avoid bias. In im- the underlying scenarios.
balanced datasets, classification accuracy can be mis- We manually examine the training set to establish
leadingly high—especially in a two-class setup—where percentile boundaries for distinguishing NE and EP
a classifier could achieve high accuracy by predomi- classes based on ground truth data. Specifically, we
nantly predicting the majority class if one class has sig- identify the maximum P4 metric within the 80% train-
nificantly more instances (e.g., 10 times more). This ing set. Using this boundary range, we determine the
bias reduces the model’s ability to generalize and per- optimal threshold index in a feature array that maxi-
form well on unseen data, especially if both classes are mizes the P4 metric, which is then used to extract the
important. As a result, the classification accuracy, pre- corresponding threshold for our heuristic. This fea-
cision and recall metrics can become unreliable, mak- ture array contains all feature values for each heuris-
ing it difficult to assess the true effectiveness of the tic. The optimal threshold index, selected based on our
heuristics, particularly in detecting or distinguishing ML regression model and P4 score, identifies the opti-
the minority class. mal threshold from the feature array. For example, the
To tackle this, balancing the dataset enables a optimal threshold for the NOMML heuristic is found
more reliable option for evaluating the classification at level 12, corresponding to the 63.85th percentile,
task, even for baseline heuristics. We partially ex- yielding a P4 score of 0.9952, with similar information
cluded Groove MIDI and ASAP subsets from the cu- available for other heuristics in Table 3. Detailed steps
rated set, as if we had included them entirely, the cu- for selecting optimal thresholds for each heuristic are
rated set initially would contain roughly 10 times more provided in the Supplementary File: Appendix B.
expressively-performed instances than non-expressive It is important to note that the analysis in this sec-
ones. A total of 361 instances were selected, as this tion is speculative, relying on observations from Ta-
was the maximum number of non-expressive instances bles 4 and 5 without direct supporting evidence at this
with available ground truth data. stage. Later in the evaluation Section 5.2, we provide
We employ logistic regression (LR, Kleinbaum corresponding results that substantiate these prelimi-
11 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
Class Distinct − Onset & Distinct − Velocity tracks (28.18% drum and 71.82% non-drum tracks).
NE (62.5%) D − O<42 & D − V<52 The distribution of MIDI instruments in the curated
EO (7.2%) D − O>=42 & D − V<52 version is displayed in Figure 7 (b), indicating that pi-
EV (27.4%) D − O<42 & D − V>=52 ano and drum tracks are the predominant components.
EP (2.9%) D − O>=42 & D − V>=52
5.2 Evaluation of Heuristics for Expressive Perfor-
mance Detection
Table 4: Detection results (%) for expressive per-
formance in each MIDI track class within the Detection Heuristics Class. Accuracy Ranking
GigaMIDI dataset. The analysis is based on Distinct Velocity 77.9% 4
the number of distinct velocity levels (Distinct- Distinct Onset 77.9% 4
Velocity: D-V) and onset time deviations (Distinct- DNVR 83.4% 3
Onset: D-O). Categories include non-expressive
DNODR 98.2% 2
(NE), expressive-onset (EO), expressive-velocity
NOMML 100% 1
(EV), and expressively-performed (EP).
Class conset−ratio(O−R) & cvelocity−ratio(V−R)
NE (52.3%) cO−R <4.175% & cV−R <40.965% Table 6: Classification accuracy of each heuristic for
EO (9.1%) cO−R >=4.175% & cV−R <40.965% expressive performance detection.
EV (24.2%) cO−R <4.175% & cV−R >=40.965% In our evaluation results (Table 6), the NOMML
EP (14.4%) cO−R >=4.175% & cV−R >=40.965% heuristic clearly outperforms other heuristics, achiev-
ing the highest accuracy at 100%. Additionally, onset-
based heuristics generally show better accuracy than
Table 5: Results (%) of expressive performance detec- velocity-based ones. This suggests that distinguish-
tion for each MIDI track class in GigaMIDI based ing velocity levels poses a greater challenge. For
on the calculation of conset−ratio (DNODR), and instance, in the ASAP subset, non-expressive score
cvelocity−ratio (DNVR). tracks—encoding traditional dynamics through veloc-
ity—display fluctuations rather than a fixed velocity
nary insights. level, whereas these tracks are aligned to a quantized
Tables 4 and 5 display the distribution of the grid, making onset-based detection more straightfor-
GigaMIDI dataset across four distinct classes (Figure ward. However, we recognize that accuracy alone does
1), using optimal thresholds derived from our baseline not provide a complete understanding, prompting fur-
heuristics (distinct velocity levels and onset time devi- ther investigation.
ations) and DNVR/DNODR heuristics. With the base-
line heuristics (Table 4), class distribution accuracy is Heuristic (%) TP TN FP FN CN
limited due to the prevalence of short-length drum
Distinct Vel. 35.4 42.5 21.2 0.9 98.0
and melody loop tracks in GigaMIDI, which baseline
Distinct On. 24.8 53.1 10.6 11.5 82.2
heuristics do not account for. In contrast, results us-
DNVR 35.4 48.0 21.2 0.9 98.2
ing DNVR/DNODR heuristics (Table 5) show improved
DNODR 34.5 63.7 0 1.77 97.3
class identification, especially for EP and NE tracks, as
NOMML 36.3 63.7 0 0 100
these heuristics consider MIDI track length, accommo-
dating short loops with around 100 notes. Although
DNVR/DNODR heuristics provide more accurate dis-
Table 7: True-Positives (TP), True-Negatives (TN),
tributions, both are less robust than the distribution of
False-Positives (FP), and False-Negatives (FN) based
the NOMML heuristic, as shown in Figure 7 (a).
on the threshold set by P4 for heuristics, including
Figure 7 (a) illustrates the distribution of NOMML
Correct-Negatives (CN), are tabled in percentage.
for MIDI tracks in the GigaMIDI dataset. The analy-
sis reveals that the majority of MIDI tracks fall within To further investigate, we also report TP, TN, FP, FN
three distinct bins (bins: 0, 2, and 12), encompassing a and CN as metrics (shown in Table 7) for assessing the
cumulative percentage of 86.1%. This discernible pat- reliability of our heuristics using the optimal thresh-
tern resembles a bimodal distribution, distinguishing olds in expressive performance detection, where "True"
between NE and EP class tracks. denotes expressive instances and "False" signifies non-
Figure 7 (a) shows 69% of MIDI tracks in GigaMIDI expressive instances. Thus, investigating the capacity
are NE class, and 31% of GigaMIDI are EP class tracks to achieve higher correct-negative (C N = T NT+F N
N ) rate
(NOMML: 12). Our curated version of GigaMIDI uti- holds significance in this context, as it assesses the reli-
lizing NOMML level 12 as a threshold is provided. able discriminatory power against NE instances, as well
This curated version consists of 869,513 files (81.59% as EP instances. As a result, NOMML achieves a 100%
single-track and 18.41% multi-track files) or 1,655,649 CN rate, and other heuristics perform reasonably well.
12 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
Figure 7: Distribution of MIDI tracks according to (a) NOMML (level between 0 and 12, where k = 6) for MIDI
tracks in GigaMIDI. NOMML heuristic investigates duple and triplet onsets, including onsets that cannot be
categorized as duple or triplet-based MIDI grids, and (b) instruments for expressively-performed tracks in
the GigaMIDI dataset.
data for non-Western music, such as Asian classical or able source links and searchable resources. Ap-
Latin/African styles, would reduce Western bias and of- plying this to MIDI data, each subset of MIDI files
fer a more inclusive resource for global music research, collected from public domain sources is accom-
supporting cross-cultural studies. panied by clear and consistent metadata via our
GitHub and Hugging Face hub webpages. For ex-
8. Data Accessibility and Ethical Statements ample, organizing the source links of each data
The GigaMIDI dataset consists of MIDI files acquired subset, as done with the GigaMIDI dataset, en-
via the aggregation of previously available datasets and sures that each source can be easily traced and
web scraping from publicly available online sources. referenced, improving discoverability.
Each subset is accompanied by source links, copyright • Accessible: Once found, data should be easily
information when available, and acknowledgments. retrievable using standard protocols. Accessibil-
File names are anonymized using MD5 hash encryp- ity does not necessarily imply open access, but
tion. We acknowledge the work from the previous it does mean that data should be available un-
dataset papers (Goebl, 1999; Müller et al., 2011; Raf- der well-defined conditions. For the GigaMIDI
fel, 2016; Bosch et al., 2016; Miron et al., 2016; Don- dataset, hosting the data on platforms like Hug-
ahue et al., 2018; Crestel et al., 2018; Li et al., 2018; ging Face Hub improves accessibility, as these
Hawthorne et al., 2019; Gillick et al., 2019; Wang platforms provide efficient data retrieval mech-
et al., 2020; Foscarin et al., 2020; Callender et al., anisms, especially for large-scale datasets. En-
2020; Ens and Pasquier, 2021; Hung et al., 2021; Sar- suring that MIDI data is accessible for public use
mento et al., 2021; Zhang et al., 2022; Szelogowski while respecting any applicable licenses supports
et al., 2022; Liu et al., 2022; Ma et al., 2022; Kong wider research and analysis in music computing.
et al., 2022; Hyun et al., 2022; Choi et al., 2022; Plut • Interoperable: Data should be structured in
et al., 2022; Hu and Widmer, 2023; Ryu et al., 2024) such a way that it can be integrated with other
that we aggregate and analyze as part of the GigaMIDI datasets and used by various applications. MIDI
subsets. data, being a widely accepted format in music
This dataset has been collected, utilized, and dis- research, is inherently interoperable, especially
tributed under the Fair Dealing provisions for research when standardized metadata and file formats are
and private study outlined in the Canadian Copyright used. By ensuring that the GigaMIDI dataset
Act (Government of Canada, 2024). Fair Dealing per- complies with widely adopted standards and sup-
mits the limited use of copyright-protected material ports integration with state-of-the-art libraries in
without the risk of infringement and without having symbolic music processing, such as Symusic and
to seek the permission of copyright owners. It is in- MidiTok, the dataset enhances its utility for mu-
tended to provide a balance between the rights of cre- sic researchers and practitioners working across
ators and the rights of users. As per instructions of different platforms and systems.
the Copyright Office of Simon Fraser University12 , two • Reusable: Data should be well-documented
protective measures have been put in place that are and licensed to be reused in future research.
deemed sufficient given the nature of the data (acces- Reusability is ensured through proper metadata,
sible online): clear licenses, and documentation of provenance.
In the case of GigaMIDI, aggregating all subsets
1. We explicitly state that this dataset has been col-
from public domain sources and linking them to
lected, used, and distributed under the Fair Deal-
the original sources strengthens the reproducibil-
ing provisions for research and private study out-
ity and traceability of the data. This practice
lined in the Canadian Copyright Act.
allows future researchers to not only use the
2. On the Hugging Face hub, we advertise that the
dataset but also verify and expand upon it by re-
data is available for research purposes only and
ferring to the original data sources.
collect the user’s legal name and email as proof
of agreement before granting access. Developing ethical and responsible AI systems for
We thus decline any responsibility for misuse. music requires adherence to core principles of fairness,
The FAIR (Findable, Accessible, Interoperable, transparency, and accountability. The creation of the
Reusable) principles (Jacobsen et al., 2020) serve as a GigaMIDI dataset reflects a commitment to these val-
framework to ensure that data is well-managed, easily ues, emphasizing the promotion of ethical practices
discoverable, and usable for a broad range of purposes in data usage and accessibility. Our work aligns with
in research. These principles are particularly important prominent initiatives promoting ethical approaches to
in the context of data management to facilitate open AI in music, such as AI for Music Initiatives13 , which
science, collaboration, and reproducibility. advocates for principles guiding the ethical creation of
• Findable: Data should be easily discoverable music with AI, supported by the Metacreation Lab for
by both humans and machines. This is typi- Creative AI14 and the Centre for Digital Music15 , which
cally achieved through proper metadata, trace- provide critical guidelines for the responsible develop-
14 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
Bellini, P., Bruno, I., and Nesi, P. (2005). Automatic Cancino-Chacón, C. E., Grachten, M., Goebl, W., and
formatting of music sheets through MILLA rule-based Widmer, G. (2018). Computational models of expres-
language and engine. Journal of New Music Re- sive music performance: A comprehensive and criti-
search, 34(3):237–257. [Link] cal review. Journal of Frontiers in Digital Humanities,
09298210500236051. 5. [Link]
Berndt, A. and Hähnel, T. (September, 2010). Mod- Choi, E., Chung, Y., Lee, S., Jeon, J., Kwon, T., and
elling musical dynamics. In Proceedings of of the Nam, J. (December, 2022). YM2413-MDB: A multi-
Audio Mostly Conference on Interaction with Sound, instrumental fm video game music dataset with emo-
Piteå, Sweden. tion annotations. In Proceedings of the International
Society for Music Information Retrieval Conference,
Born, G. (2020). Diversifying MIR: Knowledge and Bengaluru, India.
real-world challenges, and new interdisciplinary fu-
tures. Transactions of the International Society for Mu- Collins, T. and Barthet, M. (November, 2023). Expres-
sic Information Retrieval, 3(1). [Link] sor: A Transformer Model for Expressive MIDI Per-
5334/tismir.58. formance. In Proceedings of the 16th International
Symposium on Computer Music Multidisciplinary Re-
Bosch, J. J., Marxer, R., and Gómez, E. (2016). Eval- search (CMMR), Tokyo, Japan.
uation and combination of pitch estimation meth-
Crauwels, K. (2016). Musicmap.
ods for melody extraction in symphonic classical mu-
[Link]
sic. Journal of New Music Research, 45(2):101–117.
(Last accessed: January 4th, 2024).
[Link]
Crawford, T. and Lewis, R. (2016). Review: Music En-
Boulanger-Lewandowski, N., Bengio, Y., and Vincent, coding Initiative. Journal of the American Musicolog-
P. (June, 2012). Modeling temporal dependencies ical Society, 69(1):273–285. [Link]
in high-dimensional sequences: Application to poly- 1525/jams.2016.69.1.273.
phonic music generation and transcription. In Pro-
ceedings of the International Conference on Machine Crestel, L., Esling, P., Heng, L., and McAdams, S.
Learning, Edinburgh, Scotland. (September, 2018). A database linking piano and
orchestral MIDI scores with application to automatic
Briot, J.-P. (2021). From artificial neural networks to projective orchestration. In Proceedings of the Inter-
deep learning for music generation: history, concepts national Society for Music Information Retrieval Con-
and trends. Journal of Neural Computing and Ap- ference, Paris, France.
plications, pages 39–65. [Link]
s00521-020-05399-0. Cros Vila, L. and Sturm, B. L. T. (August, 2023). Statis-
tical evaluation of ABC-formatted music at the levels
Briot, J.-P. and Pachet, F. (2020). Deep learning for of items and corpora. In Proceedings of AI Music Cre-
music generation: challenges and directions. Jour- ativity (AIMC), Brighton, UK.
nal of Neural Computing and Applications, pages 981–
Cuthbert, M. S. and Ariza, C. (August, 2010). mu-
993. [Link]
sic21: A toolkit for computer-aided musicology and
6.
symbolic music data. In Proceedings of of the Interna-
Brunner, G., Konrad, A., Wang, Y., and Wattenhofer, tional Society for Music Information Retrieval Confer-
R. (September, 2018). MIDI-VAE: Modeling dynam- ence, Utrecht, Netherlands.
ics and instrumentation of music with applications to Dannenberg, R. B. (November, 2006). The interpre-
style transfer. In Proceedings of the International Soci- tation of MIDI velocity. In Proceedings of the Inter-
ety for Music Information Retrieval Conference, Paris, national Computer Music Conference, New Orleans,
France. Unites States.
Callender, L., Hawthorne, C., and Engel, J. (2020). Dieleman, S., Van Den Oord, A., and Simonyan, K.
Improving perceptual quality of drum transcription (December, 2018). The Challenge of Realistic Mu-
with the expanded groove MIDI dataset. sic Generation: Modelling Raw Audio at Scale. In
[Link] Proceedings of Conference on Neural Information Pro-
(Last accessed: 27th of October 2023). cessing Systems, Montreal, Canada.
Cancino-Chacón, C., Peter, S. D., Karystinaios, E., Fos- Donahue, C., Mao, H. H., and McAuley, J. (Septem-
carin, F., Grachten, M., and Widmer, G. (May, 2022). ber, 2018). The NES music database: A multi-
Partitura: a Python package for symbolic music pro- instrumental dataset with expressive performance at-
cessing. In Proceedings of the Music Encoding Confer- tributes. In Proceedings of the International Society for
ence, Halifax, Canada. Music Information Retrieval Conference, Paris, France.
16 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
Edwards, D., Dixon, S., Benetos, E., Maezawa, A., 42/[Link] (Last accessed: 20th of
and Kusaka, Y. (2024). A data-driven analysis of November 2024).
robust automatic piano transcription. IEEE Signal
Processing Letters. [Link] Grachten, M. and Widmer, G. (2012). Linear basis
2024.3363646. models for prediction and analysis of musical expres-
sion. Journal of New Music Research, 41(4):311–322.
Ens, J. and Pasquier, P. (November, 2021). Building the [Link]
MetaMIDI dataset: Linking symbolic and audio mu-
Hawthorne, C., Stasyuk, A., Roberts, A., Simon, I.,
sical data. In Proceedings of the International Society
Huang, C.-Z. A., Dieleman, S., Elsen, E., Engel, J.,
for Music Information Retrieval Conference, Online.
and Eck, D. (May, 2019). Enabling factorized pi-
Ens, J. and Pasquier, P. (October, 2020). MMM: Explor- ano music modeling and generation with the MAE-
ing conditional multi-track music generation with STRO dataset. In Proceedings of the International
the transformer. In Proceedings of the International Conference on Learning Representations (ICLR), New
Society for Music Information Retrieval Conference, Orleans, Unites States.
Montreal, Canada.
Hernandez-Olivan, C. and Beltran, J. R. (2022). Music
Foscarin, F., McLeod, A., Rigaux, P., Jacquemard, F., composition with deep learning: a review. Journal of
and Sakai, M. (October, 2020). ASAP: a dataset Advances in Speech and Music Technology: Computa-
of aligned scores and performances for piano tran- tional Aspects and Applications, pages 25–50. https:
scription. In Proceedings of the International Society //[Link]/10.48550/arXiv.2108.12290.
for Music Information Retrieval Conference, Montreal, Hu, P. and Widmer, G. (November, 2023). The Batik-
Canada. plays-Mozart Corpus: Linking Performance to Score
Fradet, N., Briot, J.-P., Chhel, F., Seghrouchni, A. E. F., to Musicological Annotations. In Proceedings of the
and Gutowski, N. (November, 2021). MidiTok: A International Society for Music Information Retrieval
python package for MIDI file tokenization. In Pro- Conference, Milan, Italy.
ceedings of the International Society for Music Infor- Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer,
mation Retrieval Conference, Online. N., Simon, I., Hawthorne, C., Dai, A. M., Hoffman,
Gillick, J., Roberts, A., Engel, J., Eck, D., and Bamman, M. D., Dinculescu, M., and Eck, D. (May, 2019). Mu-
D. (June, 2019). Learning to groove with inverse se- sic Transformer. In Proceedings of the International
quence transformations. In Proceedings of the Inter- Conference on Learning Representations (ICLR), New
national Conference on Machine Learning, Long Beach, Orleans, Unites States.
United States. Hung, H.-T., Ching, J., Doh, S., Kim, N., Nam, J., and
Yang, Y.-H. (November, 2021). EMOPIA: A multi-
Goddard, C., Barthet, M., and Wiggins, G. (2018). As-
modal pop piano dataset for emotion recognition and
sessing musical similarity for computational music
emotion-based music generation. In Proceedings of
creativity. Journal of the Audio Engineering Society,
the International Society for Music Information Re-
66(4):267–276. [Link]
trieval Conference, Online.
2018.0012.
Hyun, L., Kim, T., Kang, H., Ki, M., Hwang, H., Park, K.,
Goebl, W. (1999). The vienna 4x22 piano corpus.
Han, S., and Kim, S. J. (December, 2022). ComMU:
[Link]
Dataset for combinatorial music generation. In Pro-
(Last accessed: 24th of October 2024).
ceedings of the Conference on Neural Information Pro-
Goebl, W. (2001). Melody lead in piano perfor- cessing Systems (NeurIPS) Datasets and Benchmarks
mance: Expressive device or artifact? The Journal Track, New Orleans, United States.
of the Acoustical Society of America, 110(1):563–572. Jacobsen, A., de Miranda Azevedo, R., Juty, N., Batista,
[Link] D., Coles, S., Cornet, R., Courtot, M., Crosas, M., Du-
Gómez-Marín, D., Jordà, S., and Herrera, P. (2020). montier, M., Evelo, C. T., Goble, C., Guizzardi, G.,
Drum rhythm spaces: From polyphonic similarity Hansen, K. K., Hasnain, A., Hettne, K., Heringa, J.,
to generative maps. Journal of New Music Re- Hooft, R. W., Imming, M., Jeffery, K. G., Kaliyape-
search, 49(5):438–456. [Link] rumal, R., Kersloot, M. G., Kirkpatrick, C. R., Kuhn,
09298215.2020.1806887. T., Labastida, I., Magagna, B., McQuilton, P., Meyers,
N., Montesanti, A., van Reisen, M., Rocca-Serra, P.,
Government of Canada (2024). The Cana- Pergl, R., Sansone, S.-A., da Silva Santos, L. O. B.,
dian Copyright Act, RSC 1985, c. C-42, s. 29 Schneider, J., Strawn, G., Thompson, M., Waag-
(fair dealing for research and private study). meester, A., Weigel, T., Wilkinson, M. D., Willigha-
[Link] gen, E. L., Wittenburg, P., Roos, M., Mons, B., and
17 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
Schultes, E. (2020). FAIR Principles: Interpreta- Mauch, M., MacCallum, R. M., Levy, M., and Leroi,
tions and Implementation Considerations. Data In- A. M. (2015). The evolution of popular music: USA
telligence, 2(1-2):10–29. [Link] 1960–2010. Royal Society open science, 2(5):150081.
dint_r_00024. [Link]
Kim, H., Miron, M., and Serra, X. (December, 2022). Meroño-Peñuela, A., Hoekstra, R., Gangemi, A., Bloem,
Note level MIDI velocity estimation for piano per- P., de Valk, R., Stringer, B., Janssen, B., de Boer,
formance. In Proceedings of the International Soci- V., Allik, A., Schlobach, S., et al. (2017). The MIDI
ety for Music Information Retrieval Conference, Ben- linked data cloud. In The Semantic Web–ISWC 2017:
galuru, India. 16th International Semantic Web Conference, Vienna,
Austria, October 21-25, 2017, Proceedings, Part II
Kleinbaum, D. G., Dietz, K., Gail, M., Klein, M., and 16, pages 156–164. Springer. [Link]
Klein, M. (2002). Logistic regression. Springer. 1007/978-3-319-68204-4_16.
[Link]
MIDI Association (1996a). The complete MIDI 1.0 de-
Kong, Q., Li, B., Chen, J., and Wang, Y. (2022). tailed specification.
GiantMIDI-Piano: A large-scale MIDI dataset for clas- [Link]
sical piano music. Transactions of the International (Last accessed: 6th of April 2024).
Society for Music Information Retrieval. [Link]
org/10.5334/tismir.80. MIDI Association (1996b). General MIDI instrument
group and mapping.
Kuo, C.-S., Chen, W.-K., Liu, C.-H., and You, S. D. [Link]
(September, 2021). Velocity prediction for MIDI (Last accessed: 6th of April 2024).
notes with deep learning. In Proceedings of IEEE Inter-
national Conference on Consumer Electronics-Taiwan, Miron, M., Carabias-Orti, J. J., Bosch, J. J., Gómez,
Penghu, Taiwan. E., Janer, J., et al. (2016). Score-informed source
separation for multichannel orchestral recordings.
Li, B., Liu, X., Dinesh, K., Duan, Z., and Sharma, G. Journal of Electrical and Computer Engineering, 2016.
(2018). Creating a multitrack classical music perfor- [Link]
mance dataset for multimodal music analysis: Chal-
lenges, insights, and applications. IEEE Transactions Müller, M., Konz, V., Bogler, W., and Arifi-Müller, V.
on Multimedia, 21(2):522–535. [Link] (October, 2011). Saarland music data (SMD).
10.48550/arXiv.1612.08727.
Ortega, F. J., Giraldo, S. I., Perez, A., and Ramírez, R.
Li, T., Ogihara, M., and Tzanetakis, G. (2012). Music (2019). Phrase-level modelling of expression in vio-
data mining. CRC Press Boca Raton. [Link] lin performances. Journal of Frontiers in Psychology,
org/10.1201/b11041. page 776. [Link]
00776.
Licata, T. (2002). Electroacoustic music: analytical per-
spectives. Bloomsbury Publishing USA. ISBN-10: Palmer, C. (1997). Music performance. Annual review
0313314209. of psychology, 48(1):115–138. [Link]
1146/[Link].48.1.115.
Liu, J., Dong, Y., Cheng, Z., Zhang, X., Li, X., Yu, F.,
and Sun, M. (December, 2022). Symphony genera- Payne, C. (2019). MuseNet (OpenAI).
tion with permutation invariant language model. In [Link]
Proceedings of the International Society for Music In- (Last accessed: 27th of October 2023).
formation Retrieval Conference, Bengaluru, India.
Plut, C., Pasquier, P., Ens, J., and Tchemeube, R.
Ma, X., Liu, X., Zhang, B., and Wang, Y. (December, (2022). The IsoVAT Corpus: Parameterization of Mu-
2022). Robust melody track identification in sym- sical Features for Affective Composition. Transactions
bolic music. In Proceedings of International Society for of the International Society for Music Information Re-
Music Information Retrieval Conference, Bengaluru, trieval (TISMIR), 5(1). [Link]
India. tismir.120.
Manzelli, R., Thakkar, V., Siahkamari, A., and Kulis, Qiu, L., Li, S., and Sung, Y. (2021). DBTMPE:
B. (September, 2018). Conditioning deep generative Deep bidirectional transformers-based masked pre-
raw audio models for structured automatic music. In dictive encoder approach for music genre classifica-
Proceedings of International Society for Music Informa- tion. Mathematics, 9(5):530. [Link]
tion Retrieval Conference, Paris, France. 3390/math9050530.
18 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
Raffel, C. (2016). The Lakh MIDI dataset v0.1. Simonetta, F., Carnovalini, F., Orio, N., and Rodà,
[Link] A. (September, 2018). Symbolic music similarity
(Last accessed: 27th of November 2023). through a graph-based representation. In Proceed-
ings of the Audio Mostly Conference, Wrexham, United
Raffel, C. and Ellis, D. P. (March, 2016a). Optimizing Kingdom.
DTW-based audio-to-MIDI alignment and matching.
In Proceedings of International Conference on Acous- Sitarz, M. (2022). Extending F1 metric, probabilistic
tics, Speech and Signal Processing, Shanghai, China. approach. Journal of Advances in Artificial Intelligence
and Machine Learning. [Link]
Raffel, C. and Ellis, D. P. W. (August, 2016b). Extract- arXiv.2210.11997.
ing Ground Truth Information from MIDI Files: A
MIDIfesto. In Proceedings of of the International So- Szelogowski, D., Mukherjee, L., and Whitcomb, B. (De-
ciety for Music Information Retrieval Conference, New cember, 2022). A novel dataset and deep learning
York, United States. benchmark for classical music form recognition and
analysis. In Proceedings of International Society for
Repp, B. H. (1997a). Acoustics, perception, and
Music Information Retrieval Conference, Bengaluru,
production of legato articulation on a computer-
India.
controlled grand piano. The Journal of the Acousti-
cal Society of America, 102(3):1878–1890. https: Tang, J., Wiggins, G., and Fazekas, G. (November,
//[Link]/10.1121/1.420110. 2023). Reconstructing human expressiveness in pi-
ano performances with a Transformer network. In
Repp, B. H. (1997b). The aesthetic quality of a quan-
Proceedings of International Symposium on Computer
titatively average music performance: Two prelimi-
Music Multidisciplinary Research (CMMR), Tokyo,
nary experiments. Music Perception, 14(4):419–444.
Japan.
[Link]
Ryu, J., Rhyu, S., Yoon, H.-G., Kim, E., Yang, J. Y., von Rütte, D., Biggio, L., Kilcher, Y., and Hofmann, T.
and Kim, T. (February, 2024). MID-FiLD: MIDI (May, 2023). FIGARO: generating symbolic music
dataset for fine-level dynamics. In Proceedings of the with fine-grained artistic control. In Proceedings of
AAAI Conference on Artificial Intelligence, Vancouver, International Conference on Learning Representations,
Canada. Kigali, Rwanda.
Sarmento, P., Kumar, A., Carr, C., Zukowski, Z., Wang, Z., Chen, K., Jiang, J., Zhang, Y., Xu, M., Dai,
Barthet, M., and Yang, Y.-H. (November, 2021). S., Gu, X., and Xia, G. (October, 2020). POP909: A
DadaGP: A dataset of tokenized GuitarPro songs for pop-song dataset for music arrangement generation.
sequence models. In Proceedings of International So- In Proceedings of International Society for Music Infor-
ciety for Music Information Retrieval Conference, On- mation Retrieval Conference, Montreal, Canada.
line.
Wong, T.-T. (2015). Performance evaluation of classifi-
Sarmento, P., Kumar, A., Chen, Y.-H., Carr, C., cation algorithms by k-fold and leave-one-out cross
Zukowski, Z., and Barthet, M. (April, 2023a). GTR- validation. Pattern recognition, 48(9):2839–2846.
CTRL: Instrument and genre conditioning for guitar- [Link]
focused music generation with transformers. In Pro-
Zeng, M., Tan, X., Wang, R., Ju, Z., Qin, T., and Liu,
ceedings of the EvoMUSART Conference, Brno, Czech
T.-Y. (August, 2021). Musicbert: Symbolic music un-
Republic.
derstanding with large-scale pre-training. In Proceed-
Sarmento, P., Kumar, A., Xie, D., Carr, C., Zukowski, ings of the Joint Conference of the Annual Meeting of
Z., and Barthet, M. (November, 2023b). ShredGP: the Association for Computational Linguistics and the
Guitarist Style-Conditioned Tablature Generation. In International Joint Conference on Natural Language
The 16th International Symposium on Computer Mu- Processing, Bangkok, Thailand.
sic Multidisciplinary Research, Tokyo, Japan.
Zhang, H., Tang, J., Rafee, S. R. M., and Fazekas, S.
Schaffrath, H. (1995). The Essen folksong collection. D. G. (December, 2022). ATEPP: A dataset of auto-
[Link] matically transcribed expressive piano performance.
(Last accessed: 15th of September 2023). In Proceedings of International Society for Music Infor-
mation Retrieval Conference, Bengaluru, India.
Shih, Y.-J., Wu, S.-L., Zalkow, F., Muller, M., and Yang,
Y.-H. (2022). Theme Transformer: symbolic mu- Zhang, N. (2020). Learning Adversarial Transformer
sic generation with theme-conditioned transformer. for Symbolic Music Generation. IEEE Transactions on
IEEE Transactions on Multimedia. [Link] Neural Networks and Learning Systems. [Link]
10.48550/arXiv.2111.04093. org/10.1109/TNNLS.2020.2990746.
19 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
A. Additional figures
A.1 Descriptive statistics of the GigaMIDI Dataset
Figure 11: Distribution of each drum MIDI instrument event in GigaMIDI. The legend in the graph displays drum
instruments based on three relative frequency levels depending on the colour hues (blue hue: low-range
frequency, green hue: mid-range frequency, and red hue: high-range frequency).
23 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection
A.2 Distribution for the number of distinct MIDI note velocity levels and onset time deviations.
B. Model Selection and Hyperparameter Settings for Optimal Threshold Selection of Heuristics
for Expressive Music Performance Detection
B.1 Machine Learning (ML) Model Selection
Following a series of comparative experiments involving logistic regression, decision trees, and random
forests—each implemented using the scikit-learn library—logistic regression was chosen as the most suitable
machine learning algorithm for determining optimal thresholds to differentiate between non-expressive and ex-
pressive MIDI tracks. This selection was made based on the ground truth data we manually collected, which
informed the model’s performance evaluation and final decision.
The choice of a machine learning model for identifying optimal thresholds between two classes, such as
non-expressive and expressively-performed MIDI tracks, requires careful consideration of the data’s specific char-
acteristics and the analysis goals. Logistic regression is often favoured when the relationship between the input
features and the target class is approximately linear. This model provides a clear, interpretable framework for
classification by modelling the probability that a given input belongs to one of the two classes. The output of
logistic regression is a continuous probability score between 0 and 1, which allows for straightforward determina-
tion and adjustment of the decision threshold. This simplicity and directness make logistic regression particularly
appealing when the primary objective is to identify a reliable and easily interpretable threshold.
However, logistic regression has limitations, particularly when the true relationship between the features and
the outcome is non-linear or complex. In such cases, decision trees and random forests offer more flexibility.
Decision trees can capture non-linear interactions between features by partitioning the feature space into distinct
regions associated with a specific class. Random forests, as ensembles of decision trees, enhance this flexibility
by averaging the predictions of multiple trees, thereby reducing variance and improving generalization. These
models can model complex relationships that logistic regression might miss, making them more suitable for
datasets where the linear assumption of logistic regression does not hold.
Regarding threshold determination, logistic regression has a distinct advantage due to its probabilistic output.
The model naturally provides a probability estimate for each instance, and a threshold can be easily applied to
classify instances into one of the two classes. This straightforward approach to threshold selection is one of the
key reasons logistic regression is often chosen for tasks requiring clear and interpretable decision boundaries. In
contrast, decision trees and random forests do not inherently produce probability scores similarly. While they can
be adapted to generate probabilities by considering the distribution of classes within the leaf nodes for decision
trees or across the trees in the forest for random forests, this process is more complex and can make threshold
selection less intuitive.
In our computational experiment, the logistic regression machine learning model, combined with manual
threshold inspection for validation, was found to be sufficient for identifying the optimal threshold for each
heuristic. This approach was particularly effective given the simplicity of the task, which involved a single feature
for each of the three key metrics—Distinctive Note Velocity Ratio (DNVR), Distinctive Note Onset Deviation Ratio
(DNODR), and Note Onset Median Metric Level (NOMML)—and the classification of data into two categories:
non-expressive and expressive tracks. The problem at hand, being a straightforward binary classification task
using a supervised learning algorithm, aligned well with the capabilities of logistic regression, thereby rendering
it an appropriate choice for our optimal threshold selection.
Distinctive Note Onset Deviation Ratio (DNODR), trained within 10 minutes; and Note Onset Median Metric
Level (NOMML), trained in 3 minutes with our MacBook Pro.
For hyperparameter tuning, we employed the scikit-learn library for logistic regression, a widely recognized
tool in the machine learning community for its efficiency and versatility. We utilized the GridSearchCV function
within this framework, which facilitates an exhaustive search over a specified parameter grid. This approach
identifies the most effective hyperparameters for the logistic regression model. GridSearchCV systematically
explores combinations of specified hyperparameter values and evaluates model performance based on cross-
validation scores, in this case, derived from the LOOCV process.
The hyperparameters tuned during this process include the regularization strength (denoted as C), which
controls the trade-off between achieving a low training error and a low testing error, as well as the choice of
regularization method (L1 or L2). By conducting an exhaustive search over these parameters, we aimed to
identify the configuration that minimizes the validation error across all iterations of the LOOCV. This rigorous
tuning process is crucial, as these hyperparameters can significantly affect logistic regression’s performance,
particularly in the presence of imbalanced data or feature correlations. The result is a logistic regression model
that is finely tuned to perform optimally under the specific conditions of our dataset and evaluation framework.
The following parameters and model configuration were determined through hyperparameter tuning using
leave-one-out cross-validation and GridSearchCV using the scikit-learn library for the logistic regression model.
Notably, these optimal hyperparameters were consistently identified across all three models corresponding to
each heuristic.
• Hyperparameter for the logistic regression models: C=0.046415888336127774
• Logistic regression setting details using the scikit-learn Python ML library:
LogisticRegression(random_state=0, C=0.046415888336127774, max_iter=10000, tol=0.1)
This configuration represents the optimal hyperparameters identified through comprehensive parameter ex-
ploration using GridSearchCV and LOOCV, thereby ensuring the logistic regression model’s robust performance.
precision but reduce recall, whereas lower thresholds have the opposite effect.
The precision and recall analysis are related to the P4 metric in that both are used to evaluate model perfor-
mance, especially in imbalanced and small sample size datasets. Precision and recall measure the accuracy of
positive predictions and the model’s ability to identify all positive cases, respectively. The P4 metric builds on this
by optimizing for the correct classification of the minority class, making it particularly useful when the dataset is
imbalanced and handing small sample size data. While precision and recall help select optimal thresholds, the
P4 metric provides a more tailored validation for scenarios where the minority class is of primary concern.
Following the precision and recall analysis, we convert the identified threshold value into the corresponding
feature value. For instance, to translate a P4 metric threshold value (0.9952) into the corresponding Note Onset
Median Metric Level (NOMML), the index of the threshold value is determined within the threshold array derived
from the precision-recall curve analysis, ensuring that the P4 metric is maximized. This index is then used to
extract the corresponding feature value from the NOMML list. As a result, the threshold is set at the corresponding
percentile within our curated set used during the optimal threshold selection, establishing the boundary between
non-expressive and expressively-performed ground truth data. Finally, we perform a manual review to verify that
the selected thresholds align with the distribution of feature values within the dataset. Following this, we identify
the optimal percentiles for these thresholds by analyzing the distribution of the feature values.