0% found this document useful (0 votes)
84 views26 pages

The Gigamidi Dataset With Features For Expressive Music Performance Detection

The GigaMIDI dataset, containing over 1.4 million MIDI files and 1.8 billion MIDI note events, is the largest open-source collection for research in expressive music performance detection. The authors introduce innovative heuristics to differentiate between non-expressive and expressive MIDI tracks, leading to the creation of a curated dataset that includes 1,655,649 expressively-performed tracks. This work significantly contributes to music information retrieval and computational musicology by providing a comprehensive resource for analyzing musical expressiveness.

Uploaded by

solid.init
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views26 pages

The Gigamidi Dataset With Features For Expressive Music Performance Detection

The GigaMIDI dataset, containing over 1.4 million MIDI files and 1.8 billion MIDI note events, is the largest open-source collection for research in expressive music performance detection. The authors introduce innovative heuristics to differentiate between non-expressive and expressive MIDI tracks, leading to the creation of a curated dataset that includes 1,655,649 expressively-performed tracks. This work significantly contributes to music information retrieval and computational musicology by providing a comprehensive resource for analyzing musical expressiveness.

Uploaded by

solid.init
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Keon Ju Maverick Lee, Jeff Ens, Sara Adkins, Pedro Sarmento, Mathieu

Barthet, Philippe Pasquier (2025). The GigaMIDI Dataset with Features


for Expressive Music Performance Detection, Transactions of the Inter-
national Society for Music Information Retrieval, V(N), pp. xx–xx, DOI:
[Link]

DATASET

The GigaMIDI Dataset with Features for Expressive


Music Performance Detection
Keon Ju Maverick Lee, Jeff Ens, Sara Adkins, Pedro Sarmento, Mathieu Barthet,
Philippe Pasquier

Abstract
The Musical Instrument Digital Interface (MIDI), introduced in 1983, revolutionized music pro-
arXiv:2502.17726v1 [[Link]] 24 Feb 2025

duction by allowing computers and instruments to communicate efficiently. MIDI files encode
musical instructions compactly, facilitating convenient music sharing. They benefit Music Infor-
mation Retrieval (MIR), aiding in research on music understanding, computational musicology,
and generative music. The GigaMIDI dataset contains over 1.4 million unique MIDI files, encom-
passing 1.8 billion MIDI note events and over 5.3 million MIDI tracks. GigaMIDI is currently the
largest collection of symbolic music in MIDI format available for research purposes under fair
dealing. Distinguishing between non-expressive and expressive MIDI tracks is challenging, as
MIDI files do not inherently make this distinction. To address this issue, we introduce a set of
innovative heuristics for detecting expressive music performance. These include the Distinctive
Note Velocity Ratio (DNVR) heuristic, which analyzes MIDI note velocity; the Distinctive Note
Onset Deviation Ratio (DNODR) heuristic, which examines deviations in note onset times; and
the Note Onset Median Metric Level (NOMML) heuristic, which evaluates onset positions relative
to metric levels. Our evaluation demonstrates these heuristics effectively differentiate between
non-expressive and expressive MIDI tracks. Furthermore, after evaluation, we create the most
substantial expressive MIDI dataset, employing our heuristic, NOMML. This curated iteration of
GigaMIDI encompasses expressively-performed instrument tracks detected by NOMML, con-
taining all General MIDI instruments, constituting 31% of the GigaMIDI dataset, totalling 1,655,649
tracks.

Keywords: MIDI Dataset, Computational Musicology, Expressive Music Performance Detection

1. Introduction representation of music. WAV is frequently utilized to


The representation of digital music can be categorized store uncompressed audio, thereby retaining nuanced
into two main forms: audio and symbolic domains. timbral attributes. In contrast, MIDI serves as a preva-
Audio representations of musical signals characterize lent format for the symbolic storage of music data.
sounds produced by acoustic or digital sources (e.g. MIDI embraces a multitrack architecture to represent
acoustic musical instruments, vocals, found sounds, musical information, enabling the generation of a score
virtual instruments, etc.) in an uncompressed or com- representation through score editor software. This pro-
pressed way. In contrast, symbolic representation of cess encompasses diverse onset timings and velocity
music relies on a notation system to characterize the levels, facilitating quantification and encoding of these
musical structures created by a composer or resulting musical events (MIDI Association, 1996a).
from a performance (e.g., scores, tablatures, MIDI per- The choice of training dataset significantly influ-
formance). While audio representations intrinsically ences deep learning models, particularly highlighted in
encode signal aspects correlated to timbre, it is not the the development of symbolic music generation models
case for symbolic representations; however, symbolic (Brunner et al., 2018; Huang et al., 2019; Payne, 2019;
representations may refer to timbral identity (e.g. cello Ens and Pasquier, 2020; Briot and Pachet, 2020; Briot,
staff) and expressive features correlated with timbre 2021; Hernandez-Olivan and Beltran, 2022; Shih et al.,
(e.g. pianissimo or forte dynamics) through notations. 2022; von Rütte et al., 2023; Adkins et al., 2023). Con-
Multiple encoding formats are employed for the sequently, MIDI datasets have gained increased atten-
2 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

tion as one of the main resources for training these pressive music performance in MIDI tracks. Our novel
deep learning models. Within automatic music gen- heuristics were applied to each instrument track in the
eration via deep learning, end-to-end models use digi- GigaMIDI dataset, and the resulting values were used
tal audio waveform representations of musical signals to evaluate the expressiveness of tracks in GigaMIDI.
as input (Zukowski and Carr, 2017; Manzelli et al., (3) We provide details of the evaluation results (Sec-
2018; Dieleman et al., 2018). Automatic music gen- tion 5.2) of each heuristic to facilitate expressive mu-
eration based on symbolic representations (Raffel and sic performance research. (4) Through the applica-
Ellis, 2016b; Zhang, 2020) uses digital notations to tion of our optimally performing heuristic, as deter-
represent musical events from a composition or perfor- mined through our evaluation, we create the largest
mance; these can be contained, e.g. in a digital score, MIDI dataset of expressive performances, specifically
a tablature (Sarmento et al., 2023a,b), or a piano-roll. incorporating instrument tracks beyond those associ-
Moreover, symbolic music data can be leveraged in ated with piano and drums (which constitute 31%
computational musicology to analyze the vast corpus of the GigaMIDI dataset), totalling over 1.6 million
of music using MIR and music data mining techniques expressively-performed MIDI tracks.
(Li et al., 2012).
In computational creativity and musicology, a crit- 2. Background
ical aspect is distinguishing between non-expressive Before exploring the GigaMIDI dataset, we examine
performances, which are mechanical renditions of a symbolic music datasets in existing literature. This sets
score, and expressive performances, which reflect vari- the stage for our discussion on MIDI’s musical expres-
ations that convey the performer’s personality and sion and performance aspects, laying the groundwork
style. MIDI files are commonly produced through score for understanding our heuristics in detecting expres-
editors or by recording human performances using sive music performance from MIDI data.
MIDI instruments, which allow for adjustments in pa-
rameters, such as velocity or pressure, to create expres- 2.1 Symbolic Music Data
sively performed tracks. Symbolic formats refer to the representation of music
through symbolic data, such as MIDI files, rather than
However, MIDI files typically do not contain meta- audio recordings (Zeng et al., 2021). Symbolic mu-
data distinguishing between non-expressive and ex- sic understanding involves analyzing and interpreting
pressive performances, and most MIR research has fo- music based on its symbolic data, namely information
cused on file-level rather than track-level analysis. File- about musical notation, music theory and formalized
level analysis examines global attributes like duration, music concepts (Simonetta et al., 2018).
tempo, and metadata, aiding structural studies, while
track-level analysis explores instrumentation and ar- Dataset Format Files Hours Instruments
rangement details. The note-level analysis provides the GigaMIDI MIDI >1.43M >40,000 Misc.
most granular insights, focusing on pitch, velocity, and MetaMIDI MIDI 436,631 >20,000 Misc.
Lakh MIDI MIDI 174,533 >9,000 Misc.
microtiming to reveal expressive characteristics. To-
DadaGP Guitar Pro 22,677 >1,200 Misc.
gether, these hierarchical levels form a comprehensive ATEPP MIDI 11,677 1,000 Piano
framework for studying MIDI data and understanding Essen Folk Song ABC 9,034 56.62 Piano
expressive elements of musical performances. NES Music MIDI 5,278 46.1 Misc.
MID-FiLD MIDI 4,422 >40 Misc.
Our work categorizes MIDI tracks into two types: MAESTRO MIDI 1,282 201.21 Piano
non-expressive tracks, defined by fixed velocities and Groove MIDI MIDI 1,150 13.6 Drums
quantized rhythms (though expressive performances JSB Chorales MusicXML 382 >4 Misc.
may also exhibit some degree of quantization), and ex-
pressive tracks, which feature microtiming variations Table 1: Sample of symbolic datasets in multiple for-
compared to the nominal duration indicated on the mats, including MIDI, ABC, MusicXML and Guitar
score, as well as dynamics variations, translating into Pro formats.
velocity changes across and within notes. To address Symbolic formats have practical applications in mu-
this, we introduce novel heuristics in Section 4 for sic information processing and analysis. Symbolic
detecting expressive music performances by analyzing music processing involves manipulating and analyz-
microtimings and velocity levels to differentiate be- ing symbolic music data, which can be more efficient
tween expressive and non-expressive MIDI tracks. and easier to interpret than lower-level representations
The main contributions of this work can be sum- of music, such as audio files (Cancino-Chacón et al.,
marized as follows: (1) the GigaMIDI dataset, which 2022).
encompasses over 1.4 million MIDI files and over five Musical Instrument Digital Interface (MIDI) is a
million instrument tracks. This data collection is the technical standard that enables electronic musical in-
largest open-source MIDI dataset for research pur- struments and computers to communicate by trans-
poses to date. (2) we have developed novel heuristics mitting event messages that encode information such
(Heuristic 1 and 2) tailored explicitly for detecting ex- as pitch, velocity, and timing. This protocol has be-
3 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

come integral to music production, allowing for the format (Sarmento et al., 2021).
efficient representation and manipulation of musical Focusing on MIDI, Table 1 showcases symbolic mu-
data (Meroño-Peñuela et al., 2017). MIDI datasets, sic datasets. MetaMIDI (Ens and Pasquier, 2021) is a
which consist of collections of MIDI files, serve as collection of 436,631 MIDI files. MetaMIDI comprises
valuable resources for musicological research, enabling a substantial collection of multi-track MIDI files pri-
large-scale analyses of musical trends and styles. For marily derived from an extensive music corpus char-
instance, studies utilizing MIDI datasets have ex- acterized by longer duration. Approximately 57.9% of
plored the evolution of popular music (Mauch et al., MetaMIDI include a piece having a drum track.
2015) and facilitated advancements in music transcrip- Lakh MIDI dataset (LMD) encompasses a collection
tion technologies through machine learning techniques of 174,533 MIDI files (Raffel, 2016), and an audio-
(Qiu et al., 2021). The application of MIDI in various to-MIDI alignment matching technique (Raffel and El-
domains underscores its significance in both the cre- lis, 2016a) is introduced, which is also utilized in
ative and analytical aspects of contemporary music. MetaMIDI for matching musical styles if scraped style
Symbolic music processing has gained attention in metadata is unavailable.
the MIR community, and several music datasets are
available in symbolic formats (Cancino-Chacón et al., 2.2 Music Expression and Performance Representa-
2022). Symbolic representations of music can be used tions of MIDI
for style classification, emotion classification, and mu- We use the terms expressive MIDI, human-performed
sic piece matching (Zeng et al., 2021). Symbolic MIDI, and expressive machine-generated MIDI in-
formats also play a role in the automatic formatting terchangeably to describe MIDI files that capture
of music sheets. XML-compliant formats, such as expressively-performed (EP) tracks, as illustrated in
the WEDEL format, include constructs describing inte- Figure 1. EP-class MIDI tracks capture performances by
grated music objects, including symbolic music scores human musicians or producers, emulate the nuances of
(Bellini et al., 2005). Besides that, the Music Encod- live performance, or are generated by machines trained
ing Initiative (MEI) is an open, flexible format for en- with deep learning algorithms. These tracks incorpo-
coding music scores in a machine-readable way. It al- rate variations of features, such as timing, dynamics,
lows for detailed representation of musical notation and articulation, to convey musical expression.
and metadata, making it ideal for digital archiving, From the perspective of music psychology, ana-
critical editions, and musicological research (Crawford lyzing expressive music performance involves under-
and Lewis, 2016). standing how variations of, e.g. timing, dynamics and
timbre (Barthet et al., 2010) relate to performers’ in-
ABC notation is a text format used to represent
tentions and influence listeners’ perceptions. Repp’s re-
music symbolically, particularly favoured in folk mu-
search demonstrates that expressive timing deviations,
sic (Cros Vila and Sturm, 2023). It offers a human-
like rubato, enhance listeners’ perception of natural-
readable method for notating music, with elements
ness and musical quality by aligning with their cogni-
represented using letters, numbers, and symbols. This
tive expectations of flow and structure (Repp, 1997b).
format is easily learned, written, and converted into
Palmer’s work further reveals that expressive timing
standard notation or MIDI files using software, en-
and dynamics are not random but result from skilled
abling convenient sharing and playback of musical
motor planning, as musicians use mental represen-
compositions.
tations of music to execute nuanced timing and dy-
Csound notation, part of Csound software, symbol- namic changes that reflect their interpretive intentions
ically represents electroacoustic music (Licata, 2002). (Palmer, 1997).
It controls sonic parameters precisely, fostering com- Our focus lies on two main types of MIDI tracks:
plex compositions blending traditional and electronic non-expressive and expressive. Non-expressive MIDI
elements. This enables innovative experimentation in tracks exhibit relatively fixed velocity levels and on-
contemporary music. Max Mathews’ MUSIC 4, devel- set deviations, resulting in metronomic and mechani-
oped in 1962, laid the groundwork for Csound, intro- cal rhythms. In contrast, expressive MIDI tracks fea-
ducing key musical concepts to computing programs. ture subtle temporal deviations (non-quantized but hu-
With the proliferation of deep learning approaches, manized or human-performed) and greater variations
often driven by the need for vast amounts of data, the in velocity levels associated with dynamics.
creation and curation of symbolic datasets have been
active in this research area. The MIDI format can be 2.2.1 Non-expressive and expressively-performed MIDI
considered the most common music format for sym- tracks
bolic music datasets, despite alternatives such as Essen MIDI files are typically produced in two ways (exclud-
folk music database in ABC format (Schaffrath, 1995), ing synthetic data from generative music systems): us-
JSB chorales dataset available via MusicXML format ing a score/piano roll editor or recording a human per-
and Music21, (Boulanger-Lewandowski et al., 2012; formance. MIDI controllers and instruments, such as a
Cuthbert and Ariza, 2010) and Guitar Pro tablature keyboard and pads, can be utilized to adjust the param-
4 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Figure 1: Four classes (NE= non-expressive, EO= expressive-onset, EV= expressive-velocity, and EP=
expressively-performed) using heuristics in Section 4.2 for the expressive performance detection of MIDI
tracks in GigaMIDI.

eters of each note played, such as velocity and pres- expressive performance detection through variations of
sure, to produce expressively-performed MIDI. Being micro-timings and velocity levels.
able to distinguish non-expressive and expressive MIDI MAESTRO (Hawthorne et al., 2019) and Groove
tracks is useful in MIR applications. However, MIDI MIDI (Gillick et al., 2019) datasets focus on singular
files do not accommodate such distinctions within their instruments, specifically piano and drums, respectively.
metadata. MIDI track-level analysis for music expres- Despite their narrower scope, these datasets are note-
sion has received less attention from MIR researchers worthy for including MIDI files exclusively performed
than MIDI file-level analysis. Previous research re- by human musicians. Saarland music data (SMD) con-
garding interpreting MIDI velocity levels (Dannenberg, tains piano performance MIDI files and audio record-
2006) and modelling dynamics/expression (Berndt ings, but SMD only contains 50 files (Müller et al.,
and Hähnel, 2010; Ortega et al., 2019) was conducted, 2011). The Vienna 4x22 piano corpus (Goebl, 1999)
and a comprehensive review of computational mod- and the Batik-Plays-Mozart MIDI dataset (Hu and Wid-
els of expressive music performance is available in mer, 2023) both provide valuable resources for study-
(Cancino-Chacón et al., 2018). Generation of expres- ing classical piano performance. The Vienna 4x22 Pi-
sive musical performances using a case-based reason- ano Corpus features high-resolution recordings of 22
ing system (Arcos et al., 1998) has been studied in pianists performing four classical pieces aimed at an-
the context of tenor saxophone interpretation and the alyzing expressive elements like timing and dynam-
modelling of virtuosic bass guitar performances (God- ics across performances. Meanwhile, the Batik-Plays-
dard et al., 2018). Velocity prediction/estimation using Mozart dataset offers MIDI recordings of Mozart pieces
deep learning was introduced at the MIDI note-level performed by the pianist Batik, capturing detailed per-
(Kuo et al., 2021; Kim et al., 2022; Collins and Bar- formance data such as note timing and velocity. To-
thet, 2023; Tang et al., 2023). gether, these datasets support research in performance
analysis and machine learning applications in music.
2.2.2 Music expression and performance datasets The Automatically Transcribed Expressive Piano
The aligned scores and performances (ASAP) dataset Performances (ATEPP) dataset (Zhang et al., 2022)
has been developed specifically for annotating non- was devised for capturing performer-induced expres-
expressive and expressively-performed MIDI tracks siveness by transcribing audio piano performances into
(Foscarin et al., 2020). Comprising 222 digital musi- MIDI format. ATEPP addresses inaccuracies inherent in
cal scores synchronized with 1068 performances, ASAP the automatic music transcription process. Similarly,
encompasses over 92 hours of Western classical piano the GiantMIDI piano dataset (Kong et al., 2022), akin
music. This dataset provides paired MusicXML and to ATEPP, comprises AI-transcribed piano tracks that
quantized MIDI files for scores, along with paired MIDI encapsulate expressive performance nuances. How-
files and partial audio recordings for performances. ever, we excluded the ATEPP and GiantMIDI piano
The alignment of ASAP includes annotations for down- datasets from our expressive music performance de-
beat, beat, time signature, and key signature, making tection task. State-of-the-art transcription models are
it notable for its incorporation of music scores aligned known to overfit the MAESTRO dataset (Edwards
with MIDI and audio performance data. The MID-FiLD et al., 2024) due to its recordings originating from a
(Ryu et al., 2024) dataset is the sole dataset offering controlled piano competition setting. These perfor-
detailed dynamics for Western orchestral instruments. mances, all played on similar Yamaha Disklavier pi-
However, it primarily focuses on creating expressive anos under concert hall conditions, result in consis-
dynamics via MIDI Control Change #1 (modulation tent acoustic and timbral characteristics. This unifor-
wheel) and lacks velocity variations, featuring predom- mity restricts the models’ ability to generalize to out-
inantly constant velocities as verified by our manual in- of-distribution data, contributing to the observed over-
spection. In contrast, the GigaMIDI dataset focuses on fitting.
5 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

3. GigaMIDI Data Collection teristics of MIDI tracks in each subset. The curated
We present the GigaMIDI dataset in this section and subsets were subsequently analyzed and incorporated
its descriptive statistics, such as the MIDI instrument into the GigaMIDI dataset to facilitate the detection of
group, the number of MIDI notes, ticks per quarter expressive music performance.
notes, and musical style. Additional descriptive statis- To improve accessibility, the GigaMIDI dataset has
tics are in Supplementary file 1: Appendix.A.1. been made available on the Hugging Face Hub. Early
feedback from researchers in music computing and
3.1 Overview of GigaMIDI Dataset MIR indicates that this platform offers better usabil-
The GigaMIDI dataset is a superset of the MetaMIDI ity and convenience compared to alternatives such as
dataset (Ens and Pasquier, 2021), and it contains GitHub and Zenodo. This platform enhances data pre-
1,437,304 unique MIDI files with 5,334,388 MIDI in- processing efficiency and supports seamless integration
strument tracks, and 1,824,536,824 (over 109 ; hence, with workflows, such as MIDI parsing and tokeniza-
the prefix "Giga") MIDI note events. The GigaMIDI tion using Python libraries like Symusic4 and Midi-
dataset includes 56.8% single-track and 43.2% multi- Tok5 (Fradet et al., 2021), as well as deep learning
track MIDI files. It contains 996,164 drum tracks and model training using Hugging Face. Additionally, the
4,338,224 non-drum tracks. The initial version of the raw metadata of the GigaMIDI dataset is hosted on the
dataset consisted of 1,773,996 MIDI files. Approxi- Hugging Face Hub6 , see Section 8.
mately 20% of the dataset was subjected to a cleaning As part of preprocessing GigaMIDI, single-track
process, which included deduplication achieved by ver- drum files allocated to MIDI channel 1 are subjected
ifying and comparing the MD5 checksums of the files. to re-encoding. This serves the dual purpose of ensur-
While we integrated certain publicly accessible MIDI ing their accurate representation on MIDI channel 10,
datasets from previous research endeavours, it is note- drum channel, while mitigating the risk of misidentifi-
worthy that over 50% of the GigaMIDI dataset was ac- cation as a piano track, denoted as channel 1. Details
quired through web-scraping and organized by the au- of MIDI channels are explained in Section 3.3.1.
thors. Furthermore, all drum tracks in the GigaMIDI
The GigaMIDI dataset includes per-track loop de- dataset were standardized through remapping based
tection, adapting the loop detection and extraction al- on the General MIDI (GM) drum mapping guidelines
gorithm presented in (Adkins et al., 2023) to MIDI (MIDI Association, 1996b) to ensure consistency. De-
files. In total, 7,108,181 loops with lengths rang- tailed information about the drum remapping process
ing from 1 to 8 bars were extracted from GigaMIDI can be accessed via GitHub. In addition, the distribu-
tracks, covering all types of MIDI instruments. Details tion of drum instruments, categorized and visualized
and analysis of the extracted loops from the GigaMIDI by their relative frequencies, is presented in Appendix
dataset will be shared in a companion paper report via A.1 (Gómez-Marín et al., 2020).
our GitHub page.
3.3 Descriptive Statistics of the GigaMIDI Dataset
3.2 Collection and Preprocessing of GigaMIDI Dataset 3.3.1 MIDI Instrument Group
The authors manually collected and aggregated the
GigaMIDI dataset, applying our heuristics for MIDI-
based expressive music performance detection. This
aggregation process was designed to make large-scale
symbolic music data more accessible to music re-
searchers.
Regarding data collection, we manually gathered
freely available MIDI files from online sources like Zen-
odo1 , GitHub2 , and public MIDI repositories by web
scraping. The source links for each subset are provided
via our GitHub webpage3 . During aggregation, files
were organized and deduplicated by comparing MD5
hash values. We also standardized each subset to the
General MIDI (GM) specification, ensuring coherence;
for example, non-GM drum tracks were remapped to
GM. Manual curation was employed to assess the suit-
ability of the files for expressive music performance de-
tection, with particular attention to defining ground Figure 2: Distribution of the duration in bars of the
truth tracks for expressive and non-expressive cate- files from each subset of the GigaMIDI dataset.
gories. This process involved systematically identify- The X-axis is clipped to 300 for better readability.
ing the characteristics of expressive and non-expressive The GigaMIDI dataset is divided into three primary
MIDI track subsets by manually checking the charac- subsets: "all-instrument-with-drums", "drums-only",
6 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Figure 3: Distribution of files in GigaMIDI according to (a) MIDI notes, and (b) ticks per quarter note (TPQN)

IGN: 1-8 Events IGN: 9-16 Events ducers can customize instrument number allocations
Piano 60.2% Reed/Pipe 1.1% based on their preferences.
CP 2.4% Drums 17.4% The GigaMIDI dataset analysis reveals that most
Organ 1.8% Synth Lead 0.5% MIDI note events (77.6%) are found in two instru-
Guitar 6.7% Synth Pad 0.6% ment groups: piano and drums. The piano instru-
Bass 4.2% Synth FX 0.3% ment group has more MIDI note events (60.2%) be-
String 1.1% Ethnic 0.3% cause most piano-based tracks are longer. The higher
Ensemble 2.1% Percussive FX 0.3% number of MIDI notes in piano tracks compared to
other instrumental tracks can be attributed to several
Brass 0.7% Sound FX 0.3%
factors. The inherent nature of piano playing, which
involves ten fingers and frequently includes simulta-
Table 2: Number of MIDI note events by instrument neous chords due to its dual-staff layout, naturally in-
group in percentage (IGN=instrument group num- creases note density. Additionally, the piano’s wide
ber, CP=chromatic percussion, and FX=effect). pitch range, polyphonic capabilities, and versatility in
musical roles allow it to handle melodies, harmonies,
and accompaniments simultaneously. Piano tracks are
and "no-drums". The "all-instrument-with-drums" sub- often used as placeholders or sketches during compo-
set comprises 22.78% of the dataset and includes sition, and MIDI input is typically performed using a
multi-track MIDI files with drum tracks. The "drums- keyboard defaulting to a piano timbre. These charac-
only" subset makes up 56.85% of the dataset, con- teristics, combined with the cultural prominence of the
taining only drum tracks, while the "no-drums" subset piano and the practice of condensing multiple parts
(20.37%) consists of both multi-track and single-track into a single piano track for convenience, result in a
MIDI files without drum tracks. As shown in Figure 2, higher density of notes in MIDI datasets.
drums-only files typically have a high-density distribu- The GigaMIDI dataset includes a significant pro-
tion and are mostly under 50 bars, reflecting their clas- portion of drum tracks (17.4%), which are generally
sification as drum loops. Conversely, multi-track and shorter and contain fewer note events compared to pi-
single-track piano files exhibit a broader range of du- ano tracks. This is primarily because many drum tracks
rations, spanning 10 to 300 bars, with greater diversity are designed for drum loops and grooves rather than
in musical structure. for full-length musical compositions. The supplemen-
MIDI instrument groups, organized by program tary file provides a detailed distribution of note events
numbers, categorize instrument sounds. Each group for drum sub-tracks, including each drum MIDI instru-
corresponds to a specific program number range, rep- ment in the GigaMIDI dataset. Sound effects, including
resenting unique instrument sounds. For instance, pro- breath noise, bird tweets, telephone rings, applause,
gram numbers 1 to 8 on MIDI Channel 1 are associ- and gunshot sounds, exhibit minimal usage, account-
ated with the piano instrument group (acoustic piano, ing for only 0.249% of the dataset. Chromatic per-
electric piano, harpsichord, etc). The analysis in Ta- cussion (2.4%) stands for pitched percussions, such as
ble 2 focuses on the occurrence of MIDI note events glockenspiel, vibraphone, marimba, and xylophone.
across the 16 MIDI instrument groups (MIDI Associa-
tion, 1996b). Channel 10 is typically reserved for the 3.3.2 Number of MIDI Notes and Ticks Per Quarter Note
drum instrument group. Figure 3 (a) shows the distribution for the number of
Although MIDI groups/channels often align with MIDI notes in GigaMIDI. According to our data analy-
specific instrument types in the General MIDI specifi- sis, the span from the 5th to the 95th percentile covers
cation (MIDI Association, 1996a), composers and pro- 13 to 931 notes, indicating a significant presence of
7 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

short-length drum tracks or loops. data, encompassing audio-text match style metadata
Figure 3 (b) illustrates the distribution of Ticks per sourced from the MetaMIDI subset (Ens and Pasquier,
quarter note (TPQN). TPQN is a unit that measures the 2021), is conducted. Subsequently, all gathered musi-
resolution or granularity of timing information. Ticks cal style metadata undergoes conversion, adhering to
are the smallest indivisible units of time within a MIDI the Musicmap topology for consistency.
sequence. A higher TPQN value means more precise The distribution of musical style metadata in the
timing information can be stored in a MIDI sequence. GigaMIDI dataset, illustrated in Figure 5, is based on
The most common TPQN values are 480 and 960. Ac- the Musicmap topology and encompasses 195,737 files
cording to our data analysis of GigaMIDI, common annotated with musical style metadata. Notably, preva-
TPQN values range from 96 to 960 between the 5th lent styles include classical, pop, rock, and folk music.
and 95th percentiles. These 195,737 style annotations mostly originate from
a combination of scraped metadata acquired online,
3.3.3 Musical Style style data present in our subsets, and manual inspec-
tion conducted by the authors.
A major challenge in utilizing scraped style meta-
data from the MetaMIDI subset is ensuring its accuracy
of metadata. To address this, a subset of the GigaMIDI
dataset, consisting of 29,713 MIDI files, was carefully
reviewed through music listening and manually anno-
tated with style metadata by a doctoral-level music re-
searcher.
MetaMIDI integrates scraped style metadata and
associated labels obtained through an audio-MIDI
matching process7 . However, our empirical assess-
ment, based on manual auditory analysis of musi-
Figure 4: Musicmap style topology (Crauwels, 2016). cal styles, identified inconsistencies and unreliability
in the scraped metadata from the MetaMIDI subset
(Ens and Pasquier, 2021). To address this, we manu-
ally remapped 9,980 audio-text-matched musical style
metadata entries within the MetaMIDI subset, ensuring
consistent and accurate musical style classifications.
Finally, these remapped musical styles were aligned
with the Musicmap topology to provide more uniform
and reliable information on musical style.
We provide audio-text-matched musical style meta-
data available using three musical style metadata:
Discogs8 , Last.fm9 , and Tagtraum10 , collected using
the MusicBrainz11 database.

4. Heuristics for MIDI-based Expressive Mu-


sic Performance Detection
Our heuristic design centers on analyzing variations in
velocity levels and onset time deviations from a met-
ric grid. MIDI velocity replicates the hammer velocity
in acoustic pianos, where the force applied to the keys
determines the speed of the hammers, subsequently af-
fecting the energy transferred to the strings and, conse-
Figure 5: Distribution of musical style in GigaMIDI. quently, the amplitude of the resulting vibrations. This
We provide the GigaMIDI dataset with metadata concept is integrated into MIDI keyboards, which repli-
regarding musical styles. This includes our manually cate hammer velocity by using MIDI velocity levels to
curated style metadata by listening to and annotat- control the dynamics of the sound. A velocity value of
ing MIDI files based on the Musicmap style topology 0 produces no sound, while 127 indicates maximum
(Crauwels, 2016), displayed in Figure 4. We organized intensity. Higher velocity values yield louder notes,
all the musical style metadata from our subsets, includ- while lower ones result in softer tones, analogous to
ing remapping drumming styles (Gillick et al., 2019) dynamics markings like pianissimo or fortissimo in tra-
and DadaGP (Sarmento et al., 2021) to Musicmap ditional performance. Onset time deviations in MIDI
style topology. The acquisition of scraped style meta- represent the difference between the actual note tim-
8 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

ings and their expected positions on a quantized met- our heuristic, we first store only the unique values in
ric grid, with the grid’s resolution determined by the each list: for v, the distinct velocity levels are {64, 72,
TPQN (ticks per quarter note) of the MIDI file. These 80, 88}, and for o, the distinct onset time deviations
deviations, often introduced through human perfor- are {-5, 0, 5, 10}. By counting these unique values,
mance, play a crucial role in conveying musical expres- we identify four distinct velocity levels and four dis-
siveness. tinct onset time deviations for this MIDI track, with no
The primary objective of our proposed heuristics deviation being treated as a specific occurrence.
for expressive performance detection is to differenti-
ate between expressive and non-expressive MIDI tracks 4.2 Distinctive Note Velocity/Onset Deviation Ratio
by analyzing velocity and onset time deviations. This (DNVR/DNODR)
analysis is applied at the MIDI track level, with each Distinctive note velocity and onset deviation ratios
instrument track undergoing expressive performance measure the proportion (in %) of unique MIDI note
detection. Our heuristics, introduced in the following velocities and onset time deviations in each MIDI
sections, assess expressiveness by examining velocity track. These metrics form a set of heuristics for detect-
variations and microtimings, offering a versatile frame- ing expressive performances, classified into four cate-
work suitable for various GM instruments. gories: Non-Expressive (NE), Expressive-Onset (EO),
Expressive-Velocity (EV), and Expressively-Performed
Other related approaches for this task are more
(EP), as shown in Figure 1. The DNVR metric counts
specific to acoustic piano performance rather than
unique velocity levels to differentiate between tracks
being tailored to MIDI tracks. Key Overlap Time
with consistent velocity and those with expressive ve-
(Repp, 1997a) and Melody Lead (Goebl, 2001) focus
locity variation, while the DNODR calculation helps
on acoustic piano performances, analyzing legato ar-
identify MIDI tracks that are either perfectly quantized
ticulation and melodic timing anticipation, which lim-
or have minimal microtiming deviations
its their application to piano contexts. Similarly, Lin-
ear Basis Models (Grachten and Widmer, 2012) fo-
cus on Western classical instruments, particularly the Heuristic 1 Calculation of Distinctive Note Veloc-
acoustic piano, and rely on score-based dynamics (e.g., ity/Onset Deviation Ratio (DNVR/DNODR)
crescendo, fortissimo), making them less applicable to 1: x ← [x 1 , ..., x n ] ▷ list of velocity or onset deviation
non-classical or non-Western music. Such dynamics 2: c vel oci t y ← 0 ▷ number of distinctive velocity levels
can be interpreted in MIDI velocity levels, and our 3: c onset ← 0 ▷ number of distinctive onset deviations
heuristics consider this aspect. Compared to these 4: for i ← 2 t o n do ▷ n =number of notes in a track
methods, our heuristics offer broader applicability, ad- 5: if x i ∉ x then
dressing dynamic variations and microtiming devia- 6: c ← c + 1 ▷ add 1 to c if there is a new value
tions across a wide range of MIDI instruments, making 7: return c vel oci t y or c onset
them suitable for detecting expressiveness in diverse 8: cvelocity−ratio = c vel oci t y ÷ 127 × 100
musical contexts. 9: conset−ratio = c onset ÷ T PQN × 100

4.1 Baseline Heuristic: Distinct Number of Velocity


Heuristic 1 is proposed to analyze the variation in
Levels and Onset Time Deviations
velocity levels and onset time deviations within a MIDI
This baseline heuristic focuses solely on analyzing the track. Here, xvelocity holds each track’s velocity values,
count of distinct velocity levels ("distinct velocity") and while xonset contains onset deviations from a quantized
unique onset time deviations ("distinct onset") without MIDI grid based on the track’s TPQN. For example, a
considering the MIDI track length. Generally, longer possible set of values could be x vel oci t y = {88, 102, . . . }
MIDI tracks show more distinct velocities and onset and x onset = {−3, 2, 5, . . . }, the latter being represented
deviations than shorter ones. Designed as a simpler in ticks. The functions c vel oci t y and c onset return the
alternative to the more sophisticated Heuristics 1 and count of unique velocity levels and onset time devia-
2, this baseline has limited accuracy for MIDI tracks of tions per track, respectively. Next, conset−ratio is divided
varying lengths, as it does not adjust for track dura- by the track’s TPQN to represent the proportion of mi-
tion. However, this was not a significant issue during crotiming positions within each quarter note. Similarly,
heuristic evaluation in Section 5.2, as most tracks in cvelocity−ratio is divided by 127 (the range of possible
the evaluation set are longer and have a limited vari- velocity levels). Finally, each ratio is converted to a
ance in terms of length. percentage by multiplying by 100.
Our baseline heuristic design counts the number
of unique velocity levels and onset time deviations 4.3 MIDI Note Onset Median Metric Level (NOMML)
present in a MIDI track. For example, consider a MIDI Figure 6 displays the classification of various note on-
track where v = [64, 72, 72, 80, 64, 88] represents the sets into duple metric levels 0-5. Let us define k as the
MIDI velocity values, and o = [-5, 0, 5, -5, 10, 0] repre- parameter that controls the metric level’s depth. The
sents the onset time deviations in MIDI ticks. Applying duple onset metric level (dup) grid divides the beat
9 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

In Heuristic 2, we propose MIDI note onset median


metric level (NOMML), another heuristic for detect-
ing non-expressive and expressively-performed MIDI
tracks. This heuristic counts the median metric level of
note onsets. The metric level ml(x) for a note onset x
is the lowest duple or triplet level that aligns with the
onset. Since some pulses overlap between duple and
triplet levels, we prioritize duple levels before consid-
ering triplets. For instance, with 120 ticks per quarter
note, a note onset a at tick 60 aligns with pulses on all
metric levels dupi for i ≥ 1 and trip j for j ≥ 2. Here,
the lowest matching levels are dup1 and trip2 , so, by
Figure 6: Example of each duple onset metric level prioritizing duple levels, ml(a) = dup1 . Conversely, a
grid in different colours using circles and dotted note onset b at tick 40 aligns only with triplet levels,
lines for the position of onsets, where k = 6. resulting in ml(b) = trip1 .
Given a list of note onset times (o), Heuristic 2 cal-
culates the median metric level. The list c is used to
into even subdivisions, such as halves or quarters, cap-
store the metric levels for each note onset, so after ex-
turing rhythms in duple meter. The triplet onset met-
ecuting lines 4-17, we have c = [ml(o1 ), ..., ml(on )]. For
ric level (trip) grid divides the beat into three equal
example, we have a list of metric levels for note on-
parts, aligning with triplet rhythms commonly found in
sets: c = [2, 3, 4, 6, 3, 7, 8, 3, 4]. To calculate the median,
swing and compound meters. Notably, since the grey-
1 we first sort c as follows: c = [2, 3, 3, 3, 4, 4, 6, 7, 8]. Since
coloured note onset (ML < 128 note metric level) does
i the list contains 9 values, the median is the middle el-
not belong to any dup for 0 ≤ i ≤ 5, it is assigned to ement, which is the 5th value in the sorted list. Thus,
the extra category shown in the bottom row because it the median metric level for c is 4.
is finer than the maximum metric level where k = 6. In lines 4-9, the lowest duple metric level is deter-
For example, Figure 6 displays the metric level depth. mined for each note onset o i . The condition in line 10
The duple metric level dupk divides each quarter note is met only when oi does not belong to any duple met-
into 2k equal pulses, while the triplet metric level tripk ric level. Here, ||c|| denotes the current length of c. If
divides it into 32 × 2k pulses. For our experiments, we o i does not match a duple level, lines 11-15 determine
choose k = 6. Consequently, the maximum metric lev- the lowest triplet metric level. When o i does not be-
els we consider are dup5 and trip5 , corresponding to long to any duple or triplet level, it is assigned to an
the 128th notes. Based on our observation of data in extra category containing both dupi and tripi for any
MIDI tracks, this provides a sufficient level of granular- i ≥ k (lines 16-17).
ity, given the note durations frequently found in most
To calculate the median metric level, each level is
forms of music.
assigned a unique numerical value. Duple and triplet
metric levels are interleaved to ensure a meaningful
Heuristic 2 Calculation of Note Onset Median Metric median: duple levels are represented by even numbers
Level (NOMML) (dupi = 2i ) and triplet levels by odd numbers (tripi =
1: c ← [ ] ▷ List of metric levels 2i + 1).
2: o ← [o 1 , ..., o n ] ▷ List of note onsets (in ticks)
3: TPQN ▷ Ticks per quarter notes of MIDI File 5. Threshold and Evaluation of Heuristics for
4: for i ← 1 t o n do ▷ line(4-9): Handle duple onsets Expressive Music Performance Detection
5: for j ← 0 t o k − 1 do Optimal threshold selection involves a structured ap-
6: p ← TPQN 2j
▷ periodicity of duple grid proach to determine the best threshold for distinguish-
7: if oi (mod p) ≡ 0 then ing between non-expressive (NE) and expressively-
8: [Link](2 j ) ▷ multiples of periodicity performed (EP) tracks. A machine learning regressor
9: break aids in identifying this threshold, evaluated using met-
10: if ||c|| < i then ▷ line(10-15): Handle triplet rics such as classification accuracy and the P4 metric
11: for j ← 0 t o k − 1 do (Sitarz, 2022).
12: p ← 2∗3∗2
TPQN
j ▷ periodicity of triplet
13: if oi (mod p) ≡ 0 then 4·TP ·T N
P4 = (1)
14: [Link](2 j + 1) ▷ multiples of p 4 · T P · T N + (T P + T N ) · (F P + F N )
15: break
16: if ||c|| < i then ▷ Handle onsets beyond grid The selection of the P4 metric (Equation 1, TP = True-
17: [Link](2k) ▷ k=metric level’s depth Positives, TN = True-Negatives, FP = False-Positives,
and FN = False-Negatives) over the F1 metric is moti-
18: return median(c)
vated by the small sample size of ground truths avail-
10 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

able for non-expressive and expressive tracks in our bi- et al., 2002) alongside leave-one-out cross-validation
nary classification task. (LOOCV, Wong, 2015) to determine thresholds using
The curated set for threshold selection and evalua- ground truths of NE and EP classes. LR estimates each
tion is split into 80% training for the threshold selec- class probability for binary classification between NE
tion (Section 5.1) and 20% testing for the evaluation and EP class tracks. LOOCV assesses model perfor-
(Section 5.2) to prevent data leakage. Heuristics for mance iteratively by training on all but one data point
Expressive Music Performance Detection, described in and testing on the excluded point, ensuring compre-
Section 4, are assessed for classification accuracy on hensive evaluation. This is particularly beneficial for
this testing set. small datasets to avoid reliance on specific train-test
splits. During this task, the ML regressor is solely used
5.1 Threshold Selection of Heuristics for Expressive for threshold identification rather than classification.
Music Performance Detection The high accuracy of the ML regressor facilitates op-
The threshold denotes the optimal value delineating timal threshold identification without arbitrary thresh-
the boundary between NE and EP tracks. A signifi- old selection.
cant challenge in identifying the threshold stems from
the limited availability of dependable ground-truth in- Heuristic Threshold P4
stances for NE and EP tracks. Distinct Velocity 52 0.7727
The curation process involves manually inspect- Distinct Onset 42 0.7225
ing tracks for velocity and microtiming variations to DNVR 40.965% 0.7727
achieve a 100% confidence level in ground truths. Sub- DNODR 4.175% 0.9529
sets failing to meet this level are strictly excluded from NOMML Level 12 0.9952
consideration. We selected 361 NE and 361 EP tracks
and assigned binary labels 0 for NE and 1 for EP tracks.
Our curated set consists of: Table 3: Optimal threshold selection results based on
1. Non-expressive (361 instances): ASAP (Foscarin the 80% training set, showing the optimal thresh-
et al., 2020) score tracks. old value for each heuristic where the P4 value is
2. Expressively-performed (361 instances): ASAP maximized.
performance tracks, Vienna 4x22 Piano Cor- After completing the machine learning classifier’s
pus (Goebl, 1999), Saarland music data (Müller training phase, efforts are directed toward identifying
et al., 2011), Groove MIDI (Gillick et al., 2019), the classifier’s optimal boundary point to maximize the
and Batik-plays-Mozart Corpus (Hu and Widmer, P4 metric. However, relying solely on the P4 metric
2023). for threshold selection proves inadequate, as it may
For the curated set, we intentionally balanced the num- not comprehensively capture all pertinent aspects of
ber of instances across classes to avoid bias. In im- the underlying scenarios.
balanced datasets, classification accuracy can be mis- We manually examine the training set to establish
leadingly high—especially in a two-class setup—where percentile boundaries for distinguishing NE and EP
a classifier could achieve high accuracy by predomi- classes based on ground truth data. Specifically, we
nantly predicting the majority class if one class has sig- identify the maximum P4 metric within the 80% train-
nificantly more instances (e.g., 10 times more). This ing set. Using this boundary range, we determine the
bias reduces the model’s ability to generalize and per- optimal threshold index in a feature array that maxi-
form well on unseen data, especially if both classes are mizes the P4 metric, which is then used to extract the
important. As a result, the classification accuracy, pre- corresponding threshold for our heuristic. This fea-
cision and recall metrics can become unreliable, mak- ture array contains all feature values for each heuris-
ing it difficult to assess the true effectiveness of the tic. The optimal threshold index, selected based on our
heuristics, particularly in detecting or distinguishing ML regression model and P4 score, identifies the opti-
the minority class. mal threshold from the feature array. For example, the
To tackle this, balancing the dataset enables a optimal threshold for the NOMML heuristic is found
more reliable option for evaluating the classification at level 12, corresponding to the 63.85th percentile,
task, even for baseline heuristics. We partially ex- yielding a P4 score of 0.9952, with similar information
cluded Groove MIDI and ASAP subsets from the cu- available for other heuristics in Table 3. Detailed steps
rated set, as if we had included them entirely, the cu- for selecting optimal thresholds for each heuristic are
rated set initially would contain roughly 10 times more provided in the Supplementary File: Appendix B.
expressively-performed instances than non-expressive It is important to note that the analysis in this sec-
ones. A total of 361 instances were selected, as this tion is speculative, relying on observations from Ta-
was the maximum number of non-expressive instances bles 4 and 5 without direct supporting evidence at this
with available ground truth data. stage. Later in the evaluation Section 5.2, we provide
We employ logistic regression (LR, Kleinbaum corresponding results that substantiate these prelimi-
11 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Class Distinct − Onset & Distinct − Velocity tracks (28.18% drum and 71.82% non-drum tracks).
NE (62.5%) D − O<42 & D − V<52 The distribution of MIDI instruments in the curated
EO (7.2%) D − O>=42 & D − V<52 version is displayed in Figure 7 (b), indicating that pi-
EV (27.4%) D − O<42 & D − V>=52 ano and drum tracks are the predominant components.
EP (2.9%) D − O>=42 & D − V>=52
5.2 Evaluation of Heuristics for Expressive Perfor-
mance Detection
Table 4: Detection results (%) for expressive per-
formance in each MIDI track class within the Detection Heuristics Class. Accuracy Ranking
GigaMIDI dataset. The analysis is based on Distinct Velocity 77.9% 4
the number of distinct velocity levels (Distinct- Distinct Onset 77.9% 4
Velocity: D-V) and onset time deviations (Distinct- DNVR 83.4% 3
Onset: D-O). Categories include non-expressive
DNODR 98.2% 2
(NE), expressive-onset (EO), expressive-velocity
NOMML 100% 1
(EV), and expressively-performed (EP).
Class conset−ratio(O−R) & cvelocity−ratio(V−R)
NE (52.3%) cO−R <4.175% & cV−R <40.965% Table 6: Classification accuracy of each heuristic for
EO (9.1%) cO−R >=4.175% & cV−R <40.965% expressive performance detection.
EV (24.2%) cO−R <4.175% & cV−R >=40.965% In our evaluation results (Table 6), the NOMML
EP (14.4%) cO−R >=4.175% & cV−R >=40.965% heuristic clearly outperforms other heuristics, achiev-
ing the highest accuracy at 100%. Additionally, onset-
based heuristics generally show better accuracy than
Table 5: Results (%) of expressive performance detec- velocity-based ones. This suggests that distinguish-
tion for each MIDI track class in GigaMIDI based ing velocity levels poses a greater challenge. For
on the calculation of conset−ratio (DNODR), and instance, in the ASAP subset, non-expressive score
cvelocity−ratio (DNVR). tracks—encoding traditional dynamics through veloc-
ity—display fluctuations rather than a fixed velocity
nary insights. level, whereas these tracks are aligned to a quantized
Tables 4 and 5 display the distribution of the grid, making onset-based detection more straightfor-
GigaMIDI dataset across four distinct classes (Figure ward. However, we recognize that accuracy alone does
1), using optimal thresholds derived from our baseline not provide a complete understanding, prompting fur-
heuristics (distinct velocity levels and onset time devi- ther investigation.
ations) and DNVR/DNODR heuristics. With the base-
line heuristics (Table 4), class distribution accuracy is Heuristic (%) TP TN FP FN CN
limited due to the prevalence of short-length drum
Distinct Vel. 35.4 42.5 21.2 0.9 98.0
and melody loop tracks in GigaMIDI, which baseline
Distinct On. 24.8 53.1 10.6 11.5 82.2
heuristics do not account for. In contrast, results us-
DNVR 35.4 48.0 21.2 0.9 98.2
ing DNVR/DNODR heuristics (Table 5) show improved
DNODR 34.5 63.7 0 1.77 97.3
class identification, especially for EP and NE tracks, as
NOMML 36.3 63.7 0 0 100
these heuristics consider MIDI track length, accommo-
dating short loops with around 100 notes. Although
DNVR/DNODR heuristics provide more accurate dis-
Table 7: True-Positives (TP), True-Negatives (TN),
tributions, both are less robust than the distribution of
False-Positives (FP), and False-Negatives (FN) based
the NOMML heuristic, as shown in Figure 7 (a).
on the threshold set by P4 for heuristics, including
Figure 7 (a) illustrates the distribution of NOMML
Correct-Negatives (CN), are tabled in percentage.
for MIDI tracks in the GigaMIDI dataset. The analy-
sis reveals that the majority of MIDI tracks fall within To further investigate, we also report TP, TN, FP, FN
three distinct bins (bins: 0, 2, and 12), encompassing a and CN as metrics (shown in Table 7) for assessing the
cumulative percentage of 86.1%. This discernible pat- reliability of our heuristics using the optimal thresh-
tern resembles a bimodal distribution, distinguishing olds in expressive performance detection, where "True"
between NE and EP class tracks. denotes expressive instances and "False" signifies non-
Figure 7 (a) shows 69% of MIDI tracks in GigaMIDI expressive instances. Thus, investigating the capacity
are NE class, and 31% of GigaMIDI are EP class tracks to achieve higher correct-negative (C N = T NT+F N
N ) rate
(NOMML: 12). Our curated version of GigaMIDI uti- holds significance in this context, as it assesses the reli-
lizing NOMML level 12 as a threshold is provided. able discriminatory power against NE instances, as well
This curated version consists of 869,513 files (81.59% as EP instances. As a result, NOMML achieves a 100%
single-track and 18.41% multi-track files) or 1,655,649 CN rate, and other heuristics perform reasonably well.
12 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Figure 7: Distribution of MIDI tracks according to (a) NOMML (level between 0 and 12, where k = 6) for MIDI
tracks in GigaMIDI. NOMML heuristic investigates duple and triplet onsets, including onsets that cannot be
categorized as duple or triplet-based MIDI grids, and (b) instruments for expressively-performed tracks in
the GigaMIDI dataset.

6. Limitations capture the temporal intricacies of Western music.


In navigating the use of MIDI datasets for research Furthermore, a constraint emerges from the inad-
and creative explorations, it is imperative to consider equate accessibility of ground truth data that clearly
the ethical implications inherent in dataset bias (Born, demarcates the differentiation between non-expressive
2020). Bias in MIDI datasets often mirrors prevailing and expressive MIDI tracks across all MIDI instruments
practices in Western digital music production, where for expressive performance detection. Presently, such
certain instruments, particularly the piano and drums, data predominantly originates from piano and drum
as illustrated in Figure 7 (b), dominate. This predomi- instruments in the GigaMIDI dataset.
nance is largely influenced by the widespread availabil-
ity and use of MIDI-compatible instruments and con- 7. Conclusion and Future Work
trollers for these instruments. The piano is a primary Analyzing MIDI data may benefit symbolic music gen-
compositional tool and a ubiquitous MIDI controller eration, computational musicology, and music data
and keyboard, facilitating input for a wide range of mining. The GigaMIDI dataset may contribute to MIR
virtual instruments and synthesizers. Similarly, drums, research by providing consolidated access to extensive
whether through drum machines or MIDI drum pads, MIDI data for analysis. Metadata analyses, data source
enjoy widespread use for rhythm programming and references, and findings on expressive music perfor-
beat production. This prevalence arises from their mance detection may enhance nuanced inquiries and
intuitive interface and versatility within digital audio foster progress in expressive music performance analy-
workstations. This may explain why the distribution sis and generation.
of MIDI instruments in MIDI datasets is often skewed Our novel heuristics for discerning between non-
toward piano and drums, with limited representation expressive and expressively-performed MIDI tracks ex-
of other instruments, particularly those requiring more hibit notable efficacy on the presented dataset. The
nuanced interpretation or less commonly played via NOMML (Note Onset Median Metric Level) heuristic
MIDI controllers or instruments. demonstrates a classification accuracy of 100%, under-
Moreover, the MIDI standard, while effective for en- scoring its discriminative capacity for expressive music
coding basic musical information, is limited in repre- performance detection.
senting the complexities of Western music’s time signa- Future work on the GigaMIDI dataset could signif-
tures and meters. It lacks an inherent framework to en- icantly advance symbolic music research by using MIR
code hierarchical metric structures, such as strong and techniques to identify and categorize musical styles
weak beats, and struggles with the dynamic flexibility systematically across all MIDI files. Currently, only
of metric changes. Additionally, its reliance on fixed about one-fifth of the dataset includes style meta-
temporal grids often oversimplifies expressive rhyth- data; expanding this would improve its comprehen-
mic nuances like rubato, leading to a loss of critical mu- siveness. Track-level style categorization, rather than
sical details. These constraints necessitate supplemen- file-level, would better capture the mix of styles in gen-
tary metadata or advanced techniques to accurately res like rock, jazz, and pop. Additionally, adding meta-
13 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

data for non-Western music, such as Asian classical or able source links and searchable resources. Ap-
Latin/African styles, would reduce Western bias and of- plying this to MIDI data, each subset of MIDI files
fer a more inclusive resource for global music research, collected from public domain sources is accom-
supporting cross-cultural studies. panied by clear and consistent metadata via our
GitHub and Hugging Face hub webpages. For ex-
8. Data Accessibility and Ethical Statements ample, organizing the source links of each data
The GigaMIDI dataset consists of MIDI files acquired subset, as done with the GigaMIDI dataset, en-
via the aggregation of previously available datasets and sures that each source can be easily traced and
web scraping from publicly available online sources. referenced, improving discoverability.
Each subset is accompanied by source links, copyright • Accessible: Once found, data should be easily
information when available, and acknowledgments. retrievable using standard protocols. Accessibil-
File names are anonymized using MD5 hash encryp- ity does not necessarily imply open access, but
tion. We acknowledge the work from the previous it does mean that data should be available un-
dataset papers (Goebl, 1999; Müller et al., 2011; Raf- der well-defined conditions. For the GigaMIDI
fel, 2016; Bosch et al., 2016; Miron et al., 2016; Don- dataset, hosting the data on platforms like Hug-
ahue et al., 2018; Crestel et al., 2018; Li et al., 2018; ging Face Hub improves accessibility, as these
Hawthorne et al., 2019; Gillick et al., 2019; Wang platforms provide efficient data retrieval mech-
et al., 2020; Foscarin et al., 2020; Callender et al., anisms, especially for large-scale datasets. En-
2020; Ens and Pasquier, 2021; Hung et al., 2021; Sar- suring that MIDI data is accessible for public use
mento et al., 2021; Zhang et al., 2022; Szelogowski while respecting any applicable licenses supports
et al., 2022; Liu et al., 2022; Ma et al., 2022; Kong wider research and analysis in music computing.
et al., 2022; Hyun et al., 2022; Choi et al., 2022; Plut • Interoperable: Data should be structured in
et al., 2022; Hu and Widmer, 2023; Ryu et al., 2024) such a way that it can be integrated with other
that we aggregate and analyze as part of the GigaMIDI datasets and used by various applications. MIDI
subsets. data, being a widely accepted format in music
This dataset has been collected, utilized, and dis- research, is inherently interoperable, especially
tributed under the Fair Dealing provisions for research when standardized metadata and file formats are
and private study outlined in the Canadian Copyright used. By ensuring that the GigaMIDI dataset
Act (Government of Canada, 2024). Fair Dealing per- complies with widely adopted standards and sup-
mits the limited use of copyright-protected material ports integration with state-of-the-art libraries in
without the risk of infringement and without having symbolic music processing, such as Symusic and
to seek the permission of copyright owners. It is in- MidiTok, the dataset enhances its utility for mu-
tended to provide a balance between the rights of cre- sic researchers and practitioners working across
ators and the rights of users. As per instructions of different platforms and systems.
the Copyright Office of Simon Fraser University12 , two • Reusable: Data should be well-documented
protective measures have been put in place that are and licensed to be reused in future research.
deemed sufficient given the nature of the data (acces- Reusability is ensured through proper metadata,
sible online): clear licenses, and documentation of provenance.
In the case of GigaMIDI, aggregating all subsets
1. We explicitly state that this dataset has been col-
from public domain sources and linking them to
lected, used, and distributed under the Fair Deal-
the original sources strengthens the reproducibil-
ing provisions for research and private study out-
ity and traceability of the data. This practice
lined in the Canadian Copyright Act.
allows future researchers to not only use the
2. On the Hugging Face hub, we advertise that the
dataset but also verify and expand upon it by re-
data is available for research purposes only and
ferring to the original data sources.
collect the user’s legal name and email as proof
of agreement before granting access. Developing ethical and responsible AI systems for
We thus decline any responsibility for misuse. music requires adherence to core principles of fairness,
The FAIR (Findable, Accessible, Interoperable, transparency, and accountability. The creation of the
Reusable) principles (Jacobsen et al., 2020) serve as a GigaMIDI dataset reflects a commitment to these val-
framework to ensure that data is well-managed, easily ues, emphasizing the promotion of ethical practices
discoverable, and usable for a broad range of purposes in data usage and accessibility. Our work aligns with
in research. These principles are particularly important prominent initiatives promoting ethical approaches to
in the context of data management to facilitate open AI in music, such as AI for Music Initiatives13 , which
science, collaboration, and reproducibility. advocates for principles guiding the ethical creation of
• Findable: Data should be easily discoverable music with AI, supported by the Metacreation Lab for
by both humans and machines. This is typi- Creative AI14 and the Centre for Digital Music15 , which
cally achieved through proper metadata, trace- provide critical guidelines for the responsible develop-
14 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

ment and deployment of AI systems in music. Similarly, kins.


the Fairly Trained initiative16 highlights the impor- • Expressive Music Performance Heuristic De-
tance of ethical standards in data curation and model sign and Experimentation: Keon Ju Maverick
training, principles that are integral to the design of the Lee and Jeff Ens.
GigaMIDI dataset. These frameworks have shaped the • Analysis and Interpretation of Results: Keon
methodologies used in this study, from dataset creation Ju Maverick Lee, Jeff Ens, Pedro Sarmento,
and validation to algorithmic design and system evalu- Mathieu Barthet and Philippe Pasquier
ation. By engaging with these initiatives, this research • Manuscript Draft Preparation: Keon Ju Mav-
not only contributes to advancing AI in music but also erick Lee, Jeff Ens, Sara Adkins, and Pedro Sar-
reinforces the ethical use of data for the benefit of the mento.
broader music computing and MIR communities. • Research Guidance and Advisement: Philippe
Pasquier and Mathieu Barthet.
9. Acknowledgements All authors have reviewed the results and approved
We gratefully acknowledge the support and contribu- the final version of the manuscript.
tions that have directly or indirectly aided this re-
search. This work was supported in part by funding Notes
from the Natural Sciences and Engineering Research 1 [Link]
Council of Canada (NSERC) and the Social Sciences 2 [Link]
and Humanities Research Council of Canada (SSHRC). 3 [Link]
We also extend our gratitude to the School of Interac- GigaMIDI-Dataset/tree/main
tive Arts and Technology (SIAT) at Simon Fraser Uni- 4 [Link]
versity (SFU) for providing resources and an enriching 5 [Link]
research environment. Additionally, we thank the Cen- 6 [Link]
tre for Digital Music (C4DM) at Queen Mary University Metacreation/GigaMIDI
of London (QMUL) for fostering collaborative opportu- 7 [Link]
nities and supporting our engagement with interdisci- MetaMIDI-Dataset
plinary research initiatives. We also acknowledge the 8 [Link]
support of EPSRC UKRI Centre for Doctoral Training in 9 [Link]
AI and Music (Grant EP/S022694/1) and UKRI - Inno- 10 [Link]
vate UK (Project number 10102804). 11 [Link]
Special thanks are extended to Dr. Cale Plut for
server
his meticulous manual curation of musical styles and 12 [Link]
to Dr. Nathan Fradet for his invaluable assistance
integrity/copyright#
in developing the HuggingFace Hub website for the 13 [Link]
GigaMIDI dataset, ensuring it is accessible and user- 14 [Link]
friendly for music computing and MIR researchers. We 15 [Link]
also sincerely thank our research interns, Paul Triana 16 [Link]
and Davide Rizzotti, for their thorough proofreading of
the manuscript, as well as the TISMIR reviewers who
helped us improve our manuscript. References
Finally, we express our heartfelt appreciation to the Adkins, S., Sarmento, P., and Barthet, M. (April, 2023).
individuals and communities who generously shared LooperGP: A loopable sequence model for live cod-
their MIDI files for research purposes. Their contribu- ing performance using GuitarPro tablature. In Pro-
tions have been instrumental in advancing this work ceedings of International Conference on Computational
and fostering collaborative knowledge in the field. Intelligence in Music, Sound, Art and Design (Part of
EvoStar), Brno, Czech Republic.
10. Competing Interests Arcos, J. L., De Mántaras, R. L., and Serra, X. (1998).
The authors have no competing interests to declare. Saxex: A case-based reasoning system for gener-
ating expressive musical performances. Journal of
11. Authors’ Contributions New Music Research, 27(3):page 194–210. https:
The authors confirm their contributions to the //[Link]/10.1080/09298219808570746.
manuscript as follows:
• Study Conception and Design: Keon Ju Maver- Barthet, M., Depalle, P., Kronland-Martinet, R., and Ys-
ick Lee, Jeff Ens, Pedro Sarmento, Mathieu Bar- tad, S. (2010). Acoustical correlates of timbre and
thet, and Philippe Pasquier. expressiveness in clarinet performance. Music Per-
• Data Collection and Metadata: Keon Ju Maver- ception, 28(2):135–154. [Link]
ick Lee, Jeff Ens, Pedro Sarmento, and Sara Ad- mp.2010.28.2.135.
15 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Bellini, P., Bruno, I., and Nesi, P. (2005). Automatic Cancino-Chacón, C. E., Grachten, M., Goebl, W., and
formatting of music sheets through MILLA rule-based Widmer, G. (2018). Computational models of expres-
language and engine. Journal of New Music Re- sive music performance: A comprehensive and criti-
search, 34(3):237–257. [Link] cal review. Journal of Frontiers in Digital Humanities,
09298210500236051. 5. [Link]

Berndt, A. and Hähnel, T. (September, 2010). Mod- Choi, E., Chung, Y., Lee, S., Jeon, J., Kwon, T., and
elling musical dynamics. In Proceedings of of the Nam, J. (December, 2022). YM2413-MDB: A multi-
Audio Mostly Conference on Interaction with Sound, instrumental fm video game music dataset with emo-
Piteå, Sweden. tion annotations. In Proceedings of the International
Society for Music Information Retrieval Conference,
Born, G. (2020). Diversifying MIR: Knowledge and Bengaluru, India.
real-world challenges, and new interdisciplinary fu-
tures. Transactions of the International Society for Mu- Collins, T. and Barthet, M. (November, 2023). Expres-
sic Information Retrieval, 3(1). [Link] sor: A Transformer Model for Expressive MIDI Per-
5334/tismir.58. formance. In Proceedings of the 16th International
Symposium on Computer Music Multidisciplinary Re-
Bosch, J. J., Marxer, R., and Gómez, E. (2016). Eval- search (CMMR), Tokyo, Japan.
uation and combination of pitch estimation meth-
Crauwels, K. (2016). Musicmap.
ods for melody extraction in symphonic classical mu-
[Link]
sic. Journal of New Music Research, 45(2):101–117.
(Last accessed: January 4th, 2024).
[Link]
Crawford, T. and Lewis, R. (2016). Review: Music En-
Boulanger-Lewandowski, N., Bengio, Y., and Vincent, coding Initiative. Journal of the American Musicolog-
P. (June, 2012). Modeling temporal dependencies ical Society, 69(1):273–285. [Link]
in high-dimensional sequences: Application to poly- 1525/jams.2016.69.1.273.
phonic music generation and transcription. In Pro-
ceedings of the International Conference on Machine Crestel, L., Esling, P., Heng, L., and McAdams, S.
Learning, Edinburgh, Scotland. (September, 2018). A database linking piano and
orchestral MIDI scores with application to automatic
Briot, J.-P. (2021). From artificial neural networks to projective orchestration. In Proceedings of the Inter-
deep learning for music generation: history, concepts national Society for Music Information Retrieval Con-
and trends. Journal of Neural Computing and Ap- ference, Paris, France.
plications, pages 39–65. [Link]
s00521-020-05399-0. Cros Vila, L. and Sturm, B. L. T. (August, 2023). Statis-
tical evaluation of ABC-formatted music at the levels
Briot, J.-P. and Pachet, F. (2020). Deep learning for of items and corpora. In Proceedings of AI Music Cre-
music generation: challenges and directions. Jour- ativity (AIMC), Brighton, UK.
nal of Neural Computing and Applications, pages 981–
Cuthbert, M. S. and Ariza, C. (August, 2010). mu-
993. [Link]
sic21: A toolkit for computer-aided musicology and
6.
symbolic music data. In Proceedings of of the Interna-
Brunner, G., Konrad, A., Wang, Y., and Wattenhofer, tional Society for Music Information Retrieval Confer-
R. (September, 2018). MIDI-VAE: Modeling dynam- ence, Utrecht, Netherlands.
ics and instrumentation of music with applications to Dannenberg, R. B. (November, 2006). The interpre-
style transfer. In Proceedings of the International Soci- tation of MIDI velocity. In Proceedings of the Inter-
ety for Music Information Retrieval Conference, Paris, national Computer Music Conference, New Orleans,
France. Unites States.
Callender, L., Hawthorne, C., and Engel, J. (2020). Dieleman, S., Van Den Oord, A., and Simonyan, K.
Improving perceptual quality of drum transcription (December, 2018). The Challenge of Realistic Mu-
with the expanded groove MIDI dataset. sic Generation: Modelling Raw Audio at Scale. In
[Link] Proceedings of Conference on Neural Information Pro-
(Last accessed: 27th of October 2023). cessing Systems, Montreal, Canada.
Cancino-Chacón, C., Peter, S. D., Karystinaios, E., Fos- Donahue, C., Mao, H. H., and McAuley, J. (Septem-
carin, F., Grachten, M., and Widmer, G. (May, 2022). ber, 2018). The NES music database: A multi-
Partitura: a Python package for symbolic music pro- instrumental dataset with expressive performance at-
cessing. In Proceedings of the Music Encoding Confer- tributes. In Proceedings of the International Society for
ence, Halifax, Canada. Music Information Retrieval Conference, Paris, France.
16 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Edwards, D., Dixon, S., Benetos, E., Maezawa, A., 42/[Link] (Last accessed: 20th of
and Kusaka, Y. (2024). A data-driven analysis of November 2024).
robust automatic piano transcription. IEEE Signal
Processing Letters. [Link] Grachten, M. and Widmer, G. (2012). Linear basis
2024.3363646. models for prediction and analysis of musical expres-
sion. Journal of New Music Research, 41(4):311–322.
Ens, J. and Pasquier, P. (November, 2021). Building the [Link]
MetaMIDI dataset: Linking symbolic and audio mu-
Hawthorne, C., Stasyuk, A., Roberts, A., Simon, I.,
sical data. In Proceedings of the International Society
Huang, C.-Z. A., Dieleman, S., Elsen, E., Engel, J.,
for Music Information Retrieval Conference, Online.
and Eck, D. (May, 2019). Enabling factorized pi-
Ens, J. and Pasquier, P. (October, 2020). MMM: Explor- ano music modeling and generation with the MAE-
ing conditional multi-track music generation with STRO dataset. In Proceedings of the International
the transformer. In Proceedings of the International Conference on Learning Representations (ICLR), New
Society for Music Information Retrieval Conference, Orleans, Unites States.
Montreal, Canada.
Hernandez-Olivan, C. and Beltran, J. R. (2022). Music
Foscarin, F., McLeod, A., Rigaux, P., Jacquemard, F., composition with deep learning: a review. Journal of
and Sakai, M. (October, 2020). ASAP: a dataset Advances in Speech and Music Technology: Computa-
of aligned scores and performances for piano tran- tional Aspects and Applications, pages 25–50. https:
scription. In Proceedings of the International Society //[Link]/10.48550/arXiv.2108.12290.
for Music Information Retrieval Conference, Montreal, Hu, P. and Widmer, G. (November, 2023). The Batik-
Canada. plays-Mozart Corpus: Linking Performance to Score
Fradet, N., Briot, J.-P., Chhel, F., Seghrouchni, A. E. F., to Musicological Annotations. In Proceedings of the
and Gutowski, N. (November, 2021). MidiTok: A International Society for Music Information Retrieval
python package for MIDI file tokenization. In Pro- Conference, Milan, Italy.
ceedings of the International Society for Music Infor- Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer,
mation Retrieval Conference, Online. N., Simon, I., Hawthorne, C., Dai, A. M., Hoffman,
Gillick, J., Roberts, A., Engel, J., Eck, D., and Bamman, M. D., Dinculescu, M., and Eck, D. (May, 2019). Mu-
D. (June, 2019). Learning to groove with inverse se- sic Transformer. In Proceedings of the International
quence transformations. In Proceedings of the Inter- Conference on Learning Representations (ICLR), New
national Conference on Machine Learning, Long Beach, Orleans, Unites States.
United States. Hung, H.-T., Ching, J., Doh, S., Kim, N., Nam, J., and
Yang, Y.-H. (November, 2021). EMOPIA: A multi-
Goddard, C., Barthet, M., and Wiggins, G. (2018). As-
modal pop piano dataset for emotion recognition and
sessing musical similarity for computational music
emotion-based music generation. In Proceedings of
creativity. Journal of the Audio Engineering Society,
the International Society for Music Information Re-
66(4):267–276. [Link]
trieval Conference, Online.
2018.0012.
Hyun, L., Kim, T., Kang, H., Ki, M., Hwang, H., Park, K.,
Goebl, W. (1999). The vienna 4x22 piano corpus.
Han, S., and Kim, S. J. (December, 2022). ComMU:
[Link]
Dataset for combinatorial music generation. In Pro-
(Last accessed: 24th of October 2024).
ceedings of the Conference on Neural Information Pro-
Goebl, W. (2001). Melody lead in piano perfor- cessing Systems (NeurIPS) Datasets and Benchmarks
mance: Expressive device or artifact? The Journal Track, New Orleans, United States.
of the Acoustical Society of America, 110(1):563–572. Jacobsen, A., de Miranda Azevedo, R., Juty, N., Batista,
[Link] D., Coles, S., Cornet, R., Courtot, M., Crosas, M., Du-
Gómez-Marín, D., Jordà, S., and Herrera, P. (2020). montier, M., Evelo, C. T., Goble, C., Guizzardi, G.,
Drum rhythm spaces: From polyphonic similarity Hansen, K. K., Hasnain, A., Hettne, K., Heringa, J.,
to generative maps. Journal of New Music Re- Hooft, R. W., Imming, M., Jeffery, K. G., Kaliyape-
search, 49(5):438–456. [Link] rumal, R., Kersloot, M. G., Kirkpatrick, C. R., Kuhn,
09298215.2020.1806887. T., Labastida, I., Magagna, B., McQuilton, P., Meyers,
N., Montesanti, A., van Reisen, M., Rocca-Serra, P.,
Government of Canada (2024). The Cana- Pergl, R., Sansone, S.-A., da Silva Santos, L. O. B.,
dian Copyright Act, RSC 1985, c. C-42, s. 29 Schneider, J., Strawn, G., Thompson, M., Waag-
(fair dealing for research and private study). meester, A., Weigel, T., Wilkinson, M. D., Willigha-
[Link] gen, E. L., Wittenburg, P., Roos, M., Mons, B., and
17 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Schultes, E. (2020). FAIR Principles: Interpreta- Mauch, M., MacCallum, R. M., Levy, M., and Leroi,
tions and Implementation Considerations. Data In- A. M. (2015). The evolution of popular music: USA
telligence, 2(1-2):10–29. [Link] 1960–2010. Royal Society open science, 2(5):150081.
dint_r_00024. [Link]

Kim, H., Miron, M., and Serra, X. (December, 2022). Meroño-Peñuela, A., Hoekstra, R., Gangemi, A., Bloem,
Note level MIDI velocity estimation for piano per- P., de Valk, R., Stringer, B., Janssen, B., de Boer,
formance. In Proceedings of the International Soci- V., Allik, A., Schlobach, S., et al. (2017). The MIDI
ety for Music Information Retrieval Conference, Ben- linked data cloud. In The Semantic Web–ISWC 2017:
galuru, India. 16th International Semantic Web Conference, Vienna,
Austria, October 21-25, 2017, Proceedings, Part II
Kleinbaum, D. G., Dietz, K., Gail, M., Klein, M., and 16, pages 156–164. Springer. [Link]
Klein, M. (2002). Logistic regression. Springer. 1007/978-3-319-68204-4_16.
[Link]
MIDI Association (1996a). The complete MIDI 1.0 de-
Kong, Q., Li, B., Chen, J., and Wang, Y. (2022). tailed specification.
GiantMIDI-Piano: A large-scale MIDI dataset for clas- [Link]
sical piano music. Transactions of the International (Last accessed: 6th of April 2024).
Society for Music Information Retrieval. [Link]
org/10.5334/tismir.80. MIDI Association (1996b). General MIDI instrument
group and mapping.
Kuo, C.-S., Chen, W.-K., Liu, C.-H., and You, S. D. [Link]
(September, 2021). Velocity prediction for MIDI (Last accessed: 6th of April 2024).
notes with deep learning. In Proceedings of IEEE Inter-
national Conference on Consumer Electronics-Taiwan, Miron, M., Carabias-Orti, J. J., Bosch, J. J., Gómez,
Penghu, Taiwan. E., Janer, J., et al. (2016). Score-informed source
separation for multichannel orchestral recordings.
Li, B., Liu, X., Dinesh, K., Duan, Z., and Sharma, G. Journal of Electrical and Computer Engineering, 2016.
(2018). Creating a multitrack classical music perfor- [Link]
mance dataset for multimodal music analysis: Chal-
lenges, insights, and applications. IEEE Transactions Müller, M., Konz, V., Bogler, W., and Arifi-Müller, V.
on Multimedia, 21(2):522–535. [Link] (October, 2011). Saarland music data (SMD).
10.48550/arXiv.1612.08727.
Ortega, F. J., Giraldo, S. I., Perez, A., and Ramírez, R.
Li, T., Ogihara, M., and Tzanetakis, G. (2012). Music (2019). Phrase-level modelling of expression in vio-
data mining. CRC Press Boca Raton. [Link] lin performances. Journal of Frontiers in Psychology,
org/10.1201/b11041. page 776. [Link]
00776.
Licata, T. (2002). Electroacoustic music: analytical per-
spectives. Bloomsbury Publishing USA. ISBN-10: Palmer, C. (1997). Music performance. Annual review
0313314209. of psychology, 48(1):115–138. [Link]
1146/[Link].48.1.115.
Liu, J., Dong, Y., Cheng, Z., Zhang, X., Li, X., Yu, F.,
and Sun, M. (December, 2022). Symphony genera- Payne, C. (2019). MuseNet (OpenAI).
tion with permutation invariant language model. In [Link]
Proceedings of the International Society for Music In- (Last accessed: 27th of October 2023).
formation Retrieval Conference, Bengaluru, India.
Plut, C., Pasquier, P., Ens, J., and Tchemeube, R.
Ma, X., Liu, X., Zhang, B., and Wang, Y. (December, (2022). The IsoVAT Corpus: Parameterization of Mu-
2022). Robust melody track identification in sym- sical Features for Affective Composition. Transactions
bolic music. In Proceedings of International Society for of the International Society for Music Information Re-
Music Information Retrieval Conference, Bengaluru, trieval (TISMIR), 5(1). [Link]
India. tismir.120.

Manzelli, R., Thakkar, V., Siahkamari, A., and Kulis, Qiu, L., Li, S., and Sung, Y. (2021). DBTMPE:
B. (September, 2018). Conditioning deep generative Deep bidirectional transformers-based masked pre-
raw audio models for structured automatic music. In dictive encoder approach for music genre classifica-
Proceedings of International Society for Music Informa- tion. Mathematics, 9(5):530. [Link]
tion Retrieval Conference, Paris, France. 3390/math9050530.
18 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Raffel, C. (2016). The Lakh MIDI dataset v0.1. Simonetta, F., Carnovalini, F., Orio, N., and Rodà,
[Link] A. (September, 2018). Symbolic music similarity
(Last accessed: 27th of November 2023). through a graph-based representation. In Proceed-
ings of the Audio Mostly Conference, Wrexham, United
Raffel, C. and Ellis, D. P. (March, 2016a). Optimizing Kingdom.
DTW-based audio-to-MIDI alignment and matching.
In Proceedings of International Conference on Acous- Sitarz, M. (2022). Extending F1 metric, probabilistic
tics, Speech and Signal Processing, Shanghai, China. approach. Journal of Advances in Artificial Intelligence
and Machine Learning. [Link]
Raffel, C. and Ellis, D. P. W. (August, 2016b). Extract- arXiv.2210.11997.
ing Ground Truth Information from MIDI Files: A
MIDIfesto. In Proceedings of of the International So- Szelogowski, D., Mukherjee, L., and Whitcomb, B. (De-
ciety for Music Information Retrieval Conference, New cember, 2022). A novel dataset and deep learning
York, United States. benchmark for classical music form recognition and
analysis. In Proceedings of International Society for
Repp, B. H. (1997a). Acoustics, perception, and
Music Information Retrieval Conference, Bengaluru,
production of legato articulation on a computer-
India.
controlled grand piano. The Journal of the Acousti-
cal Society of America, 102(3):1878–1890. https: Tang, J., Wiggins, G., and Fazekas, G. (November,
//[Link]/10.1121/1.420110. 2023). Reconstructing human expressiveness in pi-
ano performances with a Transformer network. In
Repp, B. H. (1997b). The aesthetic quality of a quan-
Proceedings of International Symposium on Computer
titatively average music performance: Two prelimi-
Music Multidisciplinary Research (CMMR), Tokyo,
nary experiments. Music Perception, 14(4):419–444.
Japan.
[Link]

Ryu, J., Rhyu, S., Yoon, H.-G., Kim, E., Yang, J. Y., von Rütte, D., Biggio, L., Kilcher, Y., and Hofmann, T.
and Kim, T. (February, 2024). MID-FiLD: MIDI (May, 2023). FIGARO: generating symbolic music
dataset for fine-level dynamics. In Proceedings of the with fine-grained artistic control. In Proceedings of
AAAI Conference on Artificial Intelligence, Vancouver, International Conference on Learning Representations,
Canada. Kigali, Rwanda.

Sarmento, P., Kumar, A., Carr, C., Zukowski, Z., Wang, Z., Chen, K., Jiang, J., Zhang, Y., Xu, M., Dai,
Barthet, M., and Yang, Y.-H. (November, 2021). S., Gu, X., and Xia, G. (October, 2020). POP909: A
DadaGP: A dataset of tokenized GuitarPro songs for pop-song dataset for music arrangement generation.
sequence models. In Proceedings of International So- In Proceedings of International Society for Music Infor-
ciety for Music Information Retrieval Conference, On- mation Retrieval Conference, Montreal, Canada.
line.
Wong, T.-T. (2015). Performance evaluation of classifi-
Sarmento, P., Kumar, A., Chen, Y.-H., Carr, C., cation algorithms by k-fold and leave-one-out cross
Zukowski, Z., and Barthet, M. (April, 2023a). GTR- validation. Pattern recognition, 48(9):2839–2846.
CTRL: Instrument and genre conditioning for guitar- [Link]
focused music generation with transformers. In Pro-
Zeng, M., Tan, X., Wang, R., Ju, Z., Qin, T., and Liu,
ceedings of the EvoMUSART Conference, Brno, Czech
T.-Y. (August, 2021). Musicbert: Symbolic music un-
Republic.
derstanding with large-scale pre-training. In Proceed-
Sarmento, P., Kumar, A., Xie, D., Carr, C., Zukowski, ings of the Joint Conference of the Annual Meeting of
Z., and Barthet, M. (November, 2023b). ShredGP: the Association for Computational Linguistics and the
Guitarist Style-Conditioned Tablature Generation. In International Joint Conference on Natural Language
The 16th International Symposium on Computer Mu- Processing, Bangkok, Thailand.
sic Multidisciplinary Research, Tokyo, Japan.
Zhang, H., Tang, J., Rafee, S. R. M., and Fazekas, S.
Schaffrath, H. (1995). The Essen folksong collection. D. G. (December, 2022). ATEPP: A dataset of auto-
[Link] matically transcribed expressive piano performance.
(Last accessed: 15th of September 2023). In Proceedings of International Society for Music Infor-
mation Retrieval Conference, Bengaluru, India.
Shih, Y.-J., Wu, S.-L., Zalkow, F., Muller, M., and Yang,
Y.-H. (2022). Theme Transformer: symbolic mu- Zhang, N. (2020). Learning Adversarial Transformer
sic generation with theme-conditioned transformer. for Symbolic Music Generation. IEEE Transactions on
IEEE Transactions on Multimedia. [Link] Neural Networks and Learning Systems. [Link]
10.48550/arXiv.2111.04093. org/10.1109/TNNLS.2020.2990746.
19 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Zukowski, Z. and Carr, C. (December, 2017). Gen-


erating Black Metal and Math Rock: Beyond Bach,
Beethoven, and Beatles. In NIPS Workshop on Ma-
chine Learning for Creativity and Design, Long Beach,
United States.
20 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

A. Additional figures
A.1 Descriptive statistics of the GigaMIDI Dataset

Figure 8: Distribution of the number of instrument tracks in GigaMIDI.

Figure 9: Distribution of tempo (BPM, beats per minute) in GigaMIDI.


21 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Figure 10: Distribution of time signature in GigaMIDI.


22 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Figure 11: Distribution of each drum MIDI instrument event in GigaMIDI. The legend in the graph displays drum
instruments based on three relative frequency levels depending on the colour hues (blue hue: low-range
frequency, green hue: mid-range frequency, and red hue: high-range frequency).
23 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

A.2 Distribution for the number of distinct MIDI note velocity levels and onset time deviations.

Figure 12: Distribution of distinct MIDI note velocity.

Figure 13: Distribution of distinct MIDI note onset time deviation.


24 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

B. Model Selection and Hyperparameter Settings for Optimal Threshold Selection of Heuristics
for Expressive Music Performance Detection
B.1 Machine Learning (ML) Model Selection
Following a series of comparative experiments involving logistic regression, decision trees, and random
forests—each implemented using the scikit-learn library—logistic regression was chosen as the most suitable
machine learning algorithm for determining optimal thresholds to differentiate between non-expressive and ex-
pressive MIDI tracks. This selection was made based on the ground truth data we manually collected, which
informed the model’s performance evaluation and final decision.
The choice of a machine learning model for identifying optimal thresholds between two classes, such as
non-expressive and expressively-performed MIDI tracks, requires careful consideration of the data’s specific char-
acteristics and the analysis goals. Logistic regression is often favoured when the relationship between the input
features and the target class is approximately linear. This model provides a clear, interpretable framework for
classification by modelling the probability that a given input belongs to one of the two classes. The output of
logistic regression is a continuous probability score between 0 and 1, which allows for straightforward determina-
tion and adjustment of the decision threshold. This simplicity and directness make logistic regression particularly
appealing when the primary objective is to identify a reliable and easily interpretable threshold.
However, logistic regression has limitations, particularly when the true relationship between the features and
the outcome is non-linear or complex. In such cases, decision trees and random forests offer more flexibility.
Decision trees can capture non-linear interactions between features by partitioning the feature space into distinct
regions associated with a specific class. Random forests, as ensembles of decision trees, enhance this flexibility
by averaging the predictions of multiple trees, thereby reducing variance and improving generalization. These
models can model complex relationships that logistic regression might miss, making them more suitable for
datasets where the linear assumption of logistic regression does not hold.
Regarding threshold determination, logistic regression has a distinct advantage due to its probabilistic output.
The model naturally provides a probability estimate for each instance, and a threshold can be easily applied to
classify instances into one of the two classes. This straightforward approach to threshold selection is one of the
key reasons logistic regression is often chosen for tasks requiring clear and interpretable decision boundaries. In
contrast, decision trees and random forests do not inherently produce probability scores similarly. While they can
be adapted to generate probabilities by considering the distribution of classes within the leaf nodes for decision
trees or across the trees in the forest for random forests, this process is more complex and can make threshold
selection less intuitive.
In our computational experiment, the logistic regression machine learning model, combined with manual
threshold inspection for validation, was found to be sufficient for identifying the optimal threshold for each
heuristic. This approach was particularly effective given the simplicity of the task, which involved a single feature
for each of the three key metrics—Distinctive Note Velocity Ratio (DNVR), Distinctive Note Onset Deviation Ratio
(DNODR), and Note Onset Median Metric Level (NOMML)—and the classification of data into two categories:
non-expressive and expressive tracks. The problem at hand, being a straightforward binary classification task
using a supervised learning algorithm, aligned well with the capabilities of logistic regression, thereby rendering
it an appropriate choice for our optimal threshold selection.

B.2 Hyperparameter Settings and Training Details


The process of training a logistic regression model using the leave-one-out cross-validation (LOOCV) method
requires a methodical approach to ensure robust model performance. Leave-one-out cross-validation is a special
case of k-fold cross-validation where the number of folds equals the number of instances in the dataset. In
this method, the model is trained on all data points except one, which is used as the validation set, and this
process is repeated for each data point. The advantage of LOOCV lies in its ability to maximize the use of
available data for training while providing a nearly unbiased estimate of model performance. However, due to its
computational intensity, especially with large datasets, careful consideration is given to the selection and tuning
of hyperparameters to optimize the model’s performance. In our case, we trained our models with 722 instances
using LOOCV, a relatively small amount of data available with the ground truth of non-expressive and expressive
tracks due to the scarcity of such ground truth available for expressive music performance detection.
The training environment for our experiments was configured on a MacBook Pro, equipped with an Apple M2
CPU and 16GB of RAM, without the use of external GPUs. Our analysis, which included evaluation using the P4
metric alongside basic metrics such as classification accuracy, precision, and recall, did not indicate any significant
impact on performance attributable to the computational setup. Furthermore, we share three logistic regression
models in .pkl format, each trained on a specific heuristic, accessible via GitHub. These models correspond to the
following heuristics: baseline heuristics, Distinctive Note Velocity Ratio (DNVR), trained in less than 10 minutes;
25 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Distinctive Note Onset Deviation Ratio (DNODR), trained within 10 minutes; and Note Onset Median Metric
Level (NOMML), trained in 3 minutes with our MacBook Pro.
For hyperparameter tuning, we employed the scikit-learn library for logistic regression, a widely recognized
tool in the machine learning community for its efficiency and versatility. We utilized the GridSearchCV function
within this framework, which facilitates an exhaustive search over a specified parameter grid. This approach
identifies the most effective hyperparameters for the logistic regression model. GridSearchCV systematically
explores combinations of specified hyperparameter values and evaluates model performance based on cross-
validation scores, in this case, derived from the LOOCV process.
The hyperparameters tuned during this process include the regularization strength (denoted as C), which
controls the trade-off between achieving a low training error and a low testing error, as well as the choice of
regularization method (L1 or L2). By conducting an exhaustive search over these parameters, we aimed to
identify the configuration that minimizes the validation error across all iterations of the LOOCV. This rigorous
tuning process is crucial, as these hyperparameters can significantly affect logistic regression’s performance,
particularly in the presence of imbalanced data or feature correlations. The result is a logistic regression model
that is finely tuned to perform optimally under the specific conditions of our dataset and evaluation framework.
The following parameters and model configuration were determined through hyperparameter tuning using
leave-one-out cross-validation and GridSearchCV using the scikit-learn library for the logistic regression model.
Notably, these optimal hyperparameters were consistently identified across all three models corresponding to
each heuristic.
• Hyperparameter for the logistic regression models: C=0.046415888336127774
• Logistic regression setting details using the scikit-learn Python ML library:
LogisticRegression(random_state=0, C=0.046415888336127774, max_iter=10000, tol=0.1)
This configuration represents the optimal hyperparameters identified through comprehensive parameter ex-
ploration using GridSearchCV and LOOCV, thereby ensuring the logistic regression model’s robust performance.

B.3 Procedure of Optimal Threshold Selection


Our curated evaluation set comprises 361 non-expressive (NE) tracks labelled 0 and 361 expressively-performed
(EP) tracks labelled 1. We have five features for training each: baseline heuristics (the number of distinct
velocity levels and onset time deviations), DNVR, DNODR, and NOMML (more sophisticated heuristic) feature
values. To train the logistic regression models for selecting optimal thresholds for our heuristics, 80% of this
curated evaluation set was allocated as the training set. The remaining 20% was reserved as the testing set,
which was subsequently used to validate the model’s performance during the evaluation phase, so the testing set
is not involved with the optimal threshold selection process to prevent potential data leakage.
To determine the optimal threshold for expressive music performance detection using logistic regression with
a focus on the P4 metric, the following steps were undertaken:
• Step (1): Prepare the logistic regression algorithm using GridSearchCV to identify optimal hyperparameter
settings, followed by leave-one-out cross-validation to maximize the P4 metric. This ensures that the model
is fine-tuned for the specific task of classifying non-expressive and expressively-performed MIDI tracks.
• Step (2): Train the logistic regression model on the training set, incorporating the relevant features and
ground truth labels, using the pre-determined optimal hyperparameters.
• Step (3): Apply leave-one-out cross-validation on the validation set (within the training set) to obtain
predicted probabilities for the positive class, i.e., expressively-performed MIDI tracks.
• Step (4): Validate the performance of the classifier at various threshold values, focusing on optimizing the
P4 metric, which is particularly suited for imbalanced and small sample size datasets.
• Step (5): Identify the index of the optimal threshold value within the threshold array that maximizes the
P4 metric, ensuring that the model effectively distinguishes between the two classes.
• Step (6): Use this index to extract the corresponding optimal value from the feature array, translating the
identified threshold into actionable feature values.
• Step (7): Lastly, we conduct a manual inspection to ensure that the selected thresholds are consistent with
the distribution of feature values within the dataset. We then determine the optimal percentiles for these
thresholds based on the feature value distribution.
Details of Steps (4), (5), and (6): Initially, predicted probabilities for the positive class are obtained using
the predict_proba method of the logistic regression model. Next, the precision-recall curve is computed using
the precision_recall_curve function, and this curve is plotted as a function of different threshold values. The
P4 metric is then maximized to identify the optimal threshold, given its effectiveness in handling imbalanced
and small sample size datasets by prioritizing the accurate classification of the minority class. By adjusting
the threshold value, the trade-off between precision and recall can be controlled—higher thresholds increase
26 Lee, K. et al: The GigaMIDI Dataset with Features for Expressive Music Performance Detection

precision but reduce recall, whereas lower thresholds have the opposite effect.
The precision and recall analysis are related to the P4 metric in that both are used to evaluate model perfor-
mance, especially in imbalanced and small sample size datasets. Precision and recall measure the accuracy of
positive predictions and the model’s ability to identify all positive cases, respectively. The P4 metric builds on this
by optimizing for the correct classification of the minority class, making it particularly useful when the dataset is
imbalanced and handing small sample size data. While precision and recall help select optimal thresholds, the
P4 metric provides a more tailored validation for scenarios where the minority class is of primary concern.
Following the precision and recall analysis, we convert the identified threshold value into the corresponding
feature value. For instance, to translate a P4 metric threshold value (0.9952) into the corresponding Note Onset
Median Metric Level (NOMML), the index of the threshold value is determined within the threshold array derived
from the precision-recall curve analysis, ensuring that the P4 metric is maximized. This index is then used to
extract the corresponding feature value from the NOMML list. As a result, the threshold is set at the corresponding
percentile within our curated set used during the optimal threshold selection, establishing the boundary between
non-expressive and expressively-performed ground truth data. Finally, we perform a manual review to verify that
the selected thresholds align with the distribution of feature values within the dataset. Following this, we identify
the optimal percentiles for these thresholds by analyzing the distribution of the feature values.

You might also like