0% found this document useful (0 votes)
49 views5 pages

Musical Gesture Recognition with ML

This document describes preliminary research on recognizing musical gestures from audio using machine learning. Audio from a cello piece containing labeled musical gestures was analyzed using audio descriptors and fed into a machine learning model. The model was tested using cross-validation, and confusion matrices were used to evaluate performance and identify gestures that were not well recognized. The researchers aim to optimize model hyperparameters like descriptor selection and regularization coefficients to improve recognition of gestures in new audio.

Uploaded by

Nagendra Nagsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views5 pages

Musical Gesture Recognition with ML

This document describes preliminary research on recognizing musical gestures from audio using machine learning. Audio from a cello piece containing labeled musical gestures was analyzed using audio descriptors and fed into a machine learning model. The model was tested using cross-validation, and confusion matrices were used to evaluate performance and identify gestures that were not well recognized. The researchers aim to optimize model hyperparameters like descriptor selection and regularization coefficients to improve recognition of gestures in new audio.

Uploaded by

Nagendra Nagsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Musical Gesture Recognition Using Machine Learning

and Audio Descriptors


Paul Best, Jean Bresson, Diemo Schwarz

To cite this version:


Paul Best, Jean Bresson, Diemo Schwarz. Musical Gesture Recognition Using Machine Learning
and Audio Descriptors. International Conference on Content-Based Multimedia Indexing (CBMI’18),
2018, La Rochelle, France. �hal-01839050�

HAL Id: hal-01839050


[Link]
Submitted on 13 Jul 2018

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Musical Gesture Recognition Using Machine
Learning and Audio Descriptors
Paul Best Jean Bresson Diemo Schwarz
STMS Lab STMS Lab STMS Lab
Ircam, CNRS, Sorbonne Université Ircam, CNRS, Sorbonne Université Ircam, CNRS, Sorbonne Université
Paris, France Paris, France Paris, France
[Link]@[Link] [Link]@[Link] [Link]@ircam.f

Abstract—We report preliminary results of an ongoing project databases or other large datasets distributed for instance over
on automatic recognition and classification of musical “gestures” the web. A crucial part of the work, however, resides in the
from audio extracts. We use a machine learning tool designed strategy employed to efficiently test and select the relevant fea-
for motion tracking and recognition, applied to labeled vectors
of audio descriptors in order to recognize hypothetical gestures tures and parameters to use in order to carry out a recognition
formed by these descriptors. A hypothesis is that the classes task given some input data type and training set at hand.
detected in audio descriptors can be used to identify higher-
level/abstract musical structures which might not be described II. P ROTOCOL
easily using standard/symbolic representations.
The dataset preparation, model testing and validation are
Index Terms—Machine learning, gesture, hidden Markov
chains, audio descriptors. carried out and controlled in OpenMusic visual programs such
as the one visible in Fig. 1, using the OM - XMM library.
I. I NTRODUCTION A. Dataset
Musical structures carry abstract perceptual features and The data-set used in this experiment is a 9-minutes-long
characteristics that can be straightforward to identify by com- extract of the piece Zāmyād for cello by contemporary music
posers and/or listeners, but difficult or impossible to formally composer Alireza Fahrang.1 This piece is widely built around
describe using the elements of standard score representations the concept of musical gestures. Indeed, some easily noticeable
(e.g. identifying harmonic/melodic patterns etc.) Composers patterns occur – with variations – throughout the piece. The
and authors often use the term of gesture to characterize these definition of a gesture in such musical context can be relatively
dynamic elements constituting musical forms, as an analogy subjective, and based both on perceptual and more abstract
with the idea of gesture in physical movements [1], [2]. compositional considerations (gestures are for instance defined
Physical movements can be analyzed using machine learn- by the composers in terms like “an unstable glissando sound
ing techniques: Ircam’s XMM library [3], for instance, uses a with reversed attack”). The composer identified 20 different
combination of Gaussian mixture models (GMM) and hidden gesture classes in the score, and annotated the audio extract
Markov models (HMM) in order to recognize performed accordingly, cutting it into 187 labeled segments (the approx.
gestures from parallel streams of descriptors extracted from duration of a segment is between 1 and 3 seconds). The
motion capture (Cartesian coordinates, speed, etc.) HMM have number of elements per class varies between 1 (a gesture that
shown good performance processing movements in such time- appears only once in the whole training set) and 22.
series [4], [5], and require much smaller training sets than
neural networks to build reliable models. We propose here B. Audio descriptors analysis
that similar models could be used by composers working with Each audio sample is analyzed and turned into a vector
audio descriptors to recognize and classify musical gestures. of signal descriptors using the IAE/pipo libraries [8], [9].
This hypothesis is tested on a real, specific example, in the The main audio descriptors considered in this study are the
context a composer’s musical research residency project at Ir- Mel-frequency cepstral coefficients (MFCCs), the Fundamen-
cam (Paris). We developed a binding of the XMM library in the tal Frequency, Energy, Periodicity, First-order autocorrelation
OpenMusic composition and visual programming software [6], coefficient (AC1), Loudness, Centroid, Spread, Skewness, and
associated with audio description and analysis tools, and used Kurtosis. We use the first 12 MFCCs and therefore work with
this framework to run preliminary experiments. a total of 21 descriptors, sampled according to a given analysis
A general objective of this work is to provide music window size and overlap factor. Not all of these descriptors are
composers with tools and interfaces to work with machine relevant: further in this paper, we discuss strategies to select
learning in a computer-aided composition environment [7]. and combine them efficiently.
HMM techniques fit well with this situation, where algorithms
will be trained on data input by end-users, rather than on 1 [Link]
Fig. 1. Example of an OpenMusic patch and graphical interface (v. o7) including XMM model testing and validation tools applied to audio signal descriptors.

C. Model validation TABLE I


A CONFUSION MATRIX EXCERPT OUTPUT AFTER THE TEST PROCEDURE
We use a 10-fold cross-validation to measure our model’s OF Zāmyād’ S GESTURE RECOGNITION MODEL .
performances: for 10 folds, we successively train the model
GROUND TRUTH CLASS

with 90% of the data, and compute the accuracy of recognition A 0.89 0.10
for the training and remaining data sets. Eventually the average C 1
of each accuracy is computed. Our measure for the model’s G 0.09 0.81 0.05 0.05
performance thus consists in two average accuracy values: H 0.10 0.16 0.74
the training-set accuracy and the test-set accuracy. The final K 0.54 0.45
purpose of this project being to recognize gesture on new audio L 0.05 0.19 0.05 0.71
pieces, we aim at improving the test set accuracy first and A C G H K L ...
foremost. PREDICTED CLASS
Confusion matrices allow to identify among the different
classes of segments, the ones that are problematic to recognize. • The two regularization coefficients (offset added to the
Table I reproduces part of a matrix such as the one output by covariance matrices of the Gaussian distributions at each
the OpenMusic visual program in Fig. 1. We can observe how re-estimation in the algorithm):
some classes (such as C) are easily recognized, while some
– Relative: offset relative to data-variance.
others are not (e.g. K, 45% times confused with L).
– Absolute: minimum offset value.
A simple format has been defined to describe the model
D. Hyperparameters optimization
features and hyperparameters in the validation process. For
Feature selection and the tuning of hyperparameters are key instance: <mfcc,descr> (0 2 15) 15 (0.1 0.05) means:
elements for the recognition model performance. In addition • Audio descriptors are extracted using the mfcc and descr
to the choice of audio descriptors, three main parameters can analysis modules (which respectively generate 12 MFCC
vary in the model: coefficients and the vector of 9 descriptors).
• The number of hidden states of the Markov model. • The subset of selected descriptors are the descriptors
number 0, 2 and 15 (among 21 in total).2 classifying gestures. Additional results and data can be found
• 15 hidden states are used in the Markov model. at [Link]
• The relative and absolute regularization values in the Our experiments have shown that the more descriptors are
model are 0.1 and 0.05. used in the model, the more hidden states are required for
Because results can be highly dependent on the dataset used the HMM to perform well. We used first-order Hierarchical
for validation, it is important to have an easy and automated HMMs, with only one gaussian per state. In the following
way for users to optimize their model and fine-tune these sections III-A, III-B, and III-C, the number of hidden state has
hyperparameters. Different techniques have been applied to been fixed manually. In section III-D, this number is searched
choose optimum values for these different parameters. using the genetic algorithm. In all reported results, relative
and absolute regularization values were set to 0.1 and 0.05
1) Brute force algorithm: This algorithm consists in an respectively.
exhaustive test (10-fold cross-validation) of a whole list of
hyperparameters sets. It gives a clear view of which parameters A. One descriptor
set works and which doesn’t, but requires a pre-definition
of all hyperparameters configurations to be tested beforehand Table II reports the model performance when trained and
(testing all possible combinations would take too much time). run with a single audio descriptor. The model is tested with
2) Genetic algorithm: To avoid the test of an exhaustive set 10 hidden states.
of hyperparameter configurations, a genetic approach has been
developed: starting from a small number of hyperparameter TABLE II
ACCURACY OF THE MODEL WITH SINGLE DESCRIPTORS .
configurations (in principle, already identified for fairly good
performances), and applying small variations to them at each Descriptor Test set accuracy Training set accuracy
iteration. MFCC 1 0.21 0.30
The algorithm searches for optimal number of states and MFCC2 0.24 0.29
regularization values, and selects audio descriptors to use in MFCC 3 0.24 0.31
the model. The range of random variations for each hyperpa- MFCC 4 0.17 0.27
rameter is set as follows (but adjustable for different cases and MFCC 5 0.11 0.24
data-sets): MFCC 6 0.12 0.23
MFCC 7 0.19 0.30
• Number of states: [ -4 , +4 ];
MFCC 8 0.12 0.12
• Relative regularization: [ -0.05 , +0.05 ];
[...] [...] [...]
• Absolute regularization: [ -0.005 , +0.005 ];
Frequency 0.23 0.29
At each iteration a new audio descriptor is also randomly Energy 0.04 0.05
chosen and added or removed in the descriptors list. Periodicity 0.06 0.12
After applying these random variations to the hyperparam- AC1 0.04 0.06
eters configurations, the algorithm evaluates the newly formed Loudness 0.24 0.30
configurations and keeps the 4 best-performing (in terms of Centroid 0.21 0.35
test-set accuracy) among the old and new ones. Spread 0.22 0.34
The main limitation of this algorithm is that the search Skewness 0.14 0.22
is likely to get stuck in local optima: it is efficient in fine- Kurtosis 0.11 0.18
tuning hyperparameter configurations already known for good
performances, but is not reliable in a global search context. These initial results show that none of the descriptors is
reliable enough by itself to accurately analyze and recognize
E. Computation time
gestures in the audio samples. We study the model perfor-
These search and evaluation procedures for hyperparameters mances using combined descriptors in the following sections.
take time to compute. Although efforts were made to optimize
them, in average it takes approximately 5 minutes on a B. Combination of two descriptors
standard personal computer to train and cross-validate a model
with a given configuration of hyperparameters on the 187 Table III presents performances of models trained with
elements of our dataset. selected combinations of two descriptors. Those models were
tested with 15 hidden states.
III. R ESULTS Performances are visibly improved. We also note that the
best combinations of two do not necessarily correspond to the
In this section, we report elements of the model per-
combination of best-performing descriptors when tested alone
formance depending on selected hyperparameters, focusing
(section III-A), which emphasize emerging, non-predictable
on the selection of audio descriptors used for training and
aspects of the descriptors selection task. For example, Period-
2 These numbers correspond to the 1st and 3rd MFCCs, and to the first-order icity alone gave a 0.06 test-set accuracy, but performs a 0.44
autocorrelation coefficient descriptor (AC1). test-set accuracy when combined with Spread.
TABLE III IV. C ONCLUSION
ACCURACY OF THE MODEL WITH COMBINATIONS OF 2 DESCRIPTORS .
We tested machine learning techniques designed for motion
Descriptors
Test set Training set processing on audio signal descriptors, in order to recognize
accuracy accuracy abstract musical gestures in audio extracts: elements of a
MFCC 1, 4 0.44 0.66
temporal form which carry a perceptual identity, but can not
MFCC 2, Frequency 0.47 0.64
be easily described using standard score descriptions.
MFCC 2, Loudness 0.47 0.63
We also developed an integrated framework including the
MFCC 2, Spread 0.46 0.65
IAE and XMM libraries in the OpenMusic computer-aided
MFCC 3, Spread 0.45 0.65
music composition and visual programming environment,
Frequency, Loudness 0.46 0.66
which allows users (composers) to prepare data-sets, train and
Frequency, Spread 0.45 0.63
test HMM models, and select relevant hyperparameters.3
Periodicity, Spread 0.44 0.56
The choice of hyperparameters still requires a significant
Centroid, Spread 0.52 0.69
involvement from the user, and is highly dependent on the
datasets and types of gestures to be recognized. The work
C. Combination of three descriptors accomplished so far facilitates a workflow for such hyperpa-
Table IV presents a selection of performance results from rameter optimization, thanks to built-in tools such as cross-
models trained and run with combinations of 3 descriptors. validation and genetic algorithms.
Those models were tested with 18 hidden states. As we can We can not be fully satisfied yet with our model per-
see, results are again sensibly better, and relevant combinations formance on the tested dataset (test-set accuracy of 0.65 at
of descriptors become more salient (e.g. Spread, Centroid, and maximum) but no conclusion can be drawn from one single
the first few MFCC coefficient). dataset: further work will focus on applying and improving
these tools and workflow on other examples and datasets.
TABLE IV
ACCURACY OF THE MODEL WITH COMBINATIONS OF 3 DESCRIPTORS . ACKNOWLEDGMENT
Test set Training set This work is carried out within the PEPS I3A support frame-
Descriptors
accuracy accuracy work of the French National Center for Scientific Research
MFCC 2, Centroid, Spread 0.57 0.79 (CNRS). The authors thank composer Alireza Farhang for
Periodicity, Centroid, Spread 0.57 0.77 providing the annotated gesture dataset.
MFCC 3, 4, Spread 0.56 0.79
R EFERENCES
MFCC 2, Frequency, Spread 0.56 0.75
MFCC 2, 4, Centroid 0.56 0.76 [1] A. Farhang, “Modelling a gesture: Tak-Sı̄m for string quartet and live
electronics,” in The OM Composer’s Book. vol. 3, J. Bresson, C. Agon,
and G. Assayag, Eds., Editions Delatour / Ircam-Centre Pompidou, 2016.
D. Genetic algorithm results [2] R. I. Godøy and M. Leman, Eds., Musical Gestures: Sound, Movement,
and Meaning. Routledge, 2010.
Using the directed search of the genetic algorithm allowed [3] J. Françoise, N. Schnell, and F. Bevilacqua, “A Multimodal Probabilistic
to fine-tune relevant hyperparameter configurations found pre- Model for Gesture-based Control of Sound Synthesis,” in MM’13
Proceedings of the ACM international conference on Multimedia,
viously. Table V shows the four best hyperparameter sets Barcelona, Spain, 2013.
obtained with the algorithm. For all of them, the relative [4] H.-K. Lee and J. H. Kim, “An HMM-based threshold model approach
and absolute regularization values converged to 0.1 and 0.05 for gesture recognition,” in IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 21, no. 10, 1999.
respectively. The resulting selected sets of descriptors are [5] F. Bevilacqua, B. Zamborlin, A. Sypniewski, N. Schnell, F. Guédy,
far more complex than the previous combinations, and the and N. Rasamimanana, “Continuous realtime gesture following and
performance in recognition is slightly improved. recognition,” in Gesture in Embodied Communication and Human-
Computer Interaction, 8th International Gesture Workshop, Bielefeld,
Germany, 2010.
TABLE V [6] J. Bresson, D. Bouche, T. Carpentier, D. Schwarz, and J. Garcia,
M ODEL ACCURACIES WITH HYPERPARAMETER SETS AND BEST “Next-generation Computer-aided Composition Environment: A New
COMBINATIONS OF DESCRIPTOR AS FOUND BY THE GENETIC ALGORITHM .
Implementation of OpenMusic,” in Proceedings of the International
Computer Music Conference, Shanghai, China, 2017.
Hidden Test set Training set [7] J. Bresson, P. Best, D. Schwarz, and A. Farhang, “From Motion to
Descriptors
states accuracy accuracy Musical Gesture: Experiments with Machine Learning in Computer-
MFCC 2, 3, 4, 5, 6, 7, 12, Aided Composition,” in MUME 2018: International Workshop on Mu-
Frequency, Energy, 29 0.61 0.91 sical Metacreation, Salamanca, Spain, 2018.
Periodicity, AC1 [8] N. Schnell, D. Schwarz, R. Cahen, and V. Zappi, “IAE & IAEOU: The
MFCC 2, 3, 4, 5, 12, IMTR Audio Engine,” in Topophonie research project: Audiographic
Frequency, Energy, 21 0.60 0.87 cluster navigation (2009-2012), R. Cahen, Ed., ENSCI – Les Ateliers /
Periodicity, AC1, Loudness Paris Design Lab, pp. 50-51, 2012.
MFCC 2, 3, 4, 5, 12, [9] N. Schnell, D. Schwarz, J. Larralde, and R. Borghesi, “PiPo, A Plugin
Frequency, Energy, 22 0.60 0.86 Interface for Afferent Data Stream Processing Modules,” in International
Periodicity, AC1 Symposium on Music Information Retrieval, Suzhou, China, 2017.
MFCC 2, 3, 4 5, 12,
22 0.60 0.86 3 The library is open-source and available for download at:
Frequency, Energy, AC1
[Link]

You might also like