Volume 15 Number 1
Advances in Digital Music Iconography: Benchmarking the detection of musical instruments in unrestricted, non-photorealistic images from the artistic domain
DOI: https://doi.org/10.63744/nw7qvsb43mzn
Abstract
In this paper, we present MINERVA, the first benchmark dataset for the detection of musical instruments in non-photorealistic, unrestricted image collections from the realm of the visual arts. This effort is situated against the scholarly background of music iconography, an interdisciplinary field at the intersection of musicology and art history. We benchmark a number of state-of-the-art systems for image classification and object detection. Our results demonstrate the feasibility of the task but also highlight the significant challenges which this artistic material poses to computer vision. We evaluate the system to an out-of-sample collection and offer an interpretive discussion of the false positives detected. The error analysis yields a number of unexpected insights into the contextual cues that trigger the detector. The iconography surrounding children and musical instruments, for instance, shares some core properties, such as an intimacy in body language.
Introduction: the era of the pixel
Motivation
Music iconography
Computer vision
Photo-realism
Data scarcity
Irrelevant training categories
Robustness of the models
MINERVA: dataset description
Data Sources
- RIDIM: We harvested a collection of high-quality images from the RIDIM database, in those cases where the database entries provided an unambiguous hyperlink to a publicly accessible image. These records were already assigned MIMO codes by a community of domain experts, which provided important support to our in-house annotators (especially during the first experimental rounds of annotations).
- RMFAB/RMAH: We expanded on the core RIDIM data by including (midrange resolution) images from the digital collections of two federal museums in Brussels: the RMFAB (Royal Museums of Fine Arts of Belgium, Brussels) and the RMAH (Royal Museums of Art and History, Brussels). These images were selected on the basis of previous annotations that suggested they included depictions of musical instruments, although no more specific labels (e.g. MIMO codes) were available for these records at this stage. Copyrighted artworks could not be included for obvious reasons (copyright lasts for 70 years from the death of the creator under Belgian intellectual law).
- Flickr: To scale up our annotation efforts, finally, we collected a larger dataset of images from the well-known image hosting service 'Flickr' (www.flickr.com). We harvested all images from a community-curated collection of depictions of musical instruments in the visual arts pre-dating 1800.[3] This third campaign yielded much more data than the former two, but these were more noisy and contained a variety of false positives that had to be manually deleted during the annotation phase.
Vocabulary
| Instrument hypernym | Stringed instruments | Wind instruments | Percussion instruments | Keyboard instruments | Electronic instruments |
| Example instruments | Lute, psaltery, fiddle, viola da gamba, cittern | Transverse flute, end-blown trumpet, horn, shawm, bagpipe | Tambourine, cylindrical drum, frame drum, friction drum, bell | Pianoforte, virginal, portative organ, harpsichord, clavichord | Electric guitar, synthesizer, theremin, vocoder, mellotron |
Annotation process
- Representation: A challenging aspect was the variety of artistic depiction modes represented in the dataset, ranging from photo-realistic renderings to heavily stylized depictions from specific art-historical movements (e.g. impressionism, pointillism, fauvism, cubism, ...) (Figure 3a). Additionally, visibility could be low due to a proportionally small instrument depiction or the profusion of details (Figure 3b). In some instances, the state of the depicted object and its medium made the detection of the instrument difficult, e.g. a damaged medieval tympanon (Figure 3b).
- Quality: Other, more pragmatic issues arose from the images themselves. Occasionally, the quality of the images was too low to be able to detect the instruments (e.g. low resolution or compression defects) (Figure 3c). A great deal of the images did not meet international quality standards for heritage reproduction photography (uniform and neutral environment and lighting, frontal point of view), which implies that the instruments were even more difficult to detect.
- Boxes: The use of a rectangular shape for the bounding boxes sometimes has limitations and implied a certain lack of precision, e.g. in the case of a diagonally positioned flute, or in the case of overlapping instruments (Figure 3d). For some instruments which consist of several parts, e.g. a violin and its bow, only the main part (the violin) was annotated.
Characteristics
Versions and splits
| Training-set | Dev-set | Test-set | Total | |||||
| Imag | Inst | Imag | Inst | Imag | Inst | Imag | Inst | |
| Single inst | 1857 | 4243 | 1137 | 2288 | 1189 | 2102 | 4183 | 8633 |
| Top-5 inst | 952 | 1589 | 540 | 852 | 724 | 1173 | 2216 | 3614 |
| Top-10 inst | 1227 | 2147 | 680 | 1127 | 898 | 1506 | 2805 | 4780 |
| Top-20 inst | 1471 | 2915 | 860 | 1543 | 1047 | 1838 | 3378 | 6296 |
Benchmark experiments
Classification
| Top-5 inst | Top-10 inst | Top-20 inst | All inst | Hypernyms | ||||||
| CNN | Acc. | F1 | Acc. | F1 | Acc. | F1 | Acc. | F1 | Acc. | F1 |
| R-Net | 68.71 | 64.10 | 52.85 | 41.55 | 30.73 | 8.45 | 26.36 | 2.08 | 72.26 | 52.66 |
| V3 | 73.66 | 70.29 | 55.51 | 44.77 | 36.51 | 19.06 | 27.02 | 6.67 | 75.80 | 57.03 |
| V19 | 48.33 | 35.92 | 37.52 | 15.22 | 33.41 | 9.87 | 20.17 | 1.72 | 66.41 | 40.35 |
| Predicted label / Gold label | Bagpipe | E-b trumpet | Harp | Horn | Lute | Lyre | Por. organ | Rebec | Shawm | Violin |
| Bagpipe | 31 | 0 | 10 | 6 | 8 | 1 | 2 | 0 | 7 | 17 |
| E-b trumpet | 4 | 72 | 19 | 2 | 14 | 1 | 3 | 1 | 38 | 21 |
| Harp | 8 | 2 | 227 | 1 | 10 | 3 | 11 | 0 | 10 | 19 |
| Horn | 7 | 5 | 14 | 9 | 16 | 9 | 1 | 2 | 5 | 14 |
| Lute | 6 | 10 | 17 | 6 | 199 | 6 | 5 | 1 | 5 | 42 |
| Lyre | 3 | 0 | 19 | 1 | 13 | 5 | 2 | 0 | 3 | 11 |
| Por. organ | 3 | 0 | 10 | 1 | 0 | 0 | 57 | 0 | 1 | 4 |
| Rebec | 5 | 2 | 14 | 0 | 9 | 0 | 4 | 7 | 1 | 23 |
| Shawm | 4 | 11 | 25 | 2 | 11 | 2 | 4 | 6 | 40 | 13 |
| Violin | 6 | 12 | 29 | 4 | 35 | 4 | 7 | 11 | 6 | 202 |
Detection
| Instrument ≥ IoU | Precision | Recall | AP |
| Single-instrument ≥ 10 Single-instrument ≥ 50 |
0.63 0.47 |
0.42 0.31 |
0.35 0.22 |
| Stringed-Instruments ≥ 10 Stringed-Instruments ≥ 50 |
0.65 0.53 |
0.36 0.29 |
0.28 0.20 |
| Wind-Instruments ≥ 10 Wind-Instruments ≥ 50 |
0.43 0.32 |
0.07 0.05 |
0.04 0.02 |
| Percussion-Instruments ≥ 10 Percussion-Instruments ≥ 50 |
0.32 0.21 |
0.04 0.03 |
0.02 0.01 |
| Keyboard-Instruments ≥ 10 Keyboard-Instruments ≥ 50 |
0.61 0.45 |
0.11 0.08 |
0.07 0.04 |
| Electronic-Instruments ≥ 10 Electronic-Instruments ≥ 50 |
- - |
- - |
- - |
| Harp ≥ 10 Harp ≥ 50 |
0.68 0.60 |
0.62 0.54 |
0.55 0.46 |
| Lute ≥ 10 Lute ≥ 50 |
0.57 0.47 |
0.43 0.35 |
0.36 0.26 |
| Violin ≥ 10 Violin ≥ 50 |
0.37 0.26 |
0.22 0.16 |
0.12 0.07 |
| Shawm ≥ 10 Shawm ≥ 50 |
0.13 0.08 |
0.04 0.02 |
0.01 0.00 |
| End-blown trumpet ≥ 10 End-blown trumpet ≥ 50 |
0.28 0.24 |
0.04 0.03 |
0.01 0.01 |
| Harp ≥ 10 Harp ≥ 50 |
0.62 0.56 |
0.56 0.51 |
0.46 0.39 |
| Lute ≥ 10 Lute ≥ 50 |
0.55 0.47 |
0.42 0.36 |
0.33 0.25 |
| Violin ≥ 10 Violin ≥ 50 |
0.26 0.20 |
0.19 0.14 |
0.06 0.04 |
| Shawm ≥ 10 Shawm ≥ 50 |
0.17 0.17 |
0.03 0.01 |
0.00 0.00 |
| End-blown trumpet ≥ 10 End-blown trumpet ≥ 50 |
0.67 0.17 |
0.02 0.03 |
0.01 0.00 |
| Bagpipe ≥ 10 Bagpipe ≥ 50 |
0 0 |
0 0 |
0 0 |
| Portative-Organ ≥ 10 Portative-Organ ≥ 50 |
0.24 0.24 |
0.13 0.13 |
0.06 0.06 |
| Horn ≥ 10 Horn ≥ 50 |
0 0 |
0 0 |
0 0 |
| Rebec ≥ 10 Rebec ≥ 50 |
- - |
- - |
- - |
| Lyre ≥ 10 Lyre ≥ 50 |
- - |
- - |
-- - |
Additional experiments
- RMFAB/RMAH: 428 out-of-sample images from the digital assets of both museum collections that are not included in the annotated material (and which are thus not included the train and validation material of the applied detector), because the available metadata did not explicitly specify that they contained depictions of musical instruments. (This collection cannot be shared due to copyright restrictions.)
- IconArt: a generic collection of 6,528 artistic images, collected from the community-curated platform WikiArt: Visual Art Encyclopedia (https://www.wikiart.org/). The IconArt subcollection was previously redistributed by [Gonthier et al. 2018]: https://wsoda.telecom-paristech.fr/downloads/dataset/.
| Collection | Total images | Detections | True positives |
| RMFAB/RMAH | 428 | 162 | 6 |
| IconArt | 6528 | 118 | 42 |
Discussion
Skewed results
Saliency maps
- (a) Focus on the neck of the stringed instrument, as well as the characteristic presence of tuning pins at the end of the neck;
- (b) Sensitive to the presence of stretched fingers in an unnatural position;
- (c) Typical conic shape of a lyre, with outward pointing ends connected by a bridge;
- (d) Symmetric presence of tone holes in the areophone;
- (e) Elongated, cylindric shape of the main body of the areophone with wider end;
- (f) Mirrored placement of fingers and hands (close to one another).
Error analysis: false positives
Conclusions and future research
Acknowledgements
Notes
Works Cited
Recommendations
DHQ is testing out three new article recommendation methods! Please explore the links below to find articles that are related in different ways to the one you just read. We are interested in how these methods work for readers—if you would like to share feedback with us, please complete our short evaluation survey. You can also visit our documentation for these recommendation methods to learn more.
SPECTER Recommendations
Below are article recommendations generated by the SPECTER model:
- Comparative K-Pop Choreography Analysis through Deep-Learning Pose Estimation across a Large Video Corpus, 2021, Peter Broadwell, Stanford University; Timothy R. Tangherlini, University of Calfornia, Berkeley
- Computer Vision and the Creation of a Database of Printers’ Ornaments, 2021, Hazel Wilkinson, University of Birmingham, UK; James Briggs, Graphcore; Dirk Gorissen, Machine Learning Ltd.
- I was painted by...: A Case Study on the Use of CNNs for Image Classification in the Humanities, 2025, Marta Kipke, Institut für Digital Humanities, Georg-August-Universität Göttingen; Lukas Brinkmeyer, Information Systems and Machine Learning Lab, Stiftung Universität Hildesheim; Martin Langer, Institut für Digital Humanities, Georg-August-Universität Göttingen; Lars Schmidt-Thieme, Information Systems and Machine Learning Lab, Stiftung Universität Hildesheim
- Entre musique et lettres : vers une méthodologie numérique pour l’analyse de la mise en musique des poésies de Charles Baudelaire, 2018, Caroline Ardrey, The University of Birmingham; Mylène Dubiau, Université de Toulouse — Jean Jaurès; Helen Abbott, The University of Birmingham
- Transforming Information Into Knowledge: How Computational Methods Reshape Art History, 2021, Sabine Lang, Interdisciplinary Center for Scientific Computing, Heidelberg Collaboratory for Image Processing, Heidelberg University; Björn Ommer, Interdisciplinary Center for Scientific Computing, Heidelberg Collaboratory for Image Processing, Heidelberg University
DHQ Keyword Recommendations
Below are article recommendations generated by DHQ Keywords:
- Transforming Information Into Knowledge: How Computational Methods Reshape Art History, 2021, Sabine Lang, Interdisciplinary Center for Scientific Computing, Heidelberg Collaboratory for Image Processing, Heidelberg University; Björn Ommer, Interdisciplinary Center for Scientific Computing, Heidelberg Collaboratory for Image Processing, Heidelberg University
- Modelling Medieval Hands: Practical OCR for Caroline Minuscule, 2019, Brandon W. Hawk, Rhode Island College; Antonia Karaisl, Rescribe Ltd; Nick White, Rescribe Ltd
- Crowdsourcing Image Extraction and Annotation: Software Development and Case Study, 2020, Ana Jofre, SUNY Polytechnic; Vincent Berardi, Chapman University; Kathleen P.J. Brennan, University of Queensland; Aisha Cornejo, Chapman University; Carl Bennett, SUNY Polytechnic; John Harlan, SUNY Polytechnic
- Renaissance Remix. Isabella d’Este: Virtual Studiolo, 2018, Deanna Shemek, University of California, Irvine, USA; Antonella Guidazzoli, VisitLab - Cineca Interuniversity Consortium, Italy; Maria Chiara Liguori, VisitLab - Cineca Interuniversity Consortium, Italy; Giovanni Bellavia, VisitLab - Cineca Interuniversity Consortium, Italy; Daniele De Luca, VisitLab - Cineca Interuniversity Consortium, Italy; Luigi Verri, VisitLab - Cineca Interuniversity Consortium, Italy; Silvano Imboden, VisitLab - Cineca Interuniversity Consortium, Italy
- The case of the golden background, a virtual restoration and a physical reconstruction of the medieval Crucifixion of the Lindau Master (c. 1425), 2023, Liselore Tissen, Leiden University and Delft University of Technology; Sanne Frequin, Utrecht University; Ruben Wiersma, Delft University of Technology
TF-IDF Recommendations
Below are article recommendations generated by the TF-IDF Model:
- Automated Visual Content Analysis for Film Studies: Current Status and Challenges, 2020, Kader Pustu-Iren, Leibniz Information Centre of Science and Technology (TIB), Hannover, Germany; Julian Sittel, Institute for Film, Theatre and Empirical Cultural Studies, University of Mainz, Germany; Roman Mauer, Institute for Film, Theatre and Empirical Cultural Studies, University of Mainz, Germany; Oksana Bulgakowa, Institute for Film, Theatre and Empirical Cultural Studies, University of Mainz, Germany; Ralph Ewerth, Leibniz Information Centre of Science and Technology (TIB), Hannover, Germany; L3S Research Center, Leibniz University Hannover, Germany
- Deep Learning for Historical Cadastral Maps and Satellite Imagery Analysis: Insights from Styria's Franciscean Cadastre, 2024, Wolfgang Thomas Göderle, University of Innsbruck; Max Planck Institute of Geoanthropology; Fabian Rampetsreiter, University of Graz; Christian Macher, Know Center; Katrin Mauthner, Know Center; Oliver Pimas, Know Center
- Seeking Information in Spanish Historical Newspapers: The Case of Diario de Madrid (18th and 19th Centuries), 2024, Eva Sánchez-Salido, ETSI Informática, UNED; Antonio Menta, ETSI Informática, UNED; Ana García-Serrano, ETSI Informática, UNED
- Automated Transcription of Gə'əz Manuscripts Using Deep Learning, 2023, Samuel Grieggs, University of Notre Dame; Jessica Lockhart, University of Toronto; Alexandra Atiya, University of Toronto; Gelila Tilahun, University of Toronto; Suzanne Akbari, Institute for Advanced Study, Princeton, NJ; Eyob Derillo, SOAS, University of London; Jarod Jacobs, Warner Pacific College; Christine Kwon, University of Notre Dame; Michael Gervers, University of Toronto; Steve Delamarter, George Fox University; Alexandra Gillespie, University of Toronto; Walter Scheirer, University of Notre Dame
- I was painted by...: A Case Study on the Use of CNNs for Image Classification in the Humanities, 2025, Marta Kipke, Institut für Digital Humanities, Georg-August-Universität Göttingen; Lukas Brinkmeyer, Information Systems and Machine Learning Lab, Stiftung Universität Hildesheim; Martin Langer, Institut für Digital Humanities, Georg-August-Universität Göttingen; Lars Schmidt-Thieme, Information Systems and Machine Learning Lab, Stiftung Universität Hildesheim














