ExplainableNeural Symbolic Learning
ExplainableNeural Symbolic Learning
Natalia Dı́az-Rodrı́guez*a,f,c,, Alberto Lamas*f,c , Jules Sanchez*a , Gianni Franchia , Ivan Donadellob ,
Siham Tabikf,c , David Filliata , Policarpo Cruze , Rosana Montesc,g , Francisco Herreraf,c,d
a U2IS,
ENSTA, Institut Polytechnique Paris and Inria Flowers, 91762, Palaiseau, France
b Free University of Bozen-Bolzano, 39100, Italy
c DaSCI Andalusian Institute in Data Science and Computational Intelligence, University of Granada, 18071 Granada, Spain
d Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, 21589, Saudi Arabia
e Art History Department, University of Granada, 18071 Granada, Spain
f Department of Computer Science and Artificial Intelligence, University of Granada, 18071 Granada, Spain
g Department of Software Engineering, University of Granada, 18071 Granada, Spain
arXiv:2104.11914v1 [cs.LG] 24 Apr 2021
Abstract
The latest Deep Learning (DL) models for detection and classification have achieved an unprecedented
performance over classical machine learning algorithms. However, DL models are black-box methods
hard to debug, interpret, and certify. DL alone cannot provide explanations that can be validated by a non
technical audience such as end-users or domain experts. In contrast, symbolic AI systems that convert
concepts into rules or symbols –such as knowledge graphs– are easier to explain. However, they present
lower generalisation and scaling capabilities. A very important challenge is to fuse DL representations
with expert knowledge. One way to address this challenge, as well as the performance-explainability
trade-off is by leveraging the best of both streams without obviating domain expert knowledge. In this
paper, we tackle such problem by considering the symbolic knowledge is expressed in form of a domain
expert knowledge graph. We present the eXplainable Neural-symbolic learning (X-NeSyL) methodology,
designed to learn both symbolic and deep representations, together with an explainability metric to assess
the level of alignment of machine and human expert explanations. The ultimate objective is to fuse
DL representations with expert domain knowledge during the learning process so it serves as a sound
basis for explainability. In particular, X-NeSyL methodology involves the concrete use of two notions of
explanation, both at inference and training time respectively: 1) EXPLANet: Expert-aligned eXplainable
Part-based cLAssifier NETwork Architecture, a compositional convolutional neural network that makes
use of symbolic representations, and 2) SHAP-Backprop, an explainable AI-informed training procedure
that corrects and guides the DL process to align with such symbolic representations in form of knowledge
graphs. We showcase X-NeSyL methodology using MonuMAI dataset for monument facade image
classification, and demonstrate that our approach improves not only explainability of DL models but also
performance.
Keywords: Explainable Artificial Intelligence, Deep Learning, Neural-symbolic learning, Expert
Knowledge Graphs, Compositionality, Part-based Object Detection and Classification
Currently, Deep Learning (DL) constitutes the state-of-the art models in many problems [56, 41, 97,
92, 48]. These models are opaque, complex and hard to debug, which makes their use unsafe in critical
applications such as healthcare and high-risk scenarios. Furthermore, DL often requires a large amount of
training data with over-simplified annotations that obviate an important part of centuries-long knowledge
from domain experts. At the same time, DL generally uses correlation shortcuts to produce their outputs,
which makes them finicky and difficult to correct. On the contrary, most classical symbolic AI approaches
are interpretable but do not reach neither similar levels of performance nor scalability.
Among the potential solutions to clarify the decision process of a DL model, the topic of eXplainable
AI (XAI) emerges. Given an audience, an XAI system produces details or reasons to make its functioning
clear or easy to understand [5, 38]. To make black box Deep Learning methods more interpretable, a
large amount of works exposed their vulnerabilities and sensitivity and came up with visual interpretation
techniques, such as attribution or saliency maps [98, 80, 69]. However, the explanations provided by these
methods, often in form of heatmaps, are not always enough, i.e. they are not easy to quantify, correct, nor
convey to non technical audiences [47, 95, 93, 61, 40, 3, 49].
Having both specific and broad audiences of AI models contributes towards inclusiveness and accessi-
bility, both part of the principles for responsible [5] and human-centric AI [70]. Furthermore, as advocated
in [26], broadening the inclusion of different minorities and audiences can facilitate the evaluation of AI
models when the objective is deploying human-centred AI systems.
A very critical challenge is thus to blend DL representations with domain expert knowledge. This leads
us to draw inspiration from Neural-Symbolic (NeSy) learning [20, 9], a learning paradigm composed by
both neural (or sub-symbolic) and symbolic AI components. An interesting challenge consists of bringing
explainability in this fusion through the alignment of such learned and symbolic representations [7]. In
order to pursue this idea further, we approach this quest by considering the expert knowledge to be in form
of a KG.
Since our ultimate objective is fusing DL representations and domain expert representations, to fill this
gap we propose the eXplainable Neural-symbolic (X-NeSyL) learning methodology, to bring explainability
in the process. X-NeSyL methodology is aimed to make neural-symbolic models explainable, while
providing more universal explanations for both end-users and domain experts. X-NeSyL methodology
is designed to enhance both performance and explainability of DL, in particular, a convolutional neural
network (CNN) classification model. X-NeSyL methodology is constituted by three main components:
1. A symbolic processing component to process symbolic representations, in our case we model explicit
knowledge from domain experts with knowledge graphs.
2. A neural processing component to learn neural representations, EXPLANet: eXplainable Part-based
cLAssifying NETwork architecture. EXPLANet is a compositional deep architecture that allows to
classify an object by its detected parts.
3. An XAI-informed training procedure, able to guide the model to align its outputs with the symbolic
explanation and penalize it accordingly when this is not the case. We propose SHAP-backprop to
align the representations of a deep CNN with the symbolic one from a knowledge graph, thanks to a
SHAP Attribution Graph (SAG) and a misattribution function.
The election of these components is designed to enhance a DL model by endowing its output with
explanations at two levels:
• Enhancement of the explanation at inference time: We extend the classifier inference procedure to
not only classify, but also detect what will serve as basis for the explanation. These components
should be possible to be specified through the symbolic component, e.g., a knowledge graph that
acts as gold standard explanation from the expert. EXPLANet is proposed here to classify an object
based on the detected object-parts, and thus, has the role of facilitating the mapping of neural
representations to symbols.
• Enhancement of the explanation at training time: We penalize the original model at this second
training phase, aimed towards improving the original classifier, thanks to an XAI technique called
Shapley analysis [60] that assesses the contribution of each feature to a model output. SHAP-
backprop training procedure is presented to adjust the model using a misattribution function that
quantifies the error coming from the contribution of features (object-parts) attributed to the output
(expressed in a SHAP Attribution Graph, SAG) not in agreement with the theoretical contribution
expressed by the expert knowledge graph.
Together with the X-NeSyL methodology, this paper contributes an explainability metric to evaluate
the interpretability of the model, SHAP GED (SHAP Graph Edit Distance), that measures the degree
of alignment between the symbolic (expert) and neural (machine) representations. The objective of this
metric is to gauge the alignment between the explanation from the model and the explanation from the
human target audience that validates it.
We illustrate the use of X-NeSyL methodology through a guiding use case on monument architectural
style classification and its dataset named MonuMAI [53]. We selected this dataset because it includes
object-part-based annotations which make it suitable for assessing our proposal.
The pipeline components of the X-NeSyL methodology are summarized in Fig. 1. They are meant to
complete a versatile template architecture with pluggable modular components to make possible the fusion
of representations of different nature. X-NeSyL methodology can be adapted to the needs of the use case,
and allows the model to train in a continual learning [57] setting.
The experiments to validate the X-NeSyL methodology make evident the well known interpretability-
performance trade-off with respect to traditional training with an improvement of 3.6 % with respect to
the state of the art (MonuNet [53]) on MonuMAI dataset. In terms of explainability, our contributed inter-
pretability metric, SHAP GED, reports a gain of up to 0.38 –from 0.93 to 0.55–. The experimental study
shows that X-NeSyL methodology makes it possible for CNNs to gain explainability and performance.
The rest of this paper is organized as follows: First we present the literature around XAI, and
compositional, part-based classifiers in Section 2. We present a set of frameworks on Neural-Symbolic
integration as a basis and promising body of research to attain XAI in Section 3. We describe X-NeSyL
methodology in Section 4. Its core components are presented therein, Section 4.1.1 presents the symbolic
component, i.e., how KGs can be used to represent symbolic expert knowledge to be leveraged by a DL
model, Section 4.2 presents the neural representation component describing EXPLANet architecture, and
Section 4.3 the XAI-guided training method SHAP-Backprop. X-NeSyL methodology is evaluated through
the proposed explainability metric SHAP GED, presented and illustrated in Section 5. The complete
methodology pipeline is illustrated through a driving use case on MonuMAI cultural heritage application
in Section 6. Section 7 we discuss results, alternative perspectives, and open research avenues for the
future. Finally, the Appendix includes additional experiments with an extra dataset, PASCAL-Part.
The Explainable AI literature is blooming parallelly with the advances of DL models, and so is the
set of surveys doing a great job at classifying the various methods [5, 14, 38]. We particularly focus on
attribution methods, i.e. XAI methods that relate a particular output of a DL model to their input variables.
They can both be model agnostic, but also aim at improving the quality of the visualization, such as
heatmaps, saliency maps or class activation methods. In the latter case, attribution studies what part of an
input example is responsible for the network activating in a particular way [69, 5].
Figure 1: Proposed X-NeSyL methodology for eXplainable Neural-Symbolic learning. We illustrate the components of the
methodology with examples of MonuMAI use case processed with EXPLANet part-based object detection and classification model,
knowledge graphs, and SHAP-Backprop training procedure.
This section reviews three types of XAI attribution methods, 1) local explanations, 2) saliency maps
and 3) compositional part-based classification models.
One approach to merge deep representations with symbolic knowledge representation and/or adding
explainability to deep neural networks such as CNNs is through Neural-Symbolic (NeSy) integration.
NeSy integration aims at joining standard symbolic reasoning with neural networks in order to achieve the
best of both fields and soften their limitations. A complete survey of this method is provided in [20, 21].
Indeed, symbolic reasoning is able to work in presence of few data as it constrains entities through relations.
However, it has limited computational properties and needs background knowledge. On the other hand,
neural networks are fast and able to infer knowledge. However, they require a lot of data and have limited
reasoning properties. Their integration overcomes these limitations and, as stated by [11, 90], improves
the explainability of the learned models. In the following, we present the main NeSy frameworks.
Many NeSy frameworks treat logical rules as constraints to be embedded in a vector space. In most
of the cases these constraints are encoded into the regularization term of the loss function in order to
maximize their own satisfiability. Logic Tensor Networks [81] and Semantic Based Regularization [27]
perform the embedding of First-Order Fuzzy logic constraints. The idea is to jointly maximize both
training data and constraints. Both methods are able to learn in presence of constraints and perform logical
reasoning over data. In Semantic Based Regularization the representations of logical predicates are learnt
by kernel machines, whereas Logic Tensor Networks learn the predicates with tensor networks. Other
differences regard the dealing of the existential quantifier: skolemization for Logic Tensor Networks,
or conjunction of all possible groundings for Semantic Based Regularization. Logic Tensor Networks
have been applied to semantic image interpretation by [30] and to zero-shot learning by [29]. Semantic
Based Regularization has been applied, for example, to the prediction of protein interactions by [76]
and to image classification [27]. Both Logic Tensor Networks and Semantic Based Regularization show
how background knowledge is able to i) improve the results and ii) counterbalance the effect of noisy or
scarce training data. [66] proposed a regularization method for the loss function that leverages adversarial
examples: the method firstly generates samples that maximize the unsatisfaction of the constraints, then the
neural network is optimized to increase their satisfaction. [91] proposed another regularization technique
applied to Semi-Supervised Learning where the regularization term is calculated from the unlabelled
data. In the work of [96], propositional knowledge is injected into a neural network by maximizing the
probability of the knowledge to be true.
Other works use different techniques but keep the idea of defining logical operators in terms of
differentiable functions (e.g., [89, 22]). Relational Neural Machines is a framework developed by [64] that
integrates neural networks with a First-Order Logic reasoner. In the first stage, a neural network computes
the initial predictions for the atomic formulas, whereas, in the second stage, a graphical model represents a
probability distribution over the set of atomic formulas. Another strategy is to directly inject background
knowledge into the neural network structure as done by [19]. Here, the knowledge is injected in the model
by adding new layers to the neural network that encode the fuzzy-logic operator in a differentiable way.
Then, the background knowledge is enforced both at inference and training time. In addition, weights
are assigned to rules as learnable parameters. This allows for dealing with situations where the given
knowledge contains errors or it is softly satisfied by the data without a priori knowledge about the degree
of satisfaction.
The combination of logic programming with neural networks is another exploited NeSy technique.
Neural Theorem Prover [75] is an extension of the logic programming language Prolog where the crisp
atom unification is soften by using a similarity function of the atoms projected in an embedding space.
Neural Theorem Prover defines a differentiable version of the backward chaining method (used by Prolog)
with the result of learning a latent predicate representation through an optimisation of their distributed
representations. DeepProbLog [62] integrates probabilistic logic programming (ProbLog by [71]) with
(deep) neural networks. In this manner, the explicit expressiveness of logical reasoning is combined with
the abilities of deep nets.
In terms of the latest graph attention mechanisms, an application for NeSy scene detection using graph
neural networks is in [82]. They show how triple-based schema representations of non-expert but instead
relational knowledge can be used as inductive bias to learn better representations in the task of scene
representation graph, predicate and object classification. The inductive bias of encoding relational prior
knowledge enables its propagation and model fine-tuning with external triple data.
When it comes to learning from both experts and data, theoretical and empirical studies show that it is
always more efficient to learn from both than using the best of two models. [10] confirm this by combining
expert knowledge –in the form of marginal probabilities and rules– with empirical data and apply it to
learning the probability distribution of the different combinations of symptoms of a given disease. This
approach is useful in cases when there is not enough data to learn without experts, but enough to correct
them if needed. In the purely X-NeSyL methodology we will assume this expert ground truth to be the
most predominant one, even if not the only one, and thus, the one considered as gold standard. We will
assume the latter matches the domain experts knowledge represented in another modality with a symbolic
representation, alternative to the traditional use of DL dataset labels.
Finally, knowledge distillation, used by [43], is also used as NeSy technique. Here, symbolic knowledge
is extracted from a trained “teacher” network. This knowledge is used as regularization term for training a
“student” network. The latter emulates the teacher network whereas the teacher is trained by reducing the
KL-Divergence with the student network.
Other systems use external knowledge in the form of linked data or ontologies to link inputs and
outputs to background knowledge by using a symbolic learning system to generate an explanatory theory
[88, 79, 31].
One challenge of the latest DL models today is producing not only accurate but also reliable outputs,
i.e., outputs whose explanations agree with the ground truth, and even better, agree with a human expert
on the subject. X-NeSyL methodology is aimed at filling this gap, and getting model outputs and experts
explanations to coincide. In order to tackle the concrete problem of fusing DL representations with domain
expert knowledge in form of knowledge graphs, in this section we present the three main ingredients that
compose the X-NeSyL methodology: 1) the symbolic knowledge representation component, 2) the neural
representation learning component, and 3) the alignment mechanism for both representations to align, i.e.,
correct the model during training or penalize it when disagreeing with the expert knowledge.
First, in Section 4.1 we present the symbolic component that serves to endow the model with inter-
pretability –which will be in form of knowledge graphs–, then in Section 4.2 the neural representation
learning component –that will serve to reach the best performance– and finally, in Section 4.3 the XAI-
guided training procedure that makes both components align with SHAP-Backprop during training of the
DL model.
4.1. Symbolic knowledge representation for including human experts in the loop
Symbolic AI methods are interpretable and intuitive (e.g. they use rules, language, ontologies, fuzzy
logics, etc.). They are normally used for knowledge representation. Since we advocate for leveraging
the best of both, symbolic and neural representation learning currents, in order to make the latter more
explainable, here we choose a simple form of representing expert knowledge, with knowledge graphs.
Right after, in order to demonstrate the practical usage of X-NeSyL methodology, we present the running
use case using knowledge graphs that will demonstrate the usage of this methodology thorough the paper.
4.1.2. A driving use case on cultural heritage: MonuMAI architectural style facade image classification
The latest deep learning models have focused on (whole) object classification. We choose part-based
datasets as a straight forward way to leverage extra label information to produce explanations that are
compositional and very close to human reasoning, i.e., explaining a concept or object based on its parts.
In this work, we interested ourselves in the MonuMAI (Monument with Mathematics and Artificial
Intelligence) [53] citizen science application and corresponding dataset collected through the application,
because it complies with the required compositional labels in a object detection task, based on object parts.
At the same time, facade classification by pointing relevant architectonic elements is an interesting use
case application of XAI. We use this example thorough the article as a guiding application use case that
perfectly serves to demonstrate the usage of our part-based model and pipeline for explainability.
The MonuMAI project has been developed at the University of Granada (Spain) and has involved
citizens in creating and increasing the size of the training dataset through a smartphone app1 .
The MonuMAI dataset
MonuMAI dataset allows to classify architectural style classification from facade images; it includes
1, 092 high quality photographs, where the monument facade is centered and fills most of the image.
Most images were taken by smartphone cameras thanks to the MonuMAI app. The rest of images were
selected from the Internet. The dataset was annotated by art experts for two tasks, image classification and
object detection as shown Figure 6. All images belong to facades of historical buildings that are labelled
as one out of four different styles (detailed in Table 2 and Table 1): Renaissance, Gothic, Baroque and
Hispanic-Muslim. Besides this label given to an image, every image is labeled with key architectural
elements belonging to one of fourteen categories with a total of 4, 583 annotated elements (detailed in
Table 1). Each element is supposed to be typical of one or two styles, and should almost not appear
inside facade of the other styles. Examples for each style and each element are in Fig. 2 and 3, while the
MonuMAI dataset labels used are shown in Figs. 4 and 5.
Table 1: Characteristics of the architectural styles dataset, where count is the number of occurrences of an element in the dataset, and
element rate is the ratio between the number of occurrences of an element and the total number of all elements.
Table 2: Characteristics of MonuMAI architectural style classification dataset (#images represents the number of images).
Apart from MonuMAI dataset, and in order to draw more general conclusions on our work, we used a
Figure 6: Illustration of the two annotation levels of architectural style and elements on MonuMAI dataset. This image is labeled as
Hispanic-Muslim and includes different annotated elements, e.g., two lobed arches (Source: [53]).
dataset with similar hierarchy to MonuMAI. Additional results for PASCAL-Part [17] dataset are in the
Appendix.
MonuMAI’s Knowledge Graph
The original design of MonuMAI dataset and MonuNet baseline architecture [53] use the KG exclu-
sively as a design tool to visualize the architectural style of a monument facade based on the identified
parts, but it is not explicitly used in the model. In contrast, we change that to go further, in order to
guarantee a reproducible and explainable decision process that aligns with the expert knowledge. We will
see in Section 4.3.3 how KGs can be used in a detection + classification architecture, during training, since
EXPLANet is designed to incorporate the knowledge in the KG. Besides the trust gain, we aim at easing
the understanding of flaws and limitations of the model, along with failure cases. This way, requesting
new data to experts would be backed up by proper explanations and it would be effortless to target new
and relevant data collection.
The KG corresponding to MonuMAI dataset has only fourteen object classes and four architectural
styles. Each architectural element is linked to at least one style. Each link between the two sets symbolizes
that an element is typical and expected in the style it is linked to.
• Renaissance: rounded arch, triangular pediment, segmental pediment, porthole, lintelled doorway,
serliana.
• Baroque: rounded arch, lintelled doorway, porthole, broken pediment, solomonic column.
MonuMAI’s KG is depicted in Figure 7, where the root is the Architectural Style class (which inherits
from the Thing top-most class in OWL). Note there is one more dimension in the KG, the leaf level of the
original MonuMAI graph in [53] that represents some characteristics of the architectural elements, but it is
not used in the current work.
We also explored the possibility of rewriting the looser structure captured in the KG as an ontology,
using the OWL2 format. We did not limit ourselves to copying the hierarchy of the original KG, but rather
added some categories to keep the ontology flexible to allow further expansions in the future. Three main
classes are modelled in this ontology: A Facade represents an input image as a concept. A facade is
linked to one and only one2 ArchitecturalStyle through the relation exhibitsArchStyle, for which four styles
can we used (others could be added by defining new classes). A facade can be linked to any number of
ArchitecturalElement identified on it through the relation (i.e. OWL object property) hasArchElement.
ArchitecturalElement represents the class of architectural elements identified before, and is divided
in subcategories based on the type of elements such as ”Arch” or Window”. This subcategorization,
which does not exist in the original KG, was designed with the possibility of adding constraints between
subcategories, such as an ”arch” is probably higher in space than a ”column”, or least the lowest point is
higher than a column’s lowest point. Such geometrical or spatial constraints were not explored further, as
it required extra expertise modelling from architecture experts, but could be easily added in future work.
Finally, the concept ArchitecturalElement is linked to an ArchitecturalStyle object through the object
property isTypicalOf.
This ontology formulation allows us to see the problem of style classification as a problem of KG edge
detection between a facade instance and a style instance. This approach was unsuccessful (discussed in
Section 6.4).
2 In this study, as in MonuMAI, we represent the predominant one. Future work could consider the blend of more than one present
style.
Broken
pediment
Rounded
arch
Baroque
Lintelled Solomonic
doorway column
Serliana
Porthole
Flat arch
Renaissance
Architectural Horseshoe
Segmental arch
style
pediment Hispanic-
Muslim
Triangular
pediment
Trefoil Lobed
Gothic arch arch
Pointed
arch
Ogee
arch
Figure 7: Simplified MonuMAI knowledge graph constructed based on art historians expert knowledge [53].
The KG formulation presented in Section 4.1.1 can be seen as a semantic restriction of the ontology we
propose, where we kept only the triples including isTypicalOf relation and expanded the KG with a virtual
relation isNotTypicalOf, to link together all elements with all the styles. This way the KG is a directed
graph with edges going from the architectural element toward the architectural style. Because we restrict
ourselves to only one relational object property and its inverse, the edges bear either positive or negative
information, which motivates our modeling choice of having value ±1 for formulated Ki,j,k edges.
EXPLANet, is a two-stage classification model as depicted in Figure 8. The first stage detects the
object-parts present in the input image and outputs an embedding vector that encodes the importance,
quantity and combinations of the detected object-parts. This information is used by the second stage to
predict the class of the whole object present in the input image. More precisely:
1. The first stage is a detection module, which can be a detector such as Faster R-CNN [72] or
RetinaNet [58]. Let us consider that there are n object-part classes. This module is trained to detect
the key object-part classes existent in the input image, and outputs M predicted regions. Each one is
represented by its bounding box coordinates and a vector of size n representing the probability of the
n object-part classes. Let us denote pm ∈ Rn (with m ∈ [1, M ]) the probability vector of detecting
object-part m. First we process all pm by setting non maximal probabilities to zero, and denoting
this new score p0m , being p0 also a vector of size n. Let us denote vector v the final descriptor of the
image. We build v by accumulating the probabilities of p0m such that:
M
X
v= p0m , v ∈ Rn (1)
m=1
Vector v aggregates the confidence of each predicted object-part. Large values in v mean that the
input image contains a large number of object-part i with a high confidence prediction, whereas a
low value means that predictions had low confidence. Intermediate values are harder to interpret as
they could be a small amount of high confidence predictions or a large amount of low confidence
predictions, but the idea is that there is probably some objects of these kinds in the image. This
object-parts vector can be seen as tabular data where each object part can be considered a feature (to
be explained later by an XAI method). We will see in next section how a SHAP analysis can study
the contribution of each actual object part present in the image to the actual final object classification.
Note that this aggregation scheme is for Faster R-CNN. For RetinaNet we aggregate by summing all
probabilities (and do not just take the one of the detected object represented by the max. probability;
we found out that this was more stable for training the RetinaNet framework).
2. The second stage of EXPLANet is a classification network, which is actually a two-layer multi-layer
perceptron (MLP), that uses the embedding information (i.e., takes the previous detector output
as input) to perform the final classification. This stage outputs the final object class based on the
importance of the present key object parts detected in the input image.
The goal of such design is to facilitate the reproduction of the thought process of an expert, which is to
first localize and identify key elements (e.g., in the case of architectural style classification of a facade,
various types of arches or columns; and then use this information to deduce the final class (e.g., its overall
style). However, EXPLANet architecture alone does not control for expert knowledge alignment. Next
section introduces the next step of the pipeline, an XAI based training procedure and loss function to
actually verify that this happens, and when this is not the case, correct the learning.
4.3. SHAP-Backprop: An XAI-informed training procedure and XAI loss based on the SHAP attribution
graph (SAG)
After having presented the symbolic and neural knowledge processing components of X-NeSyL,
we proceed to detail the XAI-informed training procedure to make the most of the best of both worlds,
interpretable representations, and deep representations.
More concretely, this section presents how to use a model agnostic XAI technique to make a DL
(CNN-based) model more explainable by aligning the test-set feature attribution with the expert theoretical
attribution. Both knowledge bases will be encoded in KGs.
4.3.2. SAG: SHAP Attribution Graph to compute an XAI loss and explainability metric
By measuring how interpretable our model is, in the form of a KG, we want to be able to tell if the
decision process of our model is similar to how an expert mentally organizes its knowledge. As highlighted
in the previous section, thanks to SHAP we can see how each feature value impacts the predicted macro
label and thus, how each part of an object class impacts the predicted label. Based on this, we can create a
SHAP attribution graph (SAG). In this graph, the nodes are the object (macro) labels, and the parts are
linked to a macro label if according to the SHAP algorithm, it played a contribution role toward predicting
this label.
Building the SAG is a two step process. First we extract the feature vector representing the attributes
detected (float values). Thanks to the detection model we get the predicted label from it. Feature vectors
are the output of the aggregation function that are fed to the classification module.
• Having a positive SHAP value means the detected feature contributes to predicting this label, given
a trained classifier and an image. We thus add to the SAG such edge representing a present feature
contribution.
• Having a negative SHAP value and a feature value below the threshold s means that this element is
considered typical of this label and its absence is detrimental to the prediction. As such, we can link
the object label and the part label in the SAG, as a lacking feature contribution.
An example of SAG for the architectural style classification problem is in Fig. 12. M, R, G, B means
Hispanic-Muslim, Renaissance, Gothic and Baroque, respectively. The pseudo code to generate the SAG
can be found in Algorithm 1.
Algorithm 1 Computes the SHAP attribution graph (SAG) for a given inference sample.
Require: feature vector, shap values, Classes, Parts, part detected threshold s
SAG ←{}
for class in Classes do
local shap ← shap values[class]
for part in Parts do
f eature val ← f eature vector[part]
shap val ← local shap[part]
if feature val > s then
if shap val > 0 then
ADD (part, object) edge to SAG
end if
else
if shap val < 0 then
ADD (part, object) edge to SAG
end if
end if
end for
end for
return SAG
In practice this allow us to have an empirical attribution graph, the SAG (built at inference time), and a
theoretical attribution graph, the KG (representing prior knowledge). We can then compare both of them.
4 Default thresholds used in our case for detection were s = 0.05 for both Faster-RCNN and RetinaNet, as they showed to work
4.3.4. LSHAP : A new loss function based on SHAP values and a misattribution function β
Let N be the number of training examples and let I = (I1 , I2 , ..., IN ) be the training image examples.
Let D be the detector function such as:
where BBi = (BB1 , BB2 , ..., BBNi ) are the bounding boxes detected by D, Confi the confidence
associated to each predicted box, and Classi the predicted class of each box. The associated ground truth
label is used for standard backpropagation, but we will not need it for the weighting process.
Faster R-CNN [72] uses a two terms loss:
5 Anchors are a set of predefined bounding boxes of a certain height and width. These boxes are defined to capture the scale and
aspect ratio of specific objects. Height and width of anchors are a hyperparameter chosen when initializing the network.
Figure 11: Diagram of proposed X-NeSyL methodology focused on the new SHAP-Backprop training procedure (in purple). In
yellow the input data, in green the ground truth elements, in red predicted values (including classes and bounding boxes), and in blue
trainable modules.
Where i is the index of the considered image and k the index of a BB predicted within that image.
We now introduce the SHAP values, which are used as a constraining mechanism of our classifier
model to be aligned with expert knowledge provided in the KG. SHAP values are computed after training
the classification model.
Let S = (S1 , S2 , ..., SN ) with Si = (Si,1 , ..., Si,m ), with m the number of different macro labels, and
let Si,k = (Si,k,1 , Si,k,2 , ..., Si,k,l ) be the SHAP values for each training example i, where k is the macro
object label, and l is the detected part. Each SHAP value is thus of size l, with l being the number of
parts in the model. Furthermore, due to the nature of the output of the classification model, which are
probabilities, and the way SHAP values are computed, they are bounded to be a real number in [−1, 1].
The KG was already modeled as an attribution graph and corresponding matrix (in order to compute
the embedding out of the KG) in Section 4.3.2 and we will be using the same notation.
Introducing the misattribution function β
To introduce SHAP-Backprop into our training, we first need to be able to compare the SHAP values
with a ground truth, which here is represented by the expert KG. We thus introduce the misattribution
function to assess the level of alignment of the feature attribution SHAP values with the expert KG.
The goal of the misattribution function is to quantitatively compare the SHAP values computed for the
training examples with the KG. For that we assume the SHAP values are computed for all feature vectors.
A misattribution value is then computed for each feature value of each feature vector. Before considering
the definition of misattribution function, we can distinguish two cases when comparing these two elements,
depending on the feature values observed:
A) The feature value considered is higher than a given hyperparameter v, i.e., the positive case. v
symbolizes the value above which we consider a part is detected in our sample image. In our case v = 0.
B) The feature value is lower or equal to v: in this case we assume there is no detected part, i.e., the
negative case.
Case A: In the first case, given the KG, for a SHAP attribution to be coherent, it should have the
same sign as the KG. If it is the case, the misattribution is 0, i.e., there is no correction to be made and
backpropagate. Otherwise, if it has opposite sign, the misattribution will depend on the SHAP value. In
particular, it will be proportional to the absolute value of the SHAP attribution. We thus propose the
following misattribution function:
where Ni is the number of BBs predicted in image Ii , and C = (C1 , ..., CN ) the ground truth (GT)
labels for instance images I = (I1 , I2 , ..., IN ). We propose two possible loss weighting options, depending
on h, a balancing hyperparameter (equal to 1 in our experiments), that can be linear:
or exponential:
with i the index of the considered image, Ci its associated class, Classk the considered part class,
KG the KG and S the SHAP values. Either way, if α is equal to 1 when the misattribution is 0 in order to
maintain the value of the original loss function. Thus, α ∈ [1, ∞): the larger the misattribution, the larger
the penalization.
- Instance-level weighting of the LSHAP loss
This second weighted loss is at the instance level, meaning we are weighting all the BBs for a given
dataset instance with the same value:
Ni
LSHAP = ΣN
i=1 αinstance (S, KG, i, Ci )Σk=1 L(Ii , (BBk , Confk , Classk )) (9)
where
αinstance (S, KG, i, k) = maxj∈[1,l] (αBBox (S, KG, i, k, j)) (10)
i.e., the instance level weighting of the loss function considers the max BBox misattribution function value.
Just as the BB level weighting, the aggregation of terms in the misattribution function can either be linear
or exponential.
5. X-NeSyL methodology Evaluation: SHAP GED metric to report model explainability for end-
user and domain expert audiences
Detection and classification modules of EXPLANet use mAP and Accuracy, respectively, as standard
evaluation metric. In order to evaluate explainability of the model in terms of alignment with the KG,
we propose the use of the SHAP Graph Edit Distance (SHAP GED) at test time. This metric has a well
defined target audience: the end-user (in our case, of a citizen science application) and domain experts (art
historians), i.e., users with non-technical background necessarily.
Even if the SAG above can be computed for any set of theoretical and empirical feature attribution
sets, we are interested in using the GT KG in order to compute a explainability score on a test set.
The simplest way to compare two graphs is applying the GED [77]. Using straight up the GED between
a KG and the SAG does not work very well, since the number of object parts (architectural elements in
our case) detected vary too much from an image to another. What we do is to compare the SAG to the
projection of the KG given the nodes present in the SAG. More precisely, given a SAG, we compute a new
graph from the KG, where we take the subgraph of the KG that only contains the nodes in the SAG. As,
such they will have the same nodes, but with the potential addition of new edges.
An example of such projection can be seen in Figure 12 (right). This way, the projection serves to only
compute the relevant information given a specific image.
Once SHAP-Backprop procedure penalizes the missalignment of object parts with those of the KG
(detailed in next section), we will use the SAG to compute the SHAP GED between the SAG and its
projection in the KG. This procedure basically translates into counting the number of ”wrong” edges in the
SAG given the reference KG, i.e., the object parts that should not be present in this data point, given the
predicted object label.
After detailing all necessary components to run the full pipeline of X-NeSyL methodology, together
with an evaluation metric that facilitates the full process evaluation, we are in place to set up an experimental
study to validate each component. It is worth noting that each component can be adapted to each use
Hisp.-Muslim Gothic Renaissance Baroque Feature vector
Horseshoe arch -0.16 0 0.08 0.03 0
Lobed arch 0 0 0 0 0
Flat arch 0 0 0 0 0
Pointed arch 0 -0.15 0.07 0.04 0
Ogee arch 0 -0.08 0.01 0 0
Trefoil arch 0 0 0.04 0 0.2
Triangular pediment 0 0 0 0.06 0
Segmental pediment 0 0 0 0 0
Serliana 0 0 0 0 0
Rounded arch 0 0 0 0.03 1.35
Lintelled doorway 0 0 0 0 0
Porthole 0 0 0 0 0
Broken pediment 0 0 0.14 -0.16 0
Solomonic column 0 0 0.04 0 0
Table 3: Feature vector of a sample image and its SHAP analysis used for the construction of the SAG in Fig. 12, according to
Algorithm 1.
Figure 12: SHAP attribution graph (SAG, left) and the projection of the KG given the nodes present in the SAG (right).
case. Next section experiments will demonstrate, with a real life dataset, how X-NeSyL methodology
can facilitate the learning of explainable features, by fusing the information from deep and symbolic
representations.
#correct predictions
accuracy = (11)
#total predictions
where # represents the number for correct and total predictions. To evaluate the detection performance,
we use the standard metric mean average precision mAP (Eq. 12).
PK Z 1
i=1 APi 1 X
mAP = APi = p(r)dr (12)
K 10 0
r∈[0.5,...,0.95]
where given K categories of elements, p precision and r recall define p(r) as the area under the
interpolated precision-recall curve for class i.
We initialized Faster R-CNN [72] and RetinaNet with the pre-trained weights on MS-COCO [59]
then fine-tuned both detection architectures on the target datasets, i.e., MonuMAI or PASCAL-Part. The
last two layers of Faster R-CNN were fine-tuned on the target dataset. As optimization method, we used
Stochastic Gradient Descent (SGD) with learning rate of 0.0003 and a momentum of 0.1. We use Faster
R-CNN implementation provided by PyTorch.
For the classification module, we also fine-tuned the two layer MLP with 11 intermediate neurons. We
used the Adam [51] optimizer provided by Keras.
To perform an ablation study on the element or part-based detector, the original dataset is split into
three categories (train, validation and test), following a 60/20/20 split. Reported results are computed on
the test set.
The compositional part-based object classification with RetinaNet is trained in two phases. First the
detection is trained by finetuning a RetinaNet-50 pretrained on MS COCO. We use Adam optimizer with
starting learning rate (LR) of 0.00001 and a scheduler of learning rate to reduce on plateau with a patience
of 46 . We train this way for 50 epochs. Then we freeze the whole detection weights and train only the
6 Patience is the number of epochs taken into account for the scheduler to decide the network converged. Here, the last four.
classification. We use Adam optimizer with starting LR= 0.001 and a scheduler of LR to reduce on plateau
with a patience of 4. We train this way for 25 epochs.
Even if our objective was having a fully end-to-end training, the need for a quite different LR between
the detection and classification modules led us to train separately for convenience, at the moment.
Table 4: Results of two backbone variants of EXPLANet (using object detector Faster R-CNN and RetinaNet) on MonuMAI dataset,
and comparison with embedded version of the baseline model MonuNet, and a vanilla classifier baseline with ResNet. (EXPLANet
versions use the standard procedure, no SHAP-Backprop).
MonuNet [53], the baseline provided by MonuMAI dataset authors, is an architecture designed for
being used in mobile devices in real time. Because of its compressed design targeting embedded systems,
its performance is not fully comparable with EXPLANet. However, we report it for reference, as it is the
only previous model trained on novel MonuMAI dataset to date, to the best of our knowledge.
The result of the ablation study assessing the impact of the object detector on EXPLANet is in Table
4. We obtain basically the same accuracy. Even if independently, RetinaNet model is slightly superior
to Faster R-CNN, it seems that the explanation for having worse default results when using EXPLANet
with RetinaNet instead of with Faster R-CNN is due to 1) hyperparameter choice, since Faster R-CNN
uses pretraining on MS-COCO while RetinaNet uses pretraining on ImageNet, and 2) both coarse grained
MonuMAI dataset and fine-grained PASCAL-Part are of different nature in terms of the overlap among
part classes.
Due to the naturally simpler nature of RetinaNet, the latter is faster to train than Faster R-CNN8 .
We can see the confusion matrix computed on MonuMAI for EXPLANet, using both Faster R-CNN
and RetinaNet object detectors as backbones, in Fig. 13.
Overall, both part-based models outperform the regular classification for MonuMAI, which means,
that the more accurate the classification model is, the more interpretable (lower SHAP GED) it becomes.
Although a better detector (better mAP) could be intuitive to encourage a better GED, it is not expectable,
because the mAP evaluates the spatial location and the presence or not of a descriptor (object part), while
7 Since MonuMaiKET detector and MonuNet classifier are not connected, MonuNet does not provide object detection.
8 We use the RetinaNet implementation from https://github.com/yhenon/pytorch-retinanet. Ease of use stems from the fact that if
we wanted to modify the aggregation function, whether its analytical form or at the end of the detector, at which we should attach the
classifier, it would be much simpler.
Figure 13: Confusion matrix of EXPLANet using Faster R-CNN (left) and EXPLANet using RetinaNet (right) on MonuMAI dataset.
the GED evaluates just the presence. Moreover, having no correlation among mAP and GED is reasonable,
specially because mAP evaluates only one part of the model (detection), and thus it makes more sense that
accuracy correlates with SHAP GED, as our results show.
The object part detector module of MonuNet baseline, i.e., MonuMAIKET detector (based on Faster
R-CNN detector -using ResNet101- as backbone) reaches slightly higher performance. We assume this
minor difference due to the different TensorFlow and PyTorch default implementations of Faster R-CNN’s
inherent ResNet module versions in MonuNet and EXPLANet, respectively). Furthermore, the EXPLANet
with RetinaNet approach outperforms EXPLANet with Faster R-CNN interpretability-wise. This probably
stems from the object parts aggregation functions that are slightly different.
In the Faster R-CNN version of EXPLANet, only the probability for the highest scoring label is kept,
whereas in the case of using RetinaNet as part detector, the latter aggregates over all the scores for each
example, here with the sum function. This way RetinaNet is probably more robust to low score features as
it always observes several for each example.
Table 5: Impact of the new SHAP-Backprop training procedure on the mAP (detection), accuracy (classification) and SHAP GED
(interpretability). Results on MonuMAI dataset with EXPLANet architecture. SHAP GED is computed between the SAG and
its projection in the expert KG. (Standard procedure: sequential typical pipeline of 1) train detector, 2) train classifier with no
SHAP-Backprop).
stochastic nature of the training process. Since the SHAP value computation is approximated using a
random subset of reference examples to be more efficient in computation time.
On the other side, in terms of interpretability, we do have sensible improvement, (reducing SHAP GED
from 0.93 to 0.55) in the case of linear instance weighting. The gain obtained in both dimensions is large
enough to conclude that the X-NeSyL methodology helped improving interpretability in terms of SHAP
GED to the expert KG.
With the presented work we open up different research horizons and future avenues of work that we
detail in this section.
We extensively considered what is one of the most crucial points to be addressed while developing
XAI methods. Within the general needs for producing more trustworthy outputs, we tackled the challenge
of fusion and alignment of deep learning representations with domain expert knowledge. To achieve
this we proposed a new methodology, X-NeSyL, to fuse deep and symbolic representations thanks to an
explainability feedback mechanism that facilitates the alignment of both deep and symbolic features. The
part-based detection and classification, EXPLANet, and XAI-informed training procedure SHAP-Backprop
leverage expert information in form of a knowledge graph. X-NeSyL could be seen as one way to attain
explainable and theory-driven data science [5].
We demonstrated the full pipeline of X-NeSyL methodology on MonuMAI and PASCAL-Part datasets,
and the EXPLANet model with two variants of object detectors. The fusion of learned representations of
different nature through the addition of an XAI technique component facilitates the model to learn with a
human expert in the loop.
X-NeSyL methodology was also validated through a contributed audience-specific explainability metric,
SHAP GED, that quantifies the alignment of the X-NeSyL methodology neural model (EXPLANet) with
the symbolic representation of the expert knowledge. All models, datasets, training pipeline and metric of
the showcased X-NeSyL methodology are available online9 . This approach targeted compositional object
recognition based on explaining the whole through the object-parts on deep architectures. However, other
non compositional semantic properties of description logics could be further modelled in order to assess,
and further constrain the level of alignment of a DL model with symbolic knowledge representing the
expert.
Given the diverse contributions of this work, there is a broad set of options that can follow up to
improve X-NeSyL methodology. In terms of evaluation of our work, the assessment was limited by the
number of available datasets that contain part-based data, which is not large, since they must include a
corresponding KG as well.
The explainability metric may be refined, since the proposed vanilla version of SHAP GED might
not take into account all explainable factors an expert would like to see reflected in a black box model
explanation. Future work includes assessing the SHAP GED metric itself, as the most suitable graph
comparison metric, and including more elaborated datasets with finer grained object-part labels.
Acknowledgement(s)
This research was funded by the French ANRT (Association Nationale Recherche Technologie - ANRT)
industrial Cifre PhD contract with SEGULA Technologies. The paper has been partially supported by the
Andalusian Excellence project P18-FR-4961. S. Tabik was supported by the Ramon y Cajal Programme
(RYC-2015-18136).
References
[1] The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University
Press, 2 edition, 2007.
[2] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine
Süsstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012.
[3] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim.
Sanity checks for saliency maps. In Proceedings of the International Conference on Neural
Information Processing Systems, pages 9505–9515, 2018.
[4] Jacob Andreas. Measuring compositionality in representation learning. arXiv preprint
arXiv:1902.07181, 2019.
[5] Alejandro Barredo Arrieta, Natalia Dı́az-Rodrı́guez, Javier Del Ser, Adrien Bennetot, Siham Tabik,
Alberto Barbado, Salvador Garcı́a, Sergio Gil-López, Daniel Molina, Richard Benjamins, et al.
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges
toward responsible AI. Information Fusion, 58:82–115, 2020.
[6] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller,
and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise
relevance propagation. PloS one, 10(7), 2015.
[7] Adrien Bennetot, Jean-Luc Laurent, Raja Chatila, and Natalia Dı́az-Rodrı́guez. Towards explainable
neural-symbolic visual reasoning. In Proceedings of the Neural-symbolic learning and Reasoning
Workshop, NeSy-2019 at International Joint Conference on Artificial Intelligence (IJCAI), Macau,
China, 2019.
[8] Elliot Joel Bernstein and Yali Amit. Part-based statistical models for object classification and
detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), volume 2, pages 734–740. IEEE, 2005.
[9] Tarek R Besold, Artur d’Avila Garcez, Sebastian Bader, Howard Bowman, Pedro Domingos,
Pascal Hitzler, Kai-Uwe Kühnberger, Luis C Lamb, Daniel Lowd, Priscila Machado Vieira Lima,
et al. Neural-symbolic learning and reasoning: A survey and interpretation. arXiv preprint
arXiv:1711.03902, 2017.
[10] Rémi Besson, Erwan Le Pennec, and Stéphanie Allassonnière. Learning from both experts and data.
Entropy, 21(12):1208, 2019.
[11] Federico Bianchi, Matteo Palmonari, Pascal Hitzler, and Luciano Serafini. Complementing logical
reasoning with sub-symbolic commonsense. In RuleML+RR, volume 11784 of Lecture Notes in
Computer Science, pages 161–170. Springer, 2019.
[12] Alexander Binder, Sebastian Bach, Gregoire Montavon, Klaus-Robert Müller, and Wojciech Samek.
Layer-wise relevance propagation for deep neural network architectures. In Information Science
and Applications (ICISA) 2016, pages 913–922. Springer, 2016.
[13] Kurt Bollacker, Natalia Dı́az-Rodrı́guez, and Xian Li. Extending knowledge graphs with subjective
influence networks for personalized fashion. In Designing Cognitive Cities, pages 203–233. Springer,
2019.
[14] Vanessa Buhrmester, David Münch, and Michael Arens. Analysis of explainers of black box deep
neural networks for computer vision: A survey. arXiv preprint arXiv:1911.12116, 2019.
[15] Valentina Anita Carriero, Aldo Gangemi, Maria Letizia Mancinelli, Ludovica Marinucci, An-
drea Giovanni Nuzzolese, Valentina Presutti, and Chiara Veninata. ArCo: The Italian cultural
heritage knowledge graph. In International Semantic Web Conference, pages 36–52. Springer, 2019.
[16] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-
CAM++: Improved Visual Explanations for Deep Convolutional Networks. In 2018 IEEE Winter
Conference on Applications of Computer Vision (WACV), pages 839–847. IEEE, 2018.
[17] Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille.
Detect what you can: Detecting and representing objects using holistic models and body parts.
In Proceedings of the IEEE Conference on Computer Vision and Rattern Recognition, pages
1971–1978, 2014.
[18] Roberto Confalonieri, Tillman Weyde, Tarek R Besold, and Fermı́n Moscoso del Prado Martı́n.
Using ontologies to enhance human understandability of global post-hoc explanations of black-box
models. Artificial Intelligence, page 103471, 2021.
[19] Alessandro Daniele and Luciano Serafini. Neural networks enhancement through prior logical
knowledge. CoRR, abs/2009.06087, 2020.
[20] Artur S. d’Avila Garcez, Marco Gori, Luı́s C. Lamb, Luciano Serafini, Michael Spranger, and
Son N. Tran. Neural-symbolic computing: An effective methodology for principled integration of
machine learning and reasoning. Journal of Applied Logics - IfCoLog Journal of Logics and their
Applications (FLAP), 6(4):611–632, 2019.
[21] Artur S. d’Avila Garcez, Luı́s C. Lamb, and Dov M. Gabbay. Neural-Symbolic Cognitive Reasoning.
Cognitive Technologies. Springer, 2009.
[22] Artur S. d’Avila Garcez and Gerson Zaverucha. The connectionist inductive learning and logic
programming system. Applied Intelligence, 11(1):59–77, 1999.
[23] R De Kok, T Schneider, and U Ammer. Object-based classification and applications in the alpine
forest environment. International Archives of Photogrammetry and Remote Sensing, 32(Part 7):4–3,
1999.
[24] Tim Dettmers, Minervini Pasquale, Stenetorp Pontus, and Sebastian Riedel. Convolutional 2d
knowledge graph embeddings. In Proceedings of the 32th AAAI Conference on Artificial Intelligence,
pages 1811–1818, February 2018.
[25] Natalia Dı́az-Rodrı́guez, Aki Härmä, Rim Helaoui, Ignacio Huitzil, Fernando Bobillo, and Umberto
Straccia. Couch potato or gym addict? semantic lifestyle profiling with wearables and fuzzy
knowledge graphs. In 6th Workshop on Automated Knowledge Base Construction, AKBC@NIPS
2017, Long Beach, California, 2017.
[26] Natalia Dı́az-Rodrı́guez and Galena Pisoni. Accessible cultural heritage through explainable
artificial intelligence. In Adjunct Publication of the 28th ACM Conference on User Modeling,
Adaptation and Personalization, pages 317–324, 2020.
[27] Michelangelo Diligenti, Marco Gori, and Vincenzo Scoca. Learning efficiently in semantic based
regularization. In ECML/PKDD (2), volume 9852 of Lecture Notes in Computer Science, pages
33–46. Springer, 2016.
[28] Ivan Donadello and Luciano Serafini. Integration of numeric and symbolic information for semantic
image interpretation. Intelligenza Artificiale, 10(1):33–47, 2016.
[29] Ivan Donadello and Luciano Serafini. Compensating supervision incompleteness with prior knowl-
edge in semantic image interpretation. In Proceedings of the International Joint Conference on
Neural Networks (IJCNN), pages 1–8. IEEE, 2019.
[30] Ivan Donadello, Luciano Serafini, and Artur D’Avila Garcez. Logic tensor networks for semantic
image interpretation. Proceedings of the Twenty-Sixth International Joint Conference on Artificial
Intelligence, IJCAI, pages 1596–1602, 2017.
[31] Monireh Ebrahimi, Aaron Eberhart, Federico Bianchi, and Pascal Hitzler. Towards bridging the
neuro-symbolic gap: deep deductive reasoners. Applied Intelligence, pages 1–23, 2021.
[32] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman.
The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision,
88(2):303–338, 2010.
[33] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection
with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 32(9):1627–1645, 2009.
[34] Jerry A Fodor and Ernest Lepore. The compositionality papers. Oxford University Press, 2002.
[35] Ruth C Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful
perturbation. In IEEE International Conference on Computer Vision, pages 3429–3437, 2017.
[36] Weifeng Ge, Xiangru Lin, and Yizhou Yu. Weakly supervised complementary parts models for
fine-grained image classification from the bottom up. In Proceedings of the IEEE Conference on
Computer Vision and Rattern Recognition, pages 3034–3043, 2019.
[37] Lise Getoor and Christopher P Diehl. Link mining: a survey. ACM SIGKDD Explorations
Newsletter, 7(2):3–12, 2005.
[38] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino
Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR),
51(5):1–42, 2018.
[39] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE Conference on Computer Vision and Rattern Recognition,
pages 770–778, 2016.
[40] Xuehai He, Xingyi Yang, Shanghang Zhang, Jinyu Zhao, Yichen Zhang, Eric Xing, and Pengtao
Xie. Sample-Efficient Deep Learning for COVID-19 Diagnosis Based on CT Scans. medRxiv, 2020.
[41] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, 2006.
[42] Andreas Holzinger, Markus Plass, Michael Kickmeier-Rust, Katharina Holzinger, Gloria Cerasela
Crişan, Camelia-M Pintea, and Vasile Palade. Interactive machine learning: experimental evidence
for the human in the algorithmic loop. Applied Intelligence, 49(7):2401–2414, 2019.
[43] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard H. Hovy, and Eric P. Xing. Harnessing deep
neural networks with logic rules. In ACL (1). The Association for Computer Linguistics, 2016.
[44] Daniel Huber, Anuj Kapuria, Raghavendra Donamukkala, and Martial Hebert. Parts-based 3d object
classification. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition. CVPR 2004., volume 2, pages II–II. IEEE, 2004.
[45] Ignacio Huitzil, Umberto Straccia, Natalia Dı́az-Rodrı́guez, and Fernando Bobillo. Datil: learning
fuzzy ontology datatypes. In International Conference on Information Processing and Management
of Uncertainty in Knowledge-Based Systems, pages 100–112. Springer, 2018.
[46] Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. The compositionality of neural
networks: integrating symbolism and connectionism. arXiv preprint arXiv:1908.08351, 2019.
[47] Sarthak Jain and Byron C. Wallace. Attention is not explanation. In Annual Conference of the North
American Chapter of the Association for Computational Linguistics (NAACL-HLT), 2019.
[48] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-
Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Rattern Recognition, pages 1725–1732, 2014.
[49] Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T Schütt, Sven
Dähne, Dumitru Erhan, and Been Kim. The (un) reliability of saliency methods. In Explainable AI:
Interpreting, Explaining and Visualizing Deep Learning, pages 267–280. Springer, 2019.
[50] Pieter-Jan Kindermans, Kristof T Schütt, Maximilian Alber, Klaus-Robert Müller, Dumitru Er-
han, Been Kim, and Sven Dähne. Learning how to explain neural networks: PatternNet and
PatternAttribution. In 6th International Conference on Learning Representations, ICLR, 2018.
[51] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International
Conference on Learning Representations, ICLR, 2015.
[52] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning
through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
[53] Alberto Lamas, Siham Tabik, Policarpo Cruz, Rosana Montes, Álvaro Martı́nez-Sevilla, Teresa
Cruz, and Francisco Herrera. MonuMAI: Dataset, deep learning pipeline and citizen science based
app for monumental heritage taxonomy and classification. Neurocomputing, 420:266–280, 2020.
[54] Yann Le Cun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne
Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network.
In Proceedings of the 2nd International Conference on Neural Information Processing Systems,
pages 396–404, 1989.
[55] Freddy Lecue. On the role of knowledge graphs in explainable AI. Semantic Web, 11(1):41–51,
2020.
[56] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444,
2015.
[57] Timothée Lesort, Vincenzo Lomonaco, Andrei Stoian, Davide Maltoni, David Filliat, and Natalia
Dı́az-Rodrı́guez. Continual learning for robotics: Definition, framework, learning strategies,
opportunities and challenges. Information fusion, 58:52–68, 2020.
[58] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense
object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages
2980–2988, 2017.
[59] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr
Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European
Conference on Computer Vision, pages 740–755. Springer, 2014.
[60] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In
Proceedings of the International Conference on Neural Information Processing Systems, pages
4765–4774, 2017.
[61] Gianluca Maguolo and Loris Nanni. A critic evaluation of methods for COVID-19 automatic
detection from x-ray images. arXiv preprint arXiv:2004.12823, 2020.
[62] Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt.
DeepProbLog: Neural probabilistic logic programming. In Proceedings of the International
Conference on Neural Information Processing Systems, volume 31, pages 3749–3759, 2018.
[63] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The neuro-
symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv
preprint arXiv:1904.12584, 2019.
[64] Giuseppe Marra, Michelangelo Diligenti, Francesco Giannini, Marco Gori, and Marco Maggini.
Relational neural machines. In ECAI, volume 325 of Frontiers in Artificial Intelligence and
Applications, pages 1340–1347. IOS Press, 2020.
[65] Koji Maruhashi, Masaru Todoriki, Takuya Ohwa, Keisuke Goto, Yu Hasegawa, Hiroya Inakoshi,
and Hirokazu Anai. Learning multi-way relations via tensor decomposition with neural networks.
In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[66] Pasquale Minervini and Sebastian Riedel. Adversarially regularising neural NLI models to integrate
logical background knowledge. In Proceedings of the SIGNLL Conference on Computational
Natural Language Learning (CoNLL 2018), pages 65–74. Association for Computational Linguistics,
2018.
[67] Christoph Molnar. Interpretable Machine Learning. Lulu.com, 2020.
[68] Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert
Müller. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern
Recognition, 65:211–222, 2017.
[69] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2(11):e7,
2017. https://distill.pub/2017/feature-visualization.
[70] Galena Pisoni, Natalia Dı́az-Rodrı́guez, Hannie Gijlers, and Linda Tonolli. Human-centred artificial
intelligence for designing accessible cultural heritage. Applied Sciences, 11(2):870, 2021.
[71] Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. ProbLog: A Probabilistic Prolog and
Its Application in Link Discovery. In Proceedings of the 20th International Joint Conference on
Artifical Intelligence, pages 2462–2467, 2007.
[72] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object
detection with region proposal networks. In Proceedings of the International Conference on Neural
Information Processing Systems, pages 91–99, 2015.
[73] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ”Why should I trust you?” Explaining the
predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on
knowledge discovery and data mining, pages 1135–1144, 2016.
[74] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precision model-agnostic
explanations. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1527–1535,
2018.
[75] Tim Rocktäschel and Sebastian Riedel. Learning knowledge base inference with neural theorem
provers. In Proceedings of the Automatic Knowledge Base Construction AKBC Workshop @ The
Annual Conference of the North American Chapter of the Association for Computational Linguistics
(NAACL-HLT), pages 45–50. The Association for Computer Linguistics, 2016.
[76] Claudio Saccà, Stefano Teso, Michelangelo Diligenti, and Andrea Passerini. Improved multi-level
protein-protein interaction prediction with semantic-based regularization. BMC Bioinformatics,
15:103, 2014.
[77] Alberto Sanfeliu and King-Sun Fu. A distance measure between attributed relational graphs for
pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, pages 353–362, 1983.
[78] Md Kamruzzaman Sarker and Pascal Hitzler. Efficient concept induction for description logics. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3036–3043, 2019.
[79] Md Kamruzzaman Sarker, Joshua Schwartz, Pascal Hitzler, Lu Zhou, Srikanth Nadella, Brandon
Minnery, Ion Juvina, Michael L Raymer, and William R Aue. Wikipedia Knowledge Graph for
Explainable AI. In Iberoamerican Knowledge Graphs and Semantic Web Conference, pages 72–87.
Springer, 2020.
[80] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh,
and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localiza-
tion. In Proceedings of the IEEE International Conference on Computer Vision, pages 618–626,
2017.
[81] Luciano Serafini and Artur S. d’Avila Garcez. Learning and reasoning with logic tensor networks.
In AI*IA, volume 10037 of Lecture Notes in Computer Science, pages 334–348. Springer, 2016.
[82] Sahand Sharifzadeh, Sina Moayed Baharlou, and Volker Tresp. Classification by attention: Scene
graph classification with prior knowledge. In Proceedings of the 35th AAAI Conference on Artificial
Intelligence, 2020.
[83] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through
propagating activation differences. In Proceedings of the 34th International Conference on Machine
Learning - Volume 70, pages 3145–3153. JMLR. org, 2017.
[84] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for
simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2015.
[85] Austin Stone, Huayan Wang, Michael Stark, Yi Liu, D Scott Phoenix, and Dileep George. Teaching
compositionality to CNNs. In Proceedings of the IEEE Conference on Computer Vision and Rattern
Recognition, pages 5058–5067, 2017.
[86] Fabian M Suchanek, Jonathan Lajus, Armand Boschin, and Gerhard Weikum. Knowledge represen-
tation and rule mining in entity-centric knowledge bases. In Reasoning Web. Explainable Artificial
Intelligence, pages 110–152. Springer, 2019.
[87] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3319–
3328. JMLR. org, 2017.
[88] Ilaria Tiddi, Freddy Lécué, and Pascal Hitzler. Knowledge Graphs for Explainable Artificial
Intelligence: Foundations, Applications and Challenges, volume 47. IOS Press, 2020.
[89] Geoffrey G. Towell and Jude W. Shavlik. Knowledge-based artificial neural networks. Artificial
Intelligence, 70(1-2):119–165, 1994.
[90] Joseph Townsend, Thomas Chaton, and João M. Monteiro. Extracting relational explanations from
deep neural networks: A survey from a neural-symbolic perspective. IEEE Trans. Neural Networks
Learn. Syst., 31(9):3456–3470, 2020.
[91] Emile van Krieken, Erman Acar, and Frank van Harmelen. Semi-supervised learning using differ-
entiable reasoning. Journal of Applied Logics - IfCoLog Journal of Logics and their Applications
(FLAP), 6(4):633–652, 2019.
[92] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st
International Conference on Neural Information Processing Systems, 2017.
[93] Joseph D Viviano, Becks Simpson, Francis Dutil, Yoshua Bengio, and Joseph Paul Cohen. Saliency
is a possible red herring when diagnosing poor generalization. In International Conference on
Learning Representations, 2021.
[94] Haofan Wang, Mengnan Du, Fan Yang, and Zijian Zhang. Score-cam: Improved visual explanations
via score-weighted class activation mapping. In CVPR 2020 Workshop on Fair, Data Efficient and
Trusted Computer Vision, 2020.
[95] Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. In Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20, Hong Kong, China,
November 2019. Association for Computational Linguistics.
[96] Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van den Broeck. A semantic loss
function for deep learning with symbolic knowledge. In ICML, volume 80 of Proceedings of
Machine Learning Research, pages 5498–5507. PMLR, 2018.
[97] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich
Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual
attention. In Proceedings of the International Conference on Machine Learning, pages 2048–2057.
PMLR, 2015.
[98] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In
European Conference on Computer Vision, pages 818–833. Springer, 2014.
[99] Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks for
mid and high level feature learning. In 2011 International Conference on Computer Vision, pages
2018–2025. IEEE, 2011.
[100] X. Zhao, W. Zeng, J. Tang, W. Wang, and F. Suchanek. An experimental study of state-of-the-art
entity alignment approaches. IEEE Transactions on Knowledge and Data Engineering, pages 1–1,
2020.
[101] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep
features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision
and Rattern Recognition, pages 2921–2929, 2016.
8. Appendix
In this appendix we extend the results obtained with MonuMAI dataset to a second dataset, PASCAL-
Part, to further validate our results. We detail such datasets and the results obtained in the following
sections.
• Bird: Torso, Tail, Neck, Eye, Leg, Beak, Animal Wing, Head
• Aeroplane: Stern, Engine, Wheel, Artifact Wing, Body
• Person: Ebrow, Foot, Arm, Torso, Nose, Hair, Hand, Neck, Eye, Leg, Ear, Head, Mouth
• Car: License plate, Door, Wheel, Headlight, Bodywork, Mirror, Window
• DiningTable: DiningTable
10 Available
online at github.com/ivanDonadello/semantic-PASCAL-Part/.
11 Therefore,
we discarded all images where there was more than one macro object.
12 In OWL language, the latter would be placed in their hasPart object property range
• Motorbike: Wheel, Headlight, Saddle, Handlebar
• Sofa: Sofa
• Boat: Boat
• Cow: Torso, Muzzle, Tail, Horn, Eye, Neck, Leg, Ear, Head
• Chair: Chair
• Bus: License plate, Door, Wheel, Headlight, Bodywork, Mirror, Window
• TvMonitor: Screen, TvMonitor
Labels for PASCAL-Part dataset used are shown in Figs. 14 and 15, and image samples are in Figs. 16
- 17.
Since explicitly using the ontology built for MonuMAI, in the case of object classification yielded no
significant advantage,we did not pursue this direction further for PASCAL-Part dataset. A thorough work
to convert PASCAL-Part into an ontology could be done, but the variety of elements inside it could make
it difficult to 1) group them in meaningful categories, 2) extend the additional data and object properties of
such richer ontology to a KG that can be compared with an attribution graph. Since such extension to an
ontology can complicate the ontology - misattribution matching process, we leave such extension to future
work.
PASCAL-Part’s Knowledge Graph
In analogy to MonuMAI previous application, where architectural elements play the role of object parts
and macro labels correspond to the architectural styles, we also used the KG provided by the PASCAL-Part
Figure 15: Distribution of object-parts in PASCAL-Part dataset.
dataset [28]13 .
Table 6: Results of two backbone variants of EXPLANet (using Faster R-CNN and RetinaNet) on PASCAL-Part dataset.
Figure 17: Extract from PASCAL-Part dataset (II).
Figure 18: Extract from PASCAL-Part dataset (III).
Figure 19: Extract from PASCAL-Part dataset (IV).
but the predominant one is probably that images contain valuable information that the part-label model
does not. It becomes quite clear when studying the PASCAL-Part KG, as we do in Section 8.1, since
several labels are made of the same part names, but represent distinct things, i.e., parts from different
object provenance (e.g. leg of a person and leg of an animal; both car and bus have the same object-parts).
The part-based model has thus trouble differentiating such categories whereas a purely image-based model
(and not attribute based) would have no issue with those.
Table 7: Impact of the new SHAP-Backprop training procedure on the mAP (detection), accuracy (classification) and SHAP
GED (interpretability). Results on PASCAL-Part dataset with EXPLANet architecture. GED is computed between the SAG and
its projection in the expert KG. (Standard procedure: sequential typical pipeline of 1) train detector, 2) train classifier without
SHAP-Backprop)
While the linear weighting appears to have a more positive effect on improving explainability of the
model, it may not be significant, given that it does not always improve interpretability when applied on the
more specific PASCAL-Part dataset.
The discordance in performance (for mAP and Accuracy in the detector task) in Table 5 and 7 for
RetinaNet being superior than Faster R-CNN only in MonuMAI but not for Pascal-Part can be explained
due to Pascal-Part dataset labelling procedure (joining elements not unifiable, i.e., with same identifier,
such as leg or wheel, but belonging to very different types of objects: sheeps and cows have the same parts
but different object label). Therefore, it is worth highlighting the differences in the labelling process of
both datasets, as the classification based in parts with the same name but very different semantics and
visual appearance in Pascal-Part is not designed for a neural network that only takes attributes as input,
to learn classifying objects based on the parts. Thus, as the Pascal-Part KG lacks highly discriminative
features, accuracy and SHAP GED are not obviously nor directly connected, specially in RetinaNet where,
due to its inherent architecture aggregation function, all probabilities are used to perform a prediction (the
aggregation function sums them), not just the highest one. As micro and macro labels are not appropriate
(as designed in MonuNet dataset), the interpretability metric fails to reflect reality, independently of the
quality of the detector.
As conclusion, X-NeSyL methodology showed slightly differently results in datasets designed with
different purpose. This was due to mainly the lack of discriminative dataset labels for EXPLANet to
leverage. In particular, PASCAL-Part dataset, whose design does not allow full evaluation of SHAP-
Backprop’s effect on interpretability, reflected on a decreased performance on mAP and accuracy, when
compared to our baseline. This can be explained by its non-discriminative nature not designed for a
part-based object detection such as the case of the EXPLANet architecture.