Automatic Image Captioning Combining Natural Language Processing and
Automatic Image Captioning Combining Natural Language Processing and
Results in Engineering
journal homepage: www.sciencedirect.com/journal/results-in-engineering
Research paper
A R T I C L E I N F O A B S T R A C T
Keywords: An image contains a lot of information that humans can detect in a very short time. Image captioning aims to
Object detection detect this information by describing the image content through image and text processing techniques. One of
Image captioning the peculiarities of the proposed approach is the combination of multiple networks to catch as many distinct
Deep neural networks
features as possible from a semantic point of view. In this work, our goal is to prove that a combination
Semantic-instance segmentation
strategy of existing methods can efficiently improve the performance in the object detection tasks concerning the
performance achieved by each tested individually. This approach involves using different deep neural networks
that perform two levels of hierarchical object detection in an image. The results are combined and used by
a captioning module that generates image captions through natural language processing techniques. Several
experimental results are reported and discussed to show the effectiveness of our framework. The combination
strategy has also improved, showing a gain in precision over single models.
1. Introduction mon valid approach includes classification that assigns semantic classes
to an image given in input visual features extracted from the image.
The image captioning topic has recently received great attention in Generating a meaningful image content description in a natural lan-
the computer vision community. The goal of image captioning is to de- guage requires a high level of recognized object understanding. An im-
scribe analyzed image content. This issue is interesting for its important portant task in this context is recognizing the detected objects’ semantic
practical applications and because it is a great challenge for computer meanings. For these reasons, classification approaches have been used
vision to understand image contents. to assign recognized objects to semantic classes based on their visual
Computer Vision (CV) and Natural Language Processing (NLP) were features. The use of natural language processing techniques attempts to
considered and treated as separate research areas in the past. Nowa- give a possible solution to this problem by trying to reduce the seman-
days, with the huge amount of available data in multiple formats tic gap [2,4,30,32] defined in [34] as “the lack of coincidence between
produced from several sources, such as mobile devices, IoT, and mul- the information that can be extracted from visual data and the interpre-
timedia social networks [19], researchers can apply both techniques tation that these same data have for a user in a given situation”.
in conjunction to better understand multimedia contents and to ex- There are two general paradigms in image captioning: top-down and
tract knowledge more than information from a given visual content. bottom-up. The top-down paradigm starts from the information given
Indeed, a fundamental task in image retrieval and computer vision is by an image and converts it into words; on the other hand, the bottom-
the automatic annotation of images, whose goal is to find proper, hence up approach considers words that describe various aspects of an image
meaningful, words or phrases for images. The more such annotations and combines them to generate a sentence. In both approaches, linguis-
are semantically rich, the more such mapping between the annotated tic models are used to generate coherent sentences. Recent top-down
text and the visual content is consistent. In this context, approaches approaches follow an encoder-decoder framework to generate image
based on artificial intelligence, which often exploits pre-trained models, captions. Convolutional neural networks are developed to encode visual
are used to learn the mapping between low-level and semantic features images, and recurrent neural networks to decode this information. This
and then generate annotations or captions for a given image. A com- paper presents a framework based on a top-down approach with a set
* Corresponding author.
E-mail addresses: antoniomaria.rinaldi@unina.it (A.M. Rinaldi), cristiano.russo@unina.it (C. Russo), cristian.tommasino@unina.it (C. Tommasino).
https://doi.org/10.1016/j.rineng.2023.101107
Received 5 September 2022; Received in revised form 6 April 2023; Accepted 16 April 2023
Available online 19 April 2023
2590-1230/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
2
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
dimensions expressed as width and height. If the cell starts from the
top left corner of the image (𝑐𝑥 , 𝑐𝑦 ) and the previous bounding box has
width and height 𝑝𝑤 , 𝑝ℎ , then the prediction is:
( )
𝑏𝑥 = 𝜎 𝑡𝑥 + 𝑐𝑥
( )
𝑏𝑦 = 𝜎 𝑡𝑦 + 𝑐𝑦
𝑏𝑤 = 𝑝𝑤 𝑒𝑡𝑤
𝑏ℎ = 𝑝ℎ 𝑒𝑡ℎ
The width and height are predicted as offsets from the centroids of
the cluster, while the center coordinates relative to the allocation of the
filter application are defined using a sigmoid function.
The sum squared error is the loss function used for the training as
shown in Equation (3):
∑
𝑆𝑆𝐸 = _𝑖 = 1𝑛 (𝑥𝑖 − 𝑥)
̄ 2 (3)
3
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
The combination step is performed by comparing all the elements In this section, we propose and discuss several experimental results
of one vector with all the elements of the other one looking for the obtained from various executions of the presented system to put into
intersections between the bounding boxes. If one or more intersections evidence the performance of the proposed framework for image cap-
are found, the maximum is chosen, and if it exceeds a threshold of 0.8 tioning. The neural networks used for the image object segmentation
(i.e., overlap), we evaluate which of the two elements generating this have been trained by means of a transfer learning approach starting
intersection has the highest score of assignment to a predicted class, from already trained weights. In the training phase, an NVIDIA GeForce
and we chose it as the best candidate. The elements of each vector that 980 4 Gb GPU and NVIDIA CUDA and CUDNNN tools [24] have been
does not have a maximum intersection value will also be added to the used with the Python PyCharm development environment. The analysis
final vector. of various models obtained with the different configurations defined in
The metric we use for the evaluation is the well-established Inter- the training phase has been validated in two steps: (i) Graphic Analy-
section over Union, shown in Equation (4): sis: using the Tensorboard tool [20], the loss function trends have been
𝐴∩𝐵 analyzed. The aim is to have a loss function related to the training data
𝐼𝑜𝑈 = (4)
𝐴∪𝐵 decreasing as much as possible at any epoch and a validation loss with-
Let’s clarify with an example of the combiner job. We suppose to out a rich growing to avoid overfitting; (ii) Evaluation Metrics: when
have a vector with two elements, A and B, and the second vector with reasonable values for the loss functions have been obtained, we proceed
only one element C, with the following IoUs: with the related metrics calculation. The used metrics are IoU, Average
𝐼𝑜𝑈 𝐴,𝐶 = 0.95 Precision, and Average Recall. In the following of this section, we will
𝐼𝑜𝑈 𝐵,𝐶 = 0.45 present the used data set and our strategy for tuning an efficient model
It means that the elements A and C refer to the same object in the to have a good capacity to learn correctly the instances in training set
image, having a very high IoU. We proceeded to compare the score of having a good capacity for generalization of the learned knowledge.
the assignment to the predicted class. Assuming that: The results obtained from our framework and some examples of image
𝑠𝑐𝑜𝑟𝑒𝐴 = 0.85 caption are presented at the end of the section.
𝑠𝑐𝑜𝑟𝑒𝐶 = 0.95
our combiner will choose C. 4.1. Dataset description
Since element B has an IoU lower than the IoU between the other
two elements and lower than the threshold, it is not considered super- The entire COCO dataset [18] has been used. This dataset consists
imposed to the object defined in element C. However, it is added to of 118287 images for the training set, 5000 for the validation set, and
the final vector because it identifies another object in the image. In this 40671 for the test set. It contains 80 object classes divided into 12 su-
way, no information is lost during the different combination steps. perclasses, as shown in Table 1.
4
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
Table 1 Table 4
COCO dataset classes. MASK-RCNN parameters.
Person Person BACKBONE Resnet 101 or resnet 50, where 50 and 101 are
Vehicle bicycle, car, motorcycle, airplane, bus, train, truck, boat the number of layers of the network
Outdoor Traffic light, fire hydrant, stop sign, parking meter, bench GPU_COUNT Number of used GPUs
Animal Bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe IMAGES_PER_GPU Number of images to analyze on each GPU
Accessory Backpack, umbrella, handbag, tie, suitcase BATCH_SIZE Not modifiable, equal to the product of the two
Sports Frisbee, skis, snowboard, sports ball, kit, baseball bat, previous ones
baseball glove, skateboard, surfboard, tennis racket DETECTION_MIN_CONFIDENCE Minimum score threshold to consider a
Kitchen Bottle, wine glass, cup, fork, knife, spoon, bowl classification
Food Banana, apple, sandwich, orange, broccoli, IMAGE_MIN_DIM Minimum size of the scaled image
carrot, hotdog, pizza, donut, cake IMAGE_SHAPE Image shape in [W H D] format
Furniture Chair, couch, potted plant, bed, dining table, toilet LEARNING_RATE Step size at each iteration
Electronic Tv, laptop, mouse, remote, keyboard, cell phone MAX_GT_INSTANCES Maximum number of ground truth instances to
Appliance Microwave, oven, toaster, sink, refrigerator use in an image
Indoor Book, clock, vase, scissors, NAME Configuration name
teddybear, hair drier, toothbrush NUM_CLASSES Number of classes to be classified
NUMBER OF_EPOCHS Number of epochs
STEPS_PER_EPOCH Steps for each epoch
Table 2 TRAIN_ROIS_PER_IMAGE RoIs trained for each image
VALIDATION_STEPS Number of steps for validation
Statistics of COCO classes used in the
WEIGHT_DECAY Term in the weight update rule that causes the
training set.
weights to exponentially decay to zero
Class Number of Images
Person 64115
Chair 12774 • IMAGE_SHAPE=[512 512 3].
Car 12251 • IMAGE_PER_GPU=1 (hence BATCH_SIZE=1)
Dining Table 11837
Different values have been used for the tuning of other parameters.
In particular:
Table 3
Statistics of COCO classes used in the
validation set. • LEARNING_RATE: between 0.001 and 0.01 in steps of 0.001
• WEIGHT_DECAY: between 0.0001 and 0.01 in steps of 0.0001
Class Number of Images
• DETECTION: between 0.0 and 0.9 in steps of 0.1
Person 2693 • TRAINED_ROIS_PER_IMAGE: between 10 and 80 in step of 10
Chair 580 • STEPS: between 10 and 100 in steps of 10
Car 535
Dining Table 501
• VALIDATION_STEPS: between 10 and 80 in steps of 10
• DETECTION_MAX_INSTANCES: between 10 and 100 in steps of 10
• MAX_GT_INSTANCES: between 10 and 100 in steps of 10
For our purposes, we have chosen to analyze the four most pop- • Number of training images: 1000, 10000 and all images in the
ulated classes of images in the dataset. The statistics about the used dataset.
classes for the training set and the validation set are shown in Table 2
and 3, respectively. Moreover, we have to consider which network layers to train: (i)
Heads: weights related only to fully-connected layers; (ii) 3+, 4+, 5+:
4.2. Mask-RCNN weights for the first 3/4/5 layers of ResNet; (iii) All: weights for all
layers. In this regard, several hybrid configurations have been tested
The implementation of Mask R-CNN used for the creation and eval- with different strategies as a learning rate variation every 𝑥 steps, a
uation of the model [1] is implemented in Python language, and it variation of the involved layers in the weight training process, and a
uses TensorFlow and Keras libraries. It has been re-engineered for in- combination of the above two strategies. The following pipelines have
tegration into our segmentation problem. Mask R-CNN has numerous been used for image augmentation:
parameters whose variation has a considerable impact on the final re-
sult. We have manually tested several combinations of parameters for • Horizontal rotation
each model. In this paper, we show the results for the ones which gave • Vertical rotation
us the best accuracy. These parameters are shown in Table 4. • One of the two previous random choices
The parameters listed above are related to the model configuration, • One of the two rotations chosen randomly followed by a 60, 120
together with the number of training epochs and the used data augmen- or 180 degree rotation always chosen randomly
tation pipeline. From a general point of view, we will have a decrement • One of the two rotations randomly chosen followed by a Gaussian
in the loss functions related to the training steps when epochs increase. or an average blur
Moreover, it is also necessary that we haven’t an increment of the loss • Linear contrast
function related to the validation set (validation loss) if the epochs will • From 0 to 2 operations random chosen between one of the two
increase. If this increment occurs, the model will be in overfitting. rotations, one of the three rotations of 60, 120 or 180 degrees and
In our implementation, the Mask-RCNN basic parameters have been one of the two blurs
set at the following maximum values:
In general, with the same parameters, the ResNet101 backbone
• NUMBER OF_EPOCHS=40 shows a better performance in terms of loss than ResNet50, as shown in
• STEPS_PER_EPOCH=100 Fig. 4. We decided to use a ResNet101 backbone for further experiments
• VALIDATION_STEPS=80 on image captioning.
• TRAIN_ROIS_PER_IMAGE=80 The best performance was obtained with the following parameters:
5
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
Fig. 4. Comparison between ResNet50 backbone (in red) and Resnet101 backbone (in blue).
• DETECTION MAX. 100 ten on 4+, and 20 on all. Model 2 has been trained for five epochs on
• DETECTION_MIN_CONFIDENCE 0.8 heads, five on 4+ and 30 on all; (ii) Learning rate: it is set to 0.0001
• LEARNING RATE 0.001 value after 20 epochs in model 1, it is set to 0.0001 after ten epochs
• MAX_GT_INSTANCES 100 for model 2. Due to the similarities in the loss trends, we calculate the
• STEPS_FOR_EPOCH 100 reference metrics to choose the best model. The results are shown in
• TRAIN FOR IMAGE 80 Fig. 6. The best model is the number 1, and some examples of object
• VALIDATION_STEPS 50 segmentation using Mask-RCNN are in Fig. 7, where the three images
• WEIGHT_DECAY 0.0001 shown are named “img16”, “img345” and “img1” respectively.
• All images of the dataset
• Pipeline 1
4.3. RetinaNet
We will show some examples of models resulting from different pa-
rameter settings. We analyze only the two best configurations to explain The used RetinaNet implementation for the creation and evaluation
our evaluation process. This process has been used with all the other of the model is the version present in [17]. It has been re-engineered to
DNN models. be integrated into our proposed instance segmentation pipeline.
In Fig. 5 we show a comparison of model 1 and 2. This network has a simpler implementation compared with Mask-
Model 1 is the one in blue, while model 2 is the one in red. We notice RCNN, and the tuning parameters for the training step are:
how they have a similar loss function at the end of the training epochs,
while the validation loss in the case of model 1 decreases more quickly, • Network Depth
and for model 2, it remains more or less constant for all the training • Learning Rate
epochs. The differences in configuration between these two models are • Number of epochs
(i) Epoch division: Model 1 has been trained for ten epochs on heads, • Batch size
6
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
4.4. YOLOv3
• Number of epochs
• Learning rate
• Learning momentum
• Weight decay
• Image size
• Multiscale (changing the image size during training)
7
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
8
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
DNNs. The performance evaluation of the combined results obtained by chairs, previously classified as dining table, are now identified as furni-
the best-trained model on the classes and superclasses for Mask-RCNN ture. In the first object classification of the third image, the truck has
is shown in Fig. 14. been recognized as car while it is now classified as a vehicle superclass
As we can see, better results have been obtained in terms of mAP. with a higher score than the previous classification.
Fig. 15 shows the improvement of MASK-RCNN classification using su- Fig. 16 shows the performance of the combined results obtained by
perclasses. the best-trained model on classes and superclasses for Retinanet.
In fact, in addition to the two classified people in the first image, Also in this case we obtained better results in terms of mAP.
the baseball bat has been classified as “sports” superclass improving The improvements of Retinanet classification using superclasses are in
the overall detection. In the second image, instead, we notice that the Fig. 17.
9
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
In the first image, in addition to the baseball player identified as per- as outdoor superclass, have been added. The YOLOv3 performance for
son, the ball has been detected and classified as sports. In the second one, the best model trained on classes and superclasses is in Fig. 18.
the detections of three other persons, a book on the right side identified Also in this case we have obtained better results in terms of mAP.
as indoor, and the armchair on the top left side identified as furniture The classification results of this combination are shown in Fig. 19.
have been added to the previous detections. In the third image, the de- We observe that in the first image, another person is added to the
tection of the yellow car in the background has been kept, while the detection results together with the baseball identified as “sports”. In the
one in the foreground has been replaced by a detection of the vehicle second image, two more persons, two chairs classified as furniture, the
superclass more generic than the first classification but with a higher notebook classified as electronic, and the book classified as indoor have
score and a much more precise bounding box. In addition, the detec- been added to the objects previously detected. In the third image, the
tion of the truck identified as a vehicle and the parking meter, identified detected cars have been replaced by the superclass vehicle.
10
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
11
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
Fig. 21. Combined strategy comparison. All authors who contributed substantially to the study’s concep-
tion and design were involved in the preparation and review of the
manuscript until the approval of the final version. Antonio M. Ri-
ing both the spatial and lexical information provided by the networks. naldi, Cristiano Russo and Cristian Tommasino were responsible for the
The combination of the results gives us an impressive improvement literature search, manuscript development, and testing. Furthermore,
regarding the classifications and detection steps, as shown by experi- Antonio M. Rinaldi, Cristano Russo and Cristian Tommasino actively
ments, which indicates the goodness of our proposed approach and the contributed to all parts of the article, including interpretation of results,
12
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
review and approval. In addition, all authors contributed to the develop- [8] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE
ment of the system for the performance of the system tests. All authors International Conference on Computer Vision, 2017.
[9] M.Z. Hossain, F. Sohel, M.F. Shiratuddin, H. Laga, A comprehensive survey of deep
have read and agreed to the published version of the manuscript.
learning for image captioning, ACM Comput. Surv. 51 (6) (2019) 1–36.
[10] P. Hurtik, V. Molek, J. Hula, M. Vajgl, P. Vlasanek, T. Nejezchleba, Poly-yolo: higher
Declaration of competing interest speed, more precise detection and instance segmentation for yolov3, arXiv preprint,
arXiv:2005.13243.
The authors declare that they have no known competing financial [11] J. Ji, Z. Du, X. Zhang, Divergent-convergent attention for image captioning, Pattern
Recognit. 115 (2021) 107928.
interests or personal relationships that could have appeared to influence
[12] A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments for generating image de-
the work reported in this paper. scriptions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015.
Data availability [13] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg, T.L. Berg, Baby Talk:
Understanding and Generating Image Descriptions Proceedings of the 24th Cvpr,
The used data set is public and reference has been cited during the 2011.
[14] A. Kumar, S. Goel, A survey of evolution of image captioning techniques, Int. J.
text manuscript. Hybrid Intell. Syst. 14 (3) (2017) 123–139.
[15] C.-W. Kuo, Z. Kira, Beyond a pre-trained object detector: cross-modal textual and
References visual context for image captioning, in: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022.
[1] W. Abdulla, Mask r-cnn for object detection and instance segmentation on keras and [16] Y. Li, F. Ren, Light-weight retinanet for object detection, arXiv preprint, arXiv:1905.
tensorflow, https://github.com/matterport, Mask _RCNN. 10011.
[2] M.W. Akram, M. Salman, M.F. Bashir, S.M.S. Salman, T.R. Gadekallu, A.R. Javed, [17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollàr, Pytorch implementation of retinanet
A novel deep auto-encoder based linguistics clustering model for social text, Trans. object detection, https://github.com/yhenon/pytorch-retinanet, 2020.
Asian Low-Resource Lang. Inf. Process. (2022). [18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L.
[3] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom- Zitnick, Microsoft coco: common objects in context, in: European Conference on
up and top-down attention for image captioning and visual question answering, in: Computer Vision, Springer, 2014.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, [19] K. Madani, A.M. Rinaldi, C. Russo, A semantic-based strategy to model multime-
2018. dia social networks, in: Transactions on Large-Scale Data-and Knowledge-Centered
[4] M.F. Bashir, H. Arshad, A.R. Javed, N. Kryvinska, S.S. Band, Subjective answers Systems XLVII, Springer, 2021, pp. 29–50.
evaluation using machine learning and natural language processing, IEEE Access 9 [20] D. Mané, et al., Tensorboard: Tensorflow’s Visualization Toolkit, 2015.
(2021) 158972–158983. [21] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, A. Yuille, Deep captioning with multi-
[5] M. Buric, M. Pobar, M. Ivasic-Kos, Ball detection using yolo and mask r-cnn, in: 2018 modal recurrent neural networks (m-rnn), arXiv preprint, arXiv:1412.6632.
International Conference on Computational Science and Computational Intelligence [22] E. Mohamed, A. Shaker, H. Rashed, A. El-Sallab, M. Hadhoud, Insta-yolo: real-time
(CSCI), IEEE, 2018. instance segmentation, arXiv preprint, arXiv:2102.06777.
[6] A. Capuano, A.M. Rinaldi, C. Russo, An ontology-driven multimedia focused crawler [23] V.-Q. Nguyen, M. Suganuma, T. Okatani, Grit: faster and better image captioning
based on linked open data and deep learning techniques, in: Multimedia Tools and transformer using dual visual features, in: Computer Vision–ECCV 2022: 17th Euro-
Applications, Springer, 2019, pp. 1–22. pean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI,
[7] H. Fang, S. Gupta, F. Iandola, R.K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Springer, 2022.
Mitchell, J.C. Platt, et al., From captions to visual concepts and back, in: Proceedings [24] NVIDIA, P. Vingelmann, F.H. Fitzek, Cuda, release: 10.2.89, https://developer.
of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. nvidia.com/cuda-toolkit, 2020.
13
A.M. Rinaldi, C. Russo and C. Tommasino Results in Engineering 18 (2023) 101107
[25] L. Qi, Y. Wang, Y. Chen, Y.-C. Chen, X. Zhang, J. Sun, J. Jia, Pointins: Point-based [34] A.W. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based image
instance segmentation, IEEE Trans. Pattern Anal. Mach. Intell. (2021). retrieval at the end of the early years, IEEE Trans. Pattern Anal. Mach. Intell. 22 (12)
[26] J. Redmon, A. Farhadi, Yolov3: an incremental improvement, arXiv preprint, arXiv: (2000) 1349–1380.
1804.02767. [35] G. Srivastava, R. Srivastava, A survey on automatic image captioning, in: Interna-
[27] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection tional Conference on Mathematics and Computing, Springer, 2018.
with region proposal networks, in: Advances in Neural Information Processing Sys- [36] Ultralytics, Pytorch implementation of yolov3, “YOLOv3 in PyTorch” https://
tems, 2015. github.com/ultralytics/yolov3, 2020.
[28] A.M. Rinaldi, C. Russo, K. Madani, A semantic matching strategy for very large [37] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: a neural image caption
knowledge bases integration, Int. J. Inf. Technol. Web Eng. 15 (2) (2020) 1–29, generator, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
publisher: IGI Global. Recognition, 2015.
[29] A.M. Rinaldi, C. Russo, C. Tommasino, A knowledge-driven multimedia retrieval [38] C. Wang, K. Huang, How to use bag-of-words model better for image classification,
system based on semantics and deep features, Future Internet 12 (11) (2020) 183, Image Vis. Comput. 38 (2015) 65–74.
publisher: Multidisciplinary Digital Publishing Institute. [39] Y. Wang, B. Xiao, A. Bouferguene, M. Al-Hussein, H. Li, Vision-based method for
[30] C. Russo, K. Madani, A.M. Rinaldi, Knowledge construction through semantic in- semantic information extraction in construction by integrating deep learning object
terpretation of visual information, in: International Work-Conference on Artificial detection and image captioning, Adv. Eng. Inform. 53 (2022) 101699.
Neural Networks, Springer, 2019. [40] Q. You, H. Jin, Z. Wang, C. Fang, J. Luo, Image captioning with semantic attention,
[31] C. Russo, K. Madani, A.M. Rinaldi, Knowledge acquisition and design using seman- in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
tics and perception: a case study for autonomous robots, Neural Process. Lett. (2020) 2016.
1–16. [41] J. Yu, J. Yao, J. Zhang, Z. Yu, D. Tao, Sprnet: single-pixel reconstruction for one-
[32] C. Russo, K. Madani, A.M. Rinaldi, An unsupervised approach for knowledge con- stage instance segmentation, IEEE Trans. Cybern. 51 (4) (2020) 1731–1742.
struction applied to personal robots, IEEE Trans. Cogn. Dev. Syst. 13 (1) (2020) [42] Y. Zhang, X. Shi, S. Mi, X. Yang, Image captioning with transformer and knowledge
6–15. graph, Pattern Recognit. Lett. 143 (2021) 43–49.
[33] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, T. Poggio, Robust object recognition
with cortex-like mechanisms, IEEE Trans. Pattern Anal. Mach. Intell. 29 (3) (2007)
411–426.
14