Image Segmentation For Object Detection Using Mask R-CNN in Colab
Image Segmentation For Object Detection Using Mask R-CNN in Colab
ISSN- 2455-5703
Abstract
Image segmentation is a critical process in computer vision. It involves dividing a visual input into segments to simplify image
analysis. Segments represent objects or parts of objects, and comprise sets of pixels, or “super-pixels. Recently, due to the success
of deep learning models in a wide range of vision applications, there has been a substantial amount of works aimed at developing
image segmentation approaches using deep learning models. Image segmentation with CNN involves feeding segments of an
image as input to a convolutional neural network, which labels the pixels. Fully convolutional network, FCNs use convolutional
layers to process varying input sizes and can work faster. The final output layer has a large receptive field and corresponds to the
height and width of the image, while the number of channels corresponds to the number of classes. An architecture based on deep
encoders and decoders, also known as semantic pixel-wise segmentation. It involves encoding the input image into low dimensions
and then recovering it with orientation invariance capabilities in the decoder. Most notably is the R-CNN, or Region-Based
Convolutional Neural Networks, and the most recent technique called Mask R-CNN that is capable of achieving state-of-the-art
results on a range of object detection tasks.
Keywords- Machine Learning, Deep Learning, Image Segmentation, Video Analytics
I. INTRODUCTION
architectures are studied computational efficiency by using classification and detection problems. They used multiple databases
[1].
[2]. The recent development in the Deep Learning made more progress in research activities in Digital Image Processing.
The authors analysed the various pros and cons of each approach. The focus is to promote knowledge of classical computer vision
techniques. Also exploring how others options of computer vision can be combined. Many hybrid methodologies are studied and
proved to improve computer vision performance. [2].
[3]. To process Big data applications, need for large amount of spaces are needed in industry. Also more space is required
for the video streams from CCTV cameras and social media data, sensor data, agriculture data, medical data and data evolved from
space research. The authors done a survey which starts from object recognition, action recognition, crowd analysis and finally
violence detection in a crowd environment. The various problems in the existing methods were identified and summarized. [3].
[4]. In [4], the authors, used an approach on how to use Deep learning for all types of well- known applications such as
Speech recognition, Image processing and NLP. In Deep learning, pertained neural network to identify and remove noise from
images. The processing of images using deep learning is processed for image pre-processing and image augmentation for
Various applications with better results.
[5]. In [5], the authors, used latest image classification techniques based on deep neural network architectures to improve
the identification of highly boosted electroweak particles with respect to existing methods. Also they introduced new methods to
visualize and interpret the high level features learned by deep neural networks that provide discrimination beyond physics- derived
variables, adding a new capability to understand physics and to design more powerful classi_cation methods at the LHC.
[6]. The authors proposed an analysis of tracking-by-detection approach which include detection by YOLO and tracking
bySORT algorithm. This paper has information about customimage dataset being trained for 6 specific classes using YOLOand
this model is being used in videos for tracking by SORTalgorithm. Recognizing a vehicle or pedestrian in an ongoingvideo is
helpful for traffic analysis. The goal of this paper is foranalysis and knowledge of the domain.
C. Ensemble learning
Synthesizes the results of two or more related analytical models into a single spread. Ensemble learning can improve prediction
accuracy and reduce generalization error. This enables accurate classification and segmentation of images. Segmentation via
ensemble learning attempts to generate a set of weak base-learners which classify parts of the image, and combine their output,
instead of trying to create one single optimal learner.
D. DeepLab
One main motivation for DeepLab is to perform image segmentation while helping control signal decimation-reducing the number
of samples and the amount of data that the network must process. Another motivation is to enable multi-scale contextual feature
learning-aggregating features from images at different scales. DeepLab uses an ImageNet pre-trained residual neural network
(ResNet) for feature extraction. DeepLab uses atrous (dilated) convolutions instead of regular convolutions. The varying dilation
rates of each convolution enable the ResNet block to capture multi-scale contextual information. DeepLab is comprised of three
components-
– Atrous convolutions-with a factor that expands or contracts the convolutional filter’s field of view.
– ResNet-a deep convolutional network (DCNN) from Microsoft. It provides a framework that enables training thousands of
layers while maintaining performance. The powerful representational ability of ResNet boosts computer vision applications
like object detection and face recognition.
– Atrous spatial pyramid pooling (ASPP)-provides multi-scale information. It uses a set of atrous convolutions with varying
dilation rates to capture long-range context. ASPP also uses global average pooling (GAP) to incorporate image-level features
and add global context information.
V. EXPERIMENTAL STUDY
Input Image
Output
REFERENCES
[1] Mihai-SorinBadea, Iulian-IonuțFelea, Laura Maria Florea, Constantin Vertan the Image Processing and Analysis Lab (LAPI), Politehnica University of
Bucharest, Romania.
[2] Niall O’ Mahony, Sean Campbell, Anderson Carvalho, SumanHarapanahalli, Gustavo Velasco Hernandez, LenkaKrpalkova, Daniel Riordan, Joseph Walsh
IMaR Technology Gateway, Institute of Technology Tralee, Tralee, Ireland
[3] Intelligent video surveillance- a review through deep learning techniques for crowd Analysis G. Sreenu* and M. A. SaleemDurai
[4] Pratik Kanani, MamtaPadole, Deep Learning to Detect Skin Cancer using Google Colab, International Journal of Engineering and Advanced Technology
(IJEAT) ISSN- 2249 – 8958, Volume-8 Issue-6, August 2019.
[5] Ganesh B1, Kumar C2 1Assistant Professor, Department of Computer Application, Dr. M.G.R. Chockalingam Arts College, Arni, Tiruvannamalai, Deep
learning Techniques in Image processing, National Conference On Emerging Trends in Computing Technologies ( NCETCT-18 ) – 2018
[6] A. Schwartzman1, M. Kagan1, L, Mackey2, B. Nachman1 and L. De Oliveira3 1 SLAC National Accelerator Laboratory, Stanford University, 2575 Sand
Hill Road, Menlo Park, CA 94025, USA Image Processing, Computer Vision, and Deep Learning- new approaches to the analysis and physics interpretation
of LHC events, IOP Publishing 2016.
[7] AkanshaBathija M.Tech Student, Dept of Computer Engineering K J Somaiya College of Engineering Mumbai, Maharashtra, India, Visual Object Detection
and Tracking using YOLO and SORT, International Journal of Engineering Research & Technology (IJERT) http-//www.ijert.org ISSN- 2278-0181
IJERTV8IS110343 , www.ijert.org Vol. 8 Issue 11, November-2019