0% found this document useful (0 votes)
20 views21 pages

Cross-Field Road Markings Detection Based On Inverse Perspective Mapping

This paper addresses the challenges of road markings detection for autonomous vehicles, particularly the issues of data collection and small object detection accuracy. It proposes a method using a virtual dataset combined with an open dataset, employing data augmentation and Inverse Perspective Mapping to enhance model robustness and accuracy. The results demonstrate a significant improvement in detection accuracy, with mean Average Precision increasing from 60.04% to 78.66% through the proposed techniques.

Uploaded by

Brinda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views21 pages

Cross-Field Road Markings Detection Based On Inverse Perspective Mapping

This paper addresses the challenges of road markings detection for autonomous vehicles, particularly the issues of data collection and small object detection accuracy. It proposes a method using a virtual dataset combined with an open dataset, employing data augmentation and Inverse Perspective Mapping to enhance model robustness and accuracy. The results demonstrate a significant improvement in detection accuracy, with mean Average Precision increasing from 60.04% to 78.66% through the proposed techniques.

Uploaded by

Brinda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Article

Cross-Field Road Markings Detection Based on Inverse


Perspective Mapping
Eric Hsueh-Chan Lu * and Yi-Chun Hsieh

Department of Geomatics, National Cheng Kung University, No. 1, University Rd., Tainan 701, Taiwan;
[email protected]
* Correspondence: [email protected]

Abstract: With the rapid development of the autonomous vehicles industry, there has been a dramatic
proliferation of research concerned with related works, where road markings detection is an important
issue. When there is no public open data in a field, we must collect road markings data and label them
by ourselves manually, which is huge labor work and takes lots of time. Moreover, object detection
often encounters the problem of small object detection. The detection accuracy often decreases
when the detection distance increases. This is primarily because distant objects on the road take up
few pixels in the image and object scales vary depending on different distances and perspectives.
For the sake of solving the issues mentioned above, this paper utilizes a virtual dataset and open
dataset to train the object detection model and cross-field testing in the field of Taiwan roads. In
order to make the model more robust and stable, the data augmentation method is employed to
generate more data. Therefore, the data are increased through the data augmentation method and
homography transformation of images in the limited dataset. Additionally, Inverse Perspective
Mapping is performed on the input images to transform them into the bird’s eye view, which solves
the “small objects at far distance” problem and the “perspective distortion of objects” problem so that
the model can clearly recognize the objects on the road. The model testing on the front-view images
and bird’s eye view images also shows a remarkable improvement of accuracy by 18.62%.

Keywords: road markings; object detection; cross-field; inverse perspective mapping; deep learning

Citation: Lu, E.H.-C.; Hsieh, Y.-C.


Cross-Field Road Markings Detection
1. Introduction
Based on Inverse Perspective
Mapping. Sensors 2024, 24, 8080.
With the rapid development of the autonomous vehicles industry, there has been a dra-
https://doi.org/10.3390/s24248080
matic proliferation of research concerned with related works. For the purpose of driving the
car automatically, autonomous cars must be able to perceive the surrounding environment
Academic Editor: Tomás Mateo at any time and handle emergency conditions, so it is essential that detecting and classifying
Sanguino
objects accurately in the road environment will ensure the stability of the driving. Not only
Received: 14 October 2024 have autonomous vehicles gained considerable attention but High-Definition Maps have
Revised: 16 December 2024 also aroused wide concern. High-Definition Maps are maps for autonomous vehicles that
Accepted: 17 December 2024 contain a great deal of information about the road environment, such as road boundaries,
Published: 18 December 2024 road lanes, traffic signs, road markings, etc., so that the computing platform inside the
vehicle can construct spatial awareness and provide assistance for the autonomous vehicles’
decision-making systems. The High-Definition Maps are composed of point cloud layers
and vector layers that conform to the attribute content defined by the High-Definition
Copyright: © 2024 by the authors.
Maps standard. In order to ensure the quality and consistency of High-Definition Maps,
Licensee MDPI, Basel, Switzerland.
many procedures are often labor-intensive and time-consuming, especially the production
This article is an open access article
of semantic feature maps (vector layers). To improve the performance of semantic feature
distributed under the terms and
extraction for High-Definition Maps, object detection based on Convolutional Neural Net-
conditions of the Creative Commons
Attribution (CC BY) license (https://
works (CNNs) for automatic extraction of semantic features has become very common
creativecommons.org/licenses/by/
over the years. The purpose is not only to recognize and classify the objects but also to
4.0/).
indicate the relative position of the objects in the image. According to the format standards

Sensors 2024, 24, 8080. https://doi.org/10.3390/s24248080 https://www.mdpi.com/journal/sensors


Sensors 2024, 24, 8080 2 of 21

of High-Definition Maps, the vector layers consist of the actual shape of the road markings
as a polygon. The research is based on establishing the geometric outline objects of the
High-Definition Map to set the pixel-level road markings detection.
In order to recognize the objects precisely, the Artificial Intelligence (AI) decision model
needs to learn a large amount of image data that include different weather conditions to
improve the robustness of the model. However, data collection and labeling data cost
human resources and time; thus, an efficient data collection method should be determined.
Previous studies on the detection of objects mostly focus on objects on the upper part of
the road such as road signs and traffic signals, and less on road markings on the road
surface. Since the detection targets are located on the road, only the area of the road
environment is required. Generally, this kind of research will carry out image processing
in the preprocessing stage. Common practices removing irrelevant backgrounds include
extracting regions of interest, front view to bird’s eye view, and so on, which can reduce
the influence of environmental conditions and unnecessary background feature learning so
that the deep-learning model can achieve better performance.
Image data with ground truth is an important key for determining the accuracy of
object detection. If the cost of data labeling can be reduced, high-quality road image
recognition and application can be improved. However, data collection takes a long
time and costs a large amount of money. For instance, an intelligent vehicle equipped
with industrial cameras in the field where the route has been surveyed and planned
acquires the data for model training. After data collection, the self-collected data need to be
labeled manually, which is huge additional labor work and takes a large quantity of time.
Furthermore, there is no public road markings dataset in Taiwan causing difficulties in the
use of data for related research. Additionally, one of the most crucial tasks for autonomous
vehicles is to detect the lane and road markings accurately. In current methods, when the
detection distance increases, the detection accuracy of objects frequently declines. This is
primarily because distant objects on the road take up few pixels in the image and because
object scales vary depending on different distances and perspectives. In light of these
concerns, this paper expects to solve the problem of reduced detection accuracy resulting
from small objects and the difficulties of data collection.
To solve the problems mentioned above, the research utilizes a virtual dataset and an
open dataset to train the object detection model and cross-field testing in the field of Taiwan
roads without spending a lot of money on collecting data and extra labor to annotate the
target objects in the image. The mixed dataset will be augmented to increase the amount of
data in the limited dataset by the method of data augmentation, such as flip, contrast, and
brightness adjustment. Furthermore, this paper utilizes Inverse Perspective Mapping (IPM)
to produce bird’s eye view images of the scene from the front-view image plane. Compared
to the front-view images for testing, the transformed images eliminate the perspective
distortion so that the accuracy of object detection can be improved.
The contributions of the paper are listed below:
1. A virtual dataset mixed with an open dataset is viable for cross-field detection when
training them together and testing them on the real dataset.
2. The method of data augmentation is employed to increase the amount of data in the
limited dataset without extra labor to annotate the target objects in the image.
3. The integration of Inverse Perspective Mapping (IPM) to transform input images
into a bird’s eye view is a key innovation of this work, significantly improving road
marking detection accuracy. This approach addresses perspective distortion, resulting
in a remarkable mAP improvement from 60.04% to 78.66%.
4. The Mask R-CNN tests the front-view images and the bird’s eye view images, and
the mAP increases from 60.04% to 78.66%, which is a significant improvement in
model accuracy.
The following section is the organization of the paper. In Section 2, literature reviews
of object detection algorithms and road markings detection are introduced. In Section 3,
the methodology of the proposed approach is elaborated in detail. Section 4 evaluates the
Sensors 2024, 24, 8080 3 of 21

proposed method by the experiments and analyzes the results. In Section 5, conclusions
and future works of the paper will be mentioned.

2. Related Work
Some previous research works are reviewed in this section, which consists of
two major sections: object detection algorithms and research on road markings detec-
tion. The following will elaborate on the development of the object detection algorithm,
including the one-stage model and two-stage model. In addition, some existing research
on road markings will also briefly be reviewed.

2.1. Object Detection Algorithms


Convolutional Neural Network (CNN) algorithms are extensively applied in tasks
such as object recognition, classification, semantic segmentation, etc. Object detection is
the task of identifying an object with a certain class and localizing the object’s position in
the image. There are two major categories of object detection models: one-stage models
and two-stage models. One-stage models emphasize speed performance, e.g., YOLO [1].
Two-stage models emphasize detection accuracy, e.g., R-CNN [2], Faster R-CNN [3], and
Mask R-CNN [4]. The one-stage and two-stage models are described in detail below.

2.1.1. Two-Stage Model


CNN-based image recognition finds applications across various domains. Extracting
details about an object’s location and size within the image is crucial. The sliding window
technique, a straightforward approach, involves scanning the image with a fixed-sized
box and employing CNNs for object identification. In order to obtain better results, it
is necessary to use different sizes of the box; however, this process is extremely slow.
Accordingly, an increasing number of methods were proposed. R-CNN, proposed by
Ross Girshick et al., generates about 2000 candidate Region Proposals for the whole image,
and each candidate region is fed into the CNN for feature extraction. Since it takes a great
deal of time to retrieve features from more than 2000 candidate regions that are overlapped,
Ross Girshick modified the network by adding ROI pooling, calling it Fast R-CNN [5],
which calculates the feature values of the whole image at once and corresponds to the actual
location of the candidate regions to obtain the feature values of each region. The training
process is simplified and saves quite a lot of computing time. Due to the fact that generating
multiple candidate regions is quite time-consuming, Shaoqing Ren et al. presented Faster
R-CNN to shorten the overall computation time. Faster R-CNN takes advantage of the
Region Proposal Network (RPN) to generate the candidate regions effectively. The anchor
box and probability of the box will be the output to classify each candidate region, which
corresponds to the actual location of the candidate region. Previous approaches aim to
frame the object, while Mask R-CNN, proposed by Kaiming He et al., can produce pixel-
level masks. In the research, Mask R-CNN is implemented to train an object detection
model. The detailed information will be introduced in the methodology.

2.1.2. One-Stage Model


In the above evolution of the two-stage model, the overall processing tasks also
achieve the ability to generate pixel-level masks with a decent speed. Joseph Redmon et al.
proposed YOLO (You Only Look Once), focusing on prediction speed rather than accurate
masks. YOLO is designed for end-to-end training. Not only does it make training easier
but also faster. With the rapid development of YOLO architecture, YOLOv1, v2 [6], v3 [7],
and v4 [8] have been released. YOLOv2 introduced several enhancements over YOLOv1,
including improved accuracy, faster processing, and the ability to detect a greater variety
of objects. In 2018, Joseph Redmon et al. further advanced the model with the release
of YOLOv3. YOLOv3 introduced Residual Network (ResNet) [9] and Feature Pyramid
Network (FPN) [10] to solve the gradient vanish problem and to optimize small object
detection by combining different scales of feature maps resulting in more accurate results
Sensors 2024, 24, 8080 4 of 21

than the previous versions. YOLOv4 was released by Alexey Bochkovskiy et al., enhancing
various parts of YOLOv3. In addition to maintaining the speed, the detection accuracy was
significantly strengthened.
Mask R-CNN, which is improved by Faster R-CNN, burnishes its legacy on the
instance segmentation model. Influenced by Mask R-CNN, Daniel Bolya et al. hoped to
design a one-stage instance segmentation model that integrated the merits of Mask R-CNN
and YOLO, namely YOLACT [11]. YOLACT is a real-time instance segmentation model,
which performs the object detection tasks optimally and computing speed rapidly. YOLACT
splits the complex instance segmentation process into two simple parallel tasks, generating
prototype masks and predicting mask coefficients per instance. For each instance, the
corresponding predicted mask coefficient is simply multiplied and added to the prototype
mask. Subsequently, the instances are filtered according to the bounding box and the
threshold value to obtain the corresponding mask for each instance, which is a high-quality
mask. SOLO [12] directly segments the instance mask, which is a box-free approach. SOLO
considers a method that introduces the concept of instance categories to predict the class
of object instances. To distinguish the object instances based on their center locations and
object sizes, the approach transforms the instance segmentation issue into a classification
issue, imitating the idea of semantic segmentation to predict the class of each pixel. The
accuracy of the experiment testing on the COCO dataset has surpassed Mask R-CNN.
SOLO version 2 [13], published by the same author, improves the mask learning and NMS
(Non-Maximum Suppression) approach, which not only enhances the accuracy but also
realizes the real-time requirements.

2.2. Research on Road Markings Detection


Road markings detection has been the popular research in the field of autonomous
driving for decades. Most of the previous research has concentrated on the detection of
lane lines instead of the road markings, such as pedestrian and road speed limit markings.
In this section, handcrafted and deep-learning methods of road markings works will be
introduced first. Next, road markings detection based on Inverse Perspective Mapping will
be explained in the following part.

2.2.1. Research on Road Markings Detection Based on Handcrafted and


Deep-Learning Methods
The prior research on road markings detection is roughly classified into two categories,
one is handcrafted features methods, and the other is deep-learning-based object detection
methods. Conventional methods for road markings detection tasks mostly extract the basic
feature of the target object manually, e.g., color, edge, and texture, which vastly rely on
the method that the author designed. For instance, Tang et al. [14] utilized a Histogram of
Oriented Gradient (HOG) [15] and Support Vector Machine (SVM) [16] with the Region
Of Interest (ROI) restrictions, which demonstrated good performance on the dataset. In
contrast to the handcrafted methods, the deep-learning-based approach indicates better
results and stability in the feature extraction of road markings. Object detection based on
CNN has apparently improved the performance under various situations. VPGNet [17] is
an end-to-end model that detects vanishing points and road markings on the road surface.
Furthermore, the author released a new dataset that is publicly available collecting data
under various weather conditions in Korea. Hoang et al. [18] detected and classified
the arrows and bike markings on the road based on the adaptive ROI and RetinaNet.
The results show that the adaptive ROI outperforms other methods. To pursue real-time
detection, Zhang et al. [19] proposed a method consisting of three modules: preprocessing,
road markings detection, and segmentation. A lightweight network combined with the
Siamese Attention module is adopted to improve the accuracy and enhance the sensitivity
to road markings in the second stage. For the segmentation module, the segmented objects
can reach pixel-level accuracy and cost less computation. Ye et al. [20] proposed a two-stage
model, YOLOv2 combined with a Spatial Transformer Network (STN) [21], to tackle the
Sensors 2024, 24, 8080 5 of 21

distortion of road markings. The presented method can obtain good performance with less
computation even though it is a two-stage model. In summary, the deep-learning-based
approach is more robust and more stable than the traditional feature extraction approach
and can be applied to different scenarios with higher accuracy.

2.2.2. Road Markings Detection Based on Inverse Perspective Mapping


In order to drive the car automatically, autonomous cars must be able to perceive
the surrounding environment at any time and handle emergency conditions. In existing
methods, detection accuracy usually decreases with increasing distance; thus, objects are
relatively smaller. Moreover, scales of elements on the road are inconsistent at different
distances and perspectives, while distorted elements occur in the distance. The perspective
distortion can be eliminated by Inverse Perspective Mapping (IPM), which transforms the
image into bird’s eye view (BEV) and can solve the problem stated above. The existing
research on road marking detection often adopts IPM to obtain BEV images during the
preprocessing stage so as to reduce the complexity of the original image and remove
the unwanted parts of the image that focus on the ROI. Li et al. [22] performed the IPM
transformation to eliminate the impact of the perspective effect. ROIs extracted from
IPM images are exploited to detect the road markings. Greenhalgh et al. [23] presented a
method that can detect the road markings text and symbols automatically. Before detecting
the targets, the images were transformed into an IPM image to remove the perspective
distortion. MSER (Maximally Stable Extremal Regions) [24] could subsequently be applied
to generate candidate regions. Symbol-based road markings were recognized by HOG and
SVM, while text-based road signs were identified by the optical character recognition (OCR)
package. Bailo et al. [25] applied the MSER to obtain candidate regions under different
illumination and weather conditions. The proposed method is proven to detect the object
on the image robustly. Kang et al. [26] considered a method to reach the real-time detection
of road markings based on the YOLOv2 model. The synthetic dataset constructed by the
MSER algorithm contains classes and orientations, trained by the detector to predict the
class labels and position. A review of the common characteristic of IPM-related references
reveals the transformation of the front-view image to a bird’s eye view. The purpose of
the transformation to a bird’s eye view is to extract the ROI more easily and then perform
manual feature extraction. In addition, the current literature uses front-view images to
train deep-learning models and converts front-view images to top-view images without
considering the problem of disparity in viewing angles. Therefore, this study considers
the relationship between different view angles and normalizes the view angle to the top
view, which significantly improves the model performance and helps to improve the object
detection accuracy.

3. Methodology
In this section, the overall proposed method will be introduced first. Figure 1 provides
an apparent visualization of the overall framework of the proposed method. The procedure
is divided into two phases, the training phase and the testing phase. For the sake of the
inconvenience of collecting data, the training data are composed of the two open datasets
for cross-field detection. Additionally, in order to focus on the objects of the road surface,
the mixed dataset will be processed in the training phase. The images will be cropped to
remove the irrelevant background, and afterward the Inverse Perspective Mapping (IPM)
will be applied to project to the perspective of the bird’s eye view, which is favorable for
the object detection model. For the testing phase, IPM will be performed on the testing
data, projecting the images to the bird’s eye view.
road surface, the mixed dataset will be processed in the training phase. The images will
be cropped to remove the irrelevant background, and afterward the Inverse Perspective
Mapping (IPM) will be applied to project to the perspective of the bird’s eye view, which
Sensors 2024, 24, 8080 is favorable for the object detection model. For the testing phase, IPM will be performed
6 of 21
on the testing data, projecting the images to the bird’s eye view.

Figure 1.
Figure 1. Illustration
Illustration of
of the
the proposed method.
proposed method.

3.1. The Proposed Method


The
The pipeline
pipeline of
of the
theproposed
proposedmethod
methodwill willbebemanifested
manifestedininthis
thischapter. Figure
chapter. 2 illus-
Figure 2 il-
trates the workflow of the research, including input, output, and method.
lustrates the workflow of the research, including input, output, and method. Three da- Three datasets
are utilized in the research including the Surrounding Vehicles Awareness (SVA) dataset,
tasets are utilized in the research including the Surrounding Vehicles Awareness (SVA)
the Ceymo dataset [27], and the Taiwan road scene data. The SVA and Ceymo datasets
dataset, the Ceymo dataset [27], and the Taiwan road scene data. The SVA and Ceymo
will be mixed into one dataset for the training phase, while the Taiwan road scene dataset
datasets will
extracted bethe
from mixed into one
YouTube dataset
video for thedata.
are testing training phase,
Before whilethe
training themodel,
Taiwantworoad
of scene
these
dataset extracted from the YouTube video are testing data. Before training
datasets will be preprocessed comprising data augmentation, homography transformation,the model, two
of these
and datasets will
ground-truth be preprocessed
labeling. After datacomprising
preprocessing,datathe
augmentation,
mixed data homography trans-
will be fed into the
segmentation
formation, andmodel for training
ground-truth and testing
labeling. on the
After data real dataset. the
preprocessing, Themixed
proposed
datamethod
will be
can be implemented in any of the instance segmentation models and even in semantic
segmentation models to reach the demand of pixel-level output. Three different types of
models will be trained in the instance segmentation model and compared with each other.
The first one is the front-view model in which most of the previous studies primarily use
front-view images as input. The second one trains the bird’s eye view images as input,
which are transformed by the front-view images. Unlike previous studies, this paper uses
images from different viewing angles as training data. The third one takes the front-view
images and bird’s eye view images as input such that the model can detect the different
perspectives of images that make the model more robust, general, and stable. In terms of
user convenience, they may not necessarily convert the images to top-view, so this paper
also proposes a third method to balance the cross-field detection at different perspectives.
For the testing phase, in order to solve the issue of missed recognition caused by different
perspectives and far distances, before testing the Taiwan road data, the image will be trans-
formed into the perspective of the top-view. Thus, the distorted objects can be presented in
their entirety to improve the accuracy of the prediction.
In terms of user convenience, they may not necessarily convert the images to top-view, so
this paper also proposes a third method to balance the cross-field detection at different
perspectives. For the testing phase, in order to solve the issue of missed recognition caused
by different perspectives and far distances, before testing the Taiwan road data, the image
Sensors 2024, 24, 8080 7 of 21
will be transformed into the perspective of the top-view. Thus, the distorted objects can
be presented in their entirety to improve the accuracy of the prediction.

Figure 2.
Figure 2. The
The overall
overall pipeline
pipeline of
of the
the proposed
proposed method.
method.

3.2. Homography Transformation Based on Inverse Perspective Mapping


In object detection tasks based on the deep-learning method, the problem of “small
objects at far distance” is hard to solve. Therefore, IPM is applied to remove the perspective
effect and reduce the disparity between the front-view and bird’s eye view. IPM relies on
two assumptions: (1) the camera is in a fixed position relative to the road, and (2) the road
surface is flat. The main concept of IPM is to project the world coordinate system to the 2D
(two-dimensional) coordinate system. The pinhole camera model (Figure 3) plays a crucial
role in the concept of perspective transformation. In the pinhole camera model, parameters
are described as follows, where P( X w , Yw , Zw ) is an object of the world coordinate system,
and p( x ′ , y′ ) is a 2D point on the image coordinate system when P is being projected onto
the image. The distance between F and O is called the focal length, where O represents the
center of the camera, and F represents the center of the image. It describes the coordinates
of a real-world object transform as stated in the following: the coordinates of the object
in the 3D (three-dimensional) world are transformed to the coordinates of the camera
first, and then transformed to the coordinates of the 2D images on the 2D plane. In brief,
the pinhole camera model depicts how an ideal pinhole camera with an infinitely small
aperture projects an object in the 3D world onto a 2D plane.
the coordinates of a real-world object transform as stated in the following: the coordinates
of the object in the 3D (three-dimensional) world are transformed to the coordinates of the
camera first, and then transformed to the coordinates of the 2D images on the 2D plane.
Sensors 2024, 24, 8080 In brief, the pinhole camera model depicts how an ideal pinhole camera with an infinitely
8 of 21
small aperture projects an object in the 3D world onto a 2D plane.

Figure 3. Pinhole camera model.


Figure 3. Pinhole camera model.
The transformation matrix is formulated by the camera parameters, intrinsic parameter
The and
matrix, transformation matrix is matrix.
extrinsic parameter formulated
For by thethe cameraparameter
intrinsic parameters, intrinsic
matrix K (1),param-
f x and
f y are
eter the focal
matrix, and lengths
extrinsicofparameter
the camera, and the
matrix. Forc x the
, cy intrinsic
are the optical centers
parameter of the
matrix (1), 𝑓
K camera.
and 𝑓 are the
It describes thefocal
geometric
lengthsdescription of light
of the camera, the 𝑐the
andinside , 𝑐camera and
are the converts
optical the camera
centers of the
coordinates to the image coordinates. First, it scales the image plane from
camera. It describes the geometric description of light inside the camera and converts a unit of object
the
space coordinates into pixels. Second, it shifts the original point from the middle of the
camera coordinates to the image coordinates. First, it scales the image plane from a unit
image into the top-left corner.
of object space coordinates into pixels. Second,
 it shifts the original point from the middle
f x 0 cx
of the image into the top-left corner. K =  0 f c  (1)
y y
0𝑓 00 1𝑐
𝐾 0 𝑓 𝑐 (1)
The extrinsic matrix Cext (2) is composed of rotation matrix R and translation matrix T,
0 0 1
which describe the orientation and location of the camera in the world system. In brief, it
The extrinsic
indicates matrix 𝐶 from
the transformation (2) isworld
composed of rotation
coordinates matrixcoordinates.
to camera R and translation matrix
T, which describe the orientation and location of the camera inthe world system. In brief,
r11 r12 r13 t x
it indicates the transformation from  world
 coordinates
r21 r22 to camera coordinates.
r23 ty 
Cext = R| T =  r31 r32 r33 tz 
 (2)
0 0 0 1

The points in the world are (X, Y, Z) coordinates that are turned into homogeneous
forms. After multiplying by two matrices, the homogeneous coordinates of the pixel in
the image are calculated. Hence, combining two matrices by matrix multiplication is the
camera matrix shown in (3), which is a 3 × 4 matrix with 12 numbers, denoted as Cij .
The elements Cij represent the intrinsic and extrinsic parameters of the camera, where i
is the row index and j is the column index. λ is an arbitrary scalar scale factor that scales
the (x ′ , y′ , z′ ) coordinates. The (x ′ , y′ ) are divided by z′ ; thereforethe λ disappears. The
coordinates multiplied by the matrix use any arbitrary scale factor and will obtain exactly
the same result. Since the scale factor is arbitrary, the element in C (3, 4) will be set to 1.
 
 X
C14  w 
 ′ 
x C11 C12 C13
 y′  = λC21 Yw 
C22 C23 C24 
 Zw  (3)
z′ C31 C32 C33 C34
1

The matrix (4) can be simplified to consider the camera projection on a point on a
plane of the world space; therefore, the Z value of the points will be set to 0 and the
Sensors 2024, 24, 8080 9 of 21

third column of the matrix will be multiplied to be 0. After removing a row of the vector, it
is a 3 × 3 system called planar Homography H. It maps the coordinates of the points in a
plane to the points in the image. Two-dimensional coordinates of a point in the world can
be converted by using the simple matrix into the coordinates of a point in the image.
 ′      
x C11 C12 C14 Xw h11 h12 h13 Xw
 y′  = C21 C22 C24  Yw  = h21 h22 h23  Yw  (4)
1 C31 C32 1 1 h31 h32 1 1

Homography transformation is described by the following equations, where H is the


3 × 3 matrix with eight degrees of freedom (DoF), so the H can be estimated from at least
four world points and their corresponding image
 points. Among these four points, any
three of them should not be collinear. x ′ ( x ′ , y′ represents transformed coordinates, and


x ( X w , Yw , 1 represents homogeneous coordinates. The Homography matrix can warp
image pixel coordinates to corresponding pixel coordinates on the target image. The target
coordinates are calculated by the collinear equations using least-squares estimation.
    
fx 0 c x r11 r12 tx h11 h12 h13

x′ = H x H = K R| T H =  0
 
fy cy r21 r22 ty  = h21 h22 h23  (5)
0 0 1 r31 r32 tz h31 h32 h33

The eight equations are used to estimate the eight degrees of freedom in the homog-
raphy matrix, while h33 is set to 1. The hij is shifted to another reference frame, and the
coordinates of the 2D image are mapped according to (5), which will be calculated by
the least-squares method. After applying homography transformation to the front-view
images, the images will be projected onto the bird’s eye view. The four yellow points are
decided along the roadside, so the ROI will focus on the road area.
In the research, the homography transformation is performed on the data augmen-
tation to train the model. For the testing phase, the testing data will also operate the
homography matrix to transform the different angles to the perspective of the bird’s eye
view. In conclusion, IPM transforms 2D images into other 2D images on the same planar
surface through homography matrix multiplications so that the images from the monocular
camera as the front-facing image will be projected onto top-view images.

3.3. Data Augmentation


Data augmentation is a common method that is used in the deep-learning model for
the purpose of increasing the number of images from small datasets to avoid overfitting
the specific scene. To begin, the data augmentation on images will be explained. Owing to
the change of pixel coordinates, the data augmentation on the label will also be expounded
in the next part.

3.3.1. Data Augmentation on Images


Deep-learning networks generally require a great deal of training data to obtain
better results. With the limitation of obtaining data, data augmentation is utilized to
produce more data from existing datasets so that the diversity of the original images is
increased to make up for the lack of data. Common techniques of data augmentation include
(1) geometric transformation: randomly flip, crop, rotate, shear or translate images;
(2) color space transformation: change the color channel space or try to map the RGB
to other color spaces; (3) noise injection: a matrix of random values sampled from the
Gaussian distribution is added to the RGB pixels of the image; and (4) kernel filters: kernel
filters on images for convolution operations, such as sharpen and blur. After data augmen-
tation, people looking with their eyes still easily recognize the same image; however, for
the deep-learning model, these processed images are completely new images.
Sensors 2024, 24, 8080 10 of 21

Considering the simple features and monotonous color of the road markings, they
do not contain diverse structural features for the object detection model. Accordingly, this
paper adopts the flip and color space adjustment (brightness and contrast) to the training
dataset. Flip is one of the effective methods and has proven to be useful to improve the
performance of the deep-learning model. Furthermore, color space adjustment is the easiest
and most common technique to change the luminance of images. On the road surface
environment, the diversity of lighting conditions and weather conditions have an impact
on the accuracy of the model, and hence, data augmentation is a vital technique to vary the
images by the color space adjustment. The images are flipped using horizontal flipping
and vertical flipping, which are common methods in geometric transformation. Moreover,
brightness adjustment is implemented to the training data to convert the brightness-related
channel depending on the value setting. Therefore, it can make the images slightly brighter
or darker to enhance the lighting conditions of images. Contrast adjustment is also one of
the data augmentation techniques that rescales the range of intensity value in the images.
The contrast is the ratio between the brightest and darkest areas of the images. The lager
the ratio, the more gradations from black to white, which makes the object or the boundary
in the images more distinguishable. Therefore, the contrast of the white road markings
on the black asphalt road should be enhanced, which will improve the visual perception.
Ultimately, four sets of image copies were generated by the data augmentation techniques,
which increased the amount of data from the original images without extra time costs.

3.3.2. Data Augmentation on Label


Labeling the ground truth of target objects is necessary for the supervised learning
network. Before training the model, target objects of the dataset need to be labeled as
ground truth. Nevertheless, deriving more images is difficult and they also need to be
labeled. It is time-consuming and requires a considerable amount of commitment. As a
consequence, the study increases the amount of data using data augmentation and the
homography transformation of images in the limited dataset. The annotations do not re-
quire additional manual labeling after data augmentation and homography transformation.
In view of the pixel coordinates change, some labels of the augmented data need to be
modified. After data augmentation such as brightness and contrast adjustment, the label is
the same as the original annotation files, whereas the pixel coordinates of the flip images
need to be horizontally flipped. If the label originally corresponds to a “right arrow”, it
will be changed to a “left arrow” after the flip. The pixel coordinate of flipped images is
converted from the left part to the right part. Furthermore, the original images from the
mixed dataset will be transformed into the perspective of the bird’s eye view. Consequently,
the pixel coordinates of the bird’s eye view label are transformed from the front-view label
using the inverse of the homography matrix.

3.4. The Architecture of Mask R-CNN for Road Markings Detection


Mask R-CNN is a classic instance segmentation model. The merit of the Mask R-CNN
is pursuing the prediction accuracy, which can reach pixel-level accuracy. Compared to
other semantic models, Mask R-CNN is more robust and adaptive to different datasets.
The one-stage models mainly pursue the real-time requirements but often achieve bad
prediction results. Compared to the one-stage models, the region-base of two-stage models
focuses on prediction accuracy. In light of achieving a better result, Mask R-CNN is noted
as an influential instance segmentation model that can be used to detect the road markings
on the road surface. Mask R-CNN has roughly the same framework as Faster R-CNN.
The difference between Faster R-CNN and Mask R-CNN is that the original two tasks
are changed to three tasks: classification, regression, and segmentation. ROI pooling
is replaced by the ROI align. ResNet with 101 layers in combination with the neck of
the Feature Pyramid Network (FPN) is used as the feature extraction backbone in the
implementation. FPN is a top-down architecture with skip connection that solves problems
at different scales, which is beneficial to small object detection. Mask R-CNN is divided
Sensors 2024, 24, 8080 11 of 21

into two stages: finding the region proposals first and then identifying them. The first
stage has the same first layer as Faster R-CNN, called Region Proposal Network (RPN).
RPN consists of two branches. One of the branches determines the probability of whether
the anchor contains an object or not. The other branch is responsible for calculating the
offsets ( x, y, w, h) between the anchors and ground truth. The output of RPN will be the
input into the ROI align. A modification is proposed to improve the location accuracy
known as ROI align. Using bilinear interpolation of the virtual pixel at the point with the
nearest pixel—instead of using the quantization output for obtaining a fixed size from ROI
pooling—solves the problem of misalignment, which is caused by twice quantization in
the ROI pooling operation. Therefore, the accuracy of the bounding box location makes
obvious progress. In the second stage, apart from predicting classes and bounding box
locations, a branch of the fully convolutional network is added. A corresponding binary
mask
Sensors 2024, 25, x FOR is predicted
PEER REVIEW for each ROI to indicate whether a given pixel is part of the target or not. 12
The overall architecture of Mask R-CNN is depicted in Figure 4.

Figure 4. The architecture of Mask


Figure 4. The R-CNN.
architecture of Mask R-CNN.

The loss of Mask The R-CNN


loss of Maskconsists of three
R-CNN losses
consists of (6), including
three losses (6),classification loss,
including classification
bounding box bounding
loss, and mask loss. The classification loss and bounding box loss originates
box loss, and mask loss. The classification loss and bounding box loss origin
from Faster R-CNN. The loss of Faster R-CNN defined in (7) is divided into two parts,
from Faster R-CNN. The loss of Faster R-CNN defined in (7) is divided into two p
classification loss and regression loss. Classification loss is calculated as the average binary
classification loss and regression loss. Classification loss is calculated as the average bin
cross-entropy, comparing each predicted probability (Pi ) to the actual class output, which
is either cross-entropy, comparing each predicted probability (𝑃 ) to the actual class output, w
 0 or 1. For regression loss, it is the value of the bounding box offset,* where
ti = t x , ty , twis
, teither 0 or 1. For regression loss, it is the value of the bounding box offset, where
h are four coordinates of the predicted bounding box while ti is the
ground-truth box linked, 𝑡ℎto aare
𝑡 , 𝑡 , 𝑡 four coordinates
positive anchor. It of the predicted
adopts SmoothL1 bounding box while
loss to train 𝑡 ∗ is the grou
the model.
The mask loss truth
of Mask boxR-CNN
linked toallows
a positive
every anchor.
ROI toItgenerate
adopts SmoothL1 loss to
all the masks fortrain theclass;
every model. The m
however, not everyloss ofmask Maskoutput
R-CNN will contribute
allows to the
every ROI toloss. Instead,
generate themasks
all the mask for
lossevery
of class
class; howe
k will be counted according to the prediction
not every mask output will contribute to result of the loss. Instead, the mask loss of k.
classification branch in class class k wi
The perspective transformation is not limited to using
counted according to the prediction result of the Mask R-CNN as the training
classification branch model.
in class k. The
Given the maturityspective andtransformation
flexibility of theisMask R-CNN,
not limited to as wellMask
using as theR-CNN
characteristics of basicmodel. G
as the training
features and the monotonous color of the road markings, it is suitable for deployment.
the maturity and flexibility of the Mask R-CNN, as well as the characteristics of basic
tures and the monotonous L = L color
+ L of + theL road markings, it is suitable for deployment. (6)
cls box mask
𝐿 𝐿 𝐿 𝐿
1 1
L({ pi }, {ti }) = ∑ Lcls ( pi , p1i∗ ) + λ
𝐿 cls𝑝 i , 𝑡
N ∑ p∗ Lreg (t1i , ti∗ ) 𝑝∗ 𝐿
𝐿 reg𝑝 ,i𝑝∗ i 𝜆
N 𝑡 , 𝑡∗
(7)
𝑁 𝑁

4. Experimental Evaluations
In the first place, the experimental settings and evaluation metrics for the propo
method will be described. Secondly, this section elucidates the detailed information o
datasets in the experiment. Finally, some experiments and results of the proposed met
are presented at the end of the section.
Sensors 2024, 24, 8080 12 of 21

4. Experimental Evaluations
In the first place, the experimental settings and evaluation metrics for the proposed
method will be described. Secondly, this section elucidates the detailed information of the
datasets in the experiment. Finally, some experiments and results of the proposed method
are presented at the end of the section.

4.1. Experimental Settings


The study is implemented with the Linux Ubuntu 18.04 platform using the Tensorflow-
gpu-1.12 deep-learning framework, and an NVIDIA GeForce RTX 2080-Ti graphics card
unit (GPU). The hardware specifications are displayed in Table 1.

Table 1. Detailed specifications of the experimental environment.

Items Specification
CPU Intel i9-9900 3.5 GHz 10 cores
Memory DDR4 2400 MHz 16 GB × 4
GPU NVIDIA GeForce RTX2080 Ti
Operating System Linux Ubuntu 18.04
Libraries Python3.6, Tensorflow-gpu-1.12, CUDA 9.1

The following is an empirical setup for the model’s hype-parameters: the number
of steps per epoch is the total training samples divided by the batch size, where the total
epochs are 100, and the learning rate is 0.001. Other training settings are shown in Table 2.

Table 2. Training parameters.

Items Specification
Number classes 8 (including background)
Steps per epoch Training samples/batch size
Epochs 100
Learning rate 0.001
Detection minimize confidence 0.9

To evaluate the performance of the proposed method, the widely used evaluation
metric Mean Average Precision (mAP) was employed to evaluate the model. The mAP is
related to the four metrics of IoU, Precision, Recall, and AP. The following describes each
metric and its formula:
1. IoU (Intersection over Union): Equation (8) displays the overlap ratio between the
ground-truth bounding boxes and predicted bounding boxes. The higher the overlap
ratio, the higher the accuracy of the predicted target object position. Essentially, the
predefined threshold is 0.5.

Area o f Overlap
IoU = (8)
Area o f Union
2. Precision: Precision (9) is the number of predicted objects that have been predicted as
positive where true positive (TP) is the predicted object that matches the ground-truth
objects, and false positive (FP) is the positively predicted object that is actually false.

TP
Precision = (9)
TP + FP
3. Recall: Recall (10) is the number of actual objects that the model predicts correctly,
where false negative (FN) represents when the model predicts a negative object that is
actually positive.
Sensors 2024, 24, 8080 13 of
Sensors 2024, 25, x FOR PEER REVIEW 14 of 21
23

In this case, the AP is computed TP


by interpolating the precision across all points n, (10)
and
Recall =
TP + FN
r takes the maximum precision when the recall value is greater than or equal to r + 1,
4. AP (Average
as shown Precision)
in (12); 𝜌 𝑟̃ represents the area
is the measured under at
precision precision–recall
therecall 𝑟̃ . curve (PR curve).
According to PASCAL VOC [28] competitions after 2010, the calculation of AP (11)
has a modification that selects the maximum precision value at unique recall values. In
𝐴𝑃 𝑟 𝑟 𝜌 𝑟 (11)
this case, the AP is computed by interpolating the precision across all points n, and r
takes the maximum precision when the recall value is greater than or equal to r + 1, as
∼ ∼
   
shown in (12); ρ r is the measured precision at recall r .
𝜌 𝑟 𝑚𝑎𝑥 𝜌 𝑟̃ (12)
̃: ̃
AP = ∑ (rn+1 − rn )ρinterp (rn+1 ) (11)
5. n = 0
mAP (Mean Average Precision): the average of the AP for every class. The mAP as
∼
shown in (13) is a principal quantitative
ρinterp (rn+1 ) = ∼measurement
max ρ r for object detection. (12)

r : r ≥ r n +1

5. mAP (Mean Average Precision): the average 1 of the AP for every class. The mAP as
𝑚𝐴𝑃 measurement
shown in (13) is a principal quantitative 𝐴𝑃 for object detection. (13)
𝑁

1 N
N i∑
mAP = APi (13)
4.2. Dataset =1

In this section, two datasets (Ceymo and SVA datasets) for training and the Taiwan
4.2. Dataset
dataset for testing will be introduced. Two open datasets were collected in Sri Lanka and
In this section, two datasets (Ceymo and SVA datasets) for training and the Taiwan
in the virtual
dataset world,
for testing willGrand Theft Auto
be introduced. TwoVopen
(GTAV), respectively.
datasets The Ceymo
were collected dataset
in Sri Lanka and con-
in
sists of 2099 images for training and 788 for testing belonging to eleven
the virtual world, Grand Theft Auto V (GTAV), respectively. The Ceymo dataset consists of classes, providing
polygon,
2099 images bounding boxand
for training annotations, and pixel-level
788 for testing belonging segmentation
to eleven classes, masks. In order
providing to sat-
polygon,
isfy our research
bounding demand, we
box annotations, andonly retainedsegmentation
pixel-level seven classes masks.
and selected 2172toimages
In order satisfy with
our
research demand,
annotations. we only retained
Additionally, seven classes
the Surrounding and selected
Vehicles 2172 images
Awareness (SVA)with annotations.
dataset was col-
Additionally,
lected from the thevirtual
Surrounding
world,Vehicles Awareness real-world
GTAV, simulating (SVA) dataset was collected
scenarios from the
under abundant
virtual world, GTAV, simulating real-world scenarios under abundant
weather conditions and different illuminations. We selected 1771 images from the SVA weather conditions
and different illuminations. We selected 1771 images from the SVA dataset and labeled
dataset and labeled them into six classes. For the SVA dataset, the whole labeling proce-
them into six classes. For the SVA dataset, the whole labeling procedure was performed
dure was performed manually by the image annotation tool, VGG Image Annotator
manually by the image annotation tool, VGG Image Annotator (VIA). As for the Ceymo
(VIA). As
dataset, it for the Ceymo
provides label dataset, it provides
files annotated label files which
by Labelme, annotatedis a by Labelme,
different which
format is a
from
different format from VIA; therefore, the label files from the Ceymo
VIA; therefore, the label files from the Ceymo dataset need to be transformed into the dataset need to be
transformed
format of VIAinto the format
in view of VIA in The
of uniformity. viewtotal
of uniformity.
images within The total images
the two within
datasets arethe3943.
two
datasets
Figure aredepict
5a,b 3943.some
Figure 5a,b depict
examples fromsome
eachexamples from each dataset.
dataset. Subsequently, Subsequently,
the datasets were mixed the
into one dataset and were randomly divided into training sets and
datasets were mixed into one dataset and were randomly divided into training sets and validation sets of each
dataset
validationwithsets
a proportion of 7:3,with
of each dataset containing, in total,
a proportion of2785 training datainand
7:3, containing, 1158
total, validation
2785 training
data, respectively.
data and 1158 validation data, respectively.

(a) Ceymo dataset (b) SVA dataset (c) Taiwan

Figure5.5.Example
Figure Exampleof
ofimages
imagesfrom
fromdifferent
differentdatasets.
datasets.

The
The testing
testing data
data were
were derived
derived from
from YouTube
YouTubevideos
videos in
in the
the field
field of
of the
the Taiwan
Taiwanroad
road
scene. Figure 5c shows some examples of images from the Taiwan dataset. The data
scene. Figure 5c shows some examples of images from the Taiwan dataset. The data con-
contain diverse scenarios at different times during the days, including sunny, rainy, and
tain diverse scenarios at different times during the days, including sunny, rainy, and
cloudy. The total images in the testing data consist of 582 images. Figure 6 presents the
seven classes in the Taiwan data for model prediction, including straight arrow, left arrow,
Sensors 2024, 24, 8080 14 of 21
Sensors 2024, 25, x FOR PEER REVIEW 15 of 23

Sensors 2024, 25, x FOR PEER REVIEW 15 of


cloudy. The total images in the testing data consist of 582 images. Figure 6 presents the
right arrow,
seven classesstraight left arrow,
in the Taiwan straight
data rightprediction,
for model arrow, special lane, and
including pedestrian
straight crossing,
arrow, left arrow,
right
whicharrow, straight
were labeled left arrow,
manually viastraight
the VIA right
tool. arrow, special lane, and pedestrian crossing, crossin
right arrow, straight left arrow, straight right arrow, special lane, and pedestrian
which were labeled manually via the VIA tool.
which were labeled manually via the VIA tool.

(a) Straight arrow (b) Left arrow (c) Right arrow (d) Special lane

(a) Straight arrow (b) Left arrow (c) Right arrow (d) Special lane

(e) Straight left arrow (f) Straight right arrow (g) Pedestrian crossing

(e) Straight
Figure left
6. Classes ofarrow (f) Straight
Taiwan road images (sevenright arrow
classes). (g) Pedestrian crossing

Figure 6. ClassesFigure
of Taiwan road images
6. Classes (seven
of Taiwan road classes).
images (seven classes).
4.3. Data Preprocessing
4.3. Data Preprocessing
Before training the Preprocessing
4.3. Data model and testing the images, the images and label files will be
Before training
preprocessed first. Four theapproaches
Before
model andoftesting
training the data
model
the images, the
augmentation
and testing will
images
augmentand
the images,
label
theimages
the
files
total quantity will be
and label files will
preprocessed
of images. Infirst. Four the
addition, approaches
homography of data augmentationbased
transformation will augment
on Inverse thePerspective
total quantity
preprocessed first. Four approaches of data augmentation will augment the total quanti
of images.(IPM)
Mapping In addition,
appliedthe homography transformation based on Inverse Perspective
ofis images. to the
In trainingthe
addition, data and testing transformation
homography data. based on Inverse Perspecti
Mapping (IPM) is applied to the training data and testing data.
Mapping (IPM) is applied to the training data and testing data.
4.3.1. Data Augmentation
4.3.1. Data Augmentation
The experiment used Augmentation
4.3.1. Data the data augmentation method only on the training set, and the
The experiment used the data augmentation method only on the training set, and the
testing set consists ofThe theexperiment
original images. used the Data augmentation
data augmentation wasmethod
performed onlyby onchanging
thechanging
training set, and th
testing set consists of the original images. Data augmentation was performed by
the contrast and brightness
testing of
set consists the images in the experiment. The experiment was realized
the contrast and brightness of theofimages
the original
in theimages. Data augmentation
experiment. The experiment waswas performed
realizedby changin
by the Imgaug package. Theand Imgaug package is aimages
pythoninlibrary for image augmentation
by the Imgaug the contrast
package. The Imgaug brightness
packageof the is a python the experiment.
library for image Theaugmentation
experiment was realize
providing the
providing the bykeypoint
the Imgaug
keypoint and bounding
and package.
bounding Thebox
box transformation.
Imgaug package isThere
transformation. a python
There areare three
library threefunctions
for functions
image augmentatio
adopted in the experiment,
adopted in theproviding
experiment, inclusive
the inclusive
keypoint ofof “AddToBrightness”
and “AddToBrightness” function,
bounding box transformation. “LinearContrast”
function, “LinearContrast”
There are three functio
function, and
function, and “Fliplr”
“Fliplr” function.
adoptedfunction. The
The LinearContrast
in the experiment, inclusivefunction
LinearContrast function sets
setsthe
of “AddToBrightness” thealpha
alphavalue value totosam-
function, sample
“LinearContras
ple uniformly
uniformly within within the
the specific
function, specific interval
intervalfunction.
and “Fliplr” [0.4,
[0.4, 1.6].The 1.6].
In the In the experiment,
experiment,function
LinearContrast we
we set 1.6 set 1.6
asthe
sets theasalpha
the value to sam
contrast
contrast
value value ple
to adjust totheadjust the intensity
intensity
uniformly of images.
within of specific
the images. Figure
Figureinterval 7b
7b illustrates illustrates
[0.4, the experiment,
theInoutcome
1.6]. the outcome after
after contrast
we set 1.6 as th
contrast adjustment.
adjustment. The The AddToBrightness
AddToBrightness function function
converts converts
each
contrast value to adjust the intensity of images. Figure 7b illustrates theeach
image image
to a to a
color color
space space
with a
outcome aft
with a brightness-related
brightness-related channel, channel,
extracts extracts
the the
channel, channel,
and and
then then
adds
contrast adjustment. The AddToBrightness function converts each image to a color spa adds
or or
subtracts subtracts
the the
channel
value
channelbetween − 30 and -30
value between 20 to and convert
with a brightness-related
it back
20 to convert to theextracts
it back
channel,
original color space.
to the original
the channel, colorandFigure
space.then
7cadds
illustrates
Figure 7cor subtracts th
the dark image
illustrates reducing
the dark image the lighting,
reducing the and Figure
lighting, and 7d illustrates
Figure 7d the bright
illustrates the image.
bright The flip
image.
channel value between -30 and 20 to convert it back to the original color space. Figure
function can flip can
The flip function the images
flip the horizontally
images or vertically.
horizontally Fliplr means to reverse the images
illustrates the dark image reducingor thevertically.
lighting, Fliplr means
and Figure 7dtoillustrates
reverse the
the bright imag
from left to
images from leftright, towhich
right, horizontally
which flipped
horizontally the
flipped images.
the Figure
images. 7e
Figure shows
7e the
shows example
the ex- of
The flip function can flip the images horizontally or vertically. Fliplr means to reverse th
the horizontally flipped
ample of the horizontally image. Data augmentation is helpful to increase the amount of
images fromflipped image.
left to right, Datahorizontally
which augmentation is helpful
flipped to increase
the images. Figurethe 7e shows the e
data in the
amount limited
of data in dataset
the limited without
dataset extra
without labor to annotate
extra labor to the target
annotate theobjects
target in the image.
objects into increase th
ample of the horizontally flipped
After data augmentation, the number of images increased from 2785 to 13,925. image. Data augmentation is helpful
the image. After data augmentation,
amount of data in the the number
limited of images
dataset without increased
extra labor from to 2785
annotateto 13,925.
the target objects
the image. After data augmentation, the number of images increased from 2785 to 13,92

(a) (b) (c)

(a) (b) (c)

(d) (e)

Figure 7.
Figure 7. Data augmentation.
augmentation.(a) (d)image.
(a)Original
Original image.(b)(b)
Contrast. (c) (c)
Contrast. Dark. (d) (d)
Dark. Bright. (e)Flip.
(e) Flip.
Bright. (e)
Figure 7. Data
4.3.2. Inverse Perspective augmentation. (a) Original image. (b) Contrast. (c) Dark. (d) Bright. (e) Flip.
Mapping
The homography transformation based on IPM is employed to the Ceymo dataset
and SVA dataset to transform the images into the bird’s eye view to augment the dataset.
The two datasets do not provide the camera parameters, so the perspective transformation
4.3.2.
TheInverse Perspective
homography Mapping based on IPM is employed to the Ceymo dataset
transformation
and SVA Thedataset
homographyto transform the imagesbased
transformation into the on bird’s
IPM iseye view to augment
employed to the Ceymo the dataset.
dataset
The two datasets do not provide the camera parameters, so
and SVA dataset to transform the images into the bird’s eye view to augment the dataset.the perspective transformation
Sensors 2024, 24, 8080 adopts
The two planar projective
datasets do not transformation,
provide the camera choosing four points
parameters, so theon the input image
perspective and
transformation 15 the
of 21
corresponding points on the output image to estimate the homography
adopts planar projective transformation, choosing four points on the input image and the matrix. The input
image size affects
corresponding the learning
points of the image
on the output object to detection
estimatemodel. Compared matrix.
the homography to smallThe images,
input
adopts
large planar
images not projective
only requiretransformation,
more training choosing
time and four
more
image size affects the learning of the object detection model. Compared to small images, points
memory on tothe input
extract image
the and
internal
the corresponding
features of thenot
large images image points
onlybut on the
also contain
require output image
more background
more training to estimate
time and more the
noise, homography
whichtohas
memory matrix.
a negative
extract The
im-
the internal
input
pact image
on detection.size affects the
Consequently, learning of
the images the object detection model. Compared to small
features of the image but also contain more will be cropped
background firstwhich
noise, in order hasto remove the
a negative im-
images, large images not only require more training time and more memory to extract
irrelevant background
pact on detection. so that the model
Consequently, can focus
the images will on be the road surface,
cropped first in orderwhichtoreducesremovethe the
the internal features of the image but also contain more background noise, which has a
effect of
negative the
irrelevantimpactenvironmental
background conditions
so that the
on detection. and unnecessary
model can focus
Consequently, feature
on the will
the images roadbelearning.
surface,
cropped Furthermore,
whichfirstreduces
in orderthethe
to
reserved
effect of ROI
the easily chooses
environmental the source
conditions point
and to map
unnecessary to a corresponding
feature
remove the irrelevant background so that the model can focus on the road surface, which learning. point in the
Furthermore, target
the
image
reserved
reduces through
theROI perspective
easily
effect of chooses transformation.
the source point
the environmental OpenCV
to map
conditions provides perspective
to aunnecessary
and corresponding transformation
point
feature in the target
learning. Fur-
functions
thermore,
image throughtothe
calculate
reserved the
ROI
perspective homography
easily chooses
transformation. matrix
theOpenCVfor thepoint
source images
provides given
to map to the source
a corresponding
perspective and desti-
transformation point
in the points.
nation
functionstarget image
to calculate through perspectivematrix
The “getPerspectiveTransform”
the homography transformation.
function
for the images OpenCV
computes given provides
thetheprojection
source perspective
andmatrix.
desti-
transformation
Afterward,
nation points. the Thefunctions
top-view to calculate the
perspective transformation
“getPerspectiveTransform” homography
function matrix
is performed for
computes using the images given
the “warpPer-
the projection the
matrix.
source
spective” and destination
function. points.
Figure 8perspective The
provides the “getPerspectiveTransform”
example of source function
points 𝑥 using , 𝑦 inthe computes
the“warpPer-
cropped the
Afterward, the top-view transformation is performed
projection matrix. Afterward, the top-view perspective transformation is performed using
image and function.
spective” reference Figure 𝑥 , 𝑦 in the example
points 8 provides bird’s eyeofview source image.
points The𝑥pixel , 𝑦 coordinates
in the cropped of
the “warpPerspective” function. Figure 8 provides the example of source points ( xi , yi )
the
in
points
image
the cropped
are transferred
and reference
imagepoints
from 𝑥 , 𝑦 points
and reference
one plane to another
in the bird’s
xi′ , yi′eye
 through
in view
homography
image.
the bird’s eyeThe view pixel matrix
coordinates
image.
multi-
The pixel of
plication.
the points are transferred from one plane to another through
coordinates of the points are transferred from one plane to another through homography homography matrix multi-
plication.
matrix multiplication.

Figure 8. Source points 𝑥 , 𝑦 and reference points 𝑥 , 𝑦 within the different perspectives of
′ ′

the images.
Figure 8.
Figure 8. Source points ( 𝑥i , 𝑦 i and reference points x𝑥i ,, 𝑦yi within
x , y ) within the
the different
different perspectives
perspectives of
of
the
the images.
images.
As for the testing phase, the proposed method also transforms the data into the bird’s
As for the testing phase, the proposed method also transforms the data into the
eye view using
As for theIPM.
testingFigure 9 schematizes
phase, the proposedthe location
method oftransforms
also source points 𝑥 , 𝑦into
the data and
thedesti-
bird’s
bird’s eye view using IPM. Figure 9 schematizes the location of source points ( xi , yi ) and
nation
eye points
view using𝑥IPM.
, 𝑦 . ′Figure
The′ four
 9 yellow points
schematizes the are decided
location of along points
source 𝑥
the roadside, 𝑦 soand
thatdesti-
the
destination points xi , yi . The four yellow points are decided along the roadside so that
ROI will
nation
the focus
ROIpoints on𝑥 , 𝑦on .the
will focus the road area.
Theroad To
fourarea.ensure
yellow the fairness
points the
To ensure of
are fairnessthe
decided of experiment,
along the
the roadsidethe
the experiment, number of
so number
that the
the
ROI
of label’s
thewill samples
focus
label’s on after
samples transformation
the after
road is the
area. To ensure
transformation same.
isthe
the fairness
same. of the experiment, the number of
the label’s samples after transformation is the same.

Figure 9. Source points ( xi , yi ) and reference points xi′ , yi′ in the Taiwan dataset.


4.4. Experimental Results


This section describes the experimental results, including a quantitative analysis of
the proposed method, in comparison with other models, and examples of the detection
results. For training on an NVIDIA GeForce RTX2080 Ti with a dataset of 2785 training
Sensors 2024, 24, 8080 16 of 21

images and 1158 validation images, the training time for 100 epochs is approximately
1 h 20 min. The inference time for the validation set is about 1–2 min, with each image
taking approximately 70–120 ms to process.

4.4.1. The Cross-Field Detection


The evaluation of the proposed method for cross-field detection is measured in this
section. As explained in the 4.2 dataset, all images for training data that are derived from
two open datasets were mixed in one dataset. Tables 3 and 4 display the comparison
of same-field and cross-field detection results. In the experiment, the research trained
three kinds of images as training data. The first one (Table 3) was the front-view model
that primarily used front-view images as input in previous methods. Testing in the same
field, the mAP was 98.22%. For the cross-field detection, testing on the Taiwan road surface,
the model had the detection ability for the mAP to reach 57.90%. Different datasets had
different angles of the camera, so the data tested in different fields would have deviations
in angle view, resulting in poor detection. Therefore, the second one (Table 4) fed the bird’s
eye view images into the deep-learning model occupying an important place. Testing in
the same-field data, the model performance was 97.02%, while testing in the Taiwan data,
the model performance was 85.57%, which is a good detection capability. The mAP of
the model was improved by 27.67% compared to the detection ability of the first model
with the same viewing angle. This also proves that after normalizing the perspective into a
bird’s eye view, the results significantly improved even when tested on different datasets.
Although the mAP of the cross-field detection is not as high as the same-field detection,
the model has the detection ability to be utilized in the different field. Using open data
from different fields not only saves time in data collection but also proves the feasibility of
cross-field detection.

Table 3. The comparison of the same-field and cross-field detection in the front view model.

Same-Field Cross-Field (Taiwan)


Training Data
Front Front
Front View 98.22% 57.90%

Table 4. The comparison of the same-field and cross-field detection in the bird’s eye view model.

Same-Field Cross-Field (Taiwan)


Training Data
Bird’s Eye View Bird’s Eye View
Bird’s eye view 97.02% 85.57%

Despite the fact that the model has the ability to detect the object on the Taiwan
road, the model still has room for improvement. Due to the different perspectives and the
small objects at a far distance, some objects have difficulty being predicted; thus, IPM is a
vital method to improve this situation. The experimental results will be presented in the
forthcoming sections.

4.4.2. The Comparative Results of Different Perspectives


Table 5 shows the mean average precision (mAP) of the proposed method and the
different models using the proposed method. The difference of accuracy after perspective
transformation testing on the Taiwan data will be listed in the table. First of all, the mAP
of the front-view images is 57.90% and the mAP of the bird’s eye view images is 28.60%
respectively, which is quite a poor result. For the second model, the mAP of the front-view
images detected by the second model is 10.13%, which is an utterly deplorable outcome.
After transforming the images to the bird’s eye view, the mAP can reach 85.57%, achieving
a significant improvement of at least 27.67% (57.90–85.57%), which shows that the proposed
Sensors 2024, 24, 8080 17 of 21

approach can be successful. The mAP of the third model testing the front-view image is
60.04%, while the mAP of the bird’s eye view image is 78.66%. From the previous results, it
can be noticed that the same perspective model testing on the same perspective images can
obtain a better performance, while the different perspective model testing on the different
view achieves a much lower mAP. Therefore, the third model, which takes the front-view
images and bird’s eye view images as input, plays an essential role. The model can detect
the different perspectives of images, which makes the model more robust, general, and
stable. The experiment has demonstrated that the perspective transformation is effective
for object detection on the road surface.

Table 5. Comparative results of different models and our method.

Testing Data
Models Training Data
Front View Bird’s Eye View
Front view 57.90% 28.60%
Mask
BEV 10.13% 85.57%
R-CNN
Front view and BEV 60.04% 78.66%
Front view 15.32% 9.9%
SOLO v2 BEV 5.4% 42.60%
Front view and BEV 23.60% 39.70%
Front view 30.28% 9.64%
YOLACT++ BEV 28.84% 67.56%
Front view and BEV 31.48% 64.76%

Moreover, the experiment also compared the proposed method with other instance
segmentation models, SOLO version 2 and YOLACT++ [29]. SOLOv2 is a box-free instance
segmentation model using ResNet-101 as the backbone and FPN for multi-scale prediction.
As for YOLACT++, it uses the architecture of RetinaNet combing with ResNet-101 and
FPN. The result reflected in Table 5 indicates that the Mask R-CNN with the bird’s eye
image significantly outperforms the other methods. Basically, the performance of using
bird’s eye view images as input is better than the other two kinds of images as input. On
the other hand, it is believed that the proposed method can work well after perspective
transformation no matter which model is used.

4.4.3. The Experiment Results of Data Augmentation


Data augmentation is a method that increases the amount of data in a limited dataset.
Table 6 shows the comparative results of the model with data augmentation and without
data augmentation. The mAPs of the first model testing on the front view and bird’s
eye view are 58.27% and 33.96%, respectively, which increase from 0.37% to 5.36%. After
the data augmentation is performed on the second model, there is no significant increase
but there is a decrease of at least 5%. The mAP of the third model testing on the front-
view images decrease by 2.33%, while testing on the bird’s eye view increase by less
than 1%. It may be that there is no significant difference in the diversity of the data, so
the detection does not improve enough. To summarize, in comparison with the model
without data augmentation, the results can increase by 1–5% mAP, but there is no significant
improvement. However, there is no denying that data augmentation is a good approach to
augmenting the amount of data under limited resources.

Table 6. Comparative results of the model with data augmentation and without data augmentation.

w/o Data Augmentation w/ Data Augmentation


Training Data
Front View BEV Front View BEV
Front view 57.90% 28.60% 58.27% 33.96%
does not improve enough. To summarize, in comparison with the model without data
augmentation, the results can increase by 1–5% mAP, but there is no significant improve-
ment. However, there is no denying that data augmentation is a good approach to aug-
Sensors 2024, 24, 8080 menting the amount of data under limited resources.
18 of 21

Table 6. Comparative results of the model with data augmentation and without data augmentation.
Table 6. Cont. w/o Data Augmentation w/ Data Augmentation
Training Data
Front View w/o DataBEV
Augmentation Front View BEV
w/ Data Augmentation
Front view Training Data
57.90% 28.60% 58.27% 33.96%
Front View BEV Front View BEV
BEV 10.13% 85.57% 10.85% 80.34%
BEV 10.13% 85.57% 10.85% 80.34%
Front view and BEV 60.04% 78.66% 57.71% 79.02%
Front view and BEV 60.04% 78.66% 57.71% 79.02%

4.4.4. Examples of Object Detection Results


4.4.4. Examples of Object Detection Results
Figure 10 shows the results of the front view and bird’s eye view in the same image,
Figure 10 shows the results of the front view and bird’s eye view in the same image,
which uses the mixed bird’s eye view images as training input data. Some cases in the first
which uses the mixed bird’s eye view images as training input data. Some cases in the first
column
columnofofFigure
Figure 10a–c are the
10a–c are thedetection
detectionresults
resultsofofthethe front-view
front-view images,
images, which
which is de-
is detected
tected background (false positive, FP), while the bird’s eye view images can
background (false positive, FP), while the bird’s eye view images can be detected correctly.be detected
correctly.
The objectThe
onobject on the right-hand
the right-hand side in theside
firstincolumn
the firstofcolumn of Figure
Figure 10d 10d is incorrectly
is incorrectly recognized
recognized
as a specialaslane,
a special
whilelane, while eye
the bird’s the view
bird’simage
eye view image is correctly.
is predicted predicted correctly.

Front View Bird’s Eye View Ground Truth

(a)

(b)

(c)

(d)

Figure
Figure10.
10.Examples
Examplesofofthe
thebird’s
bird’seye
eyeview
viewmodel
modeltesting
testingon
ondifferent
differentcases.
cases.

Figure1111illustrates
Figure illustratesthat
thatsome
somecases
casesininthe
thefirst
firstrow
rowofofthe
thefront-view
front-viewimages
imagesfailed
failed
to detect the road markings at a far distance, such as pedestrian crossing
to detect the road markings at a far distance, such as pedestrian crossing (a), straight arrow(a), straight
arrow
(b,c), (b,c),
and and lane
special special
(d).lane (d). However,
However, after projecting
after projecting to the
to the bird’s eyebird’s
view,eye
theview, the
objects
objects were accurately recognized even on a low-light rainy day (c). As displayed in
were accurately recognized even on a low-light rainy day (c). As displayed in Figures 10
Figures 10 and 11, it is demonstrated that the proposed method can successfully detect
and 11, it is demonstrated that the proposed method can successfully detect small road
small road markings at a great distance by converting the images into a bird’s eye view.
markings at a great
In particular, the distance
number by of converting
label samplesthe images into a bird’s eye
after transforming willview. In particular,
not change, so the
the number of label samples after transforming will not change,
comparison between the two types of images is based on the same foundation. so the comparison be-
tween the two types of images is based on the same foundation.
Sensors
Sensors2024,
2024,25,
24,x8080
FOR PEER REVIEW 20 of 23
19 of 21

Front View Bird’s Eye View Ground Truth

(a)

(b)

(c)

(d)

Figure
Figure11.
11.Examples
Examplesofofthe
thefront
frontview
viewwith
withbird’s
bird’seye
eyeview
viewmodel
modeltesting
testingon
ondifferent
differentcases.
cases.

5. Conclusions
5. Conclusions
In this paper, the cross-field road markings have been successfully detected based
In this paper,
on Inverse the cross-field
Perspective Mappingroad markings
(IPM). After the have been successfully
perspective detectedthe
transformation, based on
distant
Inverse Perspective Mapping (IPM). After the perspective transformation,
objects on the road surface were detected, which solves the small object detection problem. the distant ob-
jects
Firstonof the road
all, the two surface
open were
datasetsdetected,
derivedwhich
from thesolves the world
virtual small object
and real detection
world were problem.
mixed
First of training
for the all, the two dataopen
which datasets
can reducederived frompreprocessing
the data the virtual worldtime andand cost.
real Theworld were
research,
which trained three kinds of models according to the different
mixed for the training data which can reduce the data preprocessing time and cost. The perspective training images,
presented
research, different
which results.
trained threeThekindstesting phase,according
of models comparedtowith the preliminary
the different study,train-
perspective used
front-view images to test on the road environment. IPM was
ing images, presented different results. The testing phase, compared with the preliminary performed on the input
images to transform them into the bird’s eye view, which solves the “small objects at far
study, used front-view images to test on the road environment. IPM was performed on
distance” problem and the “perspective distortion of objects” problem, so that the model
the input images to transform them into the bird’s eye view, which solves the “small ob-
can clearly recognize objects on the road. Using three kinds of models to test the front-view
jects at far
images anddistance”
bird’s eye problem and the can
view images “perspective
demonstrate distortion of objects”
the apparent problem,
result. In the so that
second
the model
model testcan
on clearly
the images recognize
after theobjects
IPMon the road.itUsing
approach, could three
reachkinds
an 85.57%of models
mAP,towhich test
the front-view
obtained images and
the immense bird’s eye view
improvement images
of 27.67% can demonstrate
(57.90–85.57%). Thethe apparent
third model result.
testingIn on
the front-view images and bird’s eye view images also showed
the second model test on the images after the IPM approach, it could reach an 85.57% a remarkable improvement
of accuracy
mAP, which by 18.62% the
obtained (60.04–78.66%). Moreover, for
immense improvement of the sake (57.90–85.57%).
27.67% of making the model more
The third
model testing on the front-view images and bird’s eye view images also showed a remark-in
robust and stable, the data augmentation method was employed to generate more data
the limited
able improvementdataset. ofIn comparison
accuracy with the
by 18.62% model without
(60.04–78.66%). data augmentation,
Moreover, for the sakethe result
of mak-
could increase by 1–2% mAP. We utilized Mask R-CNN as the implemented model and
ing the model more robust and stable, the data augmentation method was employed to
compared this with other models, SOLO and YOLACT, to ensure the proposed method
generate
could bemore data
realized in the limited dataset. In comparison with the model without data
successfully.
augmentation,
Much remains the result
to becould
doneincrease
for future by work;
1–2% mAP. We anticipated
it is also utilized Mask thatR-CNN
the work as the
can
implemented model and compared this with other models, SOLO
increase some different classes of road markings detection, such as “stop”, “slow”, “speed and YOLACT, to ensure
the proposed
limit”, “bicyclemethod
sign”,could
“lanebe realized
lines”, and successfully.
so on, so it can produce a more reliable and stable
detection
Much model.
remainsIntoaddition,
be donethe forperspective
future work; transformation of the images
it is also anticipated that the is fulfilled
work can by
choosing four reliable points and then warping them into the 2D
increase some different classes of road markings detection, such as “stop”, “slow”, “speed plane. The four points are
decided
limit”, depending
“bicycle sign”,on the different
“lane lines”, and datasets
so on, tosofind
it canthe suitablea points
produce after removing
more reliable and stable the
irrelevant background or to find the points along the roadside, so that the points follow the
detection model. In addition, the perspective transformation of the images is fulfilled by
properties of the different datasets, lacking uniformity. If the dataset contains the camera
choosing four reliable points and then warping them into the 2D plane. The four points
parameters, the homography matrix is easily computed, but it is hard to ensure that all the
are decided
datasets comedepending
with theon the different
camera datasets
information. to find thethese
Considering suitable points after
limitations, removing
the perspective
the irrelevant background or to find the points along the roadside,
transformation based on the deep-learning method can be examined and undertaken, which so that the points fol-
low the properties of the different datasets, lacking uniformity. If the dataset contains the
camera parameters, the homography matrix is easily computed, but it is hard to ensure
Sensors 2024, 24, 8080 20 of 21

automatically produces the bird’s eye view images without manual selection and camera
parameters. Moreover, other data augmentation techniques can be tried in the study to
prove that the method is beneficial to augment the data. In spite of some limitations of the
conclusion, the contributions of the study are seen to be compelling enough to encourage
future investigation into both this and other road marking-related topics. We will also
explore YOLO-based methods and state-of-the-art instance segmentation methods, such as
YOLOv8, TSD-YOLO, and CBNetV2, to compare their performance with the approaches
used in this paper. This will help evaluate the strengths of bounding box-level detection
versus mask-level segmentation in road marking tasks. Additionally, an ablation study on
data augmentation techniques, including brightness adjustment, contrast enhancement, and
flipping, will be conducted to assess their individual contributions to model performance.
Furthermore, we plan to design data augmentation strategies specifically tailored to road
marking detection in diverse and complex environments such as wear and tear of road
markings, and extreme lighting scenarios like glare or low-light conditions. These directions
aim to build upon the current findings and further enhance the robustness and applicability
of road marking detection methods.

Author Contributions: Conceptualization, E.H.-C.L. and Y.-C.H.; Methodology, E.H.-C.L. and Y.-C.H.;
Software, Y.-C.H.; Validation, Y.-C.H.; Formal analysis, E.H.-C.L.; Investigation, Y.-C.H.; Resources,
E.H.-C.L.; Data curation, Y.-C.H.; Writing—original draft, E.H.-C.L. and Y.-C.H.; Writing—review &
editing, E.H.-C.L.; Visualization, Y.-C.H.; Supervision, E.H.-C.L.; Project administration, E.H.-C.L.; Funding
acquisition, E.H.-C.L. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by National Science and Technology Council grant number NSTC
112-2628-M-006-008-MY2.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data are contained within the article.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
2. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587.
3. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv
2015, arXiv:1506.01497. [CrossRef] [PubMed]
4. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE Conference on Computer Vision, Honolulu,
HI, USA, 21–27 July 2017; pp. 2961–2969.
5. Girshick, R. Fast R-CNN. In Proceedings of the IEEE Conference on Computer Vision, Santiago, Chile, 7–13 December 2015;
pp. 1440–1448.
6. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
7. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
8. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020,
arXiv:2004.10934.
9. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2016; pp. 770–778.
10. Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
11. Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-Time Instance Segmentation. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 9157–9191.
12. Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. SOLO: Segmenting Objects by Locations. In European Conference on Computer Vision;
Springer: Cham, Switzerland, 2020; pp. 649–665.
13. Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and Fast Instance Segmentation. Adv. Neural Inf. Process. Syst.
2020, 33, 17721–17732.
Sensors 2024, 24, 8080 21 of 21

14. Tang, Z.; Boukerche, A. An Improved Algorithm for Road Markings Detection with SVM and ROI Restriction: Comparison with
a Rule-Based Model. In Proceedings of the 2018 IEEE International Conference on Communications (ICC), Kansas, MI, USA,
20–24 May 2018; pp. 1–6.
15. Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–22 June 2005; Volume 1, pp. 886–893.
16. Hearst, M.; Dumais, S.; Osuna, E.; Platt, J.; Scholkopf, B. Support Vector Machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28.
[CrossRef]
17. Lee, S.; Kim, J.; Yoon, J.S.; Shin, S.; Bailo, O.; Kim, N.; Lee, T.H.; Hong, H.; Han, S.H.; Kweon, I.S. VPGNet: Vanishing Point
Guided Network for Lane and Road Marking Detection and Recognition. In Proceedings of the IEEE Conference on Computer
Vision, Venice, Italy, 22–29 October 2017; pp. 1947–1955.
18. Hoang, T.M.; Nam, S.H.; Park, K.R. Enhanced Detection and Recognition of Road Markings based on Adaptive Region of Interest
and Deep Learning. IEEE Access 2019, 7, 109817–109832. [CrossRef]
19. Zhang, W.; Mi, Z.; Zheng, Y.; Gao, Q.; Li, W. Road Marking Segmentation based on Siamese Attention Module and Maximum
Stable External Region. IEEE Access 2019, 7, 143710–143720. [CrossRef]
20. Ye, X.Y.; Hong, D.S.; Chen, H.H.; Hsiao, P.Y.; Fu, L.C. A Two-Stage Real-Time YOLOv2-based Road Marking Detector with
Lightweight Spatial Transformation-Invariant Classification. Image Vis. Comput. 2020, 102, 103978. [CrossRef]
21. Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. Adv. Neural Inf. Process. Syst. 2015,
28, 2017–2025.
22. Li, H.; Feng, M.; Wang, X. Inverse Perspective Mapping based Urban Road Markings Detection. In Proceedings of the 2012 IEEE
2nd International Conference on Cloud Computing and Intelligence Systems, Hangzhou, China, 30 October–1 November 2012;
Volume 3, pp. 1178–1182.
23. Greenhalgh, J.; Mirmehdi, M. Detection and Recognition of Painted Road Surface Markings. In Proceedings of the International
Conference on Pattern Recognition Applications and Methods, Lisbon, Portugal, 10–12 January 2015; pp. 130–138.
24. Jiri, M.; Ondřej, C.; Martin, U.; Tomas, P. Robust Wide-Baseline Stereo from Maximally Stable Extremal Regions. Image Vis.
Comput. 2004, 22, 761–767.
25. Bailo, O.; Lee, S.; Rameau, F.; Yoon, J.S.; Kweon, I.S. Robust Road Marking Detection and Recognition Using Density-Based
Grouping and Machine Learning Techniques. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision
(WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 760–768.
26. Kang, J.; Jo, Y.; Lee, D.; Han, S.J.; Min, K.; Choi, J. Real-Time Road Surface Marking Detection from a Bird’s-Eye View Image
Using Convolutional Neural Networks. In Proceedings of the Twelfth International Conference on Machine Vision (ICMV 2019),
Amsterdam, The Netherlands, 16–18 November 2020; Volume 11433, pp. 599–604.
27. Jayasinghe, O.; Hemachandra, S.; Anhettigama, D.; Kariyawasam, S.; Rodrigo, R.; Jayasekara, P. CeyMo: See More on Roads-A
Novel Benchmark Dataset for Road Marking Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 3104–3113.
28. Everingham, M. The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Development Kit. In Evaluation (chap. 3.4). 2010.
Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2010/htmldoc/devkit_doc.html#SECTION00044000000000000
000 (accessed on 8 May 2010).
29. Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT++: Better Real-Time Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell.
2020, 44, 1108–1121. [CrossRef] [PubMed]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like