Development and Investigation of Vision System for a Small-Sized Mobile Humanoid Robot in a Smart Environment
Development and Investigation of Vision System for a Small-Sized Mobile Humanoid Robot in a Smart Environment
ABSTRACT
The conducted research aims to develop a computer vision system for a small-sized mobile humanoid robot. The decentralization
of the servomotor control and the computer vision systems is investigated based on the hardware solution point of view, moreover,
the required software level to achieve an efficient matched design is obtained. A computer vision system using the upgraded tiny-
You Only Look Once (YOLO) network model is developed to allow recognizing and identifying objects and making decisions on
interacting with them, which is recommended for crowd environment. During the research, a concept of a computer vision system
was developed, which describes the interaction between the main elements, on the basis of which hardware modules were
selected to implement the task. A structure of information interaction between hardware modules is proposed, and a connection
scheme is developed, on the basis of which a model of a computer vision system is assembled for research, with the required
algorithmic and software for solving the problem. To ensure the high speed of the computer vision system based on the ESP32-
CAM module, the neural network was improved by replacing the Visual Geometry Group 16 (VGG-16) network as the base
network for extracting the functions of the Single Shot Detector (SSD) network model with the tiny-YOLO lightweight network
model, which made it possible to preserve the multidimensional structure of the network model feature graph, resulting in
increasing the detection accuracy, while significantly reducing the amount of calculations generated by the network operation,
thereby significantly increasing the detection speed, due to a limited set of objects. Finally, a number of experiments were carried
out, both in static and dynamic environments, which showed a high accuracy of identifications.
KEYWORDS
social system; assistant robotics; autonomous robots; humanoid robot; computer vision system; neural networks; decision
making; crowd environment
T he evolution of modern robotics development is based on the ongoing processes around it in real time[10]. This solution
new technologies introduction in the field of artificial makes it possible to more accurately describe the events that occur
intelligence, decision making, processing of large amounts in the working area of a mobile robot, but the proposed method
of data, and visual information[1–6]. One of the basic requirements cannot be used in open areas, which ties it to a specific location.
for mobile robots is the availability of a computer vision system, Al-Obaidi et al.[11] conducted research to develop a mobile robot
which makes it possible to transmit the state of the environment for remote security monitoring in factories, offices, and airports.
to the operator in real time[7–9]. This allows the operator to assess The main goal of this study, according to the authors, is the
the situation and complete the assigned tasks. However, some development of an energy-saving mobile robot with a computer
solutions in this area limit the autonomy of the mobile robot and vision system and environmental control due to an array of
make it dependent on the presence of the operator. To ensure the sensors (temperature and presence of obstacles). To implement a
autonomy of such robots, it is necessary not just to implement a computer vision system, Al-Obaidi et al.[11] used a Raspberry Pi
system for broadcasting a video stream of the environment of a with a connected Camera Board, which made it possible to
mobile robot, but to develop a system for identifying objects in broadcast a video stream from a mobile robot to a control system,
real time, with the ability to further use the data obtained at the and an ATMEGA 328P microcontroller based on Arduino was
object to develop a strategy for behavior and decision making. used as a motor control system. Considering the proposed
In Ref. [10], Bavelos et al. proposed the development of a solution, it can be seen that the use of Raspberry Pi as a computer
computer vision system for an industrial mobile robot that can vision system and Arduino as a motor control system is rational.
autonomously move around the workshop. The peculiarity of the But the presence of specific power requirements for the Raspberry
proposed solution lies in the synthesis of data from sensors of a Pi (DC 5 V, 1 A) failure to comply with these recommendations
mobile robot with 2D and 3D sensors located in the workshop. can lead to unstable operation of the mobile robot, and therefore,
This, according to the authors, allows the mobile robot to perceive if Al-Obaidi et al.[11] expanded the functionality of the computer
1 Department of Computer Science, College of Information Technology, Amman Arab University, Amman 11937, Jordan
2 Faculty of Engineering, Zarqa University, Zarqa 2000, Jordan
3 College of Engineering, University of Business and Technology, Jeddah 21448, Saudi Arabia
4 Department of Media Systems and Technology, Kharkiv National University of Radio Electronics, Kharkiv 61166, Ukraine
5 Department of Electrical and Electronics Engineering, Faculty of Engineering and Architecture, Nişantaşı University, Istanbul 34398, Türkiye
Address correspondence to Hani Attar, Hattar@zu.edu.jo
© The author(s) 2025. The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/).
vision system by introducing neural networks for object to implement both software and hardware.
recognition and decision-making systems, will entail either an Thus, to solve the problem of developing a computer vision
unstable robot or the need to completely change the power supply system for small-sized robots with the possibility of implementing
system of the mobile robot. an AI-based identification system, we choose to use the 2D
In Ref. [12], an example of the development of a multi-sensor imaging method. To conduct experimental research in the role of
recognition system based on Light Detection and Ranging an actuator and work out interaction with the outside world, a
(LiDAR) for the mobile robot iRobotic Packbot 501 was given. prototype of a humanoid robot will be used, which will expand
The proposed solution allows, in synthesis with a computer vision the possibility of implementing manipulation capabilities for
system using LiDAR, to build a 3D map of the environment[12]. interacting with objects and make it possible in the future to
But for the implementation of such a system, it is necessary to use implement a group control and decision-making system to
OS Robot Operating System (ROS), which requires high- achieve the set tasks.
performance hardware based on expensive microcomputers and Thus, in this study, the authors set themselves the task of
will correspondingly increase the cost and impossibility of using it developing a computer vision system using AI technologies to
on small-sized robots. recognize and identify objects in the robot’s work area. At the
Table 1 compares some of the existing methods for obtaining same time, strict requirements are put forward for the overall
visual information about mobile robot environment, on the basis dimensions and computing power of the hardware. This makes it
of which an intelligent decision-making system can be developed. possible to use it inside small-sized intelligent humanoid robots, as
As can be seen from Table 1, the following methods were well as robots for rescue missions in areas of man-made disasters.
At this point in time, existing similar solutions use expensive
chosen for analysis:
hardware such as Nvidia Jetson Orin Nano or Raspberry Pi 4
2D imaging (camera) can track moving objects in real time and
Model B 4GB. In addition, the listed hardware has increased
locate them in the straight sight of the mobile robot. This allows to
requirements for the power circuit, as well as large overall
detect objects in dynamic space and makes it possible to dimensions, which do not meet the requirements for a small-sized
determine their location and identification using Artificial humanoid robot.
Intelligence (AI).
3D vision is a complex method of work synthesis with two
cameras or a laser scanner located at different angles. As a result, 1 General Concept of Mobile Robot Computer
there are high requirements for hardware and software Vision Control System
implementation to visualize the surrounding space of a mobile Analyzing the specifics and complexity of the developed mobile
robot. But such a system has a high accuracy of recognition of the humanoid robot, the authors decided to apply a decentralized
objects presence, without the possibility of their identification, and approach to the control system by dividing the humanoid robot
has a small delay in real time. motion control system and the computer vision system into
The ultrasonic method consists of a device that calculates the separate modules. Grounding for this decision was next: the
time interval between emission and detection of a reflected sound bandwidth of the video channel increases, which reduces the
wave, which allows to determine the presence of objectives and signal delay time in the computer vision system; and the load on
the distance to the sensor, and hence, makes a decision based on the resolution capabilities of the microcontroller module of the
the results. The ultrasonic device works in real time and it is easy servomotor control system is reduced, which speeds up the
to be implemented and programmed. However, it does not allow execution of commands. All actions for recognition and
the visualization of the environment resulting in failing on identification using AI, as well as algorithms for making decisions
identifying objects, which means that it only recognizes the and interacting with the outside world and objects, are
presence of the sensors nearby objectives. implemented on a personal computer (notebook).
The infrared method detects Infrared (IR) rays emitted by an At the first step in solving the problem, it is necessary to
object. It can also use IR light to project onto a target and receive develop a general concept for the control system implementation,
reflected light to determine its distance or proximity. Infrared which should include a computer vision module and will be used
sensors are economical and can track infrared light over a large to transfer the video stream to the laptop and return commands to
area. It also work in real time. It allows to determine the presence interact with objects (ball). Figure 1 shows the general concept for
of an object, but does not allow to identify it, and is simple enough the control system implementation for a mobile humanoid robot.
Table 1 Methods for obtaining visual information about mobile robot environment comparison.
Parameter 2D imaging (camera)[13] 3D vision (LiDAR)[14] Ultrasonic method (sensor)[15] Infrared method (sensor)[16]
Recognition accuracy ++ +++ + −
Resolution ++ +++ − −
“Dead” zone + − ++ ++
Field of View (FOV) ++ +++ + +
Dimension + +++ + +
Price + +++ + +
Hardware requirements ++ +++ + +
Software implementation complexity ++ +++ + +
Object identification +++ ++ − −
Note: +++: high rates; ++: middle rates: +: low rates; −: impossible to implement.
The basic idea of this concept is the development of a and Technology LTD is used as a model for conducting research.
decentralized control system, that is, its division into two Overall dimensions are: height is 200 mm; width is 72.2 mm; arm
subsystems. Very strong restrictions on the overall dimensions of length (from shoulder to hand) is 105 mm; movement provides
the humanoid robot became grounding for this decision. As a 18 servomotors model EMax ES08MAII.
result, four main modules will be placed on the basis of the Based on the concept proposed above for a computer vision
humanoid robot: wireless communication module; camera system for a humanoid robot (Fig. 1), at the beginning of
module; control module; and power module. development, it is necessary to discuss the restrictions that are
Wireless communication module allows to exchange data imposed when choosing hardware modules, on the basis of which
between a humanoid robot and a control system, decision-making the camera module and control module will be implemented.
using wireless Local Area Network (LAN) technologies[17]. Dimension: Camera module is no more than 30 mm × 45 mm ×
Camera module will perform the function of broadcasting the 15 mm (width × length × height). Control module is no more than
mobile robot environment using wireless network technologies. 54 mm × 51 mm × 7 mm (width × length × height).
Control module is designed to execute commands that the Power: Camera module working voltage is 5 V, and average
module receives from the control and decision-making system, current is 180 mA (max); Control module working voltage is
which is located on a personal computer (notebook). An 4.5–6.4 V, and average current is 80 mA (min).
additional task of the control module is to implement the ability to Wireless communication module: WiFi 802.11 b/g/n support
connect servomotors, which will provide freedom of action for the for camera module and control module.
humanoid robot to perform the tasks. Required camera module features are as follows:
Power module provides a small-sized battery assembly, which (1) Connection of a camera with the ability to transmit
will be used to power a humanoid robot. streaming video with a minimum resolution of 800 pixel ×
On the basis of a personal computer (notebook), it is proposed 600 pixel 30 frames per second (fps);
to create a control and decision-making system, and this decision (2) The ability to create an access point to a wireless network
is justified by the fact that the development of an object based on the module.
recognition and identification system, as well as a decision-making Required control module functions are as follows:
system, requires “serious” computing power that cannot be placed (1) Power supply support for up to 18 Emax ES08MA II
on the basis of a humanoid robot. As a result, this subsystem servomotors[18];
consists of the following modules: wireless communication (2) The maximum average current (when 18 servomotors are
module; streaming video processing module; image recognition working) can reach 2800 mA.
module; object identification module; and decision module. The rationale for the selected restrictions and their values are
Streaming video processing module is designed for primary the overall dimensions of the humanoid robot: total 72 mm ×
processing of streaming video received from a humanoid robot 200 mm × 80 mm (width × length × height) and body dimensions
and preparing it for the recognition module. 40 mm × 55 mm × 80 mm (width × length × height), which
Image recognition module processes the received streaming should accommodate control module, camera module, and power
video, divides it into frames, and recognizes the presence of module. As a result, the developed solution has a rigid framework
objects in the frame. for the analysis and selection of hardware modules.
The object identification module uses the neural networks to First, we will analyze and select a hardware module for the
identify the robot’s nearby objectives, i.e., the objectives located in implementation of the camera module, and based on the above
the robot’s working zone. Consequently, the object identification limitations, the following modules were selected, which are shown
module gives the suggested name of the identified object and the in Fig. 2, and their parameters are presented in Table 2.
accuracy of the suggestion. Having carried out a comparative analysis of the parameters of
Decision module analyzes the results obtained with the object hardware modules (Table 2) for the implementation of a
identification module and, depending on the underlying computer vision system for a humanoid robot, we can identify the
algorithm, performs interaction actions with the object. following advantages and disadvantages for solving the problem.
The OV7670 VGA hardware module fits the overall dimensions
2 Selection of Hardware Modules for the Robot’s and has an attractive price (5 US dollar). However, it has a
number of disadvantages, such as the absence of a wireless
Computer Vision Control System network module and a microcontroller-based module for primary
The humanoid mobile robot “ViVi” by Doctors of Intelligence processing of the video stream; and to use this module, you need
to purchase separately a Wi-Fi module and a control module The next step is to select a hardware module for the
(Arduino Min, STM32), which accordingly increases not only the implementation of the control module by a humanoid robot.
cost but also the complexity of the design, as well as overall Based on the above restrictions, the following hardware modules
dimensions. At the same time, the speed of the microcontroller fell into the analysis area, which are presented in Fig. 3, and their
will be significantly lower (16–32 MHz) than that of the pyAI- parameters are presented in Table 3.
K210 and ESP32 modules. Comparing the pyAI-K210 and ESP32 Analyzing the main parameters of the control system hardware
hardware modules, it can be seen that these two modules have modules, it can be seen that the 16 road steering gear control
similar parameters in terms of microcontroller speed and camera board servo and neue version 32 Kanal roboter servo control
characteristics. The main differences between such modules are board modules are based on the STM 32 bit microcontroller. At
the overall dimensions and weight of the hardware module, but the same time, the neue version 32 Kanal roboter servo control
the price of ESP32-Cam is about 4.5–5.0 times lower than pyAI- board has a module for wireless data transmission via Wi-Fi and
K210 and has a wider distribution and software support for bluetooth, which fits certain restrictions, but these modules do not
libraries for working with modules and sensors. Based on this fit in terms of overall dimensions, as well as the number of
comparison, an ESP32-Cam will be used to implement the camera servomotors that can be connected to the board management. On
module onboard a humanoid robot. the other hand, the control system hardware of neue version 32
(a) Neue version 32 Kanal (b) 16 road steering gear (c) Humanoiden roboter
roboter servo control board[25] control board servo[26] control board[27]
Fig. 3 Hardware modules for the humanoid robot motion control system implementation.
Table 3 Comparison of the hardware modules’ main parameters for the control system development.
Neue version 32 Kanal roboter 16 road steering gear Humanoiden roboter
Parameter
servo control board[25] control board servo[26] control board[27]
CPU STM 32 bit STM 32 bit ESP8266MOD
Flash capacity (Mbit) 16 16 4
Wireless communication Wi-Fi and bluetooth − Wi-Fi
Servomotor controller the number of simultaneous 32 16 18
Servomotor input voltage (V) 4.2–7.2 4.2–7.2 4.2–7.2
Operation voltage (V) 5 5 3.3
Communication protocol UART USB-TTL TTL programmer
Board dimension (mm×mm) 45×63 36×43.5 24×16
Weight (g) 8 12 12
Price (US dollar) 20 5 15
Kanal roboter servo control board, 32-Emax ES08MA II general structure for the information interaction of the camera
servomotor, and 16-road steering gear control board servo, will module and control module with the system of management and
not allow to realize all degrees of freedom for a humanoid robot. decision making to control the humanoid robot. This structure is
Considering the parameters of the humanoiden roboter control shown in Fig. 4.
board, it can be seen that it not only is suitable in terms of overall Based on the developed structure of information interaction
dimensions, but also is determined by the presence of a Wi-Fi between software and hardware modules, taking into account the
module onboard a control system based on the ESP8266MOD specifics of developing programs for a microcontroller, to
microcontroller. This will make it possible to create an access implement a computer vision system for a mobile humanoid
point on its basis and simplify the process of transmitting robot based on ESP32-Cam and a neural network for object
commands to a humanoid robot from a control and decision- recognition and identification, it is proposed to use the following
making system; as a result, to implement a control system for a set of libraries supported in the environment Arduino IDE
humanoid robot, the authors suggest using the humanoiden development[28]:
roboter control board for 15 US dollar. WebServer makes it possible to implement a simple web server
with support for only one client at a time, which allows to process
3 Structural Diagram of Interaction Between requests using the Get and Post methods. The built-in IoT
Main Elements of Mobile Robot Computer Development Framework from Espressif Systems (ESP-IDF) web
server will be used. This web server is based on the Transmission
Vision Control System Control Protocol/Internet Protocol (TCP/IP) stack and allows
After choosing the hardware modules, it is necessary to develop a developers to create and conFig. web interfaces to manage and
Battery
Fig. 4 Structure of information interaction camera module and control module with system of management and decision making to control a humanoid
robot.
interact with the ESP32-CAM[29]. in which a sequence of commands is formed and transmitted via
Wi-Fi will allow to connect to an existing local network, using the HTPS using the Post and Get methods to the humanoid robot
authentication tools, via SSID and password. control board hardware module.
ESP32-Cam is a driver library for working with the OV2640 For the convenience of visualization and understanding the
camera for the ESP32-Cam hardware module. type of signal transmitted to the circuit, it was decided to designate
For control based on the humanoiden roboter control board them in the following colors: green—digital signal for data
hardware module, it is proposed to use the following libraries: transmission; black—power circuit “ground” or “gnd”;
WebServer and Wi-Fi will perform similar functions as on a red—power supply + 5V or “vcc”. The electrical circuit diagram
computer vision system. for connecting the ESP32-Cam module using the HC-SR04
HTTP server node JS is a library that makes it possible to ultrasonic sensor is shown in Fig. 5.
receive and process commands to control a humanoid robot from The assembled test layout of the computer vision system with
a personal computer. the HC-SR04 ultrasonic sensor connected is shown in Fig. 6a
Servo is a library that allows to control the rotation angles of front view and Fig. 6b back view. Figure 6c is a front view of
Emax ES08MA II servomotors. humanoid robot body; and Fig. 6d is the location of the modules
Battery is modular assembly of batteries with minimum inside the case.
parameters: 5 V and 3000 mA.
HC-SR04 is a hardware module that is an ultrasonic sensor for 4 Development of Computer Vision System
calculating the distance to an object in the working area of a
computer vision system[30].
Algorithm for Mobile Humanoid Robot and Its
On the side of a personal computer (notebook), the humanoid Software Implementation
robot control and decision-making system is implemented based The next step in the development of a computer vision system for
on the Python 3.10 language[31] in the PyCharm development a humanoid robot is the development of an algorithm and a set of
environment[32] using the following libraries:
OpenCV is a computer vision library that is designed to CPU
analyze, classify, and process the received image from ESP32- ESP32-CAM
Cam. +5 V 5V 3.3 V
NumPy is a library for processing large multidimensional GND GPIO16
arrays and matrices with the ability to structure them. GPIO12 GPIO0
GPIO13 GND
Urllib, HTTP client support library, makes it possible to obtain GND
GPIO15 3.3/5 V
a URL address for transmitting a video stream from the ESP32- GPIO14 GPIO3
Cam. GPIO2 GPIO1
TensorFlow is a machine learning library designed to solve the GPIO4 GND
problems of building and training a neural network in order to
automatically find and classify patterns found in the working Sensor
environment of a humanoid robot. HC-SR04
YOLO v.3 is a neural network library (script) for object VCC
recognition and identification. CND
CUDA 10.1.105 is Nvideo’s technology library that uses the Trig
GPU to accelerate parallel computing in object recognition and Echo
identification. In the developed computer vision system, the
choice of the CUDA 10.1.105 library is due to the following Fig. 5 Wiring diagram for HC-SR04 sensor to ESP32-Cam.
reasons: parallel computing on the GPU and convenience and
compatibility with machine learning and computer vision
frameworks such as TensorFlow, PyTorch, and OpenCV. The use
of CUDA allows to parallelize the execution of algorithms and
data processing, which can lead to a significant acceleration of the
process.
A feature of the ESP32 microcontroller is the use of a special
ESPAsyncWebServer library, which provides the functionality of a
web server, and it supports one access point, that is, one operator.
Based on this limitation, one point transmits the streaming video (a) Front view (b) Back view
to the computer vision system, and the second one, which is
implemented on the humanoiden roboter control board, controls
the robot.
It is also necessary to implement the computational functions
“Calculating the Distance to an Object”, which will allow
calculating the distance to the detected object based on the
ultrasonic method, and the “Decision Selection Algorithm” block
contains behavioral actions, depending on the name of the
identified object, distance, and objective functions and tasks of
their interaction. (с) Front view of humanoid (d) Location of the module
The results of the selected interaction with the object are robot body inside the case
transferred to the “Sending commands by protocol HTPS” block, Fig. 6 Placement of a computer vision system in a humanoid robot.
programs for interaction between the ESP32-Cam microcontroller basis of the ESP32-Cam module, and the calculation results in
and the HC-SR04 sensor on one hand and for system of centimeters will be transmitted to the system of management and
management and decision making (notebook) on the other hand. decision making.
An enlarged algorithm for the operation of a computer vision The operation algorithm of the system of management and
system is shown in Fig. 7. decision making (notebook) is built according to the following
Let us briefly describe the purpose of the main blocks of the principle, after starting the program, at the first stage, the libraries
algorithm (Fig. 7), which are implemented on the basis of the are initialized (Fig. 4). After that, it checks for the presence of an
ESP32-Cam hardware module using the HC-SR04 sensor (Fig. 6). IP address assigned to the ESP32-Cam in the local network. If the
When power is connected to the ESP32-Cam module, the libraries given IP address is not found, the user is given an error message.
are initialized (Fig. 4), after which the local wireless network is If the connection is successful, the program starts receiving the
searched using authentication data (SSID and password). If the video stream and distance calculation results from the HC-SR04
connection fails, then an error message is displayed in the port sensor. The next step of the program is to solve the problem of
monitor (Arduino IDE during settings), and upon successful recognizing the presence of an object in the working area of a
connection, the local network access point (router) allocates a humanoid robot. If an object is in the working area, it is identified
local dynamic IP address of the ESP32-Cam. The video stream with the display of the name of the object, the value of the
and data from the sensor will be transmitted to the system of probability of matching and recognition time, as well as the
management and decision making (notebook) at this address. distance to it. Based on the received data (object name and
After receiving the local IP address, we activate the 2Mp camera distance), it is possible to implement a decision making
OV2640. Taking into account the peculiarities of the operation of mechanism to interact with the object in accordance with the
microcontrollers, until the power is turned off on the ESP32-Cam, tasks, as a result, the humanoid robot control board module
the video stream from the OV2640 camera, as well as the results receives commands for execution. At the same time, in real time,
of calculating the distance values from the ultrasonic sensor HC- all the necessary information (name of the object, probability of
SR04, will be continuously transmitted through the received IP coincidence, and distance to it) is visualized to the user in the
address in a cycle. It is worth noting that all calculations of the Human-Machine Interface (HMI) window. The program will
distance to the object using the sensor will be carried out on the function until the user completes work with it or the power is
Start Start
No No
Connected? Is there a video
stream?
Yes Yes
Recognition of
Getting a local
objects on the
IP address
frame
Identification of
OV2640 camera objects on the
activation Distance to the object frame
from the sensor HC-SR04
Making
Broadcast decisions on
streaming video interaction with
stream
Closed loop
the object
End
turned off on the ESP32-Cam module. setup() function, skip the description of the standard port settings,
When creating a program for the ESP32-Cam module, we will and present the most interesting points:
use the Arduino IDE 1.8.19 development environment with the Set the signal levels on the pins to which the HC-SR04 sensor is
obligatory connection of the following libraries in the boards connected:
manager: Arduino AVR Boards ver. 1.8.5 and ESP32 ver.2.0.4 by pinMode(trigPin, OUTPUT);
Espressif System. pinMode(echoPin, INPUT);
At the first step, write a program for broadcasting a video Set up the OV2640 camera parameters and check its
stream from ESP32-Cam, connecting libraries: performance:
#include ⟨ WebServer.h⟩ ; {using namespace esp32cam;
#include ⟨ WiFi.h⟩ ; Config cfg;
#include ⟨ esp32cam.h⟩ . cfg.setPins(pins::AiThinker);
Let us write the constants that will contain the login and cfg.setResolution(ords);
password for ESP32-Cam authentication in the local network (set cfg.setBufferCount(2);
by the developer): cfg.setJpeg(80);
const char* WIFI_SSID = “Crow”; bool ok = Camera.begin(cfg);
const char* WIFI_PASS = “notebookpuk100”. Serial.println(ok ? “CAMERA OK” : “CAMERA FAIL”); }
Let us write the port for connecting to the web server. It is worth paying attention to the last line, which allows to
WebServer server(80); display technical information in the port monitor in the Arduino
We indicate the pins to which the HC-SR04 sensor is IDE development environment when it is configured.
connected to the ESP32-Cam: Connect to a local wireless network using the data entered in
#define trigPin 13; WIFI_SSID and WIFI_PASS:
#define echoPin 15. WiFi.persistent(false);
For the convenience of computer vision system testing with WiFi.mode(WIFI_STA);
different types of resolution obtained from a 2Mp OV2640 WiFi.begin(WIFI_SSID, WIFI_PASS);
camera, we indicate the following working resolutions: while (WiFi.status() != WL_CONNECTED).
static auto loRes = esp32cam::Resolution::find(320, 240); Upon successful connection, the received local IP address of the
static auto midRes = esp32cam::Resolution::find(350, 530); ESP32-Cam module is displayed in the Arduino IDE port
static auto ords = esp32cam::Resolution::find(800, 600). monitor, which will be needed to broadcast the video stream in
That is, it will make it possible, when testing a computer vision system of management and decision making:
system, to change the image quality in the program on the system Serial.print(«http://»);
of management and decision making side. In the future, it will be Serial.println(WiFi.localIP()).
possible to select only one quality of streaming video for the tasks, Let us activate three servers, with the help of which streaming
which will make it possible to reduce the load on the video is transmitted in the form of a sequence of jpg files with
microcontroller. different resolutions and turn on the server:
Let us create a serveJpg() function to receive a streaming video server.on(“/cam-lo.jpg”, handleJpgLo);
as a jpg sequence and an error handler in the form of a 503 code server.on(“/cam-hi.jpg”, handleJpgHi);
for the server[33]. server.on(“/cam-mid.jpg”, handleJpgMid);
{auto frame = esp32cam::capture(); server.begin().
if (frame == nullptr) { The last step is to write the void loop(), which is executed in a
Serial.println(«CAPTURE FAIL»); cyclic mode and is a feature of working with microcontrollers:
server.send(503, «», «»); // means that right now the server is Prescribe the program code for the operation and calculation of
not ready to process the request, because it is overloaded or the distance for the HC-SR04 sensor:
technical work is being carried out on it {digitalWrite(trigPin, LOW); // we apply a low level signal to
Return;} pin 13;
Serial.printf(“CAPTURE OK %dx%d %db\n”, frame-> delay(200); // make a delay of 200 ms;
getWidth(), frame->getHeight(), digitalWrite(trigPin, HIGH); // we apply a high level signal to
static_cast⟨ int⟩ (frame->size())); pin 13;
server.setContentLength(frame->size()); This allows to create a sequence of ultrasonic sequences on
server.send(200, “image/jpeg”); // successful request, server is trigPin, which will then be captured as a reflected signal from
running, information is being transmitted. objects on the echoPin;
WiFiClient client = server.client(); delay(200);
frame->writeTo(client);} digitalWrite(trigPin, LOW);
Let us create the functions void handleJpgLo(), void duration = pulseIn(echoPin, HIGH);
handleJpgHi(), and void handleJpgMid() to check the possibility distance = (duration/2)/29.1; // distance calculation (in
of streaming video at three resolutions: loRes, midRes, and hires. centimeters)
An example of such an implementation of the void handleJpgLo() Serial.print(distance);
function is presented below: Serial.println(«cm») // information output to the port monitor
{if (!esp32cam::Camera.changeResolution(loRes)) { (for testing)
Serial.println(“SET-LO-RES FAIL”);} server.handleClient();}//launch Web Server
serveJpg();} Now it is necessary to develop software for recognition and
It is similarly implemented for the functions void identification of objects and its visualization in the system of
handleJpgHi() and void handleJpgMid(). management and decision making block. To do this, we will use
The next step is to describe the necessary data in the void the PyCharm[33] development environment and the Python 3.10
(a) Calibrating the location of the (b) General view of the computer
computer vision system for the test vision system
35 0.87 35 0.382
50 0.59 51 0.397
35 0.98 35 0.024
50 0.99 51 0.028
0.90
Time (s)
0.028
0.85
0.38 0.65 0.026
0.024 0.80
0.36 0.50 0.022 0.75
0 50 100 150 200 250 0 50 100 150 200 250
Distance (cm) Distance (cm)
Time Identification accuracy Time Identification accuracy
Fig. 12 Graph of “robo ball” object identification accuracy dependence, Fig. 13 Graph of “robo ball” object identification accuracy dependence,
time from distance for the classic library coco.names. time from distance for a tiny-YOLO lightweight network model.
time is in the intervals from 0.373–0.410 s and the graph has a problem in this study, the distance of 200 cm is large enough and
difficultly unpredictable shape that does not depend on distance to will not be used.
the object. In contrast to the graph of the tiny-YOLO lightweight The created dataset for the task of detecting and recognizing
network model (Fig. 13), it can be seen that the identification objects of a mobile humanoid robot is built on the basis of the tiny-
accuracy depends on the distance, that is, the farther the “robo ball” YOLO lightweight network model, surpasses the SSD network
object is from the humanoid robot, the probability of belonging to model in terms of detection accuracy and detection speed, and
this label decreases, while it is worth noting that to solve the can simultaneously meet the requirements of detection accuracy
and real-time performance (in dynamics) when performing distances of 200 cm (Fig. 14d), in which the object left the
indoor tasks. These conclusions are confirmed by comparative clarification zone, which accordingly increased the recognition
experiments on the accuracy and speed of object detection, as well accuracy to 0.97. Despite the obtained results, it can be concluded
as on the detection and recognition of objects in real time by a that the neural network improvements based on the tiny-YOLO
mobile robot in a real indoor environment. lightweight network model provide high speed and accuracy of
The next experiment is the recognition and identification of an object recognition with a low camera resolution on the ESP32-
object with the label “robo ball” in dynamics, that is, the essence of Cam (2Mp) module for small-sized mobile humanoid robots.
the experiment is to identify a dynamically receding object, and in Let us compare the developed computer vision system with a
fact, this is a simulation of a humanoid robot playing football. The control system for a humanoid robot, with similar solutions. The
results of such an experiment are shown in Fig. 14. comparison results are presented in Table 6.
Analyzing the obtained results of identifying the “robo ball” Let us build Table 7 to compare the identification time of the
object in dynamics, it can be seen that the recognition accuracy of “ball” object from distance for the developed computer vision
the improved neural network at a distance of 100 cm and 150 cm system based on ESP32-Cam with a computer vision system based
(Figs. 14b and 14c) was affected by the presence of the object in on NVIDIA Jetson Nano[38]
direct sunlight, which gave the appearance of lightening of the Let us present the comparison results obtained from Table 7 in
contours of the object , and as a result, it led to a decrease in the the form of a graph, which is shown in Fig. 15.
recognition accuracy, in contrast to the results obtained at
6 Conclusion
The developed computer vision system for small-sized mobile
humanoid robots was implemented on the basis of ESP32-Cam
using an ultrasonic sensor HC-SR0. To speed up the process of
recognition and identification of the “robo ball” object, the
authors proposed a tiny-YOLO lightweight network model. This
solution allowed us to preserve the multidimensional structure of
the feature graph of the network model, and therefore increase the
(a) Distance=50 cm, (b) Distance=100 cm, detection accuracy, while significantly reducing the amount of
recognition accuracy 0.90 recognition accuracy 0.86 calculations generated by the network operation, thereby
significantly increasing the detection speed when using
microcontroller systems. As can be seen from the results of the
experiments, the improved neural network made it possible to
obtain an average time for identifying the object “robo ball” 0.028
in natural conditions, with a high probability from 0.99 at a
distance of 50 cm to 0.80 at 200 cm, while it is worth noting that a
camera is used OV2640 2MP/FOV70. In the future, it is planned
to develop a target detection system and implement mechanisms
(c) Distance=150 cm, (d) Distance=200 cm, for interacting with “objects”, as well as consider the possibility of
recognition accuracy 0.86 recognition accuracy 0.97 implementing movement trajectory prediction to predict actions.
Fig. 14 Identification of the “robo ball” object in dynamics (distance The experimental results obtained show that the developed
increasing). computer vision system has a number of significant advantages,
Table 6 Comparison of the developed computer vision system with a control system for a humanoid robot with similar solutions.
Similar solution for mobile humanoid robots
Parameter
Baihaqi et al.[37] Nugraha et al.[38] Computer vision system developed
Microcontroller ESP32-CAM NVIDIA Jetson Nano ESP32-CAM
Operating frequency 160 MHz 1.43 GHz 160 MHz
Camera OV2640 2Mp/FOV70 Logitech C525 8 MP OV2640 2Mp/FOV70
Camera resolution in the process
GIF (320×240) HD (1280×720) SVGA (350×530)
of identification (pixel)
−
Type of underlying neural network YOLOv3 Improved neural network
(Photo storage)
Ball Ball
Object type Human face
(ideal conditions*) (not ideal conditions**)
Without mask 1.26с
Average recognition time 0.033 0.028
With mask 7.24с
Distance=50 cm: 99 Distance=50 cm: 99
Success probability (%) 80
Distance=900 cm: 46 Distance=200 cm: 80
Dimension (mm×mm×mm) 27×39×4.5 100×80×29 27×39×4.5
System price (US dollar) 8 180 8
Note: *: The studies were carried out under ideal conditions in terms of the contrast of choosing the color of the identified object “ball” (“yellow” RGB
(223, 97, 10)) and the background (“green” RGB (55, 183, 168)). **: The studies were carried out in a natural environment, the color of the identified
object “ball” (“yellow” RGB (173, 145, 97)) background in the form of linoleum (RGB (197, 192, 189)).
Table 7 Comparison of the ball object identification time versus distance development for vertical movement based on the geometrical family
for the developed computer vision system based on ESP32-Cam with the caterpillar, Comput. Intell. Neurosci., vol. 2022, p. 3046116, 2022.
computer vision system based on NVIDIA Jetson Nano[38]. [6] H. Attar, A. T. Abu-Jassar, A. Amer, V. Lyashenko, V. Yevsieiev,
and M. R. Khosravi, Control system development and
Identification time (s)
Compared system implementation of a CNC laser engraver for environmental use with
Distance= Distance= Distance= Distance=
remote imaging, Comput. Intell. Neurosci., vol. 2022, p. 9140156,
50 cm 100 cm 200 cm 300 cm
Computer vision system based 2022.
0.033 0.032 0.033 0.033 [7] A. Rabotiahov, O. Kobylin, Z. Dudar, and V. Lyashenko, Bionic
on NVIDIA Jetson Nano[38]
Developed computer vision image segmentation of cytology samples method, in Proc. 14th Int.
0.028 0.026 0.031 −
system Conf. Advanced Trends in Radioelecrtronics, Telecommunications
and Computer Engineering (TCSET), Lviv-Slavske, Ukraine, 2018,
pp. 665–670.
0.040 [8] S. M. H. Mousavi, V. Lyashenko, and V. B. S. Prasath, Analysis of a
robust edge detection system in different color spaces using color
0.035
Time (s)
and depth images, Comput. Opt., vol. 43, no. 4, pp. 632–646, 2019.
0.030 [9] A. T. Abu-Jassar, Y. M. Al-Sharo, V. Lyashenko, and S. Sotnik,
Some features of classifiers implementation for object recognition in
0.025 specialized computer systems, TEM J., vol. 10, no. 4, pp.
0.020 1645–1654, 2021.
0 50 100 200 300 [10] A. C. Bavelos, N. Kousi, C. Gkournelos, K. Lotsaris, S. Aivaliotis,
Distance (cm) G. Michalos, and S. Makris, Enabling flexibility in manufacturing
Computer vision system based on NVIDIA Jetson Nano by integrating shopfloor and process perception for mobile robot
Developed computer vision system workers, Appl. Sci., vol. 11, no. 9, p. 3985, 2021.
Fig. 15 Comparison of the ball object identification time versus distance [11] A. S. M. Al-Obaidi, A. Al-Qassar, A. R. Nasser, A. Alkhayyat, A. J.
for the developed computer vision system with a similar solution based on Humaidi, and I. K. Ibraheem, Embedded design and implementation
NVIDIA Jetson Nano[38]. of mobile robot for surveillance applications, Indonesian J. Sci.
Technol., vol. 6, no. 2, pp. 427–440, 2021.
such as high recognition speed comparable to the expensive [12] Y. H. Jung, D. H. Cho, J. W. Hong, S. H. Han, S. B. Cho, D. Y.
NVIDIA Jetson Nano and improved AI YOLO v.3 algorithm Shin, E. T. Lim, and S. S. Kim, Development of multi-sensor
implemented for ESP32-Cam, which gives a minimal module mounted mobile robot for disaster field investigation, Int.
Arch. Photogramm. Remote Sens. Spatial Inf. Sci., vol. XLIII-B3-
identification error for a dynamic object and has minimal
2022, pp. 1103–1108, 2022.
requirements for the power module, moreover, the price of the
[13] G. Lajkó, R. N. Elek, and T. Haidegger, Endoscopic image-based
developed computer vision module does not exceed 8 US dollar, skill assessment in robot-assisted minimally invasive surgery,
compared to analogues of 180 US dollar. However, the developed Sensors, vol. 21, no. 16, p. 5412, 2021.
solution is not without drawbacks. Restrictions on overall [14] M. M. Rahman, T. Rahman, D. Kim, and M. A. U. Alam,
dimensions have significantly reduced the computing power of Knowledge transfer across imaging modalities via simultaneous
the module, as a result of which the solution of some recognition learning of adaptive autoencoders for high-fidelity mobile robot
and identification problems must be solved on a separate device. vision, in Proc. 2021 IEEE/RSJ Int. Conf. Intelligent Robots and
Thus, this problem and its solution is the next direction of Systems (IROS), Prague, Czech Republic, 2021, pp. 1267–1273.
possible research. [15] Y. J. Mon, Vision robot path control based on artificial intelligence
image classification and sustainable ultrasonic signal transformation
technology, Sustainability, vol. 14, no. 9, p. 5335, 2022.
Dates [16] D. Zhang and Z. Guo, Mobile sentry robot for laboratory safety
Received: 1 May 2023; Revised: 15 September 2023; Accepted: 18 inspection based on machine vision and infrared thermal imaging
September 2023 detection, Secur. Commun. Netw., vol. 2021, p. 6612438, 2021.
[17] A. Stateczny, K. Gierlowski, and M. Hoeft, Wireless local area
network technologies as communication solutions for unmanned
References surface vehicles, Sensors, vol. 22, no. 2, p. 655, 2022.
[1] S. Khlamov, V. Savanevych, I. Tabakova, and T. Trunova, [18] EMAX ES08MA II 12g mini metal gear analog servo for RC model,
Statistical modeling for the near-zero apparent motion detection of https://www.amazon.com/ES08MA-Metal-Analog-Servo-Model/
objects in series of images from data stream, in Proc. 12th Int. Conf. dp/B07KYK9N1G, 2023.
Advanced Computer Information Technologies (ACIT), [19] OV7670 camera module supports VGA CIF auto exposure control
Ruzomberok, Slovakia, 2022, pp. 126–129. display active size 640X480, https://www.amazon.com/Supports-
[2] S. Khlamov, I. Tabakova, and T. Trunova, Recognition of the Exposure-Control-Display-640X480/dp/B09X59J8N9/ref=sr_1_6?crid=
astronomical images using the Sobel filter, in Proc. 29th Int. Conf. 1OSVL1M09YPB5&keywords=VGA+OV7670&qid=1669977970
Systems, Signals and Image Processing (IWSSIP), Sofia, Bulgaria, &sprefix=vga+ov7670+%2Caps%2C600&sr=8-6, 2023.
2022, pp. 1–4. [20] 01Studio pyai-K210 kit entwicklungs board python ai künstliche
[3] M. A. Ahmad, I. Tvoroshenko, J. H. Baker, and V. Lyashenko, intelligenz machine vision deep learning micro python, https://
Modeling the structure of intellectual means of decision-making de.aliexpress.com/item/1005001459205624.html?gatewayAdapt=glo
using a system-oriented NFO approach, Int. J. Emerg. Trends Eng. 2deu, 2023.
Res., vol. 7, no. 11, pp. 460–465, 2019. [21] ESP32-CAM esp32 cam seriell zu wifi entwicklung bord micro usb
[4] V. Lyashenko, O. Kobylin, and M. Minenko, Tools for investigating bluetooth + ov2640 kamera modul, https://de.aliexpress.com/item/
the phishing attacks dynamics, in Proc. Int. Scientific-Practical 32947577882.html?gatewayAdapt=glo2deu, 2023.
Conf. Problems of Infocommunications. Science and Technology [22] Preliminary datasheet, OV7670/OV7171 CMOS VGA (640×480),
(PIC S&T), Kharkiv, Ukraine, 2018, pp. 43–46. http://web.mit.edu/6.111/www/f2016/tools/OV7670_2006.pdf,
[5] H. Attar, A. T. Abu-Jassar, V. Yevsieiev, V. Lyashenko, I. 2023.
Nevliudov, and A. K. Luhach, Zoomorphic mobile robot [23] M5Stack K027 M5StickV K210 ai camera, https://eu.mouser.com/