Development and Investigation of Vision System for a Small-Sized Mobile Humanoid Robot in a Smart Environment

https://doi.org/10.26599/IJCS.2023.
9100018 International Journal of Crowd Science
Development and Investigation of Vision System for a Small-Sized

Mobile Humanoid Robot in a Smart Environment
Amer Tahseen Abu-Jassar1, Hani Attar2, 3 ✉, Ayman Amer2, Vyacheslav Lyashenko4, Vladyslav Yevsieiev4, and
Ahmed Solyman5
ABSTRACT
The conducted research aims to develop a computer vision system for a small-sized mobile humanoid robot. The decentralization
of the servomotor control and the computer vision systems is investigated based on the hardware solution point of view, moreover,
the required software level to achieve an efficient matched design is obtained. A computer vision system using the upgraded tiny-
You Only Look Once (YOLO) network model is developed to allow recognizing and identifying objects and making decisions on
interacting with them, which is recommended for crowd environment. During the research, a concept of a computer vision system
was developed, which describes the interaction between the main elements, on the basis of which hardware modules were
selected to implement the task. A structure of information interaction between hardware modules is proposed, and a connection
scheme is developed, on the basis of which a model of a computer vision system is assembled for research, with the required
algorithmic and software for solving the problem. To ensure the high speed of the computer vision system based on the ESP32-
CAM module, the neural network was improved by replacing the Visual Geometry Group 16 (VGG-16) network as the base
network for extracting the functions of the Single Shot Detector (SSD) network model with the tiny-YOLO lightweight network
model, which made it possible to preserve the multidimensional structure of the network model feature graph, resulting in
increasing the detection accuracy, while significantly reducing the amount of calculations generated by the network operation,
thereby significantly increasing the detection speed, due to a limited set of objects. Finally, a number of experiments were carried
out, both in static and dynamic environments, which showed a high accuracy of identifications.
KEYWORDS
social system; assistant robotics; autonomous robots; humanoid robot; computer vision system; neural networks; decision
making; crowd environment
T he evolution of modern robotics development is based on the ongoing processes around it in real time[10]. This solution
new technologies introduction in the field of artificial makes it possible to more accurately describe the events that occur
intelligence, decision making, processing of large amounts in the working area of a mobile robot, but the proposed method
of data, and visual information[1–6]. One of the basic requirements cannot be used in open areas, which ties it to a specific location.
for mobile robots is the availability of a computer vision system, Al-Obaidi et al.[11] conducted research to develop a mobile robot
which makes it possible to transmit the state of the environment for remote security monitoring in factories, offices, and airports.
to the operator in real time[7–9]. This allows the operator to assess The main goal of this study, according to the authors, is the
the situation and complete the assigned tasks. However, some development of an energy-saving mobile robot with a computer
solutions in this area limit the autonomy of the mobile robot and vision system and environmental control due to an array of
make it dependent on the presence of the operator. To ensure the sensors (temperature and presence of obstacles). To implement a
autonomy of such robots, it is necessary not just to implement a computer vision system, Al-Obaidi et al.[11] used a Raspberry Pi
system for broadcasting a video stream of the environment of a with a connected Camera Board, which made it possible to
mobile robot, but to develop a system for identifying objects in broadcast a video stream from a mobile robot to a control system,
real time, with the ability to further use the data obtained at the and an ATMEGA 328P microcontroller based on Arduino was
object to develop a strategy for behavior and decision making. used as a motor control system. Considering the proposed
In Ref. [10], Bavelos et al. proposed the development of a solution, it can be seen that the use of Raspberry Pi as a computer
computer vision system for an industrial mobile robot that can vision system and Arduino as a motor control system is rational.
autonomously move around the workshop. The peculiarity of the But the presence of specific power requirements for the Raspberry
proposed solution lies in the synthesis of data from sensors of a Pi (DC 5 V, 1 A) failure to comply with these recommendations
mobile robot with 2D and 3D sensors located in the workshop. can lead to unstable operation of the mobile robot, and therefore,
This, according to the authors, allows the mobile robot to perceive if Al-Obaidi et al.[11] expanded the functionality of the computer
1 Department of Computer Science, College of Information Technology, Amman Arab University, Amman 11937, Jordan
2 Faculty of Engineering, Zarqa University, Zarqa 2000, Jordan
3 College of Engineering, University of Business and Technology, Jeddah 21448, Saudi Arabia
4 Department of Media Systems and Technology, Kharkiv National University of Radio Electronics, Kharkiv 61166, Ukraine
5 Department of Electrical and Electronics Engineering, Faculty of Engineering and Architecture, Nişantaşı University, Istanbul 34398, Türkiye
Address correspondence to Hani Attar, Hattar@zu.edu.jo
© The author(s) 2025. The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/).
International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43 29

International Journal of Crowd Science
vision system by introducing neural networks for object to implement both software and hardware.
recognition and decision-making systems, will entail either an Thus, to solve the problem of developing a computer vision
unstable robot or the need to completely change the power supply system for small-sized robots with the possibility of implementing
system of the mobile robot. an AI-based identification system, we choose to use the 2D
In Ref. [12], an example of the development of a multi-sensor imaging method. To conduct experimental research in the role of
recognition system based on Light Detection and Ranging an actuator and work out interaction with the outside world, a
(LiDAR) for the mobile robot iRobotic Packbot 501 was given. prototype of a humanoid robot will be used, which will expand
The proposed solution allows, in synthesis with a computer vision the possibility of implementing manipulation capabilities for
system using LiDAR, to build a 3D map of the environment[12]. interacting with objects and make it possible in the future to
But for the implementation of such a system, it is necessary to use implement a group control and decision-making system to
OS Robot Operating System (ROS), which requires high- achieve the set tasks.
performance hardware based on expensive microcomputers and Thus, in this study, the authors set themselves the task of
will correspondingly increase the cost and impossibility of using it developing a computer vision system using AI technologies to
on small-sized robots. recognize and identify objects in the robot’s work area. At the
Table 1 compares some of the existing methods for obtaining same time, strict requirements are put forward for the overall
visual information about mobile robot environment, on the basis dimensions and computing power of the hardware. This makes it
of which an intelligent decision-making system can be developed. possible to use it inside small-sized intelligent humanoid robots, as
As can be seen from Table 1, the following methods were well as robots for rescue missions in areas of man-made disasters.
At this point in time, existing similar solutions use expensive
chosen for analysis:
hardware such as Nvidia Jetson Orin Nano or Raspberry Pi 4
2D imaging (camera) can track moving objects in real time and
Model B 4GB. In addition, the listed hardware has increased
locate them in the straight sight of the mobile robot. This allows to
requirements for the power circuit, as well as large overall
detect objects in dynamic space and makes it possible to dimensions, which do not meet the requirements for a small-sized
determine their location and identification using Artificial humanoid robot.
Intelligence (AI).
3D vision is a complex method of work synthesis with two
cameras or a laser scanner located at different angles. As a result, 1 General Concept of Mobile Robot Computer
there are high requirements for hardware and software Vision Control System
implementation to visualize the surrounding space of a mobile Analyzing the specifics and complexity of the developed mobile
robot. But such a system has a high accuracy of recognition of the humanoid robot, the authors decided to apply a decentralized
objects presence, without the possibility of their identification, and approach to the control system by dividing the humanoid robot
has a small delay in real time. motion control system and the computer vision system into
The ultrasonic method consists of a device that calculates the separate modules. Grounding for this decision was next: the
time interval between emission and detection of a reflected sound bandwidth of the video channel increases, which reduces the
wave, which allows to determine the presence of objectives and signal delay time in the computer vision system; and the load on
the distance to the sensor, and hence, makes a decision based on the resolution capabilities of the microcontroller module of the
the results. The ultrasonic device works in real time and it is easy servomotor control system is reduced, which speeds up the
to be implemented and programmed. However, it does not allow execution of commands. All actions for recognition and
the visualization of the environment resulting in failing on identification using AI, as well as algorithms for making decisions
identifying objects, which means that it only recognizes the and interacting with the outside world and objects, are
presence of the sensors nearby objectives. implemented on a personal computer (notebook).
The infrared method detects Infrared (IR) rays emitted by an At the first step in solving the problem, it is necessary to
object. It can also use IR light to project onto a target and receive develop a general concept for the control system implementation,
reflected light to determine its distance or proximity. Infrared which should include a computer vision module and will be used
sensors are economical and can track infrared light over a large to transfer the video stream to the laptop and return commands to
area. It also work in real time. It allows to determine the presence interact with objects (ball). Figure 1 shows the general concept for
of an object, but does not allow to identify it, and is simple enough the control system implementation for a mobile humanoid robot.
Table 1 Methods for obtaining visual information about mobile robot environment comparison.
Parameter 2D imaging (camera)[13] 3D vision (LiDAR)[14] Ultrasonic method (sensor)[15] Infrared method (sensor)[16]
Recognition accuracy ++ +++ + −
Resolution ++ +++ − −
“Dead” zone + − ++ ++
Field of View (FOV) ++ +++ + +
Dimension + +++ + +
Price + +++ + +
Hardware requirements ++ +++ + +
Software implementation complexity ++ +++ + +
Object identification +++ ++ − −
Note: +++: high rates; ++: middle rates: +: low rates; −: impossible to implement.
30 International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43

Development and Investigation of Vision System for a Small-Sized Mobile Humanoid Robot in a...
Wireless LAN Notebook

Wireless communication
Microcontrollers module
n Wireless communication Streaming video
c tio module processing module
ete
td
jec Camera module Image recognition module
Ob
Object identification
Control module
module
Action
Power module Decision module
Ball Humanoid robot System of management

and decision making
Fig. 1 General concept for the control system implementation for a mobile robot.
The basic idea of this concept is the development of a and Technology LTD is used as a model for conducting research.
decentralized control system, that is, its division into two Overall dimensions are: height is 200 mm; width is 72.2 mm; arm
subsystems. Very strong restrictions on the overall dimensions of length (from shoulder to hand) is 105 mm; movement provides
the humanoid robot became grounding for this decision. As a 18 servomotors model EMax ES08MAII.
result, four main modules will be placed on the basis of the Based on the concept proposed above for a computer vision
humanoid robot: wireless communication module; camera system for a humanoid robot (Fig. 1), at the beginning of
module; control module; and power module. development, it is necessary to discuss the restrictions that are
Wireless communication module allows to exchange data imposed when choosing hardware modules, on the basis of which
between a humanoid robot and a control system, decision-making the camera module and control module will be implemented.
using wireless Local Area Network (LAN) technologies[17]. Dimension: Camera module is no more than 30 mm × 45 mm ×
Camera module will perform the function of broadcasting the 15 mm (width × length × height). Control module is no more than
mobile robot environment using wireless network technologies. 54 mm × 51 mm × 7 mm (width × length × height).
Control module is designed to execute commands that the Power: Camera module working voltage is 5 V, and average
module receives from the control and decision-making system, current is 180 mA (max); Control module working voltage is
which is located on a personal computer (notebook). An 4.5–6.4 V, and average current is 80 mA (min).
additional task of the control module is to implement the ability to Wireless communication module: WiFi 802.11 b/g/n support
connect servomotors, which will provide freedom of action for the for camera module and control module.
humanoid robot to perform the tasks. Required camera module features are as follows:
Power module provides a small-sized battery assembly, which (1) Connection of a camera with the ability to transmit
will be used to power a humanoid robot. streaming video with a minimum resolution of 800 pixel ×
On the basis of a personal computer (notebook), it is proposed 600 pixel 30 frames per second (fps);
to create a control and decision-making system, and this decision (2) The ability to create an access point to a wireless network
is justified by the fact that the development of an object based on the module.
recognition and identification system, as well as a decision-making Required control module functions are as follows:
system, requires “serious” computing power that cannot be placed (1) Power supply support for up to 18 Emax ES08MA II
on the basis of a humanoid robot. As a result, this subsystem servomotors[18];
consists of the following modules: wireless communication (2) The maximum average current (when 18 servomotors are
module; streaming video processing module; image recognition working) can reach 2800 mA.
module; object identification module; and decision module. The rationale for the selected restrictions and their values are
Streaming video processing module is designed for primary the overall dimensions of the humanoid robot: total 72 mm ×
processing of streaming video received from a humanoid robot 200 mm × 80 mm (width × length × height) and body dimensions
and preparing it for the recognition module. 40 mm × 55 mm × 80 mm (width × length × height), which
Image recognition module processes the received streaming should accommodate control module, camera module, and power
video, divides it into frames, and recognizes the presence of module. As a result, the developed solution has a rigid framework
objects in the frame. for the analysis and selection of hardware modules.
The object identification module uses the neural networks to First, we will analyze and select a hardware module for the
identify the robot’s nearby objectives, i.e., the objectives located in implementation of the camera module, and based on the above
the robot’s working zone. Consequently, the object identification limitations, the following modules were selected, which are shown
module gives the suggested name of the identified object and the in Fig. 2, and their parameters are presented in Table 2.
accuracy of the suggestion. Having carried out a comparative analysis of the parameters of
Decision module analyzes the results obtained with the object hardware modules (Table 2) for the implementation of a
identification module and, depending on the underlying computer vision system for a humanoid robot, we can identify the
algorithm, performs interaction actions with the object. following advantages and disadvantages for solving the problem.
The OV7670 VGA hardware module fits the overall dimensions
2 Selection of Hardware Modules for the Robot’s and has an attractive price (5 US dollar). However, it has a
number of disadvantages, such as the absence of a wireless
Computer Vision Control System network module and a microcontroller-based module for primary
The humanoid mobile robot “ViVi” by Doctors of Intelligence processing of the video stream; and to use this module, you need

(a) Kamepa VGAOV7670[19] (b) PyAI-K210[20] (c) ESP32-Cam[21]

Fig. 2 Hardware modules for computer vision system implementation.
Table 2 Comparison of main parameters of camera module.

Parameter VGA OV7670[22] PyAI-K210[23] ESP32-Cam[24]
Microcontroller Additional module ESP32 (240 MHz) ESP32 (240 MHz)
Wireless communication + +
Additional module
module (Wi-Fi) (Wi-Fi)
640 pixel×480 pixel UXGA 1600 pixel×1200 pixel UXGA 1600 pixel×1200 pixel
Maximum resolution
(0.3 megapixel) (2 megapixel) (2 megapixel)
UXGA,SXG, SVGA, QVGA, UXGA,SXG, SVGA, QVGA,
Format support VGA, QVGA, CIF, and QCIF
CIF, and QCIF CIF, and QCIF
UXGA/SXGA: 15–30, UXGA/SXGA: 15–30,
Video streaming rate (fps) VGA: 30
SVGA: 30, and CIF: 60 SVGA: 30, and CIF: 60
Raw RGB, RGB (GRB 4:2:2,
YUV(422/ 420)/YCnCr422, YUV(422/ 420)/YCnCr422,
RGB565/555/444),
Available color code RGB565/555, RGB565/555,
YUV (4:2:2), and YCbCr
and 8-bit compressed data and 8-bit compressed data
(4:2:2)
Voltage (V) 2.5–3.0 3.6–6.0 5.0
Board dimension (mm×mm) 35×35 38×42 27×39
Weight (g) 8 18 10
Price (US dollar) 5 50 8–10
to purchase separately a Wi-Fi module and a control module The next step is to select a hardware module for the
(Arduino Min, STM32), which accordingly increases not only the implementation of the control module by a humanoid robot.
cost but also the complexity of the design, as well as overall Based on the above restrictions, the following hardware modules
dimensions. At the same time, the speed of the microcontroller fell into the analysis area, which are presented in Fig. 3, and their
will be significantly lower (16–32 MHz) than that of the pyAI- parameters are presented in Table 3.
K210 and ESP32 modules. Comparing the pyAI-K210 and ESP32 Analyzing the main parameters of the control system hardware
hardware modules, it can be seen that these two modules have modules, it can be seen that the 16 road steering gear control
similar parameters in terms of microcontroller speed and camera board servo and neue version 32 Kanal roboter servo control
characteristics. The main differences between such modules are board modules are based on the STM 32 bit microcontroller. At
the overall dimensions and weight of the hardware module, but the same time, the neue version 32 Kanal roboter servo control
the price of ESP32-Cam is about 4.5–5.0 times lower than pyAI- board has a module for wireless data transmission via Wi-Fi and
K210 and has a wider distribution and software support for bluetooth, which fits certain restrictions, but these modules do not
libraries for working with modules and sensors. Based on this fit in terms of overall dimensions, as well as the number of
comparison, an ESP32-Cam will be used to implement the camera servomotors that can be connected to the board management. On
module onboard a humanoid robot. the other hand, the control system hardware of neue version 32
(a) Neue version 32 Kanal (b) 16 road steering gear (c) Humanoiden roboter
roboter servo control board[25] control board servo[26] control board[27]
Fig. 3 Hardware modules for the humanoid robot motion control system implementation.

Table 3 Comparison of the hardware modules’ main parameters for the control system development.
Neue version 32 Kanal roboter 16 road steering gear Humanoiden roboter
Parameter
servo control board[25] control board servo[26] control board[27]
CPU STM 32 bit STM 32 bit ESP8266MOD
Flash capacity (Mbit) 16 16 4
Wireless communication Wi-Fi and bluetooth − Wi-Fi
Servomotor controller the number of simultaneous 32 16 18
Servomotor input voltage (V) 4.2–7.2 4.2–7.2 4.2–7.2
Operation voltage (V) 5 5 3.3
Communication protocol UART USB-TTL TTL programmer
Board dimension (mm×mm) 45×63 36×43.5 24×16
Weight (g) 8 12 12
Price (US dollar) 20 5 15
Kanal roboter servo control board, 32-Emax ES08MA II general structure for the information interaction of the camera
servomotor, and 16-road steering gear control board servo, will module and control module with the system of management and
not allow to realize all degrees of freedom for a humanoid robot. decision making to control the humanoid robot. This structure is
Considering the parameters of the humanoiden roboter control shown in Fig. 4.
board, it can be seen that it not only is suitable in terms of overall Based on the developed structure of information interaction
dimensions, but also is determined by the presence of a Wi-Fi between software and hardware modules, taking into account the
module onboard a control system based on the ESP8266MOD specifics of developing programs for a microcontroller, to
microcontroller. This will make it possible to create an access implement a computer vision system for a mobile humanoid
point on its basis and simplify the process of transmitting robot based on ESP32-Cam and a neural network for object
commands to a humanoid robot from a control and decision- recognition and identification, it is proposed to use the following
making system; as a result, to implement a control system for a set of libraries supported in the environment Arduino IDE
humanoid robot, the authors suggest using the humanoiden development[28]:
roboter control board for 15 US dollar. WebServer makes it possible to implement a simple web server
with support for only one client at a time, which allows to process
3 Structural Diagram of Interaction Between requests using the Get and Post methods. The built-in IoT
Main Elements of Mobile Robot Computer Development Framework from Espressif Systems (ESP-IDF) web
server will be used. This web server is based on the Transmission
Vision Control System Control Protocol/Internet Protocol (TCP/IP) stack and allows
After choosing the hardware modules, it is necessary to develop a developers to create and conFig. web interfaces to manage and
Humanoid robot Wireless LAN Personal computer (notebook)
ESP 32-Cam OS Windows 10/11

Python 3.10
WebServer PyCharm
Video stream
Wi-Fi OpenCV-Python
CUDA
10.1.105
ESP32-Cam NumPy
Library
Sensor data Urllib

Sensor HC-SR04
HMI TensorFlow
Humanoiden roboter
control board YOLO v.3
WebServer
Calculating the distance to an object
Wi-Fi
Decision selection algorithm
HTTP server node JS
Sending commands by protocol HTPS
Servo Control commanda
Battery
Fig. 4 Structure of information interaction camera module and control module with system of management and decision making to control a humanoid
robot.

interact with the ESP32-CAM[29]. in which a sequence of commands is formed and transmitted via
Wi-Fi will allow to connect to an existing local network, using the HTPS using the Post and Get methods to the humanoid robot
authentication tools, via SSID and password. control board hardware module.
ESP32-Cam is a driver library for working with the OV2640 For the convenience of visualization and understanding the
camera for the ESP32-Cam hardware module. type of signal transmitted to the circuit, it was decided to designate
For control based on the humanoiden roboter control board them in the following colors: green—digital signal for data
hardware module, it is proposed to use the following libraries: transmission; black—power circuit “ground” or “gnd”;
WebServer and Wi-Fi will perform similar functions as on a red—power supply + 5V or “vcc”. The electrical circuit diagram
computer vision system. for connecting the ESP32-Cam module using the HC-SR04
HTTP server node JS is a library that makes it possible to ultrasonic sensor is shown in Fig. 5.
receive and process commands to control a humanoid robot from The assembled test layout of the computer vision system with
a personal computer. the HC-SR04 ultrasonic sensor connected is shown in Fig. 6a
Servo is a library that allows to control the rotation angles of front view and Fig. 6b back view. Figure 6c is a front view of
Emax ES08MA II servomotors. humanoid robot body; and Fig. 6d is the location of the modules
Battery is modular assembly of batteries with minimum inside the case.
parameters: 5 V and 3000 mA.
HC-SR04 is a hardware module that is an ultrasonic sensor for 4 Development of Computer Vision System
calculating the distance to an object in the working area of a
computer vision system[30].
Algorithm for Mobile Humanoid Robot and Its
On the side of a personal computer (notebook), the humanoid Software Implementation
robot control and decision-making system is implemented based The next step in the development of a computer vision system for
on the Python 3.10 language[31] in the PyCharm development a humanoid robot is the development of an algorithm and a set of
environment[32] using the following libraries:
OpenCV is a computer vision library that is designed to CPU
analyze, classify, and process the received image from ESP32- ESP32-CAM
Cam. +5 V 5V 3.3 V
NumPy is a library for processing large multidimensional GND GPIO16
arrays and matrices with the ability to structure them. GPIO12 GPIO0
GPIO13 GND
Urllib, HTTP client support library, makes it possible to obtain GND
GPIO15 3.3/5 V
a URL address for transmitting a video stream from the ESP32- GPIO14 GPIO3
Cam. GPIO2 GPIO1
TensorFlow is a machine learning library designed to solve the GPIO4 GND
problems of building and training a neural network in order to
automatically find and classify patterns found in the working Sensor
environment of a humanoid robot. HC-SR04
YOLO v.3 is a neural network library (script) for object VCC
recognition and identification. CND
CUDA 10.1.105 is Nvideo’s technology library that uses the Trig
GPU to accelerate parallel computing in object recognition and Echo
identification. In the developed computer vision system, the
choice of the CUDA 10.1.105 library is due to the following Fig. 5 Wiring diagram for HC-SR04 sensor to ESP32-Cam.
reasons: parallel computing on the GPU and convenience and
compatibility with machine learning and computer vision
frameworks such as TensorFlow, PyTorch, and OpenCV. The use
of CUDA allows to parallelize the execution of algorithms and
data processing, which can lead to a significant acceleration of the
process.
A feature of the ESP32 microcontroller is the use of a special
ESPAsyncWebServer library, which provides the functionality of a
web server, and it supports one access point, that is, one operator.
Based on this limitation, one point transmits the streaming video (a) Front view (b) Back view
to the computer vision system, and the second one, which is
implemented on the humanoiden roboter control board, controls
the robot.
It is also necessary to implement the computational functions
“Calculating the Distance to an Object”, which will allow
calculating the distance to the detected object based on the
ultrasonic method, and the “Decision Selection Algorithm” block
contains behavioral actions, depending on the name of the
identified object, distance, and objective functions and tasks of
their interaction. (с) Front view of humanoid (d) Location of the module
The results of the selected interaction with the object are robot body inside the case
transferred to the “Sending commands by protocol HTPS” block, Fig. 6 Placement of a computer vision system in a humanoid robot.

programs for interaction between the ESP32-Cam microcontroller basis of the ESP32-Cam module, and the calculation results in
and the HC-SR04 sensor on one hand and for system of centimeters will be transmitted to the system of management and
management and decision making (notebook) on the other hand. decision making.
An enlarged algorithm for the operation of a computer vision The operation algorithm of the system of management and
system is shown in Fig. 7. decision making (notebook) is built according to the following
Let us briefly describe the purpose of the main blocks of the principle, after starting the program, at the first stage, the libraries
algorithm (Fig. 7), which are implemented on the basis of the are initialized (Fig. 4). After that, it checks for the presence of an
ESP32-Cam hardware module using the HC-SR04 sensor (Fig. 6). IP address assigned to the ESP32-Cam in the local network. If the
When power is connected to the ESP32-Cam module, the libraries given IP address is not found, the user is given an error message.
are initialized (Fig. 4), after which the local wireless network is If the connection is successful, the program starts receiving the
searched using authentication data (SSID and password). If the video stream and distance calculation results from the HC-SR04
connection fails, then an error message is displayed in the port sensor. The next step of the program is to solve the problem of
monitor (Arduino IDE during settings), and upon successful recognizing the presence of an object in the working area of a
connection, the local network access point (router) allocates a humanoid robot. If an object is in the working area, it is identified
local dynamic IP address of the ESP32-Cam. The video stream with the display of the name of the object, the value of the
and data from the sensor will be transmitted to the system of probability of matching and recognition time, as well as the
management and decision making (notebook) at this address. distance to it. Based on the received data (object name and
After receiving the local IP address, we activate the 2Mp camera distance), it is possible to implement a decision making
OV2640. Taking into account the peculiarities of the operation of mechanism to interact with the object in accordance with the
microcontrollers, until the power is turned off on the ESP32-Cam, tasks, as a result, the humanoid robot control board module
the video stream from the OV2640 camera, as well as the results receives commands for execution. At the same time, in real time,
of calculating the distance values from the ultrasonic sensor HC- all the necessary information (name of the object, probability of
SR04, will be continuously transmitted through the received IP coincidence, and distance to it) is visualized to the user in the
address in a cycle. It is worth noting that all calculations of the Human-Machine Interface (HMI) window. The program will
distance to the object using the sensor will be carried out on the function until the user completes work with it or the power is
Start Start
Library Video stream transmission Library

initialization via IP address ESP32-Cam initialization
in the local network
Wireless network Get a video stream

Hardware modules on a humanoid robot (ESP32-Cam and HC-SR04)
connections from the specified IP

(Wi-Fi) address
Connection error IP address lookup
System of managment and decision making (notebook)

message Wi-Fi error ESP32-Gam
No No
Connected? Is there a video
stream?
Yes Yes
Recognition of
Getting a local
objects on the
IP address
frame
Identification of
OV2640 camera objects on the
activation Distance to the object frame
from the sensor HC-SR04
Making
Broadcast decisions on
streaming video interaction with
stream
Closed loop
the object
Generation of Displaying visual

ultrasonic information in
signals HMI
End
Fig. 7 Enlarged algorithm of the computer vision system.

turned off on the ESP32-Cam module. setup() function, skip the description of the standard port settings,
When creating a program for the ESP32-Cam module, we will and present the most interesting points:
use the Arduino IDE 1.8.19 development environment with the Set the signal levels on the pins to which the HC-SR04 sensor is
obligatory connection of the following libraries in the boards connected:
manager: Arduino AVR Boards ver. 1.8.5 and ESP32 ver.2.0.4 by pinMode(trigPin, OUTPUT);
Espressif System. pinMode(echoPin, INPUT);
At the first step, write a program for broadcasting a video Set up the OV2640 camera parameters and check its
stream from ESP32-Cam, connecting libraries: performance:
#include ⟨ WebServer.h⟩ ; {using namespace esp32cam;
#include ⟨ WiFi.h⟩ ; Config cfg;
#include ⟨ esp32cam.h⟩ . cfg.setPins(pins::AiThinker);
Let us write the constants that will contain the login and cfg.setResolution(ords);
password for ESP32-Cam authentication in the local network (set cfg.setBufferCount(2);
by the developer): cfg.setJpeg(80);
const char* WIFI_SSID = “Crow”; bool ok = Camera.begin(cfg);
const char* WIFI_PASS = “notebookpuk100”. Serial.println(ok ? “CAMERA OK” : “CAMERA FAIL”); }
Let us write the port for connecting to the web server. It is worth paying attention to the last line, which allows to
WebServer server(80); display technical information in the port monitor in the Arduino
We indicate the pins to which the HC-SR04 sensor is IDE development environment when it is configured.
connected to the ESP32-Cam: Connect to a local wireless network using the data entered in
#define trigPin 13; WIFI_SSID and WIFI_PASS:
#define echoPin 15. WiFi.persistent(false);
For the convenience of computer vision system testing with WiFi.mode(WIFI_STA);
different types of resolution obtained from a 2Mp OV2640 WiFi.begin(WIFI_SSID, WIFI_PASS);
camera, we indicate the following working resolutions: while (WiFi.status() != WL_CONNECTED).
static auto loRes = esp32cam::Resolution::find(320, 240); Upon successful connection, the received local IP address of the
static auto midRes = esp32cam::Resolution::find(350, 530); ESP32-Cam module is displayed in the Arduino IDE port
static auto ords = esp32cam::Resolution::find(800, 600). monitor, which will be needed to broadcast the video stream in
That is, it will make it possible, when testing a computer vision system of management and decision making:
system, to change the image quality in the program on the system Serial.print(«http://»);
of management and decision making side. In the future, it will be Serial.println(WiFi.localIP()).
possible to select only one quality of streaming video for the tasks, Let us activate three servers, with the help of which streaming
which will make it possible to reduce the load on the video is transmitted in the form of a sequence of jpg files with
microcontroller. different resolutions and turn on the server:
Let us create a serveJpg() function to receive a streaming video server.on(“/cam-lo.jpg”, handleJpgLo);
as a jpg sequence and an error handler in the form of a 503 code server.on(“/cam-hi.jpg”, handleJpgHi);
for the server[33]. server.on(“/cam-mid.jpg”, handleJpgMid);
{auto frame = esp32cam::capture(); server.begin().
if (frame == nullptr) { The last step is to write the void loop(), which is executed in a
Serial.println(«CAPTURE FAIL»); cyclic mode and is a feature of working with microcontrollers:
server.send(503, «», «»); // means that right now the server is Prescribe the program code for the operation and calculation of
not ready to process the request, because it is overloaded or the distance for the HC-SR04 sensor:
technical work is being carried out on it {digitalWrite(trigPin, LOW); // we apply a low level signal to
Return;} pin 13;
Serial.printf(“CAPTURE OK %dx%d %db\n”, frame-> delay(200); // make a delay of 200 ms;
getWidth(), frame->getHeight(), digitalWrite(trigPin, HIGH); // we apply a high level signal to
static_cast⟨ int⟩ (frame->size())); pin 13;
server.setContentLength(frame->size()); This allows to create a sequence of ultrasonic sequences on
server.send(200, “image/jpeg”); // successful request, server is trigPin, which will then be captured as a reflected signal from
running, information is being transmitted. objects on the echoPin;
WiFiClient client = server.client(); delay(200);
frame->writeTo(client);} digitalWrite(trigPin, LOW);
Let us create the functions void handleJpgLo(), void duration = pulseIn(echoPin, HIGH);
handleJpgHi(), and void handleJpgMid() to check the possibility distance = (duration/2)/29.1; // distance calculation (in
of streaming video at three resolutions: loRes, midRes, and hires. centimeters)
An example of such an implementation of the void handleJpgLo() Serial.print(distance);
function is presented below: Serial.println(«cm») // information output to the port monitor
{if (!esp32cam::Camera.changeResolution(loRes)) { (for testing)
Serial.println(“SET-LO-RES FAIL”);} server.handleClient();}//launch Web Server
serveJpg();} Now it is necessary to develop software for recognition and
It is similarly implemented for the functions void identification of objects and its visualization in the system of
handleJpgHi() and void handleJpgMid(). management and decision making block. To do this, we will use
The next step is to describe the necessary data in the void the PyCharm[33] development environment and the Python 3.10

programming language[32]. y = int(centerY – (height / 2)).

In the first step, we include the following libraries: Update our list of bounding box coordinates, confidences, and
import cv2 class ids;
import numpy as np boxes.append([x, y, int(width), int(height)])
import urllib.request confidences.append(float(confidence))
import matplotlib.pyplot as plt class_ids.append(class_id)
import time We start working with detected objects, provided that there is at
import sys least one:
import os Iterate over the stored indexes;
Set the following weight configuration for object detection: For i in id.flatten():
CONF = 0.6 Extract the coordinates of the bounding box;
SCORE = 0.3 x, y = boxes[i][0], boxes[i][1]
IOU = 0.7 w, h = boxes[i][2], boxes[i][3]
Object detection weights such as Confidence Threshold Draw a bounding box rectangle and label the image;
(CONF), Evaluation Threshold (SCORE), and Overlap Threshold color = [int I for c in colors[class_ids[i]]]
(IOU) are typically chosen based on the experience and cv2.rectangle(im, (x, y), (x+w, y+h), color=color,
requirements of a particular computer vision task. The definition thickness=thickness)
of these weights depends on the specific task and can be selected text = f“{labels[class_ids[i]]}: {confidences[i]:.2f}”
based on the requirements for the accuracy and completeness of Calculate the width and height of the text to draw transparent
object detection. Optimal values can be determined by fields as the background of the text;
experimenting and tuning the model on the training dataset. (text_width, text_height) = cv2.getTextSize(text, cv2.FONT_
Write the path to the neural network configuration file: HERSHEY_SIMPLEX, fontScale=font_scale, thickness=thickness)
config = “D:\Yv3\darknetmaster\cfg\yolov3.cfg” [0]
Connect the neural network weight file: text_offset_x = x
weights = “D:\Yv3\darknetmaster\yolov3.weights” text_offset_y = y – 5
font_scale = 1 box_coords = ((text_offset_x, text_offset_y), (text_offset_i +
thickness = 1 text_width + 2, text_offset_y – text_height))
Connect the library of all object labels: overlay = im.copy()
labels = cv2.rectangle(overlay, box_coords[0], box_coords[1],
open(“D:\Yv3\darknetmaster\data\coco.names”).read().strip().spli color=color, thickness=cv2.FILLED)
t(“\n”) - add opacity (field transparency);
Write a color generator for each recognized object: image = cv2.addWeighted(overlay, 0.6, im, 0.4, 0)
colors = np.random.randint(0, 255, size=(len(labels), 3), dtype=” - now place the text (label: trust (%));
uint8”) cv2.putText(im, text, (x, y–5), cv2.FONT_HERSHEY_
Load the neural network YOLO: SIMPLEX,
net = cv2.dnn.readNetFromDarknet(config, weights) fontScale=font_scale, color=(0, 0, 0), thickness=thickness)
The next step is to connect to receive a fragment of streaming cv2.imshow(“Detection”, im)
video in the form of a sequence of jpg files from the ESP32-Cam if ord(“q”) == cv2.waitKey(5):
module, with the following lines: If you need to complete the work with the program, add the
url = http://192.168.1.12/cam-hi.jpg // local IP address of the HMI form close handler:
ESP32-Cam module on the network cam.release()
cam = cv2.VideoCapture(url) cv2.destroyAllWindows()
We start working with layers in accordance with the technology We have to pay attention to the fact that, depending on the
incorporated in the YOLO library: tasks set, the user can rewrite the neural network configuration
Iterate over all detected objects; file, as well as the weights file. Depending on the number of stored
For detection in output object labels, it makes it possible to increase the number of
We extract the class identifier (label) and the reliability (as a simultaneously identified objects in the working area of a
probability) of detecting the current object; humanoid robot, and the developer can create his own label
scr = detection[5:] library, which will increase the quality of identification and reduce
class_id = np.argmax(scr) time.
confidence = scr[class_id] To successfully flash the microcontroller based on the ESP32-
Determine the class of the object and evaluate the reliability of Cam module, we need to select the following configuration setting
the found object; in the Arduino IDE development environment[28] in the Tools
If confidence > CONF: menu, as shown in Fig. 8.
Scale the coordinates of the bounding box relative to the size of After successful installation of the required firmware, the ESP32-
the image, given that YOLO actually returns the center Cam should be restarted through the tools menu of the serial
coordinates (x, y) of the bounding box, followed by the width and monitor, which includes mainly setting the dynamic IP address of
height of the boxes; the module in the local network, the Joint Photographic Expert
box = detection[:4] * np.array([w, h, w, h]) Group (jpg) sequence and resolution, and other features. An
(centerX, centerY, width, height) = box.astype(“int”) example of information displayed in the serial monitor is shown
Use center coordinates (x, y) to get vertex and left corner of in Fig. 9.
bounding box; An example of the HMI of the developed computer vision
x = int(centerX – (width / 2)). system for a humanoid robot is shown in Fig. 10.

architecture of the Convolutional Neural Network (CNN). In this

case, when recognizing an object, “robo ball” is applied once to
the entire image at once. The network divides the image into a
kind of grid and predicts the bounding boxes and the probability
that there is the desired object for each area, which affects the large
spread, where the range of object identification accuracy changes
from 0.37 (for objects that strongly affect the accuracy) to 0.94 (for
objects with have a minor effect). From this we can conclude that
the use of CNN to solve the problem is not effective, due to the
Fig. 8 Arduino IDE Tools menu settings for ESP32-Cam firmware.
high divergence of identification accuracy at different distances
As you can see in Fig. 10, a standard set of configuration files from a mobile humanoid robot. As a result, there is no point in
and label files was used, and the developed computer vision conducting an experiment for the dynamic object “robo ball” with
system for a humanoid robot identifies objects. Now you can this library.
develop your own library assembly to solve the identification of We will improve the neural network, due to the idea of
the “ball” object, and in the future this library can be expanded improvement, which is that S-SSD[34] replaces the VGG-16
depending on requirements. network[35] as the basic network for extracting features of the Single
Shot Detector (SSD) network model with a tiny-YOLO
lightweight network model[36]. This will make it possible to
5 Experiment preserve the multidimensional structure of the network model
Let us conduct a series of natural experiments with the developed feature graph, which will increase the detection accuracy, while
computer vision system for a humanoid robot to assess its significantly reducing the amount of calculations generated by the
performance. To do this, we will assemble a layout that will network operation, thereby significantly increasing the detection
simulate the location of the computer vision system inside a speed, due to a limited set of objects (in the “robo ball”
humanoid robot, taking into account the vertical height of its experiment). It will also remove time delays when working in a
location from the floor, which will make it possible to check the dynamic environment. Let us repeat the conditions of the first
performance of the HC-SR04 sensor in real conditions of its experiment based on the tiny-YOLO lightweight network model
operation. An example of a layout for conducting experiments is for recognizing and identifying the “robo ball” object in room
shown in Fig. 11. conditions. The results of the second experiment are presented in
At the first stage, the research on the recognition and Table 5.
identification of a statistical object of the form of a ball with the For the convenience of analyzing the data obtained in the two
label “robo ball” is conducted, and in this case, the classic experiments described above, we present them in the form of a
coco.names library is used. To do this, we place a static object dependence graph: recognition accuracy and time on distance, for
“robo ball” at different distances from the computer vision system the classic coco.names library and the lightweight tiny-YOLO
and measure the following characteristics: the probability of network model. The resulting graphs are shown in Figs. 12 and
coincidence and recognition time, under the general condition 13, respectively.
that the illumination is 497 Lux (standard office and natural). The As can be seen from Fig. 12, the computer vision system based
results of the first experiment are shown in Table 4. on the coco.names standard library for the “robo ball” object has
An obtained result can be explained by the peculiarity of the an identification accuracy of 0.59–0.94, while the identification
Fig. 9 Example of technical information about a successful ESP32-Cam firmware.

(a) Calibrating the location of the (b) General view of the computer
computer vision system for the test vision system
Fig. 10 HMI of a developed computer vision system for a humanoid robot.

(c) Experimental layout for testing
Fig. 11 Experimental layout for research.
Table 4 Results of the first experiment.

Distance from sensor
Distance (cm) Obtained result Accuracy Identification time (s)
HC-SR04 (cm)
35 0.87 35 0.382
50 0.59 51 0.397
100 0.87 103 0.373
150 0.94 156 0.41
200 0.60 206 0.38

Table 5 Results of the second experiment.

Distance from sensor
Distance (cm) Obtained result Accuracy Identification time (s)
HC-SR04
35 0.98 35 0.024
50 0.99 51 0.028
100 0.97 103 0.026
150 0.94 156 0.032
200 0.80 206 0.031
0.42 0.95 0.034 1.00

Identification accuracy
0.032 Identification accuracy

0.95
0.40 0.80 0.030
Time (s)
0.90
Time (s)
0.028
0.85
0.38 0.65 0.026
0.024 0.80
0.36 0.50 0.022 0.75
0 50 100 150 200 250 0 50 100 150 200 250
Distance (cm) Distance (cm)
Time Identification accuracy Time Identification accuracy
Fig. 12 Graph of “robo ball” object identification accuracy dependence, Fig. 13 Graph of “robo ball” object identification accuracy dependence,
time from distance for the classic library coco.names. time from distance for a tiny-YOLO lightweight network model.
time is in the intervals from 0.373–0.410 s and the graph has a problem in this study, the distance of 200 cm is large enough and
difficultly unpredictable shape that does not depend on distance to will not be used.
the object. In contrast to the graph of the tiny-YOLO lightweight The created dataset for the task of detecting and recognizing
network model (Fig. 13), it can be seen that the identification objects of a mobile humanoid robot is built on the basis of the tiny-
accuracy depends on the distance, that is, the farther the “robo ball” YOLO lightweight network model, surpasses the SSD network
object is from the humanoid robot, the probability of belonging to model in terms of detection accuracy and detection speed, and
this label decreases, while it is worth noting that to solve the can simultaneously meet the requirements of detection accuracy

and real-time performance (in dynamics) when performing distances of 200 cm (Fig. 14d), in which the object left the
indoor tasks. These conclusions are confirmed by comparative clarification zone, which accordingly increased the recognition
experiments on the accuracy and speed of object detection, as well accuracy to 0.97. Despite the obtained results, it can be concluded
as on the detection and recognition of objects in real time by a that the neural network improvements based on the tiny-YOLO
mobile robot in a real indoor environment. lightweight network model provide high speed and accuracy of
The next experiment is the recognition and identification of an object recognition with a low camera resolution on the ESP32-
object with the label “robo ball” in dynamics, that is, the essence of Cam (2Mp) module for small-sized mobile humanoid robots.
the experiment is to identify a dynamically receding object, and in Let us compare the developed computer vision system with a
fact, this is a simulation of a humanoid robot playing football. The control system for a humanoid robot, with similar solutions. The
results of such an experiment are shown in Fig. 14. comparison results are presented in Table 6.
Analyzing the obtained results of identifying the “robo ball” Let us build Table 7 to compare the identification time of the
object in dynamics, it can be seen that the recognition accuracy of “ball” object from distance for the developed computer vision
the improved neural network at a distance of 100 cm and 150 cm system based on ESP32-Cam with a computer vision system based
(Figs. 14b and 14c) was affected by the presence of the object in on NVIDIA Jetson Nano[38]
direct sunlight, which gave the appearance of lightening of the Let us present the comparison results obtained from Table 7 in
contours of the object , and as a result, it led to a decrease in the the form of a graph, which is shown in Fig. 15.
recognition accuracy, in contrast to the results obtained at
6 Conclusion
The developed computer vision system for small-sized mobile
humanoid robots was implemented on the basis of ESP32-Cam
using an ultrasonic sensor HC-SR0. To speed up the process of
recognition and identification of the “robo ball” object, the
authors proposed a tiny-YOLO lightweight network model. This
solution allowed us to preserve the multidimensional structure of
the feature graph of the network model, and therefore increase the
(a) Distance=50 cm, (b) Distance=100 cm, detection accuracy, while significantly reducing the amount of
recognition accuracy 0.90 recognition accuracy 0.86 calculations generated by the network operation, thereby
significantly increasing the detection speed when using
microcontroller systems. As can be seen from the results of the
experiments, the improved neural network made it possible to
obtain an average time for identifying the object “robo ball” 0.028
in natural conditions, with a high probability from 0.99 at a
distance of 50 cm to 0.80 at 200 cm, while it is worth noting that a
camera is used OV2640 2MP/FOV70. In the future, it is planned
to develop a target detection system and implement mechanisms
(c) Distance=150 cm, (d) Distance=200 cm, for interacting with “objects”, as well as consider the possibility of
recognition accuracy 0.86 recognition accuracy 0.97 implementing movement trajectory prediction to predict actions.
Fig. 14 Identification of the “robo ball” object in dynamics (distance The experimental results obtained show that the developed
increasing). computer vision system has a number of significant advantages,
Table 6 Comparison of the developed computer vision system with a control system for a humanoid robot with similar solutions.
Similar solution for mobile humanoid robots
Parameter
Baihaqi et al.[37] Nugraha et al.[38] Computer vision system developed
Microcontroller ESP32-CAM NVIDIA Jetson Nano ESP32-CAM
Operating frequency 160 MHz 1.43 GHz 160 MHz
Camera OV2640 2Mp/FOV70 Logitech C525 8 MP OV2640 2Mp/FOV70
Camera resolution in the process
GIF (320×240) HD (1280×720) SVGA (350×530)
of identification (pixel)
−
Type of underlying neural network YOLOv3 Improved neural network
(Photo storage)
Ball Ball
Object type Human face
(ideal conditions*) (not ideal conditions**)
Without mask 1.26с
Average recognition time 0.033 0.028
With mask 7.24с
Distance=50 cm: 99 Distance=50 cm: 99
Success probability (%) 80
Distance=900 cm: 46 Distance=200 cm: 80
Dimension (mm×mm×mm) 27×39×4.5 100×80×29 27×39×4.5
System price (US dollar) 8 180 8
Note: *: The studies were carried out under ideal conditions in terms of the contrast of choosing the color of the identified object “ball” (“yellow” RGB
(223, 97, 10)) and the background (“green” RGB (55, 183, 168)). **: The studies were carried out in a natural environment, the color of the identified
object “ball” (“yellow” RGB (173, 145, 97)) background in the form of linoleum (RGB (197, 192, 189)).

Table 7 Comparison of the ball object identification time versus distance development for vertical movement based on the geometrical family
for the developed computer vision system based on ESP32-Cam with the caterpillar, Comput. Intell. Neurosci., vol. 2022, p. 3046116, 2022.
computer vision system based on NVIDIA Jetson Nano[38]. [6] H. Attar, A. T. Abu-Jassar, A. Amer, V. Lyashenko, V. Yevsieiev,
and M. R. Khosravi, Control system development and
Identification time (s)
Compared system implementation of a CNC laser engraver for environmental use with
Distance= Distance= Distance= Distance=
remote imaging, Comput. Intell. Neurosci., vol. 2022, p. 9140156,
50 cm 100 cm 200 cm 300 cm
Computer vision system based 2022.
0.033 0.032 0.033 0.033 [7] A. Rabotiahov, O. Kobylin, Z. Dudar, and V. Lyashenko, Bionic
on NVIDIA Jetson Nano[38]
Developed computer vision image segmentation of cytology samples method, in Proc. 14th Int.
0.028 0.026 0.031 −
system Conf. Advanced Trends in Radioelecrtronics, Telecommunications
and Computer Engineering (TCSET), Lviv-Slavske, Ukraine, 2018,
pp. 665–670.
0.040 [8] S. M. H. Mousavi, V. Lyashenko, and V. B. S. Prasath, Analysis of a
robust edge detection system in different color spaces using color
0.035
Time (s)
and depth images, Comput. Opt., vol. 43, no. 4, pp. 632–646, 2019.
0.030 [9] A. T. Abu-Jassar, Y. M. Al-Sharo, V. Lyashenko, and S. Sotnik,
Some features of classifiers implementation for object recognition in
0.025 specialized computer systems, TEM J., vol. 10, no. 4, pp.
0.020 1645–1654, 2021.
0 50 100 200 300 [10] A. C. Bavelos, N. Kousi, C. Gkournelos, K. Lotsaris, S. Aivaliotis,
Distance (cm) G. Michalos, and S. Makris, Enabling flexibility in manufacturing
Computer vision system based on NVIDIA Jetson Nano by integrating shopfloor and process perception for mobile robot
Developed computer vision system workers, Appl. Sci., vol. 11, no. 9, p. 3985, 2021.
Fig. 15 Comparison of the ball object identification time versus distance [11] A. S. M. Al-Obaidi, A. Al-Qassar, A. R. Nasser, A. Alkhayyat, A. J.
for the developed computer vision system with a similar solution based on Humaidi, and I. K. Ibraheem, Embedded design and implementation
NVIDIA Jetson Nano[38]. of mobile robot for surveillance applications, Indonesian J. Sci.
Technol., vol. 6, no. 2, pp. 427–440, 2021.
such as high recognition speed comparable to the expensive [12] Y. H. Jung, D. H. Cho, J. W. Hong, S. H. Han, S. B. Cho, D. Y.
NVIDIA Jetson Nano and improved AI YOLO v.3 algorithm Shin, E. T. Lim, and S. S. Kim, Development of multi-sensor
implemented for ESP32-Cam, which gives a minimal module mounted mobile robot for disaster field investigation, Int.
Arch. Photogramm. Remote Sens. Spatial Inf. Sci., vol. XLIII-B3-
identification error for a dynamic object and has minimal
2022, pp. 1103–1108, 2022.
requirements for the power module, moreover, the price of the
[13] G. Lajkó, R. N. Elek, and T. Haidegger, Endoscopic image-based
developed computer vision module does not exceed 8 US dollar, skill assessment in robot-assisted minimally invasive surgery,
compared to analogues of 180 US dollar. However, the developed Sensors, vol. 21, no. 16, p. 5412, 2021.
solution is not without drawbacks. Restrictions on overall [14] M. M. Rahman, T. Rahman, D. Kim, and M. A. U. Alam,
dimensions have significantly reduced the computing power of Knowledge transfer across imaging modalities via simultaneous
the module, as a result of which the solution of some recognition learning of adaptive autoencoders for high-fidelity mobile robot
and identification problems must be solved on a separate device. vision, in Proc. 2021 IEEE/RSJ Int. Conf. Intelligent Robots and
Thus, this problem and its solution is the next direction of Systems (IROS), Prague, Czech Republic, 2021, pp. 1267–1273.
possible research. [15] Y. J. Mon, Vision robot path control based on artificial intelligence
image classification and sustainable ultrasonic signal transformation
technology, Sustainability, vol. 14, no. 9, p. 5335, 2022.
Dates [16] D. Zhang and Z. Guo, Mobile sentry robot for laboratory safety
Received: 1 May 2023; Revised: 15 September 2023; Accepted: 18 inspection based on machine vision and infrared thermal imaging
September 2023 detection, Secur. Commun. Netw., vol. 2021, p. 6612438, 2021.
[17] A. Stateczny, K. Gierlowski, and M. Hoeft, Wireless local area
network technologies as communication solutions for unmanned
References surface vehicles, Sensors, vol. 22, no. 2, p. 655, 2022.
[1] S. Khlamov, V. Savanevych, I. Tabakova, and T. Trunova, [18] EMAX ES08MA II 12g mini metal gear analog servo for RC model,
Statistical modeling for the near-zero apparent motion detection of https://www.amazon.com/ES08MA-Metal-Analog-Servo-Model/
objects in series of images from data stream, in Proc. 12th Int. Conf. dp/B07KYK9N1G, 2023.
Advanced Computer Information Technologies (ACIT), [19] OV7670 camera module supports VGA CIF auto exposure control
Ruzomberok, Slovakia, 2022, pp. 126–129. display active size 640X480, https://www.amazon.com/Supports-
[2] S. Khlamov, I. Tabakova, and T. Trunova, Recognition of the Exposure-Control-Display-640X480/dp/B09X59J8N9/ref=sr_1_6?crid=
astronomical images using the Sobel filter, in Proc. 29th Int. Conf. 1OSVL1M09YPB5&keywords=VGA+OV7670&qid=1669977970
Systems, Signals and Image Processing (IWSSIP), Sofia, Bulgaria, &sprefix=vga+ov7670+%2Caps%2C600&sr=8-6, 2023.
2022, pp. 1–4. [20] 01Studio pyai-K210 kit entwicklungs board python ai künstliche
[3] M. A. Ahmad, I. Tvoroshenko, J. H. Baker, and V. Lyashenko, intelligenz machine vision deep learning micro python, https://
Modeling the structure of intellectual means of decision-making de.aliexpress.com/item/1005001459205624.html?gatewayAdapt=glo
using a system-oriented NFO approach, Int. J. Emerg. Trends Eng. 2deu, 2023.
Res., vol. 7, no. 11, pp. 460–465, 2019. [21] ESP32-CAM esp32 cam seriell zu wifi entwicklung bord micro usb
[4] V. Lyashenko, O. Kobylin, and M. Minenko, Tools for investigating bluetooth + ov2640 kamera modul, https://de.aliexpress.com/item/
the phishing attacks dynamics, in Proc. Int. Scientific-Practical 32947577882.html?gatewayAdapt=glo2deu, 2023.
Conf. Problems of Infocommunications. Science and Technology [22] Preliminary datasheet, OV7670/OV7171 CMOS VGA (640×480),
(PIC S&T), Kharkiv, Ukraine, 2018, pp. 43–46. http://web.mit.edu/6.111/www/f2016/tools/OV7670_2006.pdf,
[5] H. Attar, A. T. Abu-Jassar, V. Yevsieiev, V. Lyashenko, I. 2023.
Nevliudov, and A. K. Luhach, Zoomorphic mobile robot [23] M5Stack K027 M5StickV K210 ai camera, https://eu.mouser.com/

new/m5stack/m5stack-k027-m5stickv-k210-ai-camera/, 2023. [32] PyCharm, https://www.jetbrains.com/pycharm/, 2023.

[24] ESP32-Cam module product specification, https://loboris.eu/ESP32/ [33] What is a 503 status code? https://www.webfx.com/web-
ESP32-CAM%20Product%20Specification.pdf, 2023. development/glossary/http-status-codes/what-is-a-503-status-code/,
[25] Neue version 32 kanal roboter servo control board servo motor 2023.
controller PS2 drahtlose steuerung USB/UART verbindung modus, [34] L. Withington, D. D. P. D. Vera, C. Guest, C. Mancini, and P.
https://de.aliexpress.com/item/32272674301.html?gatewayAdapt=gl Piwek, Artificial neural networks for classifying the time series
o2deu, 2023. sensor data generated by medical detection dogs, Expert Syst. Appl.,
[26] straße lenkgetriebe control board servo controller von intelligente vol. 184, p. 115564, 2021.
roboter/serielle USB/PC APP/motor, https://de.aliexpress.com/item/ [35] Z. Wang, X. Gu, R. S. M. Goh, J. T. Zhou, and T. Luo, Efficient
1005004170128663.html?gatewayAdapt=glo2deu, 2023.
spiking neural networks with radix encoding, IEEE Trans. Neural
[27] Plen2 control board entwicklung board wireless controller für
Netw. Learning Syst., doi: 10.1109/TNNLS.2022.3195918.
18DOF vivi RC biped humanoiden roboter DIY für arduino, https://de.
[36] B. C. Liu, Q. Yu, J. W. Gao, S. Zhao, X. C. Liu, and Y. F. Lu,
aliexpress.com/item/4000432667382.html?gatewayAdapt=glo2deu,
Spiking neuron networks based energy-efficient object detection for
2023.
[28] Arduino IDE, https://www.arduino.cc/en/software, 2023. mobile robot, in Proc. 2021 China Automation Congress (CAC),
[29] P. A. Nugroho, Kontrol lampu gedung melalui WiFi esp8266 dengan Beijing, China, 2021, pp. 3224–3229.
web server lokal, J. Elektro Dan Informatika Swadharma, vol. 1, no. [37] M. Y. Baihaqi, V. Vincent, and J. W. Simatupang, Humanoid robot
2, pp. 1–11, 2021. application as COVID-19 symptoms checker using computer vision
[30] Ultrasonic module HC-SR04 distance sensor, https://www.amazon. and multiple sensors, ELKHA: J. Teknik Elektro, vol. 13, no. 2, pp.
eg/-/en/Ultrasonic-Module-HC-SR04-Distance-arduino/dp/B091 105–112, 2021.
D3QWN4, 2023. [38] A. C. Nugraha, M. L. Hakim, S. Yatmono, and M. Khairudin,
[31] Python release python 3.10, https://www.python.org/downloads/ Development of ball detection system with YOLOv3 in a humanoid
release/ python-3100/, 2023. soccer robot, J. Phys.: Conf. Ser., vol. 2111, no. 1, p. 012055, 2021.

Development and Investigation of Vision System for a Small-Sized Mobile Humanoid Robot in a Smart Environment

Uploaded by

Development and Investigation of Vision System for a Small-Sized Mobile Humanoid Robot in a Smart Environment

Uploaded by

https://doi.org/10.26599/IJCS.2023.

9100018 International Journal of Crowd Science

Development and Investigation of Vision System for a Small-Sized

International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43 29

30 International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43

Wireless LAN Notebook

Ball Humanoid robot System of management

International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43 31

(a) Kamepa VGAOV7670[19] (b) PyAI-K210[20] (c) ESP32-Cam[21]

Table 2 Comparison of main parameters of camera module.

32 International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43

Humanoid robot Wireless LAN Personal computer (notebook)

ESP 32-Cam OS Windows 10/11

Sensor data Urllib

International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43 33

34 International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43

Library Video stream transmission Library

Wireless network Get a video stream

connections from the specified IP

System of managment and decision making (notebook)

Generation of Displaying visual

Fig. 7 Enlarged algorithm of the computer vision system.

International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43 35

36 International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43

programming language[32]. y = int(centerY – (height / 2)).

International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43 37

architecture of the Convolutional Neural Network (CNN). In this

Fig. 9 Example of technical information about a successful ESP32-Cam firmware.

38 International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43

Fig. 10 HMI of a developed computer vision system for a humanoid robot.

Table 4 Results of the first experiment.

100 0.87 103 0.373

150 0.94 156 0.41

200 0.60 206 0.38

International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43 39

Table 5 Results of the second experiment.

100 0.97 103 0.026

150 0.94 156 0.032

200 0.80 206 0.031

0.42 0.95 0.034 1.00

0.032 Identification accuracy

40 International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43

International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43 41

42 International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43

new/m5stack/m5stack-k027-m5stickv-k210-ai-camera/, 2023. [32] PyCharm, https://www.jetbrains.com/pycharm/, 2023.

International Journal of Crowd Science | VOL. 9 NO.1 | 2025 | 29–43 43

You might also like