Research Paper On Sign Language To Text Conversion Using CNN
Research Paper On Sign Language To Text Conversion Using CNN
(IJIRSET)
| |e-ISSN: 2319-8753, p-ISSN: 2320-6710| www.ijirset.com | Impact Factor: 8.118| A Monthly Peer Reviewed & Referred Journal |
Prof. Kopal Gangrade,Associate Professor, Department of Computer Engineering, Pune Institute of Computer
Technology, Pune, India
ABSTRACT: Sign languages utilize a visual-manual modality to convey meaning and are commonly used by
individuals who are deaf or hard of hearing. These communication systems involve Image Processing hand gestures
that are specific to each language and can be difficult for those unfamiliar with them to interpret. In order to address
this issue, our goal is to develop an interface for Sign Language Recognition that can translate sign language into text
and audio, making it more accessible to a wider audience. However, current techniques for sign language translation
are not without their limitations, such as inaccuracies, difficulties with detecting skin tones, excessive motion gestures,
clutter, and variability. Despite these challenges, we aim to create an interface using Convolution Neural Networks
that can accurately use techniques such as Edge Detection and Hand Gesture Recognition for converting sign
language to text and audio while also mitigating these drawbacks to the best of our abilities.
KEYWORDS: Sign Language Recognition, Convolutional Neural Network, Image Processing, Edge Detection, Hand
Gesture Recognition.
I. INTRODUCTION
Sign language recognition refers to the process of converting a user’s signs and gestures into text, allowing for
commu- nication between individuals who are unable to speak and the general public. Image processing
algorithms and neural networks are utilized to map gestures to the appropriate text found within the training
data, thereby converting raw images and videos into text that can be easily read and understood. Communication
can be challenging for individuals who are deaf or hard of hearing, as they often rely on visual communication.
Sign language serves as the primary means of communication within the deaf and hard of hearing community,
utilizing visual modality to exchange information. However, many people are unaware of the grammar associ- ated
with sign language, limiting communication opportunities for individuals who rely on it. This has led to a growing
demand for a computer-based system that can accurately recognize and translate sign language. Researchers have
been working to address this problem by developing technologies that can recognize speech, facial expressions,
and human gestures.
| |e-ISSN: 2319-8753, p-ISSN: 2320-6710| www.ijirset.com | Impact Factor: 8.118| A Monthly Peer Reviewed & Referred Journal |
The concept of image inpainting was first introduced by Bertamio et al. [1]. The method was inspired by the real
inpainting process of artists. The image smoothness information interpolated by the image Laplacian is
propagated along the isophotes directions, which are estimated by the gradient of image rotated by 90 degrees.
Exemplar Based method proposed by Criminisi et al. [2] used a best exemplar patch to propagate target patch including
missing pixels. This technique uses an approach which combine structure propagation with texture synthesis and hence
produced very good results. In [3], the authors decompose the image into sum of two functions and then reconstruct
each function separately with structure and texture filling-in algorithms. Morphological technique is used to extract text
from the images presented in [4]. In [5], the inpainting technique is combined with the techniques of finding text in
images and a simple algorithm that links them. The technique is insensitive to noise, skew and text orientation. The
authors in [6] have applied the CCL (connected component labelling) to detect the text and fast marching algorithm is
used for Inpainting.
The work in this paper is divided in two stages. 1) Text- Detection 2) Inpainting. Text detection is done by applying
morphological open-close and close-open filters and combines the images. Thereafter, gradient is applied to detect the
edges followed by thresholding and morphological dilation, erosion operation. Then, connected component labelling is
performed to label each object separately. Finally, the set of selection criteria is applied to filter out non text regions.
After text detection, text inpainting is accomplished by using exemplar based Inpainting algorithm.
Our system employs convolution neural networks to recog- nize hand gestures in sign language. It does so by
capturing video and converting it into frames, which are then used to segment hand pixels. The resulting image is
then compared to a trained model to obtain accurate text labels of letters. By using this approach, our system
achieves robustness in accurately recognizing a variety of hand gestures.
III. METHODOLOGY
Our system employs convolution neural networks to recognize hand gestures in sign language. It does so by
capturing video and converting it into frames, which are then used to segment hand pixels. The resulting image is
then compared to a trained model to obtain accurate text labels of letters. By using this approach, our system achieves
robustness in accurately recognizing a variety of hand gestures.The system is a vision based approach. All the signs
are represented with bare hands and so it eliminates the problem of using any artificial devices for interaction.
A. DATASET GENERATION
In our project, we encountered a challenge in finding pre- existing datasets that met our requirements in the form of
raw images. Instead, we decided to create our own dataset using the OpenCV library. We captured approximately 800
images of each symbol in ASL for training purposes, and around 200 images per symbol for testing purposes. To
create our dataset, we used the webcam on our machine to capture each frame. Within each frame, we defined a
region of interest (ROI) using a blue bounded square, as shown in the image below.
We then extracted the ROI, which was in RGB format, and converted it into a grayscale image, as shown below.
Finally, we applied a Gaussian blur filter to our image, which helped us extract various features of our image. The
| |e-ISSN: 2319-8753, p-ISSN: 2320-6710| www.ijirset.com | Impact Factor: 8.118| A Monthly Peer Reviewed & Referred Journal |
resulting image after applying the Gaussian blur looked like the image below.
• Algorithm Layer 1:
1) Apply Gaussian Blur filter and threshold to the frame taken with openCV to get the processed image after
feature extraction.
2) This processed image is passed to the CNN model for prediction and if a letter is detected for more than
50 frames then the letter is printed and taken into consideration for forming the word.
3) Space between the words is considered using the blank symbol.
• Algorithm Layer 2:
1) We detect various sets of symbols which show similar results on getting detected.
2) We then classify between those sets using classifiers made for those sets only.
Layer 1:
• CNN Model
1) The first step in our convolutional neural network in- volves processing an input picture with a resolution of
128x128 pixels through the first convolutional layer, which uses 32 filter weights (3x3 pixels each). This
produces a 126x126 pixel image, one for each filter weight.
2) Next, we downsample the image using max pooling of 2x2, keeping only the highest value in each 2x2 square
of the array. This results in a picture that is downsampled to 63x63 pixels.
3) We then process this 63x63 image through the second convolutional layer, which uses 32 filter weights (3x3
pixels each) and produces a 60x60 pixel image.
4) The resulting images are downsampled again using max pooling of 2x2, reducing the resolution to 30x30.
5) Next, we use the resulting images as input to a fully connected layer with 128 neurons. The output from the
second convolutional layer is reshaped into an array of 30x30x32=28,800 values.
6) The output of the first densely connected layer is then fed to a second fully connected layer with 96 neurons.
We also incorporate a dropout layer with a value of 0.5 to avoid overfitting.
7) Finally, the output of the second densely connected layer serves as input to the final layer, which has a
number of neurons equal to the number of classes we are classifying (i.e., alphabets + blank symbol).
• Activation Function:
In our convolutional neural network, we have incorporated the Rectified Linear Unit (ReLU) function in each
of the layers, including the convolutional and fully connected neurons. The ReLU function calculates max(x,0)
for each input pixel, which adds nonlinearity to the formula and helps to learn more complicated features. This
function helps to eliminate the vanishing gradient problem and speed up the training by reducing the
computation time.
| |e-ISSN: 2319-8753, p-ISSN: 2320-6710| www.ijirset.com | Impact Factor: 8.118| A Monthly Peer Reviewed & Referred Journal |
• Pooling Layer:
We apply Max pooling to the input image with a pool size of (2, 2) with relu activation function.This reduces
the amount of parameters thus lessening the computation cost and reduces overfitting.
• Dropout Layers:
The dropout layer is added to our model to prevent overfitting, which occurs when the weights of the network
are overly adjusted to the training examples and the network does not generalize well to new examples. The
dropout layer randomly sets a subset of the activations in the previous layer to zero during training, which
helps prevent the model from relying too much on any specific activation. This allows the model to learn
more robust and generalizable features and reduces the chances of overfitting to the training data. During
testing, the dropout layer is disabled and all activations are used for making predictions.
• Optimizer:
To update our model in response to the output of the loss function, we have used the Adam
optimizer. Adam combines the advantages of two stochastic gradient descent algorithms: adaptive gradient
algorithm (AdaGrad) and root mean square propagation (RMSProp). AdaGrad adapts the learning rate of each
parameter based on its historical gradient information, allowing the learning rate to decrease for parameters
that are frequently updated and increase for infrequently updated ones. On the other hand, RMSProp divides
the learning rate by a running average of the magnitudes of recent gradients, which helps to avoid
exploding gradients and converge faster. Adam optimizer combines these two methods and adapts the learning
rate of each parameter based on the first and second moments of the gradients. This results in faster
convergence and better performance on a variety of deep learning tasks.
Layer 2:
| |e-ISSN: 2319-8753, p-ISSN: 2320-6710| www.ijirset.com | Impact Factor: 8.118| A Monthly Peer Reviewed & Referred Journal |
To improve the accuracy of our symbol detection and prediction, we have implemented two layers of
algorithms that can differentiate between symbols that are similar to each other. However, during our testing
phase, we encountered some issues where certain symbols were not being detected accurately, and were being
confused with other symbols. Specifically, we found that the following symbols were not being detected properly:
1) For D : R and U
2) For U : D and R
3) For I : T, D, K and I
4) For S : M and N
So to handle above cases we made three different classifiers for classifying these sets:
1) D,R,U
2) T,K,D,I
3) S,M,N
C. AUTOCORRECT FEATURE
The system incorporates the Hunspell suggest python library to provide users with suggested alternative words for
any incorrectly spelled input word. These suggested words are displayed to the user in a set, allowing them to
select the most appropriate option to append to the current sentence. This feature not only reduces the number
of spelling errors but also aids in the prediction of complex words.
| |e-ISSN: 2319-8753, p-ISSN: 2320-6710| www.ijirset.com | Impact Factor: 8.118| A Monthly Peer Reviewed & Referred Journal |
of 2.5%. In [8], a recognition model was developed using a hidden markov model classifier and a vocabulary of 30
words, resulting in an error rate of 10.90%.
In [9], they achieved an average accuracy of 86% for 41 static gestures in japanese sign language. Using depth
sensors, [10] achieved an accuracy of 99.99% for observed signers and 83.58% and 85.49% for new signers, also using
cnn for their recognition system. It is worth noting that our model does not use any background subtraction algorithm,
unlike some of the models mentioned above, so the accuracy may vary once we implement background subtraction in
our project.
Additionally, most of the projects mentioned above use kinect devices, but our goal was to create a project that could
be used with readily available resources. Since a sensor like kinect is not only readily available, but also expensive for
most people to purchase, our model uses a normal webcam, making it a great plus point. The confusion matrices for our
results are shown below.
V. CONCLUSION
We have developed a functional, real-time American Sign Language (ASL) recognition system for the DM
community, specifically for ASL alphabets. Our data-set achieved a final accuracy of 98.0%. To improve our
prediction accuracy, we implemented two layers of algorithms that verify and predict symbols that are more similar
to each other. As a result, our system is now able to detect almost all symbols accurately, assuming they are shown
properly, there is no background noise, and the lighting conditions are adequate.
REFERENCES
| |e-ISSN: 2319-8753, p-ISSN: 2320-6710| www.ijirset.com | Impact Factor: 8.118| A Monthly Peer Reviewed & Referred Journal |
[1] T. Yang, Y. Xu, and “A, Hidden Markov Model for Gesture Recognition”, CMU-RI-TR-94 10, Robotics
Institute, Carnegie Mellon Univ.,Pittsburgh,PA, May 1994.
[2] Pujan Ziaie, Thomas Muller , Mary Ellen Foster , and Alois Knoll“A Naive Bayes Munich,Dept. of Informatics
VI, Robotics and Embedded Systems,Boltzmannstr. 3, DE-85748 Garching, Germany.
[3] aeshpande3.github.io/A-Beginnerutional-Neural-Networks-Part-2/
[4] Mohammed Waleed Kalous, Machine recognition of Auslan signs using PowerGloves: Towards large-lexicon
recognition of sign language.
[5] Pigou L., Dieleman S., Kindermans PJ., Schrauwen B. (2015) Sign Language Recognition Using Convolutional
Neural Networks. In: Agapito L., Bronstein M., Rother C. (eds) Computer Vision – ECCV 2014 Workshops.
ECCV 2014. Lecture Notes in Computer Science, vol 8925. Springer, Cham
[6] Zaki, M.M., Shaheen, S.I.: Sign language recognition using acombination of new vision based features. Pattern
Recognition Letters 32(4), 572–577 (2011).
[7] N. Mukai, N. Harada and Y. Chang, ”Japanese Fingerspelling Recognition Based on Classification Tree and
Machine Learning,” 2017 Nicograph International (NicoInt) , Kyoto, Japan, 2017, pp. 19-24.
doi:10.1109/NICOInt.2017.9
[8] https://opencv.org/
[9] https://en.wikipedia.org/wiki/TensorFlow
[10] https://en.wikipedia.org/wiki/Convolutional neural network