0% found this document useful (0 votes)

88 views7 pages

SVTR: Scene Text Recognition With A Single Visual Model

The document presents SVTR, a novel scene text recognition model that utilizes a single visual model to enhance efficiency and accuracy by eliminating the need for a sequential modeling component. SVTR decomposes images into character components and employs local and global mixing blocks to capture both intra-character and inter-character patterns, achieving competitive results in recognizing both English and Chinese text. Experimental results demonstrate that SVTR-L outperforms existing methods, particularly in Chinese recognition, while SVTR-T offers a smaller, faster alternative.

Uploaded by

hsf842455044

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views7 pages

SVTR: Scene Text Recognition With A Single Visual Model

Uploaded by

hsf842455044

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

SVTR: Scene Text Recognition with a Single Visual Model

Yongkun Du1 , Zhineng Chen2∗ , Caiyan Jia1 , Xiaoting Yin3 , Tianlun Zheng2 ,
Chenxia Li3 , Yuning Du3 , Yu-Gang Jiang2
1
School of Computer and Information Technology and Beijing Key Lab of Traffic Data Analysis and
Mining, Beijing Jiaotong University, China
2
Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer
Science, Fudan University, China
3
Baidu Inc., China
{yongkundu, cyjia}@[Link], {zhinchen, ygj}@[Link], tlzheng21@[Link],
{yinxiaoting, lichenxia, duyuning}@[Link]

Abstract (a) CNN RNN Pred

Dominant scene text recognition models commonly

contain two building blocks, a visual model for (b) CNN/MHSA RNN/MHA Pred
feature extraction and a sequence model for text
transcription. This hybrid architecture, although
accurate, is complex and less efficient. In this (c) CNN/MHSA Pred MHA Pred
study, we propose a Single Visual model for Scene
Text recognition within the patch-wise image tok- Fusion
enization framework, which dispenses with the se- (d) SVTR Pred
quential modeling entirely. The method, termed
SVTR, firstly decomposes an image text into small Figure 1: (a) CNN-RNN based models. (b) Encoder-Decoder mod-
patches named character components. Afterward, els. MHSA and MHA denote multi-head self-attention and multi-
hierarchical stages are recurrently carried out by head attention, respectively. (c) Vision-Language models. (d) Our
component-level mixing, merging and/or combin- SVTR, which recognizes scene text with a single visual model and
ing. Global and local mixing blocks are devised enjoys efficient, accurate and cross-lingual versatile.
to perceive the inter-character and intra-character
patterns, leading to a multi-grained character com-
ponent perception. Thus, characters are recognized Methodologically, scene text recognition can be viewed as
by a simple linear prediction. Experimental results a cross-modal mapping from image to character sequence.
on both English and Chinese scene text recogni- Typically, the recognizer consists of two building blocks, a
tion tasks demonstrate the effectiveness of SVTR. visual model for feature extraction and a sequence model for
SVTR-L (Large) achieves highly competitive ac- text transcription. For example, CNN-RNN based models
curacy in English and outperforms existing meth- [Zhai et al., 2016; Shi et al., 2017] first employed CNN for
ods by a large margin in Chinese, while running feature extraction. The feature was then reshaped as a se-
faster. In addition, SVTR-T (Tiny) is an effective quence and modeled by BiLSTM and CTC loss to get the
and much smaller model, which shows appealing prediction (Figure 1(a)). They are featured by efficiency and
speed at inference. The code is publicly available remain the choice for some commercial recognizers. How-
at [Link] ever, the reshaping is sensitive to text disturbances such as
deformation, occlusion, etc, limiting their effectiveness.
Later, encoder-decoder based auto-regressive methods be-
1 Introduction came popular [Sheng et al., 2019; Li et al., 2019; Zheng et
Scene text recognition aims to transcript a text in natural im- al., 2021], the methods transform the recognition as an it-
age to digital character sequence, which conveys high-level erative decoding procedure (Figure 1(b)). As a result, im-
semantics vital for scene understanding. The task is challeng- proved accuracy was obtained as the context information was
ing due to variations in text deformations, fonts, occlusions, considered. However, the inference speed is slow due to the
cluttered background, etc. In the past years, many efforts have character-by-character transcription. The pipeline was fur-
been made to improve the recognition accuracy. Modern text ther extended to vision-language based framework [Yu et al.,
recognizers, besides accuracy, also take factors like inference 2020; Fang et al., 2021], where language knowledge was
speed into account because of practical requirements. incorporated (Figure 1(c)) and parallel prediction was con-
ducted. However, the pipeline often requires a large capacity
∗
Corresponding Author model or complex recognition paradigm to ensure the recog-

884
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

Input: H×W×3 H
CC0: 4 ×
W
× D0
H
CC1: 8 ×
W
× D1
H
CC2: 16 ×
W
× D2 C: 1×
W
× D3
4 4 4 4

…..

....

…..

.....

…..
Mixing Blocks

Mixing Blocks

Mixing Blocks
Embedding

Combing
Merging

Merging
Patch
FC

qbhouse
Stage 1 Stage 2 Stage 3

Stage i h C
CC0 h×w×di-1 hw×di-1 h×w×di-1 ×w×di
2

…
…..

…..
Dropout

Local/Global
×2
...

......
Mixing
Reshape Reshape

Conv
MLP
CCi-1 CCi

LN
BN
Linear
Conv
× Li Pool
CC'i Merging
Input
Mixing Block CC'3

Position Embedding Element-wise Add Character Component Activation Function

Figure 2: Overall architecture of the proposed SVTR. It is a three-stage height progressively decreased network. In each stage, a series of
mixing blocks are carried out and followed by a merging or combining operation. At last, the recognition is conducted by a linear prediction.

nition accuracy, restricting its efficiency. to enhance the recognition capability by reinforcing the vi-
Recently, there are efforts emphasized developing simpli- sual model. To this end, we propose a visual-based model
fied architectures to accelerate the speed. For example, us- SVTR for accurate, fast and cross-lingual versatile scene text
ing complex training paradigm but simple model for infer- recognition. Inspired by the recent success of vision trans-
ence. CRNN-RNN based solution was revisited in [Hu et former [Dosovitskiy et al., 2021; Liu et al., 2021], SVTR
al., 2020]. It utilized the attention mechanism and graph first decomposes an image text into small 2D patches termed
neural network to aggregate sequential features correspond- character components, as each of which may contain a part
ing to the same character. At inference, the attention mod- of a character only. Thus, patch-wise image tokenization fol-
eling branch was discarded to balance accuracy and speed. lowed by self-attention is applied to capture recognition clues
PREN2D [Yan et al., 2021] further simplified the recognition among character components. Specifically, a text-customized
by aggregating and decoding the 1D sub-character features architecture is developed for this purpose. It is a three-stage
simultaneously. [Wang et al., 2021] proposed VisionLAN. It height progressively decreased backbone with mixing, merg-
introduced a character-wise occluded learning to endue the ing and/or combining operations. Local and global mixing
visual model with language capability. While at inference, blocks are devised and recurrently employed at each stage,
the visual model was applied merely for speedup. In view together with the merging or combining operation, acquiring
of the simplicity of a single visual model based architecture, the local component-level affinities that represent the stroke-
some recognizers were proposed by employing off-the-shelf like feature of a character, and long-term dependence among
CNN [Borisyuk et al., 2018] or ViT [Atienza, 2021] as the different characters. Therefore, the backbone extracts com-
feature extractor. Despite being efficiency, their accuracy is ponent features of different distance and at multiple scales,
less competitive compared to state-of-the-art methods. forming a multi-grained character feature perception. As a re-
We argue that the single visual model based scheme is sult, the recognition is reached by a simple linear prediction.
effective only if discriminative character features could be In the whole process only one visual model is employed. We
extracted. Specifically, the model could successfully catch construct four architecture variants with varying capacities.
both intra-character local patterns and inter-character long- Extensive experiments are carried out on both English and
term dependence. The former encodes stroke-like features Chinese scene text recognition tasks. It is shown that SVTR-
that describe fine-grained features of a character, being a L (large) achieves highly competitive results in English and
critical source for distinguishing characters. While the lat- outperforms state-of-the-art models by a large margin in Chi-
ter records language-analogous knowledge that describes the nese, while running faster. Meanwhile, SVTR-T (Tiny) is
characters from a complementary aspect. However, the two also an effective and much smaller model yet with appealing
properties are not well modeled by previous feature extrac- inference speed. The main contributions of this work can be
tors. For example, CNN backbones are good at modeling summarized as follows.
local correlation rather than global dependence. Meanwhile, • We demonstrate, for the first time, that a single visual
current transformer-based general-purpose backbones would model can achieve competitive or even higher accuracy
not give privilege to local character patterns. as advanced vision-language models in scene text recog-
Motivated by the issues mentioned above, this work aims nition. It is promising to practical applications due to its

885
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

… …

.....
… …

..... (a) (b)

Figure 4: Illustration of (a) global mixing and (b) local mixing.

BN BN

...
Linear

...
...

Conv Conv
...

two consecutive 3 × 3 convolutions with stride 2 and batch

normalization, as shown in Figure 3(b). The scheme, despite
(a) (b) increasing the computational cost a little, adds the feature
dimension progressively which is in favor of feature fusion.
Figure 3: (a) The linear projection in ViT [Dosovitskiy et al., 2021]. Ablation study in Section 3.3 shows its effectiveness.
(b) Our progressive overlapping patch embedding.
2.3 Mixing Block
efficiency and cross-lingual versatility. Since two characters may differ slightly, text recognition
• We propose SVTR, a text-customized model for recog- heavily relies on features at character component level. How-
nition. It introduces local and global mixing blocks for ever, existing studies mostly employ a feature sequence to
extracting stroke-like features and inter-character depen- represent the image text. Each feature corresponds to a thin-
dence, respectively, together with the multi-scale back- slice image region, which is often noisy especially for irreg-
bone, forming a multi-grained feature description. ular text. It is not optimal for describing the character. Re-
cent advancement of vision transformer introduces 2D feature
• Empirical studies on public benchmarks demonstrate the representation, but how to leverage this representation in the
superiority of SVTR. SVTR-L achieves the state-of-the- context of text recognition is still worthy investigation.
art performance in recognizing both English and Chi- More specifically, with the embedded components, we ar-
nese scene texts. While SVTR-T is effective yet effi- gue that text recognition requires two kinds of features. The
cient, with parameters of 6.03M and consuming 4.5ms first is local component patterns such as the stroke-like fea-
per image text on average in one NVIDIA 1080Ti GPU. ture. It encodes the morphology feature and correlation be-
tween different parts of a character. The second is inter-
2 Method character dependence such as the correlation between dif-
2.1 Overall Architecture ferent characters or between text and non-text components.
Therefore, we devise two mixing blocks to perceive the cor-
Overview of the proposed SVTR is illustrated in Figure 2. It relation by using self-attention with different reception fields.
is a three-stage height progressively decreased network dedi-
cated to text recognition. For an image text of size H ×W ×3, Global Mixing. As seen in Figure 4(a), global mixing eval-
it is first transformed to H4 × W
4 patches of dimension D0 via
uates the dependence among all character components. Since
a progressive overlapping patch embedding. The patches are text and non-text are two major elements in an image, such
termed character components, each associating with a frac- a general-purpose mixing could establish the long-term de-
tion of text character in the image. Then, three stages, each pendence among component from different characters. Be-
composed of a series of mixing blocks followed by a merging sides, it also capable of weakening the influence of non-text
or combining operation, are carried out at different scales for components, while enhancing the importance of text com-
feature extraction. local and global mixing blocks are devised ponents. Mathematically, for character components CCi−1
for stroke-like local pattern extraction and inter-component from the previous stage, it is first reshaped as a feature se-
dependence capturing. With the backbone, component fea- quence. When feeding into the mixing block, a layer norm is
ture and dependence of different distance and at multiple applied and followed by a multi-head self-attention for depen-
scales are characterized, generating a representation referred dence modeling. In the following, a layer norm and a MLP
to as C of size 1 × W 4 × D3 , which perceives multi-grained
are sequentially applied for feature fusion. Together with the
character features. Finally, a parallel linear prediction with shortcut connections, forming the global mixing block.
de-duplication is conducted to get the character sequence. Local Mixing. As seen in Figure 4(b), local mixing eval-
uates the correlation among components within a predefined
2.2 Progressive Overlapping Patch Embedding
window. Its objective is to encode the character morphology
With an image text, the first task is to obtain feature patches feature and establish the associations between components
which represent character components from X ∈ RH×W ×3 within a character, which simulates the stroke-like feature vi-
H W
to CC0 ∈ R 4 × 4 ×D0 . There exists two common one-step tal for character identification. Different from global mixing,
projections for this purpose, i.e., a 4×4 disjoint linear projec- local mixing considers a neighborhood for each component.
tion (see Figure 3(a)) and a 7 × 7 convolution with stride 4. Similar to convolution, the mixing is carried out in a slid-
Alternatively, we implement the patch embedding by using ing window manner. The window size is empirically set to

886
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

7 × 11. Compared with the global mixing, it implements the 3 Experiments

self-attention mechanism to capture local patterns. 3.1 Datasets
As aforementioned, the two mixing blocks aims to ex-
tract different features that are complementary. In SVTR, For English recognition task, our models are trained on two
the blocks are recurrently applied many times in each stage commonly used synthetic scene text datasets, i.e., MJSynth
for comprehensive feature extraction. Permutation of the two (MJ) [Jaderberg et al., 2014; Jaderberg et al., 2015] and Syn-
kinds of blocks will be ablated later. thText (ST) [Gupta et al., 2016]. Then the models are tested
on six public benchmarks as follows: ICDAR 2013 (IC13)
[KaratzasAU et al., 2013] contains 1095 testing images. we
2.4 Merging discard images that contain non-alphanumeric characters or
It is computational expensive to maintain a constant spatial less than three characters. As a result, IC13 contains 857 im-
resolution across stages, which also leads to redundant rep- ages. Street View Text (SVT) [Wang et al., 2011] has 647
resentation. As a consequence, we devise a merging opera- testing images cropped form Google Street View. Many im-
tion following the mixing blocks at each stage (except the last ages are severely corrupted by noise, blur, and low resolution.
one). With the features outputted from the last mixing block, IIIT5K-Words (IIIT) [Mishra et al., 2012] is collected from
we first reshape it to an embedding of size h × w × di−1 , de- the website and contains 3000 testing images. ICDAR 2015
noting current height, width and channels, respectively. Then, (IC15) [Karatzas et al., 2015] is taken with Google Glasses
we employ a 3 × 3 convolution with stride 2 in the height di- without careful position and focusing. IC15 has two versions
mension and 1 in the width dimension, followed by a layer with 1,811 images and 2,077 images. We use the former one.
norm, generating an embedding of size h2 × w × di . Street View Text-Perspective (SVTP) [Phan et al., 2013] is
The merging operation halve the height while keep a con- also cropped form Google Street View. There are 639 test
stant width. It not only reduce the computational cost, but images in this set and many of them are perspectively dis-
also build a text-customized hierarchical structure. Typically, torted. CUTE80 (CUTE) is proposed in[Anhar et al., 2014]
most image text appears horizontally or near horizontally. for curved text recognition. 288 testing images are cropped
Compressing the height dimension can establish a multi- from full images by using annotated words.
scale representation for each character, while not affecting the For Chinese recognition task, we use the Chinese Scene
patch layout in the width dimension. Therefore, it would not Dataset [Chen et al., 2021]. It is a public dataset contain-
increase the chance of encoding adjacent characters into the ing 509,164, 63,645 and 63,646 training, validation, and test
same component across stages. We also increase the channel images. The validation set is utilized to determine the best
dimension di to compensate the information loss. model, which is then assessed using the test set.
3.2 Implementation Details
2.5 Combining and Prediction
SVTR uses the rectification module [Shi et al., 2019], where
In the last stage, the merging operation is replaced by a com- the image text is resized to 32 × 64 for distortion correction.
bining operation. It pools the height dimension to 1 at first, We use the AdamW optimizer with weight decay of 0.05 for
followed by a fully-connected layer, non-linear activation and training. For English models, the initial learning rate are set
dropout. By doing this, character components are further to 1054 × batchsize
2048 . The cosine learning rate scheduler with 2
compressed to a feature sequence, where each element is rep- epochs linear warm-up is used in all 21 epochs. Data aug-
resented by a feature of length D3 . Compared to the merging mentation like rotation, perspective distortion, motion blur
operation, the combining operation can avoid applying con- and Gaussian noise, are randomly performed during the train-
volution to an embedding whose size is very small in one ing. The alphabet includes all case-insensitive alphanumer-
dimension, e.g., with 2 in height. ics. The maximum predict length is set to 25. The length ex-
With the combined feature, we implement the recognition ceeds the vast majority of English words. For Chinese mod-
by a simple parallel linear prediction. Concretely, a linear els, the initial learning rate are set to 1034 × batchsize
512 . The
classifier with N nodes is employed. It generates a transcript cosine learning rate scheduler with 5 epochs linear warm-up
sequence of size W 4 , where ideally, components of the same is used in all 100 epochs. Data augmentation was not used for
character are transcribed to duplicate characters, components training. The maximum predict length is set to 40. Word ac-
of non-text are transcribed to a blank symbol. The sequence curacy is used as the evaluation metric. All models are trained
is automatically condensed to the final result. In the imple- by using 4 Tesla V100 GPUs on PaddlePaddle.
mentation, N is set to 37 for English and 6625 for Chinese.
3.3 Ablation Study
2.6 Architecture Variants To better understand SVTR, we perform controlled experi-
ments on both IC13 (regular text) and IC15 (irregular text)
There are several hyper-parameters in SVTR, including the under different settings. For efficiency, all the experiments
depth of channel and the number of heads at each stage, the are carried out by using SVTR-T without rectification mod-
number of mixing blocks and their permutation. By vary- ule and data augmentation.
ing them, SVTR architectures with different capacities could
be obtained and we construct four typical ones, i.e., SVTR- The Effectiveness of Patch Embedding
T (Tiny), SVTR-S (Small), SVTR-B (Base) and SVTR-L As seen in Table 2 (the left half), different embedding strate-
(Large). Their detail configurations are shown in Table 1. gies behave slightly different in recognition accuracy. Our

887
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

Models [D0 , D1 , D2 ] [L1 , L2 , L3 ] Heads D3 Permutation Params (M) FLOPs (G)

SVTR-T [64,128,256] [3,6,3] [2,4,8] 192 [L]6 [G]6 4.15 0.29
SVTR-S [96,192,256] [3,6,6] [3,6,8] 192 [L]8 [G]7 8.45 0.63
SVTR-B [128,256,384] [3,6,9] [4,8,12] 256 [L]8 [G]10 22.66 3.55
SVTR-L [192,256,512] [3,9,9] [6,8,16] 384 [L]10 [G]11 38.81 6.07

Table 1: Architecture specifications of SVTR variants (w/o counting the rectification module and linear classifier).

Embedding IC13 IC15 Merging IC13 IC15 FLOPS

Linear 92.5 72.0 None 92.4 71.8 1.10
Overlap 93.0 73.9 Prog 93.5 74.8 0.29
Ours 93.5 74.8

Table 2: Ablation study on patch embedding (the left half) and merg-
ing (the right half).

CP IC13 IC15 CP IC13 IC15

None 91.6 68.2 [G]12 92.7 73.7
[G]6 [L]6 92.2 71.9 [L]12 91.3 70.2
[L]6 [G]6 93.5 74.8 [GL]6 91.9 72.4 Figure 5: Accuracy-parameter (M) and Accuracy-speed (ms) plots
[LG]6 93.5 73.5 of different models on IC15.

Table 3: Ablation study on mixing block permutation.

3.4 Comparison with State-of-the-Art
progressive embedding scheme outperforms the two default
ones by 0.75% and 2.8% on average on the two datasets, in- We compare SVTR with previous works on six English
dicating that it is effective especially for irregular text. benchmarks covering regular and irregular text and one Chi-
nese scene dataset in Table 4. The methods are grouped by
The Effectiveness of Merging whether utilizing the language information, i.e., lan-free and
There are also two choices, i.e., applying the merging oper- lan-aware. We first look into the result in English datasets.
ation to build a progressively decreased network (prog), and Even SVTR-T, the smallest one of the SVTR family, achieves
keeping a constant spatial resolution across stages (none). As highly competitive accuracy among the lan-free ones. While
indicated in the right half of Table 2. The merging not only other SVTRs are new state-of-the-art on most datasets. When
reduces the computational cost, but also increases the accu- further compared with lan-aware ones, SVTR-L gains the
racy on both datasets. It verifies that a multi-scale sampling best accuracy on IIIT, IC15 and CUTE among the six English
on the height dimension is effective for text recognition. benchmarks. Its overall accuracy is on par with recent studies
[Fang et al., 2021; Tang et al., 2022], which used extra lan-
The Permutation of Mixing Blocks
guage models. As a contrast, SVTR enjoys its simplicity and
There are various ways to group the global and local mixing runs faster. The results basically imply that a single visual
blocks in each stage. Several of them are assessed in Table model also could perform the recognition well, as discrimi-
3, where none means no mixing block is taken into account. native character features are successfully extracted.
[G]6 [L]6 means for each stage, six global mixing blocks are
carried out at first, and then six local mixing blocks. Others We then analyze the result on Chinese Scene Dataset.
are defined similarly. As can be seen, almost every scheme [Chen et al., 2021] gives the accuracy of six existing meth-
gives certain accuracy improvement. We believe that the im- ods as shown in the table. Encouragingly, SVTR performs
provements are attributed to perceiving comprehensive char- considerably well. Compared with SAR, the best one among
acter component features. The relatively large gains on irreg- the listed ones, accuracy gains ranging from 5.4% to 9.6% are
ular text further explains the mixing block is helpful to feature observed, which are noticeable improvements. The result is
learning in complex scenarios. It is observed that [L]6 [G]6 re- explained as SVTR comprehensively perceives multi-grained
ports the best accuracy. It gives accuracy gains of 1.9% and character component features. They are well suited to char-
6.6% on IC13 and IC15 when compared with none. By plac- acterize Chinese words that have rich stroke patterns.
ing the local mixing block in front of the global one, it is In Figure.5, we also depict the accuracy, parameter and
beneficial to guide the global mixing block to focus on long- inference speed of different models on IC15. Owning
term dependence capturing. On the contrary, switching their to their simpleness, SVTRs consistently rank top-tier in
permutation is likely to confuse the role of the global mixing both accuracy-parameter and accuracy-speed plots, further
block, which may repetitively attend to local characteristics. demonstrating its superiority compared to existing methods.

888
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

English regular English unregular Chinese Params Speed

method
IC13 SVT IIIT5k IC15 SVTP CUTE Scene (M) (ms)
CRNN[Shi et al., 2017] 91.1 81.6 82.9 69.4 70.0 65.5 53.4 8.3 6.3
Rosetta[Borisyuk et al., 2018] 90.9 84.7 84.3 71.2 73.8 69.2 - 44.3 10.5
SRN*[Yu et al., 2020] 93.2 88.1 92.3 77.5 79.4 84.7 - - -
Lan-free PREN*[Yan et al., 2021] 94.7 92.0 92.1 79.2 83.9 81.3 - 29.1 40.0
ViTSTR[Atienza, 2021] 93.2 87.7 88.4 78.5 81.8 81.3 - 85.5 11.2
ABINet*[Fang et al., 2021] 94.9 90.4 94.6 81.7 84.2 86.5 - 23.5 50.6
VST*[Tang et al., 2022] 95.6 91.9 95.6 82.3 87.0 91.8 - - -
ASTER[Shi et al., 2019] - 89.5 93.4 76.1 78.5 79.5 54.5 27.2 -
MORAN[Luo et al., 2019] - 88.3 91.2 - 76.1 77.4 51.8 28.5 -
NRTR[Sheng et al., 2019] 94.7 88.3 86.5 - - - - 31.7 160
SAR[Li et al., 2019] 91.0 84.5 91.5 69.2 76.4 83.5 62.5 57.5 120
Lan-aware AutoSTR[Zhang et al., 2020] - 90.9 94.7 81.8 81.7 84.0 - 10.4 207
SRN[Yu et al., 2020] 95.5 91.5 94.8 82.7 85.1 87.8 60.1 54.7 25.4
PREN2D[Yan et al., 2021] 96.4 94.0 95.6 83.0 87.6 91.7 - - -
VisionLAN[Wang et al., 2021] 95.7 91.7 95.8 83.7 86.0 88.5 - 32.8 28.0
ABINet[Fang et al., 2021] 97.4 93.5 96.2 86 89.3 89.2 - 36.7 51.3
VST[Tang et al., 2022] 96.4 93.8 96.3 85.4 88.7 95.1 - 64.0 -
SVTR-T (Tiny) 96.3 91.6 94.4 84.1 85.4 88.2 67.9 6.03 4.5
SVTR-S (Small) 95.7 93.0 95.0 84.7 87.9 92.0 69.0 10.3 8.0
Ours
SVTR-B (Base) 97.1 91.5 96.0 85.2 89.9 91.7 71.4 24.6 8.5
SVTR-L (Large) 97.2 91.7 96.3 86.6 88.4 95.1 72.1 40.8 18.0

Table 4: Results on six English and one Chinese benchmarks tested against existing methods, where CRN N and Rosetta are from the
reproduction of CombBest [Baek et al., 2019]. Lan means language and * means the language-free version of the corresponding method.
The speed is the inference time on one NVIDIA 1080Ti GPU averaged over 3000 English image text.

3.5 Visualization Analysis

We have visualized the attention maps of SVTR-T when de-
coding different character components. Each map can be ex-
plained as serving a different role in the whole recognition.
Nine maps are selected for illustration, as shown in Figure.6.
The first line shows three maps of gazing into a fraction of
character ”B”, with emphasis on its left side, bottom and
middle parts, respectively. It indicates that different charac-
ter regions are contributed to the recognition. The second
line exhibits three maps of gazing into different characters, Figure 6: Visualization of SVTR-T attention maps.
i.e., ”B”, ”L”, and ”S”. SVTR-T is also able to learn its fea-
ture by viewing a character as a whole. While the third line
exhibits three maps simultaneously activate multiple charac- tures that describe both stroke-like local patterns and inter-
ters. It implies that dependencies among different characters component dependence of varying distance at multiple height
are successfully captured. The three lines together reveal that scales. Therefore, the recognition task can be conducted by
sub-character, character-level and cross-character clues are using a single visual model, enjoying the merits of accuracy,
all captured by the recognizer, in accordance with the claim efficiency and cross-lingual versatility. SVTR with differ-
that SVTR perceives multi-grained character component fea- ent capacities are also devised to meet diverse application
tures. It again explains the effectiveness of SVTR. needs. Experiments on both English and Chinese benchmarks
With the results and discussion, we conclude with that basically verify the proposed SVTR. Highly competitive or
SVTR-L enjoys the merits of accurate, fast and versatile, be- even better accuracy is observed compared to state-of-the-art
ing a highly competitive choice in accuracy-oriented applica- methods, while running faster. We hope that SVTR will foster
tions. While SVTR-T is an effective and much smaller model further research in scene text recognition.
yet quite fast. It is appealing in resource-limited scenarios.
Acknowledgments
4 Conclusion The work is supported by the National Nature Foundation
We have presented SVTR, a customized visual model for of China (No. 62172103, 61876016) and CCF-Baidu Open
scene text recognition. It extracts multi-grained character fea- Fund (No.2021PP15002000).

889
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

References [Li et al., 2019] H. Li, P. Wang, C. Shen, and G. Zhang.

[Anhar et al., 2014] R. Anhar, S. Palaiahnakote, C. S. Chan, Show, attend and read: A simple and strong baseline for ir-
and C. L. Tan. A robust arbitrary text detection system for regular text recognition. In AAAI, pages 8610–8617, 2019.
natural scene images. In Expert Systems with Applications, [Liu et al., 2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu,
page 8027–8048, 2014. Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
[Atienza, 2021] R. Atienza. Vision transformer for fast and Swin transformer: Hierarchical vision transformer using
efficient scene text recognition. arXiv:2105.08582, 2021. shifted windows. In ICCV, 2021.
[Baek et al., 2019] J. Baek, G. Kim, J. Lee, S. Park, D. Han, [Luo et al., 2019] C. Luo, L. Jin, and Z. Sun. A multi-object
S. Yun, S. J. Oh, and H. Lee. What is wrong with scene rectified attention network for scene text recognition. In
text recognition model comparisons? dataset and model Pattern Recognit., pages 109–118, 2019.
analysis. In ICCV, pages 4714–4722, 2019. [Mishra et al., 2012] A. Mishra, A. Karteek, and C. V. Jawa-
[Borisyuk et al., 2018] F. Borisyuk, A. Gordo, and har. Scene text recognition using higher order language
V. Sivakumar. Rosetta: Large scale system for text priors. In BMVC, 2012.
detection and recognition in images. In ACM SIGKDD, [Phan et al., 2013] T. Q. Phan, P. Shivakumara, S. Tian, and
page 71–79, 2018. C. L. Tan. Recognizing text with perspective distortion in
[Chen et al., 2021] J. Chen, H. Yu, J. Ma, M. Guan, X. Xu, natural scenes. In CVPR, pages 569–576, 2013.
X. Wang, S. Qu, B. Li, and X. Xue. Benchmarking chi- [Sheng et al., 2019] F. Sheng, Z. Chen, and B. Xu. Nrtr: A
nese text recognition: Datasets, baselines, and an empiri- no-recurrence sequence-to-sequence model for scene text
cal study. arXiv:2112.15093, 2021. recognition. In ICDAR, pages 781–786, 2019.
[Dosovitskiy et al., 2021] A. Dosovitskiy, L. Beyer, [Shi et al., 2017] B. Shi, X. Bai, and C. Yao. An end-to-end
A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, trainable neural network for image-based sequence recog-
M. Dehghani, M. Minderer, G. Heigold, S. Gelly, nition and its application to scene text recognition. In
J. Uszkoreit, and N. Houlsby. An image is worth 16x16 TPAMI, pages 2298–2304, 2017.
words: Transformers for image recognition at scale. In [Shi et al., 2019] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao,
ICLR, 2021. and X. Bai. Aster: An attentional scene text recognizer
[Fang et al., 2021] S. Fang, H. Xie, Y. Wang, Z. Mao, and with flexible rectification. In TPAMI, pages 2035–2048,
Y. Zhang. Read like humans: Autonomous, bidirectional 2019.
and iterative language modeling for scene text recognition. [Tang et al., 2022] X. Tang, Y. Lai, Y. Liu, Y. Fu, and
In CVPR, 2021. R. Fang. Visual-semantic transformer for scene text recog-
[Gupta et al., 2016] A. Gupta, A. Vedaldi, and A. Zisserman. nition. In AAAI, 2022.
Synthetic data for text localisation in natural images. In [Wang et al., 2011] K. Wang, B. Babenko, and S. Belongie.
CVPR, pages 2315–2324, 2016. End-to-end scene text recognition. In ICCV, pages 1457–
[Hu et al., 2020] W. Hu, X. Cai, J. Hou, S. Yi, and Z. Lin. 1464, 2011.
Gtc: Guided training of ctc towards efficient and accu- [Wang et al., 2021] Y. Wang, H. Xie, S. Fang, J. Wang,
rate scene text recognition. In AAAI, pages 11005–11012, S. Zhu, and Y. Zhang. From two to one: A new scene
2020. text recognizer with visual language modeling network. In
[Jaderberg et al., 2014] M. Jaderberg, K. Simonyan, ICCV, pages 14194–14203, 2021.
A. Vedaldi, and A. Zisserman. Synthetic data and artificial [Yan et al., 2021] R. Yan, L. Peng, S. Xiao, and G. Yao.
neural networks for natural scene text recognition. In Primitive representation learning for scene text recogni-
NeurIPS Deep Learning Workshop, 2014. tion. In CVPR, pages 284–293, 2021.
[Jaderberg et al., 2015] M. Jaderberg, K. Simonyan, [Yu et al., 2020] D. Yu, X. Li, C. Zhang, T. Liu, J. Han,
A. Vedaldi, and A. Zisserman. Reading text in the wild J. Liu, and E. Ding. Towards accurate scene text recog-
with convolutional neural networks. In IJCV, pages 1–20, nition with semantic reasoning networks. In CVPR, pages
2015. 12113–12122, 2020.
[Karatzas et al., 2015] D. Karatzas, L. Gomez-Bigorda, [Zhai et al., 2016] C. Zhai, Z. Chen, J. Li, and B. Xu.
A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, Chinese image text recognition with blstm-ctc: A
J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, segmentation-free method. In CCPR, pages 525–536,
F. Shafait, S. Uchida, and E. Valveny. Icdar 2015 com- 2016.
petition on robust reading. In ICDAR, pages 1156–1160,
2015. [Zhang et al., 2020] H. Zhang, Q. Yao, M. Yang, Y. Xu, and
X. Bai. Autostr: Efficient backbone search for scene text
[KaratzasAU et al., 2013] D. KaratzasAU, F. ShafaitAU, recognition. In ECCV, pages 751–767. Springer, 2020.
S. UchidaAU, M. IwamuraAU, L. G. i. BigordaAU, S. R.
MestreAU, J. MasAU, D. F. MotaAU, J. A. AlmazànAU, [Zheng et al., 2021] T. Zheng, Z. Chen, S. Fang, H. Xie, and
and L. P. de las Heras. Icdar 2013 robust reading competi- Y. Jiang. Cdistnet: Perceiving multi-domain character dis-
tion. In ICDAR, pages 1484–1493, 2013. tance for robust text recognition. arXiv:2111.11011, 2021.

890

Multimodal Visual-Semantic Representations Learning For Scene Text Recognition
No ratings yet
Multimodal Visual-Semantic Representations Learning For Scene Text Recognition
19 pages
A Transformer-Based Framework For Scene Text Recognition
No ratings yet
A Transformer-Based Framework For Scene Text Recognition
16 pages
Visual Semantics Allow For Textual Reasoning Better in Scene Text Recognition (2021) - 36
No ratings yet
Visual Semantics Allow For Textual Reasoning Better in Scene Text Recognition (2021) - 36
9 pages
A Novel Ensemble Deep Network Framework For Scene Text Recognition
No ratings yet
A Novel Ensemble Deep Network Framework For Scene Text Recognition
11 pages
Scene Text Recognition with CLIP
No ratings yet
Scene Text Recognition with CLIP
13 pages
TCS Ocr
No ratings yet
TCS Ocr
39 pages
Fujitake DTrOCR Decoder-Only Transformer For Optical Character Recognition WACV 2024 Paper
No ratings yet
Fujitake DTrOCR Decoder-Only Transformer For Optical Character Recognition WACV 2024 Paper
11 pages
DTrOCR: Transformer for OCR Excellence
No ratings yet
DTrOCR: Transformer for OCR Excellence
11 pages
Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov
No ratings yet
Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov
6 pages
HTR-VT - Handwritten Text Recognition With Vision Transformer
No ratings yet
HTR-VT - Handwritten Text Recognition With Vision Transformer
11 pages
Deep-Text Recurrent Networks for OCR
No ratings yet
Deep-Text Recurrent Networks for OCR
8 pages
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
No ratings yet
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
14 pages
6071 Generative Shape Models Joint Text Recognition and Segmentation With Very Little Training Data
No ratings yet
6071 Generative Shape Models Joint Text Recognition and Segmentation With Very Little Training Data
9 pages
Enhancing Text Spotting With A Language Model and Visual Context Information
No ratings yet
Enhancing Text Spotting With A Language Model and Visual Context Information
10 pages
SVTRv2: Advancing Scene Text Recognition
No ratings yet
SVTRv2: Advancing Scene Text Recognition
17 pages
Vision Language Transformers: A Survey: Clayton Fields Casey Kennington
No ratings yet
Vision Language Transformers: A Survey: Clayton Fields Casey Kennington
30 pages
Enhanced Scene Text Recognition Using Deep Learning Based Hybrid Attention Recognition Network
No ratings yet
Enhanced Scene Text Recognition Using Deep Learning Based Hybrid Attention Recognition Network
12 pages
01-AIA42022755 Online
No ratings yet
01-AIA42022755 Online
21 pages
SVIPTR: Fast and Efficient Scene Text Recognition With Vision Permutable Extractor
No ratings yet
SVIPTR: Fast and Efficient Scene Text Recognition With Vision Permutable Extractor
10 pages
Jaderberg 16
No ratings yet
Jaderberg 16
20 pages
20536-Article Text-24549-1-2-20220628
No ratings yet
20536-Article Text-24549-1-2-20220628
9 pages
Review of Scene Text Detection and Recognition: Han Lin Peng Yang Fanlong Zhang
No ratings yet
Review of Scene Text Detection and Recognition: Han Lin Peng Yang Fanlong Zhang
22 pages
2022 Text Recognition in The Wild
No ratings yet
2022 Text Recognition in The Wild
35 pages
L-Verse: Bidirectional Image-Text Generation
No ratings yet
L-Verse: Bidirectional Image-Text Generation
18 pages
Haramaya University Computer Science Student
No ratings yet
Haramaya University Computer Science Student
15 pages
10 1109@tetci 2019 2892755
No ratings yet
10 1109@tetci 2019 2892755
16 pages
Deep Learning Paper About Vit
No ratings yet
Deep Learning Paper About Vit
12 pages
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
No ratings yet
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
6 pages
Image-Text Retrieval: A Survey On Recent Research and Development
No ratings yet
Image-Text Retrieval: A Survey On Recent Research and Development
8 pages
A Multimodal Text Block Segmentation Framework For Photo Translation
No ratings yet
A Multimodal Text Block Segmentation Framework For Photo Translation
12 pages
TS2-Net: Advanced Text-Video Retrieval
No ratings yet
TS2-Net: Advanced Text-Video Retrieval
23 pages
SLAN: Enhancing Vision-Language Understanding
No ratings yet
SLAN: Enhancing Vision-Language Understanding
10 pages
AI Text Detection & Recognition
No ratings yet
AI Text Detection & Recognition
11 pages
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
No ratings yet
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
9 pages
Vision-Language Learning Breakthrough
No ratings yet
Vision-Language Learning Breakthrough
10 pages
Deepsolo++: Let Transformer Decoder With Explicit Points Solo For Multilingual Text Spotting
No ratings yet
Deepsolo++: Let Transformer Decoder With Explicit Points Solo For Multilingual Text Spotting
31 pages
Neuro-Symbolic Concept Learner Overview
No ratings yet
Neuro-Symbolic Concept Learner Overview
28 pages
Deep Scene Text Detection With Connected Component Proposals
No ratings yet
Deep Scene Text Detection With Connected Component Proposals
10 pages
Fine-Grained Visual Textual Alignment For Cross-Modal Retrieval Using Transformer Encoders
No ratings yet
Fine-Grained Visual Textual Alignment For Cross-Modal Retrieval Using Transformer Encoders
23 pages
Show and Tell: A Neural Image Caption Generator
No ratings yet
Show and Tell: A Neural Image Caption Generator
9 pages
Kittenplon Towards Weakly-Supervised Text Spotting Using A Multi-Task Transformer CVPR 2022 Paper
No ratings yet
Kittenplon Towards Weakly-Supervised Text Spotting Using A Multi-Task Transformer CVPR 2022 Paper
10 pages
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
No ratings yet
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
12 pages
Scene Text Recognition Based On Improved CRNN
No ratings yet
Scene Text Recognition Based On Improved CRNN
14 pages
Pyramid Vision Transformer v2
No ratings yet
Pyramid Vision Transformer v2
10 pages
GNNs in Visual Question Answering
No ratings yet
GNNs in Visual Question Answering
38 pages
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
No ratings yet
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
10 pages
Scene Text VQA Dataset Overview
No ratings yet
Scene Text VQA Dataset Overview
11 pages
10 36222-Ejt 1407231-3609588
No ratings yet
10 36222-Ejt 1407231-3609588
7 pages
Trocr: Transformer-Based Optical Character Recognition With Pre-Trained Models
No ratings yet
Trocr: Transformer-Based Optical Character Recognition With Pre-Trained Models
10 pages
Word Spotting and Recognition Using Deep Embedding
No ratings yet
Word Spotting and Recognition Using Deep Embedding
6 pages
Zhang Semantic Segmentation by Early Region Proxy CVPR 2022 Paper
No ratings yet
Zhang Semantic Segmentation by Early Region Proxy CVPR 2022 Paper
11 pages
Semantic and Verbatim Word Spotting Using Deep Neural Networks
No ratings yet
Semantic and Verbatim Word Spotting Using Deep Neural Networks
6 pages
Multimodal Machine Learning
No ratings yet
Multimodal Machine Learning
27 pages
Wu Image-Text Co-Decomposition For Text-Supervised Semantic Segmentation CVPR 2024 Paper
No ratings yet
Wu Image-Text Co-Decomposition For Text-Supervised Semantic Segmentation CVPR 2024 Paper
10 pages
Research PaPer EAST
No ratings yet
Research PaPer EAST
10 pages
Research Paper
No ratings yet
Research Paper
6 pages
Semi-Supervised Scene Text Recognition
No ratings yet
Semi-Supervised Scene Text Recognition
8 pages
Text Detection and Extraction Techniques
No ratings yet
Text Detection and Extraction Techniques
18 pages
Job Seeker's Profile
No ratings yet
Job Seeker's Profile
3 pages
Synchronous or Asynchronous Resets - VLSI Design Interview Questions With Answers - Ebook
No ratings yet
Synchronous or Asynchronous Resets - VLSI Design Interview Questions With Answers - Ebook
5 pages
Adaptive Neural Signal Detection
No ratings yet
Adaptive Neural Signal Detection
14 pages
Network Packet Trace Analysis
No ratings yet
Network Packet Trace Analysis
24 pages
5200078-11.2 RaySafe X2 Specifications en
No ratings yet
5200078-11.2 RaySafe X2 Specifications en
8 pages
Be Electronics and Telecommunication Engineering Semester 5 2023 November Microcontrollers Pattern 2019
No ratings yet
Be Electronics and Telecommunication Engineering Semester 5 2023 November Microcontrollers Pattern 2019
2 pages
Lab 02 The Application of Pressure Relief Valve and Flow Control Valve v1
No ratings yet
Lab 02 The Application of Pressure Relief Valve and Flow Control Valve v1
17 pages
My20 Corolla Hibrid HB I Ts
No ratings yet
My20 Corolla Hibrid HB I Ts
2 pages
Chapter 5
No ratings yet
Chapter 5
10 pages
JTG - FE Assignment Round
No ratings yet
JTG - FE Assignment Round
2 pages
Hospitalization Intimation for Sarvdnya Patil
No ratings yet
Hospitalization Intimation for Sarvdnya Patil
2 pages
MPPTCL Testing Assistant Official Paper (Held On - 23 Aug, 2018)
No ratings yet
MPPTCL Testing Assistant Official Paper (Held On - 23 Aug, 2018)
67 pages
Computer Networks Lab Manual 21CS52
No ratings yet
Computer Networks Lab Manual 21CS52
56 pages
CV Shveta
No ratings yet
CV Shveta
1 page
WWW - Manaresults.Co - In: Set No. 1
No ratings yet
WWW - Manaresults.Co - In: Set No. 1
8 pages
AI & ML Strategic Frameworks For Business Leaders
No ratings yet
AI & ML Strategic Frameworks For Business Leaders
138 pages
Linux System Performance Metrics
No ratings yet
Linux System Performance Metrics
14 pages
Op WebConfig 100 en PDF
No ratings yet
Op WebConfig 100 en PDF
17 pages
Instrument Hook-Up Diagram Guide
No ratings yet
Instrument Hook-Up Diagram Guide
21 pages
Guidelines For The Use of Generative AI in Research Paper Writing
No ratings yet
Guidelines For The Use of Generative AI in Research Paper Writing
8 pages
Procedure For Design and Development
No ratings yet
Procedure For Design and Development
8 pages
P1.6 - P1.8 P2.0 - P2.2: Technical Information For Hyster Service Centres
No ratings yet
P1.6 - P1.8 P2.0 - P2.2: Technical Information For Hyster Service Centres
206 pages
Product Bulletin Fisher Whisper Trim III Cages en 122404
No ratings yet
Product Bulletin Fisher Whisper Trim III Cages en 122404
8 pages
Previewpdf
No ratings yet
Previewpdf
287 pages
Icm661 - Individual Assignment (2024755239)
No ratings yet
Icm661 - Individual Assignment (2024755239)
8 pages
Autonomous Mobile Robot With Web Optimization
No ratings yet
Autonomous Mobile Robot With Web Optimization
33 pages
Application Letter: To: Industrial Parks Development Corporation Subject: Job Opportunity - Junior Consultant
100% (1)
Application Letter: To: Industrial Parks Development Corporation Subject: Job Opportunity - Junior Consultant
5 pages
Ion Auth Docs PDF
No ratings yet
Ion Auth Docs PDF
36 pages
A Guide To Steam Trap Testing - TLV
No ratings yet
A Guide To Steam Trap Testing - TLV
6 pages
Lab CPP 08 Inheritance
No ratings yet
Lab CPP 08 Inheritance
31 pages

SVTR: Scene Text Recognition With A Single Visual Model

Uploaded by

SVTR: Scene Text Recognition With A Single Visual Model

Uploaded by

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

SVTR: Scene Text Recognition with a Single Visual Model

Abstract (a) CNN RNN Pred

Dominant scene text recognition models commonly

Position Embedding Element-wise Add Character Component Activation Function

..... (a) (b)

Figure 4: Illustration of (a) global mixing and (b) local mixing.

two consecutive 3 × 3 convolutions with stride 2 and batch

7 × 11. Compared with the global mixing, it implements the 3 Experiments

Models [D0 , D1 , D2 ] [L1 , L2 , L3 ] Heads D3 Permutation Params (M) FLOPs (G)

Embedding IC13 IC15 Merging IC13 IC15 FLOPS

CP IC13 IC15 CP IC13 IC15

Table 3: Ablation study on mixing block permutation.

English regular English unregular Chinese Params Speed

3.5 Visualization Analysis

References [Li et al., 2019] H. Li, P. Wang, C. Shen, and G. Zhang.

You might also like