0% found this document useful (0 votes)
88 views7 pages

SVTR: Scene Text Recognition With A Single Visual Model

The document presents SVTR, a novel scene text recognition model that utilizes a single visual model to enhance efficiency and accuracy by eliminating the need for a sequential modeling component. SVTR decomposes images into character components and employs local and global mixing blocks to capture both intra-character and inter-character patterns, achieving competitive results in recognizing both English and Chinese text. Experimental results demonstrate that SVTR-L outperforms existing methods, particularly in Chinese recognition, while SVTR-T offers a smaller, faster alternative.

Uploaded by

hsf842455044
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views7 pages

SVTR: Scene Text Recognition With A Single Visual Model

The document presents SVTR, a novel scene text recognition model that utilizes a single visual model to enhance efficiency and accuracy by eliminating the need for a sequential modeling component. SVTR decomposes images into character components and employs local and global mixing blocks to capture both intra-character and inter-character patterns, achieving competitive results in recognizing both English and Chinese text. Experimental results demonstrate that SVTR-L outperforms existing methods, particularly in Chinese recognition, while SVTR-T offers a smaller, faster alternative.

Uploaded by

hsf842455044
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

SVTR: Scene Text Recognition with a Single Visual Model


Yongkun Du1 , Zhineng Chen2∗ , Caiyan Jia1 , Xiaoting Yin3 , Tianlun Zheng2 ,
Chenxia Li3 , Yuning Du3 , Yu-Gang Jiang2
1
School of Computer and Information Technology and Beijing Key Lab of Traffic Data Analysis and
Mining, Beijing Jiaotong University, China
2
Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer
Science, Fudan University, China
3
Baidu Inc., China
{yongkundu, cyjia}@[Link], {zhinchen, ygj}@[Link], tlzheng21@[Link],
{yinxiaoting, lichenxia, duyuning}@[Link]

Abstract (a) CNN RNN Pred

Dominant scene text recognition models commonly


contain two building blocks, a visual model for (b) CNN/MHSA RNN/MHA Pred
feature extraction and a sequence model for text
transcription. This hybrid architecture, although
accurate, is complex and less efficient. In this (c) CNN/MHSA Pred MHA Pred
study, we propose a Single Visual model for Scene
Text recognition within the patch-wise image tok- Fusion
enization framework, which dispenses with the se- (d) SVTR Pred
quential modeling entirely. The method, termed
SVTR, firstly decomposes an image text into small Figure 1: (a) CNN-RNN based models. (b) Encoder-Decoder mod-
patches named character components. Afterward, els. MHSA and MHA denote multi-head self-attention and multi-
hierarchical stages are recurrently carried out by head attention, respectively. (c) Vision-Language models. (d) Our
component-level mixing, merging and/or combin- SVTR, which recognizes scene text with a single visual model and
ing. Global and local mixing blocks are devised enjoys efficient, accurate and cross-lingual versatile.
to perceive the inter-character and intra-character
patterns, leading to a multi-grained character com-
ponent perception. Thus, characters are recognized Methodologically, scene text recognition can be viewed as
by a simple linear prediction. Experimental results a cross-modal mapping from image to character sequence.
on both English and Chinese scene text recogni- Typically, the recognizer consists of two building blocks, a
tion tasks demonstrate the effectiveness of SVTR. visual model for feature extraction and a sequence model for
SVTR-L (Large) achieves highly competitive ac- text transcription. For example, CNN-RNN based models
curacy in English and outperforms existing meth- [Zhai et al., 2016; Shi et al., 2017] first employed CNN for
ods by a large margin in Chinese, while running feature extraction. The feature was then reshaped as a se-
faster. In addition, SVTR-T (Tiny) is an effective quence and modeled by BiLSTM and CTC loss to get the
and much smaller model, which shows appealing prediction (Figure 1(a)). They are featured by efficiency and
speed at inference. The code is publicly available remain the choice for some commercial recognizers. How-
at [Link] ever, the reshaping is sensitive to text disturbances such as
deformation, occlusion, etc, limiting their effectiveness.
Later, encoder-decoder based auto-regressive methods be-
1 Introduction came popular [Sheng et al., 2019; Li et al., 2019; Zheng et
Scene text recognition aims to transcript a text in natural im- al., 2021], the methods transform the recognition as an it-
age to digital character sequence, which conveys high-level erative decoding procedure (Figure 1(b)). As a result, im-
semantics vital for scene understanding. The task is challeng- proved accuracy was obtained as the context information was
ing due to variations in text deformations, fonts, occlusions, considered. However, the inference speed is slow due to the
cluttered background, etc. In the past years, many efforts have character-by-character transcription. The pipeline was fur-
been made to improve the recognition accuracy. Modern text ther extended to vision-language based framework [Yu et al.,
recognizers, besides accuracy, also take factors like inference 2020; Fang et al., 2021], where language knowledge was
speed into account because of practical requirements. incorporated (Figure 1(c)) and parallel prediction was con-
ducted. However, the pipeline often requires a large capacity

Corresponding Author model or complex recognition paradigm to ensure the recog-

884
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

Input: H×W×3 H
CC0: 4 ×
W
× D0
H
CC1: 8 ×
W
× D1
H
CC2: 16 ×
W
× D2 C: 1×
W
× D3
4 4 4 4

…..

....

…..

.....

…..
Mixing Blocks

Mixing Blocks

Mixing Blocks
Embedding

Combing
Merging

Merging
Patch
FC

qbhouse
Stage 1 Stage 2 Stage 3

Stage i h C
CC0 h×w×di-1 hw×di-1 h×w×di-1 ×w×di
2


…..

…..
Dropout

Local/Global
×2
...

......
Mixing
Reshape Reshape

Conv
MLP
CCi-1 CCi

LN

LN

LN
BN
Linear
Conv
× Li Pool
CC'i Merging
Input
Mixing Block CC'3

Position Embedding Element-wise Add Character Component Activation Function

Figure 2: Overall architecture of the proposed SVTR. It is a three-stage height progressively decreased network. In each stage, a series of
mixing blocks are carried out and followed by a merging or combining operation. At last, the recognition is conducted by a linear prediction.

nition accuracy, restricting its efficiency. to enhance the recognition capability by reinforcing the vi-
Recently, there are efforts emphasized developing simpli- sual model. To this end, we propose a visual-based model
fied architectures to accelerate the speed. For example, us- SVTR for accurate, fast and cross-lingual versatile scene text
ing complex training paradigm but simple model for infer- recognition. Inspired by the recent success of vision trans-
ence. CRNN-RNN based solution was revisited in [Hu et former [Dosovitskiy et al., 2021; Liu et al., 2021], SVTR
al., 2020]. It utilized the attention mechanism and graph first decomposes an image text into small 2D patches termed
neural network to aggregate sequential features correspond- character components, as each of which may contain a part
ing to the same character. At inference, the attention mod- of a character only. Thus, patch-wise image tokenization fol-
eling branch was discarded to balance accuracy and speed. lowed by self-attention is applied to capture recognition clues
PREN2D [Yan et al., 2021] further simplified the recognition among character components. Specifically, a text-customized
by aggregating and decoding the 1D sub-character features architecture is developed for this purpose. It is a three-stage
simultaneously. [Wang et al., 2021] proposed VisionLAN. It height progressively decreased backbone with mixing, merg-
introduced a character-wise occluded learning to endue the ing and/or combining operations. Local and global mixing
visual model with language capability. While at inference, blocks are devised and recurrently employed at each stage,
the visual model was applied merely for speedup. In view together with the merging or combining operation, acquiring
of the simplicity of a single visual model based architecture, the local component-level affinities that represent the stroke-
some recognizers were proposed by employing off-the-shelf like feature of a character, and long-term dependence among
CNN [Borisyuk et al., 2018] or ViT [Atienza, 2021] as the different characters. Therefore, the backbone extracts com-
feature extractor. Despite being efficiency, their accuracy is ponent features of different distance and at multiple scales,
less competitive compared to state-of-the-art methods. forming a multi-grained character feature perception. As a re-
We argue that the single visual model based scheme is sult, the recognition is reached by a simple linear prediction.
effective only if discriminative character features could be In the whole process only one visual model is employed. We
extracted. Specifically, the model could successfully catch construct four architecture variants with varying capacities.
both intra-character local patterns and inter-character long- Extensive experiments are carried out on both English and
term dependence. The former encodes stroke-like features Chinese scene text recognition tasks. It is shown that SVTR-
that describe fine-grained features of a character, being a L (large) achieves highly competitive results in English and
critical source for distinguishing characters. While the lat- outperforms state-of-the-art models by a large margin in Chi-
ter records language-analogous knowledge that describes the nese, while running faster. Meanwhile, SVTR-T (Tiny) is
characters from a complementary aspect. However, the two also an effective and much smaller model yet with appealing
properties are not well modeled by previous feature extrac- inference speed. The main contributions of this work can be
tors. For example, CNN backbones are good at modeling summarized as follows.
local correlation rather than global dependence. Meanwhile, • We demonstrate, for the first time, that a single visual
current transformer-based general-purpose backbones would model can achieve competitive or even higher accuracy
not give privilege to local character patterns. as advanced vision-language models in scene text recog-
Motivated by the issues mentioned above, this work aims nition. It is promising to practical applications due to its

885
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

… …

.....
… …

..... (a) (b)

Figure 4: Illustration of (a) global mixing and (b) local mixing.


BN BN

...
Linear

...
...

Conv Conv
...

two consecutive 3 × 3 convolutions with stride 2 and batch


normalization, as shown in Figure 3(b). The scheme, despite
(a) (b) increasing the computational cost a little, adds the feature
dimension progressively which is in favor of feature fusion.
Figure 3: (a) The linear projection in ViT [Dosovitskiy et al., 2021]. Ablation study in Section 3.3 shows its effectiveness.
(b) Our progressive overlapping patch embedding.
2.3 Mixing Block
efficiency and cross-lingual versatility. Since two characters may differ slightly, text recognition
• We propose SVTR, a text-customized model for recog- heavily relies on features at character component level. How-
nition. It introduces local and global mixing blocks for ever, existing studies mostly employ a feature sequence to
extracting stroke-like features and inter-character depen- represent the image text. Each feature corresponds to a thin-
dence, respectively, together with the multi-scale back- slice image region, which is often noisy especially for irreg-
bone, forming a multi-grained feature description. ular text. It is not optimal for describing the character. Re-
cent advancement of vision transformer introduces 2D feature
• Empirical studies on public benchmarks demonstrate the representation, but how to leverage this representation in the
superiority of SVTR. SVTR-L achieves the state-of-the- context of text recognition is still worthy investigation.
art performance in recognizing both English and Chi- More specifically, with the embedded components, we ar-
nese scene texts. While SVTR-T is effective yet effi- gue that text recognition requires two kinds of features. The
cient, with parameters of 6.03M and consuming 4.5ms first is local component patterns such as the stroke-like fea-
per image text on average in one NVIDIA 1080Ti GPU. ture. It encodes the morphology feature and correlation be-
tween different parts of a character. The second is inter-
2 Method character dependence such as the correlation between dif-
2.1 Overall Architecture ferent characters or between text and non-text components.
Therefore, we devise two mixing blocks to perceive the cor-
Overview of the proposed SVTR is illustrated in Figure 2. It relation by using self-attention with different reception fields.
is a three-stage height progressively decreased network dedi-
cated to text recognition. For an image text of size H ×W ×3, Global Mixing. As seen in Figure 4(a), global mixing eval-
it is first transformed to H4 × W
4 patches of dimension D0 via
uates the dependence among all character components. Since
a progressive overlapping patch embedding. The patches are text and non-text are two major elements in an image, such
termed character components, each associating with a frac- a general-purpose mixing could establish the long-term de-
tion of text character in the image. Then, three stages, each pendence among component from different characters. Be-
composed of a series of mixing blocks followed by a merging sides, it also capable of weakening the influence of non-text
or combining operation, are carried out at different scales for components, while enhancing the importance of text com-
feature extraction. local and global mixing blocks are devised ponents. Mathematically, for character components CCi−1
for stroke-like local pattern extraction and inter-component from the previous stage, it is first reshaped as a feature se-
dependence capturing. With the backbone, component fea- quence. When feeding into the mixing block, a layer norm is
ture and dependence of different distance and at multiple applied and followed by a multi-head self-attention for depen-
scales are characterized, generating a representation referred dence modeling. In the following, a layer norm and a MLP
to as C of size 1 × W 4 × D3 , which perceives multi-grained
are sequentially applied for feature fusion. Together with the
character features. Finally, a parallel linear prediction with shortcut connections, forming the global mixing block.
de-duplication is conducted to get the character sequence. Local Mixing. As seen in Figure 4(b), local mixing eval-
uates the correlation among components within a predefined
2.2 Progressive Overlapping Patch Embedding
window. Its objective is to encode the character morphology
With an image text, the first task is to obtain feature patches feature and establish the associations between components
which represent character components from X ∈ RH×W ×3 within a character, which simulates the stroke-like feature vi-
H W
to CC0 ∈ R 4 × 4 ×D0 . There exists two common one-step tal for character identification. Different from global mixing,
projections for this purpose, i.e., a 4×4 disjoint linear projec- local mixing considers a neighborhood for each component.
tion (see Figure 3(a)) and a 7 × 7 convolution with stride 4. Similar to convolution, the mixing is carried out in a slid-
Alternatively, we implement the patch embedding by using ing window manner. The window size is empirically set to

886
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

7 × 11. Compared with the global mixing, it implements the 3 Experiments


self-attention mechanism to capture local patterns. 3.1 Datasets
As aforementioned, the two mixing blocks aims to ex-
tract different features that are complementary. In SVTR, For English recognition task, our models are trained on two
the blocks are recurrently applied many times in each stage commonly used synthetic scene text datasets, i.e., MJSynth
for comprehensive feature extraction. Permutation of the two (MJ) [Jaderberg et al., 2014; Jaderberg et al., 2015] and Syn-
kinds of blocks will be ablated later. thText (ST) [Gupta et al., 2016]. Then the models are tested
on six public benchmarks as follows: ICDAR 2013 (IC13)
[KaratzasAU et al., 2013] contains 1095 testing images. we
2.4 Merging discard images that contain non-alphanumeric characters or
It is computational expensive to maintain a constant spatial less than three characters. As a result, IC13 contains 857 im-
resolution across stages, which also leads to redundant rep- ages. Street View Text (SVT) [Wang et al., 2011] has 647
resentation. As a consequence, we devise a merging opera- testing images cropped form Google Street View. Many im-
tion following the mixing blocks at each stage (except the last ages are severely corrupted by noise, blur, and low resolution.
one). With the features outputted from the last mixing block, IIIT5K-Words (IIIT) [Mishra et al., 2012] is collected from
we first reshape it to an embedding of size h × w × di−1 , de- the website and contains 3000 testing images. ICDAR 2015
noting current height, width and channels, respectively. Then, (IC15) [Karatzas et al., 2015] is taken with Google Glasses
we employ a 3 × 3 convolution with stride 2 in the height di- without careful position and focusing. IC15 has two versions
mension and 1 in the width dimension, followed by a layer with 1,811 images and 2,077 images. We use the former one.
norm, generating an embedding of size h2 × w × di . Street View Text-Perspective (SVTP) [Phan et al., 2013] is
The merging operation halve the height while keep a con- also cropped form Google Street View. There are 639 test
stant width. It not only reduce the computational cost, but images in this set and many of them are perspectively dis-
also build a text-customized hierarchical structure. Typically, torted. CUTE80 (CUTE) is proposed in[Anhar et al., 2014]
most image text appears horizontally or near horizontally. for curved text recognition. 288 testing images are cropped
Compressing the height dimension can establish a multi- from full images by using annotated words.
scale representation for each character, while not affecting the For Chinese recognition task, we use the Chinese Scene
patch layout in the width dimension. Therefore, it would not Dataset [Chen et al., 2021]. It is a public dataset contain-
increase the chance of encoding adjacent characters into the ing 509,164, 63,645 and 63,646 training, validation, and test
same component across stages. We also increase the channel images. The validation set is utilized to determine the best
dimension di to compensate the information loss. model, which is then assessed using the test set.
3.2 Implementation Details
2.5 Combining and Prediction
SVTR uses the rectification module [Shi et al., 2019], where
In the last stage, the merging operation is replaced by a com- the image text is resized to 32 × 64 for distortion correction.
bining operation. It pools the height dimension to 1 at first, We use the AdamW optimizer with weight decay of 0.05 for
followed by a fully-connected layer, non-linear activation and training. For English models, the initial learning rate are set
dropout. By doing this, character components are further to 1054 × batchsize
2048 . The cosine learning rate scheduler with 2
compressed to a feature sequence, where each element is rep- epochs linear warm-up is used in all 21 epochs. Data aug-
resented by a feature of length D3 . Compared to the merging mentation like rotation, perspective distortion, motion blur
operation, the combining operation can avoid applying con- and Gaussian noise, are randomly performed during the train-
volution to an embedding whose size is very small in one ing. The alphabet includes all case-insensitive alphanumer-
dimension, e.g., with 2 in height. ics. The maximum predict length is set to 25. The length ex-
With the combined feature, we implement the recognition ceeds the vast majority of English words. For Chinese mod-
by a simple parallel linear prediction. Concretely, a linear els, the initial learning rate are set to 1034 × batchsize
512 . The
classifier with N nodes is employed. It generates a transcript cosine learning rate scheduler with 5 epochs linear warm-up
sequence of size W 4 , where ideally, components of the same is used in all 100 epochs. Data augmentation was not used for
character are transcribed to duplicate characters, components training. The maximum predict length is set to 40. Word ac-
of non-text are transcribed to a blank symbol. The sequence curacy is used as the evaluation metric. All models are trained
is automatically condensed to the final result. In the imple- by using 4 Tesla V100 GPUs on PaddlePaddle.
mentation, N is set to 37 for English and 6625 for Chinese.
3.3 Ablation Study
2.6 Architecture Variants To better understand SVTR, we perform controlled experi-
ments on both IC13 (regular text) and IC15 (irregular text)
There are several hyper-parameters in SVTR, including the under different settings. For efficiency, all the experiments
depth of channel and the number of heads at each stage, the are carried out by using SVTR-T without rectification mod-
number of mixing blocks and their permutation. By vary- ule and data augmentation.
ing them, SVTR architectures with different capacities could
be obtained and we construct four typical ones, i.e., SVTR- The Effectiveness of Patch Embedding
T (Tiny), SVTR-S (Small), SVTR-B (Base) and SVTR-L As seen in Table 2 (the left half), different embedding strate-
(Large). Their detail configurations are shown in Table 1. gies behave slightly different in recognition accuracy. Our

887
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

Models [D0 , D1 , D2 ] [L1 , L2 , L3 ] Heads D3 Permutation Params (M) FLOPs (G)


SVTR-T [64,128,256] [3,6,3] [2,4,8] 192 [L]6 [G]6 4.15 0.29
SVTR-S [96,192,256] [3,6,6] [3,6,8] 192 [L]8 [G]7 8.45 0.63
SVTR-B [128,256,384] [3,6,9] [4,8,12] 256 [L]8 [G]10 22.66 3.55
SVTR-L [192,256,512] [3,9,9] [6,8,16] 384 [L]10 [G]11 38.81 6.07

Table 1: Architecture specifications of SVTR variants (w/o counting the rectification module and linear classifier).

Embedding IC13 IC15 Merging IC13 IC15 FLOPS


Linear 92.5 72.0 None 92.4 71.8 1.10
Overlap 93.0 73.9 Prog 93.5 74.8 0.29
Ours 93.5 74.8

Table 2: Ablation study on patch embedding (the left half) and merg-
ing (the right half).

CP IC13 IC15 CP IC13 IC15


None 91.6 68.2 [G]12 92.7 73.7
[G]6 [L]6 92.2 71.9 [L]12 91.3 70.2
[L]6 [G]6 93.5 74.8 [GL]6 91.9 72.4 Figure 5: Accuracy-parameter (M) and Accuracy-speed (ms) plots
[LG]6 93.5 73.5 of different models on IC15.

Table 3: Ablation study on mixing block permutation.


3.4 Comparison with State-of-the-Art
progressive embedding scheme outperforms the two default
ones by 0.75% and 2.8% on average on the two datasets, in- We compare SVTR with previous works on six English
dicating that it is effective especially for irregular text. benchmarks covering regular and irregular text and one Chi-
nese scene dataset in Table 4. The methods are grouped by
The Effectiveness of Merging whether utilizing the language information, i.e., lan-free and
There are also two choices, i.e., applying the merging oper- lan-aware. We first look into the result in English datasets.
ation to build a progressively decreased network (prog), and Even SVTR-T, the smallest one of the SVTR family, achieves
keeping a constant spatial resolution across stages (none). As highly competitive accuracy among the lan-free ones. While
indicated in the right half of Table 2. The merging not only other SVTRs are new state-of-the-art on most datasets. When
reduces the computational cost, but also increases the accu- further compared with lan-aware ones, SVTR-L gains the
racy on both datasets. It verifies that a multi-scale sampling best accuracy on IIIT, IC15 and CUTE among the six English
on the height dimension is effective for text recognition. benchmarks. Its overall accuracy is on par with recent studies
[Fang et al., 2021; Tang et al., 2022], which used extra lan-
The Permutation of Mixing Blocks
guage models. As a contrast, SVTR enjoys its simplicity and
There are various ways to group the global and local mixing runs faster. The results basically imply that a single visual
blocks in each stage. Several of them are assessed in Table model also could perform the recognition well, as discrimi-
3, where none means no mixing block is taken into account. native character features are successfully extracted.
[G]6 [L]6 means for each stage, six global mixing blocks are
carried out at first, and then six local mixing blocks. Others We then analyze the result on Chinese Scene Dataset.
are defined similarly. As can be seen, almost every scheme [Chen et al., 2021] gives the accuracy of six existing meth-
gives certain accuracy improvement. We believe that the im- ods as shown in the table. Encouragingly, SVTR performs
provements are attributed to perceiving comprehensive char- considerably well. Compared with SAR, the best one among
acter component features. The relatively large gains on irreg- the listed ones, accuracy gains ranging from 5.4% to 9.6% are
ular text further explains the mixing block is helpful to feature observed, which are noticeable improvements. The result is
learning in complex scenarios. It is observed that [L]6 [G]6 re- explained as SVTR comprehensively perceives multi-grained
ports the best accuracy. It gives accuracy gains of 1.9% and character component features. They are well suited to char-
6.6% on IC13 and IC15 when compared with none. By plac- acterize Chinese words that have rich stroke patterns.
ing the local mixing block in front of the global one, it is In Figure.5, we also depict the accuracy, parameter and
beneficial to guide the global mixing block to focus on long- inference speed of different models on IC15. Owning
term dependence capturing. On the contrary, switching their to their simpleness, SVTRs consistently rank top-tier in
permutation is likely to confuse the role of the global mixing both accuracy-parameter and accuracy-speed plots, further
block, which may repetitively attend to local characteristics. demonstrating its superiority compared to existing methods.

888
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

English regular English unregular Chinese Params Speed


method
IC13 SVT IIIT5k IC15 SVTP CUTE Scene (M) (ms)
CRNN[Shi et al., 2017] 91.1 81.6 82.9 69.4 70.0 65.5 53.4 8.3 6.3
Rosetta[Borisyuk et al., 2018] 90.9 84.7 84.3 71.2 73.8 69.2 - 44.3 10.5
SRN*[Yu et al., 2020] 93.2 88.1 92.3 77.5 79.4 84.7 - - -
Lan-free PREN*[Yan et al., 2021] 94.7 92.0 92.1 79.2 83.9 81.3 - 29.1 40.0
ViTSTR[Atienza, 2021] 93.2 87.7 88.4 78.5 81.8 81.3 - 85.5 11.2
ABINet*[Fang et al., 2021] 94.9 90.4 94.6 81.7 84.2 86.5 - 23.5 50.6
VST*[Tang et al., 2022] 95.6 91.9 95.6 82.3 87.0 91.8 - - -
ASTER[Shi et al., 2019] - 89.5 93.4 76.1 78.5 79.5 54.5 27.2 -
MORAN[Luo et al., 2019] - 88.3 91.2 - 76.1 77.4 51.8 28.5 -
NRTR[Sheng et al., 2019] 94.7 88.3 86.5 - - - - 31.7 160
SAR[Li et al., 2019] 91.0 84.5 91.5 69.2 76.4 83.5 62.5 57.5 120
Lan-aware AutoSTR[Zhang et al., 2020] - 90.9 94.7 81.8 81.7 84.0 - 10.4 207
SRN[Yu et al., 2020] 95.5 91.5 94.8 82.7 85.1 87.8 60.1 54.7 25.4
PREN2D[Yan et al., 2021] 96.4 94.0 95.6 83.0 87.6 91.7 - - -
VisionLAN[Wang et al., 2021] 95.7 91.7 95.8 83.7 86.0 88.5 - 32.8 28.0
ABINet[Fang et al., 2021] 97.4 93.5 96.2 86 89.3 89.2 - 36.7 51.3
VST[Tang et al., 2022] 96.4 93.8 96.3 85.4 88.7 95.1 - 64.0 -
SVTR-T (Tiny) 96.3 91.6 94.4 84.1 85.4 88.2 67.9 6.03 4.5
SVTR-S (Small) 95.7 93.0 95.0 84.7 87.9 92.0 69.0 10.3 8.0
Ours
SVTR-B (Base) 97.1 91.5 96.0 85.2 89.9 91.7 71.4 24.6 8.5
SVTR-L (Large) 97.2 91.7 96.3 86.6 88.4 95.1 72.1 40.8 18.0

Table 4: Results on six English and one Chinese benchmarks tested against existing methods, where CRN N and Rosetta are from the
reproduction of CombBest [Baek et al., 2019]. Lan means language and * means the language-free version of the corresponding method.
The speed is the inference time on one NVIDIA 1080Ti GPU averaged over 3000 English image text.

3.5 Visualization Analysis


We have visualized the attention maps of SVTR-T when de-
coding different character components. Each map can be ex-
plained as serving a different role in the whole recognition.
Nine maps are selected for illustration, as shown in Figure.6.
The first line shows three maps of gazing into a fraction of
character ”B”, with emphasis on its left side, bottom and
middle parts, respectively. It indicates that different charac-
ter regions are contributed to the recognition. The second
line exhibits three maps of gazing into different characters, Figure 6: Visualization of SVTR-T attention maps.
i.e., ”B”, ”L”, and ”S”. SVTR-T is also able to learn its fea-
ture by viewing a character as a whole. While the third line
exhibits three maps simultaneously activate multiple charac- tures that describe both stroke-like local patterns and inter-
ters. It implies that dependencies among different characters component dependence of varying distance at multiple height
are successfully captured. The three lines together reveal that scales. Therefore, the recognition task can be conducted by
sub-character, character-level and cross-character clues are using a single visual model, enjoying the merits of accuracy,
all captured by the recognizer, in accordance with the claim efficiency and cross-lingual versatility. SVTR with differ-
that SVTR perceives multi-grained character component fea- ent capacities are also devised to meet diverse application
tures. It again explains the effectiveness of SVTR. needs. Experiments on both English and Chinese benchmarks
With the results and discussion, we conclude with that basically verify the proposed SVTR. Highly competitive or
SVTR-L enjoys the merits of accurate, fast and versatile, be- even better accuracy is observed compared to state-of-the-art
ing a highly competitive choice in accuracy-oriented applica- methods, while running faster. We hope that SVTR will foster
tions. While SVTR-T is an effective and much smaller model further research in scene text recognition.
yet quite fast. It is appealing in resource-limited scenarios.
Acknowledgments
4 Conclusion The work is supported by the National Nature Foundation
We have presented SVTR, a customized visual model for of China (No. 62172103, 61876016) and CCF-Baidu Open
scene text recognition. It extracts multi-grained character fea- Fund (No.2021PP15002000).

889
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

References [Li et al., 2019] H. Li, P. Wang, C. Shen, and G. Zhang.


[Anhar et al., 2014] R. Anhar, S. Palaiahnakote, C. S. Chan, Show, attend and read: A simple and strong baseline for ir-
and C. L. Tan. A robust arbitrary text detection system for regular text recognition. In AAAI, pages 8610–8617, 2019.
natural scene images. In Expert Systems with Applications, [Liu et al., 2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu,
page 8027–8048, 2014. Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
[Atienza, 2021] R. Atienza. Vision transformer for fast and Swin transformer: Hierarchical vision transformer using
efficient scene text recognition. arXiv:2105.08582, 2021. shifted windows. In ICCV, 2021.
[Baek et al., 2019] J. Baek, G. Kim, J. Lee, S. Park, D. Han, [Luo et al., 2019] C. Luo, L. Jin, and Z. Sun. A multi-object
S. Yun, S. J. Oh, and H. Lee. What is wrong with scene rectified attention network for scene text recognition. In
text recognition model comparisons? dataset and model Pattern Recognit., pages 109–118, 2019.
analysis. In ICCV, pages 4714–4722, 2019. [Mishra et al., 2012] A. Mishra, A. Karteek, and C. V. Jawa-
[Borisyuk et al., 2018] F. Borisyuk, A. Gordo, and har. Scene text recognition using higher order language
V. Sivakumar. Rosetta: Large scale system for text priors. In BMVC, 2012.
detection and recognition in images. In ACM SIGKDD, [Phan et al., 2013] T. Q. Phan, P. Shivakumara, S. Tian, and
page 71–79, 2018. C. L. Tan. Recognizing text with perspective distortion in
[Chen et al., 2021] J. Chen, H. Yu, J. Ma, M. Guan, X. Xu, natural scenes. In CVPR, pages 569–576, 2013.
X. Wang, S. Qu, B. Li, and X. Xue. Benchmarking chi- [Sheng et al., 2019] F. Sheng, Z. Chen, and B. Xu. Nrtr: A
nese text recognition: Datasets, baselines, and an empiri- no-recurrence sequence-to-sequence model for scene text
cal study. arXiv:2112.15093, 2021. recognition. In ICDAR, pages 781–786, 2019.
[Dosovitskiy et al., 2021] A. Dosovitskiy, L. Beyer, [Shi et al., 2017] B. Shi, X. Bai, and C. Yao. An end-to-end
A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, trainable neural network for image-based sequence recog-
M. Dehghani, M. Minderer, G. Heigold, S. Gelly, nition and its application to scene text recognition. In
J. Uszkoreit, and N. Houlsby. An image is worth 16x16 TPAMI, pages 2298–2304, 2017.
words: Transformers for image recognition at scale. In [Shi et al., 2019] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao,
ICLR, 2021. and X. Bai. Aster: An attentional scene text recognizer
[Fang et al., 2021] S. Fang, H. Xie, Y. Wang, Z. Mao, and with flexible rectification. In TPAMI, pages 2035–2048,
Y. Zhang. Read like humans: Autonomous, bidirectional 2019.
and iterative language modeling for scene text recognition. [Tang et al., 2022] X. Tang, Y. Lai, Y. Liu, Y. Fu, and
In CVPR, 2021. R. Fang. Visual-semantic transformer for scene text recog-
[Gupta et al., 2016] A. Gupta, A. Vedaldi, and A. Zisserman. nition. In AAAI, 2022.
Synthetic data for text localisation in natural images. In [Wang et al., 2011] K. Wang, B. Babenko, and S. Belongie.
CVPR, pages 2315–2324, 2016. End-to-end scene text recognition. In ICCV, pages 1457–
[Hu et al., 2020] W. Hu, X. Cai, J. Hou, S. Yi, and Z. Lin. 1464, 2011.
Gtc: Guided training of ctc towards efficient and accu- [Wang et al., 2021] Y. Wang, H. Xie, S. Fang, J. Wang,
rate scene text recognition. In AAAI, pages 11005–11012, S. Zhu, and Y. Zhang. From two to one: A new scene
2020. text recognizer with visual language modeling network. In
[Jaderberg et al., 2014] M. Jaderberg, K. Simonyan, ICCV, pages 14194–14203, 2021.
A. Vedaldi, and A. Zisserman. Synthetic data and artificial [Yan et al., 2021] R. Yan, L. Peng, S. Xiao, and G. Yao.
neural networks for natural scene text recognition. In Primitive representation learning for scene text recogni-
NeurIPS Deep Learning Workshop, 2014. tion. In CVPR, pages 284–293, 2021.
[Jaderberg et al., 2015] M. Jaderberg, K. Simonyan, [Yu et al., 2020] D. Yu, X. Li, C. Zhang, T. Liu, J. Han,
A. Vedaldi, and A. Zisserman. Reading text in the wild J. Liu, and E. Ding. Towards accurate scene text recog-
with convolutional neural networks. In IJCV, pages 1–20, nition with semantic reasoning networks. In CVPR, pages
2015. 12113–12122, 2020.
[Karatzas et al., 2015] D. Karatzas, L. Gomez-Bigorda, [Zhai et al., 2016] C. Zhai, Z. Chen, J. Li, and B. Xu.
A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, Chinese image text recognition with blstm-ctc: A
J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, segmentation-free method. In CCPR, pages 525–536,
F. Shafait, S. Uchida, and E. Valveny. Icdar 2015 com- 2016.
petition on robust reading. In ICDAR, pages 1156–1160,
2015. [Zhang et al., 2020] H. Zhang, Q. Yao, M. Yang, Y. Xu, and
X. Bai. Autostr: Efficient backbone search for scene text
[KaratzasAU et al., 2013] D. KaratzasAU, F. ShafaitAU, recognition. In ECCV, pages 751–767. Springer, 2020.
S. UchidaAU, M. IwamuraAU, L. G. i. BigordaAU, S. R.
MestreAU, J. MasAU, D. F. MotaAU, J. A. AlmazànAU, [Zheng et al., 2021] T. Zheng, Z. Chen, S. Fang, H. Xie, and
and L. P. de las Heras. Icdar 2013 robust reading competi- Y. Jiang. Cdistnet: Perceiving multi-domain character dis-
tion. In ICDAR, pages 1484–1493, 2013. tance for robust text recognition. arXiv:2111.11011, 2021.

890

You might also like