This page lists the datasets which are commonly used in text detection, text recognition and key information extraction, and their download links.
The structure of the text detection dataset directory is organized as follows.
├── ctw1500
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
├── icdar2015
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
├── icdar2017
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json
├── synthtext
│ ├── imgs
│ └── instances_training.lmdb
Dataset | Images | Annotation Files | |||
---|---|---|---|---|---|
training | validation | testing | |||
CTW1500 | homepage | instances_training.json | - | instances_test.json | |
ICDAR2015 | homepage | instances_training.json | - | instances_test.json | |
ICDAR2017 | homepage | renamed_imgs | instances_training.json | instances_val.json | - |
Synthtext | homepage | instances_training.lmdb | - |
-
For
icdar2015
:- Step1: Download
ch4_training_images.zip
andch4_test_images.zip
from homepage - Step2: Download instances_training.json and instances_test.json
- Step3:
mkdir icdar2015 && cd icdar2015 mv /path/to/instances_training.json . mv /path/to/instances_test.json . mkdir imgs && cd imgs ln -s /path/to/ch4_training_images training ln -s /path/to/ch4_test_images test
- Step1: Download
-
For
icdar2017
:- To avoid the effect of rotation when load
jpg
with opencv, We provide re-savedpng
format image in renamed_images. You can copy these images toimgs
.
- To avoid the effect of rotation when load
The structure of the text recognition dataset directory is organized as follows.
├── mixture
│ ├── coco_text
│ │ ├── train_label.txt
│ │ ├── train_words
│ ├── icdar_2011
│ │ ├── training_label.txt
│ │ ├── Challenge1_Training_Task3_Images_GT
│ ├── icdar_2013
│ │ ├── train_label.txt
│ │ ├── test_label_1015.txt
│ │ ├── test_label_1095.txt
│ │ ├── Challenge2_Training_Task3_Images_GT
│ │ ├── Challenge2_Test_Task3_Images
│ ├── icdar_2015
│ │ ├── train_label.txt
│ │ ├── test_label.txt
│ │ ├── ch4_training_word_images_gt
│ │ ├── ch4_test_word_images_gt
│ ├── III5K
│ │ ├── train_label.txt
│ │ ├── test_label.txt
│ │ ├── train
│ │ ├── test
│ ├── ct80
│ │ ├── test_label.txt
│ │ ├── image
│ ├── svt
│ │ ├── test_label.txt
│ │ ├── image
│ ├── svtp
│ │ ├── test_label.txt
│ │ ├── image
│ ├── Syn90k
│ │ ├── shuffle_labels.txt
│ │ ├── label.txt
│ │ ├── label.lmdb
│ │ ├── mnt
│ ├── SynthText
│ │ ├── shuffle_labels.txt
│ │ ├── instances_train.txt
│ │ ├── label.txt
│ │ ├── label.lmdb
│ │ ├── synthtext
│ ├── SynthAdd
│ │ ├── label.txt
│ │ ├── label.lmdb
│ │ ├── SynthText_Add
Dataset | images | annotation file | annotation file |
---|---|---|---|
training | test | ||
coco_text | homepage | train_label.txt | - |
icdar_2011 | homepage | train_label.txt | - |
icdar_2013 | homepage | train_label.txt | test_label_1015.txt |
icdar_2015 | homepage | train_label.txt | test_label.txt |
IIIT5K | homepage | train_label.txt | test_label.txt |
ct80 | - | - | test_label.txt |
svt | homepage | - | test_label.txt |
svtp | - | - | test_label.txt |
Syn90k | homepage | shuffle_labels.txt | label.txt | - |
SynthText | homepage | shuffle_labels.txt | instances_train.txt | label.txt | - |
SynthAdd | SynthText_Add.zip (code:627x) | label.txt | - |
-
For
icdar_2013
:- Step1: Download
Challenge2_Test_Task3_Images.zip
andChallenge2_Training_Task3_Images_GT.zip
from homepage - Step2: Download test_label_1015.txt and train_label.txt
- Step1: Download
-
For
icdar_2015
:- Step1: Download
ch4_training_word_images_gt.zip
andch4_test_word_images_gt.zip
from homepage - Step2: Download train_label.txt and test_label.txt
- Step1: Download
-
For
IIIT5K
:- Step1: Download
IIIT5K-Word_V3.0.tar.gz
from homepage - Step2: Download train_label.txt and test_label.txt
- Step1: Download
-
For
svt
:- Step1: Download
svt.zip
form homepage - Step2: Download test_label.txt
- Step3:
python tools/data/textrecog/svt_converter.py <download_svt_dir_path>
- Step1: Download
-
For
ct80
:- Step1: Download test_label.txt
-
For
svtp
:- Step1: Download test_label.txt
-
For
coco_text
:- Step1: Download from homepage
- Step2: Download train_label.txt
-
For
Syn90k
:- Step1: Download
mjsynth.tar.gz
from homepage - Step2: Download shuffle_labels.txt
- Step3:
mkdir Syn90k && cd Syn90k mv /path/to/mjsynth.tar.gz . tar -xzf mjsynth.tar.gz mv /path/to/shuffle_labels.txt . # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/Syn90k Syn90k
- Step1: Download
-
For
SynthText
:- Step1: Download
SynthText.zip
from homepage - Step2: Download shuffle_labels.txt
- Step3: Download instances_train.txt
- Step4:
unzip SynthText.zip cd SynthText mv /path/to/shuffle_labels.txt . # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/SynthText SynthText
- Step1: Download
-
For
SynthAdd
:mkdir SynthAdd && cd SynthAdd mv /path/to/SynthText_Add.zip . unzip SynthText_Add.zip mv /path/to/label.txt . # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/SynthAdd SynthAdd
Note:
To convert label file with txt
format to lmdb
format,
python tools/data/utils/txt2lmdb.py -i <txt_label_path> -o <lmdb_label_path>
For example,
python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb
The structure of the key information extraction dataset directory is organized as follows.
└── wildreceipt
├── class_list.txt
├── dict.txt
├── image_files
├── test.txt
└── train.txt
- Download wildreceipt.tar
The structure of the named entity recognition dataset directory is organized as follows.
└── cluener2020
├── cluener_predict.json
├── dev.json
├── README.md
├── test.json
├── train.json
└── vocab.txt
-
Download cluener_public.zip
-
Download vocab.txt and move
vocab.txt
tocluener2020