Pattern Recognition
Pattern Recognition
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
a r t i c l e i n f o a b s t r a c t
Article history: Image manipulation detection has attracted considerable attention owing to the increasing security risks
Received 18 May 2022 posed by fake images. Previous studies have proven that tampering traces hidden in images are essen-
Revised 29 August 2022
tial for detecting manipulated regions. However, existing methods have limitations in generalization and
Accepted 4 September 2022
the ability to tackle post-processing methods. This paper presents a novel Network to learn and Enhance
Available online 6 September 2022
Multiple tampering Traces (EMT-Net), including noise distribution and visual artifacts. For better gener-
Keywords: alization, EMT-Net extracts global and local noise features from noise maps using transformers and cap-
Image manipulation detection tures local visual artifacts from original RGB images using convolutional neural networks. Moreover, we
Transformer enhance fused tampering traces using the proposed edge artifacts enhancement modules and edge su-
Edge artifact enhancement pervision strategy to discover subtle edge artifacts hidden in images. Thus, EMT-Net can prevent the risks
Edge supervision of losing slight visual clues against well-designed post-processing methods. Experimental results indicate
that the proposed method can detect manipulated regions and outperform state-of-the-art approaches
under comprehensive quantitative metrics and visual qualities. In addition, EMT-Net shows robustness
when various post-processing methods further manipulate images.
© 2022 Published by Elsevier Ltd.
https://doi.org/10.1016/j.patcog.2022.109026
0031-3203/© 2022 Published by Elsevier Ltd.
X. Lin, S. Wang, J. Deng et al. Pattern Recognition 133 (2023) 109026
Fig. 1. Examples of manipulated images with different manipulation techniques, such as copy-move, splicing, and removal. The goal is to detect pixel-level binary masks of
manipulated regions.
Fig. 2. Framework of our proposed EMT-Net for pixel-level manipulation detection. EMT-Net consists of a noise encoding branch, an RGB encoding branch, four edge artifact
enhancement modules, an edge decoding branch, and a region decoding branch.
More recently, some CNN-based methods, such as ManTra [13], three most common manipulating techniques, splicing, removal,
SPAN [14], or MVSS [15], offer generalized solutions. These meth- and copy-move. First, a transformer-based Noise Encoding Branch
ods extract tampering traces hidden in images to achieve supe- (NEB) and CNN-based RGB Encoding Branch (RGB-EB) effectively
rior performance and can be divided into two categories based on extract and fuse multiple tampering traces, such as global noise,
the type of tampering trace captured. The first category is based local noise, and artifact features. Second, proposed Edge Artifact
on noise maps [9,14] generated using special convolutional kernels Enhancement (EAE) modules and an edge supervision strategy in
to RGB images. These methods employ CNNs to extract abnormal the Edge Decoding Branch (EDB) enhance subtle boundary artifacts
local noise features from noise maps to distinguish heterologous of fused features for locating the edges of manipulated regions. Fi-
regions but ignore the importance of global correlations. Unfortu- nally, a Region Decoding Branch (RDB) upsamples enhanced fea-
nately, homogenous manipulation techniques cause no local abnor- tures for pixel-level prediction. Comprehensive experimental re-
mality in noise maps, which makes it insufficient to be detected sults on six benchmark datasets, i.e., CASIA, NIST, Columbia, COVER,
by local noise features. Therefore, local noise feature-based ap- CoMoFoD, and DEFACTO, verify that our proposed method outper-
proaches have deficiencies in comprehensive detection. The second forms current state-of-the-art (SoTA) approaches even without pre-
category captures edge artifacts of manipulated areas [15,16] us- training.
ing edge detection modules or edge supervision branches to im- Our main contributions can be summarized as follows:
prove edge feature extraction ability. However, it is challenging to
• We simultaneously extract global and local noise features us-
distinguish boundary artifacts and edges of natural objects when
ing transformers from noise maps in the proposed NEB to de-
visual artifacts are hidden by well-designed post-processing meth-
tect tampering traces produced by homologous and heterolo-
ods, such as local smoothing, image compression, and filtering.
gous manipulation;
We present a novel image manipulation detection Network by
• We design EDB to reinforce the tampering traces at different
learning and Enhancing Multiple tampering Traces (EMT-Net), as
scales. EDB combines proposed EAE modules and an edge su-
shown in Fig. 2, to improve the generalization abilities and tackle
pervision strategy to find edge artifacts of manipulated regions
post-processing approaches. EMT-Net focuses on detecting the
effectively even after applying post-processing methods;
2
X. Lin, S. Wang, J. Deng et al. Pattern Recognition 133 (2023) 109026
• We present the EMT-Net, which can learn and enhance multi- former (ViT) to explicitly model global relationships between pix-
ple tampering traces, including local noise inconsistency, global els and achieved impressive accuracy on several image processing
noise correlations, and subtle boundary artifacts from different tasks. However, ViT fails to capture subtle details of images ow-
scales. The fusion and enhancement of tampering traces en- ing to the limited local perception of patch-based self-attention
able precise detection of multiple content-changing manipula- modules. Liu et al. [20] proposed a more efficient and effective
tion techniques. hierarchical Swin transformer using shifted windows-based self-
attention. Unlike ViT, the Swin transformer can extract local and
2. Related works global features and achieve SoTA performance on various vision
tasks, including object detection [24], and semantic segmentation
This section reviews the most relevant studies on CNN-based [25]. Therefore, this study applies Swin transformers to capture
manipulation detection and localization methods. Then, we briefly global and local noise features from noise maps.
introduce the transformer, one of the core components of the pro-
posed network. 3. Method
Current CNN-based approaches can be classified into those em-
ploying local noise and edge artifact features. The two categories This section briefly introduces the proposed method for detect-
are reviewed in the following subsections. ing manipulation techniques and explains the motivation behind
the study. Details of the four main network components are then
2.1. Local noise-based methods discussed.
Local noise features are extracted from noise maps modeled by 3.1. Overview and motivation
residuals between original RGB images and RGB images estimated
by interpolating methods [17]. Noise maps, which are always gen- This work aims to predict pixel-level binary masks of manipu-
erated by applying special filters to original RGB images, can en- lated regions in images using a novel model EMT-Net, which can
hance tampering clues and suppress semantic information. RGB-N extract and enhance sufficient tampering traces. We mainly detect
[17] employed steganalysis rich model (SRM) filters [18] to acquire three common forgery types, i.e., copy-move, splicing, and removal.
noise maps and utilized CNNs to capture local noise abnormali- EMT-Net extracts and fuses multiple manipulation features, includ-
ties. Some approaches [15,19] used constrained Bayar convolutional ing global noise consistencies, local noise inconsistencies, and vi-
blocks to learn image manipulation fingerprints and explore local sual artifacts. The EMT-Net framework, illustrated in Fig. 2, consists
inconsistencies from noise maps. ManTra [13], and SPAN [14] con- of four main components: NEB, RGB-EB, EDB, and RDB. The EDB
catenated regular convolutional, SRM, and Bayar filters to extract module enhances multiple features extracted using NEB and RGB-
local features from noise maps and RGB images. Li and Huang EB. Moreover, to prevent post-processing methods (e.g., blurring,
[9] designed a trainable pre-filtering module initialized with high- local smoothing, and compression) from decreasing visual clues,
pass filters for enhancing tampered traces in detecting inpainted we propose an edge supervision strategy and EAE module to rein-
regions. force boundary artifacts of fused multiple features. After enhance-
However, these noise map-based methods only used CNNs to ment, RDB predicts manipulated regions. Moreover, edge details of
extract local noise features without exploring global noise features. features are gradually refined under the guidance of EDB during
Local noise features cannot reveal manipulated regions forged the RDB decoding process. Finally, RDB generates the binary masks
by homologous manipulation techniques, weakening generalization of manipulated regions.
performance. The proposed network extracts global and local noise In the following sections, each main component is explicitly in-
features from noise maps using Swin transformers [20], thereby troduced.
addressing heterologous and homogenous manipulation.
3.2. Noise encoding branch
2.2. Edge artifact-based methods
NEB aims to provide sufficient evidence for homologous and
Edge artifacts are vital clues for image manipulation detection heterologous manipulation detection by extracting noise distribu-
as most manipulated region boundaries are surrounded by unnat- tion features, including global noise consistencies and local noise
ural artifacts. Salloum et al. [16] proposed multi-task FCN (MFCN) inconsistencies. The NEB structure is shown in Fig. 2. We adopt a
to predict tampered areas and its boundary concurrently. Moti- combined convolutional block [13,14] containing SRM filters [18],
vated by MFCN, Zhou et al. [21] designed a three-stage GSR-Net Bayar filters [19], and normal convolutional blocks to convert RGB
architecture including edge prediction, refinement, and segmenta- images to noise maps. We then perform a patch partition layer to
tion branches. Supervised edge prediction and refinement branches feed noise maps into the first Swin transformer block. As shown in
improved detection results. More recently, Chen et al. [15] designed Fig. 3(a), noise maps are partitioned into several windows of size
an edge-supervised branch using edge residual blocks in a shallow- Nw × Nw and each window is divided into patches of size N p × N p .
to-deep manner to capture subtle boundary details. However, dis- After partitioning, a linear embedding layer flattens the patches
tinguishing the natural edges of objects and boundary artifacts in each window. Finally, we apply Swin transformer blocks in se-
becomes difficult when well-designed post-processing approaches ries to extract global and local noise features from the flattened
(e.g., local smoothing and noise adding) reduce visual clues. There- patches at different scales. The following subsections discuss de-
fore, we propose EDB, in which EAE modules and an edge super- tails of the Swin transformer blocks.
vision strategy can prevent the loss of subtle artifacts and learn
more robust features for distinguishing edge artifacts and natural 3.2.1. Swin transformer block
boundaries. As illustrated in Fig. 4, the Swin transformer block is made
up of four layer-norm layers, two Multi-Layer Perceptrons (MLPs)
2.3. Transformer [26], a window-based multi-head self-attention (W-MSA) mod-
ule, a shifted window-based MSA (SW-MSA) module, and a patch
The transformer [22] was first proposed for natural language merging layer. After each MLP and MSA module, a residual con-
processing tasks. Dosovitskiy et al. [23] introduced a vision trans- nection is established to prevent gradient vanishing. The l-th Swin
3
X. Lin, S. Wang, J. Deng et al. Pattern Recognition 133 (2023) 109026
Fig. 3. Window partition. (a) Window and patch partition; (b) Shifted-window partition. All local windows in (a) are shifted two pixels in the right and bottom direction.
After the shifting operation, nine new windows are generated.
are the query, key and value of patches, respectively, N 2p is the num-
ber of patches in the window to perform MSA, and d is the di-
mension of Q, and K. An example of window shifting is shown in
Fig. 4. Details of the Swin transformer block. (a) Framework of the Swin trans- Fig. 3(b).
former block; (b) Architecture of the multi-layer perceptron in the Swin transformer Patch merging. After performing the last MLP, the patch merg-
block. ing layer downsamples the feature maps. Thus, each group of 2 × 2
adjacent patches is separated into four new feature maps. The res-
olution of the output feature maps is downsampled by 2 times,
transformer block is formulated as follows:
and the number of channels is increased by 4 times. Finally, we
Ysl1 = LN(W-MSA(Ysl )) + Ysl , (1) resize the Swin transformer block outputs xls4 to match that of
ResNet blocks in RGB-EB at corresponding scales.
where yls1 and Ysl4 are the input and output of the l-th Swin trans- 3.4. Edge decoding branch
former block, respectively, Ysl2 and Ysl3 are the intermediate re-
sults, W-MSA(· ) denotes W-MSA, SW-MSA(· ) represents SW-MSA, To better capture the edge details and assist model predic-
MLP(· ) denotes MLP, and PM(· ) denotes the patch merging layer. tions, we design an EDB for edge artifacts learning. As shown in
These sub-components are discussed in the following paragraphs. Fig. 2, EDB consists of four EAE modules, four Edge-upsample (E-
Multi-head self-attention. The W-MSA and SW-MSA modules up) blocks, and an edge supervision strategy. Supervised by edges
are designed to extract global and local features from noise maps. of manipulated regions, E-up blocks recover the resolution using
Standard MSA module [23,27] only learns the global relationship enhanced tampering feature maps from EAE modules. After chan-
of an image. To extract both local noise abnormalities and global nel compression by 1 × 1 convolutional layer, outputs of the last
noise consistencies, we use W-MSA to limit the global awareness E-up block represent binary boundary masks of manipulated re-
across windows by limiting the MSA calculation process in each gions. Four E-up blocks are cascaded to reach the full resolution
H
local image window (see Fig 3). Moreover, SW-MSA helps maintain from 16 ×W16 to H × W . Proposed EAE modules, E-up blocks, and
the long-distance relationship learning capability by enlarging the edge supervision strategy are detailed in the following subsections.
receptive fields of W-MSA. W-MSA is computed as
3.4.1. Edge artifact enhancement module
QK T Post-processing methods can weaken the edge artifacts hidden
W-MSA(Q, K, V ) = V · SoftMax(Br p + √ ) (6)
d in images, making it difficult to find subtle artifacts directly. Con-
2 2 sequently, as shown in Fig. 6(a), we propose CNN-based EAE mod-
where Br p ∈ RN p ×N p is the relative position bias of each head of ules to indirectly enhance artifacts of fused feature maps. In each
2
the self-attention module in computing process, Q, K, V ∈ Rd×N p EAE module, the convolution layers are forced to learn to eliminate
4
X. Lin, S. Wang, J. Deng et al. Pattern Recognition 133 (2023) 109026
same scale. The enhanced edge features guide the precise detec-
where RG is the residual group layer (computed using Eq. (12)),
tion of the boundary and region details of manipulated areas.
and DL denotes the dense layer shown in Fig. 6(b).
We apply binary cross entropy as a loss function in this branch.
RG(Y ) = Conv 3 × 3(ReLU(Conv 3 × 3(Y ))) + Y (12) The region loss function considering manipulated and authentic
pixel losses is as follows:
H×W
[yi lnR(xi ) + (1 − yi )ln(1 − R(xi ))]
3.4.2. Edge-upsample block lossr (x ) = − i=1
, (14)
H ×W
The framework of the E-up block is shown in Fig. 7(a). In each
E-up block in EDB, hidden feature maps Yue l−1
from the previous The total loss of the proposed EMT-Net consists of the edge and
block are concatenated with edge outputs Yel of the corresponding region losses and is formulated as
EAE module (the hidden feature maps are not available in E-up
losst (x ) = γe · losse (x ) + γr · lossr (x ), (15)
block 1). Next, a bilinear upsample doubles the resolution of con-
catenated features. Next, two successive 3 × 3 convolutional layer where γe and γr are the weights of the edge and region losses, re-
groups joined with a residual 1 × 1 convolutional layer group re- spectively. Function R(· ) denotes pixel-level results of manipulated
cover subtle edge artifacts and aggregate feature maps. regions predicted by RDB.
5
X. Lin, S. Wang, J. Deng et al. Pattern Recognition 133 (2023) 109026
Table 1
Description of the datasets .
6
X. Lin, S. Wang, J. Deng et al. Pattern Recognition 133 (2023) 109026
Table 2
Quantitative comparison of EMT-Net with the six SoTA methods on CASIA, NIST, Columbia, COVER, CoMoFoD, and DEFACTO.
ManTra 0.796 0.267 0.959 0.638 0.736 0.243 0.777 0.283 0.900 0.545 0.638 0.045
SPAN 0.709 0.213 0.779 0.252 0.741 0.463 0.791 0.325 0.854 0.267 0.869 0.217
MVSS 0.847 0.318 0.981 0.768 0.808 0.417 0.808 0.284 0.889 0.476 0.932 0.418
GSRNet 0.836 0.340 0.967 0.640 0.900 0.433 0.788 0.218 0.867 0.492 0.880 0.250
DenseFCN 0.809 0.203 0.979 0.812 0.761 0.257 0.754 0.185 0.889 0.331 0.910 0.404
LocateNet 0.754 0.273 0.986 0.738 0.718 0.411 0.813 0.282 0.897 0.590 0.941 0.457
EMT-Net 0.856 0.459 0.987 0.825 0.832 0.561 0.812 0.353 0.906 0.594 0.942 0.481
Fig. 8. Sample qualitative results of the proposed EMT-Net compared with six SoTA methods on six datasets. From left to right, we show manipulated images, ground-truth
binary masks of manipulation, predictions of the proposed EMT-Net, the EDB of our EMT-Net, MVSS, ManTra, SPAN, GSRNet, DenseFCN, and LocateNet.
indicating that the proposed supervision and enhancement strate- N-C, the global noise feature effectively improves the model perfor-
gies successfully extract subtle artifacts hidden in manipulated im- mance due to patch-level pixel self-attention. Unlike N-B and N-C,
ages. Edge artifact features guide EMT-Net in locating manipulated EMT-Net extracts global and local noise features using Swin trans-
regions for improved pixel-level detection. former blocks, taking advantage of the complementarity between
tampering traces. Moreover, the visualization results in Fig. 9(a)
4.3. Ablation study verify that the predictions of full EMT-Net are more accurate, again
proving the importance of extracting global and local features in
We conduct an ablation study on NEB, RGB-EB, and EDB com- NEB.
ponents to investigate their contribution to improving the EMT-
Net’s performance using the same metrics as in the previous ex-
periment, i.e., AUC, and F1. For a comprehensive evaluation, we de-
sign 10 setups, and the core branches of each setup are shown in 4.3.2. Effect of RGB-EB
Table 3. The results of all setups are provided on the most chal- Although the structure of ResNet blocks in RGB-EB is simple,
lenging dataset, CASIA [29]. Table 3 and Fig. 9(b) show that it considerably improves the overall
results. Comparing our full EMT-Net with setups R-D (employing
4.3.1. Effect of NEB Swin transformer blocks in RGB-EB), R-E (adopting ViTs in RGB-
Three setups are considered to evaluate the contribution of EB), and R-F (removing the whole RGB-EB branch) show reduced
NEB. N-A only uses ResNet blocks to extract local noise features, AUC and F1. The visualization results of R-D, R-E, and R-F are far
N-B merely adopts ViTs to explore global noise features, and N- from ground truths. Hence, depending on noise features extracted
C makes predictions without using any noise features from NEB. by NEB is insufficient for precise detection and extracting local
As shown in Table 3 and Fig. 9(a), the effectiveness of local noise visual artifacts using CNN from RGB images is important. ResNet
features can be proved by quantitative and qualitative compar- blocks in RGB-EB can extract local information of manipulated re-
isons between setups N-A and N-C. Specifically, AUC increases from gions more finely. Swin transformers and ViTs fail to extract subtle
0.791 to 0.815, and F1 increase from 0.304 to 0.325. The perfor- local features, which leads to the insufficient ability in locating ma-
mance improvements demonstrate that noise anomalies can help nipulated regions. The results indicate combining the features from
locate tampered areas. In addition, when comparing N-B and setup RGB-EB and NEB can maximize the detection performance.
7
X. Lin, S. Wang, J. Deng et al. Pattern Recognition 133 (2023) 109026
Table 3
Performance comparison of EMT-Net setups used in the ablation study. Each setup is trained on CASIA v2 and tested on CASIA v1. A
benchmark training strategy is adopted for all setups. global + local denotes global and local features extracted by Swin transformer blocks,
global represents global features extracted by ViTs, local denotes local feature captured by ResNet blocks, ES is the edge supervision, EAE is
the edge artifact enhancement module, and ESB is the edge supervise branch proposed in [15].
Fig. 9. Examples of qualitative manipulation detection results in the ablation study on the effectiveness of two encoding branches. The comparison between different settings
of (a) NEB, (b) RGB-EB, and (c) EDB.
4.3.3. Effect of EDB and 6 in Fig. 9(c)). Hence, when detecting post-processed manipu-
Setup E-G utilizes the Edge Supervise Branch (ESB) from MVSS lated images, EAE modules can better distinguish natural bound-
[15] instead of our proposed EDB to extract edge features by apply- aries and subtle edge artifacts using residual features compared
ing handcrafted feature-based CNN blocks directly. E-H eliminates to directly extracting edge features like the ESB. Without the as-
EDB, whereas E-I retains the edge supervision strategy without EAE sistance of EAE modules (column 9 in Fig. 8), edge supervision
modules. However, as shown in Table 3 and Fig. 9(c), E-G, E-H, and can hardly detect edge artifacts (column 9 in Fig. 9(c)). Combin-
E-I contribute to lower performance than the EMT-Net. EMT-Net ing the edge supervision and EAE modules, EMT-Net can precisely
can predict almost all details of edges around manipulated regions, find visual artifacts around manipulated regions, even with post-
whereas E-G only detects part of boundaries (compare columns 4 processed images. EDB gradually uses the enhanced edge artifacts
8
X. Lin, S. Wang, J. Deng et al. Pattern Recognition 133 (2023) 109026
Table 4
Robustness comparison of the proposed method with SoTA methods over NIST dataset. The evaluation metric is AUC.
Table 5
Robustness comparison of the proposed method with SoTA methods over NIST dataset. The evaluation metric is F1.
features to guide the better detection of the tampered area dur- images, because of the failure to extract weakened edge artifacts.
ing the upsampling process. The results validate that the proposed Moreover, the combination of EDB and EAE modules can recover
EDB with EAE modules can improve performance remarkably. boundary details hidden by post-processing methods. In summary,
our strategy of fusing different tampering features (by NEB and
4.4. Robustness analysis RGB-EB) and reinforcing subtle artifacts (by edge supervision and
EAE modules) can effectively tackle the challenges bought by post-
We separately apply post-processing methods, Gaussian blur, processing methods.
JPEG compression, and Gaussian noise on NIST [30] to verify the
robustness of EMT-Net. For each post-processing method, we vary 4.5. Limitation analysis
the kernel size in Gaussian blur (from 3 to 15), quality in JPEG
compression (from 50% to 100%), and variance of Gaussian noise We discuss the limitations of our EMT-Net in two special failure
(from 3 to 15) for comprehensive verification. Results of robustness cases shown in Fig. 10.
analysis are shown in Tables 4 and 5. Compared with SoTA CNN- In the first case, in Fig. 10(a), EMT-Net makes precise predic-
based methods, EMT-Net achieves the most general robust per- tions on images forged by advanced Generative Adversarial Net-
formance using multiple enhanced tampering traces. The perfor- works (GAN), i.e., GAN inversion. GAN inversion helps generate
mance of SoTA methods drops when dealing with post-processed close-to-reality images by inverting images back to the latent space
Fig. 10. Two failure cases of the proposed EMT-Net. (a) Images forged by GAN inversion-based techniques. (b) Authentic images without manipulated pixels.
9
X. Lin, S. Wang, J. Deng et al. Pattern Recognition 133 (2023) 109026
of pretrained GANs and reconstructing them based on genera- [2] L. Liu, L. Huang, F. Yin, Y. Chen, Offline signature verification using a region
tors with inverted codes [37]. Bau et al. [38] removed artifact- based deep metric learning network, Pattern Recognit. 118 (2021) 108009.
[3] Y. Luo, J. Ma, C.K. Yeo, BCMM: A novel post-based augmentation representa-
causing modules in GANs by quantifying the causal relationship tion for early rumour detection on social media, Pattern Recognit. 113 (2021)
between modules. Recent GAN inversion-based methods can pro- 107818.
duce high-quality manipulated images. Zhu et al. [39] developed [4] A. Popescu, H. Farid, Exposing digital forgeries in color filter array interpolated
images, IEEE Trans. Signal Process. 53 (10) (2005) 3948–3959.
an in-domain GAN inversion approach and a domain-regularized [5] B. Mahdian, S. Saic, Using noise inconsistencies for blind image forensics, Im-
optimization method to better support close-to-reality editing by age Vis. Comput. 27 (10) (2009) 1497–1503.
varying the inverted codes. Bau et al. [40] proposed a method to [6] Z. Lin, J. He, X. Tang, C. Tang, Fast, automatic and fine-grained tampered JPEG
image detection via DCT coefficient analysis, Pattern Recognit. 42 (11) (2009)
rewrite pretrained GANs and generate realistic images by modi-
2492–2501.
fying the learned rules of GANs. We examine the proposed EMT- [7] X. Bi, C. Pun, Fast copy-move forgery detection using local bidirectional co-
Net on images forged by GAN inversion-based removal [38], splic- herency error refinement, Pattern Recognit. 81 (2018) 161–175.
[8] N. Krawetz, A picture’s worth, Hacker Factor Solution. 6 (2) (2007) 2.
ing [39], and copy-paste [40]. When the artifacts in tampered re-
[9] H. Li, J. Huang, Localization of deep inpainting using high-pass fully convolu-
gions are strongly reduced, the proposed method finds incomplete tional network, in: IEEE International Conference on Computer Vision (ICCV),
tampered areas owing to insufficient tampering traces. Moreover, 2019, pp. 8300–8309.
GAN-based manipulated techniques can harmonize heterogeneous [10] X. Wang, Y. Wang, J. Lei, B. Li, Q. Wang, J. Xue, Coarse-to-fine-grained method
for image splicing region detection, Pattern Recognit. 122 (2022) 108347.
regions, making detection difficult. [11] Y. Wu, W. Abd-Almageed, P. Natarajan, Busternet: Detecting copy-move image
The second case is the performance drop of EMT-Net when forgery with source/target localization, in: European Conference on Computer
dealing with authentic images without manipulated regions. In Vision (ECCV), 2018, pp. 170–186.
[12] J. Zhong, Y. Gan, C. Vong, J. Yang, J. Zhao, J. Luo, Effective and efficient pix-
Fig. 10(b), though the proposed method can distinguish authentic el-level detection for diverse video copy-move forgery types, Pattern Recognit.
images (first row), false alarms are present in the second and third 122 (2022) 108286.
rows when semantic objects are locally inconsistent due to out-of- [13] Y. Wu, W. AbdAlmageed, P. Natarajan, Mantra-net: Manipulation tracing net-
work for detection and localization of image forgeries with anomalous fea-
focus. tures, in: IEEE Conference on Computer Vision and Pattern Recognition (ICCV),
In summary, the proposed method shows good performance 2019, pp. 9543–9552.
only when locating non-GAN-based manipulated areas in tampered [14] X. Hu, Z. Zhang, Z. Jiang, S. Chaudhuri, Z. Yang, R. Nevatia, SPAN: spatial pyra-
mid attention network for image manipulation localization, in: European Con-
images.
ference on Computer Vision (ECCV), 2020, pp. 312–328.
[15] X. Chen, C. Dong, J. Ji, J. Cao, X. Li, Image manipulation detection by multi-view
5. Conclusion multi-scale supervision, in: IEEE International Conference on Computer Vision
(ICCV), 2021, pp. 14165–14173.
[16] R. Salloum, Y. Ren, C.J. Kuo, Image splicing localization using a multi-task fully
This paper presents a novel EMT-Net that learns multiple tam- convolutional network (MFCN), J. Vis. Commun. Image Represent. 51 (2018)
pering traces from noise maps and RGB images for image manipu- 201–209.
lation detection. EMT-Net fuses tampering features, i.e., both global [17] P. Zhou, X. Han, V.I. Morariu, L.S. Davis, Learning rich features for image ma-
nipulation detection, in: IEEE Conference on Computer Vision and Pattern
and local noise features learned by Swin transformers, and detail Recognition (CVPR), 2018, pp. 1053–1061.
artifact features captured by CNN to detect the image manipu- [18] J.J. Fridrich, J. Kodovský, Rich models for steganalysis of digital images, IEEE
lation techniques. The EDB can strengthen subtle boundary arti- Trans. Inf. Forensics Secur. 7 (3) (2012) 868–882.
[19] C. Yang, H. Li, F. Lin, B. Jiang, H. Zhao, Constrained r-cnn: A general image ma-
facts of fused feature maps and prevent the loss of artifacts weak- nipulation detection model, in: IEEE International Conference on Multimedia
ened by post-processing methods. Extensive experiments demon- and Expo (ICME), 2020, pp. 1–6.
strate that EMT-Net outperforms SoTA approaches. Although our [20] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer:
Hierarchical vision transformer using shifted windows, in: IEEE International
approach achieves superiority and robustness against various post-
Conference on Computer Vision (CVPR), 2021, pp. 9992–10 0 02.
processing methods, it has some limitations, including low effi- [21] P. Zhou, B. Chen, X. Han, M. Najibi, A. Shrivastava, S. Lim, L. Davis, Generate,
ciency and high video memory usage. We also notice the draw- segment, and refine: Towards generic manipulation segmentation, in: AAAI,
2020, pp. 13058–13065.
backs of the EMT-Net in dealing with advanced GAN and authentic
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser,
images, which may be solved by more discriminative prior knowl- I. Polosukhin, Attention is all you need, in: Conference on Neural Information
edge and appropriate loss functions, respectively. Furthermore, our Processing Systems (NeuralIPS), 2017, pp. 5998–6008.
method cannot accurately detect images forged by unlearned ma- [23] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,
M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An
nipulation or post-processing techniques. A global feature extrac- image is worth 16x16 words: Transformers for image recognition at scale, in:
tor more efficient than Swin transformers is desirable in the fu- International Conference on Learning Representations (ICLR), 2021.
ture. Moreover, exploring uncertainty estimation methods to help [24] Z. Liu, Y. Tan, Q. He, Y. Xiao, Swinnet: swin transformer drives edge-aware rg-
b-d and rgb-t salient object detection, IEEE Trans. Circuits Syst. Video Technol.
detect out-of-distribution manipulated images will be an exciting 32 (7) (2022) 4486–4497.
new direction. [25] X. He, Y. Zhou, J. Zhao, D. Zhang, R. Yao, Y. Xue, Swin transformer embedding
unet for remote sensing image semantic segmentation, IEEE Trans. Geosci. Re-
mote Sens. 60 (2022) 1–15.
Declaration of Competing Interest
[26] J. Moh, F.Y. Shih, A general purpose model for image operations based on mul-
tilayer perceptrons, Pattern Recognit. 28 (7) (1995) 1083–1090.
The authors declare that they have no known competing finan- [27] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, H. Jégou, Going deeper
with image transformers, in: IEEE International Conference on Computer Vi-
cial interests or personal relationships that could have appeared to
sion (MICCAI), 2021, pp. 32–42.
influence the work reported in this paper. [28] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016,
Acknowledgments pp. 770–778.
[29] J. Dong, W. Wang, T. Tan, CASIA image tampering detection evaluation
database, in: IEEE China Summit and International Conference on Signal and
This work was supported by the National Natural Science Foun- Information Processing (China ICSIP), 2013, pp. 422–426.
dation of China no.61901016. The support funding was also from [30] H. Guan, M. Kozak, E. Robertson, Y. Lee, A.N. Yates, A. Delgado, D. Zhou,
T. Kheyrkhah, J. Smith, J.G. Fiscus, MFC datasets: Large-scale benchmark
the National Key Research and Development Program of China, datasets for media forensic challenge evaluation, in: IEEE Winter Conference
Grant no.2020YFB2103600. on Applications of Computer Vision Workshop, 2019, pp. 63–72.
[31] Y.-F. Hsu, S.-F. Chang, Detecting image splicing using geometry invariants and
References camera characteristics consistency, in: IEEE International Conference on Multi-
media and Expo (ICME), 2006.
[1] X. Song, X. Zhao, L. Fang, T. Lin, Discriminative representation combinations for
accurate face spoofing detection, Pattern Recognit. 85 (2019) 220–231.
10
X. Lin, S. Wang, J. Deng et al. Pattern Recognition 133 (2023) 109026
[32] B. Wen, Y. Zhu, R. Subramanian, T. Ng, X. Shen, S. Winkler, COVERAGE - A Ying Fu received the B.S. degree in electronic engineering from Xidian University,
novel database for copy-move forgery detection, in: International Conference Xian, China, in 2009, the M.S. degree in automation from Tsinghua University, Bei-
on Image Processing (ICIP), 2016, pp. 161–165. jing, China, in 2012, and the Ph.D. degree in information science and technology
[33] D. Tralic, I. Zupancic, S. Grgic, M. Grgic, Comofod new database for copy-move from the University of Tokyo, Tokyo, Japan, in 2015. She is currently a Professor
forgery detection, in: Proceedings ELMAR, 2013, pp. 49–54. with the School of Computer Science and Technology, Beijing Institute of Technol-
[34] G. Mahfoudi, B. Tajini, F. Retraint, F. Morain-Nicolier, J. Dugelay, M. Pic, DE- ogy. Her research interests include computer vision, image and video processing,
FACTO: image and face manipulation dataset, in: European Signal Processing and computational photography.
Conference, 2019, pp. 1–5.
[35] P. Zhuang, H. Li, S. Tan, B. Li, J. Huang, Image tampering localization using a Xiao Bai received the B.Eng. degree in computer science from Beihang University,
dense fully convolutional network, IEEE Trans. Inf. Forensics Secur. 16 (2021) Beijing, China, in 2001, and the Ph.D. degree in computer science from the Univer-
2986–2999. sity of York, York, U.K., in 2006. He was a Research Officer (Fellow and Scientist)
[36] L. Zhuo, S. Tan, B. Li, J. Huang, Self-adversarial training incorporating forgery with the Computer Science Department, University of Bath, Bath, U.K., until 2008.
attention for image forgery localization, IEEE Trans. Inf. Forensics Secur. 17 He is currently a Full Professor with the School of Computer Science and Engineer-
(2022) 819–834. ing, Beihang University. He has authored or coauthored more than 100 papers in
[37] W. Xia, Y. Zhang, Y. Yang, J.-H. Xue, B. Zhou, M.-H. Yang, Gan inversion: a sur- journals and refereed conferences. His current research interests include pattern
vey, IEEE Trans. Pattern Anal. Mach. Intell. (2022) 1–17. recognition, image processing, and remote sensing image analysis. He is an Asso-
[38] D. Bau, J. Zhu, H. Strobelt, B. Zhou, J.B. Tenenbaum, W.T. Freeman, A. Tor- ciate Editor of Pattern Recognition and Signal Processing.
ralba, GAN dissection: Visualizing and understanding generative adversarial
networks, in: International Conference on Learning Representations, 2019.
[39] J. Zhu, Y. Shen, D. Zhao, B. Zhou, In-domain GAN inversion for real image Xinlei Chen received the B.E. and M.S. degrees in electronic engineering from Ts-
editing, in: European Conference of Computer Vision, volume 12362, 2020, inghua University, Beijing, China, in 2009 and 2012, respectively, and the Ph.D. de-
pp. 592–608. gree in electrical engineering from Carnegie Mellon University, Pittsburgh, PA, USA,
[40] D. Bau, S. Liu, T. Wang, J. Zhu, A. Torralba, Rewriting a deep generative in 2018. In 2021, he is currently an Assistant Professor with Tsinghua Shenzhen In-
model, in: European Conference of Computer Vision, volume 12346, 2020, ternational Graduate School, Shenzhen, China. He was a Postdoctoral Research As-
pp. 351–369. sociate with the Department of Electrical Engineering, Carnegie Mellon University,
from 2018 to 2020. His research interests include AIoT, pervasive computing, cyber
physical system, etc.
Xun Lin is a Ph.D. candidate in the School of Computer Science and Engineering,
Beihang University. His research interests include image manipulation detection and
medical image analysis. Xiaolei Qu received the B.E. degree in software engineering from Xi’an Jiaotong Uni-
versity, Xi’an, China in 2007, the M.S. degree in pattern recognition from Huazhong
University of Science and Technology, Wuhan, China in 2009 and the Ph.D. degree
Shuai Wang received the B.E. degree from Jilin University, in 2007, and the Ph.D. in bioengineering from the University of Tokyo, Japan, in 2012. Since 2017, he has
degree in the School of Instrumentation and Optoelectronic Engineering, Beihang been an associate professor with the School of Instrumentation and Optoelectronic
University, in 2012. Since 2022, she has been an associate professor with the School Engineering, Beihang University, Beijing, China. His research interests include med-
of the Computer Science and Engineering, Beihang University. Her research interests ical ultrasound imaging, image processing and recognition.
include computer vision, intelligent perception, and AIoT.
Wenzhong Tang received the Ph.D. degree in computer science from Beihang Uni-
Jiahao Deng received the B.E. degree in computer science and technology from versity of China, Beijing, China, in 2008. He is a professor in the School of Computer
Dalian University of Technology. He is currently pursuing M.S. degree with the Science and Engineering, Beihang University. His research interests include artificial
School of the Computer Science and Engineering, Beihang University. He is inter- intelligence, smart cities and big data.
ested in deep learning methods for image manipulation detection.
11