0% found this document useful (0 votes)
87 views11 pages

DreamActor M1

The document presents DreamActor-M1, a novel human image animation framework utilizing a diffusion transformer (DiT) with hybrid guidance to enhance fine-grained control, multi-scale adaptability, and long-term temporal coherence in animations. It addresses limitations in existing methods by integrating implicit facial representations, 3D head spheres, and body skeletons for expressive motion synthesis, while also employing a progressive training strategy for diverse input scales. Experimental results indicate that DreamActor-M1 outperforms state-of-the-art techniques, providing robust and consistent animations across various scenarios.

Uploaded by

shahkhan23431
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views11 pages

DreamActor M1

The document presents DreamActor-M1, a novel human image animation framework utilizing a diffusion transformer (DiT) with hybrid guidance to enhance fine-grained control, multi-scale adaptability, and long-term temporal coherence in animations. It addresses limitations in existing methods by integrating implicit facial representations, 3D head spheres, and body skeletons for expressive motion synthesis, while also employing a progressive training strategy for diverse input scales. Experimental results indicate that DreamActor-M1 outperforms state-of-the-art techniques, providing robust and consistent animations across various scenarios.

Uploaded by

shahkhan23431
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation

with Hybrid Guidance

Yuxuan Luo* Zhengkun Rong* Lizhen Wang* Longhao Zhang* Tianshu Hu*† Yongming Zhu
Bytedance Intelligent Creation
{luoyuxuan, rongzhengkun, wanglizhen.2024, zhanglonghao.zlh, tianshu.hu,
zhuyongming}@bytedance.com
arXiv:2504.01724v2 [cs.CV] 3 Apr 2025

Figure 1. We introduce DreamActor-M1, a DiT-based human animation framework, with hybrid guidance to achieve fine-grained holistic
controllability, multi-scale adaptability, and long-term temporal coherence.

Abstract come these limitations. For motion guidance, our hybrid


control signals that integrate implicit facial representa-
While recent image-based human animation methods tions, 3D head spheres, and 3D body skeletons achieve
achieve realistic body and facial motion synthesis, crit- robust control of facial expressions and body movements,
ical gaps remain in fine-grained holistic controllability, while producing expressive and identity-preserving ani-
multi-scale adaptability, and long-term temporal coher- mations. For scale adaptation, to handle various body
ence, which leads to their lower expressiveness and ro- poses and image scales ranging from portraits to full-
bustness. We propose a diffusion transformer (DiT) based body views, we employ a progressive training strategy us-
framework, DreamActor-M1, with hybrid guidance to over- ing data with varying resolutions and scales. For appear-
ance guidance, we integrate motion patterns from sequen-
* Equal contribution.
† Corresponding author. tial frames with complementary visual references, ensur-

1
ing long-term temporal coherence for unseen regions dur- movements using a single control signal, especially for the
ing complex movements. Experiments demonstrate that our precise control over subtle facial expressions. Second, due
method outperforms the state-of-the-art works, delivering to incomplete input data and the inability to generate ex-
expressive results for portraits, upper-body, and full-body tended video sequences in a single pass, the model in-
generation with robust long-term consistency. Project Page: evitably loses information about unseen regions (such as
https://grisoon.github.io/DreamActor-M1/. clothing textures on the back) during the continuation pro-
cess that relies solely on the reference image and the fi-
nal frame of prior clips. This gradual information decay
1. Introduction leads to inconsistencies in unseen areas across sequentially
generated segments. Third, under multi-scale inputs, the
Human image animation has become a hot research di- varying information density and focus priorities make it
rection in video generation and provides potential applica- hard to achieve holistic and expressive animation within
tions for film production, the advertising industry, and video a single framework. To tackle these issues, we introduce
games. While recent advances in image and video diffusion stronger hybrid control signals, and complementary appear-
have enabled basic human motion synthesis from a single ance guidance that can fill missing information gaps, and a
image, previous works like [14, 20, 30, 40, 53, 58, 62] have progressive training strategy with a training dataset includ-
made great progress in this area. However, existing meth- ing diverse samples at different scales (e.g. portrait talking
ods remain constrained to coarse-grained animation. There and full-body dancing).
are still some critical challenges in achieving fine-grained Specifically, for motion guidance, we design a hybrid
holistic control (e.g., subtle eye blinks and lip tremors), control signals including implicit face latent for fine-grained
generalization ability to multi-scale inputs (portrait/upper- control of facial expressions, explicit head spheres for head
body/full-body), and long-term temporal coherence (e.g. scale and rotation and 3D body skeletons for torso move-
long-term consistency for unseen garment areas). In order ments and bone length adjustment which achieves robust
to handle these complex scenarios, we propose a DiT-based adaptation across substantial shape variations. For scenar-
framework, DreamActor-M1, to achieve holistic, expressive ios with limited information (e.g., multi-turn rotations or
and robust human image animation with hybrid guidance. partial-body references), we introduce complementary ap-
Recent developments in image-based animation have ex- pearance guidance by first sampling distinct poses from tar-
plored various directions, yet critical shortcomings persist get movements, then generating multi-frame references to
in attaining photorealistic, expressive, and adaptable gener- provide unseen area textures, and finally propagating these
ation for practical applications. While single-image facial references across video segments to maintain consistent de-
animation approaches employ landmark-driven or 3DMM- tails throughout long-term synthesis. To enable multi-scale
driven methodologies through GAN [3, 37, 39, 46, 48, 59] adaptation, we train the model with a progressive strategy
or NeRF [57] for expression manipulation, they frequently on a diverse dataset that includes different types of scenes
face limitations in image resolution and expression accu- like portrait acting, upper-body talking, and full-body danc-
racy. Although certain studies attain robust expression pre- ing. In summary, our key contributions are as follows.
cision via latent face representations [5, 6, 44, 51, 63] or
• We propose a holistic DiT-based framework and a pro-
achieve enhanced visual quality with diffusion-based archi-
gressive training strategy for human image animation that
tectures [41, 60], their applicability is typically restricted
supports flexible multi-scale synthesis.
to portrait regions, failing to meet the broader demands of
• We design hybrid control signals combining implicit fa-
real-world scenarios. Diffusion-based human image anima-
cial representations, explicit 3D head spheres, and body
tion [1, 14, 43, 53, 58] are able to generate basic limb ar-
skeletons to enable the expressive body and facial motion
ticulation, plausible garment deformations and hair dynam-
synthesis while supporting diverse character styles.
ics, but neglect the fine-grained facial expressions reenact-
• We develop complementary appearance guidance to miti-
ment. Current solutions remain under-explored in address-
gate information gaps of unseen areas between video seg-
ing holistic control of facial expressions and body move-
ments, enabling consistent video generation over long du-
ments while failing to accommodate real-world multi-scale
rations.
deployment requirements. Furthermore, existing methods
fail to maintain temporal coherence in long-form video syn-
2. Related Works
thesis, especially for unseen areas in the reference image.
In this work, we focus on addressing multi-scale driven Recent advancements in human image animation can be
synthesis, fine-grained face and body control, and long-term broadly categorized into single-image facial animation and
temporal consistency for unseen areas. Solving these chal- body animation, each addressing distinct technical chal-
lenges is non-trivial. First, it is very challenging to ac- lenges in photorealistic and expressive human motion syn-
curately manage both detailed facial expressions and body thesis.

2
Figure 2. Overview of DreamActor-M1. During the training stage, we first extract body skeletons and head spheres from driving frames
and then encode them to the pose latent using the pose encoder. The resultant pose latent is combined with the noised video latent along
the channel dimension. The video latent is obtained by encoding a clip from the input full video using 3D VAE. Facial expression is
additionally encoded by the face motion encoder, to generate implicit facial representations. Note that the reference image can be one
or multiple frames sampled from the input video to provide additional appearance details during training and the reference token branch
shares weights of our DiT model with the noise token branch. Finally, the denoised video latent is supervised by the encoded video latent.
Within each DiT block, the face motion token is integrated into the noise token branch via cross-attention (Face Attn), while appearance
information of ref token is injected to noise token through concatenated self-attention (Self Attn) and subsequent cross-attention (Ref Attn).

2.1. Single-Image Facial Animation Portrait [50] combined ControlNet for head pose and ex-
pression with patch-based local control and cross-identity
Early work primarily relied on GAN to create portrait an-
training. X-Nemo [60] developed 1-D latent descriptors
imation through warping and rendering. These studies
to disentangle identity-motion entanglement. SkyReels-
typically focused on improving driving expressiveness by
A1 [31] proposed a portrait animation framework leverag-
exploring various motion representations including neu-
ing the diffusion transformer [29].
ral keypoints [12, 37, 48, 59] or 3D face model parame-
ters [4, 32]. These methods can effectively decouple iden-
2.2. Single-Image Body Animation
tity and expression features, but the reproduction of ex-
pressions, especially subtle and exaggerated ones, is quite As a pioneering work in body animation, MRAA [38] pro-
limited. Another class of methods learned latent repre- posed distinct motion representations for animating articu-
sentation [5, 6, 44] directly from the driving face, en- lated objects in an unsupervised manner. Recent advance-
abling higher-quality expression reproduction. However, ments in latent diffusion models [33] have significantly
limited by GAN, these methods faced challenges in gen- boosted the development of body animation [1, 18, 25, 42,
erating high-quality results and adapting to different por- 47, 49]. Animate Anyone [14] introduced an additional Ref-
trait styles. In recent years, diffusion-based methods have erenceNet to extract appearance features from reference im-
demonstrated strong generative capabilities, leading to sig- ages. MimicMotion [58] designed a confidence-aware pose
nificant advancements in subsequent research. EMO [41] guidance mechanism and used facial landmarks to reen-
first introduced the ReferenceNet into the diffusion-based act facial expressions. Animate-X [40] employed implicit
portrait video generation task. Follow-your-emoji [25] uti- and explicit pose indicators for motion and pose features,
lized expression-aware landmarks with facial fine-grained respectively, achieving better generalization across anthro-
loss for precise motion alignment and micro-expression de- pomorphic characters. Instead of using skeleton maps,
tails. Megactor-Σ [54] designed a diffusion transformer in- Champ [62] and Make-Your-Anchor [16] leveraged a 3D
tegrating audio-visual signals for multi-modal control. X- human parametric model, while MagicAnimate [53] uti-

3
lized DensePose [11] to establish dense correspondences. 3.2. Hybrid Motion Guidance
TALK-Act [10] enhanced the textural awareness with ex-
To achieve expressive and robust human animation, in this
plicit motion guidance to improve the generation quality.
paper, we intricately craft the motion guidance and propose
StableAnimator [43] and HIA [52] addressed the identity
hybrid control signals comprising implicit facial represen-
preservation and motion blur modeling, respectively. Addi-
tations, 3D head spheres, and 3D body skeletons.
tionally, some previous studies [20, 30, 55] followed a plug-
Implicit Facial Representations. In contrast to conven-
in paradigm, enhancing effectiveness without requiring ad-
tional approaches that rely on facial landmarks for expres-
ditional training of existing model parameters. To address
sion generation, our method introduces implicit facial repre-
missing appearances under significant viewpoint changes,
sentations. This innovative approach not only enhances the
MSTed [13] introduced multiple reference images as in-
preservation of intricate facial expression details but also
put. MIMO [26] and Animate Anyone 2 [15] separately
facilitates the effective decoupling of facial expressions,
modeled humans, objects, and backgrounds, aiming to ani-
identity and head pose, enabling more flexible and realis-
mate characters with environmental affordance. Apart from
tic animation. Specifically, our pipeline begins by detect-
UNet-based diffusion, recent works [8, 35] have also begun
ing and cropping the faces in the driving video, which are
adapting diffusion transformers [29] for human animation.
then resized to a standardized format F ∈ Rt×3×224×224 .
A pre-trained face motion encoder Ef and an MLP layer
3. Method
are employed to encode faces to face motion tokens M ∈
Given one or multiple reference images IR along with a Rt×c . The M is fed into the DiT block through a cross-
driven video VD , our objective is to generate a realistic attention layer. The face motion encoder is initialized using
video that depicts the reference character mimicking the an off-the-shelf facial representation learning method [44],
motions present in the driving video. In this section, we which has been pre-trained on large-scale datasets to extract
begin with a concise introduction to our Diffusion Trans- identity-independent expression features. This initializa-
former (DiT) backbone in Sec. 3.1. Following that, we of- tion not only accelerates convergence but also ensures that
fer an in-depth explanation of our carefully-designed hybrid the encoded motion tokens are robust to variations in iden-
control signals in Sec. 3.2. Subsequently, we introduce the tity, focusing solely on the nuances of facial expressions.
complementary appearance guidance in Sec. 3.3. Finally, By leveraging implicit facial representations, our method
we present the progressive training processes in Sec. 3.4. achieves superior performance in capturing subtle expres-
sion dynamics while maintaining a high degree of flexibility
3.1. Preliminaries for downstream tasks such as expression transfer and reen-
As shown in Fig. 2, our overall framework adheres to actment. It is worth noting that we additionally train an
the Latent Diffusion Model (LDM) [33] for training the audio-driven encoder capable of mapping speech signals to
model within the latent space of a pre-trained 3D Varia- face motion token. This encoder enables facial expression
tional Autoencoder (VAE) [56]. We utilize the MMDiT [7] editing, particularly lip-syncing, without the need for a driv-
as the backbone network, which has been pre-trained on ing video.
text-to-video and image-to-video tasks, Seaweed [23], and 3D Head Spheres. Since the implicit facial representations
follow the pose condition training scheme proposed by are designed to exclusively control facial expressions, we
OmniHuman-1 [22]. Note that we employ Flow Match- introduce an additional 3D head sphere to independently
ing [24] as the training objective. manage head pose. This dual-control strategy ensures that
Unlike the prevailing ReferenceNet-based human anima- facial expressions and head movements are decoupled, en-
tion approaches [14, 15], we refrain from employing a copy abling more precise and flexible animation. Specifically,
of the DiT as the ReferenceNet to inject the reference fea- we utilize an off-the-shelf face tracking method [45] to ex-
ture into the DenoisingNet following [21]. Instead, we flat- tract 3D facial parameters from the driving video, including
ten the latent feature IeR and VeD extracted through the VAE, camera parameters and rotation angles. These parameters
patchify, concatenate them together, and then feed them into are then used to render the head as a color sphere projected
the DiT. It facilitates the information interaction between onto the 2D image plane. The sphere’s position is care-
the reference and video frames through 3D self-attention fully aligned with the position of the driving head in the
layers and spatial cross-attention layers integrated through- video frame, ensuring spatial consistency. Additionally, the
out the entire model. Specifically, in each DiT block, given size of the sphere is scaled to match the reference head’s
the concatenated token T ∈ R(t×h×w)×c , we perform self- size, while its color is dynamically determined by the driv-
attention along the first dimension. Then, we split it into ing head’s orientation, providing a visual cue for head rota-
TR ∈ RtR ×(h×w)×c and TD ∈ RtD ×(h×w)×c and reshape tion. This 3D sphere representation offers a highly flexible
them to TR ∈ R1×(h×w×tR )×c and TD ∈ RtD ×(h×w)×c to and intuitive way to control head pose, significantly reduc-
perform cross-attention along the second dimension. ing the model’s learning complexity by abstracting complex

4
Figure 3. Overview of our inference pipeline. First, we (optionally) generate multiple pseudo-references to provide complementary
appearance information. Next, we extract hybrid control signals comprising implicit facial motion and explicit poses (head sphere and
body skeleton) from the driving video. Finally, these signals are injected into a DiT model to synthesize animated human videos. Our
framework decouples facial motion from body poses, with facial motion signals being alternatively derivable from speech inputs.

3D head movements into a simple yet effective 2D represen- tended timeframes. During training, we compute the ro-
tation. This approach is particularly advantageous for pre- tation angles for all frames in the input video and sort them
serving the unique head structures of reference characters, based on their z-axis rotation values (yaw). From this sorted
especially those from anime and cartoon domains. set, we strategically select three key frames corresponding
3D Body Skeletons. For body control, we introduce 3D to the maximum, minimum, and median z-axis rotation an-
body skeletons with bone length adjustment. In particu- gles. These frames serve as representative viewpoints, en-
lar, we first use 4DHumans [9] and HaMeR [28] to esti- suring comprehensive coverage of the object’s orientation.
mate body and hand parameters of the SMPL-X [27] model. Furthermore, for videos featuring full-body compositions,
Then we select the body joints, project them onto the 2D we introduce an additional step: a single frame is randomly
image plane, and connect them with lines to construct the selected and cropped to a half-body portrait format, which
skeleton maps. We opt to use skeletons instead of render- is then included as an auxiliary reference frame. This step
ing the full body, as done in Champ [62], to avoid providing enriches the model’s understanding of both global and local
the model with strong guidance on body shape. By lever- structural details.
aging skeletons, we encourage the model to learn the shape During inference, our protocol offers an optional two-
and appearance of the character directly from the reference stage generation mode to handle challenging scenarios,
images. This approach not only reduces bias introduced by such as cases where the reference image is a single frontal
predefined body shapes but also enhances the model’s abil- half-body portrait while the driving video features full-body
ity to generalize across diverse body types and poses, lead- frames with complex motions like turning or side views.
ing to more flexible and realistic results. The body skele- First, the model is utilized to synthesize a multi-view video
tons and the head spheres are concatenated in the channel sequence from the single reference image. This initial out-
dimension, and fed into a pose encoder Ep to obtain the put captures a range of plausible viewpoints and serves as a
pose feature. The pose feature and noised video feature are foundation for further refinement.We apply the same frame
then concatenated and processed by an MLP layer to obtain selection strategy used during training to select the most
the noise token. informative frames. These selected frames are then rein-
During inference, to address variations in skeletal pro- tegrated into the model as the complementary appearance
portions across subjects, we adopt a normalization process guidance, enabling the generation of a final output that ex-
to adjust the bone length. First, we use a pre-trained im- hibits enhanced spatial and temporal coherence. This itera-
age editing model [36] to transform reference and driving tive approach not only improves the robustness of the model
images into a standardized A-pose configuration. Next, we but also ensures high-quality results even under constrained
leverage RTMPose [17] to calculate the skeletal proportions input conditions.
of both the driving subject and the reference subject. Fi-
nally, we perform anatomical alignment by proportionally 3.4. Progressive Training Process
adjusting the bone lengths of the driving subject to match Our training process is divided into three distinct stages
the skeletal measurements of the reference subject. to ensure a gradual and effective adaptation of the model.
In the first stage, we utilize only two control signals: 3D
3.3. Complementary Appearance Guidance
body skeletons and 3D head spheres, deliberately exclud-
We propose a novel multi-reference injection protocol to en- ing the implicit facial representations. This initial stage is
hance the model’s capability for robust multi-scale, multi- designed to facilitate the transition of the base video gener-
view, and long-term video generation. This approach ad- ation model to the task of human animation. By avoiding
dresses the challenges of maintaining temporal consistency overly complex control signals that may hinder the model’s
and visual fidelity across diverse viewing angles and ex- learning process, we allow the model to establish a strong

5
Figure 4. The comparisons with human image animation works, Animate Anyone [14], Champ [62], MimicMotion [58] and DisPose [20].
Our method demonstrates results with better fine-grained motions, identity preservation, temporal consistency and high fidelity.

foundational understanding of the task. In the second stage, 20,000 steps, and the third stage with 30,000 steps. To en-
we introduce the implicit facial representations while keep- hance the model’s generalization capability for arbitrary du-
ing all other model parameters frozen. During this stage, rations and resolutions, during training, the length of the
only the face motion encoder and face attention layers are sampled video clips is randomly selected from 25 to 121
trained, enabling the model to focus on learning the intri- frames, while the spatial resolution is resized to an area of
cate details of facial expressions without the interference 960 × 640, maintaining the original aspect ratio. All stages
of other variables. Finally, in the third stage, we unfreeze are trained with 8 H20 GPUs using the AdamW optimizer
all model parameters and conduct a comprehensive train- with a learning rate of 5e−6 . During inference, each video
ing session, allowing the model to fine-tune its performance segment contains 73 frames. To ensure full-video consis-
by jointly optimizing all components. This staged approach tency, we use the last latent from the current segment as the
ensures a robust and stable training process, ultimately lead- initial latent for the next segment, modeling the next seg-
ing to a more effective and adaptable model. ment generation as an image-to-video generation task. The
classifier-free guidance (cfg) parameters for both the refer-
4. Experiments ences and motion control signals are set to 2.5.
Datasets. For training, we construct a comprehensive
4.1. Experimental Setups
dataset by collecting video data from various sources, to-
Implementation Details. Our training weights are initial- taling 500 hours of footage. This dataset encompasses a
ized from a pretrained image-to-video DiT model [23] and diverse range of scenarios, including dancing, sports, film
warm up with a condition training strategy [22]. Then, we scenes, and speeches, ensuring broad coverage of human
train the first stage with 20,000 steps, the second stage with motion and expressions. The dataset is balanced in terms

6
Figure 5. Our comparisons with portrait image animation works, LivePortrait [12], X-Portrait [50], Skyreels-A1 [31] and Runway Act-
One [34]. Our method demonstrates more accurate and expressive portrait animation capabilities.

of framing, with full-body shots and half-body shots each FID↓ SSIM↑ PSNR↑ LPIPS↓ FVD↓
accounting for approximately 50% of the data. In addition, AA [14] 36.72 0.791 21.74 0.266 158.3
we leverage Nersemble [19] to further improve the synthe- Champ [62] 40.21 0.732 20.18 0.281 171.2
sis quality of faces. For evaluation, we utilize our collected MimicMotion [58] 35.90 0.799 22.25 0.253 149.9
DisPose [20] 33.01 0.804 21.99 0.248 144.7
dataset, which provides a varied and challenging bench-
mark, enabling a robust assessment of the model’s gener- Ours 27.27 0.821 23.93 0.206 122.0
alization capabilities across different scenarios.
Evaluation Metrics. We adhere to established evaluation Table 1. Quantitative comparisons with body animation methods
on our collected dataset.
metrics employed in prior research, including FID, SSIM,
LPIPS, PSNR, and FVD. The first four are used to evaluate
the generation quality of each frame, while the last one is Comparisons with Portrait Animation Methods. We
used to assess the video fidelity. also compare the DreamActor-M1 with state-of-the-art por-
trait animation methods, including LivePortrait [12], X-
4.2. Comparisons with Existing Methods Portrait [50], SkyReels-A1 [31], and Act-One [34], as
To comprehensively demonstrate the effectiveness of our shown in Tab. 2 and Fig. 5. As shown in Tab. 2, the video-
work, we conducted experiments on both body animation driven results consistently outperform all competing meth-
and portrait animation tasks. Note that our method demon- ods across all metrics on our collected dataset.
strates strong performance with just a single reference im- While facial expressions and head pose are decoupled in
age in most cases. To ensure fairness in comparison with our framework, our method can also be extended to audio-
other methods, we only used multiple reference images in driven facial animation. Specifically, we train a face motion
the ablation study, while a single reference image was em- encoder to map speech signals to face motion tokens, lead-
ployed in the comparative analysis. We strongly recom- ing to realistic and lip-sync animations. As an extended
mend readers refer to the supplementary video. application, we omit quantitative comparisons. Please refer
Comparisons with Body Animation Methods. We to our supplementary video for more results.
perform the qualitative and quantitative evaluation of
4.3. Ablation Study
DreamActor-M1 with our collected dataset and compare
with state-of-the-art body animation methods, including We conducted comprehensive ablation studies to evaluate
Animate Anyone [14], Champ [62], MimicMotion [58], and the impact of several core components of our method.
DisPose [20], as shown in Tab. 1 and Fig. 4. We can see Multi-Reference Protocol. We compare two settings: (a)
that our proposed DreamActor-M1 outperforms the current inference with a single reference image, (b) a two-stage in-
state-of-the-art results. ference approach as described in Sec. 3.3, where pseudo

7
FID↓ SSIM↑ PSNR↑ LPIPS↓ FVD↓ mentary reference images provide additional visual infor-
LivePortrait [12] 31.72 0.809 24.25 0.270 147.1 mation about unseen areas, enabling the video generation
X-Portrait [50] 30.09 0.774 22.98 0.281 150.9 process to leverage reference details. This helps avoid in-
SkyReels-A1 [31] 30.66 0.811 24.11 0.262 133.8 formation loss and thereby maintains consistency through-
Act-One [34] 29.84 0.817 25.07 0.259 135.2
out the video. Nevertheless, the performance achieved by a
Ours 25.70 0.823 28.44 0.238 110.3 single reference image remains competitive, demonstrating
its sufficiency for most scenarios.
Table 2. Quantitative comparisons with portrait animation meth- Hybrid Control Signals. We further investigate the contri-
ods on our collected dataset. bution of our hybrid control signals by ablating key com-
FID↓ SSIM↑ PSNR↑ LPIPS↓ FVD↓ ponents: (a) substituting 3D head sphere and skeleton with
3D mesh (b) substituting implicit facial representations with
Single-R 28.22 0.798 25.86 0.223 120.5
Multi-R (pseudo) 26.53 0.812 26.22 0.219 116.6 3d facial landmarks. Results are shown in Fig. 6. It reveal
a significant performance degradation under these settings,
Table 3. Ablation study on multi-reference. emphasizing the importance of each component in our hy-
brid control framework. Specifically, the 3D skeletons with
bone length adjustment provide more accurate spatial guid-
ance, and the implicit facial representations capture sub-
tle expression details more effectively than traditional land-
marks. These findings demonstrate the effectiveness and su-
periority of our proposed hybrid control signals in achieving
high-quality and realistic human image animation.

5. Conclusion
In this paper, we present a holistic human image animation
framework DreamActor-M1 addressing multi-scale adap-
tation, fine-grained facial expression and body movement
control, and long-term consistency in unseen regions. We
employ a progressive training strategy using data with vary-
ing resolutions and scales to handle various image scales
ranging from portraits to full-body views. By decoupling
identity, body pose, and facial expression through hybrid
control signals, our method achieves precise facial dynam-
ics and vivid body movements while preserving the charac-
ter identity. The proposed complementary appearance guid-
ance resolves information gaps in cross-scale animation and
unseen region synthesis. We believe these innovations pro-
vide potential insights for future research in complex mo-
tion modeling and real-world deployment of expressive hu-
man animation.
Limitation. Our framework faces inherent difficulties in
controlling dynamic camera movements, and fails to gener-
ate physical interactions with environmental objects. In ad-
dition, the bone length adjustment of our method using [36]
exhibits instability in edge cases, requiring multiple itera-
Figure 6. Ablation study of 3D skeletons with bone length adjust- tions with manual selections for optimal cases. These chal-
ment (BLA) and implicit face features. lenges still need to be addressed in future research.
Ethics considerations. Human image animation has possi-
reference images are generated first, followed by multi- ble social risks, like being misused to make fake videos.
reference inference. Results are shown in Tab. 3. It demon- The proposed technology could be used to create fake
strates that pseudo multi-reference inference outperforms videos of people, but existing detection tools [2, 61] can
single-reference inference in terms of long-term video gen- spot these fakes. To reduce these risks, clear ethical rules
eration quality and temporal consistency. This is because and responsible usage guidelines are necessary. We will
during the extended video generation process, the supple- strictly restrict access to our core models and codes to pre-

8
vent misuse. Images and videos are all from publicly avail- [9] Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran,
able sources. If there are any concerns, please contact us Angjoo Kanazawa, and Jitendra Malik. Humans in 4d: Re-
and we will delete it in time. constructing and tracking humans with transformers. In
Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 14783–14794, 2023. 5
Acknowledgment [10] Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Hang Zhou,
We extend our sincere gratitude to Shanchuan Lin, Lu Jiang, Shengyi He, Zhiliang Xu, Haocheng Feng, Errui Ding, Jing-
dong Wang, Hongtao Xie, et al. Talk-act: Enhance textural-
Zhurong Xia, Jianwen Jiang, Zerong Zheng, Chao Liang,
awareness for 2d speaking avatar reenactment with diffusion
Youjiang Xu, Ming Zhou, Siyu Liu, Xin Dong and Yanbo model. In SIGGRAPH Asia 2024 Conference Papers, pages
Zheng for their invaluable contributions and supports to this 1–11, 2024. 4
research work. [11] Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos.
Densepose: Dense human pose estimation in the wild. In
References Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 7297–7306, 2018. 4
[1] Di Chang, Yichun Shi, Quankai Gao, Jessica Fu, Hongyi Xu, [12] Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou
Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang, and Mo- Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Livepor-
hammad Soleymani. Magicpose: Realistic human poses and trait: Efficient portrait animation with stitching and retarget-
facial expressions retargeting with identity-aware diffusion. ing control. arXiv preprint arXiv:2407.03168, 2024. 3, 7,
arXiv preprint arXiv:2311.12052, 2023. 2, 3 8
[2] Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, [13] Fa-Ting Hong, Zhan Xu, Haiyang Liu, Qinjie Lin, Luchuan
Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Song, Zhixin Shu, Yang Zhou, Duygu Ceylan, and Dan Xu.
Zhang, Weiqiang Wang, et al. Demamba: Ai-generated Free-viewpoint human animation with pose-correlated refer-
video detection on million-scale genvideo benchmark. arXiv ence selection. arXiv preprint arXiv:2412.17290, 2024. 4
preprint arXiv:2405.19707, 2024. 8 [14] Li Hu. Animate anyone: Consistent and controllable image-
[3] Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, and to-video synthesis for character animation. In Proceedings of
Baoyuan Wang. Portrait4d: Learning one-shot 4d head the IEEE/CVF Conference on Computer Vision and Pattern
avatar synthesis using synthetic data. In Proceedings of Recognition, pages 8153–8163, 2024. 2, 3, 4, 6, 7
the IEEE/CVF Conference on Computer Vision and Pattern [15] Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao
Recognition, pages 7119–7130, 2024. 2 Meng, Lian Zhuo, Peng Zhang, Bang Zhang, and Liefeng
[4] Michail Christos Doukas, Stefanos Zafeiriou, and Viktoriia Bo. Animate anyone 2: High-fidelity character image
Sharmanska. Headgan: One-shot neural head synthesis and animation with environment affordance. arXiv preprint
editing. In Proceedings of the IEEE/CVF International Con- arXiv:2502.06145, 2025. 4
ference on Computer Vision (ICCV), pages 14398–14407, [16] Ziyao Huang, Fan Tang, Yong Zhang, Xiaodong Cun, Juan
2021. 3 Cao, Jintao Li, and Tong-Yee Lee. Make-your-anchor: A
[5] Nikita Drobyshev, Jenya Chelishev, Taras Khakhulin, Alek- diffusion-based 2d avatar generation framework. In Proceed-
sei Ivakhnenko, Victor Lempitsky, and Egor Zakharov. ings of the IEEE/CVF Conference on Computer Vision and
Megaportraits: One-shot megapixel neural head avatars. In Pattern Recognition, pages 6997–7006, 2024. 3
Proceedings of the 30th ACM International Conference on [17] Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han,
Multimedia, pages 2663–2671, 2022. 2, 3 Chengqi Lyu, Yining Li, and Kai Chen. Rtmpose: Real-
[6] Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos time multi-person pose estimation based on mmpose. arXiv
Vougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pan- preprint arXiv:2303.07399, 2023. 5
tic. Emoportraits: Emotion-enhanced multimodal one-shot [18] Johanna Karras, Aleksander Holynski, Ting-Chun Wang,
head avatars. In Proceedings of the IEEE/CVF Conference and Ira Kemelmacher-Shlizerman. Dreampose: Fashion
on Computer Vision and Pattern Recognition, pages 8498– image-to-video synthesis via stable diffusion. In 2023
8507, 2024. 2, 3 IEEE/CVF International Conference on Computer Vision
[7] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim (ICCV), pages 22623–22633. IEEE, 2023. 3
Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik [19] Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim
Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- Walter, and Matthias Nießner. Nersemble: Multi-view radi-
fied flow transformers for high-resolution image synthesis. ance field reconstruction of human heads. ACM Transactions
In Forty-first international conference on machine learning, on Graphics (TOG), 42(4):1–14, 2023. 7
2024. 4 [20] Hongxiang Li, Yaowei Li, Yuhang Yang, Junjie Cao, Zhi-
[8] Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, hong Zhu, Xuxin Cheng, and Long Chen. Dispose: Disen-
Xiang Yin, Zehuan Yuan, Bingyue Peng, and Jianke Zhu. tangling pose guidance for controllable human image anima-
Humandit: Pose-guided diffusion transformer for long- tion. arXiv preprint arXiv:2412.09349, 2024. 2, 4, 6, 7
form human motion video generation. arXiv preprint [21] Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and
arXiv:2502.04847, 2025. 4 Chao Liang. Omnihuman-1: Rethinking the scaling-up

9
of one-stage conditioned human animation models. arXiv [36] Yichun Shi, Peng Wang, and Weilin Huang. Seededit:
preprint arXiv:2502.01061, 2025. 4 Align image re-generation to image editing. arXiv preprint
[22] Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and arXiv:2411.06686, 2024. 5, 8
Chao Liang. Omnihuman-1: Rethinking the scaling-up [37] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov,
of one-stage conditioned human animation models. arXiv Elisa Ricci, and Nicu Sebe. First order motion model for
preprint arXiv:2502.01061, 2025. 4, 6 image animation. Advances in neural information processing
[23] Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng systems, 32, 2019. 2, 3
Xiao, and Lu Jiang. Diffusion adversarial post-training for [38] Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei
one-step video generation. arXiv preprint arXiv:2501.08316, Chai, and Sergey Tulyakov. Motion representations for ar-
2025. 4, 6 ticulated animation. In Proceedings of the IEEE/CVF con-
[24] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ference on computer vision and pattern recognition, pages
ian Nickel, and Matt Le. Flow matching for generative mod- 13653–13662, 2021. 3
eling. arXiv preprint arXiv:2210.02747, 2022. 4 [39] Jingxiang Sun, Xuan Wang, Lizhen Wang, Xiaoyu Li, Yong
[25] Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing Zhang, Hongwen Zhang, and Yebin Liu. Next3d: Genera-
He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung tive neural texture rasterization for 3d-aware head avatars. In
Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable CVPR, 2023. 2
and expressive freestyle portrait animation. In SIGGRAPH [40] Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan
Asia 2024 Conference Papers, pages 1–12, 2024. 3 Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen,
[26] Yifang Men, Yuan Yao, Miaomiao Cui, and Liefeng Bo. and Ming Yang. Animate-x: Universal character image an-
Mimo: Controllable character video synthesis with spatial imation with enhanced motion representation. ICLR 2025,
decomposed modeling. arXiv preprint arXiv:2409.16160, 2025. 2, 3
2024. 4 [41] Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo:
[27] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Emote portrait alive - generating expressive portrait videos
Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and with audio2video diffusion model under weak conditions,
Michael J. Black. Expressive body capture: 3d hands, face, 2024. 2, 3
and body from a single image. In Proceedings IEEE Conf. [42] Zhengyan Tong, Chao Li, Zhaokang Chen, Bin Wu, and
on Computer Vision and Pattern Recognition (CVPR), 2019. Wenjiang Zhou. Musepose: a pose-driven image-to-video
5 framework for virtual human generation. arxiv, 2024. 3
[28] Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo [43] Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi
Kanazawa, David Fouhey, and Jitendra Malik. Reconstruct- Dai, Chong Luo, and Zuxuan Wu. Stableanimator: High-
ing hands in 3D with transformers. In CVPR, 2024. 5 quality identity-preserving human image animation. arXiv
[29] William Peebles and Saining Xie. Scalable diffusion models preprint arXiv:2411.17697, 2024. 2, 4
with transformers. In Proceedings of the IEEE/CVF inter- [44] Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum,
national conference on computer vision, pages 4195–4205, and Baoyuan Wang. Progressive disentangled representation
2023. 3, 4 learning for fine-grained controllable talking head synthesis.
[30] Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- In Proceedings of the IEEE/CVF Conference on Computer
Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- Vision and Pattern Recognition (CVPR), 2023. 2, 3, 4
cient control for image and video generation. arXiv preprint [45] Lizhen Wang, Zhiyuan Chen, Tao Yu, Chenguang Ma, Liang
arXiv:2408.06070, 2024. 2, 4 Li, and Yebin Liu. Faceverse: a fine-grained and detail-
[31] Di Qiu, Zhengcong Fei, Rui Wang, Jialin Bai, Changqian Yu, controllable 3d face morphable model from a hybrid dataset.
Mingyuan Fan, Guibin Chen, and Xiang Wen. Skyreels-a1: In Proceedings of the IEEE/CVF conference on computer vi-
Expressive portrait animation in video diffusion transform- sion and pattern recognition, pages 20333–20342, 2022. 4
ers. arXiv preprint arXiv:2502.10841, 2025. 3, 7, 8 [46] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin
[32] Yurui Ren, Ge Li, Yuanqi Chen, Thomas H. Li, and Shan Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang
Liu. Pirenderer: Controllable portrait image generation via Wen, Qifeng Chen, et al. Rodin: A generative model for
semantic neural rendering, 2021. 3 sculpting 3d digital avatars using diffusion. In Proceedings
[33] Robin Rombach, Andreas Blattmann, Dominik Lorenz, of the IEEE/CVF conference on computer vision and pattern
Patrick Esser, and Björn Ommer. High-resolution image recognition, pages 4563–4573, 2023. 2
synthesis with latent diffusion models. In Proceedings of [47] Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung-
the IEEE/CVF conference on computer vision and pattern Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu,
recognition, pages 10684–10695, 2022. 3, 4 and Lijuan Wang. Disco: Disentangled control for realistic
[34] Runway. https : / / runwayml . com / research / human dance generation. In Proceedings of the IEEE/CVF
introducing-act-one, 2024. 7, 8 Conference on Computer Vision and Pattern Recognition,
[35] Ruizhi Shao, Youxin Pang, Zerong Zheng, Jingxiang Sun, pages 9326–9336, 2024. 3
and Yebin Liu. Human4dit: 360-degree human video [48] Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot
generation with 4d diffusion transformer. arXiv preprint free-view neural talking-head synthesis for video conferenc-
arXiv:2405.17405, 2024. 4 ing. In Proceedings of the IEEE/CVF conference on com-

10
puter vision and pattern recognition, pages 10039–10049, X-nemo: Expressive neural motion reenactment via disen-
2021. 2, 3 tangled latent attention. ICLR 2025, 2025. 2, 3
[49] Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, [61] Jiachen Zhou, Mingsi Wang, Tianlin Li, Guozhu Meng, and
Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, and Nong Kai Chen. Dormant: Defending against pose-driven human
Sang. Unianimate: Taming unified video diffusion mod- image animation. arXiv preprint arXiv:2409.14424, 2024. 8
els for consistent human image animation. arXiv preprint [62] Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong
arXiv:2406.01188, 2024. 3 Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu
[50] You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Zhu. Champ: Controllable and consistent human image an-
Shi, and Linjie Luo. X-portrait: Expressive portrait anima- imation with 3d parametric guidance. In European Confer-
tion with hierarchical motion attention. In ACM SIGGRAPH ence on Computer Vision, pages 145–162. Springer, 2024. 2,
2024 Conference Papers, pages 1–11, 2024. 3, 7, 8 3, 5, 6, 7
[51] Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, [63] Yongming Zhu, Longhao Zhang, Zhengkun Rong, Tianshu
Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Hu, Shuang Liang, and Zhipeng Ge. Infp: Audio-driven
Baining Guo. Vasa-1: Lifelike audio-driven talking faces interactive head generation in dyadic conversations. arXiv
generated in real time. Advances in Neural Information Pro- preprint arXiv:2412.04037, 2024. 2
cessing Systems, 37:660–684, 2025. 2
[52] Zhongcong Xu, Chaoyue Song, Guoxian Song, Jianfeng
Zhang, Jun Hao Liew, Hongyi Xu, You Xie, Linjie Luo, Gu-
osheng Lin, Jiashi Feng, et al. High quality human image
animation using regional supervision and motion blur condi-
tion. arXiv preprint arXiv:2409.19580, 2024. 4
[53] Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan,
Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng
Shou. Magicanimate: Temporally consistent human im-
age animation using diffusion model. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 1481–1490, 2024. 2, 3
[54] Shurong Yang, Huadong Li, Juhao Wu, Minhao Jing, Linze
Li, Renhe Ji, Jiajun Liang, Haoqiang Fan, and Jin Wang.
Megactor-Σ: Unlocking flexible mixed-modal control in
portrait animation with diffusion transformer. arXiv preprint
arXiv:2408.14975, 2024. 3
[55] Sunjae Yoon, Gwanhyeong Koo, Younghwan Lee, and
Chang Yoo. Tpc: Test-time procrustes calibration for
diffusion-based human image animation. Advances in Neu-
ral Information Processing Systems, 37:118654–118677,
2024. 4
[56] Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Ver-
sari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh
Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model
beats diffusion–tokenizer is key to visual generation. arXiv
preprint arXiv:2310.05737, 2023. 4
[57] Wangbo Yu, Yanbo Fan, Yong Zhang, Xuan Wang, Fei Yin,
Yunpeng Bai, Yan-Pei Cao, Ying Shan, Yang Wu, Zhongqian
Sun, et al. Nofa: Nerf-based one-shot facial avatar recon-
struction. In ACM SIGGRAPH 2023 conference proceedings,
pages 1–12, 2023. 2
[58] Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang,
Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mim-
icmotion: High-quality human motion video generation
with confidence-aware pose guidance. arXiv preprint
arXiv:2406.19680, 2024. 2, 3, 6, 7
[59] Jian Zhao and Hui Zhang. Thin-plate spline motion model
for image animation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
3657–3666, 2022. 2, 3
[60] Xiaochen Zhao, Hongyi Xu, Guoxian Song, You Xie,
Chenxu Zhang, Xiu Li, Linjie Luo, Jinli Suo, and Yebin Liu.

11

You might also like