Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

t2v inference #31

Closed
xjxu21 opened this issue Dec 19, 2023 · 8 comments
Closed

t2v inference #31

xjxu21 opened this issue Dec 19, 2023 · 8 comments

Comments

@xjxu21
Copy link

xjxu21 commented Dec 19, 2023

Hi, thanks for sharing the code and model.

I am trying to do some t2v inference with this codebase. I downloaded the t2v model text2video_pytorch_model.pth from modelscope and modified the yaml config. Then I run python inference.py --cfg configs/t2v_infer.yaml, but the results seem to be abnormal.

Is this model incompatible with the current codebase? If so, could you please give me a link to the right t2v model?

Thank you.

@Steven-SWZhang
Copy link
Collaborator

Steven-SWZhang commented Dec 19, 2023

There are some differences. You may need to modify the settings in t2v_train.yaml.

Diffusion: {
    'type': 'DiffusionDDIM',
    'schedule': 'linear_sd', # cosine
    'schedule_param': {
        'num_timesteps': 1000,
        'init_beta': 0.00085,
        'last_beta': 0.0120,
        'zero_terminal_snr': False,
    },
    'mean_type': 'eps',
    'loss_type': 'mse',
    'var_type': 'fixed_small',
    'rescale_timesteps': False,
    'noise_strength': 0.0
}

Just replace the diffusion above, but I haven't verified it yet. You can give it a try. Thanks.

@xjxu21
Copy link
Author

xjxu21 commented Dec 19, 2023

It works, thank you!

@justinday123
Copy link

it doesn't works to me

@khansharkhamnida
Copy link

khansharkhamnida commented Jan 27, 2024

t2v_train.yaml.

Hey xjxu21, I have created my own workspace folder as i follow this in the t2v_train.yaml

image

i have even added the models that is previously not inside

image

And I have changed Steven's suggestions accordingly

image

But no luck. It still does not output anything. Do you mind helping me? Thank you in advance

@LuthandoMaqondo
Copy link

LuthandoMaqondo commented Feb 23, 2024

Running Inference

How to resolve this, i've downloaded the text2video_pytorch_model.pth and open_clip_pytorch_model:
Exception: Failed to invoke function <function inference_text2video_entrance at 0x7f96796578b0>, with Failed to init class <class 'tools.modules.autoencoder.AutoencoderKL'>, with [Errno 2] No such file or directory: 'models/v2-1_512-ema-pruned.ckpt'

@MHRosenberg
Copy link

I have what seems like a similar issue with open_clip_pytorch_model. Is there an updated fix? Where do I find these weights? Which yaml files are currently supported and expected to run vs which are for models that have not been released?

/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
[2024-05-16 23:35:08,863] INFO: {'name': 'Config: VideoLDM Decoder', 'mean': [0.5, 0.5, 0.5], 'std': [0.5, 0.5, 0.5], 'max_words': 1000, 'num_workers': 6, 'prefetch_factor': 2, 'resolution': [448, 256], 'vit_out_dim': 1024, 'vit_resolution': [224, 224], 'depth_clamp': 10.0, 'misc_size': 384, 'depth_std': 20.0, 'frame_lens': [1, 16, 16, 16, 16, 32, 32, 32], 'sample_fps': [1, 8, 16, 16, 16, 8, 16, 16], 'vid_dataset': {'type': 'VideoDataset', 'data_list': ['data/vid_list.txt'], 'max_words': 1000, 'resolution': [448, 256], 'data_dir_list': ['data/videos/'], 'vit_resolution': [224, 224], 'get_first_frame': True}, 'img_dataset': {'type': 'ImageDataset', 'data_list': ['data/img_list.txt'], 'max_words': 1000, 'resolution': [448, 256], 'data_dir_list': ['data/images'], 'vit_resolution': [224, 224]}, 'batch_sizes': {'1': 32, '4': 8, '8': 4, '16': 4, '32': 2}, 'Diffusion': {'type': 'DiffusionDDIM', 'schedule': 'linear_sd', 'schedule_param': {'num_timesteps': 1000, 'init_beta': 0.00085, 'last_beta': 0.012, 'zero_terminal_snr': True}, 'mean_type': 'v', 'loss_type': 'mse', 'var_type': 'fixed_small', 'rescale_timesteps': False, 'noise_strength': 0.1, 'ddim_timesteps': 50}, 'ddim_timesteps': 50, 'use_div_loss': False, 'p_zero': 0.1, 'guide_scale': 9.0, 'vit_mean': [0.48145466, 0.4578275, 0.40821073], 'vit_std': [0.26862954, 0.26130258, 0.27577711], 'sketch_mean': [0.485, 0.456, 0.406], 'sketch_std': [0.229, 0.224, 0.225], 'hist_sigma': 10.0, 'scale_factor': 0.18215, 'use_checkpoint': True, 'use_sharded_ddp': False, 'use_fsdp': False, 'use_fp16': True, 'temporal_attention': True, 'UNet': {'type': 'UNetSD_TFT2V', 'in_dim': 4, 'dim': 320, 'y_dim': 1024, 'context_dim': 1024, 'out_dim': 4, 'dim_mult': [1, 2, 4, 4], 'num_heads': 8, 'head_dim': 64, 'num_res_blocks': 2, 'attn_scales': [1.0, 0.5, 0.25], 'dropout': 0.1, 'temporal_attention': True, 'temporal_attn_times': 1, 'use_checkpoint': True, 'use_fps_condition': False, 'use_sim_mask': False, 'config': 'None', 'num_tokens': 4, 'upper_len': 128, 'default_fps': 8, 'misc_dropout': 0.4}, 'guidances': [], 'auto_encoder': {'type': 'AutoencoderKL', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0, 'video_kernel_size': [3, 1, 1]}, 'embed_dim': 4, 'pretrained': 'models/v2-1_512-ema-pruned.ckpt'}, 'embedder': {'type': 'FrozenOpenCLIPTextVisualEmbedder', 'layer': 'penultimate', 'pretrained': 'models/open_clip_pytorch_model.bin', 'vit_resolution': [224, 224]}, 'ema_decay': 0.9999, 'num_steps': 1000000, 'lr': 3e-05, 'weight_decay': 0.0, 'betas': [0.9, 0.999], 'eps': 1e-08, 'chunk_size': 2, 'decoder_bs': 2, 'alpha': 0.7, 'save_ckp_interval': 50, 'warmup_steps': 10, 'decay_mode': 'cosine', 'use_ema': False, 'load_from': None, 'Pretrain': {'type': 'pretrain_specific_strategies', 'fix_weight': False, 'grad_scale': 0.5, 'resume_checkpoint': 'workspace/model_bk/model_scope_0267000.pth', 'sd_keys_path': 'data/stable_diffusion_image_key_temporal_attention_x1.json'}, 'viz_interval': 5, 'visual_train': {'type': 'VisualTrainTextImageToVideo', 'partial_keys': [['y', 'fps']], 'use_offset_noise': False, 'guide_scale': 9.0}, 'visual_inference': {'type': 'VisualGeneratedVideos'}, 'inference_list_path': '', 'log_interval': 1, 'log_dir': 'workspace/experiments/text_list_for_tft2v', 'seed': 888, 'negative_prompt': 'Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms', 'ENABLE': True, 'DATASET': 'webvid10m', 'TASK_TYPE': 'inference_tft2v_entrance', 'max_frames': 16, 'target_fps': 16, 'scale': 8, 'batch_size': 1, 'use_zero_infer': True, 'round': 1, 'test_list_path': 'data/text_list_for_tft2v.txt', 'vldm_cfg': 'configs/t2v_train.yaml', 'positive_prompt': ', cinematic, High Contrast, highly detailed, no blur, 4k render', 'test_model': 'models/tft2v_t2v_non_ema_512000.pth', 'video_compositions': ['text', 'image'], 'cfg_file': 'configs/tft2v_t2v_infer.yaml', 'init_method': 'tcp://localhost:9999', 'debug': False, 'opts': [], 'pmi_rank': 0, 'pmi_world_size': 1, 'gpus_per_machine': 1, 'world_size': 1, 'noise_strength': 0.1, 'gpu': 0, 'rank': 0, 'log_file': 'workspace/experiments/text_list_for_tft2v/log_00.txt'}
[2024-05-16 23:35:09,826] INFO: Going into inference_text2video_entrance inference on 0 gpu
[2024-05-16 23:35:09,847] INFO: Loading ViT-H-14 model config.
[2024-05-16 23:35:22,084] WARNING: Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.
[rank0]: Traceback (most recent call last):
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 62, in build_from_config
[rank0]: return req_type_entry(**cfg)
[rank0]: File "/content/drive/MyDrive/creative/VGen/tools/modules/clip_embedder.py", line 158, in init
[rank0]: model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=pretrained)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 151, in create_model_and_transforms
[rank0]: model = create_model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 122, in create_model
[rank0]: raise RuntimeError(f'Pretrained weights ({pretrained}) not found for model {model_name}.')
[rank0]: RuntimeError: Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 67, in build_from_config
[rank0]: return req_type_entry(**cfg)
[rank0]: File "/content/drive/MyDrive/creative/VGen/tools/inferences/inference_tft2v_entrance.py", line 74, in inference_tft2v_entrance
[rank0]: worker(0, cfg, cfg_update)
[rank0]: File "/content/drive/MyDrive/creative/VGen/tools/inferences/inference_tft2v_entrance.py", line 139, in worker
[rank0]: clip_encoder = EMBEDDER.build(cfg.embedder)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 107, in build
[rank0]: return self.build_func(*args, **kwargs, registry=self)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry_class.py", line 7, in build_func
[rank0]: return build_from_config(cfg, registry, **kwargs)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 64, in build_from_config
[rank0]: raise Exception(f"Failed to init class {req_type_entry}, with {e}")
[rank0]: Exception: Failed to init class <class 'tools.modules.clip_embedder.FrozenOpenCLIPTextVisualEmbedder'>, with Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/content/drive/MyDrive/creative/VGen/inference.py", line 18, in
[rank0]: INFER_ENGINE.build(dict(type=cfg_update.TASK_TYPE), cfg_update=cfg_update.cfg_dict)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 107, in build
[rank0]: return self.build_func(*args, **kwargs, registry=self)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry_class.py", line 7, in build_func
[rank0]: return build_from_config(cfg, registry, **kwargs)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 69, in build_from_config
[rank0]: raise Exception(f"Failed to invoke function {req_type_entry}, with {e}")
[rank0]: Exception: Failed to invoke function <function inference_tft2v_entrance at 0x7b6d423d5ea0>, with Failed to init class <class 'tools.modules.clip_embedder.FrozenOpenCLIPTextVisualEmbedder'>, with Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.

@zshyang
Copy link

zshyang commented Jul 8, 2024

I have what seems like a similar issue with open_clip_pytorch_model. Is there an updated fix? Where do I find these weights? Which yaml files are currently supported and expected to run vs which are for models that have not been released?

/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}") [2024-05-16 23:35:08,863] INFO: {'name': 'Config: VideoLDM Decoder', 'mean': [0.5, 0.5, 0.5], 'std': [0.5, 0.5, 0.5], 'max_words': 1000, 'num_workers': 6, 'prefetch_factor': 2, 'resolution': [448, 256], 'vit_out_dim': 1024, 'vit_resolution': [224, 224], 'depth_clamp': 10.0, 'misc_size': 384, 'depth_std': 20.0, 'frame_lens': [1, 16, 16, 16, 16, 32, 32, 32], 'sample_fps': [1, 8, 16, 16, 16, 8, 16, 16], 'vid_dataset': {'type': 'VideoDataset', 'data_list': ['data/vid_list.txt'], 'max_words': 1000, 'resolution': [448, 256], 'data_dir_list': ['data/videos/'], 'vit_resolution': [224, 224], 'get_first_frame': True}, 'img_dataset': {'type': 'ImageDataset', 'data_list': ['data/img_list.txt'], 'max_words': 1000, 'resolution': [448, 256], 'data_dir_list': ['data/images'], 'vit_resolution': [224, 224]}, 'batch_sizes': {'1': 32, '4': 8, '8': 4, '16': 4, '32': 2}, 'Diffusion': {'type': 'DiffusionDDIM', 'schedule': 'linear_sd', 'schedule_param': {'num_timesteps': 1000, 'init_beta': 0.00085, 'last_beta': 0.012, 'zero_terminal_snr': True}, 'mean_type': 'v', 'loss_type': 'mse', 'var_type': 'fixed_small', 'rescale_timesteps': False, 'noise_strength': 0.1, 'ddim_timesteps': 50}, 'ddim_timesteps': 50, 'use_div_loss': False, 'p_zero': 0.1, 'guide_scale': 9.0, 'vit_mean': [0.48145466, 0.4578275, 0.40821073], 'vit_std': [0.26862954, 0.26130258, 0.27577711], 'sketch_mean': [0.485, 0.456, 0.406], 'sketch_std': [0.229, 0.224, 0.225], 'hist_sigma': 10.0, 'scale_factor': 0.18215, 'use_checkpoint': True, 'use_sharded_ddp': False, 'use_fsdp': False, 'use_fp16': True, 'temporal_attention': True, 'UNet': {'type': 'UNetSD_TFT2V', 'in_dim': 4, 'dim': 320, 'y_dim': 1024, 'context_dim': 1024, 'out_dim': 4, 'dim_mult': [1, 2, 4, 4], 'num_heads': 8, 'head_dim': 64, 'num_res_blocks': 2, 'attn_scales': [1.0, 0.5, 0.25], 'dropout': 0.1, 'temporal_attention': True, 'temporal_attn_times': 1, 'use_checkpoint': True, 'use_fps_condition': False, 'use_sim_mask': False, 'config': 'None', 'num_tokens': 4, 'upper_len': 128, 'default_fps': 8, 'misc_dropout': 0.4}, 'guidances': [], 'auto_encoder': {'type': 'AutoencoderKL', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0, 'video_kernel_size': [3, 1, 1]}, 'embed_dim': 4, 'pretrained': 'models/v2-1_512-ema-pruned.ckpt'}, 'embedder': {'type': 'FrozenOpenCLIPTextVisualEmbedder', 'layer': 'penultimate', 'pretrained': 'models/open_clip_pytorch_model.bin', 'vit_resolution': [224, 224]}, 'ema_decay': 0.9999, 'num_steps': 1000000, 'lr': 3e-05, 'weight_decay': 0.0, 'betas': [0.9, 0.999], 'eps': 1e-08, 'chunk_size': 2, 'decoder_bs': 2, 'alpha': 0.7, 'save_ckp_interval': 50, 'warmup_steps': 10, 'decay_mode': 'cosine', 'use_ema': False, 'load_from': None, 'Pretrain': {'type': 'pretrain_specific_strategies', 'fix_weight': False, 'grad_scale': 0.5, 'resume_checkpoint': 'workspace/model_bk/model_scope_0267000.pth', 'sd_keys_path': 'data/stable_diffusion_image_key_temporal_attention_x1.json'}, 'viz_interval': 5, 'visual_train': {'type': 'VisualTrainTextImageToVideo', 'partial_keys': [['y', 'fps']], 'use_offset_noise': False, 'guide_scale': 9.0}, 'visual_inference': {'type': 'VisualGeneratedVideos'}, 'inference_list_path': '', 'log_interval': 1, 'log_dir': 'workspace/experiments/text_list_for_tft2v', 'seed': 888, 'negative_prompt': 'Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms', 'ENABLE': True, 'DATASET': 'webvid10m', 'TASK_TYPE': 'inference_tft2v_entrance', 'max_frames': 16, 'target_fps': 16, 'scale': 8, 'batch_size': 1, 'use_zero_infer': True, 'round': 1, 'test_list_path': 'data/text_list_for_tft2v.txt', 'vldm_cfg': 'configs/t2v_train.yaml', 'positive_prompt': ', cinematic, High Contrast, highly detailed, no blur, 4k render', 'test_model': 'models/tft2v_t2v_non_ema_512000.pth', 'video_compositions': ['text', 'image'], 'cfg_file': 'configs/tft2v_t2v_infer.yaml', 'init_method': 'tcp://localhost:9999', 'debug': False, 'opts': [], 'pmi_rank': 0, 'pmi_world_size': 1, 'gpus_per_machine': 1, 'world_size': 1, 'noise_strength': 0.1, 'gpu': 0, 'rank': 0, 'log_file': 'workspace/experiments/text_list_for_tft2v/log_00.txt'} [2024-05-16 23:35:09,826] INFO: Going into inference_text2video_entrance inference on 0 gpu [2024-05-16 23:35:09,847] INFO: Loading ViT-H-14 model config. [2024-05-16 23:35:22,084] WARNING: Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14. [rank0]: Traceback (most recent call last): [rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 62, in build_from_config [rank0]: return req_type_entry(**cfg) [rank0]: File "/content/drive/MyDrive/creative/VGen/tools/modules/clip_embedder.py", line 158, in init [rank0]: model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=pretrained) [rank0]: File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 151, in create_model_and_transforms [rank0]: model = create_model( [rank0]: File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 122, in create_model [rank0]: raise RuntimeError(f'Pretrained weights ({pretrained}) not found for model {model_name}.') [rank0]: RuntimeError: Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last): [rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 67, in build_from_config [rank0]: return req_type_entry(**cfg) [rank0]: File "/content/drive/MyDrive/creative/VGen/tools/inferences/inference_tft2v_entrance.py", line 74, in inference_tft2v_entrance [rank0]: worker(0, cfg, cfg_update) [rank0]: File "/content/drive/MyDrive/creative/VGen/tools/inferences/inference_tft2v_entrance.py", line 139, in worker [rank0]: clip_encoder = EMBEDDER.build(cfg.embedder) [rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 107, in build [rank0]: return self.build_func(*args, **kwargs, registry=self) [rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry_class.py", line 7, in build_func [rank0]: return build_from_config(cfg, registry, **kwargs) [rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 64, in build_from_config [rank0]: raise Exception(f"Failed to init class {req_type_entry}, with {e}") [rank0]: Exception: Failed to init class <class 'tools.modules.clip_embedder.FrozenOpenCLIPTextVisualEmbedder'>, with Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last): [rank0]: File "/content/drive/MyDrive/creative/VGen/inference.py", line 18, in [rank0]: INFER_ENGINE.build(dict(type=cfg_update.TASK_TYPE), cfg_update=cfg_update.cfg_dict) [rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 107, in build [rank0]: return self.build_func(*args, **kwargs, registry=self) [rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry_class.py", line 7, in build_func [rank0]: return build_from_config(cfg, registry, **kwargs) [rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 69, in build_from_config [rank0]: raise Exception(f"Failed to invoke function {req_type_entry}, with {e}") [rank0]: Exception: Failed to invoke function <function inference_tft2v_entrance at 0x7b6d423d5ea0>, with Failed to init class <class 'tools.modules.clip_embedder.FrozenOpenCLIPTextVisualEmbedder'>, with Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.

have you solved this issue?

@lky-ang
Copy link

lky-ang commented Aug 19, 2024

t2v_train.yaml.

Hey xjxu21, I have created my own workspace folder as i follow this in the t2v_train.yaml

image

i have even added the models that is previously not inside

image

And I have changed Steven's suggestions accordingly

image

But no luck. It still does not output anything. Do you mind helping me? Thank you in advance

Diffusion: {
'type': 'DiffusionDDIM',
'schedule': 'linear_sd', # cosine
'schedule_param': {
'num_timesteps': 1000,
"init_beta": 0.00085,
"last_beta": 0.0120,
'zero_terminal_snr': True,
},
'mean_type': 'v',
'loss_type': 'mse',
'var_type': 'fixed_small',
'rescale_timesteps': False,
'noise_strength': 0.1
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants