Skip to content

Recursively writes descriptions of video scenes using Large Language Models and Image Captioners

License

Notifications You must be signed in to change notification settings

kabachuha/video2scenario

Repository files navigation

video2scenario

Forms a tree-like folder dataset with L parts at the lowest level.

Then recursively writes descriptions of scenes with Large Language Models and Image Captioning Models.

The lowest level clips are captioned with VideoLLaVA.

The descriptions are gathered in a list and then the LLM is asked to describe the overall scene. Then the process continutes until the top level.

Any OpenAI-like text completion model can be used for this. In my tests Oobabooga's text generation webui is used as the API endpoint.

User can also provide the master prompt to help the model and edit the resulting descriptions with a Gradio demo interface.

There is also an option to store the resulting corrected output for better fine-tuning the models, for example, using a LoRA.

The Gradio interface has a dropdown to select each description - clip pair, on each level.


The goal of this subproject is to make a DiffusionOverDiffusion dataset to train InfiNet and the future complex script-based text2video models with minimal human labeling efforts.

About

Recursively writes descriptions of video scenes using Large Language Models and Image Captioners

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published