Attention Guided Textual Inversion Based Personalized Generation

This repo combines the work done in two papers:

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal^1,2, Yuval Alaluf¹, Yuval Atzmon², Or Patashnik¹, Amit H. Bermano¹, Gal Chechik², Daniel Cohen-Or¹
¹Tel Aviv University, ²NVIDIA

This work uses textual inversion to perform personalized generation, as can be seen in the [Project Website], from which this repo is forked.

Abstract:
Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks.

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Hila Chefer∗ Yuval Alaluf∗ Yael Vinker Lior Wolf Daniel Cohen-Or

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models
Hila Chefer^1,2, Yuval Alaluf¹, Yael Vinker², Lior Wolf¹, Daniel Cohen-Or¹
¹Tel Aviv University Paper Link

This work uses tackles the problem of catastrophic neglect (missing objects or wrongfully attributed traits in generation) by guiding the generation process based on the attention maps at inference time. The approach is to encourage a state where at each denoising step, for at least some components of the output sequence, the lowest weighted input token (as determined by the attention layers) receives the highest weight possible. The code for this work is accessible at the [Official Repo],.

Abstract:
Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt. While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt. We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt. Moreover, we find that in some cases the model also fails to correctly bind attributes (e.g., colors) to their corresponding subjects. To help mitigate these failure cases, we introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness of the generated images. Using an attention-based formulation of GSN, dubbed Attend-and-Excite, we guide the model to refine the cross-attention units to attend to all subject tokens in the text prompt and strengthen — or excite — their activations, encouraging the model to generate all subjects described in the text prompt. We compare our approach to alternative approaches and demonstrate that it conveys the desired concepts more faithfully across a range of text prompts.

Mixing The Two

The Attend and Excite code is primarily designed for guiding the generation process for latent diffusion models from the diffusers module. The latent diffusion model used in "An Image Is Worth One Word" is structured differently (class attributes, names architecture and code structure) and therefore requires adaptations not only in code, but also in hyperparameters applied in both the sampling process and the attend and excite pipeline implementation. For comparison, the main changes in this repo (in regards to the original Textual Inversion repo) are within the implemenation of the DDIM and DDMP samplers, attention.py implemenatations of attention layers, openaimodel.py diffusion models implementations and withing the BERT text encoder (ldm/modules/encoders/modules.py).

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
configs		configs
evaluation		evaluation
img		img
ldm		ldm
models		models
scripts		scripts
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
main.py		main.py
merge_embeddings.py		merge_embeddings.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Attention Guided Textual Inversion Based Personalized Generation

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Mixing The Two

About

Releases

Packages

Languages

License

tamirshor7/textual_inversion

Folders and files

Latest commit

History

Repository files navigation

Attention Guided Textual Inversion Based Personalized Generation

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Mixing The Two

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages