Image-Dev An Advance Text To Image AI Model
Image-Dev An Advance Text To Image AI Model
International Institute of Information Technology ((I²IT), Pune, India. Dec 15-17, 2022
Abstract—In the recent years, with the rapid growth of cation tasks such as sentiment analysis have been success-
Articial Intelligence, there is increasing interest in Text-to-Image fully performed with deep recurrent neural networks that can
models. High-quality images can be generated with state-of- learn discriminant vector representations from text. In other
art text-to-image AI models such as Imagen, DALL.E-2, Draw-
Bench. However, these models struggle with generating well domains, Deep Convolutional GAs can synthesize images such
aligned images for conict category and low database. Therefore, as bedroom interiors from random noise vectors sampled from
Image-dev is a Text-To-Image model that blends TF-IDF(Term a normal distribution. The focus of Reed et al. Conditional
Frequency – Inverse Document Frequency) model along with GANs work by introducing a single-hot class label vector as
preposition model, to evaluate the relation between the data input to the generator and discriminator along with a randomly
object. Proposed model output images have an unparalleled level
of artistic nish and an added level of language understanding selected noise vector. The result is higher kinetic stability,
and interpretation further enhance model to produce conict more visually appealing results and controlled oscillator out-
category images. Image-dev help user’s to generate a high-quality, put. The difference between the traditional conditional GAN
photorealistic images without any pre-context based on GANs, and the presented text-image model lies in the conditional
VAEs and diffusion model. Image-dev is based on diffusion model. input. Instead of generating sparse visual attribute descriptors
Diffusion model is more relevant because of its high quality and
realistic output generation capacity. for GAN processing, GANs are conditional on text embedding
Index Terms—DALL.E-2, Diffusion, Imagen, Preposition obtained with deep neural networks. The most recent artistic
model, Photorealism, Text-to-Image, TF-IDF approaches created by articial intelligence have experimented
with image data such as photorealistic images and drawings.
Currently, some of the most popular AI models are available
I. I NTRODUCTION for image creation.
Art has always been an integral part of human culture since B. Generative Models
the rst rock paintings. It’s how humans express ourselves
and tell stories. In recent years, advances have been made In recent years, generative models based on deep learning
in the eld of articial intelligence (AI), and people are have gained more attention due to great improvements in
exploring the potential of articial intelligence in various the eld. Depending on large amount of data, well-designed
elds, including art. However, understanding and appreciating networks, architectures and smart training methods, deep gen-
art is widely regarded as a purely human ability. It’s fascinating erative models have shown an incredible ability to generate
to watch how bringing AI into the loop not only drives the highly - realistic art of various kind, such as images, texts
evolution of digital art and art history, but also inspires our and sounds. Amidst these deep generative models, two major
vision for the future of art itself. Recently, with the rapid clans stand out that deserves attention: Generative Adversar-
development of articial intelligence, there is huge interest ial Networks (GANs), Variational Autoencoders (VAEs) and
in models that convert text into an image is increasing. High- Diffusion Model.
quality images can be generated using modern text-to-image 1) Generative Adversarial Network (GAN): Generative Ad-
AI models such as Imagen, DALL.E-2, and Draw-Bench. versarial Networks (GANs) is a deep generative model and a
Although, these models struggle to generate well-ordered practical framework in the class of Generative Models, based
images for conicting categories and low databases. upon zero-sum game theory and have become known for their
ability to generate photorealistic images. The importance of
A. Text-To-Image Articial Intelligence the model diversity is to collect the data distribution by means
Transforming natural language text descriptions into images of unsupervised learning and to generate more realistic data.
is an amazing demonstration of deep learning. Text classi- Currently, GANs have been widely studied and used due to
2
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:45:17 UTC from IEEE Xplore. Restrictions apply.
together based on their ASCLL value and fails to consider states the propose implementation of both diffusion model for
about synonyms which may convey the same meaning. Liu the decoder and experiments with autoregressive and diffusion
et al. solve this by the word2vec model to train the word model. Moreover, Dalle-2 uses two step approach which uses
vector in the corpus by obtaining the similar features. Higher CLIP image embedding text to image model to obtain asset of
accuracy is observed with the model from this paper than TF- optimal images and then train on this dataset along with the
IDF algorithm [4]. original text prompt which produced realistic images. Also
In the paper ’Generating Diverse Structure for Image paper sites that CLIP embedding is done into four extra token
In-painting with Hierarchical VQ-VAE’ Various image in- of context that are concatenated to the sequence of output from
painting models produce distorted structure or blurry texture GLIDE model. At last the low resolution image is passed into
on the nal image. Peng et al. aims to improve the results of upscaling model to bring the resolution standard 1024× 1024
image in-painting by proposing a two-stage model for diverse resolution [10].
in-painting. The rst stage generates multiple coarse results Imagen (Saharia et al.) use text transformers to encode
whereas the second stage renes the coarse results separately a language prompt. Imagen uses T5-XXL which has more
by augmenting texture [1]. ’Modular Generative Adversarial variation than just image captions. This paper also compares
Networks’ is another approach implied in paper [5]. The paper various models VQ-GAN+CLIP, latent diffusion, GLIDE and
propose a multi modular GAN layer which enhance the main Dalle-2. In a nutshell, a diffusion model works like this:
generating process by giving inputs at specic time. There is to prepare training data, forward diffusion process is done
also a simultaneous processing index of the modular layers. where an image is taken and add more and more noise to
The paper ’Improving Picture Quality with Photo-Realistic it, until it looks like just noise. Then a single model, like a
Style Transfer’ states style transfer applications for the photo- U-Net, is used to reverse each of these steps, which is the
realistic image processing tasks along with detail view of backward diffusion process. Unlike GLIDE, Imagen uses a
different factors and categories affect on the output [6]. smaller version of the U-Net to learn the backward diffusion
Another paper ’Stacking VAE and GAN for Context-aware steps. This paper proposes an improvement to this design.
Text-to-Image Generation’ Existing text to image generators In this paper limitations of Imagen model is also mentioned
create images as a whole and fail to look at the foreground and in which there is a comparison between Dalle-2 and Imagen
background of the images. Zhang et al. solve this by the usage where they struggle generating well aligned images for conict
of Contextual VAE and Conditional GAN in conjunction. category in of DrawBench. This paper aims to propose a
CVAEs are employed to divide the picture in foreground and solution by using TF-IDF and preposition layer solution [11].
background whereas CGANs are used for renement of the
III. M ETHODOLOGY
results provided in the earlier stage [2]. In [7] the drawback
of different image generating model have been identied and Paper uses TF-IDF model along with prepositions model to
then a new model is proposed which uses CVAE and CGAN, nd the relation data between the object inputs. Hence it is
On the basis of text input the bifurcation is done to identify very efcient in text phrase where there is major emphasis on
the optimal model of a particular text input. the relation as well. Fig. 1. shows the Image-dev owchart.
’Diffusion Models Beat GANs on Image Synthesis’ Prafulla
et al. shows improvement of traditional GANs for image syn- A. Algorithm
thesis by employing likelihood-based diffusion models with To make a text to image model generally there are three
a stationary training objective. Unconditional image synthesis models used. GANs, VAEs and Diffusion. Nowadays diffusion
is improved by the creation of a better architecture. Classier models are used because of the realistic output obtained. We
guidance is used to improve the conditional image synthesis are using a layer of TF-IDF model to increase the relational
[3]. In [8] there is similar approach using Multi-conditional interpretation of word input.
Fusion GAN to overcome the drawback of current approaches TF-IDF algorithm calculates how relevant the word is to the
which are heavyweight models, Also the paper state the content of the document as whole.
implemented model with both multi-stream and the single- Traditionally TF IDF is again comprises of two parts (tf)
stream methods. and (idf).
Paper titled ’TRGAN: Text to Image Generation Through
Optimizing Initial Image’ the photo degrading problem is count of t in d
tf (t, d) = (1)
solved proposed TRGAN model, which focuses on imple- number of words in document
menting Joint attention stacked generation module and Text Term frequency (tf) represents the total number of occur-
generation in the opposite direction and correction module. rence of a word in a particular sentence or document. These
The proposed TRGAN model outperform the GAN model [9]. ensures that relevant words are rationalize ordering of terms
DALLE-2 paper sites representation of the images that cap- using a vector to describe the bag of words. In equation (1)
ture both semantic and style. Is a step up to the previous model the term df(t) and N stands for frequency of a particular term
which was GLIDE. Dalle-2 two stage approach 1. CLIP model and number of document or corpus.
image embedding text caption. 2. Use decoder that generates In equation (2) and (3), Inverse document frequency (idf)
an image conditional on the image embedding. The papers mainly prioritize a local more appropriate words which are
3
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:45:17 UTC from IEEE Xplore. Restrictions apply.
B. Model implementation
Image-dev is based on previous VQGAN + CLIP source
model. Image-dev is using python 3 environment along with
cuda acceleration to implement GPU usage also.
1) Diffusion Model
2) VQGAN+CLIP Model
The model uses matrix which is used to determine whether
the denoising is done towards the actual desire direction.
TF-IDF and prepositions model layer sits between the pre-
processing of input tax and selection of image where it
vectorize the word which is relationally useful for original text.
Fig. 2. shows the owchart of the TF IDF and prepositional
layer.
4
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:45:17 UTC from IEEE Xplore. Restrictions apply.
images of a particular word is being indexed but also the ’color’, ’greg’, ’its’, ’kinkade’, ’light’, ’lighthouse’, ’looking’,
image which relates the most to original text input. ’mars’, ’of’, ’on’, ’realistic’, ’rutkowski’, ’scheme’, ’sea’,
• At this stage the diffusion model is up for implementa- ’shining’, ’singular’, ’thomas’, ’trending’, ’tumultuous’, ’yel-
tion. low’
• All the images are pass through the segmentation of the 0,0,1,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,1,1,1,0,0,1,0,1,
noise layer, this is recursive process till all the images 1,0,1,0,1,0,2,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1
are just noise. After all the images which are obtained
are combined into a single image and enhanced by adding IV. R ESULTS
noise. This step is crucial as the noise obtained will nally
impact the overall image the most. A survey was conducted to gather the results of our im-
• The diffusion processes the image further. plementation. Traditionally we could not use any algorithm
• After the implementation of the denoising sub layer, the to obtain the results which will raise another problems like
desired result which is obtained needs to be related to accuracy of testing algorithm as to mimic humans, Survey of
the input text. The denoising model can not gure out human test subjects was needed to be conducted to avoid the
the parameters of the denoising layer. issues. The survey contained batch of was of 263 individuals
• The commonly incorporated technique uses matrix and who rated on image parameters like Photorealistic, Artistic and
plane where the nal noise image is passed a couple of Related context. The constrains were to rate the images out of
times through the denoising layer. Traditionally only two 100 in every parameter. The subjects voted on two text input
points are obtained which determine the output, but in prompts. The nal results are in tabular form representing the
Image-dev, a hash map obtained by TF-IDF has been average percent obtain form 263 individuals in 3 category.
incorporated which acts as a third stationary parameter to The two text prompts used were ’A astronaut looking at
push the denoising process in the more accurate direction. singular lighthouse on mars, shining its light across a tumul-
• After all the points are plotted on the plane three models tuous sea of blood by greg rutkowski and Thomas kinkade,
are observed. One is the unguided denoising coordinate, realistic, Trending on artstation, yellow color scheme’ and ’A
second is the point obtained with the original input text horse riding on astronaut’. One prompt focuses on the ability
in reference and lastly, the hash-map reference point is to produce accurate results on complex input and another is the
obtained. famous conict category related prompt which our proposed
• The difference of both point to the unguided point is models tries to solve.
calculated, also the centroid is considered of the two
points which uses data as reference, thus the desired
results is in the direction of this new point obtained but
far in the plane thus indicating the recursive denoising
needed.
• Thus the reference is used as a guiding force to denoise
the nal noise image obtained earlier.
• The photo obtained is in very low resolution, thus further
up scaling model is used to increase the resolution of
the image. The Zyro open source model is used for this
purpose.
• Results obtained in the survey are used to further enhance
the parameters to get more realistic images. Fig. 3. Output from Dalle and Image-dev for rst prompt
5
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:45:17 UTC from IEEE Xplore. Restrictions apply.
TABLE II
S URVEY RESULTS OF PROMPT 2
Related
Prompt 2 Photorealistic Artistic
context
Image-dev 67% 79% 93%
Dalle & Imagen 78% 62% 32%
batch of 150 output by the model, thus the base model disco
diffusion and VQ-GAN+CLIP used shows its limits.
In comparison with the Dalle-2 and Imagen, suggested
model is suited for conict category and lacks in photo realism.
Fig. 4. Output from Dalle and Imagen for second prompt from [11].
Most of the library used is based on current model available,
new iteration using new model can produce better results, once
an advanced model like dalle-2 is cited.
Further task would be to use threads and sockets which
will produce low resolution images but would be accessible
universally through the means of application. Also, another
approach for data object and relation between them can be
implemented by using graph and neural networks. The survey
batch size should be increased for more accurate results.
R EFERENCES
[1] J. Peng, D. Liu, S. Xu, and H. Li, “Generating diverse structure for image
inpainting with hierarchical vq-vae,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 10775–
10784, 2021.
[2] C. Zhang and Y. Peng, “Stacking vae and gan for context-aware text-
to-image generation,” in 2018 IEEE Fourth International Conference on
Multimedia Big Data (BigMM), pp. 1–5, IEEE, 2018.
[3] X. Liu, A. Gherbi, Z. Wei, W. Li, and M. Cheriet, “Multispectral
image reconstruction from color images using enhanced variational
autoencoder and generative adversarial network,” IEEE Access, vol. 9,
pp. 1666–1679, 2020.
[4] Q. Liu, J. Wang, D. Zhang, Y. Yang, and N. Wang, “Text features
extraction based on tf-idf associating semantic,” in 2018 IEEE 4th
Fig. 5. Output from Image-dev for second prompt International Conference on Computer and Communications (ICCC),
pp. 2338–2343, IEEE, 2018.
[5] B. Zhao, B. Chang, Z. Jie, and L. Sigal, “Modular generative adversarial
networks,” in Proceedings of the European conference on computer
Also, at the second prompt in Fig 4. and Fig 5. clearly Dalle vision (ECCV), pp. 150–165, 2018.
and Imagen struggles to produce images which are close to the [6] I. Makarov, D. Polonskaya, and A. Feygina, “Improving picture quality
given prompt, Refer TABLE II Image-dev obtained 93 percent with photo-realistic style transfer,” in International Conference Image
Analysis and Recognition, pp. 47–55, Springer, 2018.
in the 3rd parameter which showcases the ability to produce [7] H. Tibebu, A. Malik, and V. De Silva, “Text to image synthesis using
image in conict category. stacked conditional variational autoencoders and conditional generative
With TF-IDF(Term Frequency – Inverse Document Fre- adversarial networks,” in Science and Information Conference, pp. 560–
580, Springer, 2022.
quency) model along with preposition model our proposed [8] Y. Yang, X. Ni, Y. Hao, C. Liu, W. Wang, Y. Liu, and H. Xie, “Mf-
model act as a layer which enhances the output. Our model gan: Multi-conditional fusion generative adversarial network for text-to-
produces better images in conict category where latest state- image synthesis,” in International Conference on Multimedia Modeling,
pp. 41–53, Springer, 2022.
of-the-art model like Dalle and Imagen struggles to produce a [9] L. Zhao, X. Li, P. Huang, Z. Chen, Y. Dai, and T. Li, “Trgan: Text
relevant images to given text input. to image generation through optimizing initial image,” in International
Conference on Neural Information Processing, pp. 651–658, Springer,
2021.
V. C ONCLUSION [10] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical
text-conditional image generation with clip latents,” arXiv preprint
Image-dev is a text to image diffusion model with ability to arXiv:2204.06125, 2022.
produce images which are photogenic and has painting affect [11] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S.
Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al., “Photoreal-
to it. Proposed layer sits between the test pre-processing layer istic text-to-image diffusion models with deep language understanding,”
and diffusion base model. As the proposed model requires a arXiv preprint arXiv:2205.11487, 2022.
huge computational power to process and traverse through the
diffusion process. Also desired image was obtained from the
6
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:45:17 UTC from IEEE Xplore. Restrictions apply.