0% found this document useful (0 votes)
8 views

Image-Dev An Advance Text To Image AI Model

Uploaded by

SYA63Raj More
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Image-Dev An Advance Text To Image AI Model

Uploaded by

SYA63Raj More
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2022 IEEE Pune Section International Conference (PuneCon)

International Institute of Information Technology ((I²IT), Pune, India. Dec 15-17, 2022

Image-dev : An Advance Text to Image AI model


1st Manavkumar Patel 2nd Prof. Sonal Fatangare 3rd Aryaman Nasare
Dept. of Computer Science(RMDSSOE) Dept. of Computer Science(RMDSSOE) Dept. of Computer Science(RMDSSOE)
Savitribai Phule Pune University Savitribai Phule Pune University Savitribai Phule Pune University
Pune, India Pune, India Pune, India
manavp347@gmail.com sonalfatangare.rmdssoe@sinhgad.edu aryaman1307@gmail.com

4th Abhijeet Pachpute


Dept. of Computer Science(RMDSSOE)
Savitribai Phule Pune University
Pune, India
abhishek.ebapy19@sinhgad.edu
2022 IEEE Pune Section International Conference (PuneCon) | 978-1-6654-9897-5/22/$31.00 ©2022 IEEE | DOI: 10.1109/PUNECON55413.2022.10014718

Abstract—In the recent years, with the rapid growth of cation tasks such as sentiment analysis have been success-
Articial Intelligence, there is increasing interest in Text-to-Image fully performed with deep recurrent neural networks that can
models. High-quality images can be generated with state-of- learn discriminant vector representations from text. In other
art text-to-image AI models such as Imagen, DALL.E-2, Draw-
Bench. However, these models struggle with generating well domains, Deep Convolutional GAs can synthesize images such
aligned images for conict category and low database. Therefore, as bedroom interiors from random noise vectors sampled from
Image-dev is a Text-To-Image model that blends TF-IDF(Term a normal distribution. The focus of Reed et al. Conditional
Frequency – Inverse Document Frequency) model along with GANs work by introducing a single-hot class label vector as
preposition model, to evaluate the relation between the data input to the generator and discriminator along with a randomly
object. Proposed model output images have an unparalleled level
of artistic nish and an added level of language understanding selected noise vector. The result is higher kinetic stability,
and interpretation further enhance model to produce conict more visually appealing results and controlled oscillator out-
category images. Image-dev help user’s to generate a high-quality, put. The difference between the traditional conditional GAN
photorealistic images without any pre-context based on GANs, and the presented text-image model lies in the conditional
VAEs and diffusion model. Image-dev is based on diffusion model. input. Instead of generating sparse visual attribute descriptors
Diffusion model is more relevant because of its high quality and
realistic output generation capacity. for GAN processing, GANs are conditional on text embedding
Index Terms—DALL.E-2, Diffusion, Imagen, Preposition obtained with deep neural networks. The most recent artistic
model, Photorealism, Text-to-Image, TF-IDF approaches created by articial intelligence have experimented
with image data such as photorealistic images and drawings.
Currently, some of the most popular AI models are available
I. I NTRODUCTION for image creation.
Art has always been an integral part of human culture since B. Generative Models
the rst rock paintings. It’s how humans express ourselves
and tell stories. In recent years, advances have been made In recent years, generative models based on deep learning
in the eld of articial intelligence (AI), and people are have gained more attention due to great improvements in
exploring the potential of articial intelligence in various the eld. Depending on large amount of data, well-designed
elds, including art. However, understanding and appreciating networks, architectures and smart training methods, deep gen-
art is widely regarded as a purely human ability. It’s fascinating erative models have shown an incredible ability to generate
to watch how bringing AI into the loop not only drives the highly - realistic art of various kind, such as images, texts
evolution of digital art and art history, but also inspires our and sounds. Amidst these deep generative models, two major
vision for the future of art itself. Recently, with the rapid clans stand out that deserves attention: Generative Adversar-
development of articial intelligence, there is huge interest ial Networks (GANs), Variational Autoencoders (VAEs) and
in models that convert text into an image is increasing. High- Diffusion Model.
quality images can be generated using modern text-to-image 1) Generative Adversarial Network (GAN): Generative Ad-
AI models such as Imagen, DALL.E-2, and Draw-Bench. versarial Networks (GANs) is a deep generative model and a
Although, these models struggle to generate well-ordered practical framework in the class of Generative Models, based
images for conicting categories and low databases. upon zero-sum game theory and have become known for their
ability to generate photorealistic images. The importance of
A. Text-To-Image Articial Intelligence the model diversity is to collect the data distribution by means
Transforming natural language text descriptions into images of unsupervised learning and to generate more realistic data.
is an amazing demonstration of deep learning. Text classi- Currently, GANs have been widely studied and used due to

978-1-6654-9897-5/22/$31.00 ©2022 IEEE 1


Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:45:17 UTC from IEEE Xplore. Restrictions apply.
their outstanding data generation capacity with enormous ap- Therefore, for this reason, all text should be vectorized to be
plication prospects, including image, vision computing as well better represented [4].
as language processing. GANs, use the adversarial network to 1) Term Frequency: The phrase counts of all of the vocab
make realistic images from a source text. The goal of Image- phrases and the period of the report are needed to compute TF.
dev is to minimize the distances between the generated images In case the term doesn‘t exist in a specic report, that nal
and the original ones and produce images that are closer to TF value will be 0 for that value report. In an extreme case, if
reality. These models can be trained on many different types all of the phrases with inside the report are the same, then TF
of training data [1]. can be 1. The nal value of the normalized TF value will with
2) Variational Autoencoder(VAE): A Variational Autoen- inside the variety of [0 to 1]. 0, 1 inclusive. TF is individual
coder can be termed as being an autoencoder whose training to each document and word, in equation (1) therefore :
is regularized to avoid overtting and ensure that the latent
count of t in d
space has good properties that enable generative process. tf (t, d) = (1)
number of words in document
Dimensionality reduction is the procedure used to minimize
the number of attributes that describe a part of data by 2) Document Frequency: It measures the importance of a
selecting only a subset of initial attributes by combining them document in the entire corpus set. This is very similar to TF,
into a reduced number of new attributes. Thus, it can be seen except that TF is the number of frequencies for term t in
as an encoding process. Because of overtting, the latent space document d and DF is the number of occurrences of term
of an autoencoder can be extremely irregular as close points in t in document set N. In other words, DF is the number of
latent space can produce much diversied decoded data, and documents in which this word occurs. If the term appears at
some points of the latent space could give incoherent content least once in the document, it counts as one occurrence, so
after decoding. Therefore, it can’t really dene a generative the number of times the term appears need not be known.
process that easily consists to fragment a point coming from df(t) = occurrence of t in N doc’s To maintain this range,
the latent space and carry it through the decoder to obtain new normalization is done by dividing by the total number of
data. Variational Autoencoders (VAEs) are autoencoders that documents. The main goal is to nd the informational content
address the problem of the latent space irregularity by way of of a term and DF is the exact reciprocal of that term. That’s
making the encoder return distribution over the latent space why DF is inverted.
instead of a single point and by means of including in the loss 3) Inverse Document Frequency: IDF is the inverse docu-
function a regularisation term over that returned distribution ment frequency that measures the informativeness of the term
so as to make sure a stronger association of the latent space t. When calculating the IDF it will be very low for words that
[2]. occur most frequently, such as stopwords (because they exist
3) Diffusion Model: A diffusion model is a generative in almost all documents and N/df gives this word a very low
model. In other words, it is used to generate data similar value). This nally provides the required relative weights.
to the learned data. Basically, diffusion models work by Now IDF has some other problems. IDF values explode
destroying the training data by successively adding Gaussian when you have a large corpus size (e.g. N=10000). Therefore,
noise and then learning to recover the data by reversing this to reduce the effect, IDF log is used. If the word is not in
noise process. Once trained, data can be generated using the the vocabulary when queried, it is simply ignored. However,
diffusion model by running a randomly selected noise through in some cases during the use of a xed dictionary and some
a trained denoising process [3]. dictionary words may be missing from the document. In this
case, DF will be 0. Since it is not divisible by zero, 1 is added
In addition to advanced image quality, diffusion models
to the denominator to smooth the values. idf(t) = log(N/(df +
have many other advantages, including that they do not require
1)) Finally, taking a multiplicative price of TF and IDF, the
adversarial training. If we have a consistent alternative with
TF-IDF score is obtained.
similar performance and learning efciency, it’s usually a good
idea to use it. D. Preposition Model
Preposition model is the relation directional map to high-
C. TF-IDF(Term Frequency – Inverse Document Frequency) light the relation between data object, object in the model. It
TF-IDF is an abbreviation used for”Term Frequency-Inverse allows to check if the object is in or out of the main object.
Document Frequency”. A method of quantifying words in a This is especially useful when the primary model has a one to
set of documents. Usually evaluate score of each word is one relationship and the data model cannot. For example, in
calculated to indicate its importance in documents and corpora. a table the relationship is a many- to-many relationship, so it
This method is widely used for information retrieval and text can be represented as an entity table to show the related data.
analysis. If taken example sentence like”This building is too II. R ELATED W ORK
tall”. Sentences are easy to understand because the meaning of
words and sentences is known. But how can a program written Text Features Extraction based on TF-IDF Associating
in Python interpret this sentence? Any programming language Semantic TFIDF algorithm is based on the statistics received
makes it easier to understand textual data as numeric values. by the system from the input. The algorithm group’s words

2
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:45:17 UTC from IEEE Xplore. Restrictions apply.
together based on their ASCLL value and fails to consider states the propose implementation of both diffusion model for
about synonyms which may convey the same meaning. Liu the decoder and experiments with autoregressive and diffusion
et al. solve this by the word2vec model to train the word model. Moreover, Dalle-2 uses two step approach which uses
vector in the corpus by obtaining the similar features. Higher CLIP image embedding text to image model to obtain asset of
accuracy is observed with the model from this paper than TF- optimal images and then train on this dataset along with the
IDF algorithm [4]. original text prompt which produced realistic images. Also
In the paper ’Generating Diverse Structure for Image paper sites that CLIP embedding is done into four extra token
In-painting with Hierarchical VQ-VAE’ Various image in- of context that are concatenated to the sequence of output from
painting models produce distorted structure or blurry texture GLIDE model. At last the low resolution image is passed into
on the nal image. Peng et al. aims to improve the results of upscaling model to bring the resolution standard 1024× 1024
image in-painting by proposing a two-stage model for diverse resolution [10].
in-painting. The rst stage generates multiple coarse results Imagen (Saharia et al.) use text transformers to encode
whereas the second stage renes the coarse results separately a language prompt. Imagen uses T5-XXL which has more
by augmenting texture [1]. ’Modular Generative Adversarial variation than just image captions. This paper also compares
Networks’ is another approach implied in paper [5]. The paper various models VQ-GAN+CLIP, latent diffusion, GLIDE and
propose a multi modular GAN layer which enhance the main Dalle-2. In a nutshell, a diffusion model works like this:
generating process by giving inputs at specic time. There is to prepare training data, forward diffusion process is done
also a simultaneous processing index of the modular layers. where an image is taken and add more and more noise to
The paper ’Improving Picture Quality with Photo-Realistic it, until it looks like just noise. Then a single model, like a
Style Transfer’ states style transfer applications for the photo- U-Net, is used to reverse each of these steps, which is the
realistic image processing tasks along with detail view of backward diffusion process. Unlike GLIDE, Imagen uses a
different factors and categories affect on the output [6]. smaller version of the U-Net to learn the backward diffusion
Another paper ’Stacking VAE and GAN for Context-aware steps. This paper proposes an improvement to this design.
Text-to-Image Generation’ Existing text to image generators In this paper limitations of Imagen model is also mentioned
create images as a whole and fail to look at the foreground and in which there is a comparison between Dalle-2 and Imagen
background of the images. Zhang et al. solve this by the usage where they struggle generating well aligned images for conict
of Contextual VAE and Conditional GAN in conjunction. category in of DrawBench. This paper aims to propose a
CVAEs are employed to divide the picture in foreground and solution by using TF-IDF and preposition layer solution [11].
background whereas CGANs are used for renement of the
III. M ETHODOLOGY
results provided in the earlier stage [2]. In [7] the drawback
of different image generating model have been identied and Paper uses TF-IDF model along with prepositions model to
then a new model is proposed which uses CVAE and CGAN, nd the relation data between the object inputs. Hence it is
On the basis of text input the bifurcation is done to identify very efcient in text phrase where there is major emphasis on
the optimal model of a particular text input. the relation as well. Fig. 1. shows the Image-dev owchart.
’Diffusion Models Beat GANs on Image Synthesis’ Prafulla
et al. shows improvement of traditional GANs for image syn- A. Algorithm
thesis by employing likelihood-based diffusion models with To make a text to image model generally there are three
a stationary training objective. Unconditional image synthesis models used. GANs, VAEs and Diffusion. Nowadays diffusion
is improved by the creation of a better architecture. Classier models are used because of the realistic output obtained. We
guidance is used to improve the conditional image synthesis are using a layer of TF-IDF model to increase the relational
[3]. In [8] there is similar approach using Multi-conditional interpretation of word input.
Fusion GAN to overcome the drawback of current approaches TF-IDF algorithm calculates how relevant the word is to the
which are heavyweight models, Also the paper state the content of the document as whole.
implemented model with both multi-stream and the single- Traditionally TF IDF is again comprises of two parts (tf)
stream methods. and (idf).
Paper titled ’TRGAN: Text to Image Generation Through
Optimizing Initial Image’ the photo degrading problem is count of t in d
tf (t, d) = (1)
solved proposed TRGAN model, which focuses on imple- number of words in document
menting Joint attention stacked generation module and Text Term frequency (tf) represents the total number of occur-
generation in the opposite direction and correction module. rence of a word in a particular sentence or document. These
The proposed TRGAN model outperform the GAN model [9]. ensures that relevant words are rationalize ordering of terms
DALLE-2 paper sites representation of the images that cap- using a vector to describe the bag of words. In equation (1)
ture both semantic and style. Is a step up to the previous model the term df(t) and N stands for frequency of a particular term
which was GLIDE. Dalle-2 two stage approach 1. CLIP model and number of document or corpus.
image embedding text caption. 2. Use decoder that generates In equation (2) and (3), Inverse document frequency (idf)
an image conditional on the image embedding. The papers mainly prioritize a local more appropriate words which are

3
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:45:17 UTC from IEEE Xplore. Restrictions apply.
B. Model implementation
Image-dev is based on previous VQGAN + CLIP source
model. Image-dev is using python 3 environment along with
cuda acceleration to implement GPU usage also.
1) Diffusion Model
2) VQGAN+CLIP Model
The model uses matrix which is used to determine whether
the denoising is done towards the actual desire direction.
TF-IDF and prepositions model layer sits between the pre-
processing of input tax and selection of image where it
vectorize the word which is relationally useful for original text.
Fig. 2. shows the owchart of the TF IDF and prepositional
layer.

Fig. 1. Flowchart of Image-dev.

not that frequent and are discarded by the term frequency.


This ensures the frequency of words is not the deciding factor
rather the relational part between the words is also preserved. Fig. 2. Flowchart of TF-IDF and prepositional layer.
Thus, more a word is frequent in a particular instance lesser
and lesser the importance it preserves to the context. At rst the model takes input from the user.
N N • At this time the input is passed to the TF-IDF layer which
idf (t) = = (2) further vectorize and nd the relational indexing between
df(t) N(t)
the words and put it hash-map for further reference.
Basic approach of splitting the original sentence input
N into different documents the enabling to nd the values
idf (t) = log( ) (3)
df(t) vector of each words is used.
• Then with the use of vector values of every word, Image-
Together equation (4) term frequency inverse document dev nds relevant images from the database.
frequency is one of the best Matrix to identify the signicance • Some images have a tendency to reside outside the
of words in a particular context which also not only highlights database. This limitation is solved with the help of hash
a particular subject but also the relational values it brings. maps. Using the vectorization of every words previously
obtained the images are procured in different format of
tf − idf (t, d) = tf (t, d) ∗ idf (t) (4) the input text. Main point to be considered is that not only

4
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:45:17 UTC from IEEE Xplore. Restrictions apply.
images of a particular word is being indexed but also the ’color’, ’greg’, ’its’, ’kinkade’, ’light’, ’lighthouse’, ’looking’,
image which relates the most to original text input. ’mars’, ’of’, ’on’, ’realistic’, ’rutkowski’, ’scheme’, ’sea’,
• At this stage the diffusion model is up for implementa- ’shining’, ’singular’, ’thomas’, ’trending’, ’tumultuous’, ’yel-
tion. low’
• All the images are pass through the segmentation of the 0,0,1,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,1,1,1,0,0,1,0,1,
noise layer, this is recursive process till all the images 1,0,1,0,1,0,2,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1
are just noise. After all the images which are obtained
are combined into a single image and enhanced by adding IV. R ESULTS
noise. This step is crucial as the noise obtained will nally
impact the overall image the most. A survey was conducted to gather the results of our im-
• The diffusion processes the image further. plementation. Traditionally we could not use any algorithm
• After the implementation of the denoising sub layer, the to obtain the results which will raise another problems like
desired result which is obtained needs to be related to accuracy of testing algorithm as to mimic humans, Survey of
the input text. The denoising model can not gure out human test subjects was needed to be conducted to avoid the
the parameters of the denoising layer. issues. The survey contained batch of was of 263 individuals
• The commonly incorporated technique uses matrix and who rated on image parameters like Photorealistic, Artistic and
plane where the nal noise image is passed a couple of Related context. The constrains were to rate the images out of
times through the denoising layer. Traditionally only two 100 in every parameter. The subjects voted on two text input
points are obtained which determine the output, but in prompts. The nal results are in tabular form representing the
Image-dev, a hash map obtained by TF-IDF has been average percent obtain form 263 individuals in 3 category.
incorporated which acts as a third stationary parameter to The two text prompts used were ’A astronaut looking at
push the denoising process in the more accurate direction. singular lighthouse on mars, shining its light across a tumul-
• After all the points are plotted on the plane three models tuous sea of blood by greg rutkowski and Thomas kinkade,
are observed. One is the unguided denoising coordinate, realistic, Trending on artstation, yellow color scheme’ and ’A
second is the point obtained with the original input text horse riding on astronaut’. One prompt focuses on the ability
in reference and lastly, the hash-map reference point is to produce accurate results on complex input and another is the
obtained. famous conict category related prompt which our proposed
• The difference of both point to the unguided point is models tries to solve.
calculated, also the centroid is considered of the two
points which uses data as reference, thus the desired
results is in the direction of this new point obtained but
far in the plane thus indicating the recursive denoising
needed.
• Thus the reference is used as a guiding force to denoise
the nal noise image obtained earlier.
• The photo obtained is in very low resolution, thus further
up scaling model is used to increase the resolution of
the image. The Zyro open source model is used for this
purpose.
• Results obtained in the survey are used to further enhance
the parameters to get more realistic images. Fig. 3. Output from Dalle and Image-dev for rst prompt

Output of the TF-IDF and prepositional layer:


Eg : ’A astronaut looking at singular lighthouse on mars, TABLE I
S URVEY RESULTS OF PROMPT 1
shining its light across a tumultuous sea of blood by greg
rutkowski and Thomas kinkade, realistic, Trending on artsta- Related
Prompt 1 Photorealistic Artistic
tion, yellow color scheme’. context
Image-dev 63% 87% 97%
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0682, 0.0682, 0.0252, 0.0, 0.0252, Dalle 94% 92% 84%
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0682, 0.0,
0.0, 0.0682, 0.0682, 0.0,0.0, 0.0318, 0.0318, 0.0, 0.0318, 0.0,
0.0, 0.0117, 0.0318, 0.0117, 0.0, 0.0318, 0.0, 0.0318, 0.0318,
0.0318, 0.0318, 0.0318, 0.0, 0.0318, 0.0, 0.0, 0.0318, 0.0318, Fig 3. contains the output images obtained from rst text
0.0, 0.0, 0.0,0.0682, 0.0, 0.0, 0.0682, 0.0, 0.0, 0.0, 0.0, 0.0, prompt. Refer TABLE I, It has been observed that the Image-
0.0, 0.0682, 0.0, 0.0682, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0682, 0.0, dev model struggles with the photorealism, but is more artistic
0.0682, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0682] in context to text input than Dalle, this was expected as Dalle
’across’, ’and’, ’artstation’, ’astronaut’, ’at’, ’blood’, ’by’, focuses more on photorealism rather on Artistic images.

5
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:45:17 UTC from IEEE Xplore. Restrictions apply.
TABLE II
S URVEY RESULTS OF PROMPT 2

Related
Prompt 2 Photorealistic Artistic
context
Image-dev 67% 79% 93%
Dalle & Imagen 78% 62% 32%

batch of 150 output by the model, thus the base model disco
diffusion and VQ-GAN+CLIP used shows its limits.
In comparison with the Dalle-2 and Imagen, suggested
model is suited for conict category and lacks in photo realism.
Fig. 4. Output from Dalle and Imagen for second prompt from [11].
Most of the library used is based on current model available,
new iteration using new model can produce better results, once
an advanced model like dalle-2 is cited.
Further task would be to use threads and sockets which
will produce low resolution images but would be accessible
universally through the means of application. Also, another
approach for data object and relation between them can be
implemented by using graph and neural networks. The survey
batch size should be increased for more accurate results.

R EFERENCES
[1] J. Peng, D. Liu, S. Xu, and H. Li, “Generating diverse structure for image
inpainting with hierarchical vq-vae,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 10775–
10784, 2021.
[2] C. Zhang and Y. Peng, “Stacking vae and gan for context-aware text-
to-image generation,” in 2018 IEEE Fourth International Conference on
Multimedia Big Data (BigMM), pp. 1–5, IEEE, 2018.
[3] X. Liu, A. Gherbi, Z. Wei, W. Li, and M. Cheriet, “Multispectral
image reconstruction from color images using enhanced variational
autoencoder and generative adversarial network,” IEEE Access, vol. 9,
pp. 1666–1679, 2020.
[4] Q. Liu, J. Wang, D. Zhang, Y. Yang, and N. Wang, “Text features
extraction based on tf-idf associating semantic,” in 2018 IEEE 4th
Fig. 5. Output from Image-dev for second prompt International Conference on Computer and Communications (ICCC),
pp. 2338–2343, IEEE, 2018.
[5] B. Zhao, B. Chang, Z. Jie, and L. Sigal, “Modular generative adversarial
networks,” in Proceedings of the European conference on computer
Also, at the second prompt in Fig 4. and Fig 5. clearly Dalle vision (ECCV), pp. 150–165, 2018.
and Imagen struggles to produce images which are close to the [6] I. Makarov, D. Polonskaya, and A. Feygina, “Improving picture quality
given prompt, Refer TABLE II Image-dev obtained 93 percent with photo-realistic style transfer,” in International Conference Image
Analysis and Recognition, pp. 47–55, Springer, 2018.
in the 3rd parameter which showcases the ability to produce [7] H. Tibebu, A. Malik, and V. De Silva, “Text to image synthesis using
image in conict category. stacked conditional variational autoencoders and conditional generative
With TF-IDF(Term Frequency – Inverse Document Fre- adversarial networks,” in Science and Information Conference, pp. 560–
580, Springer, 2022.
quency) model along with preposition model our proposed [8] Y. Yang, X. Ni, Y. Hao, C. Liu, W. Wang, Y. Liu, and H. Xie, “Mf-
model act as a layer which enhances the output. Our model gan: Multi-conditional fusion generative adversarial network for text-to-
produces better images in conict category where latest state- image synthesis,” in International Conference on Multimedia Modeling,
pp. 41–53, Springer, 2022.
of-the-art model like Dalle and Imagen struggles to produce a [9] L. Zhao, X. Li, P. Huang, Z. Chen, Y. Dai, and T. Li, “Trgan: Text
relevant images to given text input. to image generation through optimizing initial image,” in International
Conference on Neural Information Processing, pp. 651–658, Springer,
2021.
V. C ONCLUSION [10] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical
text-conditional image generation with clip latents,” arXiv preprint
Image-dev is a text to image diffusion model with ability to arXiv:2204.06125, 2022.
produce images which are photogenic and has painting affect [11] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S.
Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al., “Photoreal-
to it. Proposed layer sits between the test pre-processing layer istic text-to-image diffusion models with deep language understanding,”
and diffusion base model. As the proposed model requires a arXiv preprint arXiv:2205.11487, 2022.
huge computational power to process and traverse through the
diffusion process. Also desired image was obtained from the

6
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:45:17 UTC from IEEE Xplore. Restrictions apply.

You might also like