ImageEval-prompt

ImageEval-prompt is a set of prompts that evaluate text-to-image (T2I) models at a fine-grained level, including entity, style and detail. By conducting comprehensive evaluations at a fine-grained level, researchers can better understand the strengths and limitations of T2I models, in order to further improve their performance.

Construction Method

We have constructed two sets of datasets, one in English and the other in Chinese. The English dataset consists of 1,624 prompts from PartiPrompts, while the Chinese dataset comprises 339 prompts generated by an automated tool. For each prompt, we annotated three dimensions: entity, style, and detail. The entity dimension includes five sub-dimensions: object, state, color, quantity, and position; the style dimension includes two sub-dimensions: painting style and cultural style; the detail dimension includes four sub-dimensions: hands, facial features, gender, and illogical knowledge. The manual annotation results show that the entity and style dimensions are relatively simple, while the detail dimension, except for the gender sub-dimension, is complex. For specific sub-dimension contents and quantities, please refer to the table below.

Dimension	Sub-dimension	The number of prompts corresponding to each sub-dimension (English)	The number of prompts corresponding to each sub-dimension (Chinese)
Entity	object	1624	339
	status	524	52
	color	337	72
	number	1423	134
	location	803	131
Style	paint	181	113
Style	culture	104	98
Detail	hand	46	20
	face	198	68
	gender	238	49
	illogical	109	36

The annotation method adopted a "double-blind annotation & third-party arbitration" approach. The table below shows the annotation results for the English and the Chinese datasets. The list corresponding to the object indicates the subject objects contained in the prompt (which can be one or more). In the other columns, 0 indicates that the sub-dimension did not appear, 1 indicates a simple inspection of the sub-dimension, and 2 indicates a complex inspection of the sub-dimension.

Prompt	object	status	color	number	location	paint	culture	hand	face	gender	illogical
a basketball game between a team of four cats and a team of three dogs	'basketball game', 'cats', 'dogs'	0	0	1	0	0	0	0	0	0	2
穿着华丽的衣服的女士坐在椅子上，素描	'女士', '椅子', '衣服'	0	0	0	1	1	0	0	2	1	0

We verified the annotation results and found that the consistency rate of all sub-dimensions, except for entity objects, exceeded 95%. As entity objects can have multiple options, we used the intersection-over-union method to calculate the consistency of this sub-dimension. The main reason for the relatively low consistency rate (above 80%) in the annotation of entity objects is the inevitable subjective factors, for example, the entity "basketball", some annotators consider "basketball" as an entity, while others consider "ball" as the entity.

Dimension	Sub-dimension	English		Chinese
Dimension	Sub-dimension	Consistent Amount	Consistent Ratio	Consistent Amount	Consistent Ratio
Entity	object	1377	0.841	308	0.894
	status	1609	0.991	337	0.994
	color	1617	0.996	336	0.991
	number	1614	0.994	339	1.000
	location	1606	0.989	335	0.988
Style	paint	1616	0.995	336	0.991
Style	culture	1610	0.991	334	0.985
Detail	hand	1617	0.996	338	0.997
	face	1611	0.992	338	0.997
	gender	1622	0.999	338	0.997
	illogical	1604	0.988	338	0.997
All		1313	0.808	296	0.873

Annotation Dimension Description

Dimension	Sub-dimension	Description	Example
Entity	object	Subject object, which may be one or more.	The objects in "a bookshelf with ten books stacked vertically" are book and bookshelf.
	status	Action description.	The status described in "three violins lying on the floor" is lying.
	color	Color description.	The color described in "a white background with a large blue square" are white and blue.
	number	Quantity description.	The quantity described in "300 movie titles" is 300.
	location	Spatial relation description.	The spatial relation described in "a large yellow triangle above a green square and red rectangle" is "above".
Style	paint	Description of the painting style, such as sketch, traditional Chinese painting, or from a particular painter/studio, such as Monet, Van Gogh, or popular on Pixiv, ArtStation, etc.	The paint style described in "The Oriental Pearl in sketch style" is sketch style.
Style	culture	Description of cultural style, such as Chinese style, Baroque, etc.	The culture style described in "an Egyptian statue" is Egyptian culture.
Detail	hand	Description of the hands of humanoid entities.	The hand described in "a man banging on a door" is banging.
	face	Description of the facial features of humanoid entities.	The facial features described in "a painting of the Mona Lisa with a frown" is frown.
	gender	Description of the gender of humanoid entities.	The gender described in "a man standing under a tree" is male
	illogical	Description of prompts that are clearly illogical.	The description "a cat reading a book" is clearly illogical.

Acknowledgemen

We are very grateful for the contributions of existing works such as the PartiPrompts benchmark to this project.

Lisense

ImageEval-prompt is licensed under the Apache 2.0 license

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ImageEval-prompt

Construction Method

Annotation Dimension Description

Acknowledgemen

Lisense

Files

README.md

Latest commit

History

README.md

File metadata and controls

ImageEval-prompt

Construction Method

Annotation Dimension Description

Acknowledgemen

Lisense