ImageEval-prompt is a set of prompts that evaluate text-to-image (T2I) models at a fine-grained level, including entity, style and detail. By conducting comprehensive evaluations at a fine-grained level, researchers can better understand the strengths and limitations of T2I models, in order to further improve their performance.
We have constructed two sets of datasets, one in English and the other in Chinese. The English dataset consists of 1,624 prompts from PartiPrompts, while the Chinese dataset comprises 339 prompts generated by an automated tool. For each prompt, we annotated three dimensions: entity, style, and detail. The entity dimension includes five sub-dimensions: object, state, color, quantity, and position; the style dimension includes two sub-dimensions: painting style and cultural style; the detail dimension includes four sub-dimensions: hands, facial features, gender, and illogical knowledge. The manual annotation results show that the entity and style dimensions are relatively simple, while the detail dimension, except for the gender sub-dimension, is complex. For specific sub-dimension contents and quantities, please refer to the table below.
Dimension | Sub-dimension | The number of prompts corresponding to each sub-dimension (English) | The number of prompts corresponding to each sub-dimension (Chinese) |
---|---|---|---|
Entity | object | 1624 | 339 |
status | 524 | 52 | |
color | 337 | 72 | |
number | 1423 | 134 | |
location | 803 | 131 | |
Style | paint | 181 | 113 |
culture | 104 | 98 | |
Detail | hand | 46 | 20 |
face | 198 | 68 | |
gender | 238 | 49 | |
illogical | 109 | 36 |
The annotation method adopted a "double-blind annotation & third-party arbitration" approach. The table below shows the annotation results for the English and the Chinese datasets. The list corresponding to the object indicates the subject objects contained in the prompt (which can be one or more). In the other columns, 0 indicates that the sub-dimension did not appear, 1 indicates a simple inspection of the sub-dimension, and 2 indicates a complex inspection of the sub-dimension.
Prompt | object | status | color | number | location | paint | culture | hand | face | gender | illogical |
---|---|---|---|---|---|---|---|---|---|---|---|
a basketball game between a team of four cats and a team of three dogs | 'basketball game', 'cats', 'dogs' | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
穿着华丽的衣服的女士坐在椅子上,素描 | '女士', '椅子', '衣服' | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 1 | 0 |
We verified the annotation results and found that the consistency rate of all sub-dimensions, except for entity objects, exceeded 95%. As entity objects can have multiple options, we used the intersection-over-union method to calculate the consistency of this sub-dimension. The main reason for the relatively low consistency rate (above 80%) in the annotation of entity objects is the inevitable subjective factors, for example, the entity "basketball", some annotators consider "basketball" as an entity, while others consider "ball" as the entity.
Dimension | Sub-dimension | English | Chinese | ||
---|---|---|---|---|---|
Consistent Amount | Consistent Ratio | Consistent Amount | Consistent Ratio | ||
Entity | object | 1377 | 0.841 | 308 | 0.894 |
status | 1609 | 0.991 | 337 | 0.994 | |
color | 1617 | 0.996 | 336 | 0.991 | |
number | 1614 | 0.994 | 339 | 1.000 | |
location | 1606 | 0.989 | 335 | 0.988 | |
Style | paint | 1616 | 0.995 | 336 | 0.991 |
culture | 1610 | 0.991 | 334 | 0.985 | |
Detail | hand | 1617 | 0.996 | 338 | 0.997 |
face | 1611 | 0.992 | 338 | 0.997 | |
gender | 1622 | 0.999 | 338 | 0.997 | |
illogical | 1604 | 0.988 | 338 | 0.997 | |
All | 1313 | 0.808 | 296 | 0.873 |
Dimension | Sub-dimension | Description | Example |
---|---|---|---|
Entity | object | Subject object, which may be one or more. | The objects in "a bookshelf with ten books stacked vertically" are book and bookshelf. |
status | Action description. | The status described in "three violins lying on the floor" is lying. | |
color | Color description. | The color described in "a white background with a large blue square" are white and blue. | |
number | Quantity description. | The quantity described in "300 movie titles" is 300. | |
location | Spatial relation description. | The spatial relation described in "a large yellow triangle above a green square and red rectangle" is "above". | |
Style | paint | Description of the painting style, such as sketch, traditional Chinese painting, or from a particular painter/studio, such as Monet, Van Gogh, or popular on Pixiv, ArtStation, etc. | The paint style described in "The Oriental Pearl in sketch style" is sketch style. |
culture | Description of cultural style, such as Chinese style, Baroque, etc. | The culture style described in "an Egyptian statue" is Egyptian culture. | |
Detail | hand | Description of the hands of humanoid entities. | The hand described in "a man banging on a door" is banging. |
face | Description of the facial features of humanoid entities. | The facial features described in "a painting of the Mona Lisa with a frown" is frown. | |
gender | Description of the gender of humanoid entities. | The gender described in "a man standing under a tree" is male | |
illogical | Description of prompts that are clearly illogical. | The description "a cat reading a book" is clearly illogical. |
We are very grateful for the contributions of existing works such as the PartiPrompts benchmark to this project.
ImageEval-prompt is licensed under the Apache 2.0 license