Overview
You are given up to three images: one original RGB image and its two
augmented versions, side-bar information, and a pre-generated conversation
between a human and an AI assistant. Your task is to analyze the image(s)
and the sidebar information to improve the quality of the pre-generated
conversation, wherever possible, so that it follows the guidelines below by
fixing factual errors, spelling and grammar mistakes, and or by providing
additional information in responses.
NOTE: As with all LLM (Large Language Model) workflows, refer to the 3 Hs,
which expect responses that are Helpful, Honest, and Harmless.
Glossary
Caption
If it is not empty, it contains one or more sentences describing the
image you are observing.
Chain-of-Thought
Chain-of-Thought (CoT) is a technique that guides LLMs to follow a
reasoning process when dealing with hard problems. This is done by
showing the model a few examples where the step-by-step reasoning
is clearly laid out.
LLM
A Large Language Model is an AI language model that has the ability to
achieve general-purpose language generation.
Objects
They are spread across multiple lines, each describing an object of the
same image you are observing. Every line is made up of an object
name and its bounding box. The bounding box is displayed in the form
of coordinates [x1, y1, x2, y2]. The values are numbers normalized
from 0 to 1, corresponding to the top-left on X axis, top-left on Y axis,
then bottom-right on X axis, and bottom-right on Y axis.
Region captions
They are made up of multiple lines, each describing a region of the
same image you are observing.
Tags
They provide keywords that describe the content of the image and
individual regions.
Segmentation masks
They are delimited regions in the image that are describing the same
thing. That means that an image can be segmented (disassembled)
into several portions, where each portion represents a given object (for
example a car, a person, a tree) or distinct part of the background (for
example, the sky, the ground, a forest, and so on). This will be
presented as a semi-transparent color-coded mask over the original
image.
Simplified Steps
To complete a task in this workflow, follow the below simplified steps:
1. Analyze the provided data. Inspect the images, side-information
and conversation provided.
2. Conversation correction. Ensure the provided conversation is in
agreement with the project's requirements.
3. Task submission.
End-to-End Process
Task processing in this workflow is divided into the following steps:
Step 1: Data Analysis and Conversation
Design
Analyze the provided images and Side-Info.
Based on all the provided data, decide the conversation topics that you find
interesting and would have enough material to write on.
NOTE: Some of the provided information in the Side-Info may be incorrect.
Use your best judgement to rely on the information presented.
Step 2: Conversation Writing
Write a human-chatbot conversation according to the following
requirements:
The conversation must have a minimum of four and a maximum of five
question-answer pairs.
The questions must have a maximum of 40 words, equally distributed
in lengths [1–5], [6–10], [11–20], and [20–40].
The answers must have a maximum of 100 words. At least 60%
should be with explanation 20+ words.
In the conversation, you are called Human and the chatbot Model.
The Model's answer should be in a tone that a visual AI assistant is
seeing the images and answering the questions.
Questions Requirements
Length: Vary short and long questions, ranging from two words (or even one
word for follow-up questions, like “And?” “Elaborate.” “Why?” “Next.”) to 40
words. Aim for diverse lengths, with a roughly equal number of questions for
these ranges: one to five words (averaging to three words), 6-10 words
(averaging to eight words), 11-20 words (averaging to 15 words), and 20-40
words (averaging to 30 words).
Continuity: Use expressions that refer to the conversation context (for
example, use it, those, hers, him, at that time, since then, and so on), rather
than spelling out the name of the person or thing or scenario) or acronymns
with meaning implied in the context (for example, “what do I do?” instead of
“what do I do if <repeat last question’s scenario>?”, “what do you mean?”
instead of “what did you mean by <repeat content of last answer>?”).
Style: Use any style, like:
Very formal language.
Commanding language (for example, “Tell me...”, “Find the...”, instead
of “Why does...” “Where is...”).
Very informal language (for example, “why tho”, “yesterday sucked”).
Common slang (for example, "never mind", "no biggie", "it sucked",
"I'm in", "hit the books", "my bad", "wrap up", "you're telling me",
"time out", "what's up?").
Light grammar errors and spelling mistakes.
Texting acronymns (for example, “btw”, “lol”).
Answers Requirements
3Hs: The answers should follow the 3Hs (HHH) principles, respectively they
should be honest, helpful, and harmless. Thus, when writing answers, start
by writing facts that are directly observable in the image related to those
questions.
Length: If the Human specifies the length or asks for short or long answers,
comply with the requirement. Otherwise, write answers that have:
4-15 words (flexible), averaging 7.5 words for questions that do not
need explanations.
20-60 words, averaging roughly to 40 words for questions that need
explanations.
Aim for diverse lengths, but ensure that at least three answers per
conversation have an explanation (10–40 words).
When writing detailed answers, sometimes it may be helpful to write using a
chain-of-thought explanation, followed by the answer. Refer to the below
examples:
Human: Is the crowd getting entertained?
Agent: There is a man in the foreground performing a trick with a
spatula and a knife. The crowd behind him are looking towards him
and many of them are clapping and smiling. So, yes they seem to be
getting entertained.”
Human: Is this picture taken during the night?
Agent: The image is taken outdoors. The lighting appears to be
natural and the sky is bright and has clouds. So, the image does not
appear to be taken at night.
Style: Remain formal or semi-formal, unless the Human’s question instructs
otherwise. It is okay to answer partly and ask clarifying questions.
Don'ts:
Do not answer only with Yes or No, unless it is asked in the question.
Do not use slang. Do not use “Yeah”, “Not sure”. Instead, use "Yes", "I
am not sure".
Do not make the Agent ask questions, except when they are clarifying
questions or requesting more information, and the answer will help
the Model respond to the previous question being asked. Formality
questions are okay if the Human did not ask a question in the end (for
example, different rephrasing of "Is there anything else I can do for
you?").
Do not provide links, since:
They are not useful for LLMs that do not have access to the
internet.
They encourage LLMs to hallucinate URLs that does not exist.
Links change all the time and may no longer work in few years.
Instead, say the search query that leads to a given link, (for
example, Search "<search query>" on your favorite search
engine.).
Do
The following are expected practices in this workflow:
1. Only include questions that have definite answers and are not
ambiguous.
2. Ask diverse questions and give corresponding answers.
3. The content within the conversation should be logically connected. You
can think about the topic first and then generate the conversation
according to the topic.
4. The topic and questions should be related to the visual content of the
image and not require extensive reasoning or outside knowledge.
Therefore, in a conversation, aim to cover at least three of the below
areas:
a. Image or region description Image or region description
includes, (for example, “Can you provide a comprehensive
description of the image?”, “Elaborate on the details of the left
half of the image.”, “What are the specifics visible in the
image?”, “Could you offer an in-depth analysis of the top-left part
of image?”, “Can you depict the image with precise detail?”,
“Give a detailed account of the image.”, “Explain the image in
meticulous detail.”, “Can you portray the image in words?”,
“Give a thorough narrative of the image.”). Select Different ways
to ask for image or region description for more examples.
b. Image topic, quality, emotion, scene, or style Image topic,
quality, emotion, scene, or style, (for example, “Write a caption
or brief description for this image.”, “Is the man’s face blurred or
in darkness or over-exposed?”, “What feeling is the image
conveying?”, “Is this a restaurant?”, “Which art style is evident in
this image? Oil Paint, Watercolor, Photo, Renaissance?”).
c. Object types, locations, quantities, density, or relative
position Object types, locations, quantities, density, or relative
position includes, (for example, “What is that object on the left?”,
“Where is the cell phone?”, “How many horses are there?”, “Is
this a busy street?”, “How many directions do the branching
roads from the tallest main road in the image lead to in total?“).
d. Attribute recognition or comparison Attribute recognition or
comparison includes, (for example, “What is the color of the shirt
the man is wearing?”, “Which apple is bigger?”, “What is the
person holding?”, “Which boat is ahead?”, “Are all buses going in
the same direction?”).
e. Actions Actions include, (for example, “What are the kids
doing?”, “Is the man hammering a nail?”, “What is the girl
jumping over?”).
f. Landmark recognition Landmark recognition includes, (for
example, “Is this Eiffel Tower”, “What type of building is this?”).
g. Text in the image Text in the image includes, (for example,
“Where is the player wearing jersey number 35?”, “What is the
phone number in the image?”).
Don't
The following are expected practices in this workflow:
1. Do not ask questions that require complex reasoning, such as future
prediction, event planning, social relationship, questions that require
information not in the picture, questions that require annotators to use
internet search, and so on. A few examples that requires complex
reasoning are:
o “Given the time on the clock, when can I reach my destination if I
take the next train?”
o “What is the best location to put my printer on the table?”).
2. Do not ask questions that require background knowledge of the
objects, such as information about their physical properties, or natural
relationship. An examples that requires background knowledge is prey-
predator, and so on.
3. Do not ask more than two questions that are unhelpful for the Human,
such as counting objects which are smaller in number. A few examples
that depicts more than two questions are as follows:
o "How many chairs are there” if there are less than five?"
o “Is the curtain red” when the color is easily recognized as red.
o It is okay to ask trick questions, like “count number of chairs”
when there are no chairs in the scene.
4. Do not use general social fillers like “Hello, how are you” and “Thanks
for the help!”
Ideas for building conversation
If you are out of ideas, consider the following perspectives based on
relevancy to the image:
What if the Human is doing a professional job related to many images
including this one, and asking the Agent for help or to play a certain
professional? Refer to the following examples:
Example professions that the Agent could have: artist, analyst,
designer, inspector, architect, banker, lawyer, accountant,
consultant, security, detective, engineer, and so on.
Examples of work could be: condition inspection, describing
structure, security concerns, estimating counts of object piles,
check existence of things that may or may not appear in the
scene, check how many items have a property (for example,
brand, style), check traces of usage or detailed imperfection.
What if the Human provides extra context to the situation and asks for
help or opinion?
Example situations could be: lost, emergencies, product
comparisons, shopping ideas, design changes, social relationship
ideas, what to wear, and so on.
What if the Human wants some specific style for the output, or has
specific constraints?
Example requests could be: “How many people are in this
image? Only answer with Arabic numerals without punctuation.”
“Talk about this scene in the tone of an Instagram influencer, use
hashtags but do not use emojis or emoticons.” “List the colors in
this painting, separated by semicolons. Do not add a period at
the end.” “Write an ad of this <blah> in the tongue of a small
child.”
What if the Human uses the Agent for automation and asks for a
certain format response?
Example request could be: “Count the animals and answer in the
following format: sheep xN, cows xN, dogs xN”, “Report any
words in the image and group them by where they appear, and
answer in JSON format, such as `{"container": ["clorox",
"milano"], "signage": ["$9.99", "SALE", "EXIT"], "t-shirt": ["Just do
it."]}`.”
Examples of Image or Region related questions:
The following are different ways to ask for image or region description:
1. Can you provide a comprehensive description of the image?
2. Elaborate on the details of the image.
3. What are the specifics visible in the image?
4. Could you offer an in-depth analysis of the image?
5. Can you depict the image with precise detail?
6. Give a detailed account of the image.
7. Explain the image in meticulous detail.
8. Can you portray the image in words?
9. Give a thorough narrative of the image.
10. Please provide an intricate breakdown of the image.
11. Offer a complete interpretation of the image.
12. Delve into the particulars of the image.
13. Explain all the nuances you observe in the image.
14. Provide a detailed commentary on the image.
15. Illustrate the image in depth using your words.
16. Could you give a blow-by-blow description of the image?
17. Go into detail about the different elements of the image.
18. Can you dissect the image and describe each element in detail?
19. Detail the contents of the image extensively.
20. Can you provide an in-depth explanation of the image?
21. Provide a comprehensive overview of the image.
22. Break down the elements of the image in detail.
23. Can you expound upon the features of the image?
24. Offer an exhaustive description of the image.
25. How would you illustrate the image in words?
26. Please convey the image’s details verbally.
27. Can you detail the contents of the image?
28. Narrate what you see in the image in depth.
29. Kindly provide a meticulous commentary on the image.
30. Share an extensive description of the image.
31. Could you interpret the image in a detailed manner?
32. Present a detailed report of the image’s features.
33. Can you provide an intricate depiction of the image?
34. Disclose every detail you see in the image.
Step 3: Task Submission
To submit your task, select Submit.