2.
LITERATURE REVIEW
2.1 Bare Hand Computer Interaction
Hardenburg and Berard in their work “Bare Hand Human Computer
Interaction” [5] published in the Proceedings of Workshop on Perceptive
User Interfaces, describes techniques for barehanded computer interaction.
Techniques for hand segmentation, finger finding, and hand posture
classification were discussed. They applied their work for control of an on-
screen mouse pointer for applications such as a browser and presentation
tool. They have also developed a multi-user application intended as a
brainstorming tool that would allow different users to arrange text across the
space in the screen.
Figure 2-1. Application examples of Hardenburg and Berard’s system.
Finger controlled a) Browser, b) Paint, c) Presentation,
d.) multi user object organization
1
Hand segmentation techniques such as stereo image segmentation,
color, contour, connected components algorithms and image differencing
are briefly discussed as an overview of present algorithms. It is pointed out
that the weaknesses of the different techniques can be compensated by
combining techniques at the cost of computational expense. For their work,
they chose to use a modified image differencing algorithm. Image
differencing tries to segment a moving foreground (i.e. the hand) from a
static background by comparing successive frames. Additionally when
compared to a reference image the algorithm can detect resting hands.
Additional modification for image differencing was maximizing the contrast
between foreground and background.
After segmentation, the authors discuss the techniques used for
detecting the fingers and hands. They describe a simple and reliable
algorithm based on finding fingertips and from which fingers and eventually
the whole hand can be identified. The algorithm is based on a simple model
of a fingertip being a circle mounted on a long protrusion. After searching
the fingertips, a model of the fingers and eventually hand can be generated
and this information can be used for hand posture classification.
2
Figure 2-2. Finger Model used by Hardenburg and Berard
The end system that was developed had a real time capacity of around
20-25 Hz. Data from their evaluation shows about 6 frames out of 25 are
misclassified with a fast moving foreground. Accuracy was off in between
0.5 and 1.9 pixels. They have concluded in their paper that the system
developed was simple and effective capable of working in various lighting
conditions.
2.2 Using Marking Menus to Develop Command Set
for Computer Vision Based Gesture Interfaces
The authors Lenman, Bretzer and Thrusson present “Using Marking
Menus to Develop Command Set for Computer Vision” [6] published in the
Proceedings of the Second Nordic Conference on Human-computer
Interaction. This gesture based interaction will be somewhat a replacement
for the present interaction tools such as remote control, mouse, etc.
Perspective and Multimodal User Interfaces are the two main scenarios
3
discussed for gestural interfaces. First the Perspective User Interface aims
for automatic recognition of human gestures integrated with other human
expressions like facial expressions or body movements. While the
Multimode User Interfaces focuses more on hand poses and specific
gestures that can be use as commands in a command language.
Here they included the three dimensions to be considered in designing
gestural command sets. The first dimension was the Cognitive aspect, this
aspect refers to how easy commands are to learn and be remembered;
therefore command sets should be practical to the human user. Articulation
aspects being the second dimension tackle on how gestures are easy to
perform or how tiring it will be for the user. The last dimension was on the
Technical aspects. This refers to the command sets must be state of the art
or futuristic and will meet the expectations of the upcoming technology.
The authors concentrate on the cognitive side. Here they considered
that having a menu structure would be of great advantage because
commands can be then easily recognize. Pie and Marking menus are the two
types of menu structures that the authors discussed and explained. Pie
menus are pop-up menus with included alternatives that are arranged in a
radial manner. Marking menus, specifically Hierarchic Marking Menu is a
development of pie menus that allows more complex choices by
implementing sub-menus.
4
As a test, a prototype for hand gesture interaction was performed.
Lenman, Bretzer and Thrusson have chosen a hierarchic menu system for
controlling functions of a T.V., CD players and a lamp. As their chosen
computer vision system was the representation of the hand. The system will
search and then recognize the hand poses based on a combination of
multiscale color detector and particle filtering. Hand poses are then
represented in terms of hierarchies of color image features with qualitative
interrelations in terms of position, orientation and scale. Their menu system
has three hierarchical levels and four choices. Menus then are shown on a
computer screen which is inconvenient and in the future an overlay on the
TV screen will be presented.
As for their future work, they are attempting to increase the speed and
tracking stability of the system in order to acquire more position
independence for gesture recognition, increase the tolerance for varying
light conditions and increase recognition performance.
2.3 Computer Vision-Based Gesture Recognition
for an Augmented Reality Interface
Granum et. al. presented “Computer Vision-Based Gesture
Recognition for an Augmented Reality Interface” [7] which was published
in the Proceedings of the 4th International Conference on Visualization,
Imaging and Image processing. It contains or talks about different areas like
5
gesture recognition, segmentation, etc. that are needed to complete the
research and the techniques that will be used for it. Already there has been a
lot of research on vision-based hand gesture recognition and finger tracking
application. Because of our growing technology, researchers our finding
ways for computer interface to perform naturally, limitations such as
sensing the environment with sense of sight and hearing must be imitated by
the computer.
This research is done in one application for a computer-vision
interface for an augmented reality system. The computer-vision is centered
on gesture recognition and finger tracking used as interface in the PC. There
structure will project a display on a Place Holder Object (PHO) and by the
use of your own hand the system can create controls and situations for the
display, movements and gestures of the hand are detected by the Head
mounted camera which serves as the input for the system.
There are two main areas of problem and the presentation of their
solutions is the main bulk of the paper. The first was segmentation; the use
of segmentation was to detect the PHO and hands in 2D images that are
captured by the camera. Problems in detection of the hands the varying
forms of the hand as it move and the varying of its size from different
gestures. To solve this problem the study used a color pixel-based
segmentation which provides extra dimension compared to gray tone
methods. Color pixel-based segmentation creates a new problem on
6
illumination which is dependent on the intensity changes and color changes.
This problem is resolve by using normalized RGB also called chromatics
but implementing this method creates several issues one of which is that
normally cameras have limited dynamic intensity range. After segmentation
of hand pixels from the image next task is to recognize the gesture, they are
subdivided into two approach first is detection of the number of outstretched
hands and second is for the point and click gesture. For gesture recognition
a simple approach is done to resolve counting of hands is done by polar
transformation around the center of the hand and count the number of
fingers which is rectangle in shape present in each radius, but in order to
speed up the algorithm the segmented image is samples along concentric
circles. Second area of concern is detection of point and click gestures. The
algorithm in the gesture recognition is used and when it detects only one
finger it represents a pointing gesture tip of the finger is defined to be an
actual position. The center of the finger is found for each radius and the
values are fitted into a straight line, this line is searched until the final point
is reached.
Figure 2-3. Polar transformation on a gesture image.
7
The paper is a research step for gesture recognition. It is implemented
as a part of a computer-vision system for augmented reality. The research
has proven qualitatively that it can be a useful alternate interface for use in
augmented reality. Also it was proven that it is robust enough for the
augmented reality system.
2.4 Creating Touch-Screens Anywhere with
Interactive Projected Displays
Claudio Pinhanez et. al. researchers of “Creating Touch-Screens
Anywhere with Interactive Projected Display” [8] published in Proceedings
of the Eleventh ACM International Conference on Multimedia, started few
years ago working and developing systems which could transforms an
available physical space into an interactive “touch-screen” style projected
display. In this paper, the authors demonstrated the technology named
Everywhere Display (ED) which can be used for Human-Computer
Interactions (HCI). This particular technology was implemented using an
LCD projector with motorized focus and zoom and a computer controlled
pan-tilt zoom-camera. They also come up with a low-end version which
they called ED-lite which functions same as the high-end version and differs
only in the devices used. In the low-end version the group used a portable
projector and an ordinary camera.
8
Several group of professionals were researching and working for a
new method of improving the present HCI. The most common method they
make use of for HCI is by use of mouse, keyboard and touch-screens. But
these methods require an external device for humans to communicate with
computers. The goal of researchers was to develop a system that would
eliminate the use of such external device that would link the communication
of human and computers. The most popular method under research was
through computer vision. Computer vision is used nowadays since it offers a
methodology similar to human-human interaction. The goal in the
advancement of technology in HCI is to create a methodology to implement
the said advancement that is more likely to a human-human interaction.
IBM researches used computer vision to implement ED and ED-lite. With
the aid of computer vision, the system was able to steer the projected
display from one surface to another. Creating a touch-screen like interaction
is made possible by using techniques and algorithms for machine vision.
Figure 2-4. Configuration of ED (left), ED-lite (upper right), and sample
interactive projected display (bottom right).
9
The particular application used by IBM for demonstration is a slide
presentation using Microsoft PowerPoint. They were able to create a touch-
screen like function using devices which was mentioned earlier. The ED
unit was installed at ceiling height on a tripod to cover greater space. A
computer is used to control the ED unit and performs all other functions
such as vision processing from interaction and running application software.
The specific test conducted was a slide presentation application using
Microsoft PowerPoint controlled via hand gestures. There is a designated
location in the projected image which the user could use to navigate the
slide or to move the content of the projected display from one surface area
to another. The user controls the slide by touching the buttons superimposed
in the specified projected surface area. With this technology the user
interacts with the computer using bare hand and without using such input
devices attached to the directly to the user and computer.
2.5 Interactive Projection
Projector designs are now shrinking and are now just in the threshold
of being compact for handheld use. That is why Beardsley and his
colleagues of Mitsubishi Electric Research Labs propose “Interactive
Projection” [9] published in IEEE Computer Graphics and Applications.
Their work is only an investigation of mobile, opportunistic projection
10
which can make every surface into displays, a vision to make the world as
its desktop.
The prototype has buttons that serves as the I/O of the device. It also
has a built-in camera to detect the input of the user. Here, it discusses three
broad applications of interactive projection. First class is using a clean
display surface for the projected display. Another class creates a projection
on a physical surface. This, typically, is what we call augmented reality. The
first stage is object recognition and the next is to project an overlay that
gives some information about the object. The last class is to project physical
region-of-interest, which can be used as an input to a computer vision
processing. This is similar to a mouse that creates a box to select the region-
of-interest, but instead of using a mouse, the pointing finger is used.
Figure 2-5. Handheld Projector Prototype
There are two main issues when using a handheld device to create
projections. The first one is the keystone correction to produce undistorted
projection and next is the correct aspect ratio. Keystoning occurs when the
projector is not perpendicular to the screen producing a trapezoidal shape
11
instead of a square. Keystone correction is used to fix this kind of problem.
Second is the removal of the effects of hand motions. Here it describes the
technique of how to make a static projection on a surface even when in
motion. They use distinctive visual markers called fiducials to define a
coordinate frame on the display surface. Basically, a camera is used to sense
the markers and to infer the target area in camera image coordinates and
these coordinates are transformed to projector image coordinates and the
projection data is mapped into these coordinates giving the right placement
of projection.
Examples of applications for each main class given above are also
discussed. An example of the first class is a projected web browser. This is
basically a desktop Windows environment that is modified so that the
display goes to the projector and the input is taken form the buttons of the
device. An example application of the second class is a projected augmented
reality. The third application is a mouse-button hold-and-drag defining a
Region of Interest (ROI) just like in a desktop but without the use of a
mouse.
2.6 Ubiquitous Interactive Displays in a Retail
Environment
Pinhanez et. al. in their work “Ubiquitous Interactive Displays in a
Retail Environment [10] published in the Proceedings of ACM Special
12
Interest Group on Graphics (SIGGRAPH): Sketches, proposes an interactive
display that is set to a retail environment. It uses a pan/tilt/mirror/zoom
camera with a projector using computer vision methods to detect interaction
with the projected image. They call this technology the Everywhere Display
Projector (ED projector). They proposed using it in a retail environment to
help the customers find and give them information about a certain product
and it also tells where the product is located. The ED projector is installed
on the ceiling and it can project images on boards that are hung on every
aisle of the store. At the entrance of the store, there is a table where a larger
version of the product finder is projected. Here, a list of product is projected
on the table and the user can move the wooden red slider to find a product.
The camera detects this motion and the list scrolls up and down copying the
motion of the slider.
Figure 2-6. Setup of the Project Finder
13
2.7 Real-Time Fingertip Tracking and Gesture
Recognition
Professor Kenji Oka and Yoichi Sato of University of Tokyo together
with Professor Hideki Koike of University of Electro-Communications,
Tokyo worked on “Real-time fingertip Tracking and Gesture Recognition”
[11] published by IEEE Volume 22, Issue 6, that introduced method in
determining fingertip location in an image frame and measuring fingertip
trajectories across image frames. They also propose a mechanism in
combining direct manipulation and symbolic gestures based on multiple
fingertip motion. Several augmented desk interface have been developed
recently. DigitalDesk is one of the earliest attempts in augmented desk
interfaces and using only charged-coupled device (CCD) camera and a
video projector the users can operate projected desktop application with
fingertip. Inspired by DigitalDesk the group developed an augmented desk
interface called EnhancedDesk that lets users performs tasks by
manipulating both physical and electronically displayed objects
simultaneously with their own hands and fingers. An example application
demonstrated in the paper was EnhancedDesk’s two handed drawing
system. The application uses the proposed tracking and gesture recognition
methods which assigns different roles to each hand. The gesture recognition
lets users draw objects of different shapes and directly manipulate those
objects using right hand and fingers. Figure 7 shows the set-up used by the
14
group which includes infrared camera, color camera, LCD projector and
Plasma display.
Figure 2-7. EnhancedDesk’s set-up
The detection of multiple fingertips in an image frame involves
extracting hand regions, finding fingertips and finding palm’s center. In the
extraction of hand regions, an infrared camera was used to measure
temperature and compensate with the complicated background and dynamic
lighting by raising the pixel values corresponding to human skin above
other pixels. In finding fingertips a search window for fingertips were
defined rather than arm extraction since the searching process in this
method is more computationally expensive. Based on the geometrical
features, fingertip-finding method uses normalized correlation with a
properly sized template corresponding to a user’s fingertip size.
15
Figure 2.8. Fingertip Detection
Measuring fingertip trajectories involved determining trajectories,
predicting fingertip location and examining fingertip correspondences
between successive frames. In the determination of possible trajectories
predicting the locations of fingertips in the subsequently frame is done then
compare it to the previous. Finding the best combination among these two
sets of fingertips will determine multiple fingertip trajectories in real time.
Kalman filter is used in the prediction of fingertip location in one image
frame based on their locations detected in the previous frame.
Figure 2.9. (a) Detecting fingertips. (b) Comparing detected and
predicted fingertip to determine trajectories
16
In the evaluation of the tracking method, the group used Linux based
PC with Intel Pentium III 500-MHz and Hitatchi IP5000 image processing
board, and a Nikon Laird-S270 infrared camera. The testing involves seven
test subjects which was experimentally evaluated the reliability
improvement by considering fingertip correspondences between successive
image frames. The method reliably tracks multiple fingertips and could
prove useful in real time human-computer interaction applications. Gesture
recognition works well with the tracking method and able the user to
achieve interaction based on symbolic gesture while performing direct
manipulation with hands and fingers. Interaction based direct on
manipulation and symbolic gestures works by first determining from the
measured fingertip trajectories whether the user’s hand motion represent
direct manipulation or symbolic gesture. Then it selects operating modes
such as rotate, move, or resize and other control mode parameters if direct
manipulation is detected. While, if symbolic gesture is detected, the system
recognizes gesture types using a symbolic gesture recognizer in addition to
recognizing gesture locations and sizes based on trajectories.
The group plans to improve the tracking method’s reliability by
incorporating additional sensors. The reason why additional sensors were
needed is because the infrared camera didn’t work well on cold hands. A
solution is by using color camera in addition to infrared camera. The group
17
is also planning to extend the system to 3D tracking since the current system
is limited to 2D motion on a desktop.
2.8 Occlusion Detection for Front-Projected
Interactive Displays
Hilario and Cooperstock creates an “Occlusion Detection System for
Front-Projected Displays” [12] published by Austrian Computer Society.
Occlusion happens in interactive display systems when a user interacts with
the display or inadvertently blocks the projection. Occlusion in these
systems can lead to distortions in the projected image and information is
loss in the occluded region. Therefore detection of occlusion is essential to
prevent unwanted effects and occlusion detection can be used for hand and
object tracking. This work of Hilario and Cooperstock detects occlusion by
a camera-projector color calibration algorithm that estimates the RGB
camera response to projected colors. This allows predicted camera images
to be generated for projected scene. The occlusion detection algorithm
consists of offline camera-projector calibration then online occlusion
detection for each video frame. Calibration is used for constructing
predicted images to the projected scene. This is needed because Hilario and
Cooperstock’s occlusion detection occurs by pixel-wise differencing
predicted and observed camera images. Their system is used with a single
camera and projector; it also assumes a planar Lambertian surface with
18
constant lightning conditions and negligible intra-projector color calibration
to be used. Calibration is done by two steps, first is offline geometric
registration which will compute the transformation from projector to
camera frames of reference. It will center the projected image and aligned
the images to the specified world coordinate frame. For geometric
registration the paper adopted the same approach based on the work of
Sukthankar et al. which projector prewarping transformation are obtained
by detecting the corners of a projected and printed grid in camera view.
Second step in the calibration process is the offline color calibration. Due to
certain dynamics a projected display is unlikely to produce an image whose
colors match exactly those from the source of image. For us to determine
predicted camera images correctly we must determine the color transfer
function of the camera to the projection. This is done by iterating through
the projection of primary colors of varying intensities, measuring RGB
camera response storing it as color lookup table. This response is the
average RGB color over corresponding patch pixels measured over multiple
camera images. Then the predicted camera response can be computed by
summing the predicted camera responses to each of the projected color
components. Camera-projector calibration results are used in the online
occlusion detection. It is stated in their preliminary results that it is critical
to perform general occlusion detection for front projected-display system.
19