|
[back]
Gesture
Recognition
George Panotopoulos, Dinkar Gupta, Demetri Psaltis, Pietro Perona
Abstract.
Though
your personal computer has a processing capacity orders of magnitude
larger than it did some ten years ago you still use the same means to
interface with it, namely a keyboard and pointing device. In the context
of this project we investigate the design of an interface based on human
gestures. The system we are envisioning is not limited to a particular
user and should be able to learn new gestures.
Motivation and Aims. The improvement of computerŐs memory and
processing capacity offers us the possibility to implement new interfaces,
allowing us to interact with them in a more user-friendly, intuitive
way. We want to implement a system that will allow users to input information
by performing gestures in front of a camera, an accessory that is becoming
increasingly popular in personal computers. This system should have
several important characteristics. First it should be user-independent,
meaning it should be able to recognize the same gesture as such even
when performed by different users. Secondly it should be expandable,
meaning that it should have the capability to learn any new gesture
that is presented to it.
Research. A gesture collected by a camera is encoded as a sequence
of frames, each frame containing a number of pixels. This representation
allows us to consider the gesture as a sequence of arrays. By stacking
these arrays we can create a 3D space, where the third dimension is
essentially time. Thus we can incorporate the motion characteristics
of a gesture in this third dimension, and then apply pattern classification
techniques to this 3-Dimensional space. In order to achieve user-independence
we should extract features which are common to all users and which remain
fairly constant over repetitions. In order to achieve expandability
we should implement a system that selects such features automatically.
In our approach each gesture is encoded as a sequence of 30 frames.
A simple segmentation algorithm is applied on each frame so that data
is encoded binary with 1 signaling the presence of the hand and 0 otherwise.
Then the frames are stacked producing a binary 3D-space.

Regarding the feature selection we used the Forstner
corner and circle operators, appropriately modified to operate in 3D
space. Application of these operators on the training part of our database
produces the data that will be used by our classification algorithm.
Since the operators are very general they return a very high number
of hits. To make sure these do not correspond to the same feature or
noise we employ a clustering algorithm, and assume to be features only
those occurrences that cluster well over our training database.
Once the features have been extracted we proceed to the classification
algorithm. Two different approaches are proposed for this step. The
first one is similar to the Constellation Model (link to Learning Object
Class Models). The other is a Divide and Conquer implementation of Neural
Networks (NN).
For the Constellation Model approach we assume that the location of
the features follows a 3D Gaussian distribution. During training we
estimate the probability of detection of these features, as well as
the means and standard deviations of their distributions. During classification
we extract the possible constellations given a certain gesture and estimate
which one has the higher probability. In our particular case the number
of features makes the exhaustive search over combinations too time consuming,
therefore features are assigned only to the distribution they most likely
could originate from and then the overall probabilistic score is computed.
For the Divide and Conquer approach we use Neural Networks to perform
the classification. Since the dimensionality of the input space is large
and we do not want to limit the number of possible networks we would
need a fairly sizeable NN to perform the task with acceptable performance.
Our idea is instead of asking the general question "which is this gesture"
to break it down to more, simpler questions. Each simple question can
be answered by a simple NN with reasonable performance, and once this
is done we can proceed to the next, more specific question, which of
course depends on the previous answer. Note that the answer to the question
is of probabilistic nature, meaning that the NN indicates how probable
each possible answer is. This procedure can be visualized using a tree
structure. At each node of the tree we ask a question, and depending
on the answer we proceed to one of the children of that node. Each question
is simple enough so that it can be answered by our elementary NN with
an acceptably low probability of error. When we reach the leafs of the
tree we ask the most simple questions that can be asked, namely "is
this gesture X?". If the answer confirms our hypothesis it determines
the output of the system. If not we go back to the parent node and follow
the next most probable path. An added advantage of this approach is
that it can be easily matched to reconfigurable processors, such as
the OPGA.
Achievements. We have fully implemented the Constellation Model
approach and tested it using a gesture database composed of 4 subjects
performing 2 gestures each. The resulting performance of correct classification
is 60%, mainly due to the simplicity of the features we have extracted.
The Divide and Conquer NN was tested on digit classification and was
found to outperform comparable "vanilla flavor" Neural Networks. An
analytical model of the probabilistic behavior of the classification
system was derived and was found to be in good agreement with simulation
results.
Future Research. Having a complete gesture classification system
we intend to identify its weakest elements and improve them in order
to improve the overall performance of the system. Our first improvement
will be the selection of more complex features that will reduce the
number of hits per sample. Once this is done we want to compare the
merits of the two classification approaches.
Publications/References
Computer
Gesture Recognition: Using the Constellation Method.
Dinkar Gupta, Caltech Undergraduate Research Journal, Vol 1,
April 2001.
top
|