Abstract.
Using tools from dynamical systems and systems identification we develop
a framework for the study of primitives for human motion, which we refer
to as movemes. The objective is understanding human motion by decomposing
it into a sequence of elementary building blocks that belong to a known
alphabet of dynamical systems. We develop a segmentation and classification
algorithm in order to reduce a complex activity into the sequence of
movemes that have generated it. We test our ideas on data sampled from
five human subjects who were drawing figures using a computer mouse.
Our experiments show that we are able to distinguish between movemes
and recognize them even when they take place in activities containing
an unspecified number of movemes.
Introduction.
Building systems that can detect and recognize human actions and activities
is an important goal of modern engineering. Applications range from
human-machine interfaces to security to entertainment. With the development
of information technology we can expect that computer systems will be
increasingly embedded in our environment, so that human-machine interaction
will need interfaces that are easier to use and more natural. As humans
use their visual system and auditory system to communicate, several
works (see for example [10, 20] and the earlier work on building human-machine
interfaces using vision [7, 14, 23, 24, 21]) ask the question of whether
it is possible to develop computerized equipment able to communicate
with humans in similar way. As described extensively in [4] there is
also an immediate need for automated surveillance systems in commercial,
law enforcement, and military applications.
A fundamental problem in detecting and recognizing human action is one
of representation. Our point of view is that human activity should be
decomposed into building blocks which belong to an “alphabet”
of elementary actions; for example the activity “answering the
phone” could be decomposed into the sequence “step-step-step-reach-lift”,
where “step”, “reach” and “lift”
may not be further decomposed. We refer to these primitives of motion
as movemes. Our aim is then to build an alphabet of movemes, which one
can compose to represent and describe human motion similar to the way
phonemes are used in speech. The word “moveme” intended
as primitive of motion was invented by [3]. They studied periodic or
stereotypical motions such as walking or running where the motion is
always the same and therefore their movemes, like the phonemes, were
repeatable segments of trajectory. Goncalves et al. [6] studied motions
that were parametrized by an initial condition and a target, such as
“reach” that requires the specification of a target location.
They proposed that movemes ought to be parametrized by goal and style
parameters. Their moveme models are phenomenological and non-causal.
In this paper we attempt to define movemes in terms of causal dynamical
systems. This approach opens the possibility of dealing with problems
like prediction, and leads to more compact models parameterized by a
small number of parameters. Moreover the
dynamical systems framework allows us to use a set of mathematical tools
for determining analytically the performance of the algorithms proposed.
The idea of dynamical primitives of motion has also appeared in neurobiology
studies. Bizzi and Mussa-Ivaldi [2] pose the question whether the motor
behavior of vertebrates is based on simple units (motor primitives)
that can be combined flexibly to accomplish a variety of motor tasks,
and experiments have provided evidence for a modular organization of
the spinal cord in frogs and rats. Mussa-Ivaldi et al. [15] ran experiments
which showed that the fields induced by the focal activation of the
spinal cord follow a principle of vectorial summation, so that a variety
of motor control polices can be obtained from a simple linear combination
of few control modules. Experimental results in [9] and [5] support
the idea that kinematic and dynamic internal models are utilized in
movement planning and control. The “internal model” hypothesis
proposes that the brain acquires an inverse dynamic model of the object
to be controlled through motor learning after which motor control can
be executed mostly in a feed-forward manner. Thus, the role of dynamics
in the description of human motion seems to be an important one.
What is the alphabet of movemes? Which are the dynamical models that
we should use to represent them? Can a continuous trajectory of a human
body be decomposed automatically into its component movemes? To answer
these questions we take a relatively abstract point of view so to find
a representation framework that may apply to situations where dynamical
evolution and switching between different dynamical modes come into
play. We introduce a formal definition of a moveme and set up the classification
and segmentation problem that can be appropriately formalized in a dynamical
systems framework. Standard system identification tools and stability
arguments can then be applied to derive analytical error analysis for
the proposed algorithm so as to obtain performance estimates in the
presence of noise and modeling uncertainties. Finally we present some
experimental results on human drawing data. Even though the particular
example considered can be solved other ways, it is meant to show how
the developed techniques can be used in a practical and simple application
characterized by modeling uncertainty, noise, and subject variability.
The problem
of segmenting data streams originating from different unknown or partially
known processes which alternate in time is a general problem of interest
to various areas, see for example [8, 11, 22]. We propose a solution
to the problem in our particular scenario in which each one of the segments
has been generated from the perturbed version of a linear dynamical
system belonging to a finite known set of possible linear models. By
using system identification techniques [12, 18] and pattern recognition
techniques [1, 19] we develop an off-line joint segmentation and classification
algorithm and provide analytical error analysis. The dynamical systems
representation for describing human motion is not a novel idea; some
sample citations include [17, 13, 16]. Our contribution lies mainly
in the development of a joint classification-segmentation algorithm,
based on a priori given classes of motion (the moveme alphabet), and
characterized by a detailed error analysis.
The experimental results show that the performance of the proposed algorithm
is about 90% on our data set when training and testing are performed
on data coming from distinct subjects. This gives evidence of the fact
that the movemes considered are user-invariant on our data set. Subject-invariance
is not a property that we can prove formally and requires an experimental
verification. The results we obtain on 2D motions are encouraging in
this respect.
The formalism that we introduced is directly applicable to the higher-dimensional
case of full-body motion. If one compares it with previous work (e.g.
the linear/quadratic input-output maps of [6]) one notices that our
causal dynamical systems approach requires far fewer parameters for
describing a moveme; hence it promises to require fewer training examples
and allow for better generalization. Challenges into extending our results
to three dimensional (3D) motion, which the current paper does not address,
include the scalability of the approach, how to segment involuntary
actions, and how to link moveme chains into meaningful activities. Additional
work is also required to address issues like dependency on the number
of training examples, and user-dependence of the movemes in a more complex
and three dimensional experimental setting.
Furthermore, it is interesting to generalize the current segmentation
and classification algorithm to the on-line case. In the on-line setting
it would be useful to think to a possible solution to the prediction
problem, which is one of predicting the next action (or actions) on
the basis of what has already happened. Moreover exploring different
classes of dynamical systems may help modeling human motion with greater
accuracy. Also issues regarding to what extent models are user independent
and to what extent we need to train on different individuals should
be addressed.
At a higher
level of abstraction the idea of finding a “language” in
which to specify what is possible and what is not seems to be promising.
For example we know that in the sequence “step-step-reach-lift”
for answering the phone, it is not possible to lift the phone before
having reached it. These kinds of conditions could determine a model
which gives a structure to the way in which movemes can be composed.
A clear advantage of having such a model is that it could give feedback
to the segmentation and classification algorithm so to increase its
robustness.
References
[1] C.M. Bishop. Neural Networks for Pattern Recognition. Clarendon,
Oxford, 1995.
[2] E. Bizzi and F.A. Mussa-Ivaldi. Toward a neurobiology of coordinate
transformations. New Cog. Neuroscience, MIT Press, Cambridge, MA:489-500,
1999.
[3] C. Bregler and J. Malik. Learning and recognizing human dynamics
in video sequences. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition, pages 568-674, Puerto Rico, 1997.
[4] R.T. Collins, A. J. Lipton, and T. Kanade. Introduction to the special
section on video surveillance. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 22:745-746, August 2000.
[5] J.R. Flanagan and A.M. Wing. The role of internal models in motion
planning and control: evidence from grip force adjustments during movements
of hand-held loads. The Journal of Neuroscience, 17:1519-1528, 1997.
[6] L. Goncalves, E. Di Bernardo, and P. Perona. Reach out and touch
space (motion learning). In Proc. of the Third International Conference
on Automatic Face and Gesture Recognition, pages 234-239, Nara, Japan,
April 14-16 1998.
[7] L. Goncalves, E. Di Bernardo, E. Ursella, and P. Perona. Monocular
tracking for human arm in 3d. In Proc. of the 7th Int. Conference on
Computer Vision, ICCV, pages 764-770, 1995.
[8] F. Gustafsson. Adaptive Filtering and Change Detection. John Wiley
& Sons, 2000.
[9] M. Kawato. Internal models for motor control and trajectory planning.
Current Opinion in Neurobiology, 9:718-727, 1999.
[10] I. Laptev and T. Lindeberg. Tracking of multi-state hand models
using particle filtering and a hierarchy of multi-scale image features.
In IEEE Workshop on Scale-Space and Morphology, pages 63-74, Vancouver,
Canada, July 2001.
[11] M. Lavielle. Optimal segmentation of random processes. IEEE Trans.
on Signal Processing, 46:1365-1373, May 1998.
[12] L. Ljung. System Identification. Prentice Hall, New Jersey, 1999.
[13] C. Lu, H. Liu, and N.J. Ferrier. Multidimensional motion segmentation
and identification. In Proc. of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 629-636, Hilton Head Island, South Carolina,
2000.
[14] M.E. Munich and P. Perona. Visual input for pen-based computers.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:313-328,
March 2002.
[15] F.A. Mussa-Ivaldi, S.F. Giszter, and E. Bizzi. Linear combinations
of primitives in vertebrate motor control. Proc. of the National Academy
of Science, 91:7534-7538, 1994.
[16] D. Ormoneit, T. Hastie, and M.J. Black. Functional analysis of
human motion data. In Proc. 5th World Congress of the Bernoulli Society
for Probability and Mathematical Statistics and 63rd Annual Meeting
of the Institute of Mathematical Statistics, Guanajuato, Mexico, 2000.
[17] V. Pavlovic and James M. Rehg. Impact of dynamic model learning
on classi¯cation of human motion. In IEEE Conf. Computer Vision
and Pattern Recognition, Hilton Head Island, 2000.
[18] T. Söderström and P. Stoica. System Identification. Prentice
Hall. Hemel Hempstead, 1989.
[19] V. Vapnik. The Nature of Statistical Learning Theory. Springer
Verlag, 1995.
[20] S. Waldherr, S. Thurn, R. Romero, and D. Margaritis. Template-based
recognition of pose and motion gestures on a mobile robot. In Proc.
of the AAAI 15th National Conference on Artificial Intelligence, pages
977-982, 1998.
[21] P. Wellner. The digital desk calculator: Tactile manipulator on
a desk top display. In Proc. of the ACM Symposium on User Interface
and Technology, pages 27-33, Hilton Head, November 1991.
[22] A.S. Willsky and H.L. Jones. A generalized likelihood ratio approach
to the detection and estimation of jumps in linear systems. IEEE Trans.
on Automatic Control, 21:108-112, February 1976.
[23] A. Wilson and A. Bobick. Learning visual behavior for gestures
analysis. In Proc. of IEEE Symposium on Computer Vision, pages 229-234,
Coral Gables, FL, November 1995.
[24] Y. Yacoob and L. Davis. Recognizing human facial expressions from
long image sequences using optical flow. IEEE Trans. on Pattern Analysis
and Machine Intelligence 18(6), pages 636-642, 1996.