Caltech
Center for Neuromorphic Systems Engineering

Home
Research
News
People

[back]

Object Categorization: Unsupervised One-Shot Learning
Fei-Fei Li, Rob Fergus, Pietro Perona

Abstract. Learning visual models of object categories notoriously requires thousands of training examples; this is due to the diversity and richness of object appearance which requires models containing hundreds of parameters. We present a method for learning object categories from just a few images (1 - 5). It is based on incorporating "generic'' knowledge which may be obtained from previously learnt models of unrelated categories. We operate in a variational Bayesian framework: object categories are represented by probabilistic models, and "prior'' knowledge is represented as a probability density function on the parameters of these models. The "posterior'' model for an object category is obtained by updating the prior in the light of one or more observations. Our ideas are demonstrated on four diverse categories (human faces, airplanes, motorcycles, spotted cats). Initially three categories are learnt from hundreds of training examples, and a "prior'' is estimated from these. Then the model of the fourth category is learnt from 1 to 5 training examples, and is used for detecting new exemplars a set of test images.

Motivation. It is believed that humans can recognize between 5,000 and 30,000 object categories. Informal observation tells us that learning a new category is both fast and easy, sometimes requiring very few training examples: given 2 or 3 images of an animal you have never seen before, you can usually recognize it reliably later on. This is to be contrasted with the state of the art in computer vision, where learning a new category typically requires thousands, if not tens of thousands, of training images. These have to be collected, and sometimes manually segmented and aligned -- a tedious and expensive task.

Computer vision researchers are neither being lazy nor unreasonable. The appearance of objects is diverse and complex. Models that are able to represent categories as diverse as frogs, skateboards, cell-phones, shoes and mushrooms need to incorporate hundreds, if not thousands of parameters. A well-known rule-of-thumb says that the number of training examples has to be 5 to 10 times the number of object parameters—hence the large training sets. The penalty for using small training sets is over fitting: while in-sample performance may be excellent, generalization to new examples is terrible. As a consequence, current systems are impractical where real-time user interaction is required, e.g. searching an image database. By contrast, such ability is clearly demonstrated in learning in humans. Does the human visual system violate what would appear to be a fundamental limit of learning? Could computer vision algorithms be similarly efficient? One possible explanation of human efficiency is that when learning a new category we take advantage of prior experience. While we may not have seen ocelots before, we have seen cats, dogs, chairs, and, more importantly, the variability in their appearance, gives us important information on what to expect in a new category. This may allow us to learn new categories from few(er) training examples.

We explore this hypothesis in a Bayesian framework. Bayesian methods allow us to incorporate prior information about objects into a “prior” probability density function which is updated, when observations become available, into a “posterior” to be used for recognition. Bayesian methods are not new to computer vision; however, they have not been applied to the task of learning models of object categories. We use here “constellation” probabilistic models of object categories, as developed by Burl et al. and improved by Weber et al. and Fergus et al. While they maximized model likelihood to learn new categories, we use variational Bayesian methods by incorporating “general” knowledge of object categories. We show that our algorithm is able to learn a new, unrelated category using one or a few training examples.

Results. Our experiments demonstrate the benefit of using prior information in learning new object categories. The following figure shows models learnt by the Bayesian One-Shot algorithm on one of the four datasets. It is important to notice that the “priors” alone are not sufficient for object categorization (Panel (a)). But by incorporating this general knowledge into the training data, the algorithm is capable of learning a sensible model with even 1 training example. For instance, in Panel (c), we see that the 4-part model has captured the essence of a face (e.g. eyes and nose). In this case it achieves a recognition rate as high as 82%, given only 1 training example. Our algorithm has significantly faster learning speed due to much smaller number of training examples.

Figure 1. Summary of face model. (a) Test performances of the algorithm given 0-6 number of training image(s) (red). 0 number of training image is the case of using the prior model only. Note this “general'' information itself is not sufficient for categorization. Each performance is obtained by 10 repeated runs with different randomly drawn training and testing images. Error bars show one standard deviation from the mean performance. This result is compared with the maximum-likelihood (ML) method (green). Note ML cannot learn the degenerate case of a single training image. (b) Sample ROC curves for the algorithm (red) compared with the ML algorithm (green). The curves shown here use typical models drawn from the repeated runs summarized in (a). Details are shown in (c)-(f). (c) A typical model learnt with 1 training example. The left panel shows the shape component of the model. The four +'s and ellipses indicate the mean and variance in position of each part. The covariance terms are not shown. The top right panel shows the detected feature patches in the training image closest to the mean of the appearance densities for each of the four parts. The bottom right panel shows the mean appearance distributions for the first 3 PCA dimensions. Each color indicates one of the four parts. Note the shape and appearance distributions are much more “model specific'' compare to the “general'' prior model. (e) Some sample foreground test images for the model learnt in (c), with a mix of correct and incorrect classifications. The pink dots are features found on each image and the colored circles indicate the best hypothesis in the image. The size of the circles indicates the score of the hypothesis (the bigger the better). (d) and (f) are similar to (c) and (e). But the model is learnt from 5 training images.}

Conclusion. We have demonstrated that given a single example (or just a few), we can learn a new object category. This is beyond the capability of existing algorithms. In order to explore this idea we have developed a Bayesian learning framework based on representing object categories with probabilistic models. “General'' information coming from previously learnt categories is represented with a suitable prior probability density function on the parameters of such models. Our experiments, conducted on realistic images of four categories, are encouraging in that they show that very few (1 to 5) training examples produce models that are already able to discriminate images containing the desired objects from images not containing them with error rates around 8-22%.

A number of issues are still unexplored. First and foremost, more comprehensive experiments need to be carried out on a larger number of categories, in order to understand how prior knowledge improves with the number of known categories, and how categorical similarity affects the process. Second, in order to make our experiments practical we have simplified the probabilistic models that are used for representing objects. For example a probabilistic model for occlusion is not implemented in our experiments. Third, it would be highly valuable for practical applications (e.g. a vehicle roving in an unknown environment) to develop an incremental version of our algorithm, where each training example will incrementally update the probability density function defined on the parameters of each object category. In addition, the minimal training set and learning time that appear to be required by our algorithm makes it possible to conceive of visual learning applications where real-time training and user interaction are important.

References

  1. M.C. Burl, M. Weber, and P. Perona, “A probabilistic approach to object recognition using local photometry and global geometry”, Proc. ECCV, pp.628-641, 1998.
  2. H. Attias, “Inferring parameters and structure of latent variable models by variational bayes”, 15th conference on Uncertainty in Artificial Intelligence, pp. 21-30, 1999.
  3. R. Fergus, P. Perona and A. Zisserman, “Object class recognition by unsupervised scale-invariant learning”, Proc. CVPR, vol. 2, pp. 264-271, 2003.
  4. L. Fei-Fei, R. Fergus, and P. Perona, “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”, to appear in Proc. ICCV, 2003.

top