Notes on Computer Vision by Richard Szeliski


A little over a year ago I embarked on a project that involved computer vision. Not having much experience in the field, I read as many books as I could find on the subject and the most useful book I discovered was Richard Szeliski’s book: Computer Vision (2010) which is free to download.

Below are notes that were handwritten at the time. The notes may be skewed towards my goal of finding a method to extract measurements from a human body in certain poses which presented lots of tangential ways of tackling the problem but the initial difficulty was in merely segmenting the body from the background. What I discovered was that the simple problems (or at least what appeared to be simple) in computer vision are actually extremely difficult.

We initially explored the option of reconstructing a 3D model of the human body, but the accuracy of all the approaches conducted with a variety of scanners was not sufficient compared to hand measurements and it really didn’t make any business sense to pursue it since the necessity of the physical presence of an outfitter made the process redundant and less accurate. But I’ll talk more and share about 3D Modeling and Reconstruction in another post.

Not all computer vision techniques utilize machine learning, but I soon discovered that for our purposes, it was necessary. However, in order to understand a lot of the concepts of machine learning as it relates to computer vision, one needs to be familiar with the probablistic models of computer vision first. As Szliski says:

Beginners to computer vision often fail because they never define the objective of the algorithm. We write down not the steps to solve the problem but the problem itself. We write down the formulae for all the probability distributions that define the problem and then perform operations on those distributions in order to provide answers. One always has at the back of one’s mind a collection of models known to be soluble, and one always tries to find a model for one’s problem because the space of possible models is so much vaster than the space of ones in which the solution is tractable.

Problem Definition

So in my case, the problem definition was as follows:

Image segmentation needs to take into account all possible environmental variables, i.e. light, shadows, objects, obstructions, occlusions, etc. Since this is not possible, an intermediary step that involves guiding the user into the desired pose and only extracting relevant pixels can be explored. Since the expected poses are standard, i.e. T-pose, A-pose, side pose, muscle pose, etc., a guided process of matching shapes for specified poses in steps can be acheived where the user is guided into an active contour snake.

Proposed Problem Solutions

It turned out that snakes or active contour models cannot model articulated objects like the human body, so I explored an extended list as follows:

  1. Parametric Contour Models (Active Contour Models or Snakes)
    • provides weak a priori geometric information
    • assumes that we know topology of contour (open or closed), smooth etc.
    • an approach is to apply the canny edge detector and then compute a distance transform
    • apply a shape prior that favors smooth contours with low curvature
    • inference – observe a new image x and try to fit points {wi} on the contour so that they describe the image as well as possible using a maximum a posteriori criterion
  2. Shape Template Models
    • impose strongest possible form of geometric information; it is assumed we know the shape exactly and tries to identify it’s position , scale, and orientation in the image
    • canny edge detector, distance transformed image, fitting shape template
    • try to initialize position based on face detector first
    • ICP – associate each landmark point along a normal direction of the contour
  3. Statistical Shape Models (Point Distribution Models)
    • generalized procustes analysis – if we know the mean shape, then it is easy to estimate the parameters of the transformations that best map the observed points to the mean.
  4. Subspace Shape Models
    • PPCA – probablistic principle component analysis
    • active shape models
    • active appearance models – describes the intensity of the pixels and the shape of the object simultaneously
  5. Gaussian Process Latent Variable Model
    • GPVLM uses the regression model
    • approach to describe more complicated densities in articulated models

Computer Vision Notes

So let’s go ahead with the notes so we can make sense of the list above.



  1. Segmentation – the task of finding groups of pixels that “go together”. In statistics, known as cluster analysis. Additional algorithms include active contours, level sets, region splitting and merging, mean shift (mode finding), normalized cuts (splitting based on pixel similarity metrics), and binary Markov random fields solved using graph cuts. (p. 269)
  2. Edge detection – see Arbeleaz, Marie, Fowlkes et. al. (p. 244)
  3. Feature detector tests for repeatability by applying rotations, scale changes, illumination changes, viewpoint changes, and adding noise find that the improved version (gaussian derivative) of the Harris operator with θd = 1 (scale of the derivative Gaussian) and θi = 2 (scale of the integration Gaussian) works best (p. 215)
  4. Matting – the process of extracting an object from an original image. The intermediate representation used for the foreground object between these two stages is called alpha-matted color image. In addition to the three color RGB channels, an alpha channel α (or A) that describes the relative amount of opacity or fractional coverage at each pixel. The opacity is the opposite of the transparency. Pixels within the object are fully opaque (α = 1), while pixels fully outside the object are transparent (α = 0). (p. 105)
  5. Edges are features that can be matched based on their orientation and local appearance (edge profiles) and can be good indicators of object boundaries and occlusion events in image sequences. (p.207)
  6. Localized features are often called keypoint features or interest points (or even corners) and are often described by the appearance of patches of pixels surrounding the point location. (p. 207)
  7. Think back from the problem at hand to suitable techniques, rather than to grab the first technique you may have heard of. (p. 9)
    • Come up with a detailed problem definition and decide on the constraints and specifications for the problem
    • Find out which techniques are known to work, implement a few, evaluate performance, make a selection
    • Active Contours
      1. Snakes – an energy-minimizing, two-dimensional spline curve that evolves (moves) towards image features such as strong edges.
      2. ntelligent Scissors – allows the user to sketch in real time a curve as the zero-set of a characteristic function, which allows them to easily change topology and incorporate region-based statistics.
      3. Level Sets – the zero crossing of a characteristic function define the curve.
    • Split and Merge
      1. Watershed – segments an image into several catchment basins, which are the regions of an image.
      2. Region Splitting (devisive clustering) – computes a histogram for the whole image and then finds a threshold that best separates the large peaks in the histogram.
      3. Region Merging (agglomerative clustering) – linking clusters together based on distance between their closest points (single-link clustering), their farthest points (complete link-clustering), their farthest points (complete link-clustering), or something in between.
      4. Graph-based segmentation – a merging algorithm that based on two cues, namely gray-level similarity and texture similarity.
    • Mean shift and mode finding – models the feature vectors associated with each pixel (e.g. color & position) as samples from an unknown probability density function and then try to find clusters (nodes) in the distribution.
      1. K-means and mixtures of Gaussians iteratively updates cluster centers by value k (number of clusters)
      2. Mean shift – models the probability density distribution using a smooth continuous non-parametric model
    • Normalized cuts – examines the affinities (similarities) between nearby pixels and tries to separate groups that are connected by weak affinities.
    • Graph cuts and energy-based methods by restricting boundary measurements to be between immediate neighbors and compute region membership statistics by summing over pixels, we can formulate this as a classic pixel-based energy function.
  8. Optical Flow – compute an independent estimate of motion at each pixel

Computer Vision and Machine Learning

  1. Instead of pattern recognition we interpret new image data based on prior experience of images in which contents were known
    • In learning we model the relationship between the image data and the scene content.
    • In inference we exploit this relationship to predict the contents of new images.
  2. Visual data x is used to infer the state of the world w. When the state of w is continuous, we call the inference process regression. When the state is discrete , we call it classification.
  3. The model that related data x to world w are:
    • model the world state on the data Pr(w|x) (discriminative regression)
    • model the data on the world state pr(w|x) (generative classification)
  4. The number of parameters involved in building a model for each pixel relating to all possible world states is too complex.
  5. The solution is to build local models that are connected to one another so that nearby elements can help to disambiguate one another.
  6. Imposing shape priors constrains the possible contour shapes and reduces the search space (top-down) as opposed to using a bottom-up approach like trying to find edges often leading to extraneous edge fragments or missing relevant edges.
    • A shape can be defined using a set of landmark points. The connectivity of landmark points varies according to the model.
  7. Pre-processing (see prev. segmentation section)
    • Per pixel transformations – whitening, histogram, equalization, linear filtering, local binary patterns, texton maps
    • Edges, corners, and interest points – canny edge detegtor, harris corner gradients, bag of words descriptor, shape context descriptor
    • Dimensionality reduction – approximation with a single number, principle component alaysis, the K-means algorithm.


Now that was a lot of acronyms and terms that I didn’t take the time to define properly but the content is much too dense to whittle down into one blog post so I’ll let you further examine the details of the terms and their meanings. However, as it relates to our problem, we can simply view the posterior probability distribution Pr(w|x) over possible states w as a means to solve ambiguity in measurements x. A better solution is finding a model θ that mathematically relates the visual data x and the world state w. A learning algorithm that allows us to fit the parameters θ using paired training examples {xi,w}. An inference algorithm that takes the new observation x and uses the model to return the posterior Pr(w|x,θ) over the world state w.

The preprocessing stage seems to require generative models in which the likelihood Pr(x|w) of the observations modeled from a discrete set for classification of a pose for segmentation.

A low dimensional vector x that represents the shape of the contour has been extracted. Our goal is to use this data vector to predict a second vector contouring the features of the body for measurement extraction, estimate a univariate quantity w from continuous observed data x.

Fast Forward

All of the above sounds nice in theory and it was a good exercize in getting to know the limitations of applying computer vision to my specific problem. Looking back, it was easy for me to get so wrapped up in finding an elegant solution to my problem that I didn’t realize the amount of manual work that would be involved in initially coming up with the training set.

Prior to getting started on any computer vision project, if there is going to be a machine learning component involved, start with the manual process of creating the training set. The great thing about the manual process is that you can do it with simple tools and if the outcome is within the desired range, work is being done both for the present needs and the future needs of creating a model for the training set.