My piece of vision

3D Head Tracking in Video How a simple model can do big.

Written By: urbeller - May• 14•13

In this post, I will describe my own implementation of a head tracker. 3D Head Tracking (HT) consists of inferring the 3D orientation and displacement of the head, often from a (single) video source.  Here, the video source will be a Logitech C910 webcam. Of course, any webcam will do. Video grabing and
image processing will be done using OpenCV library.

The outline of the algorithm is as follow:

  1. Grab a frame and detect 2D features.
  2. Initialize the head pose.
  3. Compute 3D features→FTold.
  4. Grab a frame and detect 2D features.
  5. Compute 3D features →FTnew.
  6. Compute motion that registers FTnew→FTold.
  7. Update head pose.
  8. FTold = FTnew and go to 4.

At first glance, the toughest step in this outliine seems to be the 2D→3D features conversion. It turns out this is among the easiest task thanks to a simple idea: Cylindrical head model. In a nutshell, 2D features are unprojected from the camera reference  to a virtual cylinder. This intersection provides the
sought 3D positions of the image features. But first thing first…

Grabing an image is easy using OpenCV. Boiler plate code for that is a loop that looks like:

Mat frame, img;
VideoCapture capture;
int dev_id = 1; //Device number.;
if (!capture.isOpened()){
    cerr<< "Failed to open video device "
        << dev_id<<" \n"<<endl;
    return 1;

for (;;){
    if ( frame.empty() )

    imshow( window_name , image );
    char key = (char) waitKey(5);

    if( key == ' ' )

In each input frame, 2D features are detected. Among the myriad of features, KLT are probably the most suited to our real-time needs. Indeed, KLT are easy and fast to compute because there is no descriptor computation and no scale-space analysis is involved (at least not as SIFT). Using OpenCV, KLT features are retrieved as follow:

int MAX_COUNT=100;
TermCriteria termcrit(CV_TERMCRIT_ITER|
                      20, 0.3);
// We use two sets of points in order to swap
// pointers.
vector<Point2d> points[2];
Size subPixWinSize(10,10), winSize(21,21);

//Convert image to gray scale.

//Feature detection is performed here...
goodFeaturesToTrack(gray, points[1], MAX_COUNT,
                    0.01, 10, Mat(), 3, 0, 0.04);
cornerSubPix(gray, points[1], subPixWinSize,
             Size(-1,-1), termcrit);<br />

Now that features are detected, they are unprojected and intersected with the virtual cylinder. Exact solution to this ray-cylinder intersection could easily be found on the net. Now that we have 3D positions of features at time Tt-1 the same features are tracked in the upcoming frame using optical flow routine from OpenCV:

calcOpticalFlowPyrLK(prev_gray, gray,
                     points[0], points[1],
                     status, err);

The result of this tracking is a set of features at time Tt. To get the change in head pose, we register the 3D features at time Tt-1 with 2D features at time Tt. This is performed using a PnP algorithm. Because the virtual cylinder represents the head (a rough estimate!), it must be updated with the incremental pose
just computed. In a sens, the cylinder is a state object of the tracked head.

The head pose algorithm runs comfortably on a 2.4 ghz laptop using a Logitech C910 webcam as the following video depicts:

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.


  1. urbeller says:

    good !

  2. Vince says:

    Nice job! Thanks for sharing. I think I will try this technique one of these days.

  3. KD says:

    Really interesting.

  4. Graham Fogarty says:

    Nice demonstration Jamil. Have you tried other tracking methods – with LK tracking points drift over time and it cant handle occulsion as you are no doubt aware. I would be interested in trying other tracking techniques based on your code if you are willing to share.

    Regards, Graham

    • urbeller says:

      Hey there ! thank you for stopping by 🙂
      Actually, my strategy was to redetect points “on-demand”. When a face
      turns left for example, I redetect points on the opposite side (the one
      that is not occluded). The beauty of the cylinderical model is the
      fact that it holds the “state” of the head at any time. A drift is
      less likely to happen in this case.

  5. Kiran says:

    Jamil, Thanks. Not sure how many times I watched your video on youtube :). Have some doubts.

    1. I use Posit instead of solvePnp as in ehci project which uses sinusoidal head model. ehci. If head is not placed centrally in a video, will I need to translate image points to origin ? How to do that ? Ehci subtracts 160x/120y on every image point for resolution 320*240. I am getting wrong rotation when ever head is not places centrally in video. Kindly suggest.

    2. Your code shows 100 corner points to detect. But your youtube video didn’t contain as much points. Surprisingly no tracking points on mouth corners. How many ever times, I run, cvGoodFeaturesToTrack with what ever parameters, I get features at mouth corners. You don’t have them!!

    3. Is cornerSubPix providing any improvement for face ?

    4. Can you elaborate a little more on detecting points on non occluded side of face 🙂 ? I didn’t see that trick in the youtube video.


    • urbeller says:

      Kiran…ray of light ? 🙂

      Thank you for your interest in my little project. Before going any forward, I am working on
      a second version that will include 2 major additions.

      1/ I am not familiar with ehci, though I saw a video demo of it. In my case, I assume
      that the initial face position is fronto-parallel (basically rotation is identity).
      Also, for efficiency, the face must be in a region of interest (could that be a rectangle).
      Then, the result of tracking will determine the rotation and translation of the face at the
      same time. I noticed that PnP gave better results. Posit assumes an orthographic or affine
      projection…I think !

      2/ The 100 points in my code is the maximum features point. In practice, less than that are
      found. Of course, I am only interested in reliable features. I do get some features at mouth
      corners when I pretend speaking 🙂
      3/ Haven’t tested the improvment. Since computation time didn’t suffer from cornerSubPix, I kept
      4/ This is a work in progress. Once the points are detected and tracked, their normals can be estimated
      (they lie on a cylinder). I use the normal direction to weigh the feature’s contribution. I haven’t
      talked about it in my blog because it’s not finished yet. Stay tuned !!!

Leave a Reply

Your email address will not be published. Required fields are marked *

eight − = four