Analyzing the dynamics of the visually available articulary section of the human face is important in many fields of study, including automatic lipreading. We consider input of the type we see in the video below:
Once you have the context (here the grammar of spoken ordered digits), you can easily segment the video in time(using only visual information), saying here digit 'one' starts and so forth. More difficult is the second part of the video for you, as seemingly random digits are spoken, yet the context of digits will still allow you to achieve success(especially if you are adept at lipreading).
This work, which is funded by the swedish national research foundation, aims to develop robust and fast methods to analyse such visual speech. To this end, we are investigating optical flow. Optical flow is the point-wise estimated motion of all positions in an image. There is some confusion in the community of what is meant with optical flow, and especially how it relates to the close topics of motion estimation(in video encoding), tracking and texture matching. Here, we mean by optical flow a flow field that arises from video, and can be illustrated for example as the video below.
The video has been recorded slowly, for sake of visualization only. For this size of videos here (256 by 256) our algorithm in its current implementation can handle 200 frames/sec (in matlab, on a laptop). We expect a low-level optimized code to perform faster(especially if we use the GPU), but the point should be clear that:
Optical flow of this kind is not computationally expensive
In terms of sub-pixel accuracy, the algorithm is not the best performing one. However, it outputs consistent flow fields (same motion events yield same flow), and is naturally smooth over time. It is a so-called local method in contrast to global variational methods, but on those details we will not dwell here. It handles motions of linear structure well, which is an important detail due to something that is called the aperture problem: In general (also with humans) motion estimation tends to degrade when linear structures are present, as is the case for the barber pole illusion.
For our semi-automatic lip-segmentation application, the flow is used to track points over time around the lips. This requires the points to be initialized by a human, and for this app, we use only the flow vectors. The points can be updated over time, as shown in the video below
This segmentation is done in virtually no time at all, and will be improved many times over when static features of neighborhoods of points is included.
But we do not contend on being semi-automatic. We want of course to be fully automatic, and require no initialization or other interaction from the user.
From the flow, we are able to pick up on some relevant, and very interesting information about the dynamics of the visual speech. This we can do in a rotationally and translation invariant way, that is to say... the mouth may both move around in the video, as well as rotatate (as in head wobbling or camera rolling) without it affecting the analyzis. This is especially important for implementations on hand-held devices. Below we see our prototype for real-time avatar lip synchronization. We remind the reader that this can be done in 200 frames/second in a matlab implementaion.