WO2008132741A2

WO2008132741A2 - Apparatus and method for tracking human objects and determining attention metrics

Info

Publication number: WO2008132741A2
Application number: PCT/IL2008/000569
Authority: WO
Inventors: Itzhak Wilf; Oded Har-Tal; Adam Spiro
Original assignee: Trumedia Technologies Inc.
Priority date: 2007-04-30
Filing date: 2008-04-29
Publication date: 2008-11-06
Also published as: WO2008132741A3

Abstract

A method for determining attention metrics of human objects towards a display is disclosed. The method comprises acquiring image data by at least one imaging device whose field of view substantially covers a sector of interest relating to the display; processing the image data using an image data processing algorithm. The algorithm includes: detecting presence of the human objects, determining at least one attention metrics parameter from a group of parameters which includes: face towards the display and eyes on the display. Other embodiments are also disclosed.

Description

APPARATUS AND METHOD FOR TRACKING HUMAN OBJECTS AND DETERMINING ATTENTION METRICS

FIELD OF THE INVENTION [0001] The present invention relates to object recognition and detection. More specifically the present invention relates to an apparatus and method for tracking human objects and determining attention metrics.

BACKGROUND OF THE INVENTION [0002] In the world of directed advertizing, which aims at providing personalized information (in particular commercial information) to targeted populations, there is a growing need for identifying persons who have interest in particular items.

[0003] Large display screens can be found in the public areas of many shopping centers, displaying various advertisements and other commercial information messages that the advertisers think would raise interest in the viewers and encourage them to purchase the items advertized in these messages.

[0004] However, it would be advantageous for the advertisers to know if their displayed messages are indeed catching the attention of people who are exposed to these messages.

[0005] It is therefore an aim of the present invention to have an imaging system that can view a predetermined sector (for example, the area in front of a display screen) and detect and track objects (in particular human persons) in that sector.

[0006] As large display screens can be viewed from a wide-angle range another aim of the present invention is to have the imaging system monitor a sector across a wide-angle range.

[0007] Covering an area of interest with a single image capturing device (hereinafter referred to as "camera") is limited due to the trade-off between the field-of-view (FOV) of the camera and the distance of the objects from the camera. Per a choice of the spatial resolution of the image capture device, the wider the FOV the fewer pixels are used to represent an object at a given distance, up to a distance limit above which objects cannot be reliably detected and tracked. In addition, using a wide angle lens causes a significant "barrel distortion" effect which distorts the objects, making it harder to detect and track such objects, particularly near the edges of the image. [0008] High-resolution video cameras that typically present mega pixels of resolution can increase the detection distance. Most off-the-shelf cameras are designed with the same aspect ratio which is the ratio between the horizontal and vertical picture dimensions. This aspect ratio in many cameras is typically 4:3 (horizontal: vertical). For the task of detecting and tracking persons in a given viewed sector, the required field of view is typically wider than higher. Hence, a high-resolution camera with a horizontal field of view of 120 degrees or more, will have large unused portions of the field of view (in the vertical aspect) and will suffer from lens distortion.

[0009] It is therefore another aim of the present invention is to provide an imaging system for detecting and tracking objects with a field of view whose width is substantially larger than its height.

[0010] For wide angle viewing two or more cameras can be used. Each camera covers a part of the entire viewed sector. The image data received from the cameras has to be processed while taking into account the fact that there exists some overlapping of viewed sectors between adjacent cameras which should be compensated for.

[0011] One straightforward approach, to detecting, tracking and counting objects, in images captured by two or more cameras, is based on stitching the images acquired from the various cameras into one large image and then performing image processing to detect, track and count the objects of interest in the large stitched image. [0012] Several methods have been proposed in the art to "stitch together" images captured from different viewpoints or viewing directions. When images are mosaiced, the possibility of localized double images of objects (or "ghosts") exists. Essentially, this double imaging or ghosting will occur if an object in the scene is close in to the cameras capturing the images. For such close ranges, the amount of ghosting depends on the object distance; hence it is impossible to find a fixed mapping function that will mosaic the pictures into a seamless panorama for all objects of interest.

[0013] Uyttendaele, et al. (US Patent Number 6,701,030) describe how this localized ghosting can be compensated for by estimating the amount of local mis-registration and then locally warping each image in the mosaiced image to reduce any ghosting. [0014] The stitching or mosaicing approach suffers from several drawbacks, especially as the objects to be detected, tracked and counted are human faces, some of which are listed hereinafter. [0015] Errors in estimating the amount of local mis-registration may introduce artifacts that will reduce face image quality at the overlap region. Such reduced quality may seriously affect the detection probability for said face and the tracking probability.

[0016] Wide angle cameras (with a typical FOV of some 70 degrees) that need to be combined to image a sector of a much greater angle, may suffer from some image distortion (e.g. the "barrel effect") that further complicates the image stitching task.

[0017] Contrast enhancement techniques which are typically used in wide dynamic range cameras, depend on global and local image content to establish correction parameters. Since global image content differs between the cameras, it alters local image appearance, making it more difficult for seamless stitching.

[0018] The capture of both images must be time synchronized; otherwise the overlapped image area may contain image parts captured at different times and further reduce face image quality.

[0019] Processing a large picture as obtained by the stitching process may require vast computational resources. It may be beneficial to divide the processed image in order to allow parallel computing.

[0020] By pruning half of the pictorial information in the overlap zone, as it is typically done in the stitching process, the probability of object detection may be reduced.

[0021] Multiple camera tracking is known to be used in video surveillance to cover larger areas with multiple cameras. Camera handoff, or the capability to contiguously track an object as it moves between cameras, is desired.

[0022] The principle of camera handoff in object tracking is known in prior art and used in low-density environments when object location and trajectory are used to predict the time and location it will appear in a different camera. Sengupta, et al. (US Patent Number 6,359,647) teaches the automation of a multiple camera system based upon the location of a target object in a displayed camera image. By assessing the movement of the figure, the system selects and adjusts the next camera based upon the predicted subsequent location of the figure.

[0023] It is desired to deghost objects in the overlap zones of two cameras in order to obtain an accurate count of the instantaneous number of objects. Furthermore, it is desired to track objects between the two cameras, in the presence of occlusions and multiple objects in order to obtain an accurate count of the number of different objects, which in the specific application of viewing measurement in which an object is a human viewer, provides an accurate estimate for the viewing times for each of said viewers.

[0024] Dual cameras placed at a predetermined distance are known in stereoscopic vision where detecting an objects or part thereof simultaneously in both cameras allow to compute the 3D position of the object. However, stereoscopic vision applies to the intersection of the cameras' fields of view while for the problem at hand it is desirable to extend the audience coverage to the union of the respective fields of view.

[0025] Other aims and advantages of the present invention will be come clear after reading the present specification and considering the accompanying figures.

SUMMARY OF THE INVENTION

[0026] There is thus provided, in accordance with some embodiments of the present invention, a method for determining attention metrics of human objects towards a display, the method comprising: [0027] acquiring image data by at least one imaging device whose field of view substantially covers a sector of interest relating to the display;

[0028] processing the image data using an image data processing algorithm comprising: [0029] detecting presence of the human objects,

[0030] determining at least one attention metrics parameter from a group of parameters which includes: face towards the display and eyes on the display.

[0031] Furthermore, in accordance with some embodiments of the present invention, the group of parameters also includes proximity to the display and face expression.

[0032] Furthermore, in accordance with some embodiments of the present invention the method further comprises determining attention time span. [0033] Furthermore, in accordance with some embodiments of the present invention, said at least one imaging device" comprises two or more imaging devices, the field of view of each camera only covering a portion of the sector of interest, with an overlap zone between two adjacent imaging devices, so as to substantially cover the sector of interest.

[0034] Furthermore, in accordance with some embodiments of the present invention, at least one human object can be fully captured in the overlap zone. [0035] Furthermore, in accordance with some embodiments of the present invention the step of processing the image data comprises using two processors, one processor for each camera for single camera processing, and using a processor for extended processing to determine said at least one attention parameter. [0036] Furthermore, in accordance with some embodiments of the present invention the step of processing the image data further comprises separately detecting the presence of one or more of the human objects by each of said two or more imaging devices and predicting respective presence and location of said one or more human objects in the field of view of other of said two or more imaging devices. [0037] Furthermore, in accordance with some embodiments of the present invention the method comprises using a transformation function in the predicting of the respective presence of the human objects in the field of view of the other of said two or more imaging devices.

[0038] Furthermore, in accordance with some embodiments of the present invention there is provided an apparatus for detecting and tracking objects in a given sector of interest, the apparatus comprising:

[0039] an imaging assembly comprising two or more imaging devices, the field of view of each imaging device only covering a portion of the sector of interest , with an overlap zone between fields of view of two adjacent imaging devices, so as to substantially cover and acquire image data from the given sector of interest; [0040] a processing unit for detecting and tracking of the human objects in the sector of interest.

[0041] Furthermore, in accordance with some embodiments of the present invention the processing unit comprises two processors, one processor for each imaging device for single imaging device processing of image data to obtain object detecting and tracking information; and

[0042] a processor for extended processing of the object detecting and tracking information.

[0043] Furthermore, in accordance with some embodiments of the present invention the imaging assembly comprises a housing with windows and a rack for fixedly positioning the imaging device within the housing. [0044] Furthermore, in accordance with some embodiments of the present invention the processor for each imaging device is an integral processor of the imaging device. [0045] Furthermore, in accordance with some embodiments of the present invention the processor for extended processing is one of the two processors.

[0046] Furthermore, in accordance with some embodiments of the present invention the imaging devices are video cameras. [0047] Furthermore, in accordance with some embodiments of the present invention the processing unit includes an algorithm comprising:

[0048] detecting presence of the human objects,

[0049] determining at least one attention metrics parameter from a group of parameters which includes: face towards the display and eyes on the display. [0050] Furthermore, in accordance with some embodiments of the present invention the group of parameters also includes proximity to the display and face expression.

[0051] Furthermore, in accordance with some embodiments of the present invention the algorithm further includes determining attention time span.

[0052] Furthermore, in accordance with some embodiments of the present invention the overlap zone between fields of view of two adjacent imaging devices can fully capture one of the human objects.

BRIEF DESCRIPTION OF THE DRAWINGS

[0053] In order to better understand the present invention, and appreciate its practical applications, the following Figures are provided and referenced hereafter. It should be noted that the Figures are given as examples only and in no way limit the scope of the invention. Like components are denoted by like reference numerals.

[0054] Fig. 1 illustrates a top view of a system for recognizing objects in a given viewed sector according to embodiments of the present invention, mounted on atop a display viewed by an audience.

[0055] Fig. 2 illustrates- a- camera-assembly -according- to an embodiment of the present invention.

[0056] Fig. 3A illustrates housing of a camera assembly according to embodiments of the present invention. [0057] Fig. 3B illustrates a rack which may be incorporated in a camera assembly according to embodiments of the present invention. [0058] Fig. 3 C depicts a processing unit which may be incorporated in a camera assembly according to an embodiment of the present invention.

[0059] Fig. 4 shows a chart which presents identified viewers and the time duration of their presence in the FOV for each camera (each chart represents information relating to one of the cameras).

[0060] Fig. 5 describes a method for audience measurement metrics according to embodiments of the present invention.

[0061] Fig. 6 illustrate a method for extended audience measurement according to embodiments of the present invention. [0062] Fig. 7 illustrates a method according to embodiments of the present invention, for establishing equivalence between a left track and a right track, based on at least a single time instance in which a face is detectable is both the left and the right side of the overlap zone.

[0063] Fig. 8 illustrates a method for wide angle audience measurement according to embodiments of the present invention.

DETAILED DESCRIPTION OF EMBODIMENTS

[0064] Embodiments of the he present invention may include image-based or video-based object detection imaging systems, some of which may include counting or tracking. More specifically, "objects" in the context of the present specification typically refer to human objects (although other objects can be considered too), and said detecting, counting and tracking may be based on body images or facial images. Said detection, tracking and counting may be used, for example, for viewing measurement relating to out-of-home screens and other displays, where is it desired to count viewers that satisfy one or more conditions relating to the attention of said viewers towards the display, and can also include measurements of the time duration in which such one or more conditions are satisfied. According to some embodiments of the present invention "human objects" may refer to human faces. ^{" " "}

[0065] According to embodiments of the present invention the viewed area of interest is expanded by capturing object images with two or more imaging devices, for example, cameras, where each camera covers a part of the area of interest. More particularly, the present invention relates to a method of tracking objects in images captured by each of said cameras and combining the tracked objects to yield a compound object count for the area of interest.

[0066] According to embodiments of the present invention an apparatus and related methods for wide angle detection and tracking of objects are described herein. By way of example embodiments of the invention described herein are used to detect and track members of an audience watching a display. Fig. 1 shows a top view of the scene with a display 12 - which may be a poster, a notice-board, a TV monitor / LCD screen / Plasma display or other information or product display. A Wide-Angle Camera assembly 10 is mounted on top of the display or nearby to ensure that the audience 14 is indeed facing the display. The Camera assembly 10 field of view covers a sector of interest 16 of predetermined angle, which relates to the display (in other words, a sector in which members of audience can pay attention to the display), and a range of audience distances, in order to allow detecting, monitoring, counting or other tracking of the audience attention towards the display.

[0067] While the objects viewed by the camera assembly in this example are human persons it is understood that other objects may be viewed and monitored by a camera assembly according to embodiments of the present invention. Embodiments of the invention are concerned with monitoring only parts of the objects (e.g. faces, eyes) or to the whole object (e.g. entire body).

[0068] Embodiments of the present invention may rely on a number of different techniques for object detection and tracking and in particular may use passive imaging techniques, active illumination, or both.

[0069] According to embodiments of the present invention a camera assembly may include two or more cameras to capture a wider angle sector. The description herein will relate to two cameras only, but can be generalized to three cameras or more when it is necessary to expand the coverage angle or extend the detection range of the apparatus. A typical value for the single camera field of view is 70 degrees which can be achieved with a low-cost off-the-shelf lens with small distortion. The present example teaches how to extend the^* coverage horizontally, but may also be used to extend it vertically, or with three or more cameras.

[0070] Embodiments of the present invention may include analog cameras as well as digital cameras with arbitrary resolution and interfaces. A camera assembly according to embodiments of the present invention may use the ambient light, artificial illumination or active lighting, such as, for example, Light Emitting Diode (LED) illuminators. [0071] According to embodiments of the present invention, it desired that each object is captured in whole at least by one camera. For the specific application of human audience measurement by analyzing face images, the fields of view of the cameras of the camera assembly preferably overlap by at least a head width at each applicable audience distance, so that a face will always be captured in full by at least one camera. Wititi a typical display size of 20" - 42" (diagonal) such distance may typically range from 0.3m to 6m.

[0072] A camera assembly according to an embodiment of the present invention is depicted in Fig. 2. The two cameras, 18a and 18b, are arranged in a formation so as to maximize the apparatus' field of view angle, while maintaining an overlap area between the cameras that is at least a head wide, across the entire range of applicable object distance. These design criteria are met by using the arrangement in Fig. 2, where the cameras' lines of sight are intersecting, rather than diverging. Cameras 18a and 18b are arranged in an tilted configuration (angles α and α' respectively) so that while the cameras cover a relatively narrow sector (angles β and β' respectively) together the cameras combined field of view spans a much wider sector (angle γ). For example, if angles α and α¹ are each 30 degrees, and the field of view of each camera (angles β and β') is 70 degrees, than the overall coverage (FOV, angle γ) of the camera assembly is some 130 degrees.

[0073] The cameras may comprise analog or digital cameras and interface accordingly to a video processor that processes image data from the camera assembly as described hereinafter. [0074] A camera assembly according to an embodiment of the present invention may comprise a housing parts of which are shown in Figures 3A, 3B and 3C. For example two corresponding shells, one of which 20 (see Fig. 3A) is shown in the figure, with two openings 19, serving as windows through which cameras 18a and 18b may view the scene. An angular positioning of the cameras inside the camera assembly housing is depicted in figure 3 A. [0075] For sake of simple assembling the manufacturing of the camera assembly includes the use of a rigid rack 22, which may be made, for example, from metal, and may include holes 24 for corresponding power sockets, communication sockets 28^", Or video connections 32. Rack 22 (see Fig. 3B) may also include framed windows 19a and 19b and pegs 21a and 21b, which are designed to aid in the correct alignment of the cameras, when placing the cameras in the housing. Attaching the cameras to the appropriate window (19a or 19b) and holding the camera against the adjacent peg( 21a or 21 b) can ensure that the relative position between the cameras is fixed and constant for each manufactured camera assembly. Thus, calibration values used to transform an object detected in the left camera, for example, to a predicted object location in the second camera, are fixed.

[0076] A camera assembly may contain a video processing unit. Alternatively, the output video signals of the two cameras, whether analog or digital may be streamed to a remote processor, such as for example a PC device or to a dedicated video processing device 26 (see Fig. 3C).

[0077] A method of processing and analyzing video images captured by the camera assembly, according to embodiments of the present invention, is described herein, aimed at accurately obtaining audience measurements for the viewers in the combined field of view of the two cameras. The video processor device can include a single processor processing image date from both cameras or a dual-processor, where each processor processes image data of one of the two cameras. The description hereinafter also refers to deghosting in the common area and how to perform extended tracking across the common FOV.

[0078] According to embodiments of the present invention, a method for object tracking is aimed at detecting and in some embodiments also monitoring and tracking objects of interest (for example, human faces). In a method according to an embodiment of the present invention, image data from each camera is processed separately and then the processed data from both cameras is combined using a suggested intra-camera logic in order to generate accurate object counts (but not allowing undercounting or over counting), and to robustly track members of the audience as they move across the FOV, from one camera coverage zone, to the second camera coverage zone, thereby generating a unique ID for each person captured by the apparatus. Such a unique ID enables accurate counting of different people viewing the display and can help determining accurate exposure time, or attention span (as a measure of the viewer's interest level) for each viewer. [0079] Most of the computational work load can be done at the individual camera level, with little data transferred and processed as part of the intra-camera logic. Therefore, the present invention may lend itself to a parallel implementation-- assigning a dedicated processor to^" each camera for the time consuming tasks of object detection and tracking, and then merging the object counts and attention span into a combined results with no over-counting, using another processor, or assigning the task of combining the information from both cameras to a processor of one of the individual camera processors (so as to act as a "Master" processor). [0080] It is noted that according to some embodiments of the present invention, full imaging of the viewed sector is not necessary, and according to embodiments of the present invention it is sufficient to gather object information which does not necessarily involve reconstructing images. According to embodiments of the present invention the processing required may be handled by camera processors (one of the cameras of the camera assembly or both).

[0081] Detecting and tracking objects in images captured by a single camera is known. For a specific case of objects that are human faces, known techniques may be used in an object detecting and tracking system according to embodiments of the present invention to build a basic face tracker, capable of detecting and tracking one or more faces in a single camera. [0082] The field of face detection focuses on identifying face-like image regions based on pictorial characteristics. Specifically, a prior art method by Viola and Jones (Paul Viola and Michael Jones, "Rapid object detection using a boosted cascade of simple features", in CVPR - IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001). introduced a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. Using a new image representation called the "Integral Image", the features that may be used by the face detector can be computed very quickly. A learning algorithm, for example based on AdaBoost, selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. Finally, a method for combining increasingly more complex classifiers in a "cascade" allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions.

[0083] Face tracking is a process of following face images from one video image to a subsequent image, captured by the same camera. The process of following helps us count the number of different viewers, estimate exposure time and optionally filter spurious exposures that may be associated with random glances, or false face detections. Face tracking also helps analyze viewer motion and average audience measurements such as demographics estimates by using multiple such estimates.- „. . ._ -

[0084] The concept of object tracking in video images in widely known in prior art. As one example, Froba and Kublbeck (Bernhard Froba and Christian Kublbeck, "Face tracking by means of continuous detection," in Proc. of CVPR Workshop on Face Processing in Video (FPIV'04), Washington DC, 2004) describe a method of face tracking by means of continuous detection, relying on fast methods such as the one described by Viola and Jones and using a Kalman filter to estimate the optimal state of a tracked face from the detection results.

[0085] The concept of continuous detection may still require the object (face) to be detected in all or most of the detection frames. When the detection probability (for example, passing the cascade of classifiers as described above) is low - such tracking may fail.

[0086] As face images are highly structured rich with image information content (for example, the eyes area, the mouth area and the nose area), face tracking can also be achieved successfully without continuous detection. For example, one may use the method of template matching to search for a detected face in subsequent frames. The initial template may be selected to be the image region encompassing the detected face. Later in the tracking process, the template may be updated to be an image region encompassing the tracked face position in subsequent frames, thereby allowing face tracking even as the face is rotated away from the camera and can no longer be detected by face detection methods as described above. A prior art method of face tracking by template matching is described by L. Wang et al (L. Wang et al, Face Tracking Using Motion-Guided Dynamic Template Matching, in 5th Asian Conference on Computer Vision, 23—25 January 2002, Melbourne, Australia).

[0087] A facial features localization process may find the exact image position (in pixels) of several facial features (e.g. eyes, eyebrows, nose tip, lip corners, chin and ears). Multiple techniques are known for facial feature localization. Some of these techniques are based on a programming a specific detector for each feature (e.g. eye detector, mouth detector). Wiskott et al (Wiskott, L., Fellous, J.M., Kruger, N., and von der Malsburg, C. (1999). Face recognition by elastic bunch graph matching. In Intelligent Biometric Techniques in Fingerprint and Face Recognition, eds. L.C. Jain et al., CRC Press, ISBN 0-8493-2055-0, Chapter 11, pp. 355-396, and also Laurenz Wiskott, Jean-Marc Fellous, Norbert Kruger, et al. Face Recognition by Elastic Bunch Graph Matching, Proc. 7th Intern. Conf. on Computer Analysis of Images and Patterns, CAIP'97, Kiel) teach how to build a generic detector for a specific feature from a set of sample images of same feature (for example multiple images of a left eye) and then locate the left eye in a facial image by searching for a maximum response of the detector. For the specific case of eye detection, the bright pupil technique described below, locates eyes with great precision.

[0088] Audience measurement may require one or more conditions to be satisfied in order for the specific object instance and time duration associated with that instance, to be considered in the count and duration metrics, respectively. One such condition is "Face Towards": having the viewer face turned towards the display, in order to discard time instances in which the viewer looks away from the screen. Several ways of estimating the face pose are known. In particular, the decision whether or not the face is oriented towards the screen can be carried by number of methods including symmetry analysis, as the left side of the face image will exhibit a large degree of symmetry with the right side of the face. The medial line of the face image can be computed from the facial features located as described above, and used to divide the face into a left and right portion, prior to testing for symmetry. Alternatively, a pose classifier can be trained in a supervised manner, using multiple face images for each pre-defined orientation: left, middle and right. Multiple left or right pose classes can be selected for more accurate pose estimation and better testing of the "Face Towards" condition.

[0089] Another condition of interest in the field of audience measurement is "Eyes-On" (or "gaze estimation"): having the viewer eyes open and turned towards the display. In some audience measurement applications, "Eyes-On" is considered a stronger or more desired condition than "Face Towards". Eyes-On detection ma}' use prior art techniques of eye pupil detection in high resolution images and by tracking the pupil position within the eye image. Eyes-On detection may also use using prior art techniques of active illumination, also known as the "retro-reflection" effect, where a light-source is placed on-axis with the camera optical axis. The camera is then able to detect the light reflected from the interior of the eye, giving rise to a "bright pupil" effect, as described by Morimoto et al (C. Morimoto et al, Real-Time Detection of Eyes and Faces, Proc. Workshop. Perceptual User Interfaces, San Francisco, Nov. 1998).

[0090] Camera placement, field of view and processing resolution are designed for a specific audience distance and thus implicitly act as a distance threshold. However, it may be required to provide explicit range estimates in order to limit counting only for the audience distance range of interest, or weigh audience measurements accordingly. __ __ ___

[0091] Camera parameters, together with the facial features locations, are used to calculate the range of the face from the camera. This is done using prior information about the distribution of the size of the facial feature in the entire population. One such measure can be the distance between the eyes. For that specific measure, the method for bright pupil detection provides a very accurate measure for such distance. [0092] When the audience consists of different populations, for example children and adults, the distribution of face feature distances within these populations has to be accounted for, when estimating the face distance, as the average eyes distance of an adult is different from this of a young child. Face classification information as described above can be useful in deciding if the face under consideration belongs to an adult or to a child.

[0093] Camera parameters are required to translate, for example the horizontal image plane (or pixel) distance between two image points, to the angle between the points and the camera, which is then used to compute the distance, given the object-space distance between the points (eyes for example). [0094] When optical distortions are negligible, the camera sensor resolution and the horizontal field of view angle are sufficient. In the presence of distortion, one must correct for these distortions, based on vendor-supplied calibration tables or on-site calibration, using one of methods known in prior art.

[0095] Industry-accepted metrics for audience measurement include: [0096] Impression time: contiguous viewing interval by the same person;

[0097] Viewer Engagements: how many people viewed the display almost contiguously within a time interval;

[0098] Attention span: total net viewing time for the same person.

[0099] Fig. 4 shows a chart which presents identified viewers and the time duration of their presence in the FOV for each camera (each chart represents information relating to one of the cameras). "Count" indicates how many objects are counted for a given time duration for each camera and "total count" indicates how many objects are detected for a given time duration by the camera assembly. Viewer A has a single engagement with the display (on which the camera assembly is mounted) that comprises of two impressions of durations 6 and 5 sec respectively, for a total attention span of 11 seconds.

[00100] Face tracks comprising each of a series of face detections and face locations obtained from tracking are passed through an Audience Measurement Logic (which is incorporated in an algorithm of the processor). Pose and distance may be used as measures of attention directed by an individual member of the audience towards the display and can be used as one or more additional condition. Together with the display size these parameters are used to determine whether a specific gaze (or detection) should be considered a countable exposure. [00101] According to the present invention, audience measurement data may be extracted from each face track: Every image in the track is tested against the specific conditions. According to one embodiment, the conditions comprise of face pose ("Face Towards") and face distance from the display. According to a second embodiment, the conditions comprise of eyes directed towards the display ("Eyes On").

[00102] In real-world situation, a single face track consists of one or more contiguous segments that satisfy the conditions. Each such segment constitutes an impression. The total track corresponds to one viewer engagement (consisting of one or more viewer impressions), and the net viewing time (cumulative duration of all impressions) is the viewer attention span for that engagement. For any given point in time, the number of active (currently satisfying all conditions) impressions is the current View Count (as function of time). Fig. 5 describes a method for audience measurement metrics according to embodiments of the present invention. A single camera object tracker 40, which comprises a single camera of the camera assembly according to embodiments of the present invention, and corresponding processor (possibly the processor of the camera itself).

[00103] A single camera object tracker 40 acquires images of the scene and the image data undergoes face detection 42. After face detection face tracking 43 may be performed, while concurrently or alternatively face feature location 44 may be performed, as well as face pose estimation 46 and face distance estimation 48. The processed information (which may include ID₅ location pose and distance) is passed through audience measurement logic, which may output viewer count, viewer engagement and viewer attention span. Other parameters and measurements may be deduced from the image data as well.

[00104] Face tracking as described above can assign and maintain a unique face ID for a dense time-series of face detections. Additionally, when the face rotates from a frontal pose, template matching can guarantee continuous tracking, even when face detection criteria are not satisfied.

[00105] However7^~if^~the viewer moves behind other members of the audience, or temporarily goes out of the detection range or angle of the camera, then such viewer behavior may create two or more different tracks. [00106] It is desired to establish if two or more distinct tracks that can be determined to belong to the same viewer, in order to provide an accurate count of viewer engagements and correspondingly, accurate viewer attention spans. [00107] Extended Viewer Tracking examines all pairs of non-overlapping tracks, first applying a time difference criterion: for a pair of tracks represented by spans [tl, t2] and time durations [Ts indicating the start time and Te indicating the end time of viewer detection] require: [00108] 0 < (Ts ^ t2) < Tmax

[00109] in order to constitute a single engagement.

[00110] Then, using the time gap (Ts - t2) and the spatial distance between the 1^st track end and the 2^nd track start, normalized by the face size, improbable matches (in terms of viewer speed) are rejected. [00111] Given two such tracks, a process for deciding whether the objects depicted in these tracks are similar enough to be considered as the same object is described herein, thereby updating the number of different viewers and the corresponding attention spans. The process is based on prior art techniques in the field of face recognition.

[00112] Each track consists of one or more facial images. A straightforward solution may be to compare two tracks by computing the facial similarity between each pair of facial images taken from the two tracks. The tracks will be decided as belonging to the same viewer, based on the maximum similarity score. Again, similarity may be computed by Elastic Bunch Graph Matching or EBGM as described by Wiskott et al

[00113] Such method is robust, but may be very time consuming, especially as a track may consist of dozens of images. A faster approach according to embodiments of the present invention may be to select a characteristic set of facial images (not more than 5 images per track) and then proceed as described above. The characteristic images can be selected arbitrarily (equally spaced in time), based on a facial measure (frontal pose, sharpness, etc) or to maximize inter-image distance between the selected images in order to maximize the variability (and hence the information content) within the set.

[00114] Another method according to embodiments of the present invention may use a hierarchical method, using a fast comparison method to find the most similar image pair and then use a slower, robust method to obtain the similarity score for the decision making. The faster process may use a lower-resolution image, fewer facial features or a faster representation and distance metric such as the one based on Eigen-Faces. [00115] Once the pairing is established, corresponding face tracks are merged, the number of viewer engagements is decremented and the attention span for the resulting viewer engagement is set to the sum of the respective spans.

[00116] The same object can initially give rise to more than two tracks. Thus the track- pairs similarity process may produce pairs like (g, h), (g, s) and (s, r). Using prior art union- find algorithms all tracks will be assigned an identical object ID (g, h, s, r assigned to g) in the example above and the number of engagements and respective attention spans modified accordingly.

[00117] Fig. 6 illustrates a method for extended audience measurement according to embodiments of the present invention. Parallel to performing audience measurement logic 50 extraction of face characteristics data 52 may be performed. The processed data in combined by the extended tracker 54 (processor in charge of merging the separately processed information from each camera), which produces total count, engagement and attention span for each identified viewer. [00118] Methods according to the present invention, for combining processed data from two or more such cameras and processing the combined information is described hereinafter.

[00119] In a broad sense, accurate multi-camera counting is achieved by: independently detecting images of objects in each of said two or more overlapping views, detecting duplicate images of same object in said two or more overlapping views and then reporting the number of unique objects as the number of unduplicated object images. Also, to ensure that the correct number of unique objects is maintained at all times and that metrics associated with such unique objects are correctly extracted, objects are tracked as they pass from one camera to the other through the overlap zone. [00120] According to embodiments of the present invention, multiple camera audience measurement may be executed in the following steps:

[00121] performing geometric image correction (to overcome distortion);

[00122] computing image-image transformation or calibration;

[00123] image-image object tracking; [00124] deriving the audience measurement metrics.

[00125] According to embodiments of the present invention, a wide angle audience measurement camera assembly (see Fig. 2 and Figs. 3 A through 3C)) includes two or more wide angle cameras. Barrel distortion has limited effect on the Single Camera Processing as locally (within the image area occupied by a single face) the distortion is small However, in order to facilitate audience measurement according to embodiments of the present invention and in particular the steps of calibration and object-based degho sting, geometric correction of barrel distortion is applied in real-time to the images captured by the camera, as the first step in the processing of the digitized images. The barrel distortion and methods of modeling it and correcting it are known.

[00126] Parameters of the distortion and corresponding parameters of the correcting geometric transformation can be computed from the lens parameters as provided by the lens manufacturer. Alternatively, one may construct a calibration setup, capturing an image of a calibration target and computing the essential lens parameters from that image. Alternatively, as a preparatory stage, one may iterate with different values in the correction function until straight-line edges in the scene, indeed appear as straight lines in the corrected image.

[00127] Image-Image Transformation Function is now considered. The transformation function, which comprises matching the object location in the left image to that in the right image, or vice versa, depends on the object distance from the camera. The extent of the overlap zone, where such transformation holds, depends also on that distance.

[00128] Given an object in one camera that is located in the overlap zone, one can predict its location in the second camera if one knows its distance from the camera. By calibrating in advance the relationship between the face size and its distance from the camera, we can calculate the function F(Xl₉Yl ,Sl) --> X2,Y2,S2, where Xl₅Yl denotes the location of the face center in the left image, Sl is the face size, and X2,Y2,S2 are the corresponding values in the right camera. Since measured face size varies (between different people but also for the same person), and due to geometric distortions, we can't compute the person's distance from the camera in an exact manner. Therefore, face position need to be refined - by using some correlation searching inside an uncertainty zone, or by another method. Face image size, as derived from the image distance between the eyes or another size-related measurement.

[00129] A transformation function F can be computed from the apparatus design parameters or automatically computed by moving a single calibration object from one camera view to the other camera view, at multiple object distance values. As the mechanical arrangement between the cameras is fixed, F is computed once for that specific arrangement. [00130] Tracking of an object from one camera to the other, across the overlap zone is hereinafter explained, by way of example.

[00131] When a face is detectable by both cameras simultaneously, the face equivalence is readily established from the transformation function F. In the case that F is not tight enough to establish correspondence, a face similarity metric as described hereinabove, may be used to uniquely establish correspondence.

[00132] Fig. 7 illustrates a method according to embodiments of the present invention, for establishing equivalence between a left track and a right track, based on at least a single time instance in which a face is detectable is both the left and the right side of the overlap zone.

[00133] The more difficult case, in which the faces are not simultaneously detectable is now addressed. A person may initially be detectable only by only one camera, for example, on the left side. While in motion, the person does not look at the camera and enters the overlap zone. Then, the person is no longer visible to the first camera but is again looking towards the second camera, and is therefore, detectable. To establish object equivalence (in the specific case of left-right tracking):

[00134] Executing Face Detection in one camera by the camera object tracker (left one in this example, 40b);

[00135] Using image tracking (for example by template matching), to obtain a current estimate of object location (see Fig. 5).

[00136] Determining if the person enters the overlap zone 60b, and if so, and using the calibration table 62, performing an image-image transformation to predict its location in the other (right) camera 64.

[00137] Continuing tracking in the other (right) camera concurrently with a face detection process. The match may be stored 66.

[00138] Whenever the tracking coincides with face detection, equivalence between the face initiairy detected in the left camera to the recent right-camera detection may be needed to establish. Again, in the case that the transformation function F is not tight enough to uniquely establish correspondence, a face similarity metric such as, for example, as described hereinabove, may be used for verification.

[00139] Is it now possible to use object equivalences, as derived by the image-image tracking, to derive correct audience measurement metrics.

[00140] The concept of a timeline is now explained and a audience measurement method according to embodiments of the present invention are discussed based on that concept. The method is described in a multi-pass, offline manner which is suitable for generating audience measurement reports. Also, for some application, audience measurement is done per specific time segment - such as a promotional / advertising video clip, in which case the multi-pass method can be applied after the end of each clip.

[00141] Whenever the information is required in near real-time, using known concepts of data processing, the method can be converted to a pipelined process in order to provide wide area audience counts and attention spans after a small processing delay. Such real-time information can be used for proactive advertising — modifying the content on display based on current audience statistics / demographics.

[00142] Fig. 4 depicts a timeline representation showing viewer tracks in the left and right cameras. The representation is shown for specific time duration of 35 seconds — which may be aligned with the play schedule of a specific program or advertising segment.

[00143] Each bar denotes a track as generated by the face tracking process described above. A filled bar portion 68 denotes the time duration in which one or more conditions are satisfied. In this specific example, one condition is face towards the display as decided by a face pose detection algorithm. Another condition is the face being within the effective viewing range for the display. A void bar portion 70 denotes the time duration in which the face is tracked but said one or more conditions are not satisfied and thus does not contribute to the audience count. For this specific example, the face may be rotated away from the screen or not in the effective viewing range.

[00144] A darkly filled bar 72 denotes a face position inside the overlap zone. The span value denotes the attention span of an individual member of the audience. Thus for the left camera 5 viewers (a - e) were measured initially with attention spans of (11, 12, 6, 14 and 5) seconds correspondingly. The viewers count as a function of time can be readily derived from the timeline graph by counting the number of active tracks at each time measure (second). Similarly, the right camera captures 4 viewers (A - D) -with attention spans of (12, 18, 0 and 10) seconds correspondingly. The viewers count is derived in a similar manner.

[00145] As the method of image-image tracking enables establishing correspondence between members of the audience detected in the left camera and those detected in the right camera, once such correspondence is established, the audience measurement metrics derived as described above reflects that correspondence:

[00146] The Total Count metric is decremented for the entire overlap span of the tracked members of the audience, detected both by the left and right camera and paired. The corresponding engagements are merged into a single engagement:

[00147] The number of engagement is decremented;

[00148] The attention span for the single engagement is set to the sum of the respective spans minus the overlap span (s).

[00149] In the timeline example shown in Fig. 4, 9 engagements were obtained (a-e, A-D) with attention spans of (11, 12, 6, 14, 5, 12, 18, 0 and 10) correspondingly. The viewers count as a function of time is obtained by adding the counts and then decrementing the count during the respective overlap durations.

[00150] Here too extended face tracking may be used to detect a situation when an object is moving from one image to another without being tracked at all in the overlap area. Tracks c and D in Fig. 4 may represent such a case. In order to affect extended face tracking between the cameras, it will consider pairs of tracks from both cameras (one from each) as well.

[00151] Object detection and tracking to be used in audience measurement systems are computationally demanding. As it is desired to capture all members of the audience within the entire effective viewing range of the display and entire effective viewing angle of the display, high-resolution image capture is recommended. It may be difficult or uneconomical for a single processor to execute face detection and tracking in this case, due to bandwidth or processing power limitations. Methods according to embodiments of the present invention lend themselves to parallel implementation, in which each processor receives the images captured by each of the cameras, conducting the computationally exhaustive processes of face detection tracking, as well as the tasks of detecting if one or more conditions are satisfied, such face towards the display, eyes-on the display and face within the effective viewing distance. Then, a low-bandwidth, low-computation process addresses object deghosting and extended tracking as described above.

[00152] Fig. 8 illustrates a method for wide angle audience measurement according to embodiments of the present invention. In an embodiment of the present invention, each processor may be a media Digital Signal Processor (DSP) such as C6446 by Texas Instruments where the inter-processor communications is done via the serial audio port. The left camera image sequence is processed by the left camera object tracker 40b, that uses a combination of object detection and tracking to locate and follow objects of interest for the audience measurement application (such as human faces), in the images captured by the left camera.

[00153] The right camera image sequence is similarly processed by the right camera object tracker 40a. The object trackers may also be responsible for testing the specific measurement conditions at a pre-specified time resolutions and record the time intervals in which each of these conditions are satisfied. The measurement conditions may comprise of object distance, Face Towards, Eyes On or similar conditions. The object tracker may also gather object descriptors (for example face signatures) as may be required by the extended object tracker (54a, 54b) to establish that two distinct tracks in the same camera assembly - belong to the same object.

[00154] The multi-camera process comprises of predicting the left image location (if any) corresponding to the current object location in the right camera, and vice versa. These locations are fed into the respective trackers and matched with the current object being tracked in that specific camera. Any match, for example right prediction of a left track coincides with a right track, is recorded using the left and right track indices (for example B, a in Fig. 4).

[00155] Additionally, a Camera to Camera extended tracker 55 may detect matches between left and right tracks that were not detected while passing through the overlap zone. The set of matched tracks, whether found by the left or right camera extended trackers, left or right object predictions, or the camera to camera extended trackers, are resolved via equivalence processing (union-fmd) wide area audience measurement logic 57, to produce the resulting counting functions, Total Viewer Engagements and the respective attention spans, taking into considerations one or more Audience Measurement conditions as described above. [00156] It should be clear that the description of the embodiments and attached Figures set forth in this specification serves only for a better understanding of the invention, without limiting its scope.

[00157] It should also be clear that a person skilled in the art, after reading the present specification could make adjustments or amendments to the attached Figures and above described embodiments that would still be covered by the present invention.

Claims

1. A method for determining attention metrics of human objects towards a display, the method comprising: acquiring image data by at least one imaging device whose field of view substantially covers a sector of interest relating to the display; processing the image data using an image data processing algorithm comprising: detecting presence of the human objects, determining at least one attention metrics parameter from a group of parameters which includes: face towards the display and eyes on the display.

2. The method as claimed in claim 1, wherein the group of parameters also includes proximity to the display and face expression.

3. The method as claimed in claim 1, further comprising determining attention time span.

4. The method as claimed in claim 1 , wherein said at least one imaging device comprises two or more imaging devices, the field of view of each camera only covering a portion of the sector of interest, with an overlap zone between two adjacent imaging devices, so as to substantially cover the sector of interest.

5. The method as claimed in claim 4, wherein at least one human object can be fully captured in the overlap zone.

6. The method as claimed in claim 4, wherein the step of processing the image data comprises using two processors, one processor for each camera for single camera processing, and using a processor for extended processing to determine said at least one attention parameter.

7. The method as claimed in claim 4, wherein the step of processing the image data further comprises separately detecting the presence of one or more of the human objects by each of said two or more imaging devices and predicting respective presence and location of said one or more human objects in the field of view ot other ot said two or more imagmg devices.

8. The method as claimed in claim 6, comprising using a transformation function in the predicting of the respective presence of the human objects in the field of view of the other of said two or more imaging devices.

9. An apparatus for detecting and tracking objects in a given sector of interest, the apparatus comprising: an imaging assembly comprising two or more imaging devices, the field of view of each imaging device only covering a portion of the sector of interest , with an overlap zone between fields of view of two adjacent imaging devices, so as to substantially cover and acquire image data from the given sector of interest; a processing unit for detecting and tracking of the human objects in the sector of interest.

10. The apparatus as claimed in claim 9, wherein the processing unit comprises two processors, one processor for each imaging device for single imaging device processing of image data to obtain object detecting and tracking information; and a processor for extended processing of the object detecting and tracking information.

11. The apparatus as claimed in claim 1, wherein the imaging assembly comprises a housing with windows and a rack for fixedly positioning the imaging device within the housing.

12. The apparatus as claimed in claim I₅ wherein the processor for each imagmg device is an integral processor of the imagmg device.

13. The apparatus as claimed in claim 12, wherein the processor for extended processing is one of the two processors.

14. The apparatus as claimed in claim 9, wherein the imagmg devices are video cameras.

15. The device as claimed in claim 9, wherein the processing unit includes an algorithm comprising: detecting presence of the human objects, determining at least one attention metrics parameter from a group of parameters which includes: face towards the display and eyes on the display.

16. The device as claimed in claim 15, wherein the group of parameters also includes proximity to the display and face expression.

17. The device as claimed in claim 15, wherein the algorithm further includes determining attention time span.

18. The device as claimed in claim 9, wherein the overlap zone between fields of view of two adjacent imaging devices can fully capture one of the human objects.