WO1999007153A1

WO1999007153A1 - Systems and methods for software control through analysis and interpretation of video information

Info

Publication number: WO1999007153A1
Application number: PCT/US1998/016046
Authority: WO
Inventors: Barry Spencer
Original assignee: Reality Fusion, Inc.
Priority date: 1997-07-31
Filing date: 1998-07-31
Publication date: 1999-02-11
Also published as: AU8682498A

Abstract

A video image based computer interface system and method includes a video camera and associated components (108, 114) configured to generate signals representative of an image of a user, and to store (104) the signals as a succession of video frames. The video frames are processed by a video processing engine (106), which sequentially applies a set of transformations to each video frame in order to detect a motion of the user and to determine a user action with respect to a software generated object or area of interest. A computer monitor (102) displays at least a portion of the user image superimposed on the object or area of interest. In a first mode of the invention, the user, by making the appropriate motions, may change the trajectory of a simulated ball or like object. In other modes of the invention, the user may engage or adjust the positioning of various user controls (110); fire at targets in a computer game environment; and select information for display on an interactive kiosk.

Description

SYSTEMS AND METHODS FOR SOFTWARE CONTROL

THROUGH ANALYSIS AND INTERPRETATION OF

VIDEO INFORMATION

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No. 60/054,498 ("System and Methods for Software Control Through Analysis and Interpretation of Video Signals") filed July 31, 1997, which is hereby incorporated by reference.

BACKGROUND

1. Field of the Invention

The present invention relates to data processing by digital computers, and more particularly to a video image based user interface for computer software.

2. Background of the Prior Art

The development of improved user interfaces is a primary concern of the computing industry. The conventional method to provide interaction with a computer requires user manipulation of an input device. Examples of commonly used input devices include absolute positioning devices such as light pens, touch screens and digitizing tablets, and relative positioning devices such as mouse devices, joy sticks, touch pads and track balls. In a typical operation, the user manipulates the input device (which may involve translation or rotation of one or more active elements) such that an associated cursor or pointer is aligned with an object or area of interest displayed on a computer monitor. The user may then engage other active elements of the input device (such as a mouse button) to perform a function associated with the object or area of interest. In this manner, the input device is utilized to perform functions such as selecting and editing text, selecting items from a menu, or firing at targets in a game or simulation software application.

User interfaces requiring manipulation of an input device have a number of problems and limitations associated therewith. First, input devices may become partially or fully inoperative over time due to mechanical wear, breakage, drifting of optical or electronic elements, and contamination of working parts due to exposure to dirt and particulate matter. Second, the method of operation of input devices to effect various functions is not intuitive, and users may need substantial training and experience before they are comfortable with and proficient in the operation of a particular input device. This problem may be especially acute with respect to unsophisticated users, such as young children, or users who have little or no prior experience with personal computers. Further, certain users, such as elderly or physically disabled persons, may not possess the requisite dexterity and motor skills to precisely manipulate an input device so as to perform a desired function.

Another problem associated with conventional input devices is their tendency to limit the ability of a user of game or simulation software to conceptually immerse himself in the game environment displayed by the computer monitor. It has been recognized that the commercial success of computer game software depends in large part on the ability of the user to mentally project himself into the virtual world which the game presents. To this end, computer game software and hardware developers have expended extensive efforts toward generating more detailed and realistic graphics, accelerating processing of input, creating more complex game characters and scenarios, and adding sound effects and other audio cues. However, the necessity of using a conventional input device, such as a mouse, to interact with the game continually reminds the user of the artificiality of the game environment, and thereby hinders the conceptual immersion process.

In view of the foregoing and other deficiencies associated with conventional input devices, there has been a substantial amount of activity directed toward the development of more intuitive interfaces, wherein natural actions or movements of the user are utilized to provide software control. One such interface employs a video camera or similar device to capture an image of the user or portion thereof. The movement and/or positioning of the user is then analyzed and used to generate input to a software application. The patent prior art contains numerous references disclosing video image based computer interfaces. Illustrative examples of such prior art interfaces are disclosed in U.S. Pat. No. 5,528,263 to Platzker et al.; U.S. Pat. No. 5,534,917 to MacDougall ("Video Image Based Control System"); U.S. Pat. No. 5,616,078 to Oh ("Motion-Controlled Video Entertainment System"); U.S. Pat. No.

4,843,568 to Krueger et al. ("Real Time Perception of and Response to the Actions of an Unencumbered Participant/User"); U.S. Pat. No. 5,297,061 to Dementhon et al. ("Three Dimensional Pointing Device Monitored by Computer Vision"); and, U.S. Pat. No. 5,423,554 to Davis ("Virtual Reality Game Method and Apparatus").

However, prior art video image based interfaces have a number of associated disadvantages, and consequently have not possessed significant commercial appeal. Many of the prior art interfaces require that the user wear a distinctively colored or shaped article of clothing (such as a glove) in order to enable tracking of the user's movement by the interface software. Other interfaces similarly require the user to hold a specially shaped article (see Dementhon et al.) or to place markers about his or her body (see Oh). Additionally, many prior art interfaces are extremely limited in their functionality, are computationally expensive, and/or must be implemented in a specialized environment.

What is lacking in the prior art is a video image based interface system and method which enables software control by natural user motions and gestures, operates without specialized clothing, markers, or equipment, does not require excessive computational resources, and may be utilized to effect numerous functions in connection with a variety of application software and operating environments.

SUMMARY OF THE INVENTION

Briefly described, the invention comprises a video image based user interface for a computer system wherein a user, by making the appropriate movements, may interact with a computer-generated object or area of interest displayed on a monitor. A video camera, coupled to a computer system, generates signals representative of the image of the user. The images are stored as successive video frames. A video processing engine is configured to process each frame to detect a motion of the user, and to determine a user action with respect to the computer- generated object or area of interest. Preferably, the processing engine includes a set of software devices which sequentially apply various transforms to the video frame so as to achieve the desired analysis. The computer-generated object or area of interest (as modified by the user action) is displayed on the monitor together with at least a portion of the user image to provide visual feedback to the user.

The concepts of the invention can be employed in a variety of modes. In a first mode of the invention, a computer-generated object representative of a sport ball describes a trajectory on the monitor. By making the appropriate movements, the user may simulate striking the ball to change its trajectory. In this manner, the first mode of the invention facilitates playing of simulated sports or games such as volleyball. The first mode of the invention may also be adapted to enable a user to engage or adjust the position of a conventional user interface control object, such as a button, icon, or slider. According to a second mode of the invention, the processing engine is configured to recognize and track an article, typically representing a weapon, held by the user. The user, by adjusting the position and orientation of the article, may perform certain actions with respect to a computer-generated object. For example, the user may aim and fire at an object representing a target. The article may be visually displayed on the monitor in a visually altered or idealized form.

A third mode of the invention provides an interactive kiosk for selectively presenting information to a user. The kiosk includes a display on which is presented a set of user interface control objects. For example, an interactive kiosk located in a tourist information center may display icons representative of lodging, food, entertainment, and transportation. A video camera disposed on or proximal to the display records an image of the user, at least a portion of which is presented on the display. Video frames captured by the video camera are then processed in the manner described above to detect a motion of the user and determine an action with respect to the user interface controls, namely, an engagement of a certain icon. When a determination is made that a certain icon has been selected and engaged by a user, information related to the selected icon (e.g., a list of hotels, restaurant menus, train schedules) is responsively presented on the display. The kiosk may be provided with a modem or network interface so as to enable remote updating of informational content.

Further encompassed by the invention is a system and method for determining whether a pixel lies in the foreground (dynamic) or background (static) portion of a video frame. For each pixel in a video frame, a characteristic value is calculated based on its color and luminosity. The characteristic value may comprise, for example, the sum of the pixel's red, green, and blue color values. The characteristic value for the pixel is then compared with the values calculated for spatially corresponding pixels in previously captured video frames. If it is determined that the parameter is substantially equal to a value which has recurred a predetermined number of times in previous video frames, the pixel is considered to be in the background of the captured image, and is processed accordingly. If the calculated characteristic value is different from previously calculated values, then the pixel is considered to be in the foreground of the image.

BRIEF DESCRIPTION OF THE FIGURES In the accompanying figures:

FIG. 1 is a schematic diagram showing the operating environment of the video image based user interface of the present invention; FIG. 2 is a block diagram showing various software components of the invention stored in a computer memory;

FIG. 3 is a block diagram of a video frame; FIG. 4 is a block diagram showing components of a video processing engine; FIG. 5 is a block diagram showing components of a generic software device;

FIG. 6 is a block diagram of an exemplary library of software devices;

FIG. 7 is a flowchart depicting the steps of a method for determining whether a pixel lies in the foreground or background portion of a video frame;

FIG. 8 is a graph of pixel value versus time for an exemplary pixel; FIG. 9 is a schematic of a software device train for processing of a video frame; FIG. 10 is a schematic of a software device train for an embodiment of the invention wherein two video cameras are employed;

FIG. 11 is a perspective view of a user and computer system according to a first mode of the video image user interface system of the invention; FIG. 12 is a schematic of a preferred software device train for implementing the first mode of the invention;

FIG. 13 is a perspective view of a user and computer system according to a second mode of the invention; FIG. 14 is a schematic of a preferred software device train for implementing the second mode of the invention; and

FIG. 15 is a perspective view of an interactive kiosk according to a third mode of the invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS The following detailed description illustrates the invention by way of example, not by way of limitation of the principles of the invention. This description is intended to enable one skilled in the art to make and use the invention, and describes several embodiments, adaptations, variations, alternatives and uses of the invention.

Reference is initially directed to FIG. 1 , which depicts in schematic form a typical computer system 100 for implementation of the video image user interface system and method of the present invention. The computer system 100 includes a monitor 102 for displaying images to a user; a memory 104 for storage of data and instructions therein; a video camera 108 for generating image data; and a processor 106 for executing instructions and for controlling and coordinating the operation of the various components. The computer system 100 may further include a set of conventional input devices 110, such as a keyboard and a mouse device, for receiving positional and alphanumeric input from the user. The monitor may comprise a conventional Cathode Ray Tube (CRT) type RGB monitor, but may alternatively comprise a Liquid Crystal Display (LCD) monitor or any other suitable substitute. The memory may consist of various storage device configurations, including random access memory (RAM), read-only memory (ROM), and non-volatile storage devices such as CD-ROMs, floppy disks, and hard disk drives. The processor may be, for example, an Intel® Pentium® microprocessor. The components are coupled in communication by a bus 112, which enables transfer of data and commands between and among the individual components. The bus 112 may optionally be connected to a communications interface (not shown), such as a modem or Ethernet® card, to permit the computer system 100 to receive and transmit information over a private or public network. The video camera 108 may comprise any device capable of generating electronic signals representative of an image. Computer- specific video cameras are generally configured to transmit digitized electronic signals via an established communications port, such as a parallel or serial port. Examples of video cameras of this type include the Connectix® Quick Cam® (which connects to a computer via a parallel port) and the Intel® Create and Share™ camera (which connects to a computer via a serial port). Standard camcorders may also be utilized, but require a video frame grabber board to digitize and process the video signals prior to sending the image data to other components of the computer system. The video camera 108 preferably generates color video signals; however; certain features and aspects of the invention may be implemented using a monochromatic video camera. As is depicted in FIG. 1 , a second video camera 114 coupled to the system bus 112 may be provided to achieve certain objectives of the invention, namely those involving depth analysis, as described hereinbelow.

FIG. 2 is a map of computer system memory 104 having software and data components of the video image interface system contained therein. The memory 104 preferably holds an operating system 202 for scheduling tasks, allocating storage, and performing various low-level functions; device drivers 204 for controlling the operation of various hardware devices, a device module layer 206 for providing an interface with the video camera 108; a video processing engine 208, including a set of software devices for analyzing and transforming the video frames so as to detect a motion of the user and infer an action relative to a computer generated object displayed on the monitor; and, a class library 210, which includes one or more applications classes.

The device module layer 206 includes a set of device modules configured to receive image data generated by the video camera 108, and to store the image data as successive video frames. The device module layer 206 may include one or more certified device modules designed to support a specified model of a video camera from a particular vendor, and a generic device module designed to work with a range of video camera types. The generic device module may cooperate with one or more device drivers 204 to achieve the function of storing video frames in the frame buffer 212.

Video frames captured by the relevant device module are held in a video frame buffer 212 until the video processing engine 208 is ready to initiate processing thereof. To minimize latency concerns, the video frame buffer 212 stores a small number of frames, preferably two. The video frame buffer 212 may be of the ring type, wherein a storage pointer points to the oldest frame for overwriting thereof by an incoming video frame, or have suitable alternative structure. FIG. 3 is a block diagram of a video frame 300 configured in accordance with a preferred implementation of the invention. The video frame 300 has a two-part structure. The first portion of the video frame 300 is a variable format pixel array 302. The color and luminosity of each pixel within the array may be encoded in a variety of formats used in the art, such as 8-bit or 16-bit RGB. The second portion of the video frame 200 comprises a key buffer 304 having array dimensions corresponding to the variable format pixel array 302. The key buffer 304, which begins in an uninitialized state, is employed to store results of processing operations effected by the software devices. Processing of the video frame 300 is performed by the video processing engine 208. As may be seen from FIG. 4, the video processing engine 208 comprises a kernel 402 and a set of software devices 404. The kernel 402 operates to route the video frame 300 through a train of software devices 404, which sequentially process the video frame 300 in order to achieve a desired result. The kernel 402 performs the routing by executing a simple polling loop. The kernel queries each software device 404 to determine if device output is available. If the kernel 402 determines that output is available from a first software device, the kernel then queries a second software device known to receive input from the first software device (i.e., the next software device in the train) whether it is ready to receive input. If the second software device is ready, then the video frame 300 is passed thereto. This polling loop is continually executed to enable routing, on a one-at-a-time basis, of video frames 300 through the linked train of software devices 404.

A generic software device 404 is depicted in block form in FIG. 5. The software device 404 includes a software routine 502 for evaluating and transforming an input video frame 504 to produce an output video frame 506. The software device 404 further include input ready and output ready bits 508 and 510 which are set to signify to the kernel 402 whether the software device 404 is ready to (respectively) receive an input video frame 504 or transmit an output video frame 506. Software device memory 512 may be provided to store data such as previously captured video frames. As is alluded to hereinabove, individual software devices, each performing a specific transform on a video frame, are linked together to implement a desired overall result.

FIG. 6 depicts in block form a library 600 of exemplary software devices. It is to be noted that the collection of software devices depicted in the figure and discussed below is intended to be illustrative rather than limiting. The software devices generally embody techniques and algorithms known in the art, and hence the specific processing techniques and algorithms associated with each of the various software devices will not be discussed in detail herein, except where significant departures from such prior art techniques and algorithms are practiced by the invention.

The capture device 602 comprises the first software device in any train of software devices. The capture device is configured to examine the frame buffer 212 to detect if an unprocessed video frame 300 is present, and signify the result to the kernel 402.

The color detector device 604 is configured to examine the video frame 300 to determine if any regions of the frame 300 are of a predetermined color. Each pixel is assigned a value (typically in the range of 0-255) corresponding to how closely it matches the predetermined color, and the assigned values are written to the key buffer 304.

The key generator device 606 is configured to determine, for each pixel, if a value associated with the pixel exceeds or is less than a threshold value, or if the value falls within a specified range. The key generator 606 then assigns a value to the pixel depending on whether or not the pixel satisfies the test applied by the key generator device 606. The assigned values are written to the key buffer 304.

The smooth filter device 608 is configured to filter stray pixels or noise from the video frame 300. The smooth filter device 608 typically performs this operation by comparing each pixel to neighboring pixels to determine if values associated with the pixel are consistent with those of the neighboring pixels. If an inconsistency is found, the pixel values are reset as appropriate.

The edge detector device 610 is configured to examine objects in the video frame 300 and determine the edges thereof. As is known in the art, edge determination may be performed by comparing the color and luminosity of each pixel to pixels disposed adjacent thereto. Alternatively, edge determination of dynamic objects may be performed by examining differences between successive video frames. The edge detector device is usually employed in connection with the edge filler device 614, which examines the output of the edge detector device and fills in any discontinuities, and the edge filter device 612, which filters out pixels erroneously designated as edge pixels by the edge detector device 610.

The foreground detector device 616 is configured to examine each pixel to determine whether the pixel lies in the foreground portion or background portion of the video frame 300. The operation of the foreground detector is described with reference to FIGS. 7 and 8. FIG. 7 depicts the steps of a preferred method for foreground detection. In the first step 702, the foreground detector device selects a pixel in the video frame 300 for testing. Next, a value representative of the pixel's color and luminosity is calculated, step 704. According to the preferred method, the pixel value comprises the sum of the pixel's red, green, and blue color values. The calculated pixel value is then stored, step 706.

Next, the pixel value is compared, step 708, to the values of spatially corresponding pixels of previously captured video frames (stored in the memory of the foreground detector device 616). FIG. 8 presents a graph showing a typical variation of pixel value (for spatially corresponding pixels) in successively captured video frames. It is appreciated that the graph shows certain plateaus or recurrent pixel values 802 indicative of the pixel lying in the background portion of the video frame 300. If in step 708 it is determined that the pixel's current value is substantially equal to the recurrent or plateau value 802, then the pixel is assigned a parameter indicating that the pixel is in the background portion of the video frame, step 710. Conversely, if the pixel's present value is substantially different from the recurrent value 802, the pixel is assigned a parameter indicating that the pixel is in the foreground portion of the video frame 300, step 712. The parameter may either have two states (background or foreground), or may alternatively have a range of values corresponding to a confidence level as to whether the pixel lies in the background or foreground portion of the video frame 300.

The grayscale device 620 converts the pixel color values (typically in RGB format) in the video frame 300 to grayscale values. The calculated grayscale values are then placed in the key buffer 304.

The difference detector device 622 examines the video frame 300 to determine differences from a previously stored video frame. More specifically, the difference detector device 622 subtracts pixel values (which may comprise color or gray scale values) of the stored video frame from the pixel values of the current video frame 300, and stores the results in the key buffer 304. In unchanged regions of the video frame 300, the spatially corresponding pixels in the current and stored video frames will have identical values, and hence the subtraction will yield a value of zero. In regions where change has occurred, the subtraction process will yield non-zero values. The positive displacement device 618 performs an analysis similar to the difference detector device, but is additionally configured to determine a leading edge of the movement represented by the changed regions of the video frame 300. Finally, the full screen device 624 is configured to cause the video frame 300, or specific portions thereof, to be displayed on the computer monitor 102.

Referring to FIG. 2, the class library 210 controls interactions between computer generated objects and the user image, as transformed by the software devices. The class library 210 includes a variety of objects having a set of attributes. Typical objects include trackers, bouncing or "sticky" objects, and user interface controls (such as buttons, sliders, and icons). The class objects all include in their associated attributes a position with respect to the video frame 300. The class library 210 further includes the software application, which is preferably configured as a software device. The application device receives the video frame 300, as transformed by software devices disposed upstream in the processing train, and examines the frame 300 to evaluate user actions with respect to one or more class objects.

FIG. 9 depicts in block form a generic software device train for processing a video frame 300 in accordance with the present invention. The capture device 602 gets the video frame 300 from the frame buffer 212. The video frame 300 is then sequentially transformed by the set of software devices (labeled herein as software device 1 through N and collectively numbered 902) selected from the software device library 600. The transformed video frame 300 is then passed to the application device 904 which examines the video frame 300 to determine interactions with one or more class objects. Finally, the video frame 300 is passed to the full screen device 624, which causes the video frame 300 to be displayed by the monitor 102.

FIG. 10 depicts a variation of the user interface system of the invention wherein two spaced apart video cameras 108 and 114 are utilized to simultaneously record the user's image. This configuration is useful in applications requiring three-dimensional tracking of the user's position. In accordance with this configuration, two software device trains are provided for simultaneous processing of the video frames generated by the video cameras 108 and 114. Each of the trains includes a capture device 602, a set of additional software devices 1002 and 1004, and an application device 1006 and 1008. The processed video frames are then passed to a triangulation device 1010, which compares the two frames and uses known triangulation algorithms to derive depth measurements. The invention will now be described in the context of various exemplary modes of implementation of the video interface system and method of the foregoing description. FIGS. 11 and 12 relate to a first mode of the invention wherein a user interacts with a dynamic computer- generated object representative of a sport ball or the like. As may be seen from FIG. 11 , the user can, by making an appropriate motion with his arm or other body part, simulate "hitting" the ball to thereby change its trajectory. In this manner, various sports or games, such as volleyball or handball, can be simulated. FIG. 11 shows a perspective view of a user 1102 situated in front of a computer system 100. Initially, the user's arm is lowered, and the computer-generated object 1104 has a first location and direction of motion indicated in solid lines. To "hit" the ball, the user 1102 moves his arm to a second position (indicated in phantom) wherein the locations occupied by the computer-generated object 1104 and the user's image in the video frame 300 are coincident, causing the trajectory of the object 1104 to be changed.

FIG. 12 depicts the preferred sequential arrangement or train of software devices for applying the required image transforms to the video frame 300 to achieve the objectives of the first mode of the invention. The capture device 602 initializes the process by obtaining a video frame 300 from the frame buffer 212. The capture device 602 then passes the video frame 300 to the grayscale device 620, which calculates a grayscale value for each pixel in the video frame 300. The video frame 300, after processing by the gray scale device 620, is then routed to the difference detector device 622. The difference detector device 622 detects a user motion by comparing the current video frame 300 with a previously captured video frame. More specifically, the difference detector subtracts, on a pixel-by-pixel basis, the grayscale values of the current video frame 300 from the previously captured video frame, and places the results in the key buffer 304, The difference detector device will yield a value of zero for pixels in regions where no motion has occurred, and a non-zero value for regions in which motion is detected. The video frame 300 is then passed to the smooth filter device 608 for filtering of stray pixels.

The video frame 300 is then routed to the applications device 1202. The application device 1202 examines the video frame 300 to determine whether there has been a collision between the software-generated object 1104 and the region of the video frame 300 in which motion has been found. If a collision is detected, the application device 1202 accordingly adjusts the position and trajectory of the computer-generated object 1104. The video frame 300 is then routed to the full screen device 624 for display by the monitor 102.

One skilled in the art will recognize that the technique described above in connection with a volleyball-type game may also be employed to determine engagement or adjustment of conventional user interface controls, such as buttons, menu items, or sliders. This technique may be employed in the context of providing a user interface to conventional software applications, such as word processors or spreadsheets. FIGS. 13 and 14 relate to a second mode of the invention wherein the video processor engine 208 is configured to recognize and track an article held by the user, and to infer a user action from the movement and orientation of the article. FIG. 13 is a perspective view of a user 1302 situated in front of a computer system 100. The user 1302 grasps and positions an article 1304 (which may comprise, for example, a toy weapon) which possesses a visual characteristic, such as a distinctive color, which can be recognized by the user interface system. In a typical implementation of this mode of the invention, the image of the user 1302 is processed to determine the direction in which the article is pointing (simulating the aiming of a weapon), to enable the user 1302 to fire at one or more computer generated target objects 1306 displayed on the monitor 102.

FIG. 14 depicts the preferred train of software devices for applying the required image transforms to the video frame 300 to achieve the objectives of the second mode of the invention. The capture device 602 initializes the process by obtaining a video frame 300 from the frame buffer 212. The capture device 602 then passes the video frame 300 to the color detector device 604, which ranks each pixel in the frame 300 as to how closely it matches a specified color (which is set to the characteristic color of the article 1304). The outputted frame 300 from the color detector device 604 is routed to the key generator device 606, which compares the color match values to a threshold value and creates a key based on the thresholding results for each pixel. The frame 300 is then routed to the smooth filter device 608 to filter out noise and stray pixels. The video frame 300 is then passed to the edge detector 610, which is configured to examine the video frame 300 to detect the edges of the article 1304. The video frame 300 is thereafter routed to the edge filter device 612 and edge filler device 614 to, respectively, filter out stray edges and fill in edge discontinuities. Next, the video frame 300 is routed to the application device 1402 configured to examine the frame 300 to determine the orientation of the article 1304 and to infer a user action (i.e., firing the weapon at a computer-generated object 1306). Finally, the frame is passed to the full screen device 624 for display on the monitor 102.

FIG. 15 relates to a third mode of the invention wherein the video image user interface is implemented in the form of an interactive kiosk Interactive kiosks are commonly used in airports, hotels, tourist offices, and the like to interactively present information concerning accommodations, attractions, restaurants, transportation, etc. Such kiosks generally utilize menus to allow the user to selectively request certain information to be displayed (such as hotels within a given price range or sports events taking place on a certain day). Informational kiosks may also be employed to enable shoppers to identify and locate specific items, or to selectively present product information or advertising. Prior art informational kiosks are commonly provided with touch screens to receive user input. Touch screens, while generally facilitating an intuitive user interface, are prone to malfunction, particularly when used in a heavy-traffic environment.

With reference to FIG. 15, an interactive kiosk utilizes the user interface system of the present invention to display information selected by the user. A set of user interface controls 1502, each denoting a certain type of information, is displayed on a monitor 102. The user interface controls 1502 may comprise, for example, icons or textual menu choices. A video camera 108 captures an image of a user 1504, and the user's image is processed according to methods and techniques described above to determine a motion of the user 1504 and infer a user action with respect to the interface controls 1502. For example, as is depicted in FIG. 15, the user 1504 may select one of a plurality of icons 1502 by raising his hand such that it coincides with the area occupied by the selected icon. This causes information relevant to the icon to be displayed on the monitor.

The interactive kiosk may be advantageously equipped with a modem or similar communications device to allow remote modification of the informational content displayed to the user. In this respect, informational content can be updated and/or changed on a periodic basis without requiring a visit to the physical site of the kiosk. It is appreciated that the above-described modes are presented as illustrative examples of the video image based interface of the invention and are not intended to limit the invention to a particular embodiment or combination of embodiments. The video image based interface of the foregoing description may also be used, for example, for recognition and interpretation of gestures, immersion of a user's image in a static or dynamic video image, etc. Moreover, the user interface system may be combined with other interface technologies, such as voice recognition, to enhance functionality. It is to be further appreciated that while exemplary modes have been described in the context of a personal computer or interactive kiosk environment, features of the invention may be advantageously implemented in any number of environments. For example, the user interface system may be implemented in connection with an appliance such as a television or audio system to control various aspects of its operation. In another implementation, the user interface system may be utilized to control operation of a toy or game. Further, the user interface system may be implemented in a networked computing environment. For example, the user interface system may be advantageously implemented in connection with a multi-player game application in which users located remotely from each other engage in game play.

The invention has been described with reference to specific embodiments. Having described the invention in detail, those skilled in the art will appreciate that modifications may be made of the invention without departing from its spirit. Therefore, it is not intended that the scope of the invention be limited to the specific embodiments illustrated and described. Rather, it is intended that the appended claims and their equivalents determine the scope of this invention.

Claims

What is claimed is:

1. A system for interacting with a computer generated object, comprising: a video camera for generating electronic signals representative of an image of a user; a frame buffer, coupled to the video camera, for storing as successive digitized video frames the electronic signals generated by the video camera; a video processing engine, coupled to the frame buffer, for analyzing, in real time, the video frames to detect a motion of the user, and for determining a user action with respect to the computer-generated object from the detected motion; and a monitor, coupled to the user interface module, for displaying thereon a least a portion of the image of the user superimposed on the computer-generated object.

2. The system of claim 1, wherein the video processing engine comprises a set of software devices configured to sequentially process a video frame, wherein an output of a first software device of the set is directed as input to a second software device of the set.

3. The system of claim 2, wherein the set of software devices includes a difference detector device configured to examine successive video frames to determine a change between corresponding pixels, and to assign each pixel a value in accordance with a determined change.

4. The system of claim 2, wherein the set of software devices includes a foreground detector configured to examine a sequence of frames, and to assign a value to each pixel in a video frame characterizing whether the pixel is in the foreground portion or the background portion of the image.

5. The system of claim 1, wherein the computer-generated object comprises a simulation of a sport ball, and the user action comprises changing a trajectory of the sport ball.

6. The system of claim 1, wherein the computer-generated object comprises a menu including a plurality of menu items, and the user action comprises selecting one of the plurality of menu items.

7. The system of claim 1, wherein the computer-generated object comprises a button, and the user action comprises engaging the button.

8. The system of claim 1 , wherein the computer-generated object comprises a positionable control object, and the user action comprises adjusting a position of the control object.

9. A method for interacting with a computer, comprising the steps of: generating signals representative of an image of the user; saving the signals as a sequence of video frames; processing the video frames to determine a motion of the user and to thereby infer a user action with respect to a computer-generated object; displaying the software-generated object and at least a portion of the image of the user on a computer monitor.

10. The method of claim 9, wherein the processing step comprises comparing successive video frames to determine a change between corresponding pixels, and assigning a value to each pixel in accordance with the determined change.

11. The method of claim 9 wherein the processing step comprises determining whether each pixel lies in a foreground or a background of the image.

12. The method of claim 11, wherein the processing step further comprises the steps of: for each pixel, calculating a value representative of a color and luminosity; and determining whether the calculated value is substantially equal to a recurrent value.

13. The method of claim 12, wherein the recurrent value is determined by examining spatially corresponding pixels in a set of previously captured video frames .

14. An interactive kiosk, comprising: a video camera for capturing an image of a user; a monitor for displaying visual information thereon, including a set of user interface controls selectively engageable by the user, the monitor further displaying at least a portion of the user image captured by the video camera; and an image processor, coupled to the video camera and to the display, for processing successively captured user images to detect a motion of the user, for evaluating the detected motion to determine if one of the set of user interface controls has been engaged, and for responsively causing further visual information to be displayed corresponding to the engaged one the set of user interface controls.

15. The interactive kiosk of claim 14, wherein the user interface controls comprise a set of icons.

16. The interactive kiosk of claim 14, wherein the user interface controls comprise a textual menu.

17. The interactive kiosk of claim 14, further comprising a modem for receiving from a remote source updated visual information to be displayed on the monitor.

18. A computer interface system, comprising: a computer-recognizable article held by a user; a video camera for generating electronic signals representative of the article; a frame buffer, coupled to the video camera, for storing as successive digitized video frames the electronic signals generated by the video camera; a video processing engine, coupled to the frame buffer, for analyzing the video frames to recognize and track the position of the article, and for determining a user action with respect to a computer- generated object; and a monitor, coupled to the user interface module, for displaying thereon a representation of the article and the computer-generated object.

19. The computer interface system of claim 18, wherein the video processing engine comprises a set of software devices configured to sequentially process a video frame.

20. The computer interface system of claim 18, wherein the video processing engine recognizes and tracks the article by examining pixels of a video frame for color values characteristic of the article.

21. The computer interface system of claim 18, wherein the computer- generated object comprises a target, and the user action comprises firing a weapon at the target.

22. A method for determining a foreground portion of an image, comprising the steps of: storing a video frame comprising a two-dimensional array of pixels; and, for each pixel: calculating a value representative of a color and luminosity of the pixel; comparing the calculated value of the pixel to values of spatially corresponding pixels calculated for previously captured video frames to determine whether the calculated value is substantially equal to a recurrent value; and assigning a parameter to the pixel in accordance with a result of the comparing step; whereby the assigned parameter is representative of whether the pixel is considered to be in a foreground portion or a background portion of the video frame.

23. The method of claim 22, wherein the step of calculating a pixel value comprises adding a red, green, and blue color value of the pixel.

24. The method of claim 23, wherein the assigned parameter can have one of two values, comprising a first value representing a foreground image and a second value representing a background image.

25. The method of claim 23 , wherein the assigned parameter can have a range of values between a first value, representing a pure background image, to a second value, representing a pure foreground image.

26. The method of claim 21 , further comprising the step of applying the assigned parameter to the pixel to determine if the pixel should be displayed on a monitor.