WO2007045001A1

WO2007045001A1 - Preprocessing of game video sequences for transmission over mobile networks

Info

Publication number: WO2007045001A1
Application number: PCT/AT2005/000421
Authority: WO
Inventors: Olivia Nemethova; Martin Wrulich; Markus Rupp
Original assignee: Mobilkom Austria Aktiengesellschaft
Priority date: 2005-10-21
Filing date: 2005-10-21
Publication date: 2007-04-26
Also published as: AT508595B1; AT508595A4

Abstract

A method and a system for preprocessing game video sequences comprising frames and including a ball or puck as movable game object, for transmission of the video sequences in compressed form; in an initial search (12), frames are searched for the game object on the basis of comparisons of the frames with stored game object features; then, respective frames are compared with preceding frames, to decide on the basis of differences between consecutive frames whether a scene change (14b) has occurred or not, and in the case of a scene change, an initial search is started again; otherwise, tracking of the game object (18) is carried out by determining the positions of the game object in respective frames; at least for one frame, a dominant game playfield color is detected and is replaced by a unitary replacement color so that a playfield representation essentially consists of points of the same color; and the presence, size and/or shape of the detected game object is determined, to possibly replace the game object by an enlarged replacement game object (26).

Description

Preprocessing of Game Video Sequences for Transmission over Mobile Networks

Field of the invention

This invention concerns a system and a method for preprocessing game video sequences for the transmission in compressed form, preferably over wireless mobile (cellular phone) networks.

Due to the lossy nature of the wireless, channel and high compression rates necessary to match the given bandwidth, it is difficult, to transmit such programs over mobile networks in real-time. As the important information is carried by a single small object - a ball or the like game object - it is necessary to ensure its correct reconstruction at the receiving mobile terminal. Therefore, the aim of such a preprocessing is to perform a sharpening or enlargement of specific game objects, as e.g. a ball, a puck or the like game element of given shape in a sport game in the original video sequence to avoid its blurring or disappearing after the video resolution down-sampling and compression.

Background of the invention

Emerging 3^rd generation of mobile communication systems, or cellular phone systems, led to new multimedia services. One of the most interesting applications is video-streaming, already provided by many operators all over the world. Here, sport programs are of particular interest, whether as part of news or as stand-alone and possibly live broadcast transmissions. Beyond doubt, amongst the most popular sport programs, there are ball games, such as soccer, basketball, baseball or tennis, but also hockey, in particular ice-hockey. However, the transmission of live streaming game video sequences over mobile network introduces several challenges. Spatial and temporal smoothness of video sequences allows for high compression performed at the sender before the transmission. This compression results in a certain quality degradation. Streaming services are delay sensitive and therefore, they are usually transported via the unre- liable User Datagram Protocol (UDP) rather than via the Transmission Control Protocol (TCP) , the latter providing the possibility of transport layer retransmissions. UDP usage leads to possible packet losses at the receiver, further degrading the quality at the end-user. To match the screen of common mobile terminals, a resolution QCIF (144 x 176 in PAL) is used. For PDAs (PDA - Personal Digital Assistant) and laptops the CIF (288 x 352 in PAL) resolution is of relevance (CIF - Common Intermediate Format; QCIF - QuarterCIF) . The most important object, i.e. image element, in a ball game is understandably the ball, generally spoken the game object. Ball games are usually recorded using a slightly moving wide-angle camera. This leads to situations in which the ball is represented by three or. four pixels only, and this representations thus, are very susceptible to any kind of degradation which has also a considerable effect on the user perceptual quality [1] .

In case of video-streaming over wireless networks, the receiver typically is a power and size limited mobile. Therefore, it is not feasible to implement complex post-processing methods, allowing to cope with the given problem. Therefore, efficient and robust preprocessing of the video sequences to selectively improve the representation of the critical image elements, namely the ball or the puck, in view of a robust transmission is to be used.

To treat a ball or the like game object separately, and to ensure its display at the right place at the receiver, automatic recognition and tracking of the ball is required. This introduces several challenges because:

- a ball or puck game video-sequence usually contains cuts or slow-motion replay parts;

- a ball or puck is small, especially for the relevant QCIF or CIF resolutions;

- there can be more than one object resembling a ball or a puck;

- the ball or puck does not have to appear in every frame: it can be covered by the players or there can be parts of video without it (e.g. when the audience or details on players are shown) ; - the appearance of the ball or puck changes over the time (zooming, shadow) .

Previous work has been performed on soccer video analysis; a state-of-the-art description and proposal for automatic soccer video analysis can be found in [2] . The purpose of that work was an event detection (goal, penalty or red/yellow card detection) , and one can find there methods how to detect a scene change or slow-motion replay; however, shot classification is mainly based on the detection of the players and playground lines but not on ball recognition.

Another simple method for event detection in a soccer game is proposed in [3] . Here, the detection is based on the tracking of the trajectory of candidate objects. On the basis of the most consistent trajectories, corresponding objects are then detected as a ball. Several known object recognition methods were used to detect the ball. For instance in [4] a component analysis and in [5] recognition based on a circle detection was used.

The purpose of the above-mentioned methods was event detection. Contrary to this, in the case of the present invention, the aim is to protect the smallest and most important game object, the ball or puck. It is necessary to avoid wrong detection. The most critical situation occurs in frames where a ball (or puck) is visible within the playground not surrounded by any other objects. After using a high compression, a ball often seems to disappear by blurring fluently into the grass. To overcome this situation, a correct ball detection is required. Since real-time transmission is required, the method needs to be simple with low complexity. The circular shape of the ball or puck can be used for the decision but it is not possible to only rely on this single information. Especially in case of a wide-angle camera, parts of the sequence and QCIF resolution, the ball or puck often consists of only 3 or 4 pixels, as already mentioned above.

Accordingly, it is an object of this invention to provide a novel, simple and robust technique to cope with the above-mentioned problems, and to provide a technique of preprocessing of the video sequences at the sender (transmitter) so that good quality receipt at the mobile side is achieved.

A further object of the invention is to provide for a reliable technique to protect the ball and to ensure its display at the right place at the receiver, this in spite of data compression used at the sender side, and of the fact that the game object (ball or puck) often comprises only a few pixels.

To cope with these problems, the invention provides for a method and a system as defined in the attached independent claims. Preferred and advantageous embodiments are characterized in the dependent claims .

According to the invention a method and a system are provided where frames of video sequences including a ball, puck or the like game object may be preprocessed on the sender side in a fast and reliable manner so that at the receiver side, video images of good quality, in particular with respect to the game object, may be displayed, and this also in the case when the, for the transmission of the video sequences, data compression is applied. For the comparison of the frames with stored game object features, stored shape and/or color data may be used, and in particular, game object templates may be defined and stored for the comparison.

Further objects, features and advantages of the invention will become apparent from the following description in connection with the enclosed drawings referring to exemplary preferred embodiments of the invention to which, however, the invention should not be limited. In the drawings,

Fig. 1 shows a schematic diagram of four different examples of the ball appearance in a soccer video sequence, with the pixel of different intensities shown in squares;

Fig. 2 illustrates a schematic block diagram of a system for producing, (pre) processing and transmitting video sequences according to the invention; "^■* O "^*"

Fig. 3 shows a general diagram of the main operation modules of the preprocessing system of the present invention;

Fig. 4 shows a system of flow charts illustrating the main operation steps of the preprocessing system according to the invention;

Fig. 5 illustrates a more detailed schematic block diagram of the preprocessing system according to the invention;

Fig. 6 illustrates a more detailed block scheme of an "initial search" part of the system according to the invention;

Fig. 7 illustrates a more detailed block system of a "scene detection" part of the system of the invention;

Fig. 8 illustrates a schematic diagram showing the extrapolation method for tracking a ball on a frame-by-frame basis; and

Fig. 9 shows a representation of a generated replacement ball on a pixel basis, before (a) and after (b) a Gaussian filtering.

In Fig. 1, a system for recording, processing and transmitting video to mobiles is shown. More in particular, the system includes at least one camera 2 for recording a game, a (pre) processing system 4 for processing the video sequences before compression and transmission and a module 6 for compressing and transmitting the video sequences; the transmission is done in usual manner via MSC (mobile switching center) and BSC (base station controller) units and via basic transmission stations (BTS) 8 to a plurality of mobile terminals, in particular mobile phones 10. Such a system may be used to record and transmit games, as soccer, football, rugby, baseball, basketball, tennis, or even ice-hockey. In the following, it is for simplicity reasons referred to soccer, and a game object in form of a ball, but it should be understood that also other ball games as well as similar games using a similar game object, as in particular a puck in the case of ice-hockey, are possibilities where the present invention may be applied.

Soccer games or the like games where a ball or ball-like game object is used represent very popular content not only for analog and digital television, but also for streaming over mobile networks. Typical mobile terminals usually work with resolutions as small as 144x176 (QCIF), PDAs could display 288x352 (CIF) pixels. Limited bandwidth of 3^rd generation mobile systems supports data rates up to 2 Mbit/s, shared by all users in a cell. Therefore, for unicast transmission of streaming video, data rates up to 128 kbit/s are feasible. Video codecs supported by 3GPP standards (3GPP - 3^rd Generation Partnership Project; the scope of 3GPP was to make a globally applicable third generation (3G) mobile phone system; 3GPP specifications are based on evolved GSM (Global System for Mobile Communications) specification, now generally known as the UMTS (Universal Mobile Telecommunications System) system) at the moment are H.263 (H.263 - a video codec designed by the ITU-T (International Telecommunication Union Telecommunication Standardization Sector) as a low- bitrate encoding solution for videoconferencing) and MPEG-4 (MPEG-4 - Moving Picture Experts Group - 4, the primary uses for the MPEG-4 standard are web (streaming media) and CD distribution, conversational (videophone) , and broadcast television) , with their basic profiles. Lossy compression used by these codecs leads to visual quality degradation. Frame reduction causes the overall jerkiness of the video, further compression results in loss of spatial details accompanied with blockiness and blur- iness.

Soccer or the like game videos usually encompass scenes with different character. Most common are wide-angle moving camera shots which are particularly critical for the compression, as the ball or puck as well as players are represented by several pixels only, thus being susceptible to any quality degradation. Due to the compression, the ball can even disappear from the playground.

Since the ball carries the most important information, watching a game with a blurred small ball (or even without it) becomes rather annoying. To overcome this, a simple technique is proposed that detects and tracks the ball and replaces it by its enlarged or sharpened version. Thus, a replaced ball can still be visible well after the compression. In this connection, furthermore, it is aimed at robustness of the run-time ball tracking and replacement technique, preferably including occlusion handling and reliable scene change detection. In the following, the whole system is disclosed, containing also a preferred initial ball search approach, and it is shown that good results with respect to visual quality, reliability and overall complexity are achieved.

One of the most challenging problems is the handling of the low resolution of the ball (typically under 5 pixels) and the image in which the ball should be searched. Schematic representations of screenshots of typical balls taken from different sequences can be seen in Fig. 2 (a)-(d), where .different examples a-d of ball appearances are shown on an enlarged scale, all of which occurring in the same video, but in different frames. From this, it will be apparent that it would be useful to focus on techniques which do not require information about the object shape, or are based on edge fitting, because it is nearly impossible to identify the shape of 5 - 7 pixels large objects correctly. Besides this, it was wished to develop a technique with very low complexity and reasonable computing time. The preferred technique comprises several main parts in addition to an initial search part 12, namely in particular an appropriate image pre- filtering 14, a similarity search 16, and a consequent tracking 18 of the searched object.

A grading of the proposed system can be seen in Fig. 3. Here, it is shown that in the present embodiment, the similarity search module 16 comprises a SAD/SSD/2-dimensional filter 16a, as will be discussed later; the image prefiltering and segmentation part or module 14 comprises a dominant color detection and replacement module 14a, a scene detection module 14b and an image filtering module 14c; the tracking part 18 comprises a trajectory tracking module 18a, an algorithm learning (modell ball updating) module 18b, and an ROI prediction/occlusion handling module 18c. The initial search module 12 comprises a template set generator 12a, minimum distance polygon collection module 12b, and a fitting and optimum determination module 12c.

The basic operation of the system can be seen in Fig. 4. Simplified, a video source 2' (compare camera 2 in Fig. 1; however, the video source 2 ' could also be a video file or a video stream supplied by other devices) provides a series of frames which is monitored by scene change detector 14b which activates (see control output 20) the initial search module 12 or the "normal" recognition process 22, as is shown by a "switch" 24. The "normal" recognition process 22 consists of the image prefiltering 14, the trajectory tracking 18 and the similarity search module 16. The recognized ball position, together with information about the ball size are fed to a ball replacement module 26. After a possible ball replacement (or maintaining of the ball) , the next frame is processed.

In the case that a scene change is detected at 20, or at the beginning of a video, an initial search is carried out at initial search module 12.

During soccer video transmissions, light conditions and contrast can change. Also the playfield color is not exactly the same over the whole image. Therefore, the present technique has been developed to smooth out this kind of "noise" by prefiltering, to ensure that the similarity measure yields a maximum score in as many frames as possible.

The image prefiltering in module 14 is a particular advantageous measure to increase the robustness of the algorithm. The advantages besides the effect that the "noise" in the image is reduced are that the algorithm is more invariant against lightning changes and fluctuations of the ball appearance (i.e. if the small resolution ball is not symmetric, compare Fig. 2) ; this usually happens because of clutter caused by grass and playfield lines or due to occlusion by players. Especially the strong fluctuations in the shape of the ball by reason of grass clutter makes the recognition process very unreliable if no adequate prefiltering is carried out.

According to Fig. 3, the prefiltering is performed in three steps: (1) dominant color detection and replacement 14a; (2) scene change detection 14b; and (3) image filtering 14c; compare also Fig. 5, modules 14a (28/30); 14b; 14c. (The modules shown in Figs. 3 to 7 may be realized by separate electronic (computing) modules as well as by module parts of a software system; therefore, the same reference numerals are used in the drawings not with standing the respective dominant character - flow chart or block diagram - of the individual figures . )

At first, to smooth out the above-mentioned noise in the play- field color, where most of the time the ball will be found at, the first image of a scene is analyzed to find the dominant color values for all color channels. This analysis is on a frame basis (see frame extractor 32 in Fig. 5) done by using a color histogram information. It is not necessary to perform an image segmentation to separate the playfield in the image. Because the main goal is to smooth out the clutter of the playfield (grass) , it is sufficient that the initial dominant color detection is performed in a representative image, in which the playfield occupies the majority of the frame.

The histograms of all color channels in the RGB (R-red, G-green, B-blue) color-space are analyzed and the dominant color regions are found by means of thresholds. The RGB color-space is used because the video sequences are usually supplied in this color- space. In an example, the thresholds have been optimized empirically to 25% for the red channel, 28% for the green channel and 32% for the blue channel.

After the dominant color bounds have been found for all color channels, the mean index i_mean(c) is computed by a weighted average

where h(c) = [h_o(c), hi(c),..., h₂₅5(c)]^τ is a vector containing the number of points at each color value, i - [0,1, ..., 255]^τ is a vector of all indices belonging respectively to a color channel denoted by c, j is 1,2,3..., and T denotes the transposed matrix. In an example, j had been limited to 255 because the used video sequences had 16.7 million colors (24bit per pixel) .

The weighted average is used as replacement color, compare color replacement module 28 in Fig. 5, and due to this color replacement at 28 in Fig. 5, the playfield mainly consists of points of the same color. Thus, the frames are formed in a way that they are unified and the noise of the playfield color is suppressed.

Secondly, scene (change) detection by detector 14b is carried out in addition to the actions concerning the robustness of the similarity measure. Scene changes are a challenging field in real video transmissions because after such a scene change, the tracking of the ball can be lost completely. Other scenes than wide angle shots may be close-ups in which no ball is in the picture at all, or other perspectives in which the size of the ball is much larger than in the preceding frame.

To avoid a false detection of the ball and to reduce the complexity of the algorithm, the knowledge of an occurred scene change can be used to decide whether a detailed search should be performed. In addition to the knowledge of an occurred scene change, one can take use of the dominant color information as described above. If the dominant color at the beginning of a new frame is too far away from an "average" color (green in the case of soccer) , the scene does not show the playfield and therefore it is very likely that there is no ball in the scene at all. In such a case, the dominant color information is monitored^' to decide whether a detailed search should be started, or whether the algorithm can pass the current frame. If the dominant color information falls within the specified range, an initial search is started at module 12 (Fig. 4) .

In this "initial search" (see also the initial search module 12 in Figs. 3 and 5) an algorithm different from the one for tracking and recognizing the ball ^Λλon-the-fly" is applied. If no suf- ficient result of this search is obtained, it is concluded that there is no ball in the picture or it exceeds a certain size. In this case, no replacement of the ball is caused because a large ball does not need to be improved for the coding and a wrong replacement would result in a degradation of the subjective video quality.

For the scene detection itself, a technique of "dynamic thresholds" may be used where the SAD (SAD - sum of absolute differences) between the instantaneous frame and its preceding frame is calculated for each frame:

The instantaneous value of the threshold for a scene cut is then given by a linear combination of the instantaneous SAD value, the mean and the variance of SAD calculated over e.g. the last 20 frames. This method performs well as it adapts to the amount of movement among the frames and thus allows detection of finer scene changes than usually used fixed threshold.

Thirdly, image filtering (cf. image filter module 14c in Figs. 3 and 5) is done where - besides the dominant color detection and replacement as described above - a two-dimensional Gaussian filter is used to smooth the resulting images. The filtering is performed by means of a two-dimensional convolution

where h is the N_x x N_y filter, F_n is the n-th frame of the video sequence, i, j are the filter coordinates and x, y denote the coordinates of the two dimensional convolution.

A M x M isotric (i.e. circulary symmetric) Gaussian filter is used:

Different sizes M have been tested in several video sequences. Thereafter, in an example, it was concluded that for the low resolutions of the ball, a filter size of M = 5 is sufficient to smooth the image, without influencing the appearance of the ball significantly.

A main part of the present technique is the similarity measure which is used to search for a template ball in each frame of the soccer video. Based on the requirements of low complexity and its reasonable computing time, the method of the sum of absolute differences (SAD) as an example for a possible metric has been chosen:

where F_n is the n-th frame of the video sequence; x, y are the SAD coordinates (within the searched region of the frame) and i, j are the coordinates within the N_xX N_y template T.

In Fig. 5, a corresponding similarity metric calculation and recognition module 34 is shown for this purpose.

Other similarity metrics, like SSD (sum of squared differences) , have been tested, too:

N_x N_y

SSD(n,x,y)=∑∑(F_n(x+i,y+j)-T(i,j))> (6)

/=17=1 where F_n, T and x,y have the same meaning as in the SAD calculation.

Additionally, two-dimensional filtering has been tested, like that given above in Equation (3), where instead of a Gaussian filter h, filtering by the N_x x N_y ball template has been carried out. The results and the significance of the found minimum (or maximum in case of the two-dimensional filtering) for the three similarity measures have been compared, and it could be established that the SAD values show a much more significant minimum than the filter values do. Similar comparisons to the SSD calculations showed that the SAD method is a good tradeoff between accuracy and complexity for the given problem, so SAD was chosen as the preferred method. In the algorithm, the similarity metric is used to sequentially calculate the distance between the frame and the ball template within. the region of interest and to select the position of the ball with a minimum distance leading to a candidate position of the ball: p{ή)—arg min metric (n, x, y) ι η \ χ,yeA where p(n) denotes the candidate ball position in frame n, metric (n,x,y) stands for one of the proposed metrics and A specifies the region of interest (ROI), i.e. the part of the frame where the ball position is assumed.

It has been found that the decision whether a ball is recognized or not can be based on a fixed threshold, which a practical example has been optimized by hand to 6.8% of the maximum SAD value for a given template size. Fortunately, this threshold can also be used to speed up the detection process.

One may write the SAD in a given point (x;y) of the current frame n as a part sum

N₁ N_y

SAD_k{n,x,y)= ∑∑ \F_n{x+i,y+j)-T{i,j)\ (8)

(i-l)-N_y+j<^'k where N_x and N_y again denote the size of the template ball and k is the number of calculated absolute differences in points. For (i-l)N_y+j<k to be correct, the inner index j has to run through all possible values after choosing i. Certainly, if k=N_x*N_y, then SAD_Nx._Ny(n,x,y) is equal to SAD(n,x,y) as defined in Equation (5) above. The | • | in equations (2), (5) and (8) is responsible for the SAD being an additive metric, which part sums are non- negative for all k. Therefore, one can stop the evaluation of the SAD value in x,y when the actual SAD_k(n,x,y) exceeds the given threshold, because then the algorithm would nevertheless discard the overall SAD(n,x,y) result.

The SAD_k(n,x,y) calculates the sum of the differences of^■ the points l...k. If the current partial sum at step k exceeds the given threshold, the evaluation of the remaining points can be discarded.

This speeds up the recognition process considerably since only a fraction of the differences has to be calculated. Another way to accelerate the search process even more is explained in the next section.

The tracking and occlusion handling part of the algorithm (compare part 18 in Fig. 3) again performs an important task for the reliability and speed of the algorithm. The basic idea behind trajectory tracking is that within one scene, the ball cannot move more than a certain distance between two frames. If the position of the ball is known in the preceding frame and no scene change has occurred, one can restrict search to a part of the respective frame, to the region of interest (ROI) where to "look" after the ball. This has two effects:

- the computation time is reduced since the ROI is considerably smaller than the whole image; and

- the robustness of the recognition process is improved because mostly all "false" hits, where the SAD results would exceed the given threshold, can be rejected.

The prediction of the ROI for ball trajectory tracking (see tracking module 18 in Fig. 3, 4 and 5) can be implemented as a simple linear prediction by using a fixed size of the ROI and the new position estimated by the difference between the ball positions of the last two preceding frames p{n)=p{n-l)+k{n) , (9) where p{n) is a vector containing the predicted x and y coordinates of the ball in the n-th frame and k(n) is the predicted movement of the ball during the last frame. In Fig. 5, a ball position prediction module 36 is shown which predicts the ball position in the current frame in the manner now described, by using position data from a position memory 38. It should be noted that in principle , also here different approaches could be used to predict the ball positions, as e.g. the MMSE-method (MMSE - minimum mean square estimator) or the WLSE-method (WLSE

- weighted least square estimator) . On the basis of the informa- tion delivered by ball position prediction module 36, and of information, e.g. as to the number of frames per second (fps) etc. supplied by a rate, resolution and fps extractor 40 (s. Fig. 5), the tracking (ROI) generator 18 calculates the ROI, and furthermore specifies the region in which the dominant color replacement (module 30) and the image filter (module 14c) are applied. The extractor 40 is a module having the task to extract the bit rate, the resolution and the fps number of the video source 2¹.

Referring again to the specific ball position prediction to be described here, and under the above-mentioned assumption that the movement of the ball between two consecutive frames is not too large (what is usually true in practice if no scene change had occurred - which would be detected by module 14b) , one can predict the movement k(n) by the difference of the ball position between two preceding frames p (n-1) -p (n-2) . This leads to the following prediction:

p(n)=2p(n-l)-p(n-2) (10)

The starting values p(-l) and p(-2) of this prediction are fixed to 0, so the algorithm has to evaluate the first two frames before it can perform a prediction.

Another possibility is to adapt the weights of the predictor ad- aptionally. This can be done by solving the Yule-Walker equation system for linear prediction or, by using of an MMSE (Minimum Mean Square Error) estimator. The latter is better, as it can be easier calculated numerically.

In tests, the size of the ROI was empirically optimized to the twofold size of the template ball, such that in all tests, the ROI was large enough to contain the ball, even if the prediction was slightly false. This size of the ROI was optimized for a frame rate of 25 fps (frames per second) . But the result can differ significantly for reduced frame rates because then the movement of the ball between two consecutive frames may be much larger and the ROI could be too small to contain the ball in the case of a false prediction. - I b -

The so evaluated ROI A defines the range of x and y values, where the similarity metric may, for example, be applied to

A_x=p(n)±\N_x/2]

A=p{n)±\N_yl2] ^{1 where N_x and N_y denote the size of the template ball as in (5) and [•] denoting the so-called floor operation.

To verify the reliability of the proposed prediction, the predicted position and the recognized position delivered by the similarity metric were tracked. It has shown that the difference between predicted and recognized position ε{n)=p(n)—p(n) in x and y direction (i.e. n=x,y), that is the prediction error, is sufficiently small and shows no systematic bias. The prediction was tested with several video sequences of different lengths. The average variances of the prediction error in x and y direction (averaged over all tested sample videos) E{varε_x} and E {varε_y} were about 1.628 and 1.389. The average mean of the errors E{meanε_x} and E{meanε_y} were about -0.092 and 0.146.

Besides the prediction of the ball position, it was also tried to make the algorithm as robust as possible against slowly changing light conditions and small appearance changes because in small resolutions this happens quite often. Therefore, the replacement ball used as template for the SAD search is updated by the ball which has been newly found. The template is updated in an averaging manner as follows

where T_new is the new template which will be used in the further recognition process, T_oi_d is the template used until the current frame, B is the actual recognized ball and a is the learning factor. Empirical testing showed that together with the chosen detection threshold, a learning factor α of 0.2 is appropriate.

Of course, different options for the template generation are possible; the used number of past templates or the weighting factor of these templates defines the used algorithm for the generator.

Besides Equation (12) , it is favourable to monitor whether the size of the current found ball has increased, to decide whether the size of the template and replacement should be increased or not. To decide this, the histogram information of the current recognized ball (or merely the section of the image in which the ball was found) may be used. Normally, the ball is nearly white and surrounded by a considerably darker environment (i.e. the playfield) . Thus, it is straightforward to binarize the template ball by using the histogram information and compare the quantity of those two values . Depending on the percentage of "dark" against "bright" pixels, it can be decided whether the ball increased in size or not.

The algorithm used converts the RGB coordinates of the actual recognized ball into intensity values (B₁= intensitiy (B) ) , e.g. according to the following equation:

where vec(B) returns an (N_x*N_y)x3 matrix, and N_x, N_y denote the size of the actual recognized ball in x and y direction.

After the conversion, a hard threshold (e.g. 150) is implemented to binarize the image, so every pixel B_x(i,j) ≥ threshold will be set to one (^ΛΛ1") and all others to zero ("0") . The decision whether the template size should be increased therefore simplifies to an evaluation of the rate of "bright" with respect to the number of all pixels compared with a empirically determined threshold

If the above inequality is fulfilled, the template size will be increased in x and y direction to N_x+1 and N_y+1, and the extraction and analyzation described above is repeated until the -L O threshold is not exceeded anymore.

If within the ROI no ball is found, the algorithm decides that an occlusion has happened. No ball replacement (see module 26 in Fig. 5) is performed in this case, and no template ball update takes place. Since in most cases an occlusion occurs because the ball passes a player, or is held by him or her, a good prediction of the next ball position is nearly impossible. Thus, the algorithm proceeds in a simple way by increasing the size of the ROI from frame to frame (but maintaining its center at the same position) until the ball has been found again, compare also Fig. 8 where a ROI of original size and an enlarged ROI¹ are shown in connection with a trajectory line 42 defined by ball positions

In most cases this is sufficient to keep track of the ball, because normally the ball is occluded only a small amount of time in wide angle shots, which are of main interest, since in those frames a replacement should be done to improve the visual quality for the mobile user.

To further enhance the robustness of the algorithm and to avoid most wrong template learning decisions, it is preferred to remove the clutter caused by the playfield lines. This clutter is especially disadvantageous for the algorithm in QCIF resolutions because in a wide angle shot, the ball is very similar to a point of a playfield line. Therefore, to exclude these points from decisions, and to avoid wrong template learning, each frame may be processed by means of a simple edge detection to become a binarized version of the frame with detected playfield lines. Here, good results may be achieved with the known per se Canny algorithm for edge detection, where e.g. a threshold of 0.03 in x-direction and of 0.08 in y-direction may been defined, as has been found to be adequate in practical tests. The conversion of the RGB frame to an intensity image may be performed according to the above Equation (13) .

The binarized frame serves two measures:

- if the detected ball position approaches a detected line, - iy - the template learning is stopped and

- if the ball is not detected for a number of frames because of occlusion, a recognition is marked as valid only if it does not lie on a line detected by the edge detection.

These measures provide two remarkable benefits. First, the template does not become updated by a ball detected on the play- field line, which would disturb the desired template and probably lead to the decision that the template size has increased. And second, it is possible to provide a still more reliable recognition of the ball after occlusion because a ball recognition is not marked valid if recognized on a playfield line.

The overall diagram of Fig. 5 further depicts the above-mentioned ball replacement module 26 which gets ball position information updated by the similarity metric calculation and recognition module 34 (which calculates the similarity metric of the given template ball and the current frame in the evaluated ROI, the metric values serving as a basis for the threshold decision whether a ball is recognized or not; see above) from the position memory 38. This ball position information is also supplied to a ball extractor 44 which determines the size of the current recognized ball to extract it and to save it in a template memory 46. In this template memory 46, the extracted balls of the video source 2¹ are saved to build a basis for a template generator 48 which calculates the ball template which is used by the similarity metric calculation by using the memory of past templates, as described above in connection with Equation (12) .

Finally, in Fig. 5, a substitution ball generator 50 is shown which evaluates the optimal substitution ball for the current frame by taking rate, resolution and fps into account. The substitution ball is generated sufficiently large and with high contrast, such that the ball replacement provides the desired quality enhancement, as will be described. Further, a codec 52 is used to compress the collection of processed frames and lower its resolution to obtain the desired video output 54 which may again be a video file or a video stream adjusted to the needs of the mobile channel.

The initial search (see Fig. 6) is a crucial part of the present technique because the robustness depends significantly on the obtained results. For the purposes of the present processing, a non-causal multiresolution algorithm may be implemented. Unlike the implemented algorithm for running scenes, which uses only the information of the template and a consequent tracking for the recognition process (and therefore is very fast and computationally inexpensive) , the initial ball search uses extracted trajectory information to _..find the correct ball at the beginning of each scene. This method is much more reliable and robust because it uses knowledge about the physical behaviour of the ball in addition to the information about the shape and color attributes. Accordingly, the initial ball recognition module 12 estimates the positions of the ball in a sequence of e.g. 5 to 7 frames after a scene change has occurred, or a new video source 2' for processing is chosen.

The origin of the initial recognition process is a set of characteristic templates in different resolutions as stored in memory 56 of Fig. 6, and have been found empirically; these initial templates represent a wide range of possible balls in videos with similar resolutions. Starting from these templates, a bigger set X is generated in template generator module 58 by means of simple Gaussian filtering (i.e. different filter sizes and variances) and deblurring (i.e. deblurring using a Wiener filter or the Lucy-Richardson algorithm) . Each recognition run starts by applying the dominant color detection and replacement, cf. modules 28, 30 in Fig. 5, and uses the SAD similarity measure as described above, compare also the analogues similarity metric calculation and recognition module 34 in Fig. 6.

At the beginning of a video (n₀ - 1) or a scene change in frame n₀, the SAO{n,x,y) (compare Equation (5)) values between the frame and the ball template t are computed in each frame n_o<no+N_f/ thereafter, the values are stored in ascending manner and numbered in their order by k. Now, the first Y points (x,y) with minimum SAD (n,x,y) value are collected and for each template t a set Ω^(t> (n) is formed:

where P^ ' denotes the position (x,y) of the corresponding

K

SΑD (n,x,y) on position k in the sorted list of frame n. The set Ω^(£) (n) is generated for all templates feX and all N_f frames, therefore resulting in a total number of

sets with NV|x|-y points.

The recognized positions of the first frame for each template, Ω^(t) (1) (compare also the position memory 64 in Fig. 6) , form the beginning points of so-called "minimum distance'' polygons. Starting from an arbitrary recognized position Jt= 1... γ (by use of a fixed template t) in frame n_Or a minimum distance search module 62 calculates the Euclidean distance to all other recognized positions 1 = l...γ in the consecutive frame

d^w(Jt, /, n)= P [t)_{nμ_p(t)_{{n+ 1 ]} (16)

where P J(n) denotes the Jr-th recognized position in frame n by similarity search of template t. After calculating these distances, the minimum distance is chosen to specify the next position of the "minimum distance" polygon in frame n+1

ure is repeated for all N_f frames, all positions k in the first frame n₀ and all templates, thus resulting in a collection of γ-X polygons with γ-X-N_f points. The "minimum distance" polygons now represent possible trajectories of candidate balls. The remaining question is, which polygon describes the real ball. The algorithm is based here upon the assumption that the trajectory of a "real" ball is sufficiently smooth. To evaluate the "smoothness" of each polygon, a curve fitting may be applied by a polynom of order two: - ZZ - f (x) = ax²+bx+c (18) .

The goodness of the fit is measured by the sum of squared errors

where now p '_yy(i) stands for the y coordinate of the fitted curve in frame i.

After curve fitting and calculating the SSE values for each polygon, it is tried to find the "Optimal" polygon, which represents the trajectory of the desired ball. This is performed in two steps: first, it is searched for the starting indices k of the NSSE best fit-able polygons of template t

where all k form the set T. Within this set of "well" fit-able polygons, the one with the largest sum of Euclidean distances in chosen. Empirical testing led to this step because in some cases, noise in the playfield leads to a "ball-like" appearance which does not vanish within the chosen number of frames N_f to compute. But a set of positions found as described, which do not differ significantly from frame to frame is easy to fit and therefore, is generating a very low SSE value. On the other hand, such a polygon will have a small length, thus verifying the assumption that within the set of polygons with a small SSE value, it is the best strategy to choose the polygon with the largest sum of distances

l_opt(n)=

l—argmin d (k,l,n). /=i...y

For the case that this procedure does not lead to an optimal polygon, the algorithm may contain a possibility to choose whether the same frame sequence should be processed in a higher resolution, or whether it should be concluded that the actual scene does not contain a ball.

Fig. 6 further depicts an according optimum evaluation module 68 which shows the optimal polygon representing the real ball, as described by processing the minimum distance polygons together with the information about the goodness of fit and the total length. Different algorithms are imaginable, that is the optimum polygon may be defined by choosing the one with best goodness with fit and largest total length.

With block 70, an initial ball is shown which represents the optimal polygon chosen by the optimum evaluation module 68 together with a template which forms the basis for the determined^' polygon. The initial ball position 72 is extracted then to be saved in the position memory 38 (Fig. 5) . An extracted template 74 is the template which forms the basis of the optimal minimum distance polygon. This template will be used for the start of the further recognition process, as has been described in connection with Fig. 5.

Than, the frame extractor 32 of Fig. 5 provides the respective video frames, as is illustrated in Fig. 6 with reference numeral 32'. A ball size prediction module 76 receives the video frames and may be used to estimate the ball size in the current frame by analyzing the frame content. The playfield size may serve as a basis for this estimation, but also other methods may be used.

With respect to Fig. 7 which refers to the scene detection (compare also module 14b in Fig. 5), again video frames 32' are supplied, to namely a block SAD evaluation module 78 which computes the block SAD values of the current frame if no codec SAD information (see module 80 in Fig. 7) is available. Furthermore, dominant color information 28' is extracted by the dominant color detection module 28 in Fig. 5. This dominant color information is saved in the color information memory 82 to serve as a basis for the threshold comparison which is carried out by a threshold comparison module 84 which gets fixed threshold information, too, see block 86 in Fig. 7. More in detail, the threshold comparison module 84 decides if a scene change has occurred by comparing the dominant color information of the past frames with that of the current frame. If the threshold is exceeded, the dominant color has changed significantly enough to announce a scene change, see block 88 in Fig. 7.

In the other branch of Fig. 7 used to check whether a scene change has occurred, the results of the block SAD evaluation (in module 78) are saved in SAD memory 90, to provide the block SAD information for evaluating the dynamic threshold used in the lower branch of Fig. 7 for detection a scene change. Accordingly, a threshold generator 92 computes the dynamic threshold by taking use of the variance and mean of the past block SAD values, and a threshold comparison module 94 is provided for to perform the same task as in the other branch, compare module 84, and with a result that if the dynamic threshold is exceeded, it is decided that a scene change has happened. The output of both threshold comparisons is the scene change detection event 88 which is used than to decide whether the initial ball search or the continued processing should be used to process the video source, see switch 24 in Figs. 4 and 5.

Finally, the ball replacement to enhance the visual quality after compression will be described now, compare also Fig. 5, modules 26 and 50. The replacement ball used has to be large enough and must have sufficient contrast to the surrounding pixels in the frame.

The replacement ball used is computed from two input parameters: the desired compression to be applied and the size of the actual recognized ball. For practical reasons, the replacement ball may be chosen to be fully symmetric of size

M_x=M_y=max{N_x,λ):N_x≤Λ , (22) with N_x denoting the size of the current recognized ball, and λ,Λ respectively denoting the lower and upper bound for the replacement ball size.

Thus, the replacement ball size is equal to the current recognized ball size if N_x lies between the upper bound Λ and the lower bound λ. If the lower bound λ is exceeded, the current recognized ball size is clipped and the replacement ball size is kept at λ to ensure that the ball is large enough not to vanish after compression. If the current ball size exceeds the upper bound Λ, the ball is large enough to be visible after compression and a replacement is not necessary at all. Therefore, in this case, no replacement takes place. To ensure an improvement of the visual quality after replacement, the factor λ depends on the desired compression in a way to ensure that the ball is large enough not to vanish after compression. For QCIF resolution, for instance Λ = 10 and λ = 5 have been chosen. Equation (23) is equal to the size in y-direction, M_y = M_x.

After determining the size of the replacement ball, the latter has to be created in form of an image. Therefore, one may initialize the replacement ball by coloring all available pixels by the dominant color. The replacement ball has to be symmetric, so e.g. all pixels with Euclidean distance

are colored white. Position (x_o,yo) denotes the center of the replacement ball.

To let the replacement ball look more "natural", the so far generated ball may be smoothed by a Gaussian filter of size

M—\MJ2\ ( [•] performs the so-called ceiling operation ) and variance σ² = 0.55. The Gaussian filter may be generated by Equation (4) . The filtering may again be performed by means of a two-dimensional convolution as in Equation (3) . A sample generated replacement ball can be seen in Fig. 9 where square-shaped pixels are shown in white with respect to a ball before Gaussian filtering (picture (a) ) and with different brightness (colors) after Gaussian filtering (picture (b) ) . References

[1] O. Nemethova, M. Ries, E. Siffel, M. Rupp, "Quality Assessment for H.264 Coded Low-Rate and Low-Resolution Video Sequences", accepted to IASTED Internat. Conf. on Communications, Internet and Inf. Technology (CUT) 2004.

[2] A. Ekin, A.M. Tekalp, R. Mehrotra, "Automatic Soccer Video Analysis and Summarization", IEEE Transactions on Image Processing, vol. 12, no. 7, pp.796-807, July 2004

[3] X. Yu, C. Xu, H. W. Leong, Q. Tian, Q. Tang, K. W. Wah, "Trajectory-Based Ball Detection and Tracking with Applications to Semantic Analysis of Broadcast Soccer Video", Proc. of ACM Multimedia Conference, Berkeley, USA, Nov. 2-8, 2003

[4] M. Leo, T. D. D'Orazzio, A. Distante, "Independent Component Analysis for Ball recognition in Soccer Images", Proc. of the IASTED International Conference on Intelligent Systems & Control, Salzburg, Austria, pp. 351-355, June 25-27, 2003

[5] T. D. D'Orazzio, M. Leo, M. Nitti, G. Cicirelli, "A real time ball recognition system for sequences of soccer images", Proc. of the IASTED International Conference On Signal Processing, Pattern Recognition, and Applications, Crete, Greece, pp. 207- 212, 25-28 June, 2002

Claims

1. A method for preprocessing game video sequences comprising frames and including a ball or puck as movable game object, for transmission of the video sequences in compressed form, characterized in that in an initial search frames are searched for the game object on the basis of comparisons of the frames with stored game object features, in that respective frames are compared with preceding frames, to decide on the basis of differences between consecutive frames whether a scene change has occurred or not, wherein in the case of a scene change, an initial search is started again, whereas in the case that no scene change is detected, tracking of the game object is carried out by determining the positions of the game object in respective frames, in that at least for one frame, a dominant game play- field color is detected and is replaced by a unitary replacement color so that a playfield representation essentially consists of points of the same color, and in that the presence, size and/or shape of the detected game object is determined, to possibly replace the game object by an enlarged replacement game object.

2. The method according to claim 1, characterized in that for determination of the difference between frames, and/or between a part of a respective frame and game object templates, the method of the sum of absolute differences (SAD) is applied.

3. The method according to claim 1 or 2, characterized in that after the game object has once been recognized in two subsequent frames, the following frames are searched for the game object only within a region of interest (ROI) which is determined on the basis of an extrapolation in conformity with an estimated trajectory of the game object.

4. The method according to claim 3, characterized in that a region of interest of a fixed size is used, and in that in the case that the game object is not detected within this region of interest, an occlusion of the game object is assumed, whereafter from frame to frame, the region of interest is increased by a - ZM - predetermined amount to detect the game object again after its occlusion.

5. The method according to any one of claims 1 to 4, characterized in that the size of the replacement game object is chosen on the basis of the compression to be applied and of the size of the actual recognized game object in the respective frames.

6. The method according to any one of claims 1 to 5, characterized in that the replacement game object is subject to a Gaussian filter operation before its insertion into the respective frame.

7. The method according to any one of claims 1 to β, characterized in that after the step of dominant color detection and replacement, image filtering using a Gaussian filter, preferably a two-dimensional Gaussian filter, is performed to smooth the respective images.

8. The method according to any one of claims 1 to 7, characterized in that updated object features are defined on the basis of the old replacement game object and of the actual game object size as detected in frames preceding the respective current frame .

9. The method according to any one of claims 1 to 8, characterized in that for the initial search, a set of game object templates is generated by a template generator on the basis of empirically predetermined and stored templates by means of Gaussian filtering and/or deblurring.

10. The method according to any one of claims 1 to 9, characterized in that playfield lines are detected by an edge detection step during tracking the game object, to exclude such playfield lines from decisions concerning the tracked game object.

11. The method according to any one of claims 1 to 10, characterized in that the dominant game playfield color is replaced by a weighted average replacement color.

12. A system for preprocessing game video sequences comprising frames and including a ball or a puck as movable game object, for transmission of the video sequences in compressed form, characterized by an initial search module (12) which is arranged to search for the game object in frames on the basis of comparisons of the frames with stored game object features, by a scene change detector (14b) which is arranged to compare a respective frame with a respective preceding frame, and to decide on the basis of differences between that consecutive frames whether a scene change has occurred or not, to activate the initial search module (12) in the case of a scene change, or to activate a game object tracking unit (18) in the case that no scene change is detected, said game object tracking unit being arranged to determine the positions of the game object in respective frames, by a dominant color detection and replacement module (14a) arranged for detecting a dominant game playfield color and replacing it by a unitary replacement color, and by a game object replacement module (26) arranged to replace a too small game object by an enlarged replacement game object.

13. The system according to claim 12, characterized in that the game object tracking unit (18) is arranged to search for the game object in consecutive frames, once recognized, only within a part of the frames, the region of interest (ROI) , the latter being defined on the basis of extrapolation of the game object positions in two preceding frames, in conformity with an estimated trajectory of the game object.

14. The system according to claim 13, characterized in that a region of interest (ROI) of a fixed size is used, and in that in the case that the game object is not detected within this region of interest, an occlusion of the game object is assumed, whereafter from frame to frame, the region of interest is increased by a predetermined amount to detect the game object again after its occlusion.

15. The system according to any one of claims 12 to 14, characterized in that the size of the replacement game object is chosen on the basis of the compression to be applied and of the size of the actual recognized game object in the respective frames .

16. The system according to any one of claims 12 to 15, characterized in that the replacement game object is subject to a Gaussian filter operation before its insertion into the respective frame .

17. The system according to any one of claims 12 to 16, characterized by an image filtering module (14c) which is arranged to perform image filtering using a Gaussian filter, preferably a two-dimensional Gaussian filter, for smoothing the respective images after dominant color detection and replacement.

18. The system according to any one of claims 12 to 17, characterized in that updated object features are defined on the basis of the old replacement game object and of the actual game object size as detected in frames preceding the respective current frame .

19. The system according to any one of claims 12 to 18, characterized by a template generator (48) arranged for generating, in the case of an initial search, new game object templates on the basis of empirically predetermined and stored templates by means of Gaussian filtering and/or deblurring.

20. The system according to anyone of claims 12 to 19, characterized in that playfield lines are detected by an edge detection step during tracking the game object, to exclude such playfield lines from decisions concerning the tracked game object.