WO2011162589A1

WO2011162589A1 - Method and apparatus for adaptive data clustering

Info

Publication number: WO2011162589A1
Application number: PCT/MY2010/000238
Authority: WO
Inventors: Chee Seng Chan
Original assignee: Mimos Berhad
Priority date: 2010-06-22
Filing date: 2010-10-29
Publication date: 2011-12-29
Also published as: MY152935A

Abstract

The present invention provides a method of clustering data points in a multi-dimensional feature space. The method comprises forming a principle axis bending Gaussian model to group the data points along a curved principle axis on a orthogonal plane; carrying out a Fuzzy C-means for locating cluster centers of the principle axis bending Gaussian model; and performing iteratively Expectation- Maximization (EM) algorithm to update and optimize the cluster centers.

Description

Method and Apparatus for Adaptive Data Clustering

Field of the Invention

[0001] The present invention relates to data management. In particular, the invention relates to a method and system for adaptive data clustering. Background

[0002] Advancements of computing technologies have led to the increasing demand on data storage in light of the dramatic growth in usage of digital data, such as digital multimedia data. Sharing/distributing of the digital data has further boost the demand of data storage. It was estimated that the digital universe consumed over 280EB (ExaBytes) in year 2007, and it is projected to be 10 times of that size by year 2011.

[0003] The large amount of digital data also comes large variety of file formats.

Inexpensive digital imaging devices have rendered huge archives of images and videos. With the huge amount of data stored in time, even on a personal computer, managing the files can become hassle. Further, some data streams are unstructured, adding to the difficulty in managing them. Therefore therefore a need for methods of automatic data analysis, classification, and retrieval.

[0004] Data clustering, also known as cluster analysis, is to discover the natural grouping(s) of a set of patterns, points, or objects. Merriam- Webster Online Dictionary defines cluster analysis as "a statistical classification technique for discovering whether the individuals of a population fall into different groups by making quantitative comparisons of multiple characteristics." An example of clustering is shown in FIG. 1. A 2-dimensional (2D) vector on the left consists of data items distrubuted in a generally spiral distrubution. The right 2D vector of FIG. 1 shows the clustered results in a Milky Way using 20 clusters.

[0005] Amongst the clustering techniques, Gaussian Mixture Models (GMM) can accommodate data of varied structure, i.e., the component distributions can concentrate around surfaces of lower dimension (or even in lines), which correspond to the first several (or just the first) principal components. Sometimes clusters may concentrate around lower dimensional manifolds which are not linear. Since one non- Gaussian component can often be approximated by several Gaussian ones, these clusters can be represented by introducing more Gaussian components. For example, when a component concentrates along a nonlinear curve, it may have a piecewise linear approximation, i.e. it may be represented by several Gaussian models, each one concentrating in a linear subspace.

[0006] However, due to the intrinsic linearity of Gaussian model, when there are nonlinear manifolds in the data cloud, it is natural that many components are required and the fitting error is large. To form an intuitive illustration of this fact, data from a sample image of letter "b" (denoted by dots in each subgraph) are fitted by GMM shown in FIG. 2A and eight components are required to achieve an acceptable solution. FIG. 2B shows a corresponding principal of GMM. [0007] US patent publication no. US2002/129038A1, by Scott Woodroofe

Cunningham, discloses a computer-implemented data mining system that analyzes data using Gaussian Mixture Models (GMM). The GMM is created based on an Expectation-Maximization (EM) algorithm which outputs data cluster based on a mixture of probability distributions fitted to the accessed data.

[0008] EP 1758097 issued to Microsoft Corporation discloses a method for compressing multiple dimensional Gaussian distributions with diagonal covariance matrixes. The method includes clustering a plurality of Gaussian distributions in a multiplicity of clusters for each dimension.

Summary

[0009] In one aspect of the present invention, there is provided a method of clustering data points in a multi-dimensional feature space. The method comprises forming a principle axis bending Gaussian model along the curved principle axis on a orthogonal plane to group the data points; carrying out a Fuzzy C-means for locating cluster centers of the principle axis bending Gaussian model; and performing iteratively Expectation-Maximization (EM) algorithm to update and optimize the cluster centers. Probability of each data point is a sum of probabilities of its projections points along the curved principle axis.

[0010] In one embodiment, forming the principle axis bending Gaussian model comprises carrying out a principle component analysis to form the curved principle axis through first two eigen vectors; and carrying out a least square fitting on the curved principle axis to shape the curved principle axis. [0011] In another embodiment, the Gaussion distrubution is a probability density of the points along the curved principle axis. The EM algorithm may be carried out with factors including distances between points and centers in a dissimilarity function, distances in the principle plane and distances in orthogonal space. Further, the distances in principle plane includes arc length and normal length.

[0012] Yet, the disimilarity is a function of miltiplicative inverse of probability defensity. The principle axis bending Gaussian model may comprise forming a curved principal axis and performing Gaussian distrubution along the curved principle axis.

Brief Description of the Drawings

[0013] This invention will be described by way of non-limiting embodiments of the present invention, with reference to the accompanying drawings, in which:

[0014] FIG. 1 shows a known general data clustering on a 2-dimensional vector;

[0015] FIG. 2A shows a sample image fitted by Gaussian Mixture Model through a known technique;

[0016] FIG. 2B shows a corresponding principal of GMM of FIG. 2A;

[0017] FIG. 3 shows a flow chart of a data clustering method in accordance with one embodiment of the present invention;

[0018] FIG. 4 shows a flow chart of forming a principle axis bending Gaussian model in accordance with one embodiment of the present invention; and

[0019] FIG. 5 shows probability distributions along a curved principle axis or along a normal direction of the curve axis is Gaussian distribution. Detailed Description

[0020] In line with the above summary, the following description of a number of specific and alternative embodiments are provided to understand the inventive features of the present invention. It shall be apparent to one skilled in the art, however that this invention may be practised without such specific details. Some of the details may not be described at length so as not to obscure the invention. For ease of reference, common reference numerals will be used throughout the figures when referring to the same or similar features common to the figures.

[0021] FIG. 3 illustrates a flow chart of method for clustering a data cloud in accordance with one embodiment of the present invention. The method comprises forming a principal axis bending Gaussian model to group data at step 301; applying Fuzzy C-means to introduce degree of fuzziness on the dissimilarity function based on distances at step 302; defining dissimilarity based on a modified Expectation Maximization (EM) algorithm at step 303. The disimilarity is defined as the multiplicative inverse of probability density function which will be shown later in details.

[0022] In accordance with one embodiment, there is provided a method of data clustering based on a principle axis bending Gaussian model (herein also referred to as "the model"). In sort, the model forms a curved principal axis and performs a Gaussian distribution along the curved principle axis. The model can be also imaged as a Gaussian model bending along the first principal axis. The Gaussion distribution is a conditional probability density to any point along the curved principle axis on the orthogonal plane. The model is capable of expressing curve manifolds directly and simply. When the corresponding mixture models are introduced to fit to some curve manifolds in the data cloud, the number of components could be reduced and the fitting error could be made smaller.

[0023] As mentioned, the principle axis curve Gaussian model is equivalent to a bent Gaussian Model at a first principle axis, which can be effectively used to fit the data cloud with curve manifolds. As shown in FIG. 4, the method comprises carrying out Principle Component Analysis (PCA) at step 401 and Least Squares Fitting (LSF) at step 402. The principle axis curve (i.e. the bent principle axis) on the orthogonal plane is determined by first two eigenvectors by PCA. The LSF is used to shape the axis curve. The probability distribution along the curved principle axis or along normal direction of the curve axis is Gaussian distribution as shown in FIG. 5.

[0024] On the principle plane, one point pi may have several projection points

(ρ , ρ ', ρ ") along the principle curve axis, because bending the principle axis may cause one point locates in several different normal directions of the curve axis as shown on the right of FIG. 5. Thus probability of each point (pi, p2) is the sum of probabilities of its projection points. This makes the principle active curve Gaussian most different from the normal Gaussian model where each point's probability accounts only by itself.

[0025] Following the above, Fuzzy C-means (FCM) is employed. The aim of FCM is to find cluster centers (centroids) that minimize a dissimilarity function, i.e. the weighted within group sum of squared error objective function as shown in Eq. (1). n k

[0026] Eq. (1) [0027] where X = {x_l5 x₂, x_n} _ R_d, « is the number of data items, k is the number of clusters with 2_k_n, m is a weighting exponent on each fuzzy membership, μ; is the prototype of the center of cluster i, U = {%}, uj_t is the degree of membership of x_t in the i_th cluster, wherein [0028] 0≤μ,.,≤l, ^J |t = 1 Eq. (2)

[0029] Ki<k and Kt<n, d_lt is a norm distance measure between object x_t, and cluster center u;, and wherein

[0030] =\\ x_t - μ, \ _A = (x, - μ,)^τ A(x, - μ.) Eq. (3)

[0031] The processing of minimizing object function J_m depends on how centers find their way to the best positions, as the fuzzy memberships u;_t and norm distance di_t would change along with centers' position. Giving the previous positions of the centers, Eq. (3) would produce the distance based dissimilarities. To minimize the object function, new positions of centers would be determined by the following equations:

10 000238

[0034] After a number of iterations with Eqs. (3)-(5), the centers of clusters are optimized to minimize the object function. The iteration stops when difference of a current value of object function and a previous value of object function is less than a preset threshold. The preset threshold is generally determined based on the desire application. In one embodiment, the presetn threshold is any value between 0 and 1.

[0035] As FCM introduces a weighting exponent m on each fuzzy membership in Eq. (1) and Eq. (5), the points which are nearer to one cluster than other clusters are made more important for this cluster and at the same time more insignificant for other clusters. [0036] The dissimilarity function is defined as:

1

[0037] Eq. (6)

ρ,Ρ(χ, \ θ,)

[0038] where piP(x_t|9i) is achieved from both Eq. (7) and Eq. (8):

[0039] -2) (^Xtr Eq. (7)

1 "

[0040] Eq. (8) [0041] The fuzzy membership ¾ would be achieved from Eq. (4) and the components' weights p, is defined as Eq. (9)

[0042] A i Eq. (9)

[0043] Further, estimations of the principle active axis curve Gaussian centers, covariances, translation and rotation matrices are performed by the following equations:

[0044] (C_f,T_f,U,,u_f) = LSFM(PC i?, ' X)) Eq. (10)

n

™NorLej? ( - ¾ )

[0046] 2_i2 = Eq. (12)

[0047] ^d'(3-rf) Eq. (13)

[0048] As shown in the Eqs. 11-13, the modified EM has included distances between the respective points and centers in a dissimilarity functions, distances in the principle plane as well as distances in orthogonal space.

[0049] It is well known in the art that the conventional algorithms are not adaptive due to the linearity of Gaussian distribution. The method provided herewith is adaptive as the combination of curved principal axis and a Gaussian distribution along the curved principle axis provided above. [0050] While specific embodiments have been described and illustrated, it is understood that many changes, modifications, variations and combinations thereof could be made to the present invention without departing from the scope of the invention.

Claims

1. A method of clustering data points in a multi-dimensional feature space, the method comprising: forming a principle axis bending Gaussian model that forms a curved principle axis and performing a Gaussian distribution along the curved principle axis on a orthogonal plane to group the data points; carrying out a Fuzzy C-means for locating cluster centers of the principle axis bending Gaussian model; and performing iteratively Expectation-Maximization (EM) algorithm to update and optimize the cluster centers, wherein probability of each data point is a sum of probabilities of its projections points along the curved principle axis.

2. The method according to claim 1, wherien forming the principle axis bending Gaussian model comprising: carrying out a principle component analysis to form the curved principle axis through first two eigen vectors; and carrying out a least square fitting on the curved principle axis to shape the curved principle axis.

3. The method according to claim 1, wherein the Gaussion distrubution is a probability density of the points along the curved principle axis.

4. The method according to claim 1 , wherein the EM algorithm is carried out with factors including distances between points and centers in a dissimilarity function, distances in the principle plane and distances in orthogonal space.

5. The method according to claim 4, wherein the distances in principle plane includes arc length and normal length.

6. The method according to claim 4, wherein the dissimilarity function is:

where piP(x_t|6i) is derived from both Eq. (7) and Eq. (8): I A,) = /(l-2)(½ I I <¾⁾ Eq. (7)

7. The method according to claim 1, wherien the disimilarity is a function of miltiplicative of inverse probability defensity.

8. The method according to claim 1, wherein the principle axis bending Gaussian model comprises forming a curved principal axis and performing Gaussian distrubution along the curved principle axis.