US20130297607A1 - Identification of pattern similarities by unsupervised cluster analysis - Google Patents

Identification of pattern similarities by unsupervised cluster analysis Download PDF

Info

Publication number
US20130297607A1
US20130297607A1 US13/934,166 US201313934166A US2013297607A1 US 20130297607 A1 US20130297607 A1 US 20130297607A1 US 201313934166 A US201313934166 A US 201313934166A US 2013297607 A1 US2013297607 A1 US 2013297607A1
Authority
US
United States
Prior art keywords
data
dataset
readable medium
transitory machine
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/934,166
Inventor
Asa Ben-Hur
Andre Elisseeff
Isabelle Guyon
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Health Discovery Corp
Original Assignee
Health Discovery Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Health Discovery Corp filed Critical Health Discovery Corp
Priority to US13/934,166 priority Critical patent/US20130297607A1/en
Publication of US20130297607A1 publication Critical patent/US20130297607A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F17/30705
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention relates to the use of learning machines to identify relevant patterns in datasets containing large quantities of diverse data, and more particularly to a method and system for unsupervised learning for determining an optimal number of data clusters into which data can be divided to best enable identification of relevant patterns.
  • Knowledge discovery is the most desirable end product of data collection. Recent advancements in database technology have lead to an explosive growth in systems and methods for generating, collecting and storing vast amounts of data. While database technology enables efficient collection and storage of large data sets, the challenge of facilitating human comprehension of the information in this data is growing ever more difficult. With many existing techniques the problem has become unapproachable. Thus, there remains a need for a new generation of automated knowledge discovery tools.
  • human Genome Project has completed sequencing of the human genome.
  • the complete sequence contains a staggering amount of data, with approximately 31,500 genes in the whole genome.
  • the amount of data relevant to the genome must then be multiplied when considering comparative and other analyses that are needed in order to make use of the sequence data.
  • human chromosome 20 alone comprises nearly 60 million base pairs.
  • Several disease-causing genes have been mapped to chromosome 20 including various autoimmune diseases, certain neurological diseases, type 2 diabetes, several forms of cancer, and more, such that considerable information can be associated with this sequence alone.
  • Proteomics is the study of the group of proteins encoded and regulated by a genome. Genomic activation or expression does not always mean direct changes in protein production levels or activity. Alternative processing of mRNA or post-transcriptional or post-translational regulatory mechanisms may cause the activity of one gene to result in multiple proteins, all of which are slightly different with different migration patterns and biological activities.
  • the human proteome is believed to be 50 to 100 times larger than the human genome. Currently, there are no methods, systems or devices for adequately analyzing the data generated by such biological investigations into the genome and proteome.
  • Clustering is a widely used approach for exploratory data analysis. Clustering analysis is unsupervised learning—it is done without suggestion from an external supervisor; classes and training examples are not given a priori. The objective of clustering is to group data points into “meaningful” subsets. There is no agreed upon definition of the clustering problem, and various definitions appear in the literature. For example, clustering has been defined as a search for some “natural” or “inherent” grouping of the data. However, most clustering algorithms do not address this problem. The vast majority of clustering algorithms produce as their output either a dendogram or a partition into a number of clusters, where the number of clusters is either the input, or there is some other parameter(s) that controls the number of clusters. In either case, a model selection technique is required in order to choose the model parameter, or in the case of hierarchical algorithms, to determine which level of the dendogram represents the “inherent” structure of the data.
  • a few examples of applications of clustering include (1) analysis of microarray data, where co-expressed genes are found, and the assumption is that co-expression might be a sign of co-regulation; (2) in medical datasets (gene expression data, clinical data etc.), where patients are divided into categories; (3) in any set or set of measurements to detect trends or artifacts in the measurement protocol; and (4) in information retrieval to partition text according to categories.
  • Clustering algorithms are unsophisticated in that they provide no insight into the level of granularity at which the “meaningful” clusters might be found. Occasionally, there may be prior knowledge about the domain that facilitates making such a choice. However, even in such cases, a method for determining the granularity at which to look at the data is required. This is seen as the problem of finding the optimal number of clusters in the data, relative to some clustering algorithm.
  • a system and method are provided for analysis of data by grouping the input data points into meaningful subsets of data.
  • the exemplary system comprises a storage device for storing at least one data set, and a processor for executing the clustering algorithm.
  • the clustering algorithm is a k-means or hierarchical clustering algorithm.
  • a “good granularity level”, i.e., an optimal number of clusters, is defined as one at which a clustering solution is stable with respect to some perturbation of the data, such as noise or sub-sampling.
  • the algorithm chooses a number of pairs of sub-samples or other perturbations, clusters each sub-sample with the chosen level of granularity, then computes a similarity between pairs of clustering solutions.
  • the distribution of the similarity between pairs of clustering solutions is computed, then the highest level of granularity for which highly similar solutions are obtained under perturbation is chosen.
  • the probability distribution of the similarity measure is evaluated in order to find the “true” number of clusters based on a similarity measure between clustering solutions.
  • a dot product between pairs of clustering solutions is normalized to provide a similarity measure.
  • Other similarity measures proposed in the literature can also be expressed in terms of this dot product.
  • the inventive method provides means for extraction of information from gene expression profiles, in particular, extraction of the most relevant clusters of temporal expression profiles from tens of thousands of measured profiles.
  • the clustering algorithm is the k-means algorithm.
  • Other embodiments include all pairwise clustering methods (e.g., hierarchical clustering) and support vector clustering (i.e., a support vector machine (SVM)).
  • the fit involves affine transformations such as translation, rotation, and scale. Other transformations could be included, such as elastic transformations.
  • the algorithm is applied in a hierarchical manner by sub-sampling the data and clustering the sub-samples.
  • the resulting cluster centers for all the runs are then clustered again.
  • the resulting cluster centers are considered to be the most significant profiles.
  • a ranking of the genes is first performed according to a given quality criterion combining saliency (significant difference in expression in the profile), smoothness, and reliability (low noise level). Other criteria for ranking include the local density of examples.
  • FIGS. 1 a - 1 c are a plot of data which is a mixture of four Gaussians, a histogram of the correlation score for the Gaussian data mixture, and a plot showing the overlay of cumulative distributions of the correlation score, respectively.
  • FIGS. 2 a and 2 b are histogram of the correlation score for the yeast gene expression data and an overlay of the cumulative distributions of the correlation score, respectively.
  • FIGS. 3 a and 3 b are a histogram of the correlation score for 208 points uniformly distributed on the unit cube and an overlay of the cumulative distributions of the correlation score, respectively.
  • FIGS. 5 a - 5 b are graphic representations of gene pre-selection.
  • FIGS. 6 a - 6 b are graphs of the average profiles of the clusters of the eight clustering runs and their grouping into meta-clusters.
  • FIGS. 7 a - 7 d are graphs of curve fitting with affine transformations.
  • FIGS. 8 a - 8 h are graphs of curve fitting with affine transformations, clustering obtained from the top 800 best quality genes.
  • a typical computer system for running the inventive clustering algorithm is a Pentium®-class processor with an interface for inputting the dataset(s) of interest and a memory for storing the dataset.
  • Alternative processors include configurable logic processors made up of SRAM-based field programmable gate arrays (FPGAs) or configurable system on chip (CSOC) architectures, which include a processor and an array of configurable logic cells on a single chip, either of which can be used to accelerate computer-intensive operations such as clustering algorithms on large datasets.
  • FPGAs field programmable gate arrays
  • CSOC configurable system on chip
  • parallel implementations in multi-computers have been used for executing clustering algorithms for extracting knowledge from large-scale data repositories. Selection of an appropriate computer system for executing the inventive clustering method is within the level of skill in the art.
  • the clustering model selection algorithm works with the help of a scoring function that provides a similarity measure between two labelings.
  • a labeling is a partition of X into k subsets S 1 , . . . , S k . It can be represented by a function c:X ⁇ 1, . . . , k ⁇ where c(x i ) is the cluster to which x i belongs.
  • a less compact representation of a labeling which may be useful is the following representation by a matrix C with components:
  • this representation is label independent, i.e. there is no need to assign each point a label in ⁇ 1, . . . , k ⁇ .
  • This may be viewed as an adjacency matrix, where each cluster is a connected component of the graph.
  • This representation can also be converted into a representation of soft cluster labeling.
  • This dot product computes the number of pairs of vectors clustered together, and can also be interpreted as the number of common edges in graphs represented by C (1) and C (2) .
  • a na ⁇ ve method for computing the dot product by going over all pairs of points has complexity O(n 2 ). However, it can be computed in linear time.
  • the correlation score which is a normalized version of the dot product, is:
  • cor ⁇ ( L 1 , L 2 ) ⁇ L 1 , L 2 ⁇ ⁇ L 1 , L 1 ⁇ ⁇ ⁇ L 2 , L 2 ⁇ ( 3 )
  • N ij i,j ⁇ 0, 1 ⁇ be the number of entries on which C (1) and C (2) have values i and j, respectively.
  • the matching coefficient as the fraction of entries on which the two matrices agree:
  • the Jaccard coefficient is the corresponding ratio when “negative” matches are ignored:
  • f fraction of points sampled
  • 1 say f>0.5
  • “inherent structure” is defined as structure that is stable under sub-sampling (or alternatively, perturbing the data). Given a clustering algorithm and a dataset, this translates into a search for the best number of clusters to use in the particular instance. Note that one can also extend the search to look for a set of features where structure is apparent, however, in this case, the same set of features is kept.
  • the inventive clustering algorithm receives as input a dataset (or similarity/dissimilarity matrix) and a parameter k that controls either directly or indirectly the number of clusters that the algorithm produces. This convention is applicable to hierarchical clustering algorithms as well: given k, the tree is cut so that k clusters are produced. Next, characterize the stability of the clustering for each k. This is accomplished by producing a set of clusterings of sub-samples of the data, and comparing the labels of the intersection of pairs of clusterings using, for example, the correlation similarity measure. This is performed for increasing values of k (see above for details). The distribution of the scores for the different values of k is then compared (see, e.g., FIGS. 1 a - 1 c ). The idea is that when there is structure in the data that is well described by the clustering algorithm with that particular value of k, many sub-samples will produce similar clusterings, and their pairwise similarity score will be concentrated close to 1.
  • Each sub-sample contains a fixed fraction of the data, f.
  • the actual subsampling can be implemented in various ways:
  • FIG. 1 a shows a mixture of four Gaussians.
  • the data is a matrix which represents the mRNA expression levels of n genes across a number of experiments. Some of the genes in the data have known labels according to a functional class. Five functional classes were selected along with genes that belong uniquely to these five functional classes. This yielded a dataset with 208 genes, with 79 features (experiments). Data was normalized by subtracting the mean and dividing by the standard deviation for each column. This was also performed for the rows, and repeated for the columns. At this stage the first three principal components were extracted.
  • the distribution and histogram of scores is given in FIG. 2 a for k over the range of 2 to 7.
  • the same behavior is observed as seen in the mixture of four Gaussians data of FIG. 1 a .
  • FIGS. 3 a and 3 b The results of a test run on data uniformly distributed on the unit cube is shown in FIGS. 3 a and 3 b .
  • the distributions are quite similar to each other, with no change that can be interpreted as a transformation from a stable set of solutions to unstable solutions
  • FIG. 4 a is a plot of the distribution of the first two principle components for each of the selected letters.
  • FIG. 4 b provides histograms of the correlation score for each k in the range of 2 to 7, and
  • Table 1 provides a comparison of the number of clusters identified using the inventive method against the number of clusters obtained using other methods for selecting k. These other methods are described by R. Tibshirani, G. Walther, and T. Hastie in “Estimating the number of clusters in a dataset via the Gap statistic”, Tech. Report, Department of Statistics, Stanford University, 2000; also published in JRSSB 2000, where the Gap Statistic methods is shown to be superior to a number of other methods.
  • k-means clustering produces solutions that are highly stable with respect to sub-sampling. This good result may be due to the global optimization criterion, which differs from the local bottom up approach of hierarchical clustering that appears to be less stable to sub-sampling.
  • a data set of 1600 genes with 12 time steps was utilized to illustrate the process undergone by gene expression profiles.
  • the genes were ranked in order of “quality” to pre-select a subset for further analysis. All the genes were ranked according to three criteria: (1) saliency (the absolute difference between their min and max value; the larger the better); (2) smoothness (a coefficient assessing the smoothness of the profile was computed; the smoother the better); and (3) reliability (the average standard deviation of the experiment replicated for the whole profile).
  • FIGS. 5 a and 5 b show gene expression temporal profiles.
  • the ten best genes are depicted in FIG. 5 a
  • the ten worst genes are shown in FIG. 5 b according to a combined criterion of saliency, smoothness, and reliability.
  • the 5% top quality genes according to the above defined combined criterion 800 genes were selected.
  • a kernel clustering algorithm based on k-means was then run on random subsets of 100 genes among these 800 genes.
  • a maximum number of clusters of ten were used, but only five did not degenerate.
  • the stability of the solutions was verified by running again kernel k-means on the resulting cluster centers. The solution was robust with respect to increasing the number of genes (doubling, to 1600 genes), changing the subset size (to 200 genes) and the maximum number of cluster (to 20 genes).
  • FIGS. 6 a and 6 b illustrate the average profiles of the clusters of the eight clustering runs and their groupings into meta-clusters 1-5.
  • the cluster centers in FIG. 6 a were obtained with multiple runs of k-means using random subsets of 100 genes in the top 800 best quality genes.
  • FIG. 6 b shows the average cluster centers for the five clusters. Only a subset of the nine possible profiles that could occur is represented.
  • the clustering algorithm is based on, but is a variation of, the classical k-means algorithm.
  • the algorithm operates by the following routine:
  • This algorithm differs from the classical k-means in that it uses a special metric to measure proximity in step 2. It is based on the residuals of a fit of one profile onto another, using an affine transformation that is a combination of translation, scaling, and rotation (to first order). Such fit is remarkably simple to implement with a couple of lines of MATLAB® (The MathWorks, Inc., Natick, Mass.) and is quite fast. For example, to fit a profile (a 12 dimensional vector x 2 ) that goes through zero between time step 5 and 6, onto another vector x 1 , one can write:
  • xx 2 [x 2 ,ones(12,1),[ ⁇ 5; ⁇ 4; ⁇ 3; ⁇ 2; ⁇ 1;1;2;3;4;5;6;7]];
  • FIGS. 7 a - d depict two profiles and variations on how they can be fit one onto the other or both on their average. This provides an illustration of curve fitting with affine transformations.
  • FIG. 7 a shows the two original profiles P 1 and P 2 .
  • FIG. 7 b shows the profile P 1 fitted on P 2 .
  • FIG. 7 c shows profile P 2 fitted on P 1 .
  • FIG. 7 d shows both profiles fitted to their average.
  • Step 1 of the algorithm a simple average of the patterns is used, as opposed to an average of the curves fitted to the cluster centers. After iterating, there is a significant distortion of the cluster center profiles, some of which just become straight lines.
  • FIGS. 8 a - h illustrate the results of one run of the algorithm on a subset of 100 gene expression profiles with a maximum cluster number of 10, where expression intensity is plotted with time. The clustering is obtained from the top 800 best quality genes.
  • FIG. 8 a shows the original profiles with random class label assignments.
  • FIGS. 8 c , 8 e and 8 g illustrate fitted profiles of cluster members at successive iterations.
  • FIGS. 8 b , 8 d , 8 f , and 8 h show the cluster centers at successive iterations.
  • the present invention provides a method for model selection in clustering.
  • Most other approaches in the prior art are based on either the sum squared distances within clusters, or some combination of within-cluster distances and between-cluster distances.
  • the inventive method provides a description of the stability of a clustering method with respect to re-sampling or noise. It is believed that this stability captures the notion of “inherent” structure that is the goal in the validating clustering solutions.
  • Another advantage of present invention is that it is useful in identifying the absence of structure in the data. This differs from most clustering validation algorithms of the prior art, with the exception of the gap statistic, which can only determine the number of clusters if that number is larger than 1.

Abstract

A method is provided for unsupervised clustering of data to identify pattern similarities. A clustering algorithm randomly divides the data into k different subsets and measures the similarity between pairs of datapoints within the subsets, assigning a score to the pairs based on similarity, with the greatest similarity giving the highest correlation score. A distribution of the scores is plotted for each k. The highest value of k that has a distribution that remains concentrated near the highest correlation score corresponds to the number of classes having pattern similarities.

Description

    RELATED APPLICATIONS
  • This application is a continuation of application Ser. No. 13/019,585, filed Feb. 2, 2011, issued Jul. 16, 2013 as U.S. Pat. No. 8,489,531, which is a continuation of application Ser. No. 11/929,522, filed Oct. 30, 2007, issued Feb. 15, 2011 as U.S. Pat. No. 7,890,445, which is a continuation of application Ser. No. 10/478,191, filed Nov. 18, 2003, now abandoned, which is a U.S. National Stage filing of PCT/US02/15666, filed May 17, 2002, which claims the benefit of U.S. provisional patent applications No. 60/335,990, filed Nov. 30, 2001, and No. 60/292,221, filed May 18, 2001. Each of the identified applications is incorporated herein by reference.
  • FIELD OF THE INVENTION
  • The present invention relates to the use of learning machines to identify relevant patterns in datasets containing large quantities of diverse data, and more particularly to a method and system for unsupervised learning for determining an optimal number of data clusters into which data can be divided to best enable identification of relevant patterns.
  • BACKGROUND OF THE INVENTION
  • Knowledge discovery is the most desirable end product of data collection. Recent advancements in database technology have lead to an explosive growth in systems and methods for generating, collecting and storing vast amounts of data. While database technology enables efficient collection and storage of large data sets, the challenge of facilitating human comprehension of the information in this data is growing ever more difficult. With many existing techniques the problem has become unapproachable. Thus, there remains a need for a new generation of automated knowledge discovery tools.
  • As a specific example, the Human Genome Project has completed sequencing of the human genome. The complete sequence contains a staggering amount of data, with approximately 31,500 genes in the whole genome. The amount of data relevant to the genome must then be multiplied when considering comparative and other analyses that are needed in order to make use of the sequence data. As an illustration, human chromosome 20 alone comprises nearly 60 million base pairs. Several disease-causing genes have been mapped to chromosome 20 including various autoimmune diseases, certain neurological diseases, type 2 diabetes, several forms of cancer, and more, such that considerable information can be associated with this sequence alone.
  • One of the more recent advances in determining the functioning parameters of biological systems is the analysis of correlation of genomic information with protein functioning to elucidate the relationship between gene expression, protein function and interaction, and disease states or progression. Proteomics is the study of the group of proteins encoded and regulated by a genome. Genomic activation or expression does not always mean direct changes in protein production levels or activity. Alternative processing of mRNA or post-transcriptional or post-translational regulatory mechanisms may cause the activity of one gene to result in multiple proteins, all of which are slightly different with different migration patterns and biological activities. The human proteome is believed to be 50 to 100 times larger than the human genome. Currently, there are no methods, systems or devices for adequately analyzing the data generated by such biological investigations into the genome and proteome.
  • Clustering is a widely used approach for exploratory data analysis. Clustering analysis is unsupervised learning—it is done without suggestion from an external supervisor; classes and training examples are not given a priori. The objective of clustering is to group data points into “meaningful” subsets. There is no agreed upon definition of the clustering problem, and various definitions appear in the literature. For example, clustering has been defined as a search for some “natural” or “inherent” grouping of the data. However, most clustering algorithms do not address this problem. The vast majority of clustering algorithms produce as their output either a dendogram or a partition into a number of clusters, where the number of clusters is either the input, or there is some other parameter(s) that controls the number of clusters. In either case, a model selection technique is required in order to choose the model parameter, or in the case of hierarchical algorithms, to determine which level of the dendogram represents the “inherent” structure of the data.
  • A few examples of applications of clustering include (1) analysis of microarray data, where co-expressed genes are found, and the assumption is that co-expression might be a sign of co-regulation; (2) in medical datasets (gene expression data, clinical data etc.), where patients are divided into categories; (3) in any set or set of measurements to detect trends or artifacts in the measurement protocol; and (4) in information retrieval to partition text according to categories.
  • Most clustering algorithms either produce a hierarchical partitioning of the data into smaller and smaller clusters, or produces a partition of a dataset into a number of clusters that depend on some input parameter (the number of clusters or some other parameter(s)). The question remains, however, of how to set the input parameter, or how to determine which level of the tree representation of the data to look at: Clustering algorithms are unsophisticated in that they provide no insight into the level of granularity at which the “meaningful” clusters might be found. Occasionally, there may be prior knowledge about the domain that facilitates making such a choice. However, even in such cases, a method for determining the granularity at which to look at the data is required. This is seen as the problem of finding the optimal number of clusters in the data, relative to some clustering algorithm.
  • E. Levine and E. Domany in “Resampling Method for Unsupervised Estimation of Cluster Validity”, Neural Comp. 13, 2573-2593 (2001), assign a figure of merit to a clustering solution according to its similarity to clusterings of subsamples of the data. The “temperature” parameter of their clustering algorithm is selected according to a maximum of the similarity measure. However, in real data, such a maximum does not often occur. Other model selection techniques have difficulty detecting the absence of structure in the data, i.e., that there is a single cluster. Further, many of algorithms make assumptions as to cluster shape, and do not perform well on real data, where the cluster shape is generally not known. Accordingly, other methods for clustering are needed. The present invention is directed to such a method.
  • SUMMARY OF THE INVENTION
  • In an exemplary embodiment, a system and method are provided for analysis of data by grouping the input data points into meaningful subsets of data. The exemplary system comprises a storage device for storing at least one data set, and a processor for executing the clustering algorithm. In the preferred embodiments, the clustering algorithm is a k-means or hierarchical clustering algorithm.
  • In the method of the present invention, a “good granularity level”, i.e., an optimal number of clusters, is defined as one at which a clustering solution is stable with respect to some perturbation of the data, such as noise or sub-sampling. For each level of granularity, the algorithm chooses a number of pairs of sub-samples or other perturbations, clusters each sub-sample with the chosen level of granularity, then computes a similarity between pairs of clustering solutions. For each level of granularity, the distribution of the similarity between pairs of clustering solutions is computed, then the highest level of granularity for which highly similar solutions are obtained under perturbation is chosen.
  • According to the present invention, the probability distribution of the similarity measure is evaluated in order to find the “true” number of clusters based on a similarity measure between clustering solutions. A dot product between pairs of clustering solutions is normalized to provide a similarity measure. Other similarity measures proposed in the literature can also be expressed in terms of this dot product.
  • In one embodiment, the inventive method provides means for extraction of information from gene expression profiles, in particular, extraction of the most relevant clusters of temporal expression profiles from tens of thousands of measured profiles. In one variation of the present embodiment, the clustering algorithm is the k-means algorithm. Other embodiments include all pairwise clustering methods (e.g., hierarchical clustering) and support vector clustering (i.e., a support vector machine (SVM)). In one variation, the fit involves affine transformations such as translation, rotation, and scale. Other transformations could be included, such as elastic transformations. In the case of k-means, the algorithm is applied in a hierarchical manner by sub-sampling the data and clustering the sub-samples. The resulting cluster centers for all the runs are then clustered again. The resulting cluster centers are considered to be the most significant profiles. To facilitate the clustering task, a ranking of the genes is first performed according to a given quality criterion combining saliency (significant difference in expression in the profile), smoothness, and reliability (low noise level). Other criteria for ranking include the local density of examples.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIGS. 1 a-1 c are a plot of data which is a mixture of four Gaussians, a histogram of the correlation score for the Gaussian data mixture, and a plot showing the overlay of cumulative distributions of the correlation score, respectively.
  • FIGS. 2 a and 2 b are histogram of the correlation score for the yeast gene expression data and an overlay of the cumulative distributions of the correlation score, respectively.
  • FIGS. 3 a and 3 b are a histogram of the correlation score for 208 points uniformly distributed on the unit cube and an overlay of the cumulative distributions of the correlation score, respectively.
  • FIGS. 4 a-4 c are a plot of the first two principle components, the histogram of the correlation score, and an overlay of the cumulative distributions of the correlation score; k=2 (rightmost) to k−10 (leftmost), respectively.
  • FIGS. 5 a-5 b are graphic representations of gene pre-selection.
  • FIGS. 6 a-6 b are graphs of the average profiles of the clusters of the eight clustering runs and their grouping into meta-clusters.
  • FIGS. 7 a-7 d are graphs of curve fitting with affine transformations.
  • FIGS. 8 a-8 h are graphs of curve fitting with affine transformations, clustering obtained from the top 800 best quality genes.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • A typical computer system for running the inventive clustering algorithm is a Pentium®-class processor with an interface for inputting the dataset(s) of interest and a memory for storing the dataset. Alternative processors include configurable logic processors made up of SRAM-based field programmable gate arrays (FPGAs) or configurable system on chip (CSOC) architectures, which include a processor and an array of configurable logic cells on a single chip, either of which can be used to accelerate computer-intensive operations such as clustering algorithms on large datasets. In addition, parallel implementations in multi-computers have been used for executing clustering algorithms for extracting knowledge from large-scale data repositories. Selection of an appropriate computer system for executing the inventive clustering method is within the level of skill in the art.
  • The clustering model selection algorithm works with the help of a scoring function that provides a similarity measure between two labelings.
  • Let X={x1, . . . , xn}, and x1εRd be the dataset to be clustered. A labeling
    Figure US20130297607A1-20131107-P00001
    is a partition of X into k subsets S1, . . . , Sk. It can be represented by a function c:X→{1, . . . , k} where c(xi) is the cluster to which xi belongs.
  • A less compact representation of a labeling, which may be useful is the following representation by a matrix C with components:
  • C ij = { 1 if x i and x j belong to the same cluster , 0 otherwise ( 1 )
  • Note that this representation is label independent, i.e. there is no need to assign each point a label in {1, . . . , k}. This may be viewed as an adjacency matrix, where each cluster is a connected component of the graph. This representation can also be converted into a representation of soft cluster labeling.
  • Let labelings
    Figure US20130297607A1-20131107-P00002
    and
    Figure US20130297607A1-20131107-P00003
    have matrix representations C(1) and C(2) respectively, so that
  • ( L 1 , L 2 ) = Σ i , j C ij ( 1 ) C ij ( 2 ) ( 2 )
  • This dot product computes the number of pairs of vectors clustered together, and can also be interpreted as the number of common edges in graphs represented by C(1) and C(2). A naïve method for computing the dot product by going over all pairs of points has complexity O(n2). However, it can be computed in linear time.
  • As a dot product,
    Figure US20130297607A1-20131107-P00004
    ,
    Figure US20130297607A1-20131107-P00005
    satisfies the Cauchy-Schwartz inequality:
    Figure US20130297607A1-20131107-P00006
    ,
    Figure US20130297607A1-20131107-P00001
    2
    Figure US20130297607A1-20131107-P00007
    Figure US20130297607A1-20131107-P00008
    . The correlation score, which is a normalized version of the dot product, is:
  • cor ( L 1 , L 2 ) = L 1 , L 2 L 1 , L 1 L 2 , L 2 ( 3 )
  • A. K. Jain and R. C. Dubes, Algorithms for clustering data (Prentice Hall, Englewood Cliffs, N.J., 1988) provide a number of scores for comparing a labeling produced by a clustering algorithm with a “gold standard” labeling. This technique is known as “external” validation. In contrast, the present invention proposes the use of scoring functions for “internal” validation that do not require the “gold standard”. In the following, it is shown that two commonly used scoring functions can be expressed in terms of the dot product defined above in Equation 2. Given two matrices C(1), C(2) with 0-1 entries, let Nij i,jε{0, 1} be the number of entries on which C(1) and C(2) have values i and j, respectively. Define the matching coefficient as the fraction of entries on which the two matrices agree:
  • M ( L 1 , L 2 ) = N 00 + N 11 N 00 + N 01 + N 10 + N 11 . ( 4 )
  • The Jaccard coefficient is the corresponding ratio when “negative” matches are ignored:
  • J ( L 1 , L 2 ) = N 11 N 01 + N 10 + N 11 . ( 5 )
  • The Jaccard coefficient is more appropriate when the clusters are relatively small, since in that case, the N00 term will be the dominant factor even if the solution is far from the true one. These scores can be expressed in terms of the labeling dot product and the associated norm according to the following proposition:
  • Let C(1) and C(2) be the matrix representations of labelings
    Figure US20130297607A1-20131107-P00002
    and
    Figure US20130297607A1-20131107-P00003
    respectively. Then:
  • J ( L 1 , L 2 ) = < C ( 1 ) , C ( 2 ) > < C ( 1 ) , C ( 1 ) > + < C ( 2 ) , C ( 2 ) > - < C ( 1 ) , C ( 2 ) > M ( L 1 , L 2 ) = 1 - 1 n 2 || C ( 1 ) - C ( 2 ) || 2
  • This is a result of the observation that N11=
    Figure US20130297607A1-20131107-P00009
    C(1), C(2)
    Figure US20130297607A1-20131107-P00010
    , N01=
    Figure US20130297607A1-20131107-P00009
    1n−C(1), C(2)
    Figure US20130297607A1-20131107-P00010
    , N10=
    Figure US20130297607A1-20131107-P00009
    C(1), 1n−C(2)
    Figure US20130297607A1-20131107-P00010
    , N00=
    Figure US20130297607A1-20131107-P00009
    1n−C(1), C(2)
    Figure US20130297607A1-20131107-P00010
    , where 1n is an n×n matrix with entries equal to 1. The above expression for the Jaccard coefficient shows that it is close to the correlation score.
  • The following list provides the routine for the model explorer algorithm according to the present invention:
  • Input: A dataset D, kmin, kmax
    Require: A clustering algorithm, cluster (D, k)
      A scoring function for label comparison, s( 
    Figure US20130297607A1-20131107-P00011
    1, 
    Figure US20130297607A1-20131107-P00012
    2)
      f = 0.8
      for k = kmin to kmax do
       for i = 1 to maximum iterations do
        sub1 = subsamp(D, f) {a sub-sample with fraction f of the data}
        sub2 = subsamp(D, f)
        
    Figure US20130297607A1-20131107-P00011
     = cluster (sub1, k)
        
    Figure US20130297607A1-20131107-P00012
     = cluster (sub2, k)
        Intersect = sub1 ∩ sub2
        Score (i, k) = s( 
    Figure US20130297607A1-20131107-P00011
     (Intersect), 
    Figure US20130297607A1-20131107-P00012
     (Intersect)){Compute the
        score on the intersection of the two labels}
       end for
      end for
  • When one looks at a cloud of data points, and at a sub-sample of it for a sampling ratio, f (fraction of points sampled) is not much smaller than 1 (say f>0.5), one usually observes the same general structure. Thus, it is also reasonable to postulate that a clustering algorithm has captured the inherent structure in a dataset if clustering solutions over different subsamples are similar, e.g., according to one of the similarity measures introduced in the previous section. Thus, “inherent structure” is defined as structure that is stable under sub-sampling (or alternatively, perturbing the data). Given a clustering algorithm and a dataset, this translates into a search for the best number of clusters to use in the particular instance. Note that one can also extend the search to look for a set of features where structure is apparent, however, in this case, the same set of features is kept.
  • The inventive clustering algorithm receives as input a dataset (or similarity/dissimilarity matrix) and a parameter k that controls either directly or indirectly the number of clusters that the algorithm produces. This convention is applicable to hierarchical clustering algorithms as well: given k, the tree is cut so that k clusters are produced. Next, characterize the stability of the clustering for each k. This is accomplished by producing a set of clusterings of sub-samples of the data, and comparing the labels of the intersection of pairs of clusterings using, for example, the correlation similarity measure. This is performed for increasing values of k (see above for details). The distribution of the scores for the different values of k is then compared (see, e.g., FIGS. 1 a-1 c). The idea is that when there is structure in the data that is well described by the clustering algorithm with that particular value of k, many sub-samples will produce similar clusterings, and their pairwise similarity score will be concentrated close to 1.
  • Each sub-sample contains a fixed fraction of the data, f. The actual subsampling can be implemented in various ways:
  • 1. Select each sample independently so that the size of the intersection between two samples is random.
  • 2. Select together pairs of samples by first selecting their intersection, then selecting the rest of the data to complete the fraction f.
  • 3. Fix one clustering solution to be one produced on the whole dataset. (This third option was used by Levine and Domany (supra) to give a figure of merit to a particular clustering solution.)
  • For k=1, all clusterings are the same. This also holds for k=n, where n is the number of data points; in this case every point is in a different cluster. When the number of clusters becomes large so that there is a small number of points in each cluster, the solution becomes stable. The value of f should not be too low so that there not all clusters are represented in a sub-sample. In the Examples provided below, the shape of the distribution did not depend very much on the specific value of f. Any value between 0.6 and 0.9 worked well.
  • EXAMPLES
  • In this section experiments on artificial and real data are described. In all the experiments the distribution of the correlation score is shown. Equivalent results were obtained using other scores as well. The parameter values f=0.8 and 200 pairs of solutions were compared for each k. A hierarchical clustering algorithm was used, with the Ward criterion for merging clusters (see, e.g., Jain and Dubes, supra). Similar results were obtained using other hierarchical clustering methods (complete and average linkage). The advantage of using hierarchical clustering methods is that the same set of clusterings can be used for all values of k.
  • Example 1 Gaussian Data
  • Referring first to FIGS. 1 a-1 c, FIG. 1 a shows a mixture of four Gaussians. The histograms of the score for varying values of k for this data is plotted in FIG. 1 b. Histograms are shown for each value of k in the range of 2 to 7. Observations regarding the histograms are that at k=2, there is a peak at 1, since almost all the runs discriminated between the two upper and two lower clusters. At k=3, most runs separated the two lower clusters, and at k=4 most runs found the “correct” clustering as is reflected in the distribution of scores that is still close to 1.0. At k>4 there is no longer essentially one preferred solution. There is, in fact, a wide variety of solutions, evidenced by the widening spectrum of the similarities. FIG. 1 c plots the cumulative distributions of the correlation score for each k, where k=2 at the rightmost side of the plot (at peak 1), and k=7 being the leftmost curve.
  • Example 2 DNA Microarray Data
  • The next dataset considered was the yeast DNA microarray data of M. Eisen et. al. (“Genetics cluster analysis and display of genome-wide expression patterns”, Proc. Natl. Acad. Sci. USA, 95: 14863-14868, December 1998.). The data is a matrix which represents the mRNA expression levels of n genes across a number of experiments. Some of the genes in the data have known labels according to a functional class. Five functional classes were selected along with genes that belong uniquely to these five functional classes. This yielded a dataset with 208 genes, with 79 features (experiments). Data was normalized by subtracting the mean and dividing by the standard deviation for each column. This was also performed for the rows, and repeated for the columns. At this stage the first three principal components were extracted. The distribution and histogram of scores is given in FIG. 2 a for k over the range of 2 to 7. The same behavior is observed as seen in the mixture of four Gaussians data of FIG. 1 a. Between k=5 and k=6, there is a transition from a distribution that has a large component near 1, to a wide distribution that is very similar to the distribution on the random data. The clustering solution that was obtained for k=5 agreed well with the given labels, with a correlation score of 0.95.
  • Example 3 Uniformly Distributed Data
  • The results of a test run on data uniformly distributed on the unit cube is shown in FIGS. 3 a and 3 b. The distributions are quite similar to each other, with no change that can be interpreted as a transformation from a stable set of solutions to unstable solutions
  • The preceding examples indicate a simple way for choosing k as the value where there is a transition from a score distribution that is concentrated near 1 to a wider distribution. This can be quantified, e.g., by an increase in the area under the cumulative distribution function or by an increase in

  • S(K)=P(s>0.90).
  • The value of 0.9 is arbitrary, but any value close to 1 would work on the set of examples considered here.
  • Example 4 Isolet Letter Pronunciation
  • The next test was run on a portion of the ISOLET (Isolated Letter Speech Recognition) database created by Ron Cole and Mark Fanty of the Department of Computer Science and Engineering, Oregon Graduate Institute, Beaverton, Oreg. and available from the UCI (University of California at Irvine) Repository of Machine Learning Databases. (This data set was generated by having 150 subjects speak the name of each letter of the alphabet twice, generating 52 training examples from each speaker.) This test provides an example of what occurs with there is cluster overlap.
  • One thousand points of the original dataset, representing the letters “a”, “c”, “d”, “e”, “f” were used. The columns of the data were “whitened”, then the first three principle components were extracted. FIG. 4 a is a plot of the distribution of the first two principle components for each of the selected letters. FIG. 4 b provides histograms of the correlation score for each k in the range of 2 to 7, and FIG. 4 c is an overlay of the cumulative distribution of the correlation score with k=2 towards the far right side of the plot (near 1) and k=10 towards the left hand side of the plot.
  • Table 1 provides a comparison of the number of clusters identified using the inventive method against the number of clusters obtained using other methods for selecting k. These other methods are described by R. Tibshirani, G. Walther, and T. Hastie in “Estimating the number of clusters in a dataset via the Gap statistic”, Tech. Report, Department of Statistics, Stanford University, 2000; also published in JRSSB 2000, where the Gap Statistic methods is shown to be superior to a number of other methods.
  • TABLE 1
    Jain Silhouette KL Gap Subsamp
    4 Gauss 4 4 9 4 4
    (FIG. 1)
    Microarray 3 2 4 6 5
    (FIG. 2)
    Isolet 2 5 5 3
    (FIG. 4)
  • In the case of clustering algorithms whose output depends on the initial condition, e.g., k-means, a distribution of scores exists even when considering a fixed sub-sample. In such cases, the method produces an indication of how varied the solutions can be for various values of k. It has been observed that for a “good” value of k, a similar transition occurs, but is generally “smeared”, since k-means produces widely varying solutions. To address this, a version of k-means similar to that presented by P. Bradley and U. Fayyad in “Refining initial points for k-means clustering” (in J. Shavlik, editor, Proc. of the 15th Inter. Conf. of Machine Learning (ICML '98), pages 91-99, San Francisco, Calif., 1998. Morgan Kaufmann) was used. That method produces initial conditions that converge to near optimal solutions, and use a fixed initial condition for each value of k. According to the present invention, k-means clustering produces solutions that are highly stable with respect to sub-sampling. This good result may be due to the global optimization criterion, which differs from the local bottom up approach of hierarchical clustering that appears to be less stable to sub-sampling.
  • Example 5 Gene Expression Data
  • A data set of 1600 genes with 12 time steps was utilized to illustrate the process undergone by gene expression profiles.
  • First, the genes were ranked in order of “quality” to pre-select a subset for further analysis. All the genes were ranked according to three criteria: (1) saliency (the absolute difference between their min and max value; the larger the better); (2) smoothness (a coefficient assessing the smoothness of the profile was computed; the smoother the better); and (3) reliability (the average standard deviation of the experiment replicated for the whole profile).
  • The ranks of the genes according to these three criteria were then added to form a combined criterion according to which the genes were ranked again. The result can be seen in FIGS. 5 a and 5 b, which show gene expression temporal profiles. The ten best genes are depicted in FIG. 5 a, while the ten worst genes are shown in FIG. 5 b according to a combined criterion of saliency, smoothness, and reliability.
  • The 5% top quality genes according to the above defined combined criterion (800 genes) were selected. A kernel clustering algorithm based on k-means was then run on random subsets of 100 genes among these 800 genes. A maximum number of clusters of ten were used, but only five did not degenerate. The stability of the solutions was verified by running again kernel k-means on the resulting cluster centers. The solution was robust with respect to increasing the number of genes (doubling, to 1600 genes), changing the subset size (to 200 genes) and the maximum number of cluster (to 20 genes).
  • FIGS. 6 a and 6 b illustrate the average profiles of the clusters of the eight clustering runs and their groupings into meta-clusters 1-5. The cluster centers in FIG. 6 a were obtained with multiple runs of k-means using random subsets of 100 genes in the top 800 best quality genes. FIG. 6 b shows the average cluster centers for the five clusters. Only a subset of the nine possible profiles that could occur is represented.
  • According to the present invention, the clustering algorithm is based on, but is a variation of, the classical k-means algorithm. The algorithm operates by the following routine:
      • Initialize: Start with a random assignment of class labels to the patterns.
      • Step 1: Compute cluster centers by averaging class members in each class.
      • Step 2: Re-assign patterns to the cluster with nearest cluster center.
      • Iterate step 1 and 2 until the assignment of patterns to classes remains constant.
  • This algorithm differs from the classical k-means in that it uses a special metric to measure proximity in step 2. It is based on the residuals of a fit of one profile onto another, using an affine transformation that is a combination of translation, scaling, and rotation (to first order). Such fit is remarkably simple to implement with a couple of lines of MATLAB® (The MathWorks, Inc., Natick, Mass.) and is quite fast. For example, to fit a profile (a 12 dimensional vector x2) that goes through zero between time step 5 and 6, onto another vector x1, one can write:
  • xx2=[x2,ones(12,1),[−5;−4;−3;−2;−1;1;2;3;4;5;6;7]];
  • w=xx2\x1;
  • x2fit=xx2*w;
  • residual=mean((x1−x2fit).̂2);
  • FIGS. 7 a-d depict two profiles and variations on how they can be fit one onto the other or both on their average. This provides an illustration of curve fitting with affine transformations. FIG. 7 a shows the two original profiles P1 and P2. FIG. 7 b shows the profile P1 fitted on P2. FIG. 7 c shows profile P2 fitted on P1. FIG. 7 d shows both profiles fitted to their average.
  • After some experimentation, the following measure of dissimilarity was adopted:
      • residual0+max(residual1,residual2),
        where residual0 is the squared Euclidean distance between x1 and x2, residual1 is the residual of the fit of x1 onto x2, and residual2 is the residual of the fit of x2 onto x1. The rationale behind this choice is that a dissimilarity simply based on the symmetric fit to the average is, in some cases, too optimistic—it becomes very easy with the type of affine transformations that are being allowed to fit any curve to a straight line (but not vice versa). Residual 0 is added to avoid allowing too large transformations. The same desirable properties could be achieved by solving an optimization problem under constraints to limit the range of transformations, but this would be computationally more expensive.
  • Any positive monotonic transformation of the dissimilarity does not affect the algorithm. It should also be noted that in Step 1 of the algorithm, a simple average of the patterns is used, as opposed to an average of the curves fitted to the cluster centers. After iterating, there is a significant distortion of the cluster center profiles, some of which just become straight lines.
  • FIGS. 8 a-h illustrate the results of one run of the algorithm on a subset of 100 gene expression profiles with a maximum cluster number of 10, where expression intensity is plotted with time. The clustering is obtained from the top 800 best quality genes. FIG. 8 a shows the original profiles with random class label assignments. FIGS. 8 c, 8 e and 8 g illustrate fitted profiles of cluster members at successive iterations. FIGS. 8 b, 8 d, 8 f, and 8 h show the cluster centers at successive iterations.
  • The present invention provides a method for model selection in clustering. Most other approaches in the prior art are based on either the sum squared distances within clusters, or some combination of within-cluster distances and between-cluster distances. Not only do these prior art methods impose the notion of how a cluster should look (e.g., “compact”), most of the methods performed poorly in practice. The inventive method, on the other hand, provides a description of the stability of a clustering method with respect to re-sampling or noise. It is believed that this stability captures the notion of “inherent” structure that is the goal in the validating clustering solutions.
  • Another advantage of present invention is that it is useful in identifying the absence of structure in the data. This differs from most clustering validation algorithms of the prior art, with the exception of the gap statistic, which can only determine the number of clusters if that number is larger than 1.

Claims (18)

1. A non-transitory machine-readable medium comprising a plurality of instructions that in response to being executed result in a computing system executing a process for unsupervised classification of data in a dataset according to pattern similarities, comprising:
a clustering function to randomly assign k class labels to the data and partition the dataset in k subsets, wherein k has a range with a minimum number of two;
a compare function to, for each value of k beginning with the minimum value of k for each pair of subsets, compute a correlation score on the intersection between the pair of subsets, wherein the correlation score comprises a similarity measure between the pair of subsets and the greatest similarity has the highest score; and
a histogram function to generate a distribution of the correlation scores for each value of k, wherein the distribution comprising the highest value of k that remains concentrated near the highest correlation score corresponds to an optimal granularity level for clustering of the data according to pattern similarities within the dataset, and to generate an output corresponding to clustered data for display or storage in a storage medium.
2. The non-transitory machine-readable medium of claim 1, wherein the clustering function comprises a k-means algorithm.
3. The non-transitory machine-readable medium of claim 1, wherein the clustering function comprises a hierarchical clustering algorithm.
4. The non-transitory machine-readable medium of claim 1, wherein the optimal granularity level comprises a number of clusters for which there is a transition from a distribution that is peaked near a correlation score of “1” to a distribution that corresponds to random data.
5. The non-transitory machine-readable medium of claim 1, wherein the similarity measure is obtained by measuring a residual of a fit of one cluster onto another cluster, wherein the residual fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation.
6. The non-transitory machine-readable medium of claim 1, wherein the dataset comprises gene expression data and the pattern similarities comprise co-regulation patterns.
7. A non-transitory machine-readable medium comprising a plurality of instructions that in response to being executed result in a computing system separating classes within a dataset without supervision, wherein the computing system:
selects a plurality of granularity levels k, and for each granularity level k:
(a) induces perturbations in the dataset to generate a modified dataset;
(b) applies a clustering algorithm to the at least one modified dataset to produce k clusters under each of the perturbations;
(c) creates a data subset comprising the clusters identified in step (b);
(d) applies the clustering algorithm to the data subset using the same value of k clusters;
(e) determines the stability of the clusterings at each granularity level k by measuring dissimilarity between data in the data subset and a cluster center for the cluster into which the data was assigned;
measures fit of the data to the cluster centers for all k granularity levels, wherein the fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation;
selects from among the plurality of granularity levels an optimum granularity level k corresponding to the best fit;
generates an output comprising the dataset clustered into a plurality of subsets corresponding to the optimal granularity level k; and
displays a graph showing the dataset clustered into the plurality of subsets, wherein the subsets correspond to classes.
8. The non-transitory machine-readable medium of claim 7, wherein the perturbations comprise a combination of one or more of sub-sampling the dataset, changing initialization of the clustering algorithm, and adding noise to the dataset.
9. The non-transitory machine-readable medium of claim 7, further comprising, prior to selecting a plurality of granularity levels, ranking the data according to one or more quality criteria and selecting a pre-determined fraction of top ranked data.
10. The non-transitory machine-readable medium of claim 7, wherein the clustering algorithm is a k-means algorithm.
11. The non-transitory machine-readable medium of claim 7, wherein the clustering algorithm if hierarchical clustering.
12. The non-transitory machine-readable medium of claim 7, wherein the optimal granularity level comprises a number of clusters for which there is a transition from a distribution that is peaked near a correlation score of “1” to a distribution that corresponds to random data.
13. The non-transitory machine-readable medium of claim 7, wherein the similarity is obtained by measuring a residual of a fit of one cluster onto another cluster, wherein the residual fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation.
14. A non-transitory machine-readable medium comprising a plurality of instructions that in response to being executed result in a computing system determining without supervision an optimal number of classes within a dataset, wherein each class has pattern similarities, wherein the computing system:
selects a range of granularity levels k, the range having a minimum of two;
applies a clustering algorithm to the dataset for that k clusters are produced;
for each value of k, selects a plurality of subsets of the dataset, wherein each sub-sample comprises a fixed fraction of the dataset;
selects a plurality of pairs of subsets;
calculates a similarity between each pair of the plurality of pairs of subsets;
determines a distribution of the similarities within the plurality of pairs of subsets;
compares distributions of the similarities for all k granularity levels;
selects from among the plurality of granularity levels an optimal granularity level k whose distribution of the similarities has the highest correlation score among correlation scores for all distributions of similarities;
generates an output comprising the dataset clustered into an optimal number of subsets corresponding to the optimal k; and
displays a graph of the clustered dataset showing pattern similarities according to the optimal number of subsets.
15. The non-transitory machine-readable medium of claim 14, wherein the clustering algorithm is a k-means algorithm.
16. The non-transitory machine-readable medium of claim 14, wherein the clustering algorithm if hierarchical clustering.
17. The non-transitory machine-readable medium of claim 14, wherein the optimal granularity level comprises a number of clusters for which there is a transition from a distribution that is peaked near a correlation score of “1” to a distribution that corresponds to random data.
18. The non-transitory machine-readable medium of claim 14, wherein the similarity is obtained by measuring a residual of a fit of one cluster onto another cluster, wherein the residual fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation.
US13/934,166 2001-05-18 2013-07-02 Identification of pattern similarities by unsupervised cluster analysis Abandoned US20130297607A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/934,166 US20130297607A1 (en) 2001-05-18 2013-07-02 Identification of pattern similarities by unsupervised cluster analysis

Applications Claiming Priority (7)

Application Number Priority Date Filing Date Title
US29222101P 2001-05-18 2001-05-18
US33599001P 2001-11-30 2001-11-30
US10/478,191 US20050071140A1 (en) 2001-05-18 2002-05-17 Model selection for cluster data analysis
PCT/US2002/015666 WO2002095533A2 (en) 2001-05-18 2002-05-17 Model selection for cluster data analysis
US11/929,522 US7890445B2 (en) 2001-05-18 2007-10-30 Model selection for cluster data analysis
US13/019,585 US8489531B2 (en) 2001-05-18 2011-02-02 Identification of co-regulation patterns by unsupervised cluster analysis of gene expression data
US13/934,166 US20130297607A1 (en) 2001-05-18 2013-07-02 Identification of pattern similarities by unsupervised cluster analysis

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US13/019,585 Continuation US8489531B2 (en) 2001-05-18 2011-02-02 Identification of co-regulation patterns by unsupervised cluster analysis of gene expression data

Publications (1)

Publication Number Publication Date
US20130297607A1 true US20130297607A1 (en) 2013-11-07

Family

ID=26967221

Family Applications (4)

Application Number Title Priority Date Filing Date
US10/478,191 Abandoned US20050071140A1 (en) 2001-05-18 2002-05-17 Model selection for cluster data analysis
US11/929,522 Expired - Fee Related US7890445B2 (en) 2001-05-18 2007-10-30 Model selection for cluster data analysis
US13/019,585 Expired - Fee Related US8489531B2 (en) 2001-05-18 2011-02-02 Identification of co-regulation patterns by unsupervised cluster analysis of gene expression data
US13/934,166 Abandoned US20130297607A1 (en) 2001-05-18 2013-07-02 Identification of pattern similarities by unsupervised cluster analysis

Family Applications Before (3)

Application Number Title Priority Date Filing Date
US10/478,191 Abandoned US20050071140A1 (en) 2001-05-18 2002-05-17 Model selection for cluster data analysis
US11/929,522 Expired - Fee Related US7890445B2 (en) 2001-05-18 2007-10-30 Model selection for cluster data analysis
US13/019,585 Expired - Fee Related US8489531B2 (en) 2001-05-18 2011-02-02 Identification of co-regulation patterns by unsupervised cluster analysis of gene expression data

Country Status (3)

Country Link
US (4) US20050071140A1 (en)
AU (1) AU2002259250A1 (en)
WO (1) WO2002095533A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9202178B2 (en) 2014-03-11 2015-12-01 Sas Institute Inc. Computerized cluster analysis framework for decorrelated cluster identification in datasets
US9424337B2 (en) 2013-07-09 2016-08-23 Sas Institute Inc. Number of clusters estimation
US20160299961A1 (en) * 2014-02-04 2016-10-13 David Allen Olsen System and method for grouping segments of data sequences into clusters
CN114492572A (en) * 2021-12-21 2022-05-13 成都产品质量检验研究院有限责任公司 Material structure classification method and system based on machine learning clustering algorithm
US11553872B2 (en) 2018-12-04 2023-01-17 L'oreal Automatic image-based skin diagnostics using deep learning

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7747624B2 (en) * 2002-05-10 2010-06-29 Oracle International Corporation Data summarization
KR100597089B1 (en) * 2003-12-13 2006-07-05 한국전자통신연구원 Method for identifying of relevant groups of genes using gene expression profiles
US20090112533A1 (en) * 2007-10-31 2009-04-30 Caterpillar Inc. Method for simplifying a mathematical model by clustering data
US9881415B2 (en) 2008-06-19 2018-01-30 Robert Andrew Palais Generating point clouds
US20100146518A1 (en) 2008-12-08 2010-06-10 International Business Machines Corporation All-To-All Comparisons on Architectures Having Limited Storage Space
US8099453B2 (en) * 2009-01-22 2012-01-17 Hewlett-Packard Development Company, L.P. System and method for data clustering
US8571333B2 (en) * 2010-10-17 2013-10-29 Canon Kabushiki Kaisha Data clustering
US10853828B2 (en) * 2011-02-28 2020-12-01 Flytxt B.V Methods and systems for providing multivariate time series clustering for customer segmentation
CA2832918C (en) * 2011-06-22 2016-05-10 Rogers Communications Inc. Systems and methods for ranking document clusters
CN102521202B (en) * 2011-11-18 2017-05-24 东南大学 Automatic discovery method of complex system oriented MAXQ task graph structure
US20140222817A1 (en) * 2013-02-04 2014-08-07 David Allen Olsen System and method for grouping segments of data sequences into clusters
CN103150372B (en) * 2013-03-08 2017-01-04 江苏唯实科技有限公司 The clustering method of magnanimity higher-dimension voice data based on centre indexing
US10599953B2 (en) 2014-08-27 2020-03-24 Verint Americas Inc. Method and system for generating and correcting classification models
US9183285B1 (en) * 2014-08-27 2015-11-10 Next It Corporation Data clustering system and methods
US10210246B2 (en) * 2014-09-26 2019-02-19 Oracle International Corporation Techniques for similarity analysis and data enrichment using knowledge sources
US10296192B2 (en) 2014-09-26 2019-05-21 Oracle International Corporation Dynamic visual profiling and visualization of high volume datasets and real-time smart sampling and statistical profiling of extremely large datasets
US10891272B2 (en) 2014-09-26 2021-01-12 Oracle International Corporation Declarative language and visualization system for recommended data transformations and repairs
EP3361416A1 (en) 2017-02-09 2018-08-15 Tata Consultancy Services Limited Statistical modeling techniques based neural network models for generating intelligence reports
US10902346B2 (en) * 2017-03-28 2021-01-26 International Business Machines Corporation Efficient semi-supervised concept organization accelerated via an inequality process
US11185780B2 (en) 2017-04-24 2021-11-30 International Business Machines Corporation Artificial intelligence profiling
US10810472B2 (en) 2017-05-26 2020-10-20 Oracle International Corporation Techniques for sentiment analysis of data using a convolutional neural network and a co-occurrence network
US10885056B2 (en) 2017-09-29 2021-01-05 Oracle International Corporation Data standardization techniques
US10936599B2 (en) 2017-09-29 2021-03-02 Oracle International Corporation Adaptive recommendations
US11216501B2 (en) 2018-09-07 2022-01-04 Walmart Apollo, Llc Method and apparatus to more quickly classify additional text entries
CN112789636A (en) * 2018-10-26 2021-05-11 索尼公司 Information processing apparatus, information processing method, and program
CN111126419B (en) * 2018-10-30 2023-12-01 顺丰科技有限公司 Dot clustering method and device
US11423249B2 (en) * 2018-12-03 2022-08-23 Bank Of America Corporation Computer architecture for identifying data clusters using unsupervised machine learning in a correlithm object processing system
CN110728293B (en) * 2019-08-30 2021-10-29 浙江工业大学 Hierarchical clustering method for tourist heading data
US11762819B2 (en) 2019-10-15 2023-09-19 Target Brands, Inc. Clustering model analysis for big data environments
US11757919B2 (en) * 2020-04-20 2023-09-12 Kovrr Risk Modeling Ltd. System and method for catastrophic event modeling
CN113029979B (en) * 2021-02-10 2023-03-24 河南中烟工业有限责任公司 Method for testing quality stability of cigarette paper
CN116522183B (en) * 2023-02-01 2023-09-15 华智生物技术有限公司 Sub-class center point feature extraction method, system and equipment based on seed clustering

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020054694A1 (en) * 1999-03-26 2002-05-09 George J. Vachtsevanos Method and apparatus for analyzing an image to direct and identify patterns
US6763069B1 (en) * 2000-07-06 2004-07-13 Mitsubishi Electric Research Laboratories, Inc Extraction of high-level features from low-level features of multimedia content

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1994022420A1 (en) * 1993-04-05 1994-10-13 Quest International B.V. Silicone based skin care products
US5649068A (en) * 1993-07-27 1997-07-15 Lucent Technologies Inc. Pattern recognition system using support vectors
US5572628A (en) * 1994-09-16 1996-11-05 Lucent Technologies Inc. Training system for neural networks
US5950146A (en) 1996-10-04 1999-09-07 At & T Corp. Support vector method for function estimation
US6157921A (en) * 1998-05-01 2000-12-05 Barnhill Technologies, Llc Enhancing knowledge discovery using support vector machines in a distributed network environment
US6267722B1 (en) * 1998-02-03 2001-07-31 Adeza Biomedical Corporation Point of care diagnostic systems
US6658395B1 (en) * 1998-05-01 2003-12-02 Biowulf Technologies, L.L.C. Enhancing knowledge discovery from multiple data sets using multiple support vector machines
US6789069B1 (en) * 1998-05-01 2004-09-07 Biowulf Technologies Llc Method for enhancing knowledge discovered from biological data using a learning machine
US6760715B1 (en) * 1998-05-01 2004-07-06 Barnhill Technologies Llc Enhancing biological knowledge discovery using multiples support vector machines
US6263287B1 (en) * 1998-11-12 2001-07-17 Scios Inc. Systems for the analysis of gene expression data
US6941321B2 (en) * 1999-01-26 2005-09-06 Xerox Corporation System and method for identifying similarities among objects in a collection
US6714925B1 (en) * 1999-05-01 2004-03-30 Barnhill Technologies, Llc System for identifying patterns in biological data using a distributed network
JP4320112B2 (en) * 2000-03-27 2009-08-26 日立ソフトウエアエンジニアリング株式会社 Gene experiment data display method
US6944602B2 (en) * 2001-03-01 2005-09-13 Health Discovery Corporation Spectral kernels for learning machines

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020054694A1 (en) * 1999-03-26 2002-05-09 George J. Vachtsevanos Method and apparatus for analyzing an image to direct and identify patterns
US6763069B1 (en) * 2000-07-06 2004-07-13 Mitsubishi Electric Research Laboratories, Inc Extraction of high-level features from low-level features of multimedia content

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9424337B2 (en) 2013-07-09 2016-08-23 Sas Institute Inc. Number of clusters estimation
US20160299961A1 (en) * 2014-02-04 2016-10-13 David Allen Olsen System and method for grouping segments of data sequences into clusters
US9202178B2 (en) 2014-03-11 2015-12-01 Sas Institute Inc. Computerized cluster analysis framework for decorrelated cluster identification in datasets
US11553872B2 (en) 2018-12-04 2023-01-17 L'oreal Automatic image-based skin diagnostics using deep learning
US11832958B2 (en) 2018-12-04 2023-12-05 L'oreal Automatic image-based skin diagnostics using deep learning
CN114492572A (en) * 2021-12-21 2022-05-13 成都产品质量检验研究院有限责任公司 Material structure classification method and system based on machine learning clustering algorithm

Also Published As

Publication number Publication date
US20050071140A1 (en) 2005-03-31
AU2002259250A1 (en) 2002-12-03
WO2002095533A3 (en) 2003-04-10
US7890445B2 (en) 2011-02-15
US20110125683A1 (en) 2011-05-26
US8489531B2 (en) 2013-07-16
US20080140592A1 (en) 2008-06-12
WO2002095533A2 (en) 2002-11-28

Similar Documents

Publication Publication Date Title
US8489531B2 (en) Identification of co-regulation patterns by unsupervised cluster analysis of gene expression data
Van der Laan et al. A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap
Erisoglu et al. A new algorithm for initial cluster centers in k-means algorithm
Genolini et al. KmL: A package to cluster longitudinal data
US10546245B2 (en) Methods for mapping data into lower dimensions
Unnikrishnan et al. Toward objective evaluation of image segmentation algorithms
Kumar et al. Visual approaches for exploratory data analysis: A survey of the visual assessment of clustering tendency (vat) family of algorithms
Reza et al. ICA and PCA integrated feature extraction for classification
Cord et al. Feature selection in robust clustering based on Laplace mixture
Bryan Problems in gene clustering based on gene expression data
CN110795564A (en) Text classification method lacking negative cases
Kavitha et al. PCA-based gene selection for cancer classification
Li et al. A novel unsupervised feature selection method for bioinformatics data sets through feature clustering
CN114003636A (en) Multivariate time sequence similarity searching method based on variable correlation
Vengatesan et al. The performance analysis of microarray data using occurrence clustering
CN111797267A (en) Medical image retrieval method and system, electronic device and storage medium
Gan et al. Clustering by propagating probabilities between data points
Mehri et al. A pixel labeling approach for historical digitized books
US20230352119A1 (en) Method and system for subsampling of cells from single-cell genomics dataset
KR20100001177A (en) Gene selection algorithm using principal component analysis
Priscilla et al. A semi-supervised hierarchical approach: Two-dimensional clustering of microarray gene expression data
Ragunthar et al. Classification of gene expression data with optimized feature selection
Liu et al. Mining gene expression data
Nadif et al. Algorithms for Model-based Block Gaussian Clustering.
Costa et al. A symbolic approach to gene expression time series analysis

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO PAY ISSUE FEE