|Número de publicación||US20050210027 A1|
|Tipo de publicación||Solicitud|
|Número de solicitud||US 10/801,420|
|Fecha de publicación||22 Sep 2005|
|Fecha de presentación||16 Mar 2004|
|Fecha de prioridad||16 Mar 2004|
|También publicado como||US7970772, US20070226212|
|Número de publicación||10801420, 801420, US 2005/0210027 A1, US 2005/210027 A1, US 20050210027 A1, US 20050210027A1, US 2005210027 A1, US 2005210027A1, US-A1-20050210027, US-A1-2005210027, US2005/0210027A1, US2005/210027A1, US20050210027 A1, US20050210027A1, US2005210027 A1, US2005210027A1|
|Inventores||Charu Aggarwal, Philip Yu|
|Cesionario original||International Business Machines Corporation|
|Exportar cita||BiBTeX, EndNote, RefMan|
|Citas de patentes (8), Citada por (21), Clasificaciones (7), Eventos legales (1)|
|Enlaces externos: USPTO, Cesión de USPTO, Espacenet|
The present invention is related to techniques for clustering a data stream and, more particularly, techniques for monitoring data abnormalities in the stream through the clustering of the data stream.
In general, large volumes of continuously evolving data, which may be stored, is referred to as a data stream. Data streams have received increased attention in recent years due to technological innovations, which have facilitated the creation, maintenance and storage of such data. A number of data mining studies have been conducted in the data stream context in recent years, see, e.g., C. C. Aggarwal, “A Framework for Diagnosing Changes in Evolving Data Streams,” ACM SIGMOD Conference, 2003; B. Babcock et al., “Models and Issues in Data Stream Systems,” ACM PODS Conference, 2002; P. Domingos et al., “Mining High-Speed Data Streams,” ACM SIGKDD Conference, 1998; S. Guha et al., “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Proceedings of the International Conference on Data Engineering, 1999; and L. O'Callaghan et al., “Streaming-Data Algorithms for High-Quality Clustering,” ICDE Conference, 2002.
Clustering is the partitioning of a given set of objects, such as data points, into one or more groups (clusters) of similar objects. The similarity of a data point with another data point is typically defined by a distance measure or objective function. In addition, data points that do not naturally fit into any particular cluster are referred to as outliers. Clustering has been widely studied by those in the database and data mining communities because of its applicability to a wide range of problems, see, e.g., P. Bradley et al., “Scaling Clustering Algorithms to Large Databases,” SIGKDD Conference, 1998; S. Guha et al., “CURE: An Efficient Clustering Algorithm for Large Databases,” ACM SIGMOD Conference, 1998; R. Ng et al., “Efficient and Effective Clustering Methods for Spatial Data Mining,” Very Large Data Bases Conference, 1994; A. Jain et al., “Algorithms for Clustering Data,” Prentice Hall, N.J., 1998; L. Kaufman et al., “Finding Groups in Data—An Introduction to Cluster Analysis,” Wiley Series in Probability and Math Sciences, 1990; E. Knorr et al., “Algorithms for Mining Distance-Based Outliers in Large Data Sets,” Proceedings of the VLDB Conference, September, 1998; E. Knorr et al., “Finding Intensional Knowledge of Distance-Based Outliers,” Proceedings of the VLDB Conference, September, 1999; S. Ramaswamy et al., “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proceedings of the ACM SIGMOD Conference, 2000; and T. Zhang et al., “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” ACM SIGMOD Conference, 1996.
The problem of categorical data clustering has also been recently studied, see, e.g., V. Ganti et al., “CACTUS-Clustering Categorical Data Using Summaries,” Proceedings of the ACM SIGKDD Conference, 1999; D. Gibson et al., “Clustering Categorical Data: An Approach Based on Dynamical Systems,” Proceedings of the VLDB Conference, 1998; and S. Guha et al., “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Proceedings of the International Conference on Data Engineering, 1999. However, these techniques cannot be utilized for clustering data streams, since they do not naturally scale well with increasing data size. Furthermore, a data stream clustering technique requires the appropriate mechanisms to deal with the temporal issues created by the evolution of the data stream.
Clustering and outlier monitoring present a number of unique challenges in an evolving data stream environment. For example, the continuous evolution of clusters makes it essential to quickly identify new patterns in the data. In addition, it is also important to provide end users with the ability to analyze the clusters in an offline fashion.
In the data stream environment, outlier and abnormality monitoring is especially problematic, since the temporal component of the data stream influences whether an outlier is defined as an abnormality. For example, the first arriving data point of a cluster may be considered an outlier at the moment of its arrival. However, as time passes, data points may join the newly created cluster, thereby initiating a new pattern of activity resulting from the evolution of the data stream. On the other hand, in many other cases, data points may not join the outlier or newly created cluster over time, thereby defining an abnormality. An important aspect of the data stream clustering process is the ability to identify and label such events effectively.
The present invention provides techniques for clustering a data stream and, more particularly, techniques for monitoring data abnormalities in the stream through the clustering of the data stream.
For example, in one aspect of the invention, a technique for monitoring abnormalities in a data stream comprises the following steps. A plurality of objects are received from the data stream, and one or more clusters are created from the plurality of objects. At least a portion of the one or more clusters have statistical data of the respective cluster. It is determined from the statistical data whether one or more abnormalities exist in the data stream.
Thus, a framework may be provided in which select statistical data may be stored at regular intervals. This results in a technique which is able to analyze different characteristics of the clusters in an effective manner. Advantageously, the inventive techniques may be useful for clustering different kinds of categorical data sets, and adapting to the rapidly evolving nature of a data stream.
Additional advantages of the inventive techniques of the present invention include the ability to explore the clusters in an online fashion, and store statistical data which may be utilized for a better understanding and analysis of the data stream. In applications in which the data stream evolves considerably, different kinds of clusters may assist in understanding the behavior of the data stream over different periods in time. This is advantageous since a fast data stream cannot be repeatedly processed in order to resolve different kinds of queries.
These and other objects, features, and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The following description will illustrate the invention using an exemplary data processing system architecture. It should be understood, however, that the invention is not limited to use with any particular system architecture. The invention is instead more generally applicable to any data processing system in which it is desirable to perform efficient and effective data stream clustering. It is to be understood that the phrase “data point,” illustratively used herein, is one example of a data “object.”
As will be illustrated in detail below, the present invention introduces techniques for clustering a data stream and, more particularly, techniques for monitoring data abnormalities in the stream through the clustering of the data stream. An abnormality, as referred to herein, is defined as an outlier cluster or outlier data point of the data stream having specifically defined values in the stored statistical data of the data point or cluster. The stored statistical data may include, for example, the number of pairwise attribute values, the number of categorical attribute values, the number of data points, the sum of the weights of the data points, and the time at which the last data point was added to the outlier. A more detailed description of the values of the statistical data required for abnormality determination are provided herein.
Referring initially to
Data points from a data stream are received at server 30 from an individual client 10 and stored on disk 60. All computations on the data stream are performed by CPU 40. The clustered data points and their corresponding statistical data are stored on disk 60, and are utilized for the purpose of answering a variety of user queries. For example, a data stream may relate to records of a credit card company corresponding to the transactions of their customers. Attributes of these records may include the age and sex of the customer.
In another example, the data points of the data stream may relate to records corresponding to user accesses, or customer connections, on a network. The queries for abnormalities in the data stream are searches for intrusions, or hacker actions. For example, a customer may attempt to bring down a web server by making millions of web accesses on the server using an automated machine, such as a crawler. The queries or searches for abnormalities may be initiated by a system administrator.
Referring now to
The methodology begins at block 202, where data stream maintenance is performed. This maintenance involves receiving data points from the data stream and creating clusters, having associated statistical information. A more detailed description of cluster and data stream maintenance is provided in
In block 206, a user queries for abnormalities within a specified time horizon (t1, t2). Block 208 receives the query and resolves the query by retrieving stored statistical data of the clusters from block 204. The statistical data is used in order to respond to user queries for abnormalities in block 210, terminating the methodology. A more detailed description of block 210 is provided in
Referring now to
In block 308, it is determined whether the data point should be added to the closest cluster. A more detailed description of block 308 is provided in
A newly created cluster containing only a single data point may be referred to as a “trend-setter.” From the point of view of a user, a trend-setter is an outlier, until the arrival of other data points certify the fact that it is actually a cluster. If and when a sufficient number of new data points are added to the cluster, it is referred to as a mature cluster. The specific number of data points needed in order to make a mature cluster is application dependent, however, in the intrusion detection application described above, a mature cluster may contain 20-50 data points.
At a given moment in time, a mature cluster can either be “active” or “inactive.” A mature cluster is said to be active when it has received data points in the recent past. When a mature cluster has not received data points in the recent past, it is said to be inactive. Again, the specific amount of time that must pass in order for a mature cluster to become inactive is application dependent. However, in the intrusion detection application, an active mature cluster may be a mature cluster that has received data points in the last ten days. In some cases, a trend-setter cluster becomes inactive before it has a chance to mature. Such a cluster typically contains a small number of transient data points, which may typically be the result of an underlying abnormality that is short-term in nature.
A set of clusters may be dynamically maintained by effectively scaling with data size. In order to achieve better scalability during data stream maintenance, data structures may be constructed that allow for additive operations on the data points.
In order to achieve greater accuracy in the clustering technique, a high level of granularity is maintained in the maintenance of the underlying data structures. This may be achieved through a condensation technique in which groups of data clusters are condensed. These groups of clusters are referred to as cluster droplets.
A cluster droplet D(t, C) at time t, and a set of categorical data points C is referred to as a tuple (DF2, DF1, n, w(t), l), in which each statistical component is defined as follows:
Cluster droplet maintenance involves storing the data at a high level of granularity so as to lose the least amount of information. The droplet update technique continuously maintains a set of cluster droplets C1 . . . Ck, which it updates as new data points arrive. For each cluster, the entire set of statistical data is maintained in the droplet. The maximum number of droplets k which are maintained is dependent upon the amount of available main memory 50. In receiving data points, it is first assumed that no clusters exist. As new data points arrive, unit clusters containing individual data points are created. Once a maximum number k of such clusters have been created, the online maintenance of the clusters may begin starting with a trivial set of k clusters which are updated over time with the arrival of new data points.
Referring now to
In the case of cluster droplets described above, which maintain a maximum number of droplets k, the cluster with the maximum similarity value is defined as Cmindex. If a similarity value of S(X, Cmindex) is greater than the user-defined threshold, the point X is assigned to the cluster Cmindex. It is also determined whether an inactive cluster exists in the existing set of cluster droplets. If no such inactive cluster exists, then the data point X is added to Cmindex. In the even that the data point X is assigned to the cluster Cmindex, two steps are performed:
In the event that the newly arriving data point does not naturally fit in any of the cluster droplets and an inactive cluster does exist, then the most inactive cluster is replaced by a new cluster containing the solitary data point X. The most inactive cluster may be defined as the least recently updated cluster droplet. This new cluster is a potential outlier, or the beginning of a new trend. Further understanding of this new cluster droplet may only be obtained with the progress of the data stream.
Referring now to
In order to more fully describe decay statistics, a further description of the data stream is first required. The data stream comprises a set of multi-dimensional records Xl . . . Xk . . . arriving at time stamps Tl . . . Tk . . . . Each Xi is a multi-dimensional categorical record containing d dimensions which are denoted by Xi=(xl i . . . xd i). It is assumed that the ith categorical dimension contains vi possible values. Since the stream clustering technique should attribute greater importance to recent clusters, a time-sensitive weight is provided for each data point. It is assumed that each data point has a weight defined by f(t), which is also referred to as the fading function. The fading function f(t) is a non-monotonic decreasing function which decays uniformly with time t. In order to formalize this concept, the half-life of a point in the data stream is defined as the time at which f(t0)=(1/2) f(0).
Conceptually, the aim of defining a half life is to define the rate of decay of the weight assigned to each data point in the stream. Correspondingly, the decay-rate is defined as the inverse of the half life of the data stream. The decay-rate is denoted by λ=1/t0. In order for the half-life property to hold, the weight of each point in the data stream is defined by f(t)=2−λt, creating a half life of 1/λ. In the intrusion detection application described above, a decay rate may be 0.5 per day, thus, having a half-life of two days. However, the decay rate and half-life are application dependent, and therefore may differ from these examples.
By changing the value of λ, it is possible to change the rate at which the importance of the historical information in the data stream decays. The higher the value of λ, the lower the importance of the historical information compared to more recent data. By changing the value of this parameter, it is possible to obtain considerable control on the rate at which the historical statistics are allowed to decay. For more stable data streams, it is desirable to pick a smaller value of λ, whereas for rapidly evolving data streams, it is desirable to pick a larger value of λ.
Referring now to
For example, when a new cluster is created during the streaming technique by a newly arriving data point, it is allowed to remain as a trend-setting outlier for at least one half-life. During that period, if at least one more data point is added to the newly formed cluster, it becomes an active and mature cluster. If no new points arrive during a half-life, then the trend-setting outlier is recognized as a true abnormality in the data stream, and the single point cluster is removed from the current set of clusters. Thus, a new cluster containing one data point is removed when the (weighted) number of points in the cluster is 0.5.
This criterion is also used for the removal of mature clusters. In other words, a mature cluster is removed when the weighted number of points in that cluster is larger than 0.5. This will happen only when the inactivity period in the cluster has exceeded the half life 1/λ. The greater the number of points in the cluster, the greater the level by which the inactivity period would need to exceed its half life in order to be considered an inactive cluster. This is a natural solution, since it is intuitively desirable to have stronger requirements (a longer inactivity period) for the elimination of a cluster containing a larger number of points.
The inventive techniques are applicable to a large number of applications such as systems diagnosis. For example, as described above, the techniques of the present invention may be utilized for online monitoring of network intrusions. Referring now to
Although illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made by one skilled in the art without departing from the scope or spirit of the invention.
|Patente citada||Fecha de presentación||Fecha de publicación||Solicitante||Título|
|US6625585 *||14 Abr 2000||23 Sep 2003||Bioreason, Inc.||Method and system for artificial intelligence directed lead discovery though multi-domain agglomerative clustering|
|US6947933 *||17 Dic 2003||20 Sep 2005||Verdasys, Inc.||Identifying similarities within large collections of unstructured data|
|US7072891 *||22 Mar 2002||4 Jul 2006||Korea Advanced Institute Of Science & Technology||Apparatus and method for hyper-rectangle based multidimensional data segmentation and clustering|
|US7227985 *||28 Feb 2003||5 Jun 2007||Fuji Xerox Co., Ltd.||Data classifier for classifying pattern data into clusters|
|US20020107858 *||5 Jul 2001||8 Ago 2002||Lundahl David S.||Method and system for the dynamic analysis of data|
|US20020161763 *||26 Oct 2001||31 Oct 2002||Nong Ye||Method for classifying data using clustering and classification algorithm supervised|
|US20030158855 *||28 Jun 2002||21 Ago 2003||Farnham Shelly D.||Computer system architecture for automatic context associations|
|US20040098617 *||18 Nov 2002||20 May 2004||Research Foundation Of The State University Of New York||Specification-based anomaly detection|
|Patente citante||Fecha de presentación||Fecha de publicación||Solicitante||Título|
|US7565335 *||15 Mar 2006||21 Jul 2009||Microsoft Corporation||Transform for outlier detection in extract, transfer, load environment|
|US7571153 *||28 Mar 2005||4 Ago 2009||Microsoft Corporation||Systems and methods for performing streaming checks on data format for UDTs|
|US7690037 *||13 Jul 2005||30 Mar 2010||Symantec Corporation||Filtering training data for machine learning|
|US7881255 *||5 Abr 2010||1 Feb 2011||Google Inc.||Systems and methods for relating network traffic using traffic-based signatures|
|US7996189||15 Oct 2007||9 Ago 2011||Sony France S.A.||Event-detection in multi-channel sensor-signal streams|
|US8141148 *||17 Oct 2006||20 Mar 2012||Threatmetrix Pty Ltd||Method and system for tracking machines on a network using fuzzy GUID technology|
|US8176178||29 Ene 2008||8 May 2012||Threatmetrix Pty Ltd||Method for tracking machines on a network using multivariable fingerprinting of passively available information|
|US8380721||18 Ene 2007||19 Feb 2013||Netseer, Inc.||System and method for context-based knowledge search, tagging, collaboration, management, and advertisement|
|US8417695||30 Oct 2009||9 Abr 2013||Netseer, Inc.||Identifying related concepts of URLs and domain names|
|US8763113||17 Oct 2006||24 Jun 2014||Threatmetrix Pty Ltd||Method and system for processing a stream of information from a computer network using node based reputation characteristics|
|US8782783||13 Feb 2012||15 Jul 2014||Threatmetrix Pty Ltd||Method and system for tracking machines on a network using fuzzy guid technology|
|US8825654||25 Oct 2012||2 Sep 2014||Netseer, Inc.||Methods and apparatus for distributed community finding|
|US8825657 *||19 Ene 2007||2 Sep 2014||Netseer, Inc.||Systems and methods for creating, navigating, and searching informational web neighborhoods|
|US8838605||25 Oct 2012||16 Sep 2014||Netseer, Inc.||Methods and apparatus for distributed community finding|
|US8843434||28 Feb 2007||23 Sep 2014||Netseer, Inc.||Methods and apparatus for visualizing, managing, monetizing, and personalizing knowledge search results on a user interface|
|US9100344||26 Dic 2006||4 Ago 2015||Wipro Limited||Label-based partitioning for network subscribers|
|US9110985||15 Oct 2010||18 Ago 2015||Neetseer, Inc.||Generating a conceptual association graph from large-scale loosely-grouped content|
|US20060218144 *||28 Mar 2005||28 Sep 2006||Microsoft Corporation||Systems and methods for performing streaming checks on data format for UDTs|
|US20060227799 *||8 Abr 2005||12 Oct 2006||Lee Man-Ho L||Systems and methods for dynamically allocating memory for RDMA data transfers|
|US20120317117 *||13 Dic 2012||Hitachi Solutions, Ltd.||Information Visualization System|
|EP1916828A1 *||27 Oct 2006||30 Abr 2008||Sony France S.A.||Event-detection in multi-channel sensor-signal streams|
|Clasificación de EE.UU.||1/1, 707/999.006|
|Clasificación internacional||G06K9/62, G06F17/30|
|Clasificación cooperativa||Y10S707/952, G06K9/6284|
|4 Jun 2004||AS||Assignment|
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:AGGARWAL, CHARU C.;YU, PHILIP SHI-LUNG;REEL/FRAME:014696/0502
Effective date: 20040329