« AnteriorContinuar »
United States Patent [w]
Vaithyanathan et al.
US005857179A [ii] Patent Number:  Date of Patent:
 COMPUTER METHOD AND APPARATUS FOR CLUSTERING DOCUMENTS AND AUTOMATIC GENERATION OF CLUSTER KEYWORDS
 Inventors: Shivakumar Vaithyanathan, Nashua, N.H.; Mark R. Adler, Lexington, Mass.; Christopher G. Hill, Cumming, Ga.
 Assignee: Digital Equipment Corporation,
 Appl. No.: 709,755
 Filed: Sep. 9, 1996
 Int. CI. G06F 17/30
 U.S. CI 707/2; 395/794
 Field of Search 395/601, 602,
395/603, 604, 605, 611, 616, 11, 13, 50, 751, 752, 758, 759, 760, 761, 779, 788,
 References Cited
U.S. PATENT DOCUMENTS
4,839,853 6/1989 Deerwester et al 364/900
5,263,120 11/1993 Bickel 395/11
5,343,554 8/1994 Koza et al 395/13
5,481,712 1/1996 Silver et al 395/701
5,559,940 9/1996 Hutson 395/788
5,619,709 4/1997 Caid et al 707/532
Jain, A.K., et al., "Algorithms for Clustering Data," Michigan State University, Prentice Hall, Englewood Cliffs, New Jersey 07632, pp. 96-101 (1988).
Faloutsos, C, et al., "A Survey of Information Retrieval and Filtering," University of Maryland, College Park, MD 20742, pp. 1-22 (no date given).
Cutting, D.R., et al., "Scatter/Gather: A Cluster-based
Approach to Browsing Large Document Collections," Pro-
ceedings of the Fifteenth Annual International ACM SIGIR
Conference, pp. 318-329 (Jun. 1992).
Cutting, D.R., et al., "Constant Interaction-Time Scatter/
Gather Browsing of Very Large Document Collections,"
Proceedings of the Sixteenth Annual International ACM
SIGIR Conference, pp. 1-9 (Jun. 1993).
Faber, V, "Clustering and the Continuous k-Means Algo-
rithm," (No Date Given).
Singhal, A., "Length Normalizatin in Degraded Text Collections," Department of Computer Science,, Cornell University, Ithaca, NY 14853, pp. 1-19 (no date given).
Primary Examiner—-Thomas G. Black
Assistant Examiner—Buay Lian Ho
Attorney, Agent, or Firm—David A. Dagg
A computer method and apparatus determines keywords of documents. An initial document by term matrix is formed, each document being represented by a respective M dimensional vector, where M represents the number of terms or words in a predetermined domain of documents. The dimensionality of the initial matrix is reduced to form resultant vectors of the documents. The resultant vectors are then clustered such that correlated documents are grouped into respective clusters. For each cluster, the terms having greatest impact on the documents in that cluster are identified. The identified terms represent key words of each document in that cluster. Further, the identified terms form a cluster summary indicative of the documents in that cluster.
20 Claims, 10 Drawing Sheets