Búsqueda Imágenes Maps Play YouTube Noticias Gmail Drive Más »
Búsqueda avanzada de patentes | Imágenes de página | Historial web | Iniciar sesión

Patentes

  

United States Patent [w]

Vaithyanathan et al.

US005857179A [ii] Patent Number: [45] Date of Patent:

5,857,179 Jan. 5, 1999

[54] COMPUTER METHOD AND APPARATUS FOR CLUSTERING DOCUMENTS AND AUTOMATIC GENERATION OF CLUSTER KEYWORDS

[75] Inventors: Shivakumar Vaithyanathan, Nashua, N.H.; Mark R. Adler, Lexington, Mass.; Christopher G. Hill, Cumming, Ga.

[73] Assignee: Digital Equipment Corporation,

Maynard, Mass.

[21] Appl. No.: 709,755
[22] Filed: Sep. 9, 1996

[51] Int. CI. G06F 17/30

[52] U.S. CI 707/2; 395/794

[58] Field of Search 395/601, 602,

395/603, 604, 605, 611, 616, 11, 13, 50, 751, 752, 758, 759, 760, 761, 779, 788,

792, 794

[56] References Cited

U.S. PATENT DOCUMENTS

4,839,853 6/1989 Deerwester et al 364/900

5,263,120 11/1993 Bickel 395/11

5,343,554 8/1994 Koza et al 395/13

5,481,712 1/1996 Silver et al 395/701

5,559,940 9/1996 Hutson 395/788

5,619,709 4/1997 Caid et al 707/532

OTHER PUBLICATIONS

Jain, A.K., et al., "Algorithms for Clustering Data," Michigan State University, Prentice Hall, Englewood Cliffs, New Jersey 07632, pp. 96-101 (1988).

Faloutsos, C, et al., "A Survey of Information Retrieval and Filtering," University of Maryland, College Park, MD 20742, pp. 1-22 (no date given).

Cutting, D.R., et al., "Scatter/Gather: A Cluster-based
Approach to Browsing Large Document Collections," Pro-
ceedings of the Fifteenth Annual International ACM SIGIR
Conference, pp. 318-329 (Jun. 1992).
Cutting, D.R., et al., "Constant Interaction-Time Scatter/
Gather Browsing of Very Large Document Collections,"
Proceedings of the Sixteenth Annual International ACM
SIGIR Conference, pp. 1-9 (Jun. 1993).
Faber, V, "Clustering and the Continuous k-Means Algo-
rithm," (No Date Given).

Singhal, A., "Length Normalizatin in Degraded Text Collections," Department of Computer Science,, Cornell University, Ithaca, NY 14853, pp. 1-19 (no date given).

Primary Examiner—-Thomas G. Black
Assistant Examiner—Buay Lian Ho
Attorney, Agent, or Firm—David A. Dagg

[blocks in formation]

A computer method and apparatus determines keywords of documents. An initial document by term matrix is formed, each document being represented by a respective M dimensional vector, where M represents the number of terms or words in a predetermined domain of documents. The dimensionality of the initial matrix is reduced to form resultant vectors of the documents. The resultant vectors are then clustered such that correlated documents are grouped into respective clusters. For each cluster, the terms having greatest impact on the documents in that cluster are identified. The identified terms represent key words of each document in that cluster. Further, the identified terms form a cluster summary indicative of the documents in that cluster.

20 Claims, 10 Drawing Sheets

[merged small][merged small][merged small][merged small][graphic][merged small]
[graphic]
[subsumed][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][graphic][merged small][merged small][graphic]
[merged small][merged small][merged small][merged small][merged small][merged small][graphic][merged small][merged small][merged small][table][merged small][merged small][merged small][merged small][merged small][merged small][merged small][table][merged small][merged small][merged small]
« AnteriorContinuar »