WO2003052627A1 - Information resource taxonomy - Google Patents
Information resource taxonomy Download PDFInfo
- Publication number
- WO2003052627A1 WO2003052627A1 PCT/AU2002/001719 AU0201719W WO03052627A1 WO 2003052627 A1 WO2003052627 A1 WO 2003052627A1 AU 0201719 W AU0201719 W AU 0201719W WO 03052627 A1 WO03052627 A1 WO 03052627A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- resources
- clusters
- generating
- taxonomy
- cluster
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/954—Navigation, e.g. using categorised browsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Definitions
- the present invention relates to taxonomies for information resources, and in particular to a system and process for generating a taxonomy for a plurality of information resources in a communications network.
- a process for generating a taxonomy for a plurality of information resources in a communications network including: collecting said resources from said network; generating cluster criteria from said resources; and generating said taxonomy as a hierarchy of resource clusters based on said criteria.
- the present invention also provides an information resource taxonomy system, including a data collector for collecting information resources from a communications network; and a taxonomy generator for generating a taxonomy represented by a hierarchy of resource clusters, using cluster criteria generated from said resources.
- Figure 1 is a schematic diagram of a preferred embodiment of an information resource taxonomy system
- Figure 2 is a flow diagram of a data collection process executed by a data collector of the system
- Figure 3 is a flow diagram of a pre-processing process executed by a pre-processor of the system.
- Figure 4 is a graph of the goodness value of a document set as a function of the cluster threshold.
- an information resource taxonomy system includes a data collector 10, a data processing system 12, a renderer 14, and a management system 16.
- the taxonomy system executes a taxonomy generation process that automatically generates a taxonomy from structured or unstructured documents or other information resources, and can be used to maintain the taxonomy.
- the taxonomy is a hierarchical tree structure that organizes resources into clusters or nodes based on their similarity, and can include the resources themselves.
- the taxonomy is subsequently used by the renderer 14 to generate markup code such as HTML, XML, or ASP that provides an interactive, hierarchical view into the space of documents or other information resources.
- a user of the Internet can view the hierarchy and open individual documents or other information resources over the Internet using a web browser 32 to access the markup code generated by the renderer 14 and generate a graphical display of the hierarchy.
- the taxonomy system can be applied to a variety of taxonomy generation tasks such as site management of corporate intranets and external web sites.
- An administrator of the taxonomy system can login to the system from a terminal associated with the management system 16. The administrator can then submit to the taxonomy system a text file that defines the taxonomy specifications, i.e., the taxonomy creation tasks to be performed by the system.
- This file includes a list of universal resource indicators (URIs) and a corresponding list of 'include' specifications.
- URIs indicate high-level domains that are to be clustered or categorised by the taxonomy system, and the 'include' specifications indicate the types of documents that are to be included in the taxonomy. For example, it may be desired to include only textual documents in one or more of the following formats: HTML, text, Microsoft Word®, FrameMaker, and StarOffice.
- the text file containing these specifications is sent to the data collector 10.
- the components of the taxonomy system can be implemented using standard computer system hardware and adding unique software modules.
- the data collector 10 and the renderer 4 are 850 MHz Pentium 3 and 1.5 GHz Pentium 4 personal computers, respectively, each running a Linux operating system.
- the data processing system 12 is a Sun Ultra Enterprise four-CPU server running a Solaris 8 operating system.
- the management system 16 is a 1.5 GHz Pentium 4 personal computer running a Windows XP operating system.
- the data processing system 12 includes a number of data processing modules 18 to 26, including a pre-processor 18, a sampler 20 a clusterer and classifier 22, a taxonomy database 24, and a post processor module 26.
- the data processing system 12 can further include parallel clusterers 28, and/or parallel classifiers 30.
- the renderer 14 includes a taxonomy rendering module 15 and a web server module 17.
- the management system 16 includes a process management component 19 and an editor module 21. Whilst these modules are preferably implemented by software code, at least some of the processing steps executed by the modules, described below, may be implemented by hardware circuits such as application- specific integrated circuits (ASICs).
- ASICs application- specific integrated circuits
- the data collector 10 executes a data collection process, as shown in Figure 2.
- the data collection process begins at step 34 when the taxonomy specifications are received.
- the collector 10 uses the specifications to navigate or "crawl" the Internet at step 36, starting at the top level domains provided by the URI lists and progressing down to sub-domains thereof.
- the crawling process is known in the art. Briefly, the data collector 10 performs HTTP GET requests to network servers indicated by the provided URIs, or by links within HTML data previously retrieved from the network, including only those links that match the include specifications. For each document retrieved, the data collector 10 converts any documents that are not in HTML into HTML at step 38. The resulting HTML data is then sent to the data processing system 12 at step 40.
- step 42 the process branches at step 42 to return to step 34, and waits for the next category specification to be submitted by an administrator. Alternatively, if it is determined at step 42 that more data needs to be collected, the process branches back to step 36 in order to retrieve more data from the network.
- HTML data sent to the data processing system 12 from the data collector 10 is received by the pre-processor 18.
- HTML data can be directly submitted to the preprocessor 18 by the administrator using the management system 16.
- the pre-processor 18 executes an HTML processing process, as shown in Figure 3.
- the process begins when HTML data is received by the pre-processor 18 at step 44.
- Metadata tags are then extracted from the HTML data at step 46. This is achieved by regular expression matching on predefined patterns such as the HTML tags ⁇ TITLE> ⁇ META...> and so on.
- Meta information is included in the output from the pre-processor 18 as text-delimited additions to the data.
- the delimiters are text markups that do not normally occur in the data, e.g., "xxxxxxxx:”.
- the remaining data is then processed at step 48 by a filter that removes data that is not considered to be important. This includes removing text that appears likely to be a component of an advertising table or banner. Commonly occurring noise strings are removed by stoplists or by statistical analysis. For example, noise reduction can be achieved by building a frequency table of strings found in the document set. These strings are the characters found between matching pairs of HTML tags, such as ⁇ TD> and ⁇ /TD>. A string is removed from the document set if its occurrence frequency exceeds a threshold value.
- the pre-processor 18 converts the remaining HTML to text by removing HTML tags.
- the resulting text document is then sent to the sampler 20 at step 52.
- the sampler 20 samples a fixed fraction of incoming documents, as described below.
- the sample documents are then processed by the clusterer/classifier 22.
- the clusterer 22 partitions the documents based on their content. It does this by forming groups or clusters of documents based on their natural affinity rather than requiring a pre- specified number of categories.
- the clustering and feature selection processes are based upon processes described in the specification of International Patent Application No. PCT/AUOl/00198 ("the TACT specification"), incorporated herein by reference.
- each document is represented by a word frequency vector including words from the document and their frequencies of occurrence, where some words are excluded using feature selection criteria.
- a numeric similarity measure is then determined as a function of any two word vectors to determine the similarity of any two documents. For example, a new cluster can be formed by two documents if their similarity falls within a threshold similarity value for clustering.
- a cluster is characterised by a word frequency vector that is the average of the word frequency vectors of its constituent documents. This average word frequency vector is referred to as the cluster centroid.
- the similarity measure used is the cosine similarity function, described in the TACT specification.
- the clustering process uses this similarity measure to group similar documents into clusters by assigning each document to the most similar cluster.
- An optimal similarity threshold value for creating clusters from a given document set is determined by creating different groupings of the documents at different thresholds and then evaluating these to determine the best grouping, as described in An Evaluation of Criteria for Measuring the Quality of Clusters by B. Raskutti and C. Leckie, pp.
- Hierarchical clustering is achieved by iterative clustering of larger, less coherent clusters.
- the coherence of a cluster is determined by the intra-cluster similarity value of the cluster. If the documents in a cluster are very similar, i.e., the similarity values of each document with the cluster centroid fall within a similarity threshold for coherence, then the cluster is deemed coherent. If this criterion is not met, then documents within the cluster are formed into sub-clusters of the original cluster. These sub-clusters are sub-nodes of the original parent cluster or node, thus forming a hierarchy of clusters or nodes. By performing this sub-clustering iteratively, a hierarchical tree structure of coherent clusters is formed, to provide the taxonomy.
- the computational complexity of this clustering process is proportional to n, the number of documents, K, the number of threshold evaluations and m, the average number of clusters per threshold.
- the clustering process includes several steps for alleviating some of the scalability issues by reducing n and K. Whilst m is much smaller that n, it is proportional to n, therefore reducing n also reduces m.
- execution time is reduced by using percentage- based random sampled clustering of the document space whereby the sampler 20 provides a fixed fraction of the document space to the clusterer 22 for clustering.
- a second form is provided by stopping the clustering process after a predefined time interval in order to generate a clustered sample of the document space.
- the first process simply classifies documents into the existing clusters using the existing cluster centroids. That is, a new document is added to an existing cluster if its similarity to the cluster centroid falls within a fixed threshold similarity value. Any documents failing the threshold evaluation criteria for all clusters are set aside for later clustering.
- the second process uses the sample document clusters as a training set for an alternative document classification system.
- a support vector machine SNM
- the SVM is described in the specification of International Patent Application No. PCT/AUO 1/00415, incorporated herein by reference.
- any documents not classified are set aside for later clustering.
- the third process simply continues to cluster, but using the optimal threshold similarity value determined whilst clustering the initial sample documents. This process forms new clusters for new documents that are not similar to the existing clusters.
- Figure 4 is a graph of the goodness value of a document set, as described above, as a function of the logarithm of the similarity threshold value for cluster formation.
- the solid line 54 joining data points has a well defined minimum 56 at a log (threshold) value near 0.2.
- the general shape of this graph is typical of all document sets. Knowing the approximate shape of this graph allows the optimal threshold value for a particular document set to be located rapidly.
- the taxonomy produced by clustering is stored in the taxonomy database 24.
- the postprocessor module 26 augments the clustered data by extracting titles from metadata of each document, and adding summary text generated by the clustering process, as described in the TACT specification.
- access logs i.e., web server or proxy cache logs
- the clusters and/or documents within each cluster can be ranked using the access frequency of each document. For example, on a corporate web server, the most popular pages are listed near the top of each category listing, and/or the most popular categories are listed near the top of a listing of categories.
- the management system 16 includes an editor 21 that allows the administrator to manually edit a taxonomy to create a new document hierarchy. This new structure can then be used as the training set for adding further documents to the database using the classifier function of the clusterer/classifier 22.
- the speed of document classification by the categorisation system can be improved by using the parallel classifiers 30 to classify many documents in parallel.
- the editor 21 offers a number of editing functions, including moving branches of the hierarchical taxonomy to other branches, editing meta descriptions for documents and branches, and creating, deleting, and merging new branches in the taxonomy.
- the editor 12 presents information from the taxonomy database 24 using HTML forms. Changes can then be made to the taxonomy by modifying input fields in the forms and then submitting the changes via submit buttons of the forms.
- the taxonomy rendering module 15 of the renderer 14 generates dynamic web pages using the taxonomy database 24 to provide structure to the original resource content. These web pages can be accessed by providing to the web browser 32 a URI associated with the web server module 17.
- the visual presentation provided by these web pages is derived from a configuration file detailing the arrangement of the various fields on the rendered page.
- the pages represent a web 'view' into the hierarchy using a 'directory' style wherein the URI of the displayed page corresponds to the position or branch within the taxonomy that is being browsed.
- Each level in the 'view' can contain documents and/or categories, i.e., deeper branches in the taxonomy. Browsing into a category produces a new view with a greater level of specificity.
- Each branch in the taxonomy is initially labelled automatically by extracting descriptive information from the data during taxonomy generation, as described above, and is manually editable by invoking the editor module 21 of the management system 16.
- Documents are presented using their titles and summaries. Browsing to the document opens the document or a representation of the document.
Abstract
Description
Claims
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP02782539A EP1456774A4 (en) | 2001-12-18 | 2002-12-18 | Information resource taxonomy |
US10/499,587 US8166030B2 (en) | 2001-12-18 | 2002-12-18 | Information resource taxonomy |
AU2002347222A AU2002347222B2 (en) | 2001-12-18 | 2002-12-18 | Information resource taxonomy |
NZ533673A NZ533673A (en) | 2001-12-18 | 2002-12-18 | Information resource taxonomy |
CA2470864A CA2470864C (en) | 2001-12-18 | 2002-12-18 | Information resource taxonomy |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AUPR9589A AUPR958901A0 (en) | 2001-12-18 | 2001-12-18 | Information resource taxonomy |
AUPR9589 | 2001-12-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2003052627A1 true WO2003052627A1 (en) | 2003-06-26 |
Family
ID=3833204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/AU2002/001719 WO2003052627A1 (en) | 2001-12-18 | 2002-12-18 | Information resource taxonomy |
Country Status (6)
Country | Link |
---|---|
US (1) | US8166030B2 (en) |
EP (1) | EP1456774A4 (en) |
AU (2) | AUPR958901A0 (en) |
CA (1) | CA2470864C (en) |
NZ (1) | NZ533673A (en) |
WO (1) | WO2003052627A1 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006018041A1 (en) * | 2004-08-13 | 2006-02-23 | Swiss Reinsurance Company | Speech and textual analysis device and corresponding method |
US7502765B2 (en) | 2005-12-21 | 2009-03-10 | International Business Machines Corporation | Method for organizing semi-structured data into a taxonomy, based on tag-separated clustering |
US7610313B2 (en) | 2003-07-25 | 2009-10-27 | Attenex Corporation | System and method for performing efficient document scoring and clustering |
US8639044B2 (en) | 2004-02-13 | 2014-01-28 | Fti Technology Llc | Computer-implemented system and method for placing cluster groupings into a display |
WO2014177726A2 (en) | 2014-05-20 | 2014-11-06 | Advanced Silicon Sa | Method, apparatus and computer program for localizing an active stylus on a capacitive touch device |
US8909647B2 (en) | 2009-07-28 | 2014-12-09 | Fti Consulting, Inc. | System and method for providing classification suggestions using document injection |
US9176642B2 (en) | 2005-01-26 | 2015-11-03 | FTI Technology, LLC | Computer-implemented system and method for displaying clusters via a dynamic user interface |
US9208592B2 (en) | 2005-01-26 | 2015-12-08 | FTI Technology, LLC | Computer-implemented system and method for providing a display of clusters |
US9275344B2 (en) | 2009-08-24 | 2016-03-01 | Fti Consulting, Inc. | Computer-implemented system and method for generating a reference set via seed documents |
US11068546B2 (en) | 2016-06-02 | 2021-07-20 | Nuix North America Inc. | Computer-implemented system and method for analyzing clusters of coded documents |
Families Citing this family (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050131946A1 (en) * | 2003-04-09 | 2005-06-16 | Philip Korn | Method and apparatus for identifying hierarchical heavy hitters in a data stream |
US7325005B2 (en) * | 2004-07-30 | 2008-01-29 | Hewlett-Packard Development Company, L.P. | System and method for category discovery |
US7325006B2 (en) * | 2004-07-30 | 2008-01-29 | Hewlett-Packard Development Company, L.P. | System and method for category organization |
US20070174268A1 (en) * | 2006-01-13 | 2007-07-26 | Battelle Memorial Institute | Object clustering methods, ensemble clustering methods, data processing apparatus, and articles of manufacture |
US7546278B2 (en) * | 2006-03-13 | 2009-06-09 | Microsoft Corporation | Correlating categories using taxonomy distance and term space distance |
WO2008030510A2 (en) * | 2006-09-06 | 2008-03-13 | Nexplore Corporation | System and method for weighted search and advertisement placement |
US7783641B2 (en) * | 2006-10-26 | 2010-08-24 | Microsoft Corporation | Taxonometric personal digital media organization |
US8000955B2 (en) * | 2006-12-20 | 2011-08-16 | Microsoft Corporation | Generating Chinese language banners |
US20080189265A1 (en) * | 2007-02-06 | 2008-08-07 | Microsoft Corporation | Techniques to manage vocabulary terms for a taxonomy system |
US8375072B1 (en) * | 2007-04-12 | 2013-02-12 | United Services Automobile Association (Usaa) | Electronic file management hierarchical structure |
US7962507B2 (en) * | 2007-11-19 | 2011-06-14 | Microsoft Corporation | Web content mining of pair-based data |
US8050965B2 (en) * | 2007-12-14 | 2011-11-01 | Microsoft Corporation | Using a directed graph as an advertising system taxonomy |
US8099430B2 (en) * | 2008-12-18 | 2012-01-17 | International Business Machines Corporation | Computer method and apparatus of information management and navigation |
US8140531B2 (en) * | 2008-05-02 | 2012-03-20 | International Business Machines Corporation | Process and method for classifying structured data |
US8560485B2 (en) * | 2009-02-26 | 2013-10-15 | Fujitsu Limited | Generating a domain corpus and a dictionary for an automated ontology |
US8200671B2 (en) * | 2009-02-26 | 2012-06-12 | Fujitsu Limited | Generating a dictionary and determining a co-occurrence context for an automated ontology |
US8949184B2 (en) | 2010-04-26 | 2015-02-03 | Microsoft Technology Licensing, Llc | Data collector |
AU2011279556A1 (en) * | 2010-07-16 | 2013-02-14 | First Wave Technology Pty Ltd | Methods and systems for analysis and/or classification of information |
WO2012172046A1 (en) * | 2011-06-15 | 2012-12-20 | The Provost, Fellows, Foundation Scholars, And The Other Members Of Board, Of The College Of The Holy And Undivided Trinity Of Queen Elizabeth, Near Dublin | A network system for generating application specific hypermedia content from multiple sources |
EP2836920A4 (en) * | 2012-04-09 | 2015-12-02 | Vivek Ventures Llc | Clustered information processing and searching with structured-unstructured database bridge |
CN103678335B (en) * | 2012-09-05 | 2017-12-08 | 阿里巴巴集团控股有限公司 | The method of method, apparatus and the commodity navigation of commodity sign label |
US9448966B2 (en) * | 2013-04-26 | 2016-09-20 | Futurewei Technologies, Inc. | System and method for creating highly scalable high availability cluster in a massively parallel processing cluster of machines in a network |
US10331621B1 (en) * | 2013-09-19 | 2019-06-25 | Trifacta Inc. | System and method for displaying a sample of uniform and outlier rows from a file |
US11531705B2 (en) * | 2018-11-16 | 2022-12-20 | International Business Machines Corporation | Self-evolving knowledge graph |
US20230130502A1 (en) * | 2021-10-21 | 2023-04-27 | Paypal, Inc. | Entity clustering |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0704810A1 (en) * | 1994-09-30 | 1996-04-03 | Hitachi, Ltd. | Method and apparatus for classifying document information |
US5974412A (en) * | 1997-09-24 | 1999-10-26 | Sapient Health Network | Intelligent query system for automatically indexing information in a database and automatically categorizing users |
WO2000062203A1 (en) * | 1999-04-09 | 2000-10-19 | Semio Corporation | System and method for generating a taxonomy from a plurality of documents |
US6360227B1 (en) * | 1999-01-29 | 2002-03-19 | International Business Machines Corporation | System and method for generating taxonomies with applications to content-based recommendations |
US6446061B1 (en) * | 1998-07-31 | 2002-09-03 | International Business Machines Corporation | Taxonomy generation for document collections |
Family Cites Families (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5442778A (en) * | 1991-11-12 | 1995-08-15 | Xerox Corporation | Scatter-gather: a cluster-based method and apparatus for browsing large document collections |
JP3284641B2 (en) * | 1992-09-03 | 2002-05-20 | ソニー株式会社 | Method for optimizing measurement conditions of overlay accuracy measuring machine and method for optimizing alignment mark shape or alignment mark measurement method in exposure apparatus |
US6460036B1 (en) * | 1994-11-29 | 2002-10-01 | Pinpoint Incorporated | System and method for providing customized electronic newspapers and target advertisements |
US5768580A (en) | 1995-05-31 | 1998-06-16 | Oracle Corporation | Methods and apparatus for dynamic classification of discourse |
US5887120A (en) * | 1995-05-31 | 1999-03-23 | Oracle Corporation | Method and apparatus for determining theme for discourse |
US5708822A (en) * | 1995-05-31 | 1998-01-13 | Oracle Corporation | Methods and apparatus for thematic parsing of discourse |
US6178396B1 (en) * | 1996-08-02 | 2001-01-23 | Fujitsu Limited | Word/phrase classification processing method and apparatus |
JP2940501B2 (en) * | 1996-12-25 | 1999-08-25 | 日本電気株式会社 | Document classification apparatus and method |
US5963965A (en) * | 1997-02-18 | 1999-10-05 | Semio Corporation | Text processing and retrieval system and method |
US5819258A (en) * | 1997-03-07 | 1998-10-06 | Digital Equipment Corporation | Method and apparatus for automatically generating hierarchical categories from large document collections |
US6185550B1 (en) * | 1997-06-13 | 2001-02-06 | Sun Microsystems, Inc. | Method and apparatus for classifying documents within a class hierarchy creating term vector, term file and relevance ranking |
US6137911A (en) * | 1997-06-16 | 2000-10-24 | The Dialog Corporation Plc | Test classification system and method |
US6233575B1 (en) * | 1997-06-24 | 2001-05-15 | International Business Machines Corporation | Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values |
US6424982B1 (en) | 1999-04-09 | 2002-07-23 | Semio Corporation | System and method for parsing a document using one or more break characters |
US6711585B1 (en) * | 1999-06-15 | 2004-03-23 | Kanisa Inc. | System and method for implementing a knowledge management system |
AUPQ591800A0 (en) | 2000-02-25 | 2000-03-23 | Sola International Holdings Ltd | System for prescribing and/or dispensing ophthalmic lenses |
AUPQ684400A0 (en) | 2000-04-11 | 2000-05-11 | Telstra R & D Management Pty Ltd | A gradient based training method for a support vector machine |
AUPR033800A0 (en) | 2000-09-25 | 2000-10-19 | Telstra R & D Management Pty Ltd | A document categorisation system |
US7376620B2 (en) * | 2001-07-23 | 2008-05-20 | Consona Crm Inc. | System and method for measuring the quality of information retrieval |
-
2001
- 2001-12-18 AU AUPR9589A patent/AUPR958901A0/en not_active Abandoned
-
2002
- 2002-12-18 NZ NZ533673A patent/NZ533673A/en not_active IP Right Cessation
- 2002-12-18 CA CA2470864A patent/CA2470864C/en not_active Expired - Fee Related
- 2002-12-18 US US10/499,587 patent/US8166030B2/en not_active Expired - Fee Related
- 2002-12-18 WO PCT/AU2002/001719 patent/WO2003052627A1/en not_active Application Discontinuation
- 2002-12-18 EP EP02782539A patent/EP1456774A4/en not_active Ceased
- 2002-12-18 AU AU2002347222A patent/AU2002347222B2/en not_active Ceased
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0704810A1 (en) * | 1994-09-30 | 1996-04-03 | Hitachi, Ltd. | Method and apparatus for classifying document information |
US5974412A (en) * | 1997-09-24 | 1999-10-26 | Sapient Health Network | Intelligent query system for automatically indexing information in a database and automatically categorizing users |
US6446061B1 (en) * | 1998-07-31 | 2002-09-03 | International Business Machines Corporation | Taxonomy generation for document collections |
US6360227B1 (en) * | 1999-01-29 | 2002-03-19 | International Business Machines Corporation | System and method for generating taxonomies with applications to content-based recommendations |
WO2000062203A1 (en) * | 1999-04-09 | 2000-10-19 | Semio Corporation | System and method for generating a taxonomy from a plurality of documents |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7610313B2 (en) | 2003-07-25 | 2009-10-27 | Attenex Corporation | System and method for performing efficient document scoring and clustering |
US9858693B2 (en) | 2004-02-13 | 2018-01-02 | Fti Technology Llc | System and method for placing candidate spines into a display with the aid of a digital computer |
US9342909B2 (en) | 2004-02-13 | 2016-05-17 | FTI Technology, LLC | Computer-implemented system and method for grafting cluster spines |
US8942488B2 (en) | 2004-02-13 | 2015-01-27 | FTI Technology, LLC | System and method for placing spine groups within a display |
US9384573B2 (en) | 2004-02-13 | 2016-07-05 | Fti Technology Llc | Computer-implemented system and method for placing groups of document clusters into a display |
US9619909B2 (en) | 2004-02-13 | 2017-04-11 | Fti Technology Llc | Computer-implemented system and method for generating and placing cluster groups |
US8639044B2 (en) | 2004-02-13 | 2014-01-28 | Fti Technology Llc | Computer-implemented system and method for placing cluster groupings into a display |
US9984484B2 (en) | 2004-02-13 | 2018-05-29 | Fti Consulting Technology Llc | Computer-implemented system and method for cluster spine group arrangement |
US9082232B2 (en) | 2004-02-13 | 2015-07-14 | FTI Technology, LLC | System and method for displaying cluster spine groups |
US9495779B1 (en) | 2004-02-13 | 2016-11-15 | Fti Technology Llc | Computer-implemented system and method for placing groups of cluster spines into a display |
US9245367B2 (en) | 2004-02-13 | 2016-01-26 | FTI Technology, LLC | Computer-implemented system and method for building cluster spine groups |
US8428935B2 (en) | 2004-08-13 | 2013-04-23 | Infocodex Ag | Neural network for classifying speech and textural data based on agglomerates in a taxonomy table |
WO2006018041A1 (en) * | 2004-08-13 | 2006-02-23 | Swiss Reinsurance Company | Speech and textual analysis device and corresponding method |
WO2006018411A3 (en) * | 2004-08-13 | 2006-06-08 | Swiss Reinsurance Co | Speech and textual analysis device and corresponding method |
WO2006018411A2 (en) * | 2004-08-13 | 2006-02-23 | Swiss Reinsurance Company | Speech and textual analysis device and corresponding method |
US9176642B2 (en) | 2005-01-26 | 2015-11-03 | FTI Technology, LLC | Computer-implemented system and method for displaying clusters via a dynamic user interface |
US9208592B2 (en) | 2005-01-26 | 2015-12-08 | FTI Technology, LLC | Computer-implemented system and method for providing a display of clusters |
US7502765B2 (en) | 2005-12-21 | 2009-03-10 | International Business Machines Corporation | Method for organizing semi-structured data into a taxonomy, based on tag-separated clustering |
US9542483B2 (en) | 2009-07-28 | 2017-01-10 | Fti Consulting, Inc. | Computer-implemented system and method for visually suggesting classification for inclusion-based cluster spines |
US9898526B2 (en) | 2009-07-28 | 2018-02-20 | Fti Consulting, Inc. | Computer-implemented system and method for inclusion-based electronically stored information item cluster visual representation |
US9336303B2 (en) | 2009-07-28 | 2016-05-10 | Fti Consulting, Inc. | Computer-implemented system and method for providing visual suggestions for cluster classification |
US9477751B2 (en) | 2009-07-28 | 2016-10-25 | Fti Consulting, Inc. | System and method for displaying relationships between concepts to provide classification suggestions via injection |
US10083396B2 (en) | 2009-07-28 | 2018-09-25 | Fti Consulting, Inc. | Computer-implemented system and method for assigning concept classification suggestions |
US8909647B2 (en) | 2009-07-28 | 2014-12-09 | Fti Consulting, Inc. | System and method for providing classification suggestions using document injection |
US9165062B2 (en) | 2009-07-28 | 2015-10-20 | Fti Consulting, Inc. | Computer-implemented system and method for visual document classification |
US9064008B2 (en) | 2009-07-28 | 2015-06-23 | Fti Consulting, Inc. | Computer-implemented system and method for displaying visual classification suggestions for concepts |
US9679049B2 (en) | 2009-07-28 | 2017-06-13 | Fti Consulting, Inc. | System and method for providing visual suggestions for document classification via injection |
US9275344B2 (en) | 2009-08-24 | 2016-03-01 | Fti Consulting, Inc. | Computer-implemented system and method for generating a reference set via seed documents |
US9336496B2 (en) | 2009-08-24 | 2016-05-10 | Fti Consulting, Inc. | Computer-implemented system and method for generating a reference set via clustering |
US9489446B2 (en) | 2009-08-24 | 2016-11-08 | Fti Consulting, Inc. | Computer-implemented system and method for generating a training set for use during document review |
US10332007B2 (en) | 2009-08-24 | 2019-06-25 | Nuix North America Inc. | Computer-implemented system and method for generating document training sets |
WO2014177726A2 (en) | 2014-05-20 | 2014-11-06 | Advanced Silicon Sa | Method, apparatus and computer program for localizing an active stylus on a capacitive touch device |
US11068546B2 (en) | 2016-06-02 | 2021-07-20 | Nuix North America Inc. | Computer-implemented system and method for analyzing clusters of coded documents |
Also Published As
Publication number | Publication date |
---|---|
AU2002347222B2 (en) | 2008-05-29 |
CA2470864A1 (en) | 2003-06-26 |
CA2470864C (en) | 2014-08-12 |
AUPR958901A0 (en) | 2002-01-24 |
NZ533673A (en) | 2006-04-28 |
AU2002347222A1 (en) | 2003-06-30 |
EP1456774A4 (en) | 2005-02-23 |
EP1456774A1 (en) | 2004-09-15 |
US20050080781A1 (en) | 2005-04-14 |
US8166030B2 (en) | 2012-04-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2470864C (en) | Information resource taxonomy | |
US6182091B1 (en) | Method and apparatus for finding related documents in a collection of linked documents using a bibliographic coupling link analysis | |
US7428533B2 (en) | Automatic generation of taxonomies for categorizing queries and search query processing using taxonomies | |
JP3665480B2 (en) | Document organizing apparatus and method | |
US6457028B1 (en) | Method and apparatus for finding related collections of linked documents using co-citation analysis | |
US8312035B2 (en) | Search engine enhancement using mined implicit links | |
US6286018B1 (en) | Method and apparatus for finding a set of documents relevant to a focus set using citation analysis and spreading activation techniques | |
He et al. | Automatic topic identification using webpage clustering | |
US8965894B2 (en) | Automated web page classification | |
US20110307479A1 (en) | Automatic Extraction of Structured Web Content | |
Al-asadi et al. | A survey on web mining techniques and applications | |
CN112597370A (en) | Webpage information autonomous collecting and screening system with specified demand range | |
Moumtzidou et al. | Discovery of environmental nodes in the web | |
Abramowicz et al. | Supporting topic map creation using data mining techniques | |
US10380195B1 (en) | Grouping documents by content similarity | |
Tian et al. | Two-phase web site classification based on hidden markov tree models | |
KR20010102687A (en) | Method and System for Web Documents Sort Using Category Learning Skill | |
Gong et al. | An implementation of web image search engines | |
JPH10222534A (en) | Device for retrieving information | |
McCurley et al. | Mining and knowledge discovery from the Web | |
Jiang et al. | Applying associative relationship on the clickthrough data to improve web search | |
Srinivasan et al. | IDENTIFYING A THRESHOLD CHOICE FOR THE SEARCH ENGINE USERS TO REDUCE THE INFORMATION OVERLOAD USING LINK BASED REPLICA REMOVAL IN PERSONALIZED SEARCH ENGINE USER PROFILE. | |
Ola et al. | MODIFIED PAGE RANKING SYSTEM | |
Onyejegbu et al. | Modified Page Ranking System | |
Htay et al. | International Journal of Engineering Technology Research & Management |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2002347222 Country of ref document: AU |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2470864 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2002782539 Country of ref document: EP Ref document number: 533673 Country of ref document: NZ |
|
WWP | Wipo information: published in national office |
Ref document number: 2002782539 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 10499587 Country of ref document: US |
|
WWP | Wipo information: published in national office |
Ref document number: 533673 Country of ref document: NZ |
|
NENP | Non-entry into the national phase |
Ref country code: JP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: JP |