WO2002041161A1 - Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments - Google Patents
Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments Download PDFInfo
- Publication number
- WO2002041161A1 WO2002041161A1 PCT/US2001/048124 US0148124W WO0241161A1 WO 2002041161 A1 WO2002041161 A1 WO 2002041161A1 US 0148124 W US0148124 W US 0148124W WO 0241161 A1 WO0241161 A1 WO 0241161A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- documents
- document
- computer
- distinctive
- text
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
Definitions
- This invention relates to a computer-assisted method and apparatus for identifying duplicate and near-duplicate documents or text spans in a collection of documents or text spans, respectively. .2. Description of the Prior Art
- the current art includes inventions that compare a single pair of known- to-be-similar documents to identify the differences between the documents ⁇
- the Unix "diff ' program uses an efficient algorithm for finding the longest common subsequence (LCS) between two sequences, such as the lines in two documents. Aho, Hopcroft, and Ullman, Data Structures and Algorithms, Addison-Wesley Publishing Company, April 1987, pages 189-192.
- the lines that are left when the LCS is removed represent the changes needed to transform one document into another. Additionally, U.S.
- Patent No.4,807,182 uses anchor points (points in common between two files) to identify differences between an original and a modified version of a document.
- Another approach for comparing documents is to compute a checksum for each document. If two documents have the same checksum, they are likely to be identical. But comparing documents using checksums is an extremely fragile method, since even a single character change in a document yields a different checksum. Thus, checksums are good for identifying exact duplicates, but not for identifying near- duplicates.
- U.S. Patent No. 5,680,611 teaches the use of checksums to identify duplicate records.
- U.S. Patent No. 5,898,836 discloses the use of checksums to identify whether a region of a document has changed by comparing checksums for sub-document passages, for example, the text between HTML tags.
- This technique depends on the frequency of the n-grams within the document by requiring the n-grams and all sub-parts (at least the prefix sub-parts) to be of high frequency.
- the Juola method focuses on applications involving very small training corpora, and has been applied to a variety of areas, including language identification, determining authorship of a document, and text classification. The method does not provide a measure of distinctiveness.
- the prior art does not compare more than two documents, does not allow text fragments in each document to appear in a different or arbitrary order, is not selective in the choice of n-grams used to compare the documents, does not use the frequency of the n-grams across documents for selecting n-grams used to compare the documents, and does not peimit a mixture of very low frequency and very high frequency components in the n-grams.
- Near-duplicate documents contain long stretches of identical text that are not present in other, non-duplicate documents.
- the long text fragments that are present in only a few documents represent distinctive features that can be used to distinguish similar documents from dissimilar documents in a robust fashion.
- These text fragments represent a kind of "signature" for a document which can be used to match the document with near-duplicate documents and to distinguish the document from non-duplicate documents.
- Documents that overlap significantly on such text fragments will most likely be duplicates or near-duplicates. Overlap occurs not just when the text is excerpted, but also when deliberate changes have been made to the text, such as paraphrasing, interspersing comments by another author, and outright plagiarism.
- the present invention identifies duplicate and near-duplicate documents and text spans by identifying a small number of distinctive features for each document, for example, distinctive word n-grams likely to appear in duplicate or near-duplicate documents.
- the features act as a proxy for the full document, allowing the invention to compare documents by comparing their distinctive features.
- Documents having at least one feature in common are compared with each other.
- Near-duplicate documents are identified by counting the proportion of the features in common between the two • documents.
- a key to the effectiveness of this method is the ability to find distinctive features.
- the features need to be rare enough to be common among only near-duplicate documents, but not so rare as to be specific to just one document.
- An individual word may not be rare enough, but an n-gram containing the word might be. Longer n-grams might be too rare.
- the distinctive features may include glue words (i.e., very common words) -within the features but, preferably, not at either end.
- distinctive features may include words that are common to just a few documents and/or words that are common to all but a few documents.
- Applications of the present invention include removing redundancy in document collections (including web catalogs and search engines), matching summary sentences with corresponding document sentences, and detection of plagiarism and copyright infringement for text documents and passages.
- FIG. 1 is a flow diagram of a first embodiment of a method according to the present invention as applied to documents;
- Fig. 2 is a flow diagram of a second embodiment of a method according to the present invention as applied to documents;
- Figs. 3 A and 3B are a flow diagram of a third embodiment of a method according to the present mvention as applied to documents;
- Fig. 3 C is an illustration of a document index;
- Fig. 3D is an illustration of a feature index
- Fig. 3E is an illustration of a list 324
- Fig. 3F is an illustration of a list 330
- Fig. 3G is an illustration of a list 336
- Fig. 4 is a flow diagram of an embodiment of a method according to the present invention as applied to text spans;
- Fig. 5 is a flow diagram of an embodiment of a method according to the present invention as applied to images.
- Fig. 6 is an illustration of an apparatus according to the present invention. DESCRIPTION OF THE PREFERRED EMBODIMENTS
- Step 110 identifies distinctive features in the document collection 100 and in each document in the collection 100.
- Loop 112 iterates for each pair of documents. Within loop 112, step 114 determines if the pair of documents has at least one distinctive feature in common. If they do, the pair is compared in step 116 to determine if they are duplicate or near- duplicate documents. Loop 112 then continues with the next pair of documents. If the pair of documents does not have at least one distinctive feature in common, no comparison is performed, and loop 112 continues with the next pair of documents. The method illustrated in Fig.
- 1 can be applied to, for example: removing duplicates in document collections; detecting plagiarism; detecting copyright infringement; determining the authorship of a document; clustering successive versions. of a document from among a collection of documents; seeding a text classification or text clustering algorithm with sets of duplicate or near-duplicate documents; matching an e- mail message with responses to the e-mail message, and vice versa; and creating a document index for use with a query system to efficiently find documents that contain a particular phrase or excerpt in response to a query, even if the particular phrase or excerpt was not recorded correctly in the document or the query.
- the method can also be applied to augmenting information retrieval or text classification algorithms that use single- word terms with a small number of multiword terms.
- Algorithms of this type that are based on a bag-of-words model assume that each word appears independently. Although such algorithms can be extended to apply to word bigrams, trigrams, and so on, allowing all word n-grams of a particular length rapidly becomes computationally unmanageable.
- the present invention may be used to generate a small list of word n-grams to augment the bag-of-words index. These word n-grams are likely to distinguish documents. Therefore, if they are present in a query, they can help narrow the search results considerably. This is in contrast to methods based on word co-occurrence statistics which yield word n-grams that are rather common in the document set.
- the method illustrated in Fig. 1 may be used to determine whether documents are duplicates or near-duplicates even if the distinctive features appear in a different order in each document.
- the distinctive features may be distinctive text fragments found within the collection of documents 100.
- the method may be applied to information retrieval methods, such as a text classification method or any information retrieval method that assumes word independence and adds the distinctive text fragments to an index set.
- the distinctive text fragments may be sequences of at least two words that appear in a limited number of documents in the document collection 100. If one distinctive text fragment is found within another distinctive text fragment, only the longest distinctive text fragment may be considered as a feature.
- a sequence of at least two words may be considered as appearing in a- document when the document contains the sequence of at least two words at least a user-specified minimum number of times or a user-specified minimum frequency. The frequency may be defined as the number of occurrences in the document divided by the length of the document.
- a distinctiveness score may be calculated and the highest scoring sequences that are found in at least two documents in the document collection 100 may be considered as text fragments.
- the distinctiveness score may be the reciprocal of the number of documents containing the phrase multiplied by a monotonic function of the number of words in the phrase, where the monotonic function may be the number of words in the phrase.
- the limited number restricting the number of documents having the sequence of at least two words may be selected by a user as a constant or a percentage.
- the limited number may be defined by a linear function of the number of documents in the document collection 100, such as a linear function of the square root or logarithm of the number of documents in the document collection 100.
- the distinctive text fragments may include glue words (i.e., words that appear in almost all of the documents and for which their absence is distinctive). Glue words include stopwords like "the” and “of and allow phrases like "United States of America" to be counted as distinctive phrases.
- the method may exclude glue words that appear at either extreme of the distinctive text fragment.
- the sequence of at least two words may be considered as appearing in a document when the document contains the sequence of at least two words at least a user-specified minimum number of times or a user-specified minimum frequency.
- the frequency may be defined as the number of occurrences in the document divided by the length of the document.
- Fig. 2 illustrates another embodiment of the present invention which finds duplicate or near-duplicate documents within a document collection 200.
- Step 210 identifies distinctive features of the documents in the document collection 200 and in each document in the collection 200.
- Loop 212 iterates for each pair of documents. Within loop 212, step 214 determines if the pair of documents has at least one distinctive feature in common. If they do, step 216 divides the number of features that the pair of documents has in common by the smaller number of the number of features in each document.
- Step 218 determines whether the result of step 216 is greater than a threshold value.
- the threshold value may be a constant, a fixed percentage of the number of documents in the document collection 200, the logarithm of the number of documents, or the square root of the number of documents.
- step 220 deems the documents duplicates or near-duplicates, and loop 212 continues with the next pair of documents. If the result is not.greater than the threshold, the documents are not duplicates or near-duplicates, and loop 212 continues.
- Figs. 3A and 3B show another embodiment of the present invention which finds duplicate or near-duplicate documents within a document collection 300.
- step 310 identifies distinctive features of the documents in the document collection 300 and in each document in the collection 300.
- step 312 builds a document index 314 and step 316 builds a feature index 318.
- the document index 314 maps each document to the features contained therein.
- the feature index 318 maps the features to the documents that contain them.
- the indexes 314 and 318 are built in a manner that ignores duplicates (i.e., if a feature is repeated within a document, it is mapped only once).
- Loop 320 iterates through each document such that step 322 can create a list 324 that includes each unique distinctive feature that was identified in step 310.
- step 326 iterates through the feature index 318 so that step 328 can create a list 330 that includes each distinctive feature and the documents in which the distinctive feature is located.
- step 334 creates a list 336 of pairs of documents that have at least one feature in common and the number of features they have in common.
- Loop 338 iterates through list 336.
- step 340 divides the number of features that the pair of documents has in common by the smaller number of the number of features in each document (from the document index 314).
- Step 342 determines whether the result of step 340 is greater than a threshold value.
- the threshold value may, for example, be a constant, a fixed percentage of the number of documents in the document collection 300, the logarithm of the number of documents, or the square root of the number of documents.
- step 344 deems the documents duplicates or near-duplicates, and loop 338 continues with the next pair of documents. If the result is not greater than the threshold, the documents are not duplicates or near-duplicates, and loop 338 continues.
- Fig. 3C illustrates an example format for the document index 314.
- Fig. 3D illustrates the feature index 318
- Fig. 3E illustrates list 324
- Fig. 3F illustrates list 330
- Fig. 3G illustrates list 336 as constructed in two steps.
- a method according to the present invention is utilized to find duplicate or near-duplicate text spans, including sentences, within a text span collection 400.
- the text spans in the collection 400 may be sentences.
- Step 410 identifies distinctive features of the text spans in the text span collection 400 and in each text span in the collection 400.
- Loop 412 iterates for each pair of text spans. Within loop 412, step 414 determines if the pair of text spans has at least one distinctive feature in common. If they do, the pair is compared in step 416 to determine if they are duplicate or near-duplicate text spans. Loop 412 then continues with the next pair of text spans. If the pair of text spans does not have at least one distinctive feature in common, no comparison is performed, and loop 412 continues with the next pair of text spans.
- This method may be used to match sentences from one document with sentences from another. This would be useful m matching sentences of a human-written summary for an original document with sentences from the original document. Similarly, in a plagiarism detector, once the method as applied to documents has found duplicate documents, the sentence version can be used to match sentences in the plagiarized copy with the corresponding sentences from the original document. Another application of sentence matching would identify changes made to a document in a word processing application where such changes need not retain the sentences, lines, or other text fragments in the original order.
- Step 510 identifies distinctive features of the images in the image collection 500 and in each image in the collection 500.
- the distinctive features may be sequences of at least two adjacent tiles from the images.
- Loop 512 iterates for each pair of images. Within loop 512, step 514 determines if the pair of images has at least one distinctive feature in common. If they do, the pair is compared in step 516 to determine if they are duplicate or near-duplicate images. Loop 512 then continues with the next pair of images. If the pair of images does not have at least one distinctive feature in common, no comparison is performed, and loop 512 continues with the next pair of images.
- the method performs canonicalization of the images by converting them to black and white and sampling them at several resolutions.
- small overlapping tiles correspond to words and horizontal and vertical sequences to text fragments.
- the method illustrated in Fig. 5 may be applied to detecting copyright infringement based on image content where the original image does not have a digital watermark. This method may also be applied to fingerprint identification or handwritten signature authentication, among other applications.
- the present invention also, includes an apparatus that is capable of identifying duplicate and near-duplicate documents in a large collection of documents.
- the apparatus includes a means for initially selecting distinctive features contained within the collection of documents, a means for subsequently identifying the distinctive features contained in each document, and a means for then comparing the distinctive features of each pair of documents having at least one distinctive feature in common to deterrriine whether the documents are duplicate or near-duplicate documents.
- FIG. 6 illustrates an embodiment of an apparatus of the present invention capable of enabling the methods of the present invention.
- a computer system 600 is utilized to enable the method.
- the computer system 600 includes a display unit 610 and an input device 612.
- the input device 612 may be any device capable of receiving user input, for example, a keyboard or a scanner.
- the computer system 600 also includes a storage device 614 for storing the document collection and a storage device 616 for storing the method according to the present invention.
- a processor 618 executes the method stored on storage device 616 and accesses the document collection stored on storage device 614.
- the processor is also capable of sending information to the display unit 610 and receiving information from the input device 612.
- DF(x) was the number of documents containing the text "x”
- N was the overall number of documents
- R was a threshold on DF.
- a word in a particular document may be restricted from contributing to DF(x) if the word's frequency in that document falls below a user-specified threshold.
- a phrase consisted of at least two words which occur in more than one document and in no more than R documents ( 1 ⁇ DF(x) ⁇ R ).
- the phrases also contained glue words that occurred in at least ( N - ) documents.
- the glue words could appear within a phrase, but not in the leftmost or rightmost position in the phrase.
- the document was segmented at words of intermediate rarity ( R ⁇ DF(x) ⁇ R-N ) and what remained were considered distinctive phrases.
- the phrases may also be segmented at the glue words to obtain additional distinctive sub-phrases, for example, "United States of America" yields "United States” upon splitting at the "of.
- the second pass also built a document index that mapped each document to its set of distinctive phrases and sub- phrases using a document identifier and a phrase identifier and built a phrase index that mapped from the phrases to the documents that contained them using the phrase and document identifiers.
- the indexes were built in a manner that ignores duplicates.
- a third pass iterated over the document identifiers in the document index (it is not necessary to use the actual documents once the indexes are built).
- the document index was used to gather a list of the phrase identifiers.
- the document identifiers obtained from the phrase index was iterated over to count the total number of times each document identifier occurred.
- a list of documents that overlap with the document in at least one phrase and the number of phrases that overlap was generated. This list of document identifiers included only those documents that had at least one phrase in common with the source document in order to avoid the need to compare the source document with every other document.
- an overlap ratio was calculated by dividing the number of common phrases by the smaller of the number of phrases in each document. This made it possible to detect a small passage excerpted from a longer document. The overlap ratio was compared with a match percentage threshold. If it exceeded the threshold, the pair was reported as potential near- duplicates. Optionally, the results maybe accepted as is or a more detailed comparison algorithm may be applied to the near-duplicate document pairs.
- the implementation is also very efficient.
- the first two passes are linear in N.
- the third pass runs in time N*P, where P is the average number of documents that overlap in at least one phrase.
- P is the average number of documents that overlap in at least one phrase.
- P is the average number of documents that overlap in at least one phrase.
- P is the average number of documents that overlap in at least one phrase.
- P is the average number of documents that overlap in at least one phrase.
- P is N, but typically P is R.
- R the accuracy
- there is a trade-off between running time and accuracy In practice, however, an acceptable level of accuracy is achieved for a running time that is linear in N. This is a significant improvement over algorithms which would require pairwise comparisons of all the documents, or at least N-squared rur ing time.
- the implementation was executed on 125 newspaper articles and their corresponding human-written summaries, for a total of 250 documents.
- the implementation may use different thresholds for the low frequency and glue words. Sequences of mid-range DF words where the sequence itself has low DF, may be included. Additionally, the number of words in a phrase may be factored in as a measure of the phrase's complexity in addition to rarity, for example, dividing the length of the phrase by the phrase' s DF ( TL/DF or log(TL)/DF ). Although, this yields a preference for longer phrases, it allows longer phrases to have higher DF and, thus, be less distinctive.
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2002229035A AU2002229035A1 (en) | 2000-11-15 | 2001-10-31 | Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments |
JP2002543304A JP2004519761A (en) | 2000-11-15 | 2001-10-31 | Computer-aided method and apparatus for effectively identifying duplicated or near duplicated documents and text spans using highly identifiable text fragments |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/713,733 | 2000-11-15 | ||
US09/713,733 US6978419B1 (en) | 2000-11-15 | 2000-11-15 | Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2002041161A1 true WO2002041161A1 (en) | 2002-05-23 |
Family
ID=24867308
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2001/048124 WO2002041161A1 (en) | 2000-11-15 | 2001-10-31 | Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments |
Country Status (4)
Country | Link |
---|---|
US (1) | US6978419B1 (en) |
JP (1) | JP2004519761A (en) |
AU (1) | AU2002229035A1 (en) |
WO (1) | WO2002041161A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101221832B1 (en) | 2011-12-08 | 2013-01-15 | 동국대학교 경주캠퍼스 산학협력단 | A program recording datum and system for copyright protecting |
Families Citing this family (222)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU4328000A (en) | 1999-03-31 | 2000-10-16 | Verizon Laboratories Inc. | Techniques for performing a data query in a computer system |
US8572069B2 (en) * | 1999-03-31 | 2013-10-29 | Apple Inc. | Semi-automatic index term augmentation in document retrieval |
US8275661B1 (en) | 1999-03-31 | 2012-09-25 | Verizon Corporate Services Group Inc. | Targeted banner advertisements |
US6718363B1 (en) * | 1999-07-30 | 2004-04-06 | Verizon Laboratories, Inc. | Page aggregation for web sites |
US6912525B1 (en) | 2000-05-08 | 2005-06-28 | Verizon Laboratories, Inc. | Techniques for web site integration |
US8010988B2 (en) | 2000-09-14 | 2011-08-30 | Cox Ingemar J | Using features extracted from an audio and/or video work to obtain information about the work |
US8205237B2 (en) | 2000-09-14 | 2012-06-19 | Cox Ingemar J | Identifying works, using a sub-linear time search, such as an approximate nearest neighbor search, for initiating a work-based action, such as an action on the internet |
US6658423B1 (en) * | 2001-01-24 | 2003-12-02 | Google, Inc. | Detecting duplicate and near-duplicate files |
US7899825B2 (en) * | 2001-06-27 | 2011-03-01 | SAP America, Inc. | Method and apparatus for duplicate detection |
US6778995B1 (en) | 2001-08-31 | 2004-08-17 | Attenex Corporation | System and method for efficiently generating cluster groupings in a multi-dimensional concept space |
US6978274B1 (en) | 2001-08-31 | 2005-12-20 | Attenex Corporation | System and method for dynamically evaluating latent concepts in unstructured documents |
US6888548B1 (en) | 2001-08-31 | 2005-05-03 | Attenex Corporation | System and method for generating a visualized data representation preserving independent variable geometric relationships |
US7271804B2 (en) | 2002-02-25 | 2007-09-18 | Attenex Corporation | System and method for arranging concept clusters in thematic relationships in a two-dimensional visual display area |
US7219301B2 (en) * | 2002-03-01 | 2007-05-15 | Iparadigms, Llc | Systems and methods for conducting a peer review process and evaluating the originality of documents |
US20070208698A1 (en) * | 2002-06-07 | 2007-09-06 | Dougal Brindley | Avoiding duplicate service requests |
US8090717B1 (en) | 2002-09-20 | 2012-01-03 | Google Inc. | Methods and apparatus for ranking documents |
US7568148B1 (en) | 2002-09-20 | 2009-07-28 | Google Inc. | Methods and apparatus for clustering news content |
US7725544B2 (en) * | 2003-01-24 | 2010-05-25 | Aol Inc. | Group based spam classification |
US7703000B2 (en) * | 2003-02-13 | 2010-04-20 | Iparadigms Llc | Systems and methods for contextual mark-up of formatted documents |
US7590695B2 (en) | 2003-05-09 | 2009-09-15 | Aol Llc | Managing electronic messages |
JP4014160B2 (en) * | 2003-05-30 | 2007-11-28 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Information processing apparatus, program, and recording medium |
US7739602B2 (en) | 2003-06-24 | 2010-06-15 | Aol Inc. | System and method for community centric resource sharing based on a publishing subscription model |
US8136025B1 (en) | 2003-07-03 | 2012-03-13 | Google Inc. | Assigning document identification tags |
US7627613B1 (en) | 2003-07-03 | 2009-12-01 | Google Inc. | Duplicate document detection in a web crawler system |
US7610313B2 (en) | 2003-07-25 | 2009-10-27 | Attenex Corporation | System and method for performing efficient document scoring and clustering |
US7644076B1 (en) * | 2003-09-12 | 2010-01-05 | Teradata Us, Inc. | Clustering strings using N-grams |
US7577655B2 (en) | 2003-09-16 | 2009-08-18 | Google Inc. | Systems and methods for improving the ranking of news articles |
US7503035B2 (en) * | 2003-11-25 | 2009-03-10 | Software Analysis And Forensic Engineering Corp. | Software tool for detecting plagiarism in computer source code |
US7823127B2 (en) * | 2003-11-25 | 2010-10-26 | Software Analysis And Forensic Engineering Corp. | Detecting plagiarism in computer source code |
US7191175B2 (en) | 2004-02-13 | 2007-03-13 | Attenex Corporation | System and method for arranging concept clusters in thematic neighborhood relationships in a two-dimensional visual display space |
US7809695B2 (en) * | 2004-08-23 | 2010-10-05 | Thomson Reuters Global Resources | Information retrieval systems with duplicate document detection and presentation functions |
US7331010B2 (en) | 2004-10-29 | 2008-02-12 | International Business Machines Corporation | System, method and storage medium for providing fault detection and correction in a memory subsystem |
US8214369B2 (en) * | 2004-12-09 | 2012-07-03 | Microsoft Corporation | System and method for indexing and prefiltering |
US20060142993A1 (en) * | 2004-12-28 | 2006-06-29 | Sony Corporation | System and method for utilizing distance measures to perform text classification |
US7356777B2 (en) | 2005-01-26 | 2008-04-08 | Attenex Corporation | System and method for providing a dynamic user interface for a dense three-dimensional scene |
US7404151B2 (en) | 2005-01-26 | 2008-07-22 | Attenex Corporation | System and method for providing a dynamic user interface for a dense three-dimensional scene |
US7401080B2 (en) * | 2005-08-17 | 2008-07-15 | Microsoft Corporation | Storage reports duplicate file detection |
US20070112752A1 (en) * | 2005-11-14 | 2007-05-17 | Wolfgang Kalthoff | Combination of matching strategies under consideration of data quality |
US7685392B2 (en) | 2005-11-28 | 2010-03-23 | International Business Machines Corporation | Providing indeterminate read data latency in a memory system |
US7542989B2 (en) * | 2006-01-25 | 2009-06-02 | Graduate Management Admission Council | Method and system for searching, identifying, and documenting infringements on copyrighted information |
US7661064B2 (en) * | 2006-03-06 | 2010-02-09 | Microsoft Corporation | Displaying text intraline diffing output |
US8073830B2 (en) * | 2006-03-31 | 2011-12-06 | Google Inc. | Expanded text excerpts |
US20070266001A1 (en) * | 2006-05-09 | 2007-11-15 | Microsoft Corporation | Presentation of duplicate and near duplicate search results |
US8175875B1 (en) * | 2006-05-19 | 2012-05-08 | Google Inc. | Efficient indexing of documents with similar content |
US7765475B2 (en) * | 2006-06-13 | 2010-07-27 | International Business Machines Corporation | List display with redundant text removal |
US8015162B2 (en) * | 2006-08-04 | 2011-09-06 | Google Inc. | Detecting duplicate and near-duplicate files |
US8321197B2 (en) * | 2006-10-18 | 2012-11-27 | Teresa Ruth Gaudet | Method and process for performing category-based analysis, evaluation, and prescriptive practice creation upon stenographically written and voice-written text files |
US7870459B2 (en) | 2006-10-23 | 2011-01-11 | International Business Machines Corporation | High density high reliability memory module with power gating and a fault tolerant address and command bus |
US8515912B2 (en) | 2010-07-15 | 2013-08-20 | Palantir Technologies, Inc. | Sharing and deconflicting data changes in a multimaster database system |
US8983970B1 (en) * | 2006-12-07 | 2015-03-17 | Google Inc. | Ranking content using content and content authors |
US8577866B1 (en) | 2006-12-07 | 2013-11-05 | Googe Inc. | Classifying content |
US8930331B2 (en) | 2007-02-21 | 2015-01-06 | Palantir Technologies | Providing unique views of data based on changes or rules |
NZ553484A (en) * | 2007-02-28 | 2008-09-26 | Optical Systems Corp Ltd | Text management software |
US7849399B2 (en) * | 2007-06-29 | 2010-12-07 | Walter Hoffmann | Method and system for tracking authorship of content in data |
US20090063470A1 (en) * | 2007-08-28 | 2009-03-05 | Nogacom Ltd. | Document management using business objects |
US8037073B1 (en) * | 2007-12-31 | 2011-10-11 | Google Inc. | Detection of bounce pad sites |
AU2008255269A1 (en) * | 2008-02-05 | 2009-08-20 | Nuix Pty. Ltd. | Document comparison method and apparatus |
US10747952B2 (en) | 2008-09-15 | 2020-08-18 | Palantir Technologies, Inc. | Automatic creation and server push of multiple distinct drafts |
TW201027375A (en) | 2008-10-20 | 2010-07-16 | Ibm | Search system, search method and program |
US8572093B2 (en) * | 2009-01-13 | 2013-10-29 | Emc Corporation | System and method for providing a license description syntax in a software due diligence system |
US8479161B2 (en) * | 2009-03-18 | 2013-07-02 | Oracle International Corporation | System and method for performing software due diligence using a binary scan engine and parallel pattern matching |
US8307351B2 (en) * | 2009-03-18 | 2012-11-06 | Oracle International Corporation | System and method for performing code provenance review in a software due diligence system |
CN101859309A (en) * | 2009-04-07 | 2010-10-13 | 慧科讯业有限公司 | System and method for identifying repeated text |
US9104695B1 (en) | 2009-07-27 | 2015-08-11 | Palantir Technologies, Inc. | Geotagging structured data |
US8713018B2 (en) | 2009-07-28 | 2014-04-29 | Fti Consulting, Inc. | System and method for displaying relationships between electronically stored information to provide classification suggestions via inclusion |
EP2471009A1 (en) | 2009-08-24 | 2012-07-04 | FTI Technology LLC | Generating a reference set for use during document review |
US9053296B2 (en) | 2010-08-28 | 2015-06-09 | Software Analysis And Forensic Engineering Corporation | Detecting plagiarism in computer markup language files |
CA2810041C (en) | 2010-09-03 | 2015-12-08 | Iparadigms, Llc | Systems and methods for document analysis |
US20120084868A1 (en) * | 2010-09-30 | 2012-04-05 | International Business Machines Corporation | Locating documents for providing data leakage prevention within an information security management system |
US8607140B1 (en) * | 2010-12-21 | 2013-12-10 | Google Inc. | Classifying changes to resources |
US8799240B2 (en) | 2011-06-23 | 2014-08-05 | Palantir Technologies, Inc. | System and method for investigating large amounts of data |
US9547693B1 (en) | 2011-06-23 | 2017-01-17 | Palantir Technologies Inc. | Periodic database search manager for multiple data sources |
US8732574B2 (en) | 2011-08-25 | 2014-05-20 | Palantir Technologies, Inc. | System and method for parameterizing documents for automatic workflow generation |
JP2013149061A (en) * | 2012-01-19 | 2013-08-01 | Nec Corp | Document similarity evaluation system, document similarity evaluation method, and computer program |
FR2989189B1 (en) * | 2012-04-04 | 2017-10-13 | Qwant | METHOD AND DEVICE FOR QUICKLY PROVIDING INFORMATION |
US9798768B2 (en) | 2012-09-10 | 2017-10-24 | Palantir Technologies, Inc. | Search around visual queries |
US8843493B1 (en) * | 2012-09-18 | 2014-09-23 | Narus, Inc. | Document fingerprint |
US9348677B2 (en) | 2012-10-22 | 2016-05-24 | Palantir Technologies Inc. | System and method for batch evaluation programs |
US9081975B2 (en) | 2012-10-22 | 2015-07-14 | Palantir Technologies, Inc. | Sharing information between nexuses that use different classification schemes for information access control |
US9501761B2 (en) | 2012-11-05 | 2016-11-22 | Palantir Technologies, Inc. | System and method for sharing investigation results |
US9501507B1 (en) | 2012-12-27 | 2016-11-22 | Palantir Technologies Inc. | Geo-temporal indexing and searching |
US10140664B2 (en) | 2013-03-14 | 2018-11-27 | Palantir Technologies Inc. | Resolving similar entities from a transaction database |
US8903717B2 (en) | 2013-03-15 | 2014-12-02 | Palantir Technologies Inc. | Method and system for generating a parser and parsing complex data |
US8855999B1 (en) | 2013-03-15 | 2014-10-07 | Palantir Technologies Inc. | Method and system for generating a parser and parsing complex data |
US8868486B2 (en) | 2013-03-15 | 2014-10-21 | Palantir Technologies Inc. | Time-sensitive cube |
US10275778B1 (en) | 2013-03-15 | 2019-04-30 | Palantir Technologies Inc. | Systems and user interfaces for dynamic and interactive investigation based on automatic malfeasance clustering of related data in various data structures |
US8924388B2 (en) | 2013-03-15 | 2014-12-30 | Palantir Technologies Inc. | Computer-implemented systems and methods for comparing and associating objects |
US8930897B2 (en) | 2013-03-15 | 2015-01-06 | Palantir Technologies Inc. | Data integration tool |
US8909656B2 (en) | 2013-03-15 | 2014-12-09 | Palantir Technologies Inc. | Filter chains with associated multipath views for exploring large data sets |
US8799799B1 (en) | 2013-05-07 | 2014-08-05 | Palantir Technologies Inc. | Interactive geospatial map |
US9565152B2 (en) | 2013-08-08 | 2017-02-07 | Palantir Technologies Inc. | Cable reader labeling |
US9785317B2 (en) | 2013-09-24 | 2017-10-10 | Palantir Technologies Inc. | Presentation and analysis of user interaction data |
US8938686B1 (en) | 2013-10-03 | 2015-01-20 | Palantir Technologies Inc. | Systems and methods for analyzing performance of an entity |
US8812960B1 (en) | 2013-10-07 | 2014-08-19 | Palantir Technologies Inc. | Cohort-based presentation of user interaction data |
US9116975B2 (en) | 2013-10-18 | 2015-08-25 | Palantir Technologies Inc. | Systems and user interfaces for dynamic and interactive simultaneous querying of multiple data stores |
US8832594B1 (en) * | 2013-11-04 | 2014-09-09 | Palantir Technologies Inc. | Space-optimized display of multi-column tables with selective text truncation based on a combined text width |
US9105000B1 (en) | 2013-12-10 | 2015-08-11 | Palantir Technologies Inc. | Aggregating data from a plurality of data sources |
US9734217B2 (en) | 2013-12-16 | 2017-08-15 | Palantir Technologies Inc. | Methods and systems for analyzing entity performance |
US10579647B1 (en) | 2013-12-16 | 2020-03-03 | Palantir Technologies Inc. | Methods and systems for analyzing entity performance |
US10356032B2 (en) | 2013-12-26 | 2019-07-16 | Palantir Technologies Inc. | System and method for detecting confidential information emails |
US9514417B2 (en) | 2013-12-30 | 2016-12-06 | Google Inc. | Cloud-based plagiarism detection system performing predicting based on classified feature vectors |
US8832832B1 (en) | 2014-01-03 | 2014-09-09 | Palantir Technologies Inc. | IP reputation |
KR101577376B1 (en) * | 2014-01-21 | 2015-12-14 | (주) 아워텍 | System and method for determining infringement of copyright based on the text reference point |
US8935201B1 (en) | 2014-03-18 | 2015-01-13 | Palantir Technologies Inc. | Determining and extracting changed data from a data source |
US9836580B2 (en) | 2014-03-21 | 2017-12-05 | Palantir Technologies Inc. | Provider portal |
US9535974B1 (en) | 2014-06-30 | 2017-01-03 | Palantir Technologies Inc. | Systems and methods for identifying key phrase clusters within documents |
US9129219B1 (en) | 2014-06-30 | 2015-09-08 | Palantir Technologies, Inc. | Crime risk forecasting |
US9619557B2 (en) | 2014-06-30 | 2017-04-11 | Palantir Technologies, Inc. | Systems and methods for key phrase characterization of documents |
US9256664B2 (en) | 2014-07-03 | 2016-02-09 | Palantir Technologies Inc. | System and method for news events detection and visualization |
US20160026923A1 (en) | 2014-07-22 | 2016-01-28 | Palantir Technologies Inc. | System and method for determining a propensity of entity to take a specified action |
US9886422B2 (en) * | 2014-08-06 | 2018-02-06 | International Business Machines Corporation | Dynamic highlighting of repetitions in electronic documents |
US9454281B2 (en) | 2014-09-03 | 2016-09-27 | Palantir Technologies Inc. | System for providing dynamic linked panels in user interface |
US9390086B2 (en) | 2014-09-11 | 2016-07-12 | Palantir Technologies Inc. | Classification system with methodology for efficient verification |
US9767172B2 (en) | 2014-10-03 | 2017-09-19 | Palantir Technologies Inc. | Data aggregation and analysis system |
US9501851B2 (en) | 2014-10-03 | 2016-11-22 | Palantir Technologies Inc. | Time-series analysis system |
US9785328B2 (en) | 2014-10-06 | 2017-10-10 | Palantir Technologies Inc. | Presentation of multivariate data on a graphical user interface of a computing system |
US9984133B2 (en) | 2014-10-16 | 2018-05-29 | Palantir Technologies Inc. | Schematic and database linking system |
US9805099B2 (en) | 2014-10-30 | 2017-10-31 | The Johns Hopkins University | Apparatus and method for efficient identification of code similarity |
US9229952B1 (en) | 2014-11-05 | 2016-01-05 | Palantir Technologies, Inc. | History preserving data pipeline system and method |
US9043894B1 (en) | 2014-11-06 | 2015-05-26 | Palantir Technologies Inc. | Malicious software detection in a computing system |
US9483546B2 (en) | 2014-12-15 | 2016-11-01 | Palantir Technologies Inc. | System and method for associating related records to common entities across multiple lists |
US9348920B1 (en) | 2014-12-22 | 2016-05-24 | Palantir Technologies Inc. | Concept indexing among database of documents using machine learning techniques |
US10362133B1 (en) | 2014-12-22 | 2019-07-23 | Palantir Technologies Inc. | Communication data processing architecture |
US10552994B2 (en) | 2014-12-22 | 2020-02-04 | Palantir Technologies Inc. | Systems and interactive user interfaces for dynamic retrieval, analysis, and triage of data items |
US10452651B1 (en) | 2014-12-23 | 2019-10-22 | Palantir Technologies Inc. | Searching charts |
US9335911B1 (en) | 2014-12-29 | 2016-05-10 | Palantir Technologies Inc. | Interactive user interface for dynamic data analysis exploration and query processing |
US9817563B1 (en) | 2014-12-29 | 2017-11-14 | Palantir Technologies Inc. | System and method of generating data points from one or more data stores of data items for chart creation and manipulation |
US11302426B1 (en) | 2015-01-02 | 2022-04-12 | Palantir Technologies Inc. | Unified data interface and system |
US9727560B2 (en) | 2015-02-25 | 2017-08-08 | Palantir Technologies Inc. | Systems and methods for organizing and identifying documents via hierarchies and dimensions of tags |
EP3611632A1 (en) | 2015-03-16 | 2020-02-19 | Palantir Technologies Inc. | Displaying attribute and event data along paths |
US9886467B2 (en) | 2015-03-19 | 2018-02-06 | Plantir Technologies Inc. | System and method for comparing and visualizing data entities and data entity series |
US9348880B1 (en) | 2015-04-01 | 2016-05-24 | Palantir Technologies, Inc. | Federated search of multiple sources with conflict resolution |
US10103953B1 (en) | 2015-05-12 | 2018-10-16 | Palantir Technologies Inc. | Methods and systems for analyzing entity performance |
US10628834B1 (en) | 2015-06-16 | 2020-04-21 | Palantir Technologies Inc. | Fraud lead detection system for efficiently processing database-stored data and automatically generating natural language explanatory information of system results for display in interactive user interfaces |
US9418337B1 (en) | 2015-07-21 | 2016-08-16 | Palantir Technologies Inc. | Systems and models for data analytics |
US9392008B1 (en) | 2015-07-23 | 2016-07-12 | Palantir Technologies Inc. | Systems and methods for identifying information related to payment card breaches |
US9996595B2 (en) | 2015-08-03 | 2018-06-12 | Palantir Technologies, Inc. | Providing full data provenance visualization for versioned datasets |
US9456000B1 (en) | 2015-08-06 | 2016-09-27 | Palantir Technologies Inc. | Systems, methods, user interfaces, and computer-readable media for investigating potential malicious communications |
US9600146B2 (en) | 2015-08-17 | 2017-03-21 | Palantir Technologies Inc. | Interactive geospatial map |
US9671776B1 (en) | 2015-08-20 | 2017-06-06 | Palantir Technologies Inc. | Quantifying, tracking, and anticipating risk at a manufacturing facility, taking deviation type and staffing conditions into account |
US11150917B2 (en) | 2015-08-26 | 2021-10-19 | Palantir Technologies Inc. | System for data aggregation and analysis of data from a plurality of data sources |
US9485265B1 (en) | 2015-08-28 | 2016-11-01 | Palantir Technologies Inc. | Malicious activity detection system capable of efficiently processing data accessed from databases and generating alerts for display in interactive user interfaces |
US10706434B1 (en) | 2015-09-01 | 2020-07-07 | Palantir Technologies Inc. | Methods and systems for determining location information |
US9984428B2 (en) | 2015-09-04 | 2018-05-29 | Palantir Technologies Inc. | Systems and methods for structuring data from unstructured electronic data files |
US9639580B1 (en) | 2015-09-04 | 2017-05-02 | Palantir Technologies, Inc. | Computer-implemented systems and methods for data management and visualization |
US9576015B1 (en) | 2015-09-09 | 2017-02-21 | Palantir Technologies, Inc. | Domain-specific language for dataset transformations |
US10733237B2 (en) * | 2015-09-22 | 2020-08-04 | International Business Machines Corporation | Creating data objects to separately store common data included in documents |
US9424669B1 (en) | 2015-10-21 | 2016-08-23 | Palantir Technologies Inc. | Generating graphical representations of event participation flow |
US10223429B2 (en) | 2015-12-01 | 2019-03-05 | Palantir Technologies Inc. | Entity data attribution using disparate data sets |
US10706056B1 (en) | 2015-12-02 | 2020-07-07 | Palantir Technologies Inc. | Audit log report generator |
US9514414B1 (en) | 2015-12-11 | 2016-12-06 | Palantir Technologies Inc. | Systems and methods for identifying and categorizing electronic documents through machine learning |
US9760556B1 (en) | 2015-12-11 | 2017-09-12 | Palantir Technologies Inc. | Systems and methods for annotating and linking electronic documents |
US10114884B1 (en) | 2015-12-16 | 2018-10-30 | Palantir Technologies Inc. | Systems and methods for attribute analysis of one or more databases |
US9542446B1 (en) | 2015-12-17 | 2017-01-10 | Palantir Technologies, Inc. | Automatic generation of composite datasets based on hierarchical fields |
US10373099B1 (en) | 2015-12-18 | 2019-08-06 | Palantir Technologies Inc. | Misalignment detection system for efficiently processing database-stored data and automatically generating misalignment information for display in interactive user interfaces |
US9996236B1 (en) | 2015-12-29 | 2018-06-12 | Palantir Technologies Inc. | Simplified frontend processing and visualization of large datasets |
US10871878B1 (en) | 2015-12-29 | 2020-12-22 | Palantir Technologies Inc. | System log analysis and object user interaction correlation system |
US10089289B2 (en) | 2015-12-29 | 2018-10-02 | Palantir Technologies Inc. | Real-time document annotation |
US9792020B1 (en) | 2015-12-30 | 2017-10-17 | Palantir Technologies Inc. | Systems for collecting, aggregating, and storing data, generating interactive user interfaces for analyzing data, and generating alerts based upon collected data |
US10698938B2 (en) | 2016-03-18 | 2020-06-30 | Palantir Technologies Inc. | Systems and methods for organizing and identifying documents via hierarchies and dimensions of tags |
US9652139B1 (en) | 2016-04-06 | 2017-05-16 | Palantir Technologies Inc. | Graphical representation of an output |
US10068199B1 (en) | 2016-05-13 | 2018-09-04 | Palantir Technologies Inc. | System to catalogue tracking data |
AU2017274558B2 (en) | 2016-06-02 | 2021-11-11 | Nuix North America Inc. | Analyzing clusters of coded documents |
US10007674B2 (en) | 2016-06-13 | 2018-06-26 | Palantir Technologies Inc. | Data revision control in large-scale data analytic systems |
US10545975B1 (en) | 2016-06-22 | 2020-01-28 | Palantir Technologies Inc. | Visual analysis of data using sequenced dataset reduction |
US10909130B1 (en) | 2016-07-01 | 2021-02-02 | Palantir Technologies Inc. | Graphical user interface for a database system |
US10324609B2 (en) | 2016-07-21 | 2019-06-18 | Palantir Technologies Inc. | System for providing dynamic linked panels in user interface |
US10719188B2 (en) | 2016-07-21 | 2020-07-21 | Palantir Technologies Inc. | Cached database and synchronization system for providing dynamic linked panels in user interface |
US10552002B1 (en) | 2016-09-27 | 2020-02-04 | Palantir Technologies Inc. | User interface based variable machine modeling |
US10133588B1 (en) | 2016-10-20 | 2018-11-20 | Palantir Technologies Inc. | Transforming instructions for collaborative updates |
US10726507B1 (en) | 2016-11-11 | 2020-07-28 | Palantir Technologies Inc. | Graphical representation of a complex task |
US9842338B1 (en) | 2016-11-21 | 2017-12-12 | Palantir Technologies Inc. | System to identify vulnerable card readers |
US10318630B1 (en) | 2016-11-21 | 2019-06-11 | Palantir Technologies Inc. | Analysis of large bodies of textual data |
US11250425B1 (en) | 2016-11-30 | 2022-02-15 | Palantir Technologies Inc. | Generating a statistic using electronic transaction data |
US10467275B2 (en) | 2016-12-09 | 2019-11-05 | International Business Machines Corporation | Storage efficiency |
GB201621434D0 (en) | 2016-12-16 | 2017-02-01 | Palantir Technologies Inc | Processing sensor logs |
US9886525B1 (en) | 2016-12-16 | 2018-02-06 | Palantir Technologies Inc. | Data item aggregate probability analysis system |
US10044836B2 (en) | 2016-12-19 | 2018-08-07 | Palantir Technologies Inc. | Conducting investigations under limited connectivity |
US10249033B1 (en) | 2016-12-20 | 2019-04-02 | Palantir Technologies Inc. | User interface for managing defects |
US10728262B1 (en) | 2016-12-21 | 2020-07-28 | Palantir Technologies Inc. | Context-aware network-based malicious activity warning systems |
US11373752B2 (en) | 2016-12-22 | 2022-06-28 | Palantir Technologies Inc. | Detection of misuse of a benefit system |
US10360238B1 (en) | 2016-12-22 | 2019-07-23 | Palantir Technologies Inc. | Database systems and user interfaces for interactive data association, analysis, and presentation |
US10163227B1 (en) * | 2016-12-28 | 2018-12-25 | Shutterstock, Inc. | Image file compression using dummy data for non-salient portions of images |
US10721262B2 (en) | 2016-12-28 | 2020-07-21 | Palantir Technologies Inc. | Resource-centric network cyber attack warning system |
US10216811B1 (en) | 2017-01-05 | 2019-02-26 | Palantir Technologies Inc. | Collaborating using different object models |
US10762471B1 (en) | 2017-01-09 | 2020-09-01 | Palantir Technologies Inc. | Automating management of integrated workflows based on disparate subsidiary data sources |
US10133621B1 (en) | 2017-01-18 | 2018-11-20 | Palantir Technologies Inc. | Data analysis system to facilitate investigative process |
US10509844B1 (en) | 2017-01-19 | 2019-12-17 | Palantir Technologies Inc. | Network graph parser |
US10515109B2 (en) | 2017-02-15 | 2019-12-24 | Palantir Technologies Inc. | Real-time auditing of industrial equipment condition |
US20180276206A1 (en) * | 2017-03-23 | 2018-09-27 | Hcl Technologies Limited | System and method for updating a knowledge repository |
US10581954B2 (en) | 2017-03-29 | 2020-03-03 | Palantir Technologies Inc. | Metric collection and aggregation for distributed software services |
US10866936B1 (en) | 2017-03-29 | 2020-12-15 | Palantir Technologies Inc. | Model object management and storage system |
US10713432B2 (en) * | 2017-03-31 | 2020-07-14 | Adobe Inc. | Classifying and ranking changes between document versions |
US10133783B2 (en) | 2017-04-11 | 2018-11-20 | Palantir Technologies Inc. | Systems and methods for constraint driven database searching |
US11074277B1 (en) | 2017-05-01 | 2021-07-27 | Palantir Technologies Inc. | Secure resolution of canonical entities |
US10563990B1 (en) | 2017-05-09 | 2020-02-18 | Palantir Technologies Inc. | Event-based route planning |
US10606872B1 (en) | 2017-05-22 | 2020-03-31 | Palantir Technologies Inc. | Graphical user interface for a database system |
US10795749B1 (en) | 2017-05-31 | 2020-10-06 | Palantir Technologies Inc. | Systems and methods for providing fault analysis user interface |
US10956406B2 (en) | 2017-06-12 | 2021-03-23 | Palantir Technologies Inc. | Propagated deletion of database records and derived data |
US11748416B2 (en) * | 2017-06-19 | 2023-09-05 | Equifax Inc. | Machine-learning system for servicing queries for digital content |
US11216762B1 (en) | 2017-07-13 | 2022-01-04 | Palantir Technologies Inc. | Automated risk visualization using customer-centric data analysis |
US10942947B2 (en) | 2017-07-17 | 2021-03-09 | Palantir Technologies Inc. | Systems and methods for determining relationships between datasets |
US10430444B1 (en) | 2017-07-24 | 2019-10-01 | Palantir Technologies Inc. | Interactive geospatial map and geospatial visualization systems |
US10956508B2 (en) | 2017-11-10 | 2021-03-23 | Palantir Technologies Inc. | Systems and methods for creating and managing a data integration workspace containing automatically updated data models |
US11281726B2 (en) | 2017-12-01 | 2022-03-22 | Palantir Technologies Inc. | System and methods for faster processor comparisons of visual graph features |
US10783162B1 (en) | 2017-12-07 | 2020-09-22 | Palantir Technologies Inc. | Workflow assistant |
US11314721B1 (en) | 2017-12-07 | 2022-04-26 | Palantir Technologies Inc. | User-interactive defect analysis for root cause |
US10877984B1 (en) | 2017-12-07 | 2020-12-29 | Palantir Technologies Inc. | Systems and methods for filtering and visualizing large scale datasets |
US10769171B1 (en) | 2017-12-07 | 2020-09-08 | Palantir Technologies Inc. | Relationship analysis and mapping for interrelated multi-layered datasets |
US11061874B1 (en) | 2017-12-14 | 2021-07-13 | Palantir Technologies Inc. | Systems and methods for resolving entity data across various data structures |
US10853352B1 (en) | 2017-12-21 | 2020-12-01 | Palantir Technologies Inc. | Structured data collection, presentation, validation and workflow management |
US11263382B1 (en) | 2017-12-22 | 2022-03-01 | Palantir Technologies Inc. | Data normalization and irregularity detection system |
GB201800595D0 (en) | 2018-01-15 | 2018-02-28 | Palantir Technologies Inc | Management of software bugs in a data processing system |
US11599369B1 (en) | 2018-03-08 | 2023-03-07 | Palantir Technologies Inc. | Graphical user interface configuration system |
US10877654B1 (en) | 2018-04-03 | 2020-12-29 | Palantir Technologies Inc. | Graphical user interfaces for optimizations |
US10754822B1 (en) | 2018-04-18 | 2020-08-25 | Palantir Technologies Inc. | Systems and methods for ontology migration |
US10885021B1 (en) | 2018-05-02 | 2021-01-05 | Palantir Technologies Inc. | Interactive interpreter and graphical user interface |
US10754946B1 (en) | 2018-05-08 | 2020-08-25 | Palantir Technologies Inc. | Systems and methods for implementing a machine learning approach to modeling entity behavior |
US11061542B1 (en) | 2018-06-01 | 2021-07-13 | Palantir Technologies Inc. | Systems and methods for determining and displaying optimal associations of data items |
US11119630B1 (en) | 2018-06-19 | 2021-09-14 | Palantir Technologies Inc. | Artificial intelligence assisted evaluations and user interface for same |
US11126638B1 (en) | 2018-09-13 | 2021-09-21 | Palantir Technologies Inc. | Data visualization and parsing system |
US11294928B1 (en) | 2018-10-12 | 2022-04-05 | Palantir Technologies Inc. | System architecture for relating and linking data objects |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5634051A (en) * | 1993-10-28 | 1997-05-27 | Teltech Resource Network Corporation | Information management system |
US5913208A (en) * | 1996-07-09 | 1999-06-15 | International Business Machines Corporation | Identifying duplicate documents from search results without comparing document content |
Family Cites Families (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4807182A (en) | 1986-03-12 | 1989-02-21 | Advanced Software, Inc. | Apparatus and method for comparing data groups |
US5258910A (en) * | 1988-07-29 | 1993-11-02 | Sharp Kabushiki Kaisha | Text editor with memory for eliminating duplicate sentences |
JP3270783B2 (en) * | 1992-09-29 | 2002-04-02 | ゼロックス・コーポレーション | Multiple document search methods |
US5692176A (en) * | 1993-11-22 | 1997-11-25 | Reed Elsevier Inc. | Associative text search and retrieval system |
JP4518574B2 (en) * | 1995-08-11 | 2010-08-04 | ソニー株式会社 | Recording method and apparatus, recording medium, and reproducing method and apparatus |
US5680611A (en) | 1995-09-29 | 1997-10-21 | Electronic Data Systems Corporation | Duplicate record detection |
US5933823A (en) * | 1996-03-01 | 1999-08-03 | Ricoh Company Limited | Image database browsing and query using texture analysis |
US6098034A (en) * | 1996-03-18 | 2000-08-01 | Expert Ease Development, Ltd. | Method for standardizing phrasing in a document |
US5920854A (en) * | 1996-08-14 | 1999-07-06 | Infoseek Corporation | Real-time document collection search engine with phrase indexing |
US5898836A (en) | 1997-01-14 | 1999-04-27 | Netmind Services, Inc. | Change-detection tool indicating degree and location of change of internet documents by comparison of cyclic-redundancy-check(CRC) signatures |
US6076051A (en) * | 1997-03-07 | 2000-06-13 | Microsoft Corporation | Information retrieval utilizing semantic representation of text |
US5978828A (en) | 1997-06-13 | 1999-11-02 | Intel Corporation | URL bookmark update notification of page content or location changes |
US6470307B1 (en) * | 1997-06-23 | 2002-10-22 | National Research Council Of Canada | Method and apparatus for automatically identifying keywords within a document |
AU742831B2 (en) * | 1997-09-04 | 2002-01-10 | British Telecommunications Public Limited Company | Methods and/or systems for selecting data sets |
US5983216A (en) * | 1997-09-12 | 1999-11-09 | Infoseek Corporation | Performing automated document collection and selection by providing a meta-index with meta-index values indentifying corresponding document collections |
US6353824B1 (en) * | 1997-11-18 | 2002-03-05 | Apple Computer, Inc. | Method for dynamic presentation of the contents topically rich capsule overviews corresponding to the plurality of documents, resolving co-referentiality in document segments |
US6092065A (en) * | 1998-02-13 | 2000-07-18 | International Business Machines Corporation | Method and apparatus for discovery, clustering and classification of patterns in 1-dimensional event streams |
US6628824B1 (en) * | 1998-03-20 | 2003-09-30 | Ken Belanger | Method and apparatus for image identification and comparison |
US6119124A (en) * | 1998-03-26 | 2000-09-12 | Digital Equipment Corporation | Method for clustering closely resembling data objects |
US6185614B1 (en) * | 1998-05-26 | 2001-02-06 | International Business Machines Corp. | Method and system for collecting user profile information over the world-wide web in the presence of dynamic content using document comparators |
US6263348B1 (en) * | 1998-07-01 | 2001-07-17 | Serena Software International, Inc. | Method and apparatus for identifying the existence of differences between two files |
US6240409B1 (en) * | 1998-07-31 | 2001-05-29 | The Regents Of The University Of California | Method and apparatus for detecting and summarizing document similarity within large document sets |
US6741743B2 (en) * | 1998-07-31 | 2004-05-25 | Prc. Inc. | Imaged document optical correlation and conversion system |
US6104990A (en) * | 1998-09-28 | 2000-08-15 | Prompt Software, Inc. | Language independent phrase extraction |
US6473753B1 (en) * | 1998-10-09 | 2002-10-29 | Microsoft Corporation | Method and system for calculating term-document importance |
US6549897B1 (en) * | 1998-10-09 | 2003-04-15 | Microsoft Corporation | Method and system for calculating phrase-document importance |
US6643686B1 (en) * | 1998-12-18 | 2003-11-04 | At&T Corp. | System and method for counteracting message filtering |
US6295529B1 (en) * | 1998-12-24 | 2001-09-25 | Microsoft Corporation | Method and apparatus for indentifying clauses having predetermined characteristics indicative of usefulness in determining relationships between different texts |
US6598054B2 (en) * | 1999-01-26 | 2003-07-22 | Xerox Corporation | System and method for clustering data objects in a collection |
US6366950B1 (en) * | 1999-04-02 | 2002-04-02 | Smithmicro Software | System and method for verifying users' identity in a network using e-mail communication |
US6547829B1 (en) * | 1999-06-30 | 2003-04-15 | Microsoft Corporation | Method and system for detecting duplicate documents in web crawls |
US6718363B1 (en) * | 1999-07-30 | 2004-04-06 | Verizon Laboratories, Inc. | Page aggregation for web sites |
US6442606B1 (en) * | 1999-08-12 | 2002-08-27 | Inktomi Corporation | Method and apparatus for identifying spoof documents |
US6356633B1 (en) * | 1999-08-19 | 2002-03-12 | Mci Worldcom, Inc. | Electronic mail message processing and routing for call center response to same |
US6615209B1 (en) * | 2000-02-22 | 2003-09-02 | Google, Inc. | Detecting query-specific duplicate documents |
US6697998B1 (en) * | 2000-06-12 | 2004-02-24 | International Business Machines Corporation | Automatic labeling of unlabeled text data |
US6658423B1 (en) * | 2001-01-24 | 2003-12-02 | Google, Inc. | Detecting duplicate and near-duplicate files |
-
2000
- 2000-11-15 US US09/713,733 patent/US6978419B1/en not_active Expired - Fee Related
-
2001
- 2001-10-31 AU AU2002229035A patent/AU2002229035A1/en not_active Abandoned
- 2001-10-31 WO PCT/US2001/048124 patent/WO2002041161A1/en active Application Filing
- 2001-10-31 JP JP2002543304A patent/JP2004519761A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5634051A (en) * | 1993-10-28 | 1997-05-27 | Teltech Resource Network Corporation | Information management system |
US5913208A (en) * | 1996-07-09 | 1999-06-15 | International Business Machines Corporation | Identifying duplicate documents from search results without comparing document content |
Non-Patent Citations (3)
Title |
---|
BHARAT: "A comparison of techniques to find mirrored hosts on the WWW", NEC RESEARCH INDEX, 1999, XP002908687 * |
CHOWDHURY ET AL.: "Collection statistics for fast duplicate document detection", GOOGLE SEARCH, 1999, pages 1 - 30, XP002908686 * |
SHIVAKUMAR ET AL.: "Finding near-replicas of documents on the web", NEC RESEARCH INDEX, 1998, XP002908688 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101221832B1 (en) | 2011-12-08 | 2013-01-15 | 동국대학교 경주캠퍼스 산학협력단 | A program recording datum and system for copyright protecting |
Also Published As
Publication number | Publication date |
---|---|
JP2004519761A (en) | 2004-07-02 |
US6978419B1 (en) | 2005-12-20 |
AU2002229035A1 (en) | 2002-05-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6978419B1 (en) | Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments | |
Haveliwala et al. | Scalable Techniques for Clustering the Web. | |
US7236923B1 (en) | Acronym extraction system and method of identifying acronyms and extracting corresponding expansions from text | |
Pomikálek | Removing boilerplate and duplicate content from web corpora | |
US6654717B2 (en) | Multi-language document search and retrieval system | |
US8548972B1 (en) | Near-duplicate document detection for web crawling | |
US7191116B2 (en) | Methods and systems for determining a language of a document | |
US7707157B1 (en) | Document near-duplicate detection | |
US20120158749A1 (en) | System and method for providing tag-based relevance recommendations of bookmarks in a bookmark and tag database | |
US20120030200A1 (en) | Topics in relevance ranking model for web search | |
JP2008287698A (en) | Indexing system and indexing program | |
US8843493B1 (en) | Document fingerprint | |
JP5291523B2 (en) | Similar data retrieval device and program thereof | |
JP2005276183A (en) | Method and system for ranking words and concepts in text using graph-based ranking | |
JPH09198398A (en) | Pattern retrieving device | |
Zhang et al. | Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm | |
Sutoyo et al. | Detecting documents plagiarism using winnowing algorithm and k-gram method | |
Manaf et al. | Comparison of carp rabin algorithm and Jaro-Winkler distance to determine the equality of Sunda languages | |
Campbell et al. | Copy detection systems for digital documents | |
Fan et al. | Article clipper: a system for web article extraction | |
Ceglarek | Architecture of the semantically enhanced intellectual property protection system | |
Sun et al. | Near duplicate text detection using frequency-biased signatures | |
JP4682627B2 (en) | Document retrieval apparatus and method | |
JP3396734B2 (en) | Corpus error detection / correction processing apparatus, corpus error detection / correction processing method, and program recording medium therefor | |
Tsai et al. | Multilingual novelty detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ CZ DE DE DK DK DM DZ EC EE EE ES FI FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PH PL PT RO RU SD SE SG SI SK SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2002543304 Country of ref document: JP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase |