US20130332450A1 - System and Method for Automatically Detecting and Interactively Displaying Information About Entities, Activities, and Events from Multiple-Modality Natural Language Sources - Google Patents

System and Method for Automatically Detecting and Interactively Displaying Information About Entities, Activities, and Events from Multiple-Modality Natural Language Sources Download PDF

Info

Publication number
US20130332450A1
US20130332450A1 US13/493,659 US201213493659A US2013332450A1 US 20130332450 A1 US20130332450 A1 US 20130332450A1 US 201213493659 A US201213493659 A US 201213493659A US 2013332450 A1 US2013332450 A1 US 2013332450A1
Authority
US
United States
Prior art keywords
information
entities
equivalence
text
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/493,659
Inventor
Vittorio Castelli
Radu Florian
Xiaoqiang Luo
Hema Raghavan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US13/493,659 priority Critical patent/US20130332450A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CASTELLI, VITTORIO, FLORIAN, RADU, LUO, XIAOQIANG, RAGHAVAN, HEMA
Priority to US13/543,157 priority patent/US20140195884A1/en
Assigned to DARPA reassignment DARPA CONFIRMATORY LICENSE (SEE DOCUMENT FOR DETAILS). Assignors: INTERNATIONAL BUSINESS MACHINES CORPORATION
Priority to DE201310205737 priority patent/DE102013205737A1/en
Priority to CN201310122395.8A priority patent/CN103488663A/en
Publication of US20130332450A1 publication Critical patent/US20130332450A1/en
Priority to US15/419,615 priority patent/US10698964B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Definitions

  • the present disclosure relates to information technology, and, more particularly, to natural language processing (NLP) systems.
  • NLP natural language processing
  • Exemplary embodiments of the present disclosure provide methods for automatically extracting and organizing data such that a user can interactively explore information about entities, activities, and events.
  • information may be automatically extracted in real time from multiple modalities and multiple languages and displayed in a navigable and compact representation of the retrieved information.
  • Exemplary embodiments may use natural language processing techniques to automatically analyze information from multiple sources, in multiple modalities, and in multiple languages, including, but not limited to, web pages, blogs, newsgroups, radio feeds, video, and television.
  • Exemplary embodiments may use the output of automatic machine translation systems that translate foreign language sources into the language of the user, and use the output from automatic speech transcription systems that convert video and audio feeds into text.
  • Exemplary embodiments may use natural language processing techniques including information extraction tools, question answering tools, and distillation tools, to automatically analyze the text produced as described above and extract searchable and summarizable information.
  • the system may perform name-entity detection, cross-document co-reference resolution, relation detection, and event detection and tracking.
  • Exemplary embodiments may use automatic relevance detection techniques and redundancy reduction methods to provide the user with relevant and non-redundant information.
  • Exemplary embodiments may display the desired information in a compact and navigable representation by: providing means for the user to specify entities, activities, or events of interest (for example: by typing natural language queries, by selecting entities from an automatically generated list of entities that satisfy user specified requirements, such as, entities that are prominently featured in the data sources over a user specified time, by selecting sections of text by browsing an article, or by selecting events or topics from representations of automatically detected events/topics over a specified period of time
  • Exemplary embodiments may automatically generate a page in response to the user query by adaptively building a template that best matches the inferred user's intention (for example: if the user selects a person, who is a politician, the system would detect this fact, search for information on election campaign, public appearances, statements, and public service history of the person; if the user selects a company, the system would search for recent news about the company, for information on the company's top officials, for press releases, etc.)
  • the system may search for news items about the event, for reactions to the event, for outcomes of the event, and for related events.
  • the system may also automatically detect the entities involved in the event, such as people, countries, local governments, companies and organizations, and retrieve relevant information about these entities.
  • Exemplary embodiments may allow the user to track entities that appear on the produced page, including automatically producing a biography of a person from available data and listing recent actions by an organization automatically extracted from the available data.
  • Exemplary embodiments may allow the user to explore events or activities that appear on the page, including: automatically constructing a timeline of the salient moments in an ongoing event.
  • Exemplary embodiments may allow the user to explore the connections between entities and events (for example: providing information on the role of a company in an event, listing quotes by a person on a topic, describing the relation between two companies, summarizing meetings or contacts between two people and optionally retrieving images of the desired entities.
  • entities and events for example: providing information on the role of a company in an event, listing quotes by a person on a topic, describing the relation between two companies, summarizing meetings or contacts between two people and optionally retrieving images of the desired entities.
  • a method for automatically extracting and organizing information by a processing device from a plurality of data sources is provided.
  • a natural language processing information extraction pipeline that includes an automatic detection of entities is applied to the data sources.
  • Information about detected entities is identified by analyzing products of the natural language processing pipeline.
  • Identified information is grouped into equivalence classes containing equivalent information.
  • At least one displayable representation of the equivalence classes is created.
  • An order in which the at least one displayable representation is displayed is computed.
  • a combined representation of the equivalence classes that respects the order in which the displayable representation is displayed is produced.
  • Each equivalence classes may include a collection of items.
  • Each item may include a span of text extracted from a document, together with a specification of information about a desired entity derived from the span of text.
  • Computing an order in which the displayable representations are displayed may include randomly computing the order.
  • Grouping identified information into equivalence classes may include assigning each identified information to a separate equivalence class.
  • Grouping identified information into equivalence classes may include computing a representative instance of each equivalence class, ensuring that representative instances of different classes are not redundant with respect to each other, and ensuring that instances of each equivalence class are redundant with respect to the representative instance of the equivalence class.
  • a method for processing information by a processing device is provided.
  • a user query is received.
  • a user query intention is inferred from the user query to develop an inferred user intention.
  • a page is automatically generated in response to the user query by adaptively building a template that corresponds to the inferred user intention using natural processing of multiple modalities comprising at least one of text, audio and video.
  • the political status may be searched, information on at least one of an election campaign, public appearances, statements, and public service history, may be searched, and a page in response to the user query may be automatically generated.
  • Entities in the event and retrieved relevant information about the entities may be identified and searched.
  • a method for automatically extracting and organizing information by a processing device from a corpus of documents having multiple modalities of information in multiple languages for display to a user is provided.
  • the corpus of documents is browsed to identify and incrementally retrieve documents containing audio/video files.
  • Text from the audio/video files is transcribed to provide a textual representation.
  • Text of the textural representation that is in a foreign language is translated.
  • Desired information about at least one of entities, activities, and events is incrementally extracted. Extracted information is organized. Organized extracted information is converted into a navigable display presentable to the user.
  • Incrementally extracting desired information may include applying a natural language processing pipeline to each document to iterate all entities detected in the corpus and identifying relation mentions and event mentions that involve a selected entity, wherein an entity is at least one of a physical animate object, a physical inanimate object, something that has a proper name, something that has a measurable physical property, a legal entity and abstract concepts, a mention is a span of text that refers to an entity, a relation is a connection between two entities, a relation mention is a span of text that describes a relation, and an event is a set of relations between two or more entities involving one or more actions.
  • Organizing extracted information may include iterating on all the entities identified in the corpus, dividing the information extracted about the entity into selected equivalence classes containing equivalent information, iterating on all the equivalence classes, selecting one item in each equivalence class to represent all items in the equivalence class, and recording information about the equivalence class and about a representative selected for use in producing the navigable display, wherein each equivalence class may include a collection of items, each item having a span of text extracted from a document, together with a specification of the information about the desired entity derived from the span of text.
  • Converting organized extracted information into a navigable display presentable to the user may include scoring the equivalence classes of information by assigning to the equivalence class at least one of a highest score of the pieces of information in the class, the average score of its members, the median score of its members, and the sum of the scores of its members, sorting the equivalence classes in descending order of score to prioritize an order in which the equivalence classes are displayed to the user, iterating for each equivalence class, constructing of a displayable representation of an instance selected and combining the displayable representations to produce a displayable representation of the equivalence classes.
  • the displayable representation may include a passage containing extracted information marked up with visual highlights.
  • a non-transitory computer program storage device embodying instructions executable by a processor to interactively display information about entities, activities and events from multiple-modality natural language sources.
  • An information extraction module includes instruction code for downloading document content from text and audio/video, for parsing the document content, for detecting mentions, for co-referencing, for cross-document co-referencing and for extracting relations.
  • An information gathering module includes instruction code for extracting acquaintances, biography and involvement in events from the information extraction module.
  • An information display module includes instruction code for displaying information from the information gathering module.
  • the information extraction module further may include instruction code for transcribing audio from video sources and for translating non-English transcribed audio into English text.
  • the information extraction module may include instruction code for clustering mentions under the same entity and for linking the entity clusters across documents.
  • the information gathering module may include instruction code for inputting a sentence and an entity and extracting specific information about the entity from the sentence.
  • the information display module may include instruction code for grouping results into non-redundant sets, sorting the sets, producing a brief description of each set, selecting a representative snippet for each set, highlighting the portions of the snippet that contain information pertaining to a specific tab, constructing navigation hyperlinks to other pages, and generating data used to graphically represent tab content.
  • a non-transitory computer program storage device embodying instructions executable by a processor to automatically extract and organize information from a plurality of data sources.
  • Instruction code is provided for applying to the data sources a natural language processing information extraction pipeline that includes an automatic detection of entities.
  • Instruction code is provided for identifying information about detected entities by analyzing products of the natural language processing pipeline.
  • Instruction code is provided for grouping identified information into equivalence classes containing equivalent information.
  • Instruction code is provided for creating at least one displayable representation of the equivalence classes.
  • Instruction code is provided for computing an order in which the at least one displayable representation is displayed.
  • Instruction code is provided for producing a combined representation of the equivalence classes that respects the order in which said displayable representation is displayed.
  • FIG. 1 depicts a sequence of operational steps in accordance with an exemplary embodiment
  • FIG. 2 depicts a sequence of operational steps in accordance with a portion of the operational steps of FIG. 1 ;
  • FIG. 3 depicts a sequence of operational steps in accordance with a portion of the operational steps of FIG. 2 ;
  • FIG. 4 depicts a sequence of operational steps in accordance with a portion of the operational steps of FIG. 1 ;
  • FIG. 5 depicts a sequence of operational steps in accordance with a portion of the operational steps of FIG. 1 ;
  • FIG. 6 depicts an exemplary entity page in accordance with an exemplary embodiment
  • FIGS. 7( a ) and 7 ( b ) depict exemplary entity pages for a news broadcasting application.
  • FIG. 8 depicts a program storage device and processor for executing a sequence of operational steps in accordance with an exemplary embodiment.
  • the term “document” may refer to a textual document irrespective of its format, to media files including streaming audio and video, and to hybrids of the above, such as web pages with embedded video and audio streams.
  • the term “corpus” refers to a formal or informal collection of multimedia documents, such as all the papers published in a scientific journal or all the English web pages published by news agencies in Arabic-speaking countries.
  • the term “entity” may refer to a physical animate object (e.g., a person), to a physical inanimate object (e.g., a building), to something that has a proper name (e.g., Mount Everest), to something that has a measurable physical property (e.g., a point in time or a span of time, a company, a township, a country), to a legal entity (e.g., a nation) and to abstract concepts, such as the unit of measurement and the measure of a physical property.
  • a physical animate object e.g., a person
  • a physical inanimate object e.g., a building
  • something that has a proper name e.g., Mount Everest
  • a measurable physical property e.g., a point in time or a span of time, a company, a township, a country
  • a legal entity e.g., a nation
  • abstract concepts such as the unit of measurement and the measure
  • the term “mention” denotes a span of text that refers to an entity. Given a large structured set of documents, an entity may be associated with the collection of all of its mentions that appear in the structured set of documents, and, therefore, the term entity may also be used to denote such collection.
  • the term “relation” refers to a connection between two entities (e.g., Barack Obama is the president of the United States; Michelle Obama and Barack Obama are married).
  • a relation mention is a span of text that explicitly describes a relation. Thus, a relation mention involves two entity mentions.
  • the term “event” refers to a set of relations between two or more entities, involving one or more actions.
  • FIG. 1 shows an overview of an exemplary embodiment which may be applicable to a corpus of news documents consisting of web pages created by news agencies and containing multiple modalities of information in multiple languages.
  • Multimodal corpus 100 is browsed in a methodical automated manner (i.e., crawled) in Step 110 , wherein the multi-modal documents in the corpus are identified and incrementally retrieved. Such crawling can operate in an incremental fashion, in which case it would retrieve only documents that were not available during previous crawling operations.
  • Documents containing audio information such as audio files or video files with audio, are then analyzed by transcription at Step 120 . After Step 120 , a textual representation of all the multi-modal documents is available. Text in foreign languages is translated at translation step 130 .
  • the result is textual representation 140 of the multimodal corpus that contains documents in a desired language as well as their original version in their source language.
  • Textual representation 140 of the corpus is incrementally analyzed in Step 150 , which extracts desired information (information extraction (IE)) about entities, activities, and events.
  • IE information extraction
  • the extracted information is organized in Step 160 , and the organized information is converted into a navigable display form that is presented to the user.
  • FIG. 2 shows an IE process, according to an exemplary embodiment, of Step 150 wherein information on entities, activities, and events are incrementally extracted.
  • Step 210 consists of applying a natural language processing pipeline to each document of the collection. The pipeline can be applied incrementally as new documents are added to the corpus.
  • Step 220 iterates over all entities detected in the corpus. Step 220 can be applied incrementally by iterating only on the entities detected in new documents as they are added to the corpus.
  • Step 230 identifies relation mentions extracted by Step 210 that involve the entity selected by Step 220 .
  • Step 240 identifies event mentions involving mentions of the entity selected by Step 220 .
  • Step 250 extracts information pertaining to the entity selected by Step 220 .
  • FIG. 3 shows an example of natural language processing pipeline Step 210 as described in FIG. 2 .
  • Text Cleanup Step 310 removes from the text irrelevant characters, such as formatting characters, HyperText Markup Language (HTML) tags, and the like.
  • Tokenization Step 320 analyzes the cleaned-up text and identifies word and sentence boundaries.
  • Part-of-speech tagging Step 330 associates to each word a label that describes its grammatical function.
  • Mention detection Step 340 identifies in the tokenized text the mentions of entities and the words that denote the presence of events (called event anchors).
  • Parsing Step 350 extracts the hierarchical grammatical structure of each sentence, and typically represents it as a tree.
  • Semantic role labeling Step 360 identifies how each of the nodes in the tree extracted by parsing Step 350 is semantically related to each of the verbs in the sentence.
  • Co-reference resolution Step 370 identifies the entities to which the mentions produced by the mention detection 340 belong.
  • Relation extraction Step 380 detects relations between entity mention pairs and between entity mention and event anchors.
  • FIG. 4 shows an exemplary embodiment of organizing the information about entities according to Step 160 of FIG. 1 .
  • Step 410 iterates over all the entities identified in the corpus.
  • An incremental embodiment of Step 410 consists of iterating on all the entities identified in new documents as they are added to the corpus.
  • Step 420 divides the information extracted about the entity selected by iteration Step 410 into equivalence classes, containing equivalent or redundant information.
  • each equivalence class would consist of a collection of items, where each item consists of a span of text extracted from a document, together with a specification of the information about the desired entity derived from the span of text.
  • equivalence classes could be mutually exclusive or could overlap, wherein the same item could belong to one or more equivalence class.
  • Step 430 iterates on the equivalence classes produced by Step 420 .
  • Step 440 would select one item in the class that best represents all the items in the class.
  • Selection criteria used by selection Step 440 can include, but not be limited to: selecting the most common span of text that appears in the equivalence class (for example, the span “U.S. President Barack Obama” is more common than “Barack Obama, the President of the United States”, and, according to this selection criterion, would be chosen as the representative span to describe the relationship of “Barack Obama” to the “United States”), selecting the span of text that conveys the largest amount of information (for example, “Barack Obama is the 44th and current President of the United States” conveys more information about the relationship between “Barack Obama” and the “United States” than “U.S. President Barack Obama”, and would be chosen as representative according to this criterion), and selecting the span of text with the highest score produced by the extraction Step 150 , if the step associates a score with its results.
  • Step 450 records the information about the equivalence class and about the representative selected by Step 440 , so that the information can be used by the subsequent Step 170 of FIG. 1 .
  • the method shown in FIG. 4 can be adapted to the case in which equivalence classes can overlap and it is still desirable to select distinct representatives for different classes, for example, by means of an optimization procedure that would combine one or more of the selection criteria listed above or of equivalent selection criteria with a dissimilarity measure that would favor the choice of distinct representatives for overlapping equivalence classes.
  • an individual instance of extracted information may consist of a span (equivalently, a passage) from a document together with a specification of the information extracted about a desired entity from the span.
  • a specification can consist of a collection of attribute-value pairs, a collection of Research Description Framework (RDIF) triples, a set of relations in a relational database, and the like.
  • the specification can be represented using a description language, such as Extensible Markup Language (XML), using the RDF representation language, using a database, and the like.
  • Step 420 may consist of identifying groups of instances of extracted information satisfying two conditions: the first being that each group contains at least one instance (main instance) given which all other instances in the group are redundant; the second being that main instances of separate groups are not redundant with respect to each other. This result can be accomplished using a traditional clustering algorithm or an incremental clustering algorithm.
  • FIG. 5 shows an exemplary embodiment of a method of Step 170 of FIG. 1 for constructing a displayable representation of the information pertaining to an entity and collected according to the method described in FIG. 4 .
  • Step 510 the equivalence classes of information produced by Step 420 are scored, for example, by assigning to the equivalence class the highest score of the pieces of information in the class.
  • other quantities can be used as the score of the equivalence class, for example: the average score of its members, the median score of its members, the sum of the scores of its members, and the like.
  • the score is used to prioritize the order in which the equivalence classes are displayed to the user.
  • Step 520 sorts the equivalence classes in descending order of score.
  • Step 530 selects each equivalence class.
  • Step 550 constructs a displayable representation of the instance selected from the equivalence class.
  • displayable representation consists of the passage containing the extracted information, appropriately marked up with visual highlights. Such visual highlights may include color to differentiate the extracted information. Additionally, the displayable representation could include visual cues to easily identify other entities for which an information page exists.
  • Step 560 combines the representations produced by Step 550 to produce a displayable representation of the equivalence class.
  • this step consists of displaying the representative instance of the equivalence class and providing means for displaying the other members, for instance, by providing links to the representation of these members.
  • an exemplary page describing an entity i.e., an Entity page (EP)
  • the page is divided into a left and a right part.
  • the two frames in the left part contain a picture and biographical information automatically extracted from the Wikipedia internet encyclopedia or from another source of reliable information, respectively.
  • the right part contains a set of tabs that organize relevant small pieces (snippets) of text by the kind of information they convey.
  • the content in each tab is the output of a series of information extraction modules which are described in further detail below.
  • Each tab also shows a graphical summary of the content of its content.
  • Table 1 shown below, summarizes the information conveyed by the snippets of text in each tab.
  • Entity Type Tab Title Description Person Affiliations Describe affiliations of the person to companies, organizations, governments, agencies, etc Statements Report statements made by the person on any topic Actions Describe the actions of the person Related People List acquaintances of the person Locations List places & locations visited by the person Elections Describe election campaign of the person Involvment in Events Describe events in which the person is involved ORG & GPE Actions Describe actions of the organization or of official representatives Related Orgs Describe related organizations, such as subsidiaries. Associated People List people associated with the ORG/GPE Statements Report statements released by the organization or made by representatives
  • IGMs Information Gathering Modules
  • a typical IGM is based upon a machine learning model, further described below.
  • Each IGM also associates a relevance score with each snippet.
  • IDMs Information Display Modules
  • IDMs To visualize each equivalence class, IDMs produce a title, which is a short representation of the information it conveys, and select a representative snippet. They highlight the portions of the representative snippet that contain the information of interest to the tab, and create links to pages of other entities mentioned in the snippets. Additional sentences in the equivalence class are shown by clicking a link marked “Additional Supporting Results . . . ”. Since news agencies often reuse the same sentences over time, such sentences are available by clicking “Other Identical Results”.
  • IDMs create the data used to produce a visual summary of the content in the selected tab, shown in the rightmost frame of the top half of the GUI.
  • this visualization is a network of relationships.
  • it is a cloud of the content words in the tab.
  • the interface is not only useful for an analyst tracking an entity in the news, but also for financial analysts following news about a company, or web users getting daily updates of the news.
  • the redundancy detection and systematic organization of information makes the content easy to digest.
  • entities can be highlighted in articles, as depicted in FIG. 7( a ), and those entities for which an EP exists (i.e., there are relevant snippets for at least one tab) are hyperlinked to the EP. Users can also arrive at the EP by viewing a searchable list of entities in alphabetic order, or by frequency in the news as depicted in FIG. 7( b ).
  • FIG. 8 shows an overview of an exemplary embodiment of program storage device 600 wherein instruction code contained therein for an IE, IGM and IDM are depicted.
  • Processor 700 executes the instruction code stored in program storage device 600 .
  • a crawler as previously described above can periodically download new content from a set of English text and Arabic text and video sites in documents 610 .
  • Audio from video sources can be segmented into chunks of 2-minute intervals and then transcribed.
  • Arabic can be translated into English using a state-of-the-art machine translation system.
  • Table 2 lists the average number of documents from each modality-language pair on a daily basis.
  • Each new textual document 610 may be analyzed by the IE pipeline 620 .
  • the first step after tokenization is parsing, followed by mention detection.
  • mentions are clustered by a within-document co-reference-resolution algorithm.
  • “Washington” and “White House” are grouped under the same entity (the USA), and “Leon Edward Panetta” and “Leon Panetta” under the same person (Secretary of Defense). Nominal and pronominal mentions are also added to the clusters.
  • a cross-document co-reference system then links the entity clusters across documents.
  • each cluster is linked to the knowledge base (KB) used in the Text Analysis Conference (TAC) Entity Linking task, which is derived from a subset of the Wikipedia Internet encyclopedia. If a match in the KB is found, the cluster is assigned the KB ID of the match, which allows for the cross-referencing of entities across documents. Besides exact match with titles in the KB, the cross-document co-reference system uses soft match features and context information to match against spelling variations and alternate names. The system also disambiguates between entities with identical names. The next IE component extracts relations between the entities in the document, such as employed by, son of, etc.
  • KB knowledge base
  • TAC Text Analysis Conference
  • the mention detection, co-reference and relation extraction modules are trained on an internally annotated set of 1301 documents labeled according to the Knowledge from Language Understanding and Extraction (KLUE) 2 ontology. On a development set of 33 documents, these components achieve an F1 of 71.6%, 83.7% and 65% respectively.
  • the entity linking component is unsupervised and achieves an accuracy of 73% on the TAC-2009 person queries.
  • Annotated documents are then analyzed by the IGMs 630 and IDMs 640 described above.
  • an IGM takes as input a sentence and an entity, and extracts specific information about that entity from the sentence. For example, a specific IGM may detect whether a family relation of a given person is mentioned in the input sentence.
  • a partial list of IGMs and the description of the extracted content is shown in Table 1.
  • the output of the IGMs is then analyzed by IDMs, which assemble the content of the GUI tabs. These tabs either correspond to a question template from a pilot program or are derived from the above-mentioned relations.
  • IDMs For each entity, IDMs selectively choose annotations produced by IGMs, group them into equivalence classes, rank the equivalence classes to prioritize the information displayed to the user, and assemble the content of the tab.
  • IGMs and IDMs are described in still further detail below.
  • IGMs extract specific information pertaining to a given entity from a specific sentence in two stages: First, they detect whether the snippet contains relevant information. Then they identify information nuggets.
  • Snippet relevance detection relies on statistical classifiers, trained on three corpora produced as part of the pilot program: i) data provided by Linguistic Data Consortium (LDC) to the pilot program teams during the early years of the program; ii) data provided by BAE Systems; and iii) internally annotated data.
  • the data consist of queries and snippets with binary relevance annotation.
  • the LDC and internally annotated data were specifically developed for training and testing purpose, while the BAE data also include queries from yearly evaluations, the answers provided by the teams that participated to the evaluations, and the official judgments of the answers.
  • the statistical models are maximum entropy classifiers or averaged perceptrons chosen based on empirical performance.
  • Table 3 summarizes the performance of the models used on the year-4 unsequestered queries, run against an internally generated development set.
  • the “TN” column denotes a template number.
  • IGMs analyze snippets selected by the template models and extract the information used by the IDMs to assemble and visualize the results. This step is called “Information Nugget Extraction”, where an information nugget is an atomic answer to a specific question. Extracted nuggets include the focus of the answer (e.g., the location visited by a person), the supporting text (a subset of the snippet), a summary of the answer (taken from the snippet or automatically generated). Different modules extract specific types of nuggets. These modules can be simple rule-based systems or full statistical models. Each tab uses a different set of nugget extractors, which can be easily assembled and configured to produce customized versions of the system.
  • IDMs use the information produced by IGMs to visualize the results. This involves grouping results into non-redundant sets, sorting the sets, producing a brief description of each set, selecting a representative snippet for each set, highlighting the portions of the snippet that contain information pertaining to the specific tab, constructing navigation hyperlinks to other pages, and generating the data used to graphically represent the tab content.
  • IGMs produce results in a generic format that supports a well-defined Application Program Interface (API).
  • IDMs query this API to retrieve selected IGM products.
  • a configuration file specifies which IGM products to use for redundancy detection. For example, the content of the Affiliations tab for persons (see Table 1) is constructed from automatic content extraction (ACE)-style relations.
  • ACE automatic content extraction
  • the configuration file instructs the IDM to use the relation type and the KB-ID of the affiliated entity for redundancy reduction.
  • Redundancy detection groups results into equivalence classes.
  • Each class contains unique values of the IGM products specified in the configuration file.
  • IDMs can further group classes into superclasses or split the equivalence classes according to the values of IGM products. For example, they can partition the equivalence classes according to the date of the document containing the information.
  • the resulting groups of documents constitute the unit of display.
  • IDMs assign a score to each of these groups, for example, using a function of the score of the individual snippets and of the number of results in the group or in the equivalence class.
  • the groups are sorted by score, and the highest scoring snippet is selected as representative for the group.
  • Each group is then visualized as a section in the tab, with a title that is constructed using selected IGM products.
  • the score of the group is also optionally shown.
  • the text of representative snippet containing the evidence for the relevant information is highlighted in yellow. The named mentions are linked to the corresponding page, if available, and links to different views of the document are provided.
  • Each tab is associated with a graphical representation that summarizes its content, and that is shown in the rightmost section of the top half of the GUI of FIG. 6 .
  • This visualization is generated dynamically by invoking an application on a server when the tab is visualized.
  • Exemplary embodiments of the system can support three different visualizations: a word cloud, and two styles of graphs that show connections between entities.
  • a configuration file instructs the IDMs on which IGM products contain the information to be shown in the graphical representation. This information is then formatted to comply with the API of the program that dynamically constructs the visualization.
  • the exemplary embodiments described above can utilize natural language processing methods well known in the art.
  • a fundamental reference is the book “Foundations of Statistical Natural Language Processing” by Manning and Schutze, which covers the main techniques that form such methods.
  • Constructing language models based on co-occurrences is taught in Chapter 6. Identifying the sense of words using their context, called word-sense disambiguation is taught in Chapter 7. Recognizing the grammatical type of words in a sentence, called part-of-speech tagging, is taught in Chapter 9. Recognizing the grammatical structure of a sentence, called parsing, is taught in Chapter 11. Automatically translating from a source language to a destination language is taught in Chapter 13. The main topics of Information Retrieval are taught in Chapter 15. Automatic methods for text categorization are taught in Chapter 16.
  • GPE geopolitical entities
  • named entities form a key aspect of news documents and one is often interested in tracking stories about a person (e.g., Leon Panetta), an organization (e.g., Apple Inc.) or a GPE (e.g., the United States).
  • exemplary embodiments described above provide a system that automatically constructs summary pages for named entities from news data.
  • the EP page describing an entity is organized into sections that answer specific questions about that entity, such as Biographical Information, Statements made, Acquaintances, Actions, and the like. Each section contains snippets of text that support the facts automatically extracted from the corpus.
  • Redundancy detection yields a concise summary with only novel and useful snippets being presented in the default display.
  • the system can be implemented using a variety of sources, and shows information extracted not only from English newswire text, but also from machine-translated text and automatically transcribed audio.
  • the exemplary embodiments described above provide a system that organizes and summarizes the content in a systematic way that is useful to the user.
  • the system is not limited to a bag-of-words search, but uses deeper NLP technology to detect mentions of named entities, to resolve co-reference (both within a document and across documents), and to mine relationships such as employed by, spouse of, subsidiary of, etc., from the text.
  • the framework is highly scaleable and can generate a summary for every entity that appears in the news in real-time.
  • the flexible architecture of the system allows it to be quickly adapted to domains other than news, such as collections of scientific papers where the entities of interest are authors, institution, and countries.
  • exemplary embodiments may take the form of an embodiment combining software and hardware aspects that may all generally be referred to as a “processor”, “circuit,” “module” or “system.”
  • exemplary implementations may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code stored thereon.
  • the computer-usable or computer-readable medium may be a computer readable storage medium.
  • a computer readable storage medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus or device.
  • Computer program code for carrying out operations of the exemplary embodiments may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk. C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • the computer program instructions may be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • processor as used herein is intended to include any processing device, such as, for example, one that includes a central processing unit (CPU) and/or other processing circuitry (e.g., digital signal processor (DSP), microprocessor, etc.). Additionally, it is to be understood that the term “processor” may refer to more than one processing device, and that various elements associated with a processing device may be shared by other processing devices.
  • memory as used herein is intended to include memory and other computer-readable media associated with a processor or CPU, such as, for example, random access memory (RAM), read only memory (ROM), fixed storage media (e.g., a hard drive), removable storage media (e.g., a diskette), flash memory, etc.
  • I/O circuitry as used herein is intended to include, for example, one or more input devices (e.g., keyboard, mouse, etc.) for entering data to the processor, and/or one or more output devices (e.g., printer, monitor, etc.) for presenting the results associated with the processor.
  • input devices e.g., keyboard, mouse, etc.
  • output devices e.g., printer, monitor, etc.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

Abstract

A method for automatically extracting and organizing information by a processing device from a plurality of data sources is provided. A natural language processing information extraction pipeline that includes an automatic detection of entities is applied to the data sources. Information about detected entities is identified by analyzing products of the natural language processing pipeline. Identified information is grouped into equivalence classes containing equivalent information. At least one displayable representation of the equivalence classes is created. An order in which the at least one displayable representation is displayed is computed. A combined representation of the equivalence classes that respects the order in which the displayable representation is displayed is produced.

Description

    STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
  • This invention was made with Government support under Contract No.: HR0011-08-C-0110 (awarded by Defense Advanced Research Project Agency) (DARPA). The Government has certain rights in this invention.
  • BACKGROUND
  • 1. Technical Field
  • The present disclosure relates to information technology, and, more particularly, to natural language processing (NLP) systems.
  • 2. Discussion of Related Art
  • News agencies, bloggers, twitters, scientific journals and conferences, all produce extremely large amounts of unstructured data in textual, audio, and video form. Large amounts of such unstructured data and information can be gathered from multiple modalities in multiple languages, e.g., internet text, audio, and video sources. There is a need for analyzing the information and producing a compact representation of: 1) information such as actions of specific entities (e.g., persons, organizations, countries); 2) activities (e.g., the presidential election campaign); and 3) events (e.g., the death of a celebrity). Currently, such representations can be produced manually, but this solution is not cost effective and it requires skilled workers especially when the information is gathered from multiple languages. Such manually produced representations are also generally not scaleable.
  • BRIEF SUMMARY
  • Exemplary embodiments of the present disclosure provide methods for automatically extracting and organizing data such that a user can interactively explore information about entities, activities, and events.
  • In accordance with exemplary embodiments information may be automatically extracted in real time from multiple modalities and multiple languages and displayed in a navigable and compact representation of the retrieved information.
  • Exemplary embodiments may use natural language processing techniques to automatically analyze information from multiple sources, in multiple modalities, and in multiple languages, including, but not limited to, web pages, blogs, newsgroups, radio feeds, video, and television.
  • Exemplary embodiments may use the output of automatic machine translation systems that translate foreign language sources into the language of the user, and use the output from automatic speech transcription systems that convert video and audio feeds into text.
  • Exemplary embodiments may use natural language processing techniques including information extraction tools, question answering tools, and distillation tools, to automatically analyze the text produced as described above and extract searchable and summarizable information. The system may perform name-entity detection, cross-document co-reference resolution, relation detection, and event detection and tracking.
  • Exemplary embodiments may use automatic relevance detection techniques and redundancy reduction methods to provide the user with relevant and non-redundant information.
  • Exemplary embodiments may display the desired information in a compact and navigable representation by: providing means for the user to specify entities, activities, or events of interest (for example: by typing natural language queries, by selecting entities from an automatically generated list of entities that satisfy user specified requirements, such as, entities that are prominently featured in the data sources over a user specified time, by selecting sections of text by browsing an article, or by selecting events or topics from representations of automatically detected events/topics over a specified period of time
  • Exemplary embodiments may automatically generate a page in response to the user query by adaptively building a template that best matches the inferred user's intention (for example: if the user selects a person, who is a politician, the system would detect this fact, search for information on election campaign, public appearances, statements, and public service history of the person; if the user selects a company, the system would search for recent news about the company, for information on the company's top officials, for press releases, etc.)
  • In accordance with exemplary embodiments, if the user selects an event, the system may search for news items about the event, for reactions to the event, for outcomes of the event, and for related events. The system may also automatically detect the entities involved in the event, such as people, countries, local governments, companies and organizations, and retrieve relevant information about these entities.
  • Exemplary embodiments may allow the user to track entities that appear on the produced page, including automatically producing a biography of a person from available data and listing recent actions by an organization automatically extracted from the available data.
  • Exemplary embodiments may allow the user to explore events or activities that appear on the page, including: automatically constructing a timeline of the salient moments in an ongoing event.
  • Exemplary embodiments may allow the user to explore the connections between entities and events (for example: providing information on the role of a company in an event, listing quotes by a person on a topic, describing the relation between two companies, summarizing meetings or contacts between two people and optionally retrieving images of the desired entities.
  • According to an exemplary embodiment, a method for automatically extracting and organizing information by a processing device from a plurality of data sources is provided. A natural language processing information extraction pipeline that includes an automatic detection of entities is applied to the data sources. Information about detected entities is identified by analyzing products of the natural language processing pipeline. Identified information is grouped into equivalence classes containing equivalent information. At least one displayable representation of the equivalence classes is created. An order in which the at least one displayable representation is displayed is computed. A combined representation of the equivalence classes that respects the order in which the displayable representation is displayed is produced.
  • Each equivalence classes may include a collection of items. Each item may include a span of text extracted from a document, together with a specification of information about a desired entity derived from the span of text.
  • Computing an order in which the displayable representations are displayed may include randomly computing the order.
  • Grouping identified information into equivalence classes may include assigning each identified information to a separate equivalence class.
  • Grouping identified information into equivalence classes may include computing a representative instance of each equivalence class, ensuring that representative instances of different classes are not redundant with respect to each other, and ensuring that instances of each equivalence class are redundant with respect to the representative instance of the equivalence class.
  • According to an exemplary embodiment, a method for processing information by a processing device is provided. A user query is received. A user query intention is inferred from the user query to develop an inferred user intention. A page is automatically generated in response to the user query by adaptively building a template that corresponds to the inferred user intention using natural processing of multiple modalities comprising at least one of text, audio and video.
  • When the user query selects a person who has a political status, the political status may be searched, information on at least one of an election campaign, public appearances, statements, and public service history, may be searched, and a page in response to the user query may be automatically generated.
  • When the user query selects a company information on at least one of recent news about the company, information on the company's top officials, and press releases for the company, may be searched, and a page in response to the user query may be automatically generated.
  • When the user query selects an event information on at least one of news items about the event and reactions to the event may be searched, and a page in response to the user query may be automatically generated.
  • Entities in the event and retrieved relevant information about the entities may be identified and searched.
  • According to an exemplary embodiment, a method for automatically extracting and organizing information by a processing device from a corpus of documents having multiple modalities of information in multiple languages for display to a user is provided. The corpus of documents is browsed to identify and incrementally retrieve documents containing audio/video files. Text from the audio/video files is transcribed to provide a textual representation. Text of the textural representation that is in a foreign language is translated. Desired information about at least one of entities, activities, and events is incrementally extracted. Extracted information is organized. Organized extracted information is converted into a navigable display presentable to the user.
  • Incrementally extracting desired information may include applying a natural language processing pipeline to each document to iterate all entities detected in the corpus and identifying relation mentions and event mentions that involve a selected entity, wherein an entity is at least one of a physical animate object, a physical inanimate object, something that has a proper name, something that has a measurable physical property, a legal entity and abstract concepts, a mention is a span of text that refers to an entity, a relation is a connection between two entities, a relation mention is a span of text that describes a relation, and an event is a set of relations between two or more entities involving one or more actions.
  • Organizing extracted information may include iterating on all the entities identified in the corpus, dividing the information extracted about the entity into selected equivalence classes containing equivalent information, iterating on all the equivalence classes, selecting one item in each equivalence class to represent all items in the equivalence class, and recording information about the equivalence class and about a representative selected for use in producing the navigable display, wherein each equivalence class may include a collection of items, each item having a span of text extracted from a document, together with a specification of the information about the desired entity derived from the span of text.
  • Converting organized extracted information into a navigable display presentable to the user may include scoring the equivalence classes of information by assigning to the equivalence class at least one of a highest score of the pieces of information in the class, the average score of its members, the median score of its members, and the sum of the scores of its members, sorting the equivalence classes in descending order of score to prioritize an order in which the equivalence classes are displayed to the user, iterating for each equivalence class, constructing of a displayable representation of an instance selected and combining the displayable representations to produce a displayable representation of the equivalence classes.
  • The displayable representation may include a passage containing extracted information marked up with visual highlights.
  • According to an exemplary embodiment, a non-transitory computer program storage device embodying instructions executable by a processor to interactively display information about entities, activities and events from multiple-modality natural language sources is provided. An information extraction module includes instruction code for downloading document content from text and audio/video, for parsing the document content, for detecting mentions, for co-referencing, for cross-document co-referencing and for extracting relations. An information gathering module includes instruction code for extracting acquaintances, biography and involvement in events from the information extraction module. An information display module includes instruction code for displaying information from the information gathering module.
  • The information extraction module further may include instruction code for transcribing audio from video sources and for translating non-English transcribed audio into English text.
  • The information extraction module may include instruction code for clustering mentions under the same entity and for linking the entity clusters across documents.
  • The information gathering module may include instruction code for inputting a sentence and an entity and extracting specific information about the entity from the sentence.
  • The information display module may include instruction code for grouping results into non-redundant sets, sorting the sets, producing a brief description of each set, selecting a representative snippet for each set, highlighting the portions of the snippet that contain information pertaining to a specific tab, constructing navigation hyperlinks to other pages, and generating data used to graphically represent tab content.
  • According to an exemplary embodiment, a non-transitory computer program storage device embodying instructions executable by a processor to automatically extract and organize information from a plurality of data sources, is provided. Instruction code is provided for applying to the data sources a natural language processing information extraction pipeline that includes an automatic detection of entities. Instruction code is provided for identifying information about detected entities by analyzing products of the natural language processing pipeline. Instruction code is provided for grouping identified information into equivalence classes containing equivalent information. Instruction code is provided for creating at least one displayable representation of the equivalence classes. Instruction code is provided for computing an order in which the at least one displayable representation is displayed. Instruction code is provided for producing a combined representation of the equivalence classes that respects the order in which said displayable representation is displayed.
  • BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
  • Exemplary embodiments will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:
  • FIG. 1 depicts a sequence of operational steps in accordance with an exemplary embodiment;
  • FIG. 2 depicts a sequence of operational steps in accordance with a portion of the operational steps of FIG. 1;
  • FIG. 3 depicts a sequence of operational steps in accordance with a portion of the operational steps of FIG. 2;
  • FIG. 4 depicts a sequence of operational steps in accordance with a portion of the operational steps of FIG. 1;
  • FIG. 5 depicts a sequence of operational steps in accordance with a portion of the operational steps of FIG. 1;
  • FIG. 6 depicts an exemplary entity page in accordance with an exemplary embodiment;
  • FIGS. 7( a) and 7(b) depict exemplary entity pages for a news broadcasting application; and
  • FIG. 8 depicts a program storage device and processor for executing a sequence of operational steps in accordance with an exemplary embodiment.
  • DETAILED DESCRIPTION
  • Reference will now be made in more detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout.
  • In the exemplary embodiments, the term “document” may refer to a textual document irrespective of its format, to media files including streaming audio and video, and to hybrids of the above, such as web pages with embedded video and audio streams.
  • In the exemplary embodiments, the term “corpus” refers to a formal or informal collection of multimedia documents, such as all the papers published in a scientific journal or all the English web pages published by news agencies in Arabic-speaking countries.
  • In the exemplary embodiments, the term “entity” may refer to a physical animate object (e.g., a person), to a physical inanimate object (e.g., a building), to something that has a proper name (e.g., Mount Everest), to something that has a measurable physical property (e.g., a point in time or a span of time, a company, a township, a country), to a legal entity (e.g., a nation) and to abstract concepts, such as the unit of measurement and the measure of a physical property.
  • In the exemplary embodiments, the term “mention” denotes a span of text that refers to an entity. Given a large structured set of documents, an entity may be associated with the collection of all of its mentions that appear in the structured set of documents, and, therefore, the term entity may also be used to denote such collection.
  • In the exemplary embodiments, the term “relation” refers to a connection between two entities (e.g., Barack Obama is the president of the United States; Michelle Obama and Barack Obama are married). A relation mention is a span of text that explicitly describes a relation. Thus, a relation mention involves two entity mentions.
  • In the exemplary embodiments, the term “event” refers to a set of relations between two or more entities, involving one or more actions.
  • FIG. 1 shows an overview of an exemplary embodiment which may be applicable to a corpus of news documents consisting of web pages created by news agencies and containing multiple modalities of information in multiple languages. Multimodal corpus 100 is browsed in a methodical automated manner (i.e., crawled) in Step 110, wherein the multi-modal documents in the corpus are identified and incrementally retrieved. Such crawling can operate in an incremental fashion, in which case it would retrieve only documents that were not available during previous crawling operations. Documents containing audio information, such as audio files or video files with audio, are then analyzed by transcription at Step 120. After Step 120, a textual representation of all the multi-modal documents is available. Text in foreign languages is translated at translation step 130. The result is textual representation 140 of the multimodal corpus that contains documents in a desired language as well as their original version in their source language.
  • Textual representation 140 of the corpus is incrementally analyzed in Step 150, which extracts desired information (information extraction (IE)) about entities, activities, and events. The extracted information is organized in Step 160, and the organized information is converted into a navigable display form that is presented to the user.
  • FIG. 2 shows an IE process, according to an exemplary embodiment, of Step 150 wherein information on entities, activities, and events are incrementally extracted. Step 210 consists of applying a natural language processing pipeline to each document of the collection. The pipeline can be applied incrementally as new documents are added to the corpus. Step 220 iterates over all entities detected in the corpus. Step 220 can be applied incrementally by iterating only on the entities detected in new documents as they are added to the corpus. Step 230 identifies relation mentions extracted by Step 210 that involve the entity selected by Step 220. Step 240 identifies event mentions involving mentions of the entity selected by Step 220. Step 250 extracts information pertaining to the entity selected by Step 220.
  • FIG. 3 shows an example of natural language processing pipeline Step 210 as described in FIG. 2. Text Cleanup Step 310 removes from the text irrelevant characters, such as formatting characters, HyperText Markup Language (HTML) tags, and the like. Tokenization Step 320 analyzes the cleaned-up text and identifies word and sentence boundaries. Part-of-speech tagging Step 330 associates to each word a label that describes its grammatical function. Mention detection Step 340 identifies in the tokenized text the mentions of entities and the words that denote the presence of events (called event anchors). Parsing Step 350 extracts the hierarchical grammatical structure of each sentence, and typically represents it as a tree. Semantic role labeling Step 360 identifies how each of the nodes in the tree extracted by parsing Step 350 is semantically related to each of the verbs in the sentence. Co-reference resolution Step 370 identifies the entities to which the mentions produced by the mention detection 340 belong. Relation extraction Step 380 detects relations between entity mention pairs and between entity mention and event anchors. Those of ordinary skill in the art would appreciate that these steps can be implemented using generally known statistical methods, rules, or combinations thereof.
  • FIG. 4 shows an exemplary embodiment of organizing the information about entities according to Step 160 of FIG. 1.
  • Step 410 iterates over all the entities identified in the corpus. An incremental embodiment of Step 410 consists of iterating on all the entities identified in new documents as they are added to the corpus.
  • Step 420 divides the information extracted about the entity selected by iteration Step 410 into equivalence classes, containing equivalent or redundant information. In an exemplary embodiment, each equivalence class would consist of a collection of items, where each item consists of a span of text extracted from a document, together with a specification of the information about the desired entity derived from the span of text. Those of ordinary skill in the art would appreciate that such equivalence classes could be mutually exclusive or could overlap, wherein the same item could belong to one or more equivalence class.
  • Step 430 iterates on the equivalence classes produced by Step 420.
  • Step 440 would select one item in the class that best represents all the items in the class. Selection criteria used by selection Step 440 can include, but not be limited to: selecting the most common span of text that appears in the equivalence class (for example, the span “U.S. President Barack Obama” is more common than “Barack Obama, the President of the United States”, and, according to this selection criterion, would be chosen as the representative span to describe the relationship of “Barack Obama” to the “United States”), selecting the span of text that conveys the largest amount of information (for example, “Barack Obama is the 44th and current President of the United States” conveys more information about the relationship between “Barack Obama” and the “United States” than “U.S. President Barack Obama”, and would be chosen as representative according to this criterion), and selecting the span of text with the highest score produced by the extraction Step 150, if the step associates a score with its results.
  • Step 450 records the information about the equivalence class and about the representative selected by Step 440, so that the information can be used by the subsequent Step 170 of FIG. 1. The method shown in FIG. 4 can be adapted to the case in which equivalence classes can overlap and it is still desirable to select distinct representatives for different classes, for example, by means of an optimization procedure that would combine one or more of the selection criteria listed above or of equivalent selection criteria with a dissimilarity measure that would favor the choice of distinct representatives for overlapping equivalence classes.
  • In an exemplary embodiment of Step 420, an individual instance of extracted information may consist of a span (equivalently, a passage) from a document together with a specification of the information extracted about a desired entity from the span. Such specification can consist of a collection of attribute-value pairs, a collection of Research Description Framework (RDIF) triples, a set of relations in a relational database, and the like. The specification can be represented using a description language, such as Extensible Markup Language (XML), using the RDF representation language, using a database, and the like.
  • Step 420 may consist of identifying groups of instances of extracted information satisfying two conditions: the first being that each group contains at least one instance (main instance) given which all other instances in the group are redundant; the second being that main instances of separate groups are not redundant with respect to each other. This result can be accomplished using a traditional clustering algorithm or an incremental clustering algorithm.
  • FIG. 5 shows an exemplary embodiment of a method of Step 170 of FIG. 1 for constructing a displayable representation of the information pertaining to an entity and collected according to the method described in FIG. 4.
  • In Step 510 the equivalence classes of information produced by Step 420 are scored, for example, by assigning to the equivalence class the highest score of the pieces of information in the class. Alternatively, other quantities can be used as the score of the equivalence class, for example: the average score of its members, the median score of its members, the sum of the scores of its members, and the like. According to the method described in FIG. 5, the score is used to prioritize the order in which the equivalence classes are displayed to the user.
  • Step 520 sorts the equivalence classes in descending order of score.
  • Step 530 selects each equivalence class. For all the instances of the equivalence class selected (Step 540), Step 550 constructs a displayable representation of the instance selected from the equivalence class. In an exemplary embodiment, such displayable representation consists of the passage containing the extracted information, appropriately marked up with visual highlights. Such visual highlights may include color to differentiate the extracted information. Additionally, the displayable representation could include visual cues to easily identify other entities for which an information page exists.
  • Step 560 combines the representations produced by Step 550 to produce a displayable representation of the equivalence class. In an exemplary embodiment, this step consists of displaying the representative instance of the equivalence class and providing means for displaying the other members, for instance, by providing links to the representation of these members.
  • Referring now to FIG. 6, an exemplary page describing an entity (i.e., an Entity page (EP)) for the individual Leon Panetta is depicted. The page is divided into a left and a right part. The two frames in the left part contain a picture and biographical information automatically extracted from the Wikipedia internet encyclopedia or from another source of reliable information, respectively. The right part contains a set of tabs that organize relevant small pieces (snippets) of text by the kind of information they convey. The content in each tab is the output of a series of information extraction modules which are described in further detail below. Each tab also shows a graphical summary of the content of its content.
  • Table 1, shown below, summarizes the information conveyed by the snippets of text in each tab.
  • TABLE 1
    Description of Information Contained in the GUI Tabs,
    Organized by Entity Type.
    Entity Type Tab Title Description
    Person Affiliations Describe affiliations of the person
    to companies, organizations,
    governments, agencies, etc
    Statements Report statements made by the
    person on any topic
    Actions Describe the actions of the person
    Related People List acquaintances of the person
    Locations List places & locations visited
    by the person
    Elections Describe election campaign
    of the person
    Involvment in Events Describe events in which the
    person is involved
    ORG & GPE Actions Describe actions of the
    organization or of official
    representatives
    Related Orgs Describe related organizations,
    such as subsidiaries.
    Associated People List people associated with the
    ORG/GPE
    Statements Report statements released by the
    organization or made by
    representatives
  • These snippets are selected by a collection of Information Gathering Modules (IGMs) specified in a configuration file. A typical IGM is based upon a machine learning model, further described below. Each IGM also associates a relevance score with each snippet.
  • To assemble the tab content, the snippets selected and scored by the IGMs are analyzed by appropriate Information Display Modules (IDMs), specified in a configuration file. IDMs group snippets with identical information for a tab into the same equivalence class. IDMs associate a score to each equivalence class, and sort the classes according to the score.
  • To visualize each equivalence class, IDMs produce a title, which is a short representation of the information it conveys, and select a representative snippet. They highlight the portions of the representative snippet that contain the information of interest to the tab, and create links to pages of other entities mentioned in the snippets. Additional sentences in the equivalence class are shown by clicking a link marked “Additional Supporting Results . . . ”. Since news agencies often reuse the same sentences over time, such sentences are available by clicking “Other Identical Results”.
  • IDMs create the data used to produce a visual summary of the content in the selected tab, shown in the rightmost frame of the top half of the GUI. For the Related People tab depicted in FIG. 6, this visualization is a network of relationships. For other tabs, it is a cloud of the content words in the tab.
  • The interface is not only useful for an analyst tracking an entity in the news, but also for financial analysts following news about a company, or web users getting daily updates of the news. The redundancy detection and systematic organization of information makes the content easy to digest.
  • In a news browsing application, entities can be highlighted in articles, as depicted in FIG. 7( a), and those entities for which an EP exists (i.e., there are relevant snippets for at least one tab) are hyperlinked to the EP. Users can also arrive at the EP by viewing a searchable list of entities in alphabetic order, or by frequency in the news as depicted in FIG. 7( b).
  • FIG. 8 shows an overview of an exemplary embodiment of program storage device 600 wherein instruction code contained therein for an IE, IGM and IDM are depicted. Processor 700 executes the instruction code stored in program storage device 600.
  • A crawler as previously described above can periodically download new content from a set of English text and Arabic text and video sites in documents 610. Audio from video sources can be segmented into chunks of 2-minute intervals and then transcribed. Arabic can be translated into English using a state-of-the-art machine translation system. Table 2 lists the average number of documents from each modality-language pair on a daily basis.
  • TABLE 2
    Number of articles downloaded by the
    crawler daily in different modalities.
    Source # docs
    En-Text 1317
    Ar-Text 813
    Ar-Video 843
  • Subsequent components in the pipeline work on English text documents, and the framework can be easily extended to any language for which translation and transcription systems exist.
  • Each new textual document 610 may be analyzed by the IE pipeline 620. The first step after tokenization is parsing, followed by mention detection. Within each document, mentions are clustered by a within-document co-reference-resolution algorithm. Thus, in the appropriate context, “Washington” and “White House” are grouped under the same entity (the USA), and “Leon Edward Panetta” and “Leon Panetta” under the same person (Secretary of Defense). Nominal and pronominal mentions are also added to the clusters. A cross-document co-reference system then links the entity clusters across documents. This is done by linking each cluster to the knowledge base (KB) used in the Text Analysis Conference (TAC) Entity Linking task, which is derived from a subset of the Wikipedia Internet encyclopedia. If a match in the KB is found, the cluster is assigned the KB ID of the match, which allows for the cross-referencing of entities across documents. Besides exact match with titles in the KB, the cross-document co-reference system uses soft match features and context information to match against spelling variations and alternate names. The system also disambiguates between entities with identical names. The next IE component extracts relations between the entities in the document, such as employed by, son of, etc. The mention detection, co-reference and relation extraction modules are trained on an internally annotated set of 1301 documents labeled according to the Knowledge from Language Understanding and Extraction (KLUE) 2 ontology. On a development set of 33 documents, these components achieve an F1 of 71.6%, 83.7% and 65% respectively. The entity linking component is unsupervised and achieves an accuracy of 73% on the TAC-2009 person queries.
  • Annotated documents are then analyzed by the IGMs 630 and IDMs 640 described above. In its basic form, an IGM takes as input a sentence and an entity, and extracts specific information about that entity from the sentence. For example, a specific IGM may detect whether a family relation of a given person is mentioned in the input sentence. A partial list of IGMs and the description of the extracted content is shown in Table 1. The output of the IGMs is then analyzed by IDMs, which assemble the content of the GUI tabs. These tabs either correspond to a question template from a pilot program or are derived from the above-mentioned relations. For each entity, IDMs selectively choose annotations produced by IGMs, group them into equivalence classes, rank the equivalence classes to prioritize the information displayed to the user, and assemble the content of the tab. The IGMs and IDMs are described in still further detail below.
  • IGMs extract specific information pertaining to a given entity from a specific sentence in two stages: First, they detect whether the snippet contains relevant information. Then they identify information nuggets.
  • Snippet relevance detection relies on statistical classifiers, trained on three corpora produced as part of the pilot program: i) data provided by Linguistic Data Consortium (LDC) to the pilot program teams during the early years of the program; ii) data provided by BAE Systems; and iii) internally annotated data. The data consist of queries and snippets with binary relevance annotation. The LDC and internally annotated data were specifically developed for training and testing purpose, while the BAE data also include queries from yearly evaluations, the answers provided by the teams that participated to the evaluations, and the official judgments of the answers. The statistical models are maximum entropy classifiers or averaged perceptrons chosen based on empirical performance. They use a broad array of features including lexical, structural, syntactic, dependency, and semantic features. Table 3 summarizes the performance of the models used on the year-4 unsequestered queries, run against an internally generated development set. The “TN” column denotes a template number.
  • TABLE 3
    Performance of the IGM models
    Template TN P R F
    Templates for Person Entities
    Information T3 75.60 90.07 82.20
    Actions T13 50.00 18.33 26.83
    Whereabouts T17 86.11 43.66 57.94
    Election Campaign T21 78.72 26.81 40.00
    Templates for ORG/GPE Entities
    Information T4 71.50 90.79 80.00
    Actions T14 45.83 29.73 36.07
    Arrests of Members T15 75.51 74.00 74.75
    Location of Representative T18 36.36 44.94 40.20
  • IGMs analyze snippets selected by the template models and extract the information used by the IDMs to assemble and visualize the results. This step is called “Information Nugget Extraction”, where an information nugget is an atomic answer to a specific question. Extracted nuggets include the focus of the answer (e.g., the location visited by a person), the supporting text (a subset of the snippet), a summary of the answer (taken from the snippet or automatically generated). Different modules extract specific types of nuggets. These modules can be simple rule-based systems or full statistical models. Each tab uses a different set of nugget extractors, which can be easily assembled and configured to produce customized versions of the system.
  • IDMs use the information produced by IGMs to visualize the results. This involves grouping results into non-redundant sets, sorting the sets, producing a brief description of each set, selecting a representative snippet for each set, highlighting the portions of the snippet that contain information pertaining to the specific tab, constructing navigation hyperlinks to other pages, and generating the data used to graphically represent the tab content.
  • IGMs produce results in a generic format that supports a well-defined Application Program Interface (API). IDMs query this API to retrieve selected IGM products. For each tab, a configuration file specifies which IGM products to use for redundancy detection. For example, the content of the Affiliations tab for persons (see Table 1) is constructed from automatic content extraction (ACE)-style relations. The configuration file instructs the IDM to use the relation type and the KB-ID of the affiliated entity for redundancy reduction. Thus, if a snippet states that Sam Palmisano was manager of “IBM”, and another that Sam Palmisano was manager of “International Business Machines”, and “IBM” and “International Business Machines” have the same KB-ID, then the snippets are marked as redundant for the purpose of the Affiliations tab.
  • Redundancy detection groups results into equivalence classes. Each class contains unique values of the IGM products specified in the configuration file. IDMs can further group classes into superclasses or split the equivalence classes according to the values of IGM products. For example, they can partition the equivalence classes according to the date of the document containing the information. The resulting groups of documents constitute the unit of display. IDMs assign a score to each of these groups, for example, using a function of the score of the individual snippets and of the number of results in the group or in the equivalence class. The groups are sorted by score, and the highest scoring snippet is selected as representative for the group. Each group is then visualized as a section in the tab, with a title that is constructed using selected IGM products. The score of the group is also optionally shown. The text of representative snippet containing the evidence for the relevant information is highlighted in yellow. The named mentions are linked to the corresponding page, if available, and links to different views of the document are provided.
  • Each tab is associated with a graphical representation that summarizes its content, and that is shown in the rightmost section of the top half of the GUI of FIG. 6. This visualization is generated dynamically by invoking an application on a server when the tab is visualized.
  • Exemplary embodiments of the system can support three different visualizations: a word cloud, and two styles of graphs that show connections between entities. A configuration file instructs the IDMs on which IGM products contain the information to be shown in the graphical representation. This information is then formatted to comply with the API of the program that dynamically constructs the visualization.
  • The exemplary embodiments described above can utilize natural language processing methods well known in the art. A fundamental reference is the book “Foundations of Statistical Natural Language Processing” by Manning and Schutze, which covers the main techniques that form such methods. Constructing language models based on co-occurrences (n-gram models) is taught in Chapter 6. Identifying the sense of words using their context, called word-sense disambiguation is taught in Chapter 7. Recognizing the grammatical type of words in a sentence, called part-of-speech tagging, is taught in Chapter 9. Recognizing the grammatical structure of a sentence, called parsing, is taught in Chapter 11. Automatically translating from a source language to a destination language is taught in Chapter 13. The main topics of Information Retrieval are taught in Chapter 15. Automatic methods for text categorization are taught in Chapter 16.
  • Given the significant proportion of new material on the Internet that is news that centers around people, organizations and geopolitical entities (GPEs), named entities form a key aspect of news documents and one is often interested in tracking stories about a person (e.g., Leon Panetta), an organization (e.g., Apple Inc.) or a GPE (e.g., the United States). Exemplary embodiments described above provide a system that automatically constructs summary pages for named entities from news data. The EP page describing an entity is organized into sections that answer specific questions about that entity, such as Biographical Information, Statements made, Acquaintances, Actions, and the like. Each section contains snippets of text that support the facts automatically extracted from the corpus. Redundancy detection yields a concise summary with only novel and useful snippets being presented in the default display. The system can be implemented using a variety of sources, and shows information extracted not only from English newswire text, but also from machine-translated text and automatically transcribed audio.
  • While publicly available news aggregators like Google News show the top entities in the news, clicking on these typically results in a keyword search (with, perhaps, some redundancy detection). On the other hand, the exemplary embodiments described above provide a system that organizes and summarizes the content in a systematic way that is useful to the user. The system is not limited to a bag-of-words search, but uses deeper NLP technology to detect mentions of named entities, to resolve co-reference (both within a document and across documents), and to mine relationships such as employed by, spouse of, subsidiary of, etc., from the text. The framework is highly scaleable and can generate a summary for every entity that appears in the news in real-time. The flexible architecture of the system allows it to be quickly adapted to domains other than news, such as collections of scientific papers where the entities of interest are authors, institution, and countries.
  • The methodologies of the exemplary embodiments of the present disclosure may be particularly well-suited for use in an electronic device or alternative system. Accordingly, exemplary embodiments may take the form of an embodiment combining software and hardware aspects that may all generally be referred to as a “processor”, “circuit,” “module” or “system.” Furthermore, exemplary implementations may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code stored thereon.
  • Any combination of one or more computer usable or computer readable medium(s) may be utilized. The computer-usable or computer-readable medium may be a computer readable storage medium. A computer readable storage medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fibre, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus or device.
  • Computer program code for carrying out operations of the exemplary embodiments may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk. C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • Exemplary embodiments are described herein with reference to flowchart illustrations and/or block diagrams. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions.
  • The computer program instructions may be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • It is to be appreciated that the term “processor” as used herein is intended to include any processing device, such as, for example, one that includes a central processing unit (CPU) and/or other processing circuitry (e.g., digital signal processor (DSP), microprocessor, etc.). Additionally, it is to be understood that the term “processor” may refer to more than one processing device, and that various elements associated with a processing device may be shared by other processing devices. The term “memory” as used herein is intended to include memory and other computer-readable media associated with a processor or CPU, such as, for example, random access memory (RAM), read only memory (ROM), fixed storage media (e.g., a hard drive), removable storage media (e.g., a diskette), flash memory, etc. Furthermore, the term “I/O circuitry” as used herein is intended to include, for example, one or more input devices (e.g., keyboard, mouse, etc.) for entering data to the processor, and/or one or more output devices (e.g., printer, monitor, etc.) for presenting the results associated with the processor.
  • The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
  • Although illustrative embodiments of the present disclosure have been described herein with reference to the accompanying drawings, it is to be understood that the present disclosure is not limited to those precise embodiments, and that various other changes and modifications may be made therein by one skilled in the art without departing from the scope of the appended claims.

Claims (15)

What is claimed is:
1. A method for automatically extracting and organizing information by a processing device from a plurality of data sources, comprising:
applying to the data sources a natural language processing information extraction pipeline that includes an automatic detection of entities;
identifying information about detected entities by analyzing products of the natural language processing pipeline;
grouping identified information into equivalence classes containing equivalent information;
creating at least one displayable representation of the equivalence classes;
computing an order in which the at least one displayable representation is displayed; and
producing a combined representation of the equivalence classes that respects the order in which the displayable representation is displayed.
2. The method of claim 1, wherein each equivalence class comprises a collection of items, each item comprising a span of text extracted from a document, together with a specification of information about a desired entity derived from the span of text.
3. The method of claim 1, wherein computing an order in which the displayable representations are displayed further comprises randomly computing the order.
4. The method of claim 1, wherein grouping identified information into equivalence classes further comprises assigning each identified information to a separate equivalence class.
5. The method of claim 1, wherein grouping identified information into equivalence classes further comprises:
computing a representative instance of each equivalence class;
ensuring that representative instances of different classes are not redundant with respect to each other;
ensuring that instances of each equivalence class are redundant with respect to the representative instance of the equivalence class.
6. A method for processing information by a processing device, the method comprising:
receiving a user query;
inferring a user query intention from the user query to develop an inferred user intention; and
automatically generating a page in response to the user query by adaptively building a template that corresponds to the inferred user intention using natural processing of multiple modalities comprising at least one of text, audio and video.
7. The method of claim 6, further comprising: when the user query selects a person who has a political status,
detecting the political status,
searching for information on at least one of an election campaign, public appearances, statements, and public service history, and
automatically generating a page in response to the user query.
8. The method of claim 6, further comprising when the user query selects a company:
searching for information on at least one of recent news about the company, information on the company's top officials, and press releases for the company; and
automatically generating a page in response to the user query.
9. The method of claim 6, further comprising when the user query selects an event:
searching for information on at least one of news items about the event and reactions to the event; and
automatically generating a page in response to the user query.
10. The method of claim 9, wherein entities in the event are identified and retrieved relevant information about the entities is searched.
11. A method for automatically extracting and organizing information by a processing device from a corpus of documents having multiple modalities of information in multiple languages for display to a user, the method comprising:
browsing the corpus of documents to identify and incrementally retrieve documents containing audio/video files;
transcribing text from the audio/video files to provide a textual representation;
translating text of the textural representation that is in a foreign language;
incrementally extracting desired information about at least one of entities, activities, and events;
organizing extracted information; and
converting organized extracted information into a navigable display presentable to the user.
12. The method of claim 11, wherein incrementally extracting desired information comprises:
applying a natural language processing pipeline to each document to iterate all entities detected in the corpus;
identifying relation mentions and event mentions that involve a selected entity,
wherein an entity is at least one of a physical animate object, a physical inanimate object, something that has a proper name, something that has a measurable physical property, a legal entity and abstract concepts,
wherein a mention is a span of text that refers to an entity,
wherein a relation is a connection between two entities,
wherein a relation mention is a span of text that describes a relation, and
wherein an event is a set of relations between two or more entities involving one or more actions.
13. The method of claim 11, wherein organizing extracted information comprises:
iterating on all the entities identified in the corpus;
dividing the information extracted about the entity into selected equivalence classes containing equivalent information;
iterating on all the equivalence classes;
selecting one item in each equivalence class to represent all items in the equivalence class; and
recording information about the equivalence class and about a representative selected for use in producing the navigable display,
wherein each equivalence class comprises a collection of items, each item comprising a span of text extracted from a document, together with a specification of the information about the desired entity derived from the span of text.
14. The method of claim 11, wherein converting organized extracted information into a navigable display presentable to the user comprises:
scoring the equivalence classes of information by assigning to the equivalence class at least one of a highest score of the pieces of information in the class, the average score of its members, the median score of its members, and the sum of the scores of its members;
sorting the equivalence classes in descending order of score to prioritize an order in which the equivalence classes are displayed to the user;
iterating for each equivalence class a constructing of a displayable representation of an instance selected; and
combining the displayable representations to produce a displayable representation of the equivalence classes.
15. The method of claim 14, wherein the displayable representation comprises a passage containing extracted information marked up with visual highlights.
US13/493,659 2012-06-11 2012-06-11 System and Method for Automatically Detecting and Interactively Displaying Information About Entities, Activities, and Events from Multiple-Modality Natural Language Sources Abandoned US20130332450A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
US13/493,659 US20130332450A1 (en) 2012-06-11 2012-06-11 System and Method for Automatically Detecting and Interactively Displaying Information About Entities, Activities, and Events from Multiple-Modality Natural Language Sources
US13/543,157 US20140195884A1 (en) 2012-06-11 2012-07-06 System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources
DE201310205737 DE102013205737A1 (en) 2012-06-11 2013-04-02 Method for automatically extracting and organizing information from data sources in e.g. web pages, involves producing combined representation of the equivalence classes in which the order for displayable representation is displayed
CN201310122395.8A CN103488663A (en) 2012-06-11 2013-04-10 System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources
US15/419,615 US10698964B2 (en) 2012-06-11 2017-01-30 System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US13/493,659 US20130332450A1 (en) 2012-06-11 2012-06-11 System and Method for Automatically Detecting and Interactively Displaying Information About Entities, Activities, and Events from Multiple-Modality Natural Language Sources

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US13/543,157 Continuation US20140195884A1 (en) 2012-06-11 2012-07-06 System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources
US15/419,615 Division US10698964B2 (en) 2012-06-11 2017-01-30 System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources

Publications (1)

Publication Number Publication Date
US20130332450A1 true US20130332450A1 (en) 2013-12-12

Family

ID=49716127

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/493,659 Abandoned US20130332450A1 (en) 2012-06-11 2012-06-11 System and Method for Automatically Detecting and Interactively Displaying Information About Entities, Activities, and Events from Multiple-Modality Natural Language Sources
US15/419,615 Active 2032-07-01 US10698964B2 (en) 2012-06-11 2017-01-30 System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources

Family Applications After (1)

Application Number Title Priority Date Filing Date
US15/419,615 Active 2032-07-01 US10698964B2 (en) 2012-06-11 2017-01-30 System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources

Country Status (2)

Country Link
US (2) US20130332450A1 (en)
CN (1) CN103488663A (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130311166A1 (en) * 2012-05-15 2013-11-21 Andre Yanpolsky Domain-Specific Natural-Language Processing Engine
US20140039877A1 (en) * 2012-08-02 2014-02-06 American Express Travel Related Services Company, Inc. Systems and Methods for Semantic Information Retrieval
US20140095147A1 (en) * 2012-10-01 2014-04-03 Nuance Communications, Inc. Situation Aware NLU/NLP
CN103744841A (en) * 2013-12-23 2014-04-23 武汉传神信息技术有限公司 Information fragment translating method and system
US20140200879A1 (en) * 2013-01-11 2014-07-17 Brian Sakhai Method and System for Rating Food Items
CN104035975A (en) * 2014-05-23 2014-09-10 华东师范大学 Method utilizing Chinese online resources for supervising extraction of character relations remotely
US20140372102A1 (en) * 2013-06-18 2014-12-18 Xerox Corporation Combining temporal processing and textual entailment to detect temporally anchored events
US20150220539A1 (en) * 2014-01-31 2015-08-06 Global Security Information Analysts, LLC Document relationship analysis system
US20170017897A1 (en) * 2015-07-17 2017-01-19 Knoema Corporation Method and system to provide related data
US20170091653A1 (en) * 2015-09-25 2017-03-30 Xerox Corporation Method and system for predicting requirements of a user for resources over a computer network
US9672827B1 (en) * 2013-02-11 2017-06-06 Mindmeld, Inc. Real-time conversation model generation
US20170199867A1 (en) * 2014-10-30 2017-07-13 Mitsubishi Electric Corporation Dialogue control system and dialogue control method
CN107180030A (en) * 2016-03-09 2017-09-19 阿里巴巴集团控股有限公司 Relation data generation method and device on a kind of network
US20180089181A1 (en) * 2015-03-18 2018-03-29 Nec Corporation Text visualization system, text visualization method, and recording medium
US10176251B2 (en) * 2015-08-31 2019-01-08 Raytheon Company Systems and methods for identifying similarities using unstructured text analysis
CN109446336A (en) * 2018-09-18 2019-03-08 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of news screening
US10282419B2 (en) 2012-12-12 2019-05-07 Nuance Communications, Inc. Multi-domain natural language processing architecture
US10586156B2 (en) 2015-06-25 2020-03-10 International Business Machines Corporation Knowledge canvassing using a knowledge graph and a question and answer system
US10650192B2 (en) * 2015-12-11 2020-05-12 Beijing Gridsum Technology Co., Ltd. Method and device for recognizing domain named entity
US20200183971A1 (en) * 2017-08-22 2020-06-11 Subply Solutions Ltd. Method and system for providing resegmented audio content
US10698964B2 (en) 2012-06-11 2020-06-30 International Business Machines Corporation System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources
CN111615696A (en) * 2017-11-18 2020-09-01 科奇股份有限公司 Interactive representation of content for relevance detection and review
US10783321B2 (en) * 2018-03-28 2020-09-22 Konica Minolta, Inc. Document creation support device and program
US10867256B2 (en) * 2015-07-17 2020-12-15 Knoema Corporation Method and system to provide related data
US20220084508A1 (en) * 2020-09-15 2022-03-17 International Business Machines Corporation End-to-End Spoken Language Understanding Without Full Transcripts
US20220215029A1 (en) * 2021-01-05 2022-07-07 Salesforce.Com, Inc. Personalized nls query suggestions using language models
US20230091949A1 (en) * 2021-09-21 2023-03-23 NCA Holding BV Data realization for virtual collaboration environment

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170161319A1 (en) * 2015-12-08 2017-06-08 Rovi Guides, Inc. Systems and methods for generating smart responses for natural language queries
WO2017107010A1 (en) * 2015-12-21 2017-06-29 浙江核新同花顺网络信息股份有限公司 Information analysis system and method based on event regression test
US10817527B1 (en) * 2016-04-12 2020-10-27 Tableau Software, Inc. Systems and methods of using natural language processing for visual analysis of a data set
US10607463B2 (en) * 2016-12-09 2020-03-31 The Boeing Company Automated object and activity tracking in a live video feed
KR20190006680A (en) * 2017-07-11 2019-01-21 에스케이하이닉스 주식회사 Data storage device and operating method thereof
CN110046261B (en) * 2019-04-22 2022-01-21 山东建筑大学 Construction method of multi-modal bilingual parallel corpus of construction engineering
CN110245352A (en) * 2019-06-18 2019-09-17 北京智合大方科技有限公司 A kind of public sentiment hot word analysis method and device
CN111126069B (en) * 2019-12-30 2022-03-29 华南理工大学 Social media short text named entity identification method based on visual object guidance
CN112446792A (en) * 2020-12-01 2021-03-05 中国人寿保险股份有限公司 Benefit demonstration generation method and device, electronic equipment and storage medium
US11437038B2 (en) 2020-12-11 2022-09-06 International Business Machines Corporation Recognition and restructuring of previously presented materials
CN116415005B (en) * 2023-06-12 2023-08-18 中南大学 Relationship extraction method for academic network construction of scholars

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5980096A (en) * 1995-01-17 1999-11-09 Intertech Ventures, Ltd. Computer-based system, methods and graphical interface for information storage, modeling and stimulation of complex systems
US6816858B1 (en) * 2000-03-31 2004-11-09 International Business Machines Corporation System, method and apparatus providing collateral information for a video/audio stream
US20050108001A1 (en) * 2001-11-15 2005-05-19 Aarskog Brit H. Method and apparatus for textual exploration discovery
US7013323B1 (en) * 2000-05-23 2006-03-14 Cyveillance, Inc. System and method for developing and interpreting e-commerce metrics by utilizing a list of rules wherein each rule contain at least one of entity-specific criteria
US20060116994A1 (en) * 2004-11-30 2006-06-01 Oculus Info Inc. System and method for interactive multi-dimensional visual representation of information content and properties
US20070282665A1 (en) * 2006-06-02 2007-12-06 Buehler Christopher J Systems and methods for providing video surveillance data
US20080027707A1 (en) * 2006-07-28 2008-01-31 Palo Alto Research Center Incorporated Systems and methods for persistent context-aware guides
US20080294978A1 (en) * 2007-05-21 2008-11-27 Ontos Ag Semantic navigation through web content and collections of documents
US20100017427A1 (en) * 2008-07-15 2010-01-21 International Business Machines Corporation Multilevel Hierarchical Associations Between Entities in a Knowledge System
US20100076972A1 (en) * 2008-09-05 2010-03-25 Bbn Technologies Corp. Confidence links between name entities in disparate documents
US20100114899A1 (en) * 2008-10-07 2010-05-06 Aloke Guha Method and system for business intelligence analytics on unstructured data
US20110225155A1 (en) * 2010-03-10 2011-09-15 Xerox Corporation System and method for guiding entity-based searching
US20110258556A1 (en) * 2010-04-16 2011-10-20 Microsoft Corporation Social home page
US20110282892A1 (en) * 2010-05-17 2011-11-17 Xerox Corporation Method and system to guide formulations of questions for digital investigation activities
US20120011428A1 (en) * 2007-10-17 2012-01-12 Iti Scotland Limited Computer-implemented methods displaying, in a first part, a document and in a second part, a selected index of entities identified in the document
US20120117475A1 (en) * 2010-11-09 2012-05-10 Palo Alto Research Center Incorporated System And Method For Generating An Information Stream Summary Using A Display Metric
US20120158687A1 (en) * 2010-12-17 2012-06-21 Yahoo! Inc. Display entity relationship
US20120197862A1 (en) * 2011-01-31 2012-08-02 Comsort, Inc. System and Method for Creating and Maintaining a Database of Disambiguated Entity Mentions and Relations from a Corpus of Electronic Documents
US20130124490A1 (en) * 2011-11-10 2013-05-16 Microsoft Corporation Contextual suggestion of search queries
US8645125B2 (en) * 2010-03-30 2014-02-04 Evri, Inc. NLP-based systems and methods for providing quotations
US8700604B2 (en) * 2007-10-17 2014-04-15 Evri, Inc. NLP-based content recommender

Family Cites Families (92)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5696916A (en) 1985-03-27 1997-12-09 Hitachi, Ltd. Information storage and retrieval system and display method therefor
US5604899A (en) * 1990-05-21 1997-02-18 Financial Systems Technology Pty. Ltd. Data relationships processor with unlimited expansion capability
US5873076A (en) * 1995-09-15 1999-02-16 Infonautics Corporation Architecture for processing search queries, retrieving documents identified thereby, and method for using same
GB9726654D0 (en) 1997-12-17 1998-02-18 British Telecomm Data input and retrieval apparatus
US6438741B1 (en) * 1998-09-28 2002-08-20 Compaq Computer Corporation System and method for eliminating compile time explosion in a top down rule based system using selective sampling
US7143434B1 (en) * 1998-11-06 2006-11-28 Seungyup Paek Video description system and method
US6205441B1 (en) * 1999-03-31 2001-03-20 Compaq Computer Corporation System and method for reducing compile time in a top down rule based system using rule heuristics based upon the predicted resulting data flow
US7185016B1 (en) * 2000-09-01 2007-02-27 Cognos Incorporated Methods and transformations for transforming metadata model
US6601026B2 (en) * 1999-09-17 2003-07-29 Discern Communications, Inc. Information retrieval by natural language querying
US20030191625A1 (en) * 1999-11-05 2003-10-09 Gorin Allen Louis Method and system for creating a named entity language model
US6963867B2 (en) * 1999-12-08 2005-11-08 A9.Com, Inc. Search query processing to provide category-ranked presentation of search results
US6757646B2 (en) 2000-03-22 2004-06-29 Insightful Corporation Extended functionality for an inverse inference engine based web search
US6567805B1 (en) * 2000-05-15 2003-05-20 International Business Machines Corporation Interactive automated response system
US7606796B2 (en) * 2000-06-15 2009-10-20 Generate, Inc. Method of and system for determining connections between parties using private links
AU2001268489B2 (en) * 2000-06-15 2006-07-20 Generate, Inc. Method of and system for determining connections between parties over a network
US6952666B1 (en) * 2000-07-20 2005-10-04 Microsoft Corporation Ranking parser for a natural language processing system
US20070027672A1 (en) 2000-07-31 2007-02-01 Michel Decary Computer method and apparatus for extracting data from web pages
US7526425B2 (en) * 2001-08-14 2009-04-28 Evri Inc. Method and system for extending keyword searching to syntactically and semantically annotated data
US7227517B2 (en) 2001-08-23 2007-06-05 Seiko Epson Corporation Electronic device driving method, electronic device, semiconductor integrated circuit, and electronic apparatus
WO2003040963A1 (en) * 2001-11-02 2003-05-15 Medical Research Consultants L.P. Knowledge management system
US7177799B2 (en) * 2002-01-14 2007-02-13 Microsoft Corporation Semantic analysis system for interpreting linguistic structures output by a natural language linguistic analysis system
US7260570B2 (en) * 2002-02-01 2007-08-21 International Business Machines Corporation Retrieving matching documents by queries in any national language
US7716199B2 (en) 2005-08-10 2010-05-11 Google Inc. Aggregating context data for programmable search engines
US7191119B2 (en) * 2002-05-07 2007-03-13 International Business Machines Corporation Integrated development tool for building a natural language understanding application
US7805302B2 (en) * 2002-05-20 2010-09-28 Microsoft Corporation Applying a structured language model to information extraction
US20030220913A1 (en) * 2002-05-24 2003-11-27 International Business Machines Corporation Techniques for personalized and adaptive search services
US7113950B2 (en) * 2002-06-27 2006-09-26 Microsoft Corporation Automated error checking system and method
US7472110B2 (en) * 2003-01-29 2008-12-30 Microsoft Corporation System and method for employing social networks for information discovery
EP1652155A4 (en) * 2003-06-27 2009-11-11 Generate Inc Method of and system for determining connections between parties using private links
US7308464B2 (en) * 2003-07-23 2007-12-11 America Online, Inc. Method and system for rule based indexing of multiple data structures
US20050043940A1 (en) * 2003-08-20 2005-02-24 Marvin Elder Preparing a data source for a natural language query
US20050120003A1 (en) * 2003-10-08 2005-06-02 Drury William J. Method for maintaining a record of searches and results
US20070214126A1 (en) 2004-01-12 2007-09-13 Otopy, Inc. Enhanced System and Method for Search
US7747601B2 (en) 2006-08-14 2010-06-29 Inquira, Inc. Method and apparatus for identifying and classifying query intent
US20060009966A1 (en) * 2004-07-12 2006-01-12 International Business Machines Corporation Method and system for extracting information from unstructured text using symbolic machine learning
US20060059129A1 (en) * 2004-09-10 2006-03-16 Hideyuki Azuma Public relations communication methods and systems
US7610191B2 (en) * 2004-10-06 2009-10-27 Nuance Communications, Inc. Method for fast semi-automatic semantic annotation
WO2006093394A1 (en) * 2005-03-04 2006-09-08 Chutnoon Inc. Server, method and system for providing information search service by using web page segmented into several information blocks
US7401073B2 (en) * 2005-04-28 2008-07-15 International Business Machines Corporation Term-statistics modification for category-based search
US20070016580A1 (en) * 2005-07-15 2007-01-18 International Business Machines Corporation Extracting information about references to entities rom a plurality of electronic documents
JP4314221B2 (en) * 2005-07-28 2009-08-12 株式会社東芝 Structured document storage device, structured document search device, structured document system, method and program
US20070078850A1 (en) * 2005-10-03 2007-04-05 Microsoft Corporation Commerical web data extraction system
US8060357B2 (en) * 2006-01-27 2011-11-15 Xerox Corporation Linguistic user interface
US20070219986A1 (en) * 2006-03-20 2007-09-20 Babylon Ltd. Method and apparatus for extracting terms based on a displayed text
US8144151B2 (en) * 2006-05-10 2012-03-27 Hireright, Inc. Spatial and temporal graphical display of verified/validated data organized as complex events
US7693805B2 (en) * 2006-08-01 2010-04-06 Yahoo, Inc. Automatic identification of distance based event classification errors in a network by comparing to a second classification using event logs
US20080228675A1 (en) * 2006-10-13 2008-09-18 Move, Inc. Multi-tiered cascading crawling system
US7822734B2 (en) * 2006-12-12 2010-10-26 Yahoo! Inc. Selecting and presenting user search results based on an environment taxonomy
US8515728B2 (en) 2007-03-29 2013-08-20 Microsoft Corporation Language translation of visual and audio input
US7797309B2 (en) * 2007-06-07 2010-09-14 Datamaxx Applied Technologies, Inc. System and method for search parameter data entry and result access in a law enforcement multiple domain security environment
US8849831B2 (en) * 2007-06-07 2014-09-30 Datamaxx Applied Technologies, Inc. System and method for efficient indexing of messages in a law enforcement data network
US20090119276A1 (en) * 2007-11-01 2009-05-07 Antoine Sorel Neron Method and Internet-based Search Engine System for Storing, Sorting, and Displaying Search Results
WO2009061399A1 (en) * 2007-11-05 2009-05-14 Nagaraju Bandaru Method for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis
US10157195B1 (en) * 2007-11-29 2018-12-18 Bdna Corporation External system integration into automated attribute discovery
US20090171929A1 (en) 2007-12-26 2009-07-02 Microsoft Corporation Toward optimized query suggeston: user interfaces and algorithms
US20090228439A1 (en) 2008-03-07 2009-09-10 Microsoft Corporation Intent-aware search
US20090307209A1 (en) * 2008-06-10 2009-12-10 David Carmel Term-statistics modification for category-based search
US20100023319A1 (en) * 2008-07-28 2010-01-28 International Business Machines Corporation Model-driven feedback for annotation
US9740753B2 (en) * 2008-12-18 2017-08-22 International Business Machines Corporation Using spheres-of-influence to characterize network relationships
US7958109B2 (en) 2009-02-06 2011-06-07 Yahoo! Inc. Intent driven search result rich abstracts
US20100205198A1 (en) * 2009-02-06 2010-08-12 Gilad Mishne Search query disambiguation
US8326637B2 (en) * 2009-02-20 2012-12-04 Voicebox Technologies, Inc. System and method for processing multi-modal device interactions in a natural language voice services environment
WO2010101540A1 (en) * 2009-03-02 2010-09-10 Panchenko Borys Evgenijovich Method for the fully modifiable framework distribution of data in a data warehouse taking account of the preliminary etymological separation of said data
US8311330B2 (en) 2009-04-06 2012-11-13 Accenture Global Services Limited Method for the logical segmentation of contents
US8190601B2 (en) * 2009-05-22 2012-05-29 Microsoft Corporation Identifying task groups for organizing search results
US20100306249A1 (en) * 2009-05-27 2010-12-02 James Hill Social network systems and methods
US8744978B2 (en) * 2009-07-21 2014-06-03 Yahoo! Inc. Presenting search results based on user-customizable criteria
US8583673B2 (en) * 2009-08-17 2013-11-12 Microsoft Corporation Progressive filtering of search results
US8473501B2 (en) * 2009-08-25 2013-06-25 Ontochem Gmbh Methods, computer systems, software and storage media for handling many data elements for search and annotation
IL201130A (en) * 2009-09-23 2013-09-30 Verint Systems Ltd Systems and methods for large-scale link analysis
US9405841B2 (en) * 2009-10-15 2016-08-02 A9.Com, Inc. Dynamic search suggestion and category specific completion
US8176032B2 (en) * 2009-10-22 2012-05-08 Ebay Inc. System and method for automatically publishing data items associated with an event
US20110128288A1 (en) * 2009-12-02 2011-06-02 David Petrou Region of Interest Selector for Visual Queries
US9176986B2 (en) * 2009-12-02 2015-11-03 Google Inc. Generating a combination of a visual query and matching canonical document
US8811742B2 (en) * 2009-12-02 2014-08-19 Google Inc. Identifying matching canonical documents consistent with visual query structural information
US8805079B2 (en) * 2009-12-02 2014-08-12 Google Inc. Identifying matching canonical documents in response to a visual query and in accordance with geographic information
US9710556B2 (en) 2010-03-01 2017-07-18 Vcvc Iii Llc Content recommendation based on collections of entities
CN102207936B (en) * 2010-03-30 2013-10-23 国际商业机器公司 Method and system for indicating content change of electronic document
US8874432B2 (en) * 2010-04-28 2014-10-28 Nec Laboratories America, Inc. Systems and methods for semi-supervised relationship extraction
US8874550B1 (en) * 2010-05-19 2014-10-28 Trend Micro Incorporated Method and apparatus for security information visualization
US8862458B2 (en) * 2010-11-30 2014-10-14 Sap Ag Natural language interface
WO2012088023A2 (en) * 2010-12-20 2012-06-28 Akamai Technologies, Inc. Methods and systems for delivering content to differentiated client devices
AU2012207503A1 (en) * 2011-01-17 2013-09-05 Chacha Search, Inc. Method and system of selecting responders
US8838582B2 (en) * 2011-02-08 2014-09-16 Apple Inc. Faceted search results
US9064004B2 (en) * 2011-03-04 2015-06-23 Microsoft Technology Licensing, Llc Extensible surface for consuming information extraction services
US9824138B2 (en) * 2011-03-25 2017-11-21 Orbis Technologies, Inc. Systems and methods for three-term semantic search
US8856099B1 (en) * 2011-09-27 2014-10-07 Google Inc. Identifying entities using search results
US8768910B1 (en) * 2012-04-13 2014-07-01 Google Inc. Identifying media queries
EP2839391A4 (en) * 2012-04-20 2016-01-27 Maluuba Inc Conversational agent
US20130332450A1 (en) 2012-06-11 2013-12-12 International Business Machines Corporation System and Method for Automatically Detecting and Interactively Displaying Information About Entities, Activities, and Events from Multiple-Modality Natural Language Sources
US9020962B2 (en) * 2012-10-11 2015-04-28 Wal-Mart Stores, Inc. Interest expansion using a taxonomy
US9846836B2 (en) * 2014-06-13 2017-12-19 Microsoft Technology Licensing, Llc Modeling interestingness with deep neural networks

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5980096A (en) * 1995-01-17 1999-11-09 Intertech Ventures, Ltd. Computer-based system, methods and graphical interface for information storage, modeling and stimulation of complex systems
US6816858B1 (en) * 2000-03-31 2004-11-09 International Business Machines Corporation System, method and apparatus providing collateral information for a video/audio stream
US7013323B1 (en) * 2000-05-23 2006-03-14 Cyveillance, Inc. System and method for developing and interpreting e-commerce metrics by utilizing a list of rules wherein each rule contain at least one of entity-specific criteria
US20050108001A1 (en) * 2001-11-15 2005-05-19 Aarskog Brit H. Method and apparatus for textual exploration discovery
US20060116994A1 (en) * 2004-11-30 2006-06-01 Oculus Info Inc. System and method for interactive multi-dimensional visual representation of information content and properties
US20070282665A1 (en) * 2006-06-02 2007-12-06 Buehler Christopher J Systems and methods for providing video surveillance data
US20080027707A1 (en) * 2006-07-28 2008-01-31 Palo Alto Research Center Incorporated Systems and methods for persistent context-aware guides
US20080294978A1 (en) * 2007-05-21 2008-11-27 Ontos Ag Semantic navigation through web content and collections of documents
US20120011428A1 (en) * 2007-10-17 2012-01-12 Iti Scotland Limited Computer-implemented methods displaying, in a first part, a document and in a second part, a selected index of entities identified in the document
US8700604B2 (en) * 2007-10-17 2014-04-15 Evri, Inc. NLP-based content recommender
US20100017427A1 (en) * 2008-07-15 2010-01-21 International Business Machines Corporation Multilevel Hierarchical Associations Between Entities in a Knowledge System
US20100076972A1 (en) * 2008-09-05 2010-03-25 Bbn Technologies Corp. Confidence links between name entities in disparate documents
US20100114899A1 (en) * 2008-10-07 2010-05-06 Aloke Guha Method and system for business intelligence analytics on unstructured data
US20110225155A1 (en) * 2010-03-10 2011-09-15 Xerox Corporation System and method for guiding entity-based searching
US8645125B2 (en) * 2010-03-30 2014-02-04 Evri, Inc. NLP-based systems and methods for providing quotations
US20110258556A1 (en) * 2010-04-16 2011-10-20 Microsoft Corporation Social home page
US20110282892A1 (en) * 2010-05-17 2011-11-17 Xerox Corporation Method and system to guide formulations of questions for digital investigation activities
US20120117475A1 (en) * 2010-11-09 2012-05-10 Palo Alto Research Center Incorporated System And Method For Generating An Information Stream Summary Using A Display Metric
US20120158687A1 (en) * 2010-12-17 2012-06-21 Yahoo! Inc. Display entity relationship
US20120197862A1 (en) * 2011-01-31 2012-08-02 Comsort, Inc. System and Method for Creating and Maintaining a Database of Disambiguated Entity Mentions and Relations from a Corpus of Electronic Documents
US20130124490A1 (en) * 2011-11-10 2013-05-16 Microsoft Corporation Contextual suggestion of search queries

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130311166A1 (en) * 2012-05-15 2013-11-21 Andre Yanpolsky Domain-Specific Natural-Language Processing Engine
US10698964B2 (en) 2012-06-11 2020-06-30 International Business Machines Corporation System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources
US20160132483A1 (en) * 2012-08-02 2016-05-12 American Express Travel Related Services Company, Inc. Systems and methods for semantic information retrieval
US20140039877A1 (en) * 2012-08-02 2014-02-06 American Express Travel Related Services Company, Inc. Systems and Methods for Semantic Information Retrieval
US9805024B2 (en) * 2012-08-02 2017-10-31 American Express Travel Related Services Company, Inc. Anaphora resolution for semantic tagging
US20160328378A1 (en) * 2012-08-02 2016-11-10 American Express Travel Related Services Company, Inc. Anaphora resolution for semantic tagging
US9424250B2 (en) * 2012-08-02 2016-08-23 American Express Travel Related Services Company, Inc. Systems and methods for semantic information retrieval
US9280520B2 (en) * 2012-08-02 2016-03-08 American Express Travel Related Services Company, Inc. Systems and methods for semantic information retrieval
US20140095147A1 (en) * 2012-10-01 2014-04-03 Nuance Communications, Inc. Situation Aware NLU/NLP
US9619459B2 (en) * 2012-10-01 2017-04-11 Nuance Communications, Inc. Situation aware NLU/NLP
US10282419B2 (en) 2012-12-12 2019-05-07 Nuance Communications, Inc. Multi-domain natural language processing architecture
US20140200879A1 (en) * 2013-01-11 2014-07-17 Brian Sakhai Method and System for Rating Food Items
US9672827B1 (en) * 2013-02-11 2017-06-06 Mindmeld, Inc. Real-time conversation model generation
US20140372102A1 (en) * 2013-06-18 2014-12-18 Xerox Corporation Combining temporal processing and textual entailment to detect temporally anchored events
WO2015096625A1 (en) * 2013-12-23 2015-07-02 语联网(武汉)信息技术有限公司 Information fragment translating method and system
CN103744841A (en) * 2013-12-23 2014-04-23 武汉传神信息技术有限公司 Information fragment translating method and system
US9928295B2 (en) * 2014-01-31 2018-03-27 Vortext Analytics, Inc. Document relationship analysis system
US11243993B2 (en) * 2014-01-31 2022-02-08 Vortext Analytics, Inc. Document relationship analysis system
US20150220539A1 (en) * 2014-01-31 2015-08-06 Global Security Information Analysts, LLC Document relationship analysis system
CN104035975A (en) * 2014-05-23 2014-09-10 华东师范大学 Method utilizing Chinese online resources for supervising extraction of character relations remotely
US20170199867A1 (en) * 2014-10-30 2017-07-13 Mitsubishi Electric Corporation Dialogue control system and dialogue control method
US10489514B2 (en) * 2015-03-18 2019-11-26 Nec Corporation Text visualization system, text visualization method, and recording medium
US20180089181A1 (en) * 2015-03-18 2018-03-29 Nec Corporation Text visualization system, text visualization method, and recording medium
US10586156B2 (en) 2015-06-25 2020-03-10 International Business Machines Corporation Knowledge canvassing using a knowledge graph and a question and answer system
US10867256B2 (en) * 2015-07-17 2020-12-15 Knoema Corporation Method and system to provide related data
US20170017897A1 (en) * 2015-07-17 2017-01-19 Knoema Corporation Method and system to provide related data
US10108907B2 (en) * 2015-07-17 2018-10-23 Knoema Corporation Method and system to provide related data
US10176251B2 (en) * 2015-08-31 2019-01-08 Raytheon Company Systems and methods for identifying similarities using unstructured text analysis
US10417578B2 (en) * 2015-09-25 2019-09-17 Conduent Business Services, Llc Method and system for predicting requirements of a user for resources over a computer network
US20170091653A1 (en) * 2015-09-25 2017-03-30 Xerox Corporation Method and system for predicting requirements of a user for resources over a computer network
US10650192B2 (en) * 2015-12-11 2020-05-12 Beijing Gridsum Technology Co., Ltd. Method and device for recognizing domain named entity
CN107180030A (en) * 2016-03-09 2017-09-19 阿里巴巴集团控股有限公司 Relation data generation method and device on a kind of network
US11693900B2 (en) * 2017-08-22 2023-07-04 Subply Solutions Ltd. Method and system for providing resegmented audio content
US20200183971A1 (en) * 2017-08-22 2020-06-11 Subply Solutions Ltd. Method and system for providing resegmented audio content
CN111615696A (en) * 2017-11-18 2020-09-01 科奇股份有限公司 Interactive representation of content for relevance detection and review
US10783321B2 (en) * 2018-03-28 2020-09-22 Konica Minolta, Inc. Document creation support device and program
CN109446336A (en) * 2018-09-18 2019-03-08 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of news screening
US20220084508A1 (en) * 2020-09-15 2022-03-17 International Business Machines Corporation End-to-End Spoken Language Understanding Without Full Transcripts
US11929062B2 (en) * 2020-09-15 2024-03-12 International Business Machines Corporation End-to-end spoken language understanding without full transcripts
US20220215029A1 (en) * 2021-01-05 2022-07-07 Salesforce.Com, Inc. Personalized nls query suggestions using language models
US11755596B2 (en) * 2021-01-05 2023-09-12 Salesforce, Inc. Personalized NLS query suggestions using language models
US20230091949A1 (en) * 2021-09-21 2023-03-23 NCA Holding BV Data realization for virtual collaboration environment
US20230094459A1 (en) * 2021-09-21 2023-03-30 NCA Holding BV Data modeling for virtual collaboration environment

Also Published As

Publication number Publication date
US20170140057A1 (en) 2017-05-18
CN103488663A (en) 2014-01-01
US10698964B2 (en) 2020-06-30

Similar Documents

Publication Publication Date Title
US10698964B2 (en) System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources
US20140195884A1 (en) System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources
US11698920B2 (en) Methods, systems, and computer-readable media for semantically enriching content and for semantic navigation
Nasar et al. Textual keyword extraction and summarization: State-of-the-art
US20180246890A1 (en) Providing answers to questions including assembling answers from multiple document segments
Ortega Academic search engines: A quantitative outlook
US9659084B1 (en) System, methods, and user interface for presenting information from unstructured data
US20130305149A1 (en) Document reader and system for extraction of structural and semantic information from documents
US20090070322A1 (en) Browsing knowledge on the basis of semantic relations
Yi et al. Revisiting the syntactical and structural analysis of Library of Congress Subject Headings for the digital environment
Fernandez et al. Linking data across universities: an integrated video lectures dataset
Hinze et al. Improving access to large-scale digital libraries throughsemantic-enhanced search and disambiguation
Armentano et al. NLP-based faceted search: Experience in the development of a science and technology search engine
Wimalasuriya et al. Using multiple ontologies in information extraction
Spitz et al. EVELIN: Exploration of event and entity links in implicit networks
GB2592884A (en) System and method for enabling a search platform to users
Qumsiyeh et al. Searching web documents using a summarization approach
Cameron et al. Semantics-empowered text exploration for knowledge discovery
Stranisci et al. The World Literature Knowledge Graph
Uma et al. A survey paper on text mining techniques
Sheela et al. Criminal event detection and classification in web documents using ANN classifier
Johnny et al. Key phrase extraction system for agricultural documents
Norouzi et al. A spatiotemporal semantic search engine for cultural events
Schoen et al. AI Supports Information Discovery and Analysis in an SPE Research Portal
Bashaddadh et al. Topic detection and tracking interface with named entities approach

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CASTELLI, VITTORIO;FLORIAN, RADU;LUO, XIAOQIANG;AND OTHERS;REEL/FRAME:028355/0150

Effective date: 20120611

AS Assignment

Owner name: DARPA, VIRGINIA

Free format text: CONFIRMATORY LICENSE;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:029034/0526

Effective date: 20120820

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION