US20170161257A1

US20170161257A1 - System and method for linguistic term differentiation

Info

Publication number: US20170161257A1
Application number: US15/437,297
Authority: US
Inventors: Athena Ann Smyros; Constantine Smyros
Original assignee: Individual
Current assignee: Intelligent Language LLC
Priority date: 2013-05-02
Filing date: 2017-02-20
Publication date: 2017-06-08
Also published as: US9575958B1

Abstract

A representative system and method for linguistic differentiation comprises a computing device: receiving input data from a requestor; generating a plurality of term units from the input data, where the plurality of term units comprise of a first number of term units; identifying a plurality of differentiable terms of the plurality of term units, where the plurality of differentiable terms comprise a second number of term units; determining a differentiability score for each term unit of the plurality of term units; determining an input data score for the input data by evaluating a ratio of the second number of term units to the first number of term units; and transmitting, a plurality of differentiable term units to the requestor in order of their differentiability scores.

Description

RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 14/268,581, entitled “DIFFERENTIATION TESTING” and filed on 2 May 2014, which application claims priority to U.S. Provisional Patent Application No. 61/818,904, entitled “DIFFERENTIATION TESTING” and filed on 2 May 2013, which applications are hereby incorporated herein by reference.

BACKGROUND

Currently, various communication devices are being rapidly introduced that need to interact with natural language in an unstructured manner. Communication systems are finding it difficult to keep pace with the introduction of devices, as well as the growth of information.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings are incorporated in and are a part of this specification. Understanding that the drawings illustrate only representative embodiments of the invention and are not therefore to be considered as limiting scope. Representative embodiments will be described and explained more fully with reference to the accompanying drawing in which:

FIG. 1 illustrates a differentiation workflow that is usable with representative embodiments described herein; and

FIG. 2 representatively illustrates a block diagram of a computer system which is adapted for use in accordance with representative embodiments.

DETAILED DESCRIPTION

Differentiation in language starts with a set of terms used in the language, such as words, numbers, alphabetical and/or numerical codes, and/or the like. This forms a largest language set of words that provide a basis for most language analysis that concentrates on using meaning to perform useful functions. This term list may be similar to a dictionary of terms for a language, but may include terms normally not found in dictionaries, such as email addresses, phone numbers, product codes, and other information. In representative embodiments, such a list represents a largest set of usable terms within a language. The set contains all such terms, and all such possible combination of terms, that are currently usable within a data or document repository. Note that the term list may not have all possible combinations within a given language's alphabet, since it generally contains numbers like integers, and such numbers may comprise an infinite set that is not usable for language operations on a computer. For instance, a company repository may not contain all possible combinations of the alphabet in a given language, just those that have meaning at the point in time that the repository exists and is being analyzed by the system.
From a maximum set of terms, there may be, for any language-based functions performed by a computer, ways of distinguishing one term from another within a fixed set of meaning(s), such as a dictionary sense of the meaning of a word. This refers to the ability of a term to be put into a set of meanings that does not require further modification or explanation. Hence, such a term can be used to separate one text stream from another. For instance, the use of the term “application” does not indicate the fixed meaning that would allow it to be placed into a set of meanings without some explanation of what kind of “application” is being indicated. It could be an insurance application, a software application, a car paint application, or the like. Therefore, the term “application” is not useful on its own to distinguish itself from another text stream that contains the same term without requiring more terms to put it into a fixed meaning that is useful to perform any analysis. This is a nondifferentiated term. This can be contrasted with the term “fireplace,” which has a fixed meaning that is useful to perform analysis and is a differentiated term. The type of fireplace enhances the meaning but does not alter the fixed meaning of fireplace, as in the above-described examples, where the word “application” has various fixed meanings that require more terms to provide fixed meanings.
Differentiation testing may be used to determine the usefulness of a term on a single-term basis or within a set of terms. Note that the term may also be a separator between one text stream and another text stream in any given language. Differential testing may use any form of text as an input. For example, a message, file, blog, document, email, as well as input directly obtained from a user, or another request from a system, can be used as input. Differential testing involves performing a comparison between the input against one or more members of a repository or comparison set, depending on the implementation. Note that the input may comprise a plurality of different input types. Thus, input can check for differentiation with other input, in addition to comparing input to a comparison set. The input is checked for differentiation to the terms in each document in the undifferentiated comparison set, or it may be compared in general after filtering. There is no requirement that both or all operands of a comparison require differentiation to derive benefits of differentiated input. Any text stream within the system can be used at any time to run differentiated testing.
FIG. 1 representatively illustrates a workflow that is usable with the system. The differentiation list is a set of terms 101, typically a single term unit (TU) in most languages, including English, that are considered to be differentiable in a language. These may be restricted by part of speech (POS), such as noun or verb. Some languages, such as English, can count all verbs as nondifferentiable to some degree. There may be both a binary classification or an n-ary classification scheme used separately or for the entire set of POS used in a given language; this may also vary depending on implementation, since while POS is normally used to construct a list, the list may be constructed using any metric that causes something to be differentiated from something else. The number of total terms may be less than the number of differentiable terms in a given language. Implication, such as pronoun usage, will cause measurement problems for implementations that do not take into account such uses, and also are not necessarily directly differentiable, depending on their antecedent. Each list that is generated is based on a specific language, and some languages may be difficult to classify at the term level, especially where idiomatic-based pictographs (characters) may be used to construct sentences, as in Chinese.
User input 102 represents input in text form or a non-text communication method that has been converted to text form. There is generally no restriction on the size of the input. In addition, the output required from the system can be sent or a default for a particular implementation can be used.
There are several variables that may be measured, depending on the implementation and its scope of input. One of these is writing type. Generally, from a discourse level, there are four basic writing types: expository, argumentation, description and narration. The writing type indicates the purpose for the communication—such as technical manuals or textbooks (expository), marketing collateral or job applications (argumentation), news stories or poetry (description), and novels or biographies (narration). There are several different types of writing that are used throughout a given language ID, and these may occur at random within any repository, even when the repository is limited in scope, such as representing a single device or a single information domain.
Another variable that may need to be measured is writing style. Some writing styles are terse, while others contain significant modification or extra words based on use of inversion and other sentence constructions. This refers to the range of expression that is possible within a language, and differentiation can be measured without regard to such expression ranges. A significant measure in processing the input is the apparent level of summarization. This triggers the use of differentiation when the text is already in a summarized or highly condensed form, and lacks a single particular focus. The use of differentiation may be implemented for the development of an outline/hierarchy or a summary when various language analyzers are used, such as topical, date, and location.
Some measurements of differentiation may involve the use of a functional scope, which can be used to limit operation of the measurement of differentiation, such as restricting the measure to a certain amount of text, such as a document, message, section of a file, a single input, or a part of, or even an entire repository. The scope can be defined as the input range over and including the repository itself. For a given implementation, there may be a document scope, a section of a document scope, a paragraph scope, or a TU scope. The TU represents a discrete word, number, or other symbol in the language that has specific meaning within the language. The differentiable measure can then be applied to these different functional scopes because it is applied at the TU scope. In addition, a scope may also include parts of a word, such as a suffix or a prefix, or an individual character within a specific language, such as the letter “a” in English. Scopes such as these, below the TU scope, generally do not use differentiation since it measures TUs for their ability to restrict an object range within a given language.
An optional intersection 103 between the differentiated list and the repository word list, such as Windex, can be used if implemented. Windexes are discussed in U.S. patent application Ser. No. 12/192,794, entitled “SYSTEMS AND METHODS FOR INDEXING INFORMATION FOR A SEARCH ENGINE” filed 15 Aug. 2008, the entirety of which is incorporated herein by reference.
Lists for differentiation can be kept up-to-date by intersection and using only those terms that are in the term-encoding scheme. This may be especially useful in data caches, since these have limited memory and there is generally no reason to store an entire differentiated list for a language if the current repository does not contain a particular term. Updating can be triggered in real-time with new addition(s) to the repository word list or Windex, which comprises a simple binary search of the differentiated list.
A test at step 104 to see if differentiable terms exist in the input may be performed next. If a differentiated term exists, then a term extraction process can begin. If no differentiated term exists, then there is generally no raw material to determine differentiability, and the system may be configured to apply an input value of 0 (zero). For this condition, the system may be adapted to terminate operations at step 105. A message can be generated that indicates that follow-on processes cannot be performed with the input because, as far as searching is concerned, there is generally no mechanism to differentiate the terms in the text stream input against what is used as the comparison set.
When a differentiated term exists, then step 106 performs term extraction. Text extraction involves finding terms that might be used for the differentiation. This is based on what was determined for the differentiation list in step 101. Depending on implementation, this may involve isolation of terms that meet similar grammatical requirements, such as all the nouns or all the verbs found in the input. It may also be based on uses of the input. E.g., some input may require testing for a specific set of terms, and these may not fall into a single grammatical category like POS. Terms are extracted and any appropriate filters, such as a grammatical categorization, applied as part of the term extraction process to eliminate terms that will not be used in the differentiation process. Typically, the input may use a single differentiation test, but any number may be applied. A differentiation test may be unary, binary or n-ary in nature, and the differentiable list is usually substantially equivalent to the control set when the comparison is made. However, tests can be performed when a document has passed through the differentiation testing process, or with one that has no such test. Binary and n-ary tests occur when the term extraction process indicates that the terms comprise both single word and multi-word terms. This generally occurs in most implementations of differentiation testing. Therefore, the extraction 106 should result in a list of phrases as well as single words that are used by the differentiation tests.
Thereafter, the process for assigning differentiation weights/scores at step 107 can begin. A term length is generally recorded as part of the process, so that a term has at least one TU or word (as in English). Depending on implementation, there may be several stages of differentiation testing at this point, set up by the grammatical or functional boundaries in the extracted terms. A noun that ends a multi-term set may be used, or a verb may be used, or a set of terms found in a list may be used, depending on implementation. For instance, a set of time indicators may be found to be differentiators for a given implementation. These are generated as a list, and then checked to see if any of the terms are in a phrase or not. In English, a verb phrase has auxiliary and main parts; generally, only the main part is tested for differentiation, since auxiliary verbs are used more frequently. In some cases, both verbs and nouns can be used, such has when there is a set of functions that are to be differentiated for a device, such as the a representative differentiation list comprising “play, run, stop, start, pause”, where it is possible that different inflections can refer to the same button or some other feature of the device. A weight is given to each term found in the input based on this list. In this example, the differentiation list may score a phrase “play the DVD” higher than “stop the DVD,” because the DVD must be played before it can be stopped. Therefore, the weight in the differentiated list would be higher for “play” and “run”, but lower for “start,” “stop,” and “pause.”
Determining a score/weight in step 107 usually takes place at TU level, but may be performed with any character that maps to an identifiable unit in a specific language. Each TU in input may be intersected with the differentiation control set. Each TU may be graded based on any linear arrangement, and may be a simple binary value (e.g., 0 or 1) or a more complex system that takes into account variations in differentiation. In some cases, this may be based on a differentiable list suitable for the process. Each TU in a grammatical scheme may be either a noun, modifier, adverb, or other language classification based on the individual function of a word. Typically, a noun-based test is sufficient, as verbs might generally be classified as nondifferentiable. This can be changed as required, depending on implementation, as when the function of a device is used, as in the above-described example.
For some implementations, an object test may be the major differentiator test, since it may be a significant element. An object is usually a noun, but depending on implementation, may also be a subset of a noun or may include verbs or other terms that are functioning as a noun in a particular text stream input. The object functions would get a higher score and impact the score for a multi-word test more than a modifier test, if used. Such a follow-on modifier test may be used to determine differentiation when included as part of a larger set of terms, as it serves to restrict the set of possible objects referred to by the object. Note that these are not a required part of the measurement, as the noun is still a more significant determiner of differentiation when an object test is indicated. However, a more precise test will generally require differentiation of adjectives. This may also be measured along grammatical lines, such as which modifier has more impact on limiting the object in question. For many implementations, associations, substitutions, and other such similarity tests may be used to group like terms together to weight differentiated terms similarly. This can vary based on implementation. Also, some prefixes and suffixes in some languages will receive attention when compared against other terms with the same root words. This process also may affect the outcome of a process that has sensitivity in this area, such as sentiment analysis, feature-based analysis, abstractions, and/or the like. Therefore, a weight may be assigned as well to distinguish differentiation when the differentiation list does not contain prefix and suffix variations. These may be applied to a modifier relation of an object, or they may be applied to the object itself, depending on where it occurred in the input.
Several examples follow. The differentiation list for these examples=(acoustic, paint). The weighting comprises a simple binary scale that adds an additional weight when the object value is differentiable, and adds a one (1) when a term is differentiable, regardless where the term occurs in the input. Example 1: “special qualities” =nondifferentiable (both individually and as an multi-word term), but still more differentiable than a nondifferentiable single word (weight 0 or norm). Example 2: “special (nondifferentiable) acoustics (differentiable)” and the term is differentiable owing to the object value (acoustics). Weight=2, since the object is higher than the nonobject. Example 3: “acoustical (differentiable) qualities (nondifferentiable)” and is differentiable, but less differentiable than the object that has a differentiable value, making the weight=1 (e.g., less than the object version). Example 4: “acoustical (differentiable) paints (differentiable)” is differentiable and is the most differentiable (among the examples herein) of any phrase or multi-word term since all members are differentiable (weight=3).
Once all terms have been differentiated, then they can be grouped together based on some criteria in optional step 108. One approach may comprise the creation of groups based on term length (e.g., cardinality). There are usually at least two group types, the single and the multi-word term. In the single TU, the score doesn't have other factors, unless the function or POS is considered. There are implication issues and these affect the classification system in POS-based implementations. In multiple-unit terms, the end or terminating object may be weighted more than other TUs in length. Modifiers or terms are measured and scored for each multiple unit term. Each unique group has an individual score, along with each component of an multiple-unit term. For example, for the input “the dog went shopping at the Cordova Mall,” terms that are in the exemplary filter=“dog,” “shopping,” “Cordova,” “mall.” No functional words or verbs are being used in this representative basic filter measured for differentiation. These are considered nondifferentiable for this implementation. Each term is analyzed for differentiation as follows: “dog” =differentiable; “shopping” =nondifferentiable; “Cordova”=differentiable; “mall” =nondifferentiable. “Cordova mall,” in the example, is an multi-word term, so its scores are combined. In a binary context, it gets a score of 1. This means that it has a differentiable member. Note that the terminating object is nondifferentiable. If it were differentiable, it would get a special weight, and the score would be higher.
An optional input score at step 109 can be calculated for an entire input by using a summation, division, multiplication, other mathematical method based on the input size, output requirements, and/or other implementation-specific data. Each differentiable term of the input participates in the final score, for most purposes. Using the list in the above-described example of “acoustic” and “paint,” and an input: “Acoustical paint has been found to be useful.” —in this this input, the differentiated multi-word term is equal to “acoustical paint.” The score based on step 107 is equal to 3, since both terms are differentiable and the term “paint” is an object, using an object scoring system as described above. A score can be generated for an entire input based on implementation; for this example, a score for the input may be based on the number of differentiated terms that have been located within the set of object and modifier-object term sets. In this example, the addition term (a modifier) to be considered is “useful,” which has a score of 0. Therefore, the aggregate score for this input would be equal to 3. Depending on implementation requirements, a deduction can be made for this term since it is undifferentiated, such as subtracting one from the initial phrase, giving the score for this a 2. It is also possible to assign weights based on function; since the term “useful” is not an object, maybe only half a point is subtracted, making the score 2.5. This is advantageous to gauge the input as a whole and how beneficial it might be in different situations. For instance, for a search term, only the differentiated portion of the input is used, and that should be reflected in those term(s) that comprise the input score. In general, the other terms may be safely dropped in certain implementations, since they will not help locate documents that correspond to the input.
A representative use of differentiation testing may be to analyze a topical outline. A branch of the outline may be considered to start at the top node or a chosen topic. The topic may or may not have enough differentiation to convey a meaning that is related to the underlying document.
In this case, the topic path may be augmented by differentiation-based multi-word terms and single words that provide significant information in the form of subtopics to each topic that implements differentiation. The root of the path may or may not be differentiable. The ability to provide this information is important for summarized, argumentation documents, like job descriptions, marketing communications, and even summaries of larger documents. A quick view can be generated for such documents based on augmented differentiation information, and may be a smaller subset of the full topical outline. The quick view can give significant information without having to read a lot of extraneous words and still provide an overview of a document. This can be implemented in outline form, giving several subtopics that contain differentiable information about each parent topic. Any number of topics in the chain can be augmented using differentiation, and therefore can build any number of quick views.
For further information on topical analysis, see U.S. patent application Ser. No. 13/689,656, entitled “SYSTEMS and METHODS OF TOPICAL ANALYSIS” filed on 29 Nov. 2012, the entirety of which is incorporated herein by reference.
Another representative use of differentiation testing involves listings of product features. In some instances, descriptions of product features use nondifferentiable labels, such as best, perfect, etc. This may also be true of various other marketing documents or collateral. These labels may or may not contain actual data that is useful for a particular analysis, and the determination of what differentiable features may be desired to show the user (or input process) why a certain product is better than another product. Feature extraction, in generic terms, is the ability to ascertain characteristics about a focus object, such as a camera or a lawn mower. Most nondifferentiable descriptors do not impart any actual meaning with respect to the object's characteristics; rather, they qualify such features in a judgmental fashion that may not concur with the judgement of a current viewer/user.
Corrections to an input in step no are optional and may be performed by differentiation testing to make the input more usable. For instance, a calling function may want to compare two documents that contain flowery language, but the information desired is of the feature of a robot to perform a certain task. Most flowery words, such as “quality,” “feature,” or the like, are generally not helpful for running a search, and may need to be located and separated from terms that are helpful, such as “line-of-sight requirements,” “range-of-motion,” and “actuator arm.” These represent differentiators that will make the comparison beneficial to the task at hand. These terms may then be isolated in each document. A comparison is made and an initial decision can be made if there is enough information to proceed with more in-depth analysis. Any types of corrections to the inputs with respect to differentiation may be used in any number of information retrieval schemes. For instance, removing non sequitur portions to send to a search engine can be considered; e.g., in “acoustical paint has been found to be useful,” only the differentiated terms “acoustical paint” would be sent to a search engine to find documents that relate to that kind of paint.
The data can be returned in step iii in any number of forms, since there are several different ways in which an input can be represented. The system can return the differentiated part of the input only (as in “acoustical paint” above). This may be useful for information retrieval tasks and other such data analysis that depend on being able to find a distinguishable point from which to perform a set of functions. Another return may be to indicate the differentiated portions using any number of methods, such as encoding the output to show differentiated portions. Any such returns may also indicate weights and/or scores for each individual component of the input, as well as optionally for the entire input. The returned data may be presented to a user, e.g., via a display, or other man-machine interface, or the returned data may be provided to another program application that uses the returned data as input for further processing.
The returned data may be used to determine if a secondary process can be successfully run, or if more information is required about the input. This may be the case when automated processes are in place and human intervention is not feasible for a given business environment. For instance, a set of messages being compared against a control message, such as a filter that needs to determine if the message is indicating that a part of a system needs to be shut down. If there is not enough information to determine whether there is a problem, then the user (or input process) needs to be informed that the message is incomplete, which may be determined by a lack of differentiated terms in the input message.
Another example of this is the use of a control document (as a requirements document) with filtering of documents that do not contain any more detailed information on indicated requirements. Assuming that the documents must have required information, then the user (or input process) may be informed that a document does not meet the requirements. This may be accomplished by using differentiation systems and/or processes as representatively disclosed. Accordingly, a message can be raised to the user (or input process) without human intervention. These scenarios can be addressed with the use of a differentiable measure that determines suitability of a document to perform a particular task at hand. In order to accomplish this, differentiation testing can be used to develop a metric that a document can be measured against.
A representative system example follows. There is a single document in the repository, and the document contains the following text: “The use of acoustic paint is necessary in sound sensitive environments to remove ambient sound from test equipment for hearing tests. This removes the expense of hearing booths.” Differentiation is used to build a set of search terms that will allow the system to distinguish between other documents such that documents that contain similar information can be found automatically.
The system builds a differentiated list in step 101. This is done by performing differentiation on the language being used for possible inputs from the user. Depending on implementation, this list contains all possible differentiable terms for the language at the point in time of system invocation. This process may be performed at different points in time based on system requirements.
The system obtains input from the requestor in step 102, which contains the single document in the repository as representatively identified above. The document is then reduced to a set of terms that are separated by POS or other such grammar function. In this case, only nouns and modifiers will be used. The complete set of terms ignores functional words, since they are typically undifferentiated in a language. The list is equal to “use, acoustic, paint, necessary, sound-sensitive, environments, remove, ambient, sound, test, equipment, hearing, booth.” This list represents the repository word list, which is then intersected with the differentiated list in step 103 to determine if there are any terms that are considered differentiable in this representative example.
Intersection produces differentiable terms in step 104. This list is equal to “acoustic, paint, booth.” Once this list has been found, the terms are extracted in their full form, including any phrases in the above text, by text extraction in step 106 to produce the results: “use,” “acoustic paints,” “necessary,” “sound-sensitive environments,” “remove,” “ambient sound,” “test equipment,” “expense,” “hearing booths.” Note that term extraction may treat prepositional phrases and infinitives differently based on implementation.
Each term is then processed for differentiable score weights in step 107 based on the term extraction results. For the given example, a simple binary system is employed as representatively described above. For this application of a search term, only those found to have differentiable scores equal or above 1 will be used. The first term “use” is nondifferentiated, so it gets a score of 0 and is not used as a search term. The second multi-word term “acoustic paints” is differentiated, where both terms are also individually differentiated, getting the score of 3, and is therefore used as a search term. The third term “necessary” is not differentiated, so it gets a score of 0 and is not used as a search term. The fourth multi-word term “sound-sensitive environments” contains only nondifferentiable terms, but since it contains multiple terms it gets a score of 1 and is used as a search term. The fifth single-word term, “remove” is not differentiated, and therefore gets a score of 0 and is not used as a search term. The sixth term “ambient sound” is also nondifferentiated, but contains more than one term, so it gets a score of 1 and is used as a search term, as is the seventh term “test equipment.” The eighth term “expense” is a single term that is nondifferentiated, and gets a score of 0 and is not used as a search term. The last term, “hearing booths,” contains a differentiated object and gets a score of 2, and is used as a search term.
In this process, there is no grouping that is required so optional step 108 is not performed. If there is a need to calculate an input score in step 109, it can be done by determining the document score to determine how differentiable the document is and how well-formed its group of search terms will be. In this example, the document score will be the number of terms that are differentiated over the total number of terms. There are 9 terms in this example, and 5 are differentiated. Accordingly, it would get a score of 5/9, which is a moderate differentiated score for a document. In this case, no corrections to the input would be required. The output in step 111 is the five differentiated terms: “acoustic paints,” “hearing booths,” “sound-sensitive environments,” “ambient sound,” and “test equipment” in order of their differentiated scores.
FIG. 2 representatively illustrates computer system 200 adapted to use representative embodiments. Central processing unit (CPU) 201 is coupled to system bus 202. The CPU 201 may be any general-purpose CPU, such as an Intel Pentium processor. However, embodiments herein are not restricted by the architecture of CPU 201, as long as CPU 201 supports operations as described herein. Bus 202 is coupled to random access memory (RAM) 203, which may be SRAM, DRAM, or SDRAM. ROM 204 is also coupled to bus 202, which may be PROM, EPROM, or EEPROM. RAM 203 and ROM 204 hold user and system data and programs, as is appreciated in the art.
Bus 202 is also coupled to input/output (I/O) controller card 205, communications adapter card 211, user interface card 208, and display card 209. The I/O adapter card 205 connects to storage devices 206, such as one or more of a hard drive, a CD drive, a floppy disk drive, or a tape drive, to the computer system. The I/O adapter 205 may also be connected to a printer (not illustrated), which would allow the system to print paper copies of information such as document, photographs, articles, etc. Note that the printing device may comprise a printer (e.g., inkjet, laser, etc.), a fax machine, or a copier machine. Communications card 211 is adapted to couple computer system 200 to a network 212, which may be one or more of a telephone network, a local (LAN) and/or a wide-area (WAN) network, an Ethernet network, and/or the Internet. User interface card 208 couples user input devices, such as keyboard 213, pointing device 207, and microphone (not shown), to the computer system 200. User interface card 208 may also provide sound output to a user via speaker(s) (not illustrated). The display card 209 may be driven by CPU 201 to control display on display device 210.
Note that any of the functions described herein may be implemented in hardware, software, and/or firmware, and/or any combination thereof. When implemented in software, elements of representative embodiments may comprise code segments to perform operations or tasks. The program or code segments can be stored in a computer-readable medium. The “computer-readable medium” may include any physical medium configured to store or transfer information. Examples of a processor-readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an erasable ROM (EROM), a floppy diskette, a compact disk CD-ROM, an optical disk, a hard disk, a fiber optic medium, or the like. The code segments may be downloaded via computer networks such as the Internet, Intranet, or the like.
Embodiments described herein may operate on or in conjunction with any network attached storage (NAS), storage array network (SAN), blade server storage, rack server storage, jukebox storage, cloud, storage mechanism, flash storage, solid-state drive, magnetic disk, read only memory (ROM), random access memory (RAM), or any other computing device, including scanners, embedded devices, mobile, desktop, server, or the like. Such devices may comprise one or more of: a computer, a laptop computer, a personal computer, a personal data assistant, a camera, a phone, a cell phone, a mobile phone, a computer server, a media server, a music player, a game box, a smart phone, a data storage device, a measuring device, a handheld scanner, a scanning device, a barcode reader, a point-of-sale device, a digital assistant, a desk phone, an IP phone, a solid-state memory device, a tablet, and/or a memory card.
In a representative embodiment, a method comprises steps of a computing device: receiving an input from a requestor; generating a plurality of term units (TUs) from the input, the plurality of TUs consisting of a first number of terms; identifying a plurality of differentiable terms of the plurality of TUs, the plurality of differentiable terms consisting of a second number of terms; determining a differentiability score for each term unit (TU) of the plurality of TUs; determining an input score for the input by dividing the second number of terms by the first number of terms; and transmitting a plurality of differentiable TUs to the requestor in order of differentiability score. The input may comprise a document or a plurality of documents. The method may further comprise the computing device prioritizing a plurality of documents based on their input scores. The method may further comprise the computing device prioritizing a set of topics based on their differentiability scores. Determination of the differentiability scores may be based on a grammatical scheme or a functional scheme. The grammatical scheme may comprise classification as a noun, a modifier, an adverb, or a verb. The functional scheme may comprise classification based on a linguistic scope. The method may further comprise the computing device producing a topical analysis of the input based on a plurality of differentiability-scored TUs. The method may further comprise the computing device performing a search of the input based on a plurality of differentiability-scored TUs. The linguistic scope may comprise a type of writing, a style of writing, or a linguistic functional scope of the input. The computing device may comprise a computer, a laptop computer, a personal computer, a server computer, a personal data assistant, a camera, a phone, a cell phone, a mobile phone, a smart phone, a tablet, a media server, a music player, a game box, a data storage device, a measuring device, a handheld scanner, a scanning device, a barcode reader, a point-of-sale (POS) device, a digital assistant, a desk phone, an Internet Protocol (IP) phone, a solid-state memory device, or a memory card.
In another representative embodiment, a method may comprise a computing device: receiving a computing device, a document from an originator; generating a plurality of term units (TUs) from the document, wherein each term unit (TU) of the plurality of TUs comprises a word, a multi-word, a number, or a symbol, the plurality of TUs consisting of a first number of TUs; identifying a plurality of differentiable terms of the plurality of TUs, the plurality of differentiable terms consisting of a second number of TUs; determining a differentiability score for each TU of the plurality of TUs; determining a document score for the document by dividing the second number of TUs by the first number of TUs; and transmitting a plurality of differentiable TUs to the originator in differentiability-scored order. The method may further comprise the computing device prioritizing a plurality of documents based on their document scores. The method may further comprise the computing device prioritizing a set of topics based on their differentiability scores. Determination of the differentiability score may be based on a grammatical scheme or a functional scheme. A grammatical scheme may comprise classification as a noun, a modifier, an adverb, or a verb. A functional scheme may comprise classification based on a linguistic scope. A plurality of differentiability-scored TUs may be configured for the originator to use the plurality of differentiability-scored TUs in a topical analysis of a plurality of documents. A plurality of differentiability-scored TUs may be configured for the originator to use the plurality of differentiability-scored TUs in a search of a plurality of documents.
In yet another representative embodiment, a computing device has one or more processors and a non-transitory, computer-readable medium storing a program that is executable by the one or more processors. The program comprises instructions to: receive a document from a requestor; generate a plurality of term units (TUs) from the document, wherein each term unit (TU) of the plurality of TUs comprises a word, a multi-word, a number, or a symbol, the plurality of TUs consisting of a first number of TUs; identify a plurality of differentiable terms of the plurality of TUs, the plurality of differentiable terms consisting of a second number of TUs; determine a differentiability score for each TU of the plurality of TUs; determine a document score for the document, the document score based on a ratio of the second number of TUs to the first number of TUs; and transmit a plurality of differentiable TUs to the requestor in order of at least one of ascending differentiability score or descending differentiability score. The program may further comprise instructions to prioritize a set of topics based on differentiability scores, wherein determination of the differentiability score is based on a grammatical scheme or a functional scheme, wherein: the grammatical scheme may comprise classification as a noun, a modifier, an adverb, or a verb; and the functional scheme may comprise classification based on a linguistic scope. The computing device may comprise a computer, a laptop computer, a personal computer, a server computer, a personal data assistant, a camera, a phone, a cell phone, a mobile phone, a smart phone, a tablet, a media server, a music player, a game box, a data storage device, a measuring device, a handheld scanner, a scanning device, a barcode reader, a point-of-sale (POS) device, a digital assistant, a desk phone, an Internet Protocol (IP) phone, a solid-state memory device, or a memory card. The requestor may comprise a human user, a process of the computing device, or a second (different) computing device.

Claims

What is claimed is:

1. A method comprising:

receiving, by a computing device, an input;

generating, by the computing device, a plurality of term units (TUs) from the input, the plurality of TUs consisting of a first number of TUs;

identifying, by the computing device, a plurality of differentiable TUs of the plurality of TUs, the plurality of differentiable TUs consisting of a second number of TUs;

determining, by the computing device, a differentiability score for each term unit (TU) of the plurality of TUs, wherein the differentiability score indicates a substantially fixed meaning or a non-fixed meaning;

determining, by the computing device, an input score for the input by dividing the second number of TUs by the first number of TUs; and

transmitting, by the computing device, the plurality of differentiable TUs in differentiability-scored order.

2. The method of claim 1, wherein the input comprises a document, and the method is repeated for a plurality of documents.

3. The method of claim 2, further comprising prioritizing, by the computing device, the plurality of documents based on input score.

4. The method of claim 2, wherein the plurality of TUs comprises a set of topics, and further comprising prioritizing, by the computing device, the set of topics based on differentiable TUs of the set of topics.

5. The method of claim 1, wherein determining the differentiability score is based on a grammatical scheme or a functional scheme.

6. The method of claim 5, wherein:

the grammatical scheme comprises classification as a noun, a modifier, an adverb, or a verb; and

the functional scheme comprises classification based on a type of writing.

7. The method of claim 6 further comprising, producing, by the computing device, a topical analysis of the input based on the plurality of differentiable TUs.

8. The method of claim 6 further comprising, performing, by the computing device, a search of the input based on the plurality of differentiable TUs.

9. The method of claim 6, wherein the type of writing comprises a style of writing or a linguistic functional scope of the input.

10. The method of claim 1, wherein the computing device comprises a computer, a laptop computer, a personal computer, a server computer, a personal data assistant, a camera, a phone, a cell phone, a mobile phone, a smart phone, a tablet, a media server, a music player, a game box, a data storage device, a measuring device, a handheld scanner, a scanning device, a barcode reader, a point-of-sale (POS) device, a digital assistant, a desk phone, an Internet Protocol (IP) phone, a solid-state memory device, or a memory card.

11. A method comprising:

receiving, by a computing device, a document;

generating, by the computing device, a plurality of term units (TUs) from the document, wherein each term unit (TU) of the plurality of TUs comprises a word, a multi-word, a number, or a symbol, wherein the plurality of TUs consist of a plurality of differentiable TUs and a plurality of non-differentiable TUs, and the plurality of TUs consists of a first number of TUs;

identifying, by the computing device, the plurality of non-differentiable TUs, wherein each of the plurality of non-differentiable TUs comprises non-fixed meaning;

identifying, by the computing device, the plurality of differentiable TUs, wherein each of the plurality of differentiable TUs comprises substantially fixed meaning, and the plurality of differentiable TUs consists of a second number of TUs;

determining, by the computing device, a differentiability score for each TU of the plurality of TUs;

determining, by the computing device, a document score for the document by dividing the second number of TUs by the first number of TUs; and

transmitting, by the computing device, the plurality of differentiable TUs in order of differentiability score for each TU of the plurality of differentiable TUs.

12. The method of claim 11, wherein the method is repeated for a plurality of documents, and further comprising prioritizing, by the computing device, the plurality of documents based on document score.

13. The method of claim 12, wherein the plurality of TUs comprises a set of topics, and further comprising prioritizing, by the computing device, the set of topics based on differentiability scores of the plurality of TUs of the set of topics.

14. The method of claim 13, wherein determining the differentiability score is based on a grammatical scheme or a functional scheme.

15. The method of claim 14, wherein:

the grammatical scheme comprises classification as a noun, an adjective, a modifier, an adverb, or a verb; and

the functional scheme comprises classification based on a linguistic scope comprising at least one of a type of writing or a style of writing.

16. The method of claim 15, wherein the determining the differentiability score comprises providing a plurality of differentiability-scored TUs configured for use in a topical analysis of the plurality of documents.

17. The method of claim 15, wherein the determining the differentiability score comprises providing a plurality of differentiability-scored TUs configured for use in a search of the plurality of documents.

18. A computing device comprising:

one or more processors; and

a non-transitory, computer-readable medium storing a program that is executable by the one or more processors, the program comprising instructions to:

receive a document;

generate a plurality of term units (TUs) from the document, wherein each term unit (TU) of the plurality of TUs comprises a word, a multi-word, a number, or a symbol, the plurality of TUs consisting of a plurality of differentiable TUs and a plurality of non-differentiable TUs, the plurality of TUs consisting of a first number of TUs;

identify the plurality of non-differentiable TUs, wherein each of the plurality of non-differentiable TUs have a non-fixed linguistic meaning;

identify a plurality of differentiable TUs of the plurality of TUs, wherein each of the plurality of differentiable TUs have a fixed linguistic meaning, and the plurality of differentiable TUs consists of a second number of TUs;

determine a differentiability score for each TU of the plurality of TUs, wherein the differentiability score indicates a substantially fixed linguistic meaning or a non-fixed linguistic meaning;

determine a document score for the document, the document score based on a ratio of the second number of TUs to the first number of TUs; and

transmit the plurality of differentiable TUs in order of at least one of ascending differentiability score or descending differentiability score.

19. The computing device of claim 18, wherein the program further comprises instructions to prioritize a set of topics based on differentiability scores of TUs comprising the set of topics, wherein determination of the differentiability score is based on a grammatical scheme or a functional scheme, wherein:

the functional scheme comprises classification based on type of writing.

20. The computing device of claim 19, wherein:

the computing device comprises a computer, a laptop computer, a personal computer, a server computer, a personal data assistant, a camera, a phone, a cell phone, a mobile phone, a smart phone, a tablet, a media server, a music player, a game box, a data storage device, a measuring device, a handheld scanner, a scanning device, a barcode reader, a point-of-sale (POS) device, a digital assistant, a desk phone, an Internet Protocol (IP) phone, a solid-state memory device, or a memory card; and

the plurality of scored differentiable TUs are transmitted to a requestor, and the requestor is a human user, a process of the computing device, or a second computing device.