WO2005048120A1 - Text summarization - Google Patents

Text summarization Download PDF

Info

Publication number
WO2005048120A1
WO2005048120A1 PCT/US2004/036896 US2004036896W WO2005048120A1 WO 2005048120 A1 WO2005048120 A1 WO 2005048120A1 US 2004036896 W US2004036896 W US 2004036896W WO 2005048120 A1 WO2005048120 A1 WO 2005048120A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
word
sentences
value
words
Prior art date
Application number
PCT/US2004/036896
Other languages
French (fr)
Inventor
Ke-Song Han
Fang Chen
Gui-Lin Chen
Original Assignee
Motorola Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola Inc. filed Critical Motorola Inc.
Publication of WO2005048120A1 publication Critical patent/WO2005048120A1/en
Priority to US11/416,978 priority Critical patent/US20060206806A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Definitions

  • This invention concerns automatic text summarization of documents.
  • the invention is particularly useful for, but not necessarily limited to, summarizing text received by a radio communications port or memory module associated with an electronic device.
  • a method for summarizing text comprising the steps of: evaluating selected words of the text according to predetermined criteria to provide word score values for each of the selected words; calculating for each of the selected words a word weighted score that is dependent on the word score values and a number of occurrences of each of the selected words; scoring sentences of the text to determine a sentence weighted score for the sentences, the sentence weighted score depending on sentence type and a combined word weighted score for words therein; and selecting at least one of the sentences to provide a summary of the text, the selecting being dependent on the sentence weighted score of at least some of the sentences.
  • the sentence type is dependent on predetermined indicator words and phrases.
  • the sentence type may be dependent on the case of a word or the sentence type can be from a group comprising: a title sentence, a supplementary title sentence, sub-title without any symbol, first sentence in a paragraph, second sentence in a paragraph, middle sentences in a paragraph, and last sentence in a paragraph.
  • the predetermined criteria may include word length or a type of sentence the word appears in, or a word part-of-speech, or a word inherent value, or a words syntax function value in the sentence.
  • the word weighted score W is determined by the formula:
  • W is a. word's weighted score for a single occurrence in the text
  • W L is a word length value
  • W pos is a word part-of-speech value
  • W type is word sentence type value which the word appears
  • W mlue is a word inherent value
  • W m is a word syntax function value in the sentence in which the word appears.
  • the following non-linear formula can be used to determine the word weighted score of a word that has more than one occurrence:
  • W(n + 1) is the word's total weight when it has n+1 occurrences
  • W(n) is the word's accumulated weight when it has a total of n occurrences
  • W +1 is the weight of the individual word at its (n+l)th occurrence.
  • the following formula is used to provide the sentence weighted score:
  • the step of selecting sentences for the summary involves selecting only sentences of a sentence length between a minimum sentence length threshold value and a maximum sentence length threshold value, the sentence length being determined by a number of words therein.
  • selecting at least one of the sentences can be based on selecting a proportion of sentences ordered according to their sentence weighted score. In one alternative, the selecting at least one of the sentences can be based on selecting sentences having their sentence weighted scores above a threshold value.
  • the invention is a text summarizing system to perform the method described above, the system comprising: memory to receive a document and store a program. a processor to perform the method on the document in memory using the program.
  • the invention is an engine embedded into a browser to perform the method described above, the system comprising: memory to receive a document and store a program. a processor to perform the method on the document in memory using the program.
  • the invention is an electronics communication device to perform the method described above, the system comprising: memory to receive a document and store a program. ⁇ a processor to perform the method on the document in memory using the program.
  • the electronic communication device may include a mobile phone or personal digital assistant.
  • Fig. 1 is a block diagram of an electronic device
  • Fig. 2 is a flow diagram illustrating a method for summarizing text that may be performed on the device of Fig. 1.
  • an electronic device in the form of a radio telephone 1 comprises a radio frequency communications unit 2 coupled to be in communication with a processor 3.
  • An input interface in the form of a screen 5 and a keypad 6 are also coupled to be in communication with the processor 3.
  • the processor 3 includes an encoder/decoder 11 with an associated
  • the processor 3 also includes a micro-processor 13 coupled, by a common data and address bus 17, to the encoder/decoder 11 and an associated character Read Only Memory (ROM) 14, a Random Access Memory (RAM) 4, static programmable memory 16 and a removable SLM module 18.
  • the static programmable memory 16 and SLM module 18 each can store, amongst other things, selected incoming text messages and a telephone book database TDb.
  • the micro-processor 13 has ports for coupling to the keypad 6, the screen 5 and an alert module 15 that typically contains a speaker, vibrator motor and associated drivers.
  • the character Read Only Memory 14 stores code for decoding or encoding text messages that may be received by the communication unit 2 or input at the keypad 6.
  • the character Read Only Memory 14 also stores operating code (OC) for micro- processor 13 and code for performing text summarization as described below with reference to Fig. 2.
  • the radio frequency communications unit 2 is a combined receiver and transmitter having a common antenna 7.
  • the communications unit 2 has a transceiver 8 coupled to antenna 7 via a radio frequency amplifier 9.
  • the transceiver 8 is also coupled to a combined modulator/demodulator 10 that couples the communications unit 2 to the processor 3.
  • Fig. 2 there is illustrated a method 20 for summarizing text.
  • the method 20 is typically invoked, at a start step 21, by a user entering a command at the keypad 6.
  • the method 20 then includes a step of providing text 22 that may be provided by a user inserting a memory module containing text into the sim module 18 or by the device 1 receiving a text message via the radio frequency unit 2 that is subsequently stored in the static memory 16. It should be noted that the text can be received by other means including downloading from the internet (via a port not shown). After of the text is provided, typically in the form of an electronic document, appropriate resources may be flagged for use, these resources being stored in ROM 14. For instance, for Chinese text a Chinese word lexicon and a Chinese part-of-speech (POS) dictionary may be flagged for use.
  • the method 20 then performs a step of identifying text structure 23 that is essentially a pre-processing stage where the text is prepared for automatic summarization.
  • step 23 the unnecessary spaces and blank lines are identified and deleted. This step 23 also generally involves determining an average length of a text line and the number of sentences.
  • the text is also structurally analysed to identify its various parts, such as: title; subtitle; author; abstract; paragraph numbering; relative sentence numbering in a paragraph and in the complete text; and references.
  • the method 20 next performs a step of evaluating 24 selected words of the text according to predetermined criteria to provide word score values for each of the selected words.
  • the words in the text are scored depending upon how likely they are to be useful in the summary.
  • Chinese words are subjected to segmentation that involves a coarse segmentation by word matching. Any ambiguity is processed using the well known Chinese character grouping of "right priority" and "high-frequency priority" (selecting frequently used character groups). Then person and place names are processed, since in Chinese text there can be a single surname and a double surname.
  • English words are stemmed that involves removing the variable word endings such as "ing” and “ed”.
  • a score value is allocated to each selected word in the text, depending on the following criteria: 1.
  • a word length value W L (where an integer value of 1 is given per character forming the word when the word is represented by alphanumeric characters, the word length value being the square root (SQR) of the integer value; and when the text is in Chinese characters a default word length value of 1 is allocated); hence the word “dog” has a word length value of SQR(3), the word “begin” has a word length value of SQR(5) and the word “iterative” .has a word length value of 3. 2.
  • an overriding rank (value of 14) for the word is selected when it is identified as a 'subject indicative' word or a 'exemplitive' word.
  • a subject indicative words are "This text", “In a word”, “All in all”, “Mainly introduce”, “Mainly research”, “Mainly analyze”, “highly commend”, “particularly point out”, “Unanimously think”, “intensively accuse” and “Unanimously overpass”.
  • Examples of exemplitive words are “for example”, “for instance”, “instance”, “give an example” and “example”. 4.
  • a word inherent value W value values of 0 ,1 or 2). Different words have different inherent importance depending on historical, geographical or other factors. For example, there are two Chinese words for a hard disk. One is mainly used in China mainland, while the other is mainly used in Hong Kong and Taiwan, so these two words have different values for the geographical reason.
  • a word syntax function value W m in sentence For instance, subjective or objective or predictive words receive a value of 2; complimentary words receive a value of 1.
  • a step of calculating 25 is effected for calculating for each of the selected words a word weighted score that is dependent on the word score values and a frequency of occurrence of each of the selected words.
  • the actual word weighted scores W l for the selected words are determined by a non-linear formula is as follows:
  • the word weighted scores are calculated as follows:
  • W(n + 1) W(n) + 1 /(» + 1) x W «+ ⁇
  • W(n + 1) is a word's total weighted score when it has n+1 occurrences
  • W(n) is a word's accumulated weighted score when it has a total of n occurrences
  • w " +1 is the individual word weighted score at the (n+l)th occurrence
  • W(l) is taken as W ⁇ .
  • a scoring sentences step 26 provides for scoring sentences of the text to determine a sentence weighted score for the sentences, the sentence weighted score depending at least on sentence type value S(type) and a combined word weighted score of words in the sentence.
  • Default sentence type values S (type) range for 14 to 1 as illustrated in table 1 below.
  • the sentence type values are is dependent on the case of a word. For upper case sentences the Default Sentence Type Value DSTV is multiplied by a Case Factor CF of unity, whereas for lower case sentences the Default Sentence Type Value DSTV is altered by a Case Factor of 0.9. Also, sentences containing any of a list of predetermined indicator words and phrases are affect the Default Sentence Type Value DSTV. For example, "In conclusion”, “this letter”, “results”, “summary”, “argue”, “propose”, “develop”, “attempt” are identified since these are most likely to be useful in the summary and are identified as indicator words. Hence, sentences with such indicator words have their Default Sentence Type Value DSTV is altered by an Indicator Word Factor IWF of 1.2, however sentences without such indicator words have an Indicator Word Factor IWF of unity .
  • a sentence is weighed in a non-linear fashion depending on the weight of the words in it, the sentence type value S(type) or rank and its length.
  • the following formula is used to weigh a sentence:
  • WS is the sentence weighted score of a sentence
  • V w(w,) is the sum of all the word weighted scores in this sentence
  • S(len) is another weighting factor related to sentence length.
  • the sum of the word weighted scores takes account of each word's individual weight, and so takes account of whether the sentence contains subject indicative or exemplitive words.
  • a sentence contains a subject indicative, this sentence has a larger probability to be a summary sentence than those don't have any subject indicative words.
  • the sentences contain subject exemplitive words usually have a smaller probability than those don't have any subject exemplitive words.
  • a selecting step 27 provides for selecting sentences (candidate summary sentences) of the text to provide a summary of the text, the selecting being dependent on the sentence weighted score of at least some of the sentences.
  • the sentences are typically sorted by their weight in descending order. Sentences that are too short or too long tend not to be included in summaries.
  • a Minimum Sentence Length threshold MST value of, say, 5 words is set for the shortest allowable sentence length and 50 words for a Minimum Sentence Length threshold LST value. Sentences outside this range are excluded from selection.
  • L(Sj) relates to the length of S t
  • W(S ⁇ ) relates to the weight of Si.
  • An overall sentence weighted score can be calculated to order the sentences in order of selection.
  • a default length L of summary is set to 30% of the original text document and the top 30% of the sentences are selected and concatenated to create a summary.
  • the selecting provides for selecting a proportion of sentences ordered according to their sentence weighted score.
  • the selecting provides for selecting sentences having their sentence weighted scores above a threshold value.
  • the summary smoothed by standard known techniques and is then displayed at the screen 5 a displaying step 28 and at a test step 29 a user can decide if the summary is satisfactory by selecting relevant keys of keypad 6.
  • the user may, at an adjusting parameters step 30, adjust the thresholds MST, LST, adjust the default length L of the summary and also change bias weightings of certain words. Also, different readers may have different interests in an article.
  • the method 20 therefore automatically maintains a bias word list, and the user can add to or delete from the list prior to invoking the method 20 or at step 30.
  • steps 27 and 28 are performed and the parameters may be adjusted again if at the test step 29 the summary is deemed unsatisfactory, otherwise the summary is selected as satisfactory (or a user terminates the method 20) at test step 29 and the summary can be stored in memory 16 before the method 20 terminates at an end step 31.
  • the present invention provides a useful method for efficiently summarizing text. It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Abstract

A method for summarizing text (20), comprising evaluating (24) selected words of the text according to predetermined criteria to provide word score values for each of the selected words. Thew method then provides for calculating (25) for each of the selected words a word weighted score that is dependent on the word score values and a number of occurrences of each of the selected words. Thereafter a step (26) of scoring sentences of the text to determine a sentence weighted score for the sentences is conducted. The sentence weighted score depends on sentence type and a combined word weighted score for words in the sentence. The method then provides for selecting (27) sentences to provide a summary of the text, the selecting being dependent on the sentence weighted score of the sentences.

Description

TEXT SUMMARIZATION
FIELD OF THE INVENTION This invention concerns automatic text summarization of documents. The invention is particularly useful for, but not necessarily limited to, summarizing text received by a radio communications port or memory module associated with an electronic device.
BACKGROUND OFTHEINVENTION Each day individuals are exposed to text in a document such as newspapers, technical papers, e-mails, technical reports and general news. The volume of literature published annually in a specific field is generally far too large for an individual to read and assimilate. Ideally, a title and abstract should convey to the reader the main themes of the document and consequently whether the complete document is of any relevance. These document sections that are highly rich in content can be misleading and inaccurate. Hence, there is a need to provide automatic document summary generation tools. Having a summary of a document allows the reader to determine whether that document is of interest, and hence, reading more of the document might be desirable. Conversely, reading the summary of a document could suffice to sufficiently inform the reader about the document, or instead, could indicate to the reader that the particular document is not of interest.
SUMMARY OF THE INVENTION According to one aspect of the invention, there is provided a method for summarizing text, comprising the steps of: evaluating selected words of the text according to predetermined criteria to provide word score values for each of the selected words; calculating for each of the selected words a word weighted score that is dependent on the word score values and a number of occurrences of each of the selected words; scoring sentences of the text to determine a sentence weighted score for the sentences, the sentence weighted score depending on sentence type and a combined word weighted score for words therein; and selecting at least one of the sentences to provide a summary of the text, the selecting being dependent on the sentence weighted score of at least some of the sentences. Suitably, the sentence type is dependent on predetermined indicator words and phrases. The sentence type may be dependent on the case of a word or the sentence type can be from a group comprising: a title sentence, a supplementary title sentence, sub-title without any symbol, first sentence in a paragraph, second sentence in a paragraph, middle sentences in a paragraph, and last sentence in a paragraph.
Preferably, the predetermined criteria may include word length or a type of sentence the word appears in, or a word part-of-speech, or a word inherent value, or a words syntax function value in the sentence. Suitably, the word weighted score W is determined by the formula:
W= WL x WP0S x W e x Walue x Wm
given that W is a. word's weighted score for a single occurrence in the text, WL is a word length value, Wpos is a word part-of-speech value, Wtype is word sentence type value which the word appears, Wmlue is a word inherent value and Wm is a word syntax function value in the sentence in which the word appears. Preferably, the following non-linear formula can be used to determine the word weighted score of a word that has more than one occurrence:
W(n + 1) = W(n) + 1 l(n + 1) x W "+1 where W(l) = W
given that W(n + 1) is the word's total weight when it has n+1 occurrences, W(n) is the word's accumulated weight when it has a total of n occurrences, and W+1 is the weight of the individual word at its (n+l)th occurrence. Suitably, the following formula is used to provide the sentence weighted score:
WS= ∑ W( w. ) x S(type) I S(len)
where WS is the sentence weighted score of a sentence, ^ W w,) is the sum of all the word weighted scores in this sentence, and S(len) is another weighting factor related to sentence length. Preferably, the step of selecting sentences for the summary involves selecting only sentences of a sentence length between a minimum sentence length threshold value and a maximum sentence length threshold value, the sentence length being determined by a number of words therein. Suitably, selecting at least one of the sentences can be based on selecting a proportion of sentences ordered according to their sentence weighted score. In one alternative, the selecting at least one of the sentences can be based on selecting sentences having their sentence weighted scores above a threshold value. In a second aspect the invention is a text summarizing system to perform the method described above, the system comprising: memory to receive a document and store a program. a processor to perform the method on the document in memory using the program. In a third aspect the invention is an engine embedded into a browser to perform the method described above, the system comprising: memory to receive a document and store a program. a processor to perform the method on the document in memory using the program. In a fourth as aspect the invention is an electronics communication device to perform the method described above, the system comprising: memory to receive a document and store a program. a processor to perform the method on the document in memory using the program. The electronic communication device may include a mobile phone or personal digital assistant. BRIEF DESCRIPTION OF THE DRAWINGS Examples of the invention will now be described with reference to the accompanying drawings, in which: Fig. 1 is a block diagram of an electronic device; and Fig. 2 is a flow diagram illustrating a method for summarizing text that may be performed on the device of Fig. 1.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT OF THE INVENTION In the drawings, like numerals are used to indicate like elements throughout. With reference to Fig. 1, an electronic device in the form of a radio telephone 1 comprises a radio frequency communications unit 2 coupled to be in communication with a processor 3. An input interface in the form of a screen 5 and a keypad 6 are also coupled to be in communication with the processor 3. The processor 3 includes an encoder/decoder 11 with an associated
Read Only Memory (ROM) 12 storing data for encoding and decoding voice or other signals that may be transmitted or received by the radio telephone 1. The processor 3 also includes a micro-processor 13 coupled, by a common data and address bus 17, to the encoder/decoder 11 and an associated character Read Only Memory (ROM) 14, a Random Access Memory (RAM) 4, static programmable memory 16 and a removable SLM module 18. The static programmable memory 16 and SLM module 18 each can store, amongst other things, selected incoming text messages and a telephone book database TDb. The micro-processor 13 has ports for coupling to the keypad 6, the screen 5 and an alert module 15 that typically contains a speaker, vibrator motor and associated drivers. The character Read Only Memory 14 stores code for decoding or encoding text messages that may be received by the communication unit 2 or input at the keypad 6. In this embodiment the character Read Only Memory 14 also stores operating code (OC) for micro- processor 13 and code for performing text summarization as described below with reference to Fig. 2. The radio frequency communications unit 2 is a combined receiver and transmitter having a common antenna 7. The communications unit 2 has a transceiver 8 coupled to antenna 7 via a radio frequency amplifier 9. The transceiver 8 is also coupled to a combined modulator/demodulator 10 that couples the communications unit 2 to the processor 3. Referring now to Fig. 2, there is illustrated a method 20 for summarizing text. The method 20 is typically invoked, at a start step 21, by a user entering a command at the keypad 6. The method 20 then includes a step of providing text 22 that may be provided by a user inserting a memory module containing text into the sim module 18 or by the device 1 receiving a text message via the radio frequency unit 2 that is subsequently stored in the static memory 16. It should be noted that the text can be received by other means including downloading from the internet (via a port not shown). After of the text is provided, typically in the form of an electronic document, appropriate resources may be flagged for use, these resources being stored in ROM 14. For instance, for Chinese text a Chinese word lexicon and a Chinese part-of-speech (POS) dictionary may be flagged for use. The method 20 then performs a step of identifying text structure 23 that is essentially a pre-processing stage where the text is prepared for automatic summarization. All the processing for summarisation is performed by the micro-processor 13 using code stored in the character Read Only Memory 14. The text will generally be written in an author's particular style and with the author's preferred layout. For example, one writer may like to insert a blank line between two paragraphs, while another may add four blank spaces at the beginning of each paragraph. Also, there are special problems associated with
Chinese text since it is based on the double-byte-character set (DBCS). Most characters in a Chinese document are stored using two bytes, but there will usually be many single byte symbols, such as English letters, numbers, and punctuations, etc. Punctuation, for instance a stop '.' creates additional problems. The stop could be a full stop of the single-byte-character set (SBC) which can identify the end of a sentence, so it should be transformed into "D". But if it is a decimal symbol in a number string, or if it is a part of suspension points, it doesn't need further processing. In step 23, the unnecessary spaces and blank lines are identified and deleted. This step 23 also generally involves determining an average length of a text line and the number of sentences. The text is also structurally analysed to identify its various parts, such as: title; subtitle; author; abstract; paragraph numbering; relative sentence numbering in a paragraph and in the complete text; and references. The method 20 next performs a step of evaluating 24 selected words of the text according to predetermined criteria to provide word score values for each of the selected words. In this step 24 the words in the text are scored depending upon how likely they are to be useful in the summary. Also, Chinese words are subjected to segmentation that involves a coarse segmentation by word matching. Any ambiguity is processed using the well known Chinese character grouping of "right priority" and "high-frequency priority" (selecting frequently used character groups). Then person and place names are processed, since in Chinese text there can be a single surname and a double surname. Also, English words are stemmed that involves removing the variable word endings such as "ing" and "ed". After segmentation or stemming a score value is allocated to each selected word in the text, depending on the following criteria: 1. A word length value WL ( where an integer value of 1 is given per character forming the word when the word is represented by alphanumeric characters, the word length value being the square root (SQR) of the integer value; and when the text is in Chinese characters a default word length value of 1 is allocated); hence the word "dog" has a word length value of SQR(3), the word "begin" has a word length value of SQR(5) and the word "iterative" .has a word length value of 3. 2. A word part-of-speech value WP0S (noun =1.2, verb =1.3, adjective =1.1; pronoun=l.l;others=0.5). 3. A word sentence type value Wtype or rank of the type of sentence the word appears in or, if appropriate, an overriding rank for the word. A word is classified depending on the rank of the sentence it is in. There are 14 types for PF- , they are: word in the title =14 word in vice title =13 word in text's abstract =12 word in subtitle with no symbol = 11 word in first level subtitle =10 word in second level subtitle =9 word in third level subtitle-8 word in fourth level subtitle=7 word in the first sentence of a paragraph=6 word in the second sentence of a paragraph=5 word in a last sentence of a paragraph =4 word in middle sentences of a paragraph =3 word in independent sentence =2 word in reference article =1 Alternatively, an overriding rank (value of 14) for the word is selected when it is identified as a 'subject indicative' word or a 'exemplitive' word. For instance, a subject indicative words are "This text", "In a word", "All in all", "Mainly introduce", "Mainly research", "Mainly analyze", "highly commend", "particularly point out", "Unanimously think", "intensively accuse" and "Unanimously overpass". Examples of exemplitive words are "for example", "for instance", "instance", "give an example" and "example". 4. A word inherent value Wvalue (values of 0 ,1 or 2). Different words have different inherent importance depending on historical, geographical or other factors. For example, there are two Chinese words for a hard disk. One is mainly used in China mainland, while the other is mainly used in Hong Kong and Taiwan, so these two words have different values for the geographical reason. Also there may be two words with the same meaning, but one is rarely used, so these two words have different values for a historical reason. The word's inherent value is determined by experience and stored in the dictionary, form where it can be retrieved. 5. A word syntax function value Wm in sentence. For instance, subjective or objective or predictive words receive a value of 2; complimentary words receive a value of 1.
After the step of evaluating 24 a step of calculating 25 is effected for calculating for each of the selected words a word weighted score that is dependent on the word score values and a frequency of occurrence of each of the selected words. The actual word weighted scores Wl for the selected words are determined by a non-linear formula is as follows:
W= WL x WP0S x W x Wwlm x Wm
When the word has more than 1 occurrence, the word weighted scores are calculated as follows:
W(n + 1) = W(n) + 1 /(» + 1) x W «+ι
to accumulate the weight, where W(n + 1) is a word's total weighted score when it has n+1 occurrences, W(n) is a word's accumulated weighted score when it has a total of n occurrences, w "+1 is the individual word weighted score at the (n+l)th occurrence, and W(l) is taken as Wλ . In a linear weighting system the weighting is multiplied by the frequency occurrence. For example, if a word "Clone" appears 5 times, it has an inherent value 3, then it will be given a value: 5*3 = 15. In contrast, this 1 ^ non-linear approach to frequency weighting, when W =3, W -3, W =3, W4=5.5 and W5=7.375, results in the accumulated word weighted weight of the word W as: W(l) = 3 W(2) = 3 + lA*3 = 4.5 W(3) = 4.5 + 1/3*3= 5.5 W(4) = 5.5 + 1/4*5.5 = 6.875 W(5) = 6.875 +1/5*6.875 = 8.25 After the step of calculating 25 a scoring sentences step 26 provides for scoring sentences of the text to determine a sentence weighted score for the sentences, the sentence weighted score depending at least on sentence type value S(type) and a combined word weighted score of words in the sentence. Default sentence type values S (type) range for 14 to 1 as illustrated in table 1 below.
Macro Name Default Sentence Rank Type Value DSTV
MAIN ΠTLE 14 A title sentence
VICE ΠTLE 13 A supplementary title sentence
SYMBOL_LESS_TITLE 12 Sub-title without any symbol
FΓRST_LEVEL_TITLE 11 First level sub-title
SECOND_LEVEL_ΉTLE 10 Second level sub-title
THIRDJLEVEL ΠTLE 9 Third level sub-title
FOURTH_LEVEL_TITLE 8 Fourth level sub-title
ABSTRACT SENTENCE 7 Sentence in author's abstraction
PARAGRAPH_FΓRST_SENT 6 First sentence in a paragraph
ENCE PARAGRAPH_SECOND_SE 5 Second sentence m
NTENCE paragraph PARAGRAPH_MIDDLE_SE 4 Middle sentences in a
NTENCE paragraph
PARAGRAPH_TAIL_SENT 3 Last sentence in a paragraph ENCE
LNDEPENDENT_SENTENC 2 Independent sentence
REFERENCE SENTENCE 1 Sentence in reference
Table. 1 Default Sentence Type value
Also, the sentence type values are is dependent on the case of a word. For upper case sentences the Default Sentence Type Value DSTV is multiplied by a Case Factor CF of unity, whereas for lower case sentences the Default Sentence Type Value DSTV is altered by a Case Factor of 0.9. Also, sentences containing any of a list of predetermined indicator words and phrases are affect the Default Sentence Type Value DSTV. For example, "In conclusion", "this letter", "results", "summary", "argue", "propose", "develop", "attempt" are identified since these are most likely to be useful in the summary and are identified as indicator words. Hence, sentences with such indicator words have their Default Sentence Type Value DSTV is altered by an Indicator Word Factor IWF of 1.2, however sentences without such indicator words have an Indicator Word Factor IWF of unity .
Thus the sentence type value S(type) = DSTV * CF * IWF
In this step 26 a sentence is weighed in a non-linear fashion depending on the weight of the words in it, the sentence type value S(type) or rank and its length. The following formula is used to weigh a sentence:
WS= ∑ W(w, ) x S(type) / S(leή)
where WS is the sentence weighted score of a sentence, V w(w,) is the sum of all the word weighted scores in this sentence, and S(len) is another weighting factor related to sentence length. The sum of the word weighted scores takes account of each word's individual weight, and so takes account of whether the sentence contains subject indicative or exemplitive words. Experience tells us that if a sentence contains a subject indicative, this sentence has a larger probability to be a summary sentence than those don't have any subject indicative words. Analogously, the sentences contain subject exemplitive words usually have a smaller probability than those don't have any subject exemplitive words. Statistical analysis of sentence length distributions in source text and in human prepared summaries was conducted on a corpus of documents. The longest sentence had 180 words. We found these two distributions to be very alike. A Minimum Mean-Square Error method was therefore used to process the relationship between sentence length and importance, and a cubic equation was derived to describe this relationship quantitatively.
S(len) = y , where y - ax3 + bx2 + ex + d
Where x is the length in words of a sentence. Also, using the longest sentence of 180 words, a 180 by 180 matrix X can be derived of elements (xt, y() . We therefore get Y - X ■ θ , in other words the following is obtained:
Figure imgf000012_0001
Since it can be deduced thatø = [Xτxγ XτY , we can determine values the four parameters: a, b, c and d. These values are: a = 0.0002; b=0.2127; c=4.9961; and D = 6.8755. After the scoring sentences step 26 a selecting step 27 provides for selecting sentences (candidate summary sentences) of the text to provide a summary of the text, the selecting being dependent on the sentence weighted score of at least some of the sentences. In this regard, before selecting
• candidate summary sentences, the sentences are typically sorted by their weight in descending order. Sentences that are too short or too long tend not to be included in summaries. A Minimum Sentence Length threshold MST value of, say, 5 words is set for the shortest allowable sentence length and 50 words for a Minimum Sentence Length threshold LST value. Sentences outside this range are excluded from selection. In other words, the selecting step 27 provides for selecting only sentences of a sentence length between the Minimum Sentence Length Threshold MST value and the Maximum Sentence Length Threshold MST value, the sentence length being determined by a number of words therein. Given a certain length L of the resulting summary, sentences Si are selected from a set of sentences S, to satisfy two conditions simultaneously:
Figure imgf000013_0001
∑W(S{) = max
where L(Sj) relates to the length of St, and W(Sι) relates to the weight of Si. An overall sentence weighted score can be calculated to order the sentences in order of selection. A default length L of summary is set to 30% of the original text document and the top 30% of the sentences are selected and concatenated to create a summary. In other words, the selecting provides for selecting a proportion of sentences ordered according to their sentence weighted score. In one alternative, the selecting provides for selecting sentences having their sentence weighted scores above a threshold value. The summary smoothed by standard known techniques and is then displayed at the screen 5 a displaying step 28 and at a test step 29 a user can decide if the summary is satisfactory by selecting relevant keys of keypad 6. If the summary is unsatisfactory the user may, at an adjusting parameters step 30, adjust the thresholds MST, LST, adjust the default length L of the summary and also change bias weightings of certain words. Also, different readers may have different interests in an article. The method 20 therefore automatically maintains a bias word list, and the user can add to or delete from the list prior to invoking the method 20 or at step 30. After step 30 steps 27 and 28 are performed and the parameters may be adjusted again if at the test step 29 the summary is deemed unsatisfactory, otherwise the summary is selected as satisfactory (or a user terminates the method 20) at test step 29 and the summary can be stored in memory 16 before the method 20 terminates at an end step 31. Advantageously, the present invention provides a useful method for efficiently summarizing text. It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Claims

WE CLAIM:
1. A method for summarizing text, comprising the steps of: evaluating selected words of the text according to predetermined criteria to provide word score values for each of the selected words; calculating for each of the selected words a word weighted score that is dependent on the word score values and a number of occurrences of each of the selected words; scoring sentences of the text to determine a sentence weighted score for the sentences, the sentence weighted score depending on sentence type and a combined word weighted score for words therein; and selecting at least one of the sentences to provide a summary of the text, the selecting being dependent on the sentence weighted score of at least some of the sentences .
2. A method according to claim 1, characterized in that the sentence type is dependent on predetermined indicator words and phrases.
3. A method according to claim 1, characterized in that the sentence type is dependent on the case of a word.
4. A method according to claim 1, characterized in that sentence type is from a group comprising: a title sentence, a supplementary title sentence, sub-title without any symbol, first sentence in a paragraph, second sentence in a paragraph, middle sentences in a paragraph, and last sentence in a paragraph.
5. A method according to claim 1, characterized in that the predetermined criteria includes word length.
6. A method according to claim 1, characterized in that the predetermined criteria includes a type of sentence the word appears in.
7. A method according to claim 1, characterized in that the predetermined criteria includes a word part-of-speech.
8. A method according to claim 1, characterized in that the predetermined criteria includes a word inherent value.
9. A method according to claim 1, characterized in that the predetermined criteria includes the words syntax function value in the sentence.
10. A method according to claim 1, characterized in that the word weighted score W is determined by the formula:
W= WL x Wpos x W x Wvalue x Wm
given that W is a word's weighted score for a single occurrence in the text, WL is a word length value, WP0S is a word part-of-speech value, W^ is word sentence type value which the word appears, Wvalue is a word inherent value and Wm is a word syntax function value in the sentence in which the word appears.
11. A method according to claim 10, characterized in that the following non-linear formula is used to determine the word weighted score of a word that has more than one occurrence: W(n + l) = W(ή) +l/(n + l) xWn+1 where W(l) = W
given that W(n + \) is the word's total weight when it has n+1 occurrences, W(n) is the word's accumulated weight when it has a total of n occurrences, and w "+1 is the weight of the individual word at its (n+l)th occurrence.
12. A method according to claim 11, characterized in that the following formula is used to provide the sentence weighted score:
WS= T W(w, ) x S(type) IS (ten) where WS is the sentence weighted score of a sentence, W(w,) is the sum of all the word weighted scores in this sentence, and S(len) is another weighting factor related to sentence length.
13. A method according to claim 1, characterized in that the step of selecting sentences for the summary involves selecting only sentences of a sentence length between a minimum sentence length threshold value and a maximum sentence length threshold value, the sentence length being determined by a number of words therein.
14. A method according to claim 1, characterized in that selecting at least one of the sentences is based on selecting a proportion of sentences ordered according to their sentence weighted score.
15. A method according to claim 1, characterized in that selecting at least one of the sentences is based on selecting sentences having their sentence weighted scores above a threshold value.
PCT/US2004/036896 2003-11-07 2004-11-04 Text summarization WO2005048120A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/416,978 US20060206806A1 (en) 2004-11-04 2006-05-03 Text summarization

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CNA200310114860XA CN1614585A (en) 2003-11-07 2003-11-07 Context Generality
CN0310114860.X 2003-11-07

Publications (1)

Publication Number Publication Date
WO2005048120A1 true WO2005048120A1 (en) 2005-05-26

Family

ID=34580578

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/036896 WO2005048120A1 (en) 2003-11-07 2004-11-04 Text summarization

Country Status (2)

Country Link
CN (1) CN1614585A (en)
WO (1) WO2005048120A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162781A (en) * 2019-04-09 2019-08-23 国金涌富资产管理有限公司 A kind of finance text subjectivity sentence automatic identifying method
WO2020012483A1 (en) * 2018-07-11 2020-01-16 Ofek - Eshkolot Research And Development Ltd. Method for defining meaning and extracting novelty from text

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5104762B2 (en) * 2006-10-23 2012-12-19 日本電気株式会社 Content summarization system, method and program
US9916309B2 (en) 2011-10-14 2018-03-13 Yahoo Holdings, Inc. Method and apparatus for automatically summarizing the contents of electronic documents
CN103885935B (en) * 2014-03-12 2016-06-29 浙江大学 Books chapters and sections abstraction generating method based on books reading behavior
CN103942182B (en) * 2014-04-29 2018-04-27 百度在线网络技术(北京)有限公司 A kind of English text form optimization method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5384703A (en) * 1993-07-02 1995-01-24 Xerox Corporation Method and apparatus for summarizing documents according to theme
US6334132B1 (en) * 1997-04-16 2001-12-25 British Telecommunications Plc Method and apparatus for creating a customized summary of text by selection of sub-sections thereof ranked by comparison to target data items
US20020052901A1 (en) * 2000-09-07 2002-05-02 Guo Zhi Li Automatic correlation method for generating summaries for text documents
US6493663B1 (en) * 1998-12-17 2002-12-10 Fuji Xerox Co., Ltd. Document summarizing apparatus, document summarizing method and recording medium carrying a document summarizing program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5384703A (en) * 1993-07-02 1995-01-24 Xerox Corporation Method and apparatus for summarizing documents according to theme
US6334132B1 (en) * 1997-04-16 2001-12-25 British Telecommunications Plc Method and apparatus for creating a customized summary of text by selection of sub-sections thereof ranked by comparison to target data items
US6493663B1 (en) * 1998-12-17 2002-12-10 Fuji Xerox Co., Ltd. Document summarizing apparatus, document summarizing method and recording medium carrying a document summarizing program
US20020052901A1 (en) * 2000-09-07 2002-05-02 Guo Zhi Li Automatic correlation method for generating summaries for text documents

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020012483A1 (en) * 2018-07-11 2020-01-16 Ofek - Eshkolot Research And Development Ltd. Method for defining meaning and extracting novelty from text
CN110162781A (en) * 2019-04-09 2019-08-23 国金涌富资产管理有限公司 A kind of finance text subjectivity sentence automatic identifying method

Also Published As

Publication number Publication date
CN1614585A (en) 2005-05-11

Similar Documents

Publication Publication Date Title
US20060206806A1 (en) Text summarization
US9164983B2 (en) Broad-coverage normalization system for social media language
KR100453227B1 (en) Similar sentence retrieval method for translation aid
US8027832B2 (en) Efficient language identification
US5384703A (en) Method and apparatus for summarizing documents according to theme
KR100849272B1 (en) Method for automatically summarizing Markup-type documents
CN102576358B (en) Word pair acquisition device, word pair acquisition method, and program
US20100211381A1 (en) System and Method of Creating and Using Compact Linguistic Data
US20020194230A1 (en) System and method for generating analytic summaries
Corston-Oliver Text compaction for display on very small screens
CN109960724A (en) A kind of text snippet method based on TF-IDF
JP2000514218A (en) Word recognition of Japanese text by computer system
WO2008052239A1 (en) Email document parsing method and apparatus
JP4263371B2 (en) System and method for parsing documents
WO2012166417A1 (en) Method and system for text message normalization based on character transformation and unsupervised of web data
CN102884518A (en) Automatic context sensitive language correction using an internet corpus particularly for small keyboard devices
WO2007143914A1 (en) Method, device and inputting system for creating word frequency database based on web information
CN101526938B (en) File processing device
EP1627325B1 (en) Automatic segmentation of texts comprising chunks without separators
CN113743090B (en) Keyword extraction method and device
WO2005048120A1 (en) Text summarization
CN111339457B (en) Method and apparatus for extracting information from web page and storage medium
Geyken et al. On-the-fly Generation of Dictionary Articles for the DWDS Website
JP4382663B2 (en) System and method for generating and using concise linguistic data
JP2016173742A (en) Face mark emotion information extraction system, method and program

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 11416978

Country of ref document: US

122 Ep: pct application non-entry in european phase