CN102385574A - Method and device for extracting sentences from document - Google Patents

Method and device for extracting sentences from document Download PDF

Info

Publication number
CN102385574A
CN102385574A CN2010102686756A CN201010268675A CN102385574A CN 102385574 A CN102385574 A CN 102385574A CN 2010102686756 A CN2010102686756 A CN 2010102686756A CN 201010268675 A CN201010268675 A CN 201010268675A CN 102385574 A CN102385574 A CN 102385574A
Authority
CN
China
Prior art keywords
sentence
cue
structure pattern
document
predetermined special
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010102686756A
Other languages
Chinese (zh)
Other versions
CN102385574B (en
Inventor
游赣梅
孙军
谢宣松
赵利军
郑继川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to CN201010268675.6A priority Critical patent/CN102385574B/en
Publication of CN102385574A publication Critical patent/CN102385574A/en
Application granted granted Critical
Publication of CN102385574B publication Critical patent/CN102385574B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a method and a device for extracting sentences with prospective special meanings from a document. The method comprises the following steps of: obtaining a sentence structure mode of a sentence with a prospective special meaning; obtaining cue words, wherein a sentence containing the cue words is more possibly to be the sentence with the prospective special meaning than a sentence without the cue words; combining the sentence structure mode and the cue words to obtain the combined sentence structure mode-cue words which is in accordance with a sentence grammatical structure; based on the sentence structure mode-cue words contained in the sentence in the document, determining the score of the sentence; and based on the score of the sentence, the sentence with the prospective special meaning is extracted from the document. The method and the device for extracting sentences with prospective special meanings from the document are utilized, accordingly, the interference caused by noise sentences can be reduced, and the sentences with the prospective special meanings can be extracted more accurately and efficiently.

Description

Extract the method and apparatus of sentence from document
Technical field
The present invention relates generally to document process and information extraction, relates more specifically to extract from document the method and apparatus of sentence.
Background technology
A lot of technology of from document, extracting sentence automatically or forming documentation summary have been proposed.
At patent documentation US7051024 B2; Be entitled as Document summarizer for word processors; Among the MICROSOFT CORP, a kind of method of automatic formation documentation summary is proposed, wherein; The frequency that the content words of statistics in the document occurs obtains the scoring of sentence through the corresponding frequency of each content words of being comprised in the sentence is sued for peace; Scoring according to sentence is sorted to each sentence.In addition; Some potential problem phrase or vocabulary have been defined in advance; In the document, be referred to as cue phrase (cue-phrase); Its implication is that the sentence that includes such problem phrase or vocabulary should not be added in the documentation summary, perhaps has only certain to carry earlier and just be introduced in the documentation summary under the situation that condition sets up; In the frequency statistics of carrying out the content words appearance; Phrase in each sentence is compared with predefined cue phrase; If it comprises the cue phrase; Then whether decision will be outside this sentence eliminating and the documentation summary, still conditionally with its candidate as the adding documentation summary.
In addition, at patent documentation US Patent 5924108-Document summarizer for word processors, among the MICROSOFT CORP, make up to judge whether it is crucial sentence according to whether comprising cue or cue in the sentence.
In addition; At S Teufel; The Sentence extraction as a classification task of M Moens, among In Proceedings of the ACL ' 97/EACL ' 97 Workshop on Intelligent Scalable Text Summarization (July 1997), the prompting phrase is used for filtering unit's comment (meta-discourse); The prompting phrase is divided into 5 types by manual work, and the corresponding respectively sentence that comprises cue belongs to the different possibilities of summing up sentence.According to the prompting phrase, position in the article, sentence length, speech occurrence number in the dictionary, suitably name occurs, and each sentence is given a mark according to each characteristic, appears at the possibility in the summary so just obtain sentence.
Summary of the invention
But, there are some problems in above-mentioned classic method.For instance, thinking that the sentence that comprises introducer tends under the situation of sentence into expectation, finding usually in one piece of document, much to comprise the sentences that introducer is not expectation (hereinafter, being referred to as the noise sentence) though exist.So, utilize above-mentioned classic method, usually can not suitably find the expectation sentence.
In addition, the inventor finds, in many cases, possibly expect from document, to extract some and acquire a special sense or the sentence of special role.For example, for patent document, expectation extracts the sentence that the technical matters that will solve is invented in explanation automatically.Again for example, in the product description, the sentence about the product advantage is extracted in expectation.For another example, for contract, expectation extract wherein to disadvantageous clause in side or the like.
According to an aspect of the present invention, provide a kind of and extracted the method for sentence, can comprise the steps: to obtain to have the sentence structure pattern of the sentence of predetermined Special Significance with predetermined Special Significance from literary composition retaining; Obtain cue, the sentence that wherein contains this cue more possibly be the sentence with predetermined Special Significance than the sentence that does not contain this cue; Combination sentence structure pattern and cue are with the sentence structure pattern-cue after the combination that obtains to meet the sentence syntactic structure; Based on the sentence structure pattern-cue that sentence comprised in the said document, confirm the mark of sentence; And, come from said document, to extract sentence with predetermined Special Significance based on the mark of sentence.
According to a further aspect in the invention, provide a kind of and extracted the device of the sentence with predetermined Special Significance from literary composition retaining, can comprise: the sentence structure pattern obtains parts, is used to obtain have the sentence structure pattern of the sentence of predetermined Special Significance; Cue obtains parts, is used to obtain cue, and the sentence that wherein contains this cue more possibly be the sentence with predetermined Special Significance than the sentence that does not contain this cue; Sentence structure pattern-cue combiner is used to make up sentence structure pattern and cue, with the sentence structure pattern-cue after the combination that obtains to meet the sentence syntactic structure; The sentence mark is confirmed parts, is used for the sentence structure pattern-cue that sentence comprised based on said document, confirms the mark of sentence; And the sentence extracting said elements, be used for mark based on sentence, come from said document, to extract sentence with predetermined Special Significance.
According to another aspect of the invention, provide a kind of and extracted the method for sentence, can comprise the steps: to obtain to have the sentence structure pattern of the sentence of predetermined Special Significance with predetermined Special Significance from literary composition retaining; Obtain cue, the sentence that wherein contains this cue more possibly be the sentence with predetermined Special Significance than the sentence that does not contain this cue; Combination sentence structure pattern and cue are with the sentence structure pattern-cue after the combination that obtains to meet the sentence syntactic structure; And, come from said document, to extract sentence with predetermined Special Significance based on the sentence structure pattern-cue that sentence comprised in the said document.
Utilize the method and apparatus that from document, extracts sentence of the present invention, can alleviate the interference that the noise sentence is brought, extract sentence more accurately and efficiently with predetermined Special Significance with predetermined Special Significance.
Description of drawings
Fig. 1 is the overall flow figure that extracts the method for the sentence with predetermined Special Significance according to an embodiment of the invention from the literary composition retaining;
Fig. 2 is the process flow diagram that extracts the method for the sentence with predetermined Special Significance in accordance with another embodiment of the present invention from document;
Fig. 3 be according to the present invention another embodiment extract the process flow diagram of the method for sentence from document with predetermined Special Significance;
Fig. 4 is the schematic block diagram that extracts the device of the sentence with predetermined Special Significance according to an embodiment of the invention from the literary composition retaining; And
Fig. 5 can put into practice exemplary computer system of the present invention according to an embodiment of the invention.
Embodiment
In order to make those skilled in the art understand the present invention better, the present invention is done further explain below in conjunction with accompanying drawing and embodiment.
For ease of understanding and describing conveniently, at first set forth general plotting of the present invention below.Of preamble, only rely on the cue phrase possibly obtain many noise sentences, the noise sentence comprises the cue phrase but is not the sentence of wanting.And manyly there is the sentence of Special Significance or special role to have some specific sentence structure pattern usually.Therefore, also considered the cue phrase simultaneously if when the sentence that extraction acquires a special sense, both considered the sentence structure pattern, then expection will obtain more gratifying extraction result.
In the application's document, phrase can refer to single speech or the word of being made up of a plurality of words, and word and single speech (word) refer in the Chinese word in a word or the English.
In addition, for fear of obscuring main points of the present invention, well-known features or structure are not described in the application's document; For example; In sentence extracts, at first to carry out subordinate sentence, participle usually to document; And when the importance of word is assessed, consider position, speech length, part of speech of some word features such as word frequency, inverted entry frequency, speech etc.About participle.Subordinate sentence has a lot of technique known such as participle technique ICTCLAS etc.; But these aspects that to be not the present invention pay close attention to; Therefore not to it for detailing, still, need to prove; This does not represent that the present invention cannot comprise these well-known features or structure, and these participles and word feature selection technology may be used to the present invention on the contrary.
For ease of understanding and describing conveniently, usually explain below to extract the sentence of describing the technical solution problem to patent documentation.But; It is emphasized that the present invention is not limited to extract the sentence of describing the technical solution problem, in fact from document, extract any sentence that acquires a special sense and to use the present invention; For example, from product description, extract sentence about the product advantage; From contract, extract square disadvantageous clause or the like.
Fig. 1 shows the overall flow figure that extracts the method for the sentence with predetermined Special Significance according to an embodiment of the invention from the literary composition retaining.
As shown in Figure 1, keeping off the method 100 that extracts the sentence with predetermined Special Significance from literary composition according to an embodiment of the invention can comprise: sentence structure pattern acquisition step S110, cue acquisition step S120, sentence structure pattern-cue combination step S130, sentence mark are confirmed step S140, sentence extraction step S150.Specify in the face of each step down.
At step S110, acquisition has the sentence structure pattern of the sentence of predetermined Special Significance.
Sentence structure pattern with sentence of predetermined special doubt is meant that the do not match literal combination of this tactic pattern of the group of text composition and division in a proportion of this tactic pattern of coupling more possibly be the sentence with predetermined Special Significance.For example; Can portray the sentence structure pattern from the following aspects: comprise between 2 above phrases, the phrase that punctuation mark or word by predetermined number separates, phrase has and the corresponding part of speech of its effect sentence structure; For example; If phrase as the adverbial modifier, then possibly be adverbial phrase; If phrase as subject, then possibly be noun or pronoun phrase; If phrase as the predicate trunk, then possibly be verb phrase; And so on.
The difference of sentence structure pattern and phrase is that from the sentence structure pattern, people can get a glimpse of or know the framework of a sentence, and wants the aspect described, and general more complicated.And phrase then generally is that unit formed in the sentence of the level between word and sentence, and the relatively more fixing expression meaning is arranged, but generally can not know the framework of sentence from it.
The sentence of the technical matters that solves with the description of extracting in the patent documentation is an example; Typical sentence structure pattern has: " accordingly; the object of this method " (below be called sentence structure Mode S P1), " as a result, the problem of the paper " (to call sentence structure Mode S P2 in the following text); " therefore ... what is needed for{4,20}invention " (below be called sentence structure Mode S P3) is typical sentence structure pattern.Wherein { 4,20} representes that the intermediate character number is 4 to 20.
Sentence structure pattern with sentence of predetermined special doubt for example can be learnt to obtain from the training collection of document automatically, also can be by the artificial definition of the experienced expert of association area.Under situation about from the training collection of document, learning automatically; The training collection of document can be made up of a large amount of training documents; For the situation of from patent documentation, extracting the sentence of technical problem; The training collection of document can be made up of the open document of a large amount of patent documentations such as patented claim, and through the manual work affirmation sentence of technical problem has wherein been carried out mark.At this moment the sentence structure pattern of the sentence of learning this technical problem can be for example waited, and the sentence structure pattern of the sentence of the technical problem that obtains through study can be stored through sentence whole matching or sentence part coupling.
For the sentence structure pattern that is obtained.Can not be equal to with making any distinction between and treat.But, also can for example can set weight in the training collection of document to the different weight of sentence structure mode initialization that is obtained according to the frequency that this sentence structure pattern occurs as alternative.
At step S120, obtain cue, the sentence that wherein contains this cue more possibly be the sentence with predetermined Special Significance than the sentence that does not contain this cue.
The sentence that acquires a special sense can contain some cue usually.For example, for the sentence of the technical problem in the patent documentation, the vocabulary that usually occurs has solve, provide, need, increase, decrease, optimize, high, poorer or the like.Can these vocabulary be extracted as cue.
Equally, cue can be learnt automatically to obtain from the training collection of document or come artificial definite by experienced expert.
Likewise, do not put on an equal footing different cues with can making any distinction between, perhaps can set different weights for different cues.
In addition, need to prove that sentence structure pattern and cue with sentence of predetermined Special Significance can be obtained by the outside; In this case; Can be to obtain from another calculation element of carrying out object identification through network, or by user's input, can certainly be to be stored in identification identifying information well in advance in the removable storage medium of flash memory for example; Read identifying information from this removable storage medium then, the method or the means of acquisition are not construed as limiting the invention.
At step S130, combination sentence structure pattern and cue are with the sentence structure pattern-cue after the combination that obtains to meet the sentence syntactic structure.
For example, sentence pattern SP1, promptly " accordingly, the object of this method " can with cue solve, provide, need, increase, decrease, optimize combination, but suitable and high, poorer combination.
Equally, which cue can draw which sentence structure pattern according to a large amount of training documents can make up with, and which sentence structure pattern is never perhaps few and which cue combination.Certainly, also can rule of thumb come artificial definition by the expert of association area.
Likewise, do not put on an equal footing different sentences tactic pattern-cue combination with can making any distinction between, perhaps can set different weights for different sentences tactic pattern-cue combination.About how specifically setting or to learn the weight of sentence structure pattern-cue combination, the back will be described in detail with reference to figure 3 specially.
At step S140,, confirm the mark of sentence based on the sentence structure that sentence the comprised pattern in the document-cue combination.
After having obtained sentence structure pattern-programmed alarm contamination through abovementioned steps S110, S120, S130; Can be for each sentence in any one document (hereinafter is referred to as test document); Whether contain the sentence structure pattern-cue combination that has obtained according to this sentence, calculate the mark of sentence.
For example; Suppose that a sentence in the document is " accordingly; the object of this method is to provide an improved inkjet printing system having a specialized orifice plate "; This sentence comprises accordingly; The object of this method (SP1 pattern)-provide (cue) combination is supposed that the weight of all sentence structure pattern-prompting contaminations is identical to be 1, and then to try to achieve be 1 to the mark of this sentence.
Suppose that another sentence in the document is " A description will be given below; with reference to the drawings; of embodiments of the present invention ", because in this sentence, do not comprise any sentence structure pattern-cue combination.Therefore can for example to try to achieve be 0 to the mark of this sentence.
In above-mentioned sentence fractional computation process, whether comprise sentence structure pattern-cue according to sentence simply and make up and calculate the sentence mark.But this only is example and is to provide from the purpose of being convenient to understand, and can also have the method that other calculates the sentence mark.For example, can not comprise the sentence setting and the sentence structure pattern corresponding weights of cue for comprising the sentence structure pattern, and for comprising cue but do not comprise the sentence setting and the cue corresponding weights of sentence structure pattern.
In addition, for some cue, can be equipped with synonym, approximate speech or can carry out the tabulation of the alternative phrase of synonym, and be equipped with corresponding ratio molecule, such as 0.9.Thus; When sentence being carried out sentence structure pattern-cue coupling or retrieval; Under the situation of sentence structure pattern-cue of not finding coupling; Can retrieve synonym, be similar to speech or can carry out the tabulation of the alternative phrase of synonym, and whether exist sentence structure pattern in this case maybe can carry out the combination of the alternative phrase of synonym with this synonym, approximate speech, and can be in the hope of corresponding mark; For example, be to exist the mark under the sentence structure pattern-cue situation of mating to multiply by 0.9.
At step S150,, come from said document, to extract sentence with predetermined Special Significance based on the mark of sentence.
For example, can extract the sentence that mark surpasses predetermined threshold, perhaps extract the forward sentence of mark ordering as sentence with predetermined Special Significance.
The sentence of the Special Significance that is extracted can output to output device such as display, printer etc., also can output to another electronic equipment and supply further to use or handle.
With reference to accompanying drawing 1 method that extracts the sentence with predetermined Special Significance according to an embodiment of the invention from the literary composition retaining has been described above.But, need to prove that the foregoing description only is an example, should be as restriction of the present invention.A lot of substituting or modification can be arranged, and these do not exceed protection scope of the present invention.
For example; The step of the mark of above-mentioned definite sentence is not necessary; But can utilize certain sorting algorithm or learning algorithm directly from said document, to extract sentence based on the sentence structure pattern-cue that sentence comprised in the said document with predetermined Special Significance.For example, the most simply, can only check whether a sentence comprises sentence structure pattern-cue combination, if comprise, then this sentence is extracted as the sentence that acquires a special sense, and not have the operation of explicit calculating mark.
Perhaps, for example, can utilize decision tree to classify.At this moment; Can for example, judge at a node place whether a sentence exists sentence structure pattern-cue combination A with judging characteristic or the variable of each sentence structure pattern-cue combination as the node of decision tree; And judge at another node place whether sentence exists sentence structure pattern-cue combination B; And according to judged result bifurcated in addition, obtain the classification results of sentence at last at the leaf node place, wherein utilize the training document sets that the decision tree that is constructed is trained.In this example; When a test sentence is judged; There is not the operation of the mark of confirming this test sentence yet, but, judges with decision tree according to the situation that sentence structure pattern-cue that this test sentence contained makes up; See which leaf node it can go to, then it is included into the classification under this leaf node.
Again for example; Utilizing under the situation of Bayes classifier; Can obtain the prior probability under the various situation through the statistics to the training document sets, sentence is the probability of Special Significance sentence under the situation that has each sentence structure pattern-cue combination thereby try to achieve.And calculate, and and then sort out according to this probability that belongs to Special Significance for test sentence.At this moment need not confirm the operation of sentence mark yet.
Having explained with decision tree and Bayes classifier above utilizes learning algorithm to carry out the situation of learning training and test.But this only is an example; Other learning algorithm for example logistic regression sorting technique, rule-based method etc. may be used to the present invention, and the back will be described in detail for utilizing the logistic regression sorting technique to calculate the weight of sentence structure pattern-cue combination and classify to test sentence with reference to figure 3.
In addition, above-mentioned cue only has been the cue that acts on certainly, thinks that promptly the sentence that comprises cue more possibly be the sentence that acquires a special sense than the sentence that does not comprise cue.But this only is an example.For example; Can introduce negates the cue of effect; For comprising the sentence setting penalty factor that this works the cue of negating effect, for example should reduce its mark then, perhaps be excluded in outside the sentence that acquires a special sense comprising this sentence that plays the cue of negative effect simply.
In addition, in the above-mentioned example, only considered to have the sentence structure pattern of the sentence of predetermined Special Significance.Alternatively or as replenishing, can consider to obtain the sentence structure pattern of noise sentence, the noise sentence is meant that this sentence contains cue but is not the sentence with predetermined Special Significance; Judge then whether the sentence in the said document meets the sentence structure pattern of noise sentence; And deletion is judged as the sentence of noise sentence from said document.For example, for extracting the sentence of describing the technical solution problem from patent documentation, a noise sentence pattern can be " invention ... problem. "
In addition, it is also conceivable that the sentence structure pattern of the fixing non-sentence that acquires a special sense of some form, the sentence with this sentence structure pattern does not generally have the Special Significance of expectation.To check then whether sentence matees the sentence structure pattern of this non-sentence that acquires a special sense,, perhaps set penalty factor for this sentence if coupling is excluded in this sentence outside the sentence that acquires a special sense.
In addition; Can consider that also negate the cue combination of effect with the sentence structure pattern of the non-sentence that acquires a special sense with plaing; Check then whether sentence matees the sentence structure pattern and the prompting contamination that plays negative effect of so non-sentence that acquires a special sense; If coupling is excluded in this sentence outside the sentence that acquires a special sense, perhaps set penalty factor for this sentence.
In addition, need to prove that the document here (no matter being training document or test document) is wide in range notion, both can be the full document of common meaning, also can be the part of document.
Fig. 2 is the process flow diagram that extracts the method 200 of the sentence with predetermined Special Significance in accordance with another embodiment of the present invention from document.
Step S210 shown in Fig. 2, S220, S250 and step S110, S120, S150 shown in Figure 1 are basic identical, and the descriptions thereof are omitted here.
The method 200 of the sentence with predetermined Special Significance shown in Figure 2 and method shown in Figure 1 100 be different to be to have introduced cue bunch, promptly no longer is to be that angle is considered with the cue, but from cue bunch or point out the angle of phrase to consider.This be because, in some cases, possibly have a lot of cues, the number of the sentence structure pattern-prompting contamination that at this moment exists will sharply increase, and is especially true in the situation that the sentence structure pattern is also more.If bunch be that unit considers at this moment,, save resource with the complexity and the calculated amount of reduction place problem greatly with cue.
Particularly, at step S221, the cue that obtains for step S220 carries out cluster, obtains some cues bunch.
Cluster is a kind of machine learning algorithm of non-supervision, and being used for each individuals or sample are gathered is some types, and each individuality can be regarded as a point in the feature space.Its basic thought is that it is one type or cluster that the point that the feature space middle distance is nearer and intensive gathers.
In the cue cluster of this paper, each word is each sample, and the similarity between the word can be regarded as the distance between the word.Thus; Existing various clustering algorithm for example is entitled as " Clustering to Find Exemplar Terms for Keyphrase Extraction ", Zhiyuan Liu, Peng Li; Yabin Zheng; Maosong Sun, the clustering algorithm of mentioning in the article of relevant meeting EMNLP 2009, the 257-266 pages or leaves of natural language processing all can be applied to the present invention.
About last cluster obtain bunch number k can be predetermined, for example be the number of the key words of user or system's appointment, perhaps also can be uncertain, confirm according to the operation result that clustering algorithm is last.
The objective function of cluster can be that the introducer with cluster has identical semantic or identical sentence grammer and part of speech.Perhaps, the objective function of cluster it is also conceivable that bunch and bunch between distance and/or each bunch in member's factors such as number.Clustering method can comprise the clustering method based on the meaning of one's words, based on the clustering method of grammer, or both combinations, or the like.
Similarity between the word can be definite in advance and be stored in the word similarity database, also can be that the scene calculates from the object document that is processed.Can utilize the mutual information method to calculate the similarity between the word; Perhaps can also utilize log-likelihood ratio (Log Likelihood Ratio), Chi-square Test statistical methods such as (Chi-squared); And the knowledge method that gives dictionary (for example WordNet knows net) calculates.
The simple examples of a cluster process is described below.For example, for the cue solve in the above-mentioned example, provide, need, increase, decrease, optimize; High, poorer is according to part of speech (verb and adjective), solve, provide, need; Increase, decrease, optimize, high, poorer can be divided into 2 bunches, i.e. " solve; provide, need, increase, decrease, optimize " and " high, poorer " (hereinafter is called a bunch C3).
And then, according to semanteme, for example be that expression solves or cue solve is gone up and down in expression; Provide, need, increase, decrease; Optimize can be divided into 2 bunches " solve, provide, need " (hereinafter is called C1) again; " increase, decrease, optimize " (hereinafter is called C2).So obtained 3 promptings bunch C1, C2 and C3 altogether.
Above the number of cue, cue bunch and cue bunch only be example, can relate to the number of different cues, cue bunch and cue bunch as required.
A big benefit of introducing cue bunch is, the status of all speech in cue bunch, effect, weight or the like think it all is identical.Thus, need not bunch to consider these factors, can reduce the workload of processing to each cue.
At step S230, different with step S130 shown in Figure 1, not combination sentence structure pattern and cue, but combination sentence structure pattern and cue bunch, meet the sentence structure pattern-cue bunch after the combination of sentence syntactic structure with acquisition.
For example, for typical sentence structure pattern mentioned above: " accordingly, the object of this method " (SP1), " as a result; the problem of the paper " (SP2), " therefore ... what is needed for{4,20}invention " (SP3), and above-mentioned cue bunch " solve; provide, need " (C1), " increase, decrease; optimize " (C2), " high, poorer ", we can obtain the combination of following significant sentence structure pattern-introducer bunch: SP1-C1, SP1-C2; SP2-C3, SP3-C2, SP3-C3.
Under the situation of considering weight, do not considering the weight of each sentence structure pattern-guiding contamination separately, but be reduced to the weight of considering each sentence structure pattern-guiding contamination.Thus, further reduced the workload of handling.
At step S240, whether comprise the combination of sentence structure pattern-cue bunch based on test document, confirm the mark of sentence.Thereby,, come from document, to extract the sentence that acquires a special sense based on the mark of sentence at step S250.
Likewise, the method for distilling of above-mentioned Special Significance sentence is merely example.Can in the sentence method for distilling, consider noise sentence structure pattern further and/or play negates the cue of effect.
Fig. 3 be according to the present invention another embodiment extract the process flow diagram of the method 300 of sentence from document with predetermined Special Significance.
Step S310 shown in Figure 3, S320, S321, S330, S350 and step S210, S220, S221, S230, S250 shown in Figure 2 are basic identical, omit its concrete narration here.
Method 300 shown in Figure 3 and method 200 shown in Figure 2 different have been step S331 many, are used for confirming the weight of the sentence structure pattern-programmed alarm speech bunch after the combination.And the step S340 of the mark of definite sentence maybe be correspondingly different.
The sentence that can will train sentence in the collection of document to be categorized as to have predetermined Special Significance through sorting technique and non-sentence with predetermined Special Significance calculate the weight of sentence structure pattern-cue bunch.Sorting technique can be the logistic regression sorting technique, bayes classification method, at least one in rule and method and the functional method or combination.
Provide the example of weight of confirming the combination of sentence structure pattern-cue bunch through the logistic regression sorting technique below.
Suppose that the mark of sentence representes with variable z, x1 is used in the combination (being provided with k) of sentence structure pattern-cue bunch; X2; Xk representes, is adopting linear logic to return under the situation of sorting technique, and the mark z of sentence can represent with following linear logic regression formula (1).
z=β0+β1*x1+β2*x2+…+βk*Xk,……(1)
Wherein β 0, and β 1, and β 2, β k, and waiting is respectively the combination x1 of sentence structure pattern-cue bunch, x2, the coefficient of xk also is the weight of the combination of each sentence structure pattern-cue bunch.
At the SP1-C1 that is combined as of sentence structure pattern-cue bunch, SP1-C2, under the situation of SP2-C3, k=3, then above-mentioned formula (1) becomes formula (2)
z=β0+β1*x1+β2*x2+β3*X2,……(2)
Whether use the training collection of document, be that its corresponding mark set in the sentence with predetermined Special Significance according to it for each sentence, and according to the combination that whether contains sentence structure pattern-cue bunch; X1, x2, the value of x3 also (is for example confirmed; This combination is arranged; Value is 1, should not make up, and value is 0).Thus can be in the hope of factor beta 0, β 1, β 2, also be the combination S P1-C1 of sentence structure pattern-cue bunch, SP1-C2, the weight of SP2-C3.
The combination S P1-C1 of sentence structure pattern-cue bunch, SP1-C2, the weight of SP2-C3 with each sentence in all sentence-introducer bunch pattern match test document, and adopts linear method accumulative total pattern weight, obtains the sentence value.
For example; Suppose that sentence S is " accordingly; the object of this method is to provide an improved inkjet printing system having a specialized orifice plate ". its coupling SP1-C1 pattern, so the mark of sentence will for
Score(S)=β0+β1。
In the above-mentioned example; Adopt the logistic regression sorting technique to confirm the weight of the combination of sentence structure pattern-cue bunch; But alternatively also can adopt for example bayes classification method, the weight of the combination of sentence structure pattern-cue bunch is confirmed at least one in rule and method and the functional method or combination.
Fig. 4 is the schematic block diagram that extracts the device 400 of the sentence with predetermined Special Significance according to an embodiment of the invention from the literary composition retaining.
The device 400 that extracts the sentence with predetermined Special Significance from literary composition retaining can comprise: the sentence structure pattern obtains parts 410, is used to obtain have the sentence structure pattern of the sentence of predetermined Special Significance; Cue obtains parts 420, is used to obtain cue, and the sentence that wherein contains this cue more possibly be the sentence with predetermined Special Significance than the sentence that does not contain this cue; Sentence structure pattern-cue combiner 430 is used to make up sentence structure pattern and cue, with the sentence structure pattern-cue after the combination that obtains to meet the sentence syntactic structure; The sentence mark is confirmed parts 440, is used for the sentence structure pattern-cue that sentence comprised based on said document, confirms the mark of sentence; And sentence extracting said elements 450, be used for mark based on sentence, come from said document, to extract sentence with predetermined Special Significance.
Said sentence structure pattern with sentence of predetermined Special Significance can be learnt to obtain from the training collection of document automatically, or obtain by the manual work definition.
Sentence structure pattern with sentence of predetermined special doubt can be meant that the do not match literal combination of any structure pattern of the group of text composition and division in a proportion of matching structure pattern more possibly be the sentence with predetermined Special Significance.
Said device 400 can also comprise the parts of the weight that is used for confirming the sentence structure pattern-cue after each combination; And the sentence mark confirms that parts based on each sentence comprised in the document the sentence structure pattern-cue and the weight of corresponding sentence structure pattern-cue, confirm the mark of sentence.
Device 400 can also comprise and be used for carrying out cluster for the cue that is obtained, and obtains the parts of cue bunch.This sentence structure pattern-cue combiner 430 combination sentence structure patterns and cue bunch.Device 400 can also comprise the parts of the weight of sentence structure pattern-cue of being used for confirming after each combination bunch.And the sentence mark confirms that parts 440 can confirm the mark of sentence based on the weight of each sentence comprised in the said document sentence structure pattern-cue bunch and corresponding sentence structure pattern-cue bunch.
Be used for confirming the weight that sentence that the parts of the weight of the sentence structure pattern-cue bunch after each combination can will train the sentence of collection of document to be categorized as to have predetermined Special Significance through sorting technique and non-sentence with predetermined Special Significance calculate sentence structure pattern-cue.Sorting technique can be the logistic regression sorting technique, bayes classification method, at least one in rule and method and the functional method or combination.
Device 400 can also comprise the parts of the sentence structure pattern that is used to obtain the noise sentence, and the noise sentence is meant that this sentence contains cue but is not the sentence with predetermined Special Significance; Be used for judging whether the sentence of said document meets the parts of the sentence structure pattern of noise sentence; And the parts that are used for being judged as the sentence of noise sentence from said document deletion.
Fig. 5 is the synoptic diagram that can put into practice exemplary computer system 700 of the present invention according to an embodiment of the invention.
To provide description with reference to figure 5 as the example of the Hardware configuration that realizes above-mentioned multi-object recognition device.CPU (CPU) 701 carries out various processing according to the program that is stored in ROM (ROM (read-only memory)) 702 or the storage area 708.For example, CPU carry out describe in the above-described embodiments extract the program of the method for sentence from the literary composition retaining with predetermined Special Significance.RAM (RAS) 703 suitably stores by the program of CPU 701 execution, data or the like.CPU 301, ROM 702 and RAM 703 interconnect through bus 704.
CPU 701 is connected in input/output interface 705 through bus 704.Comprise the importation 706 of keyboard, mouse, microphone etc. and comprise that the output of display, loudspeaker etc. is connected in input/output interface 705.CPU 701 carries out various processing according to the instruction of 706 inputs from the importation.CPU 701 is to output 707 output process result.
The storage area 708 that is connected in input/output interface 705 comprises for example hard disk, and storage is by the program and the various data of CPU701 execution.Communications portion 709 is communicated by letter with external device (ED) through the network such as the Internet, LAN etc.
Be connected in the removable medium 711 of driver 710 driving such as disk, CD, magneto-optic disk or the semiconductor memories etc. of input/output interface 705, and obtain to be recorded in the program, data or the like there.Program that is obtained and data are transferred to storage area 708 when needed, and are stored in the there.
More than combine specific embodiment to describe ultimate principle of the present invention; But; It is to be noted; As far as those of ordinary skill in the art, can understand whole or any step or the parts of method and apparatus of the present invention, can be in the network of any calculation element (comprising processor, storage medium etc.) or calculation element; Realize that with hardware, firmware, software or their combination this is that those of ordinary skills use their basic programming skill just can realize under the situation of having read explanation of the present invention.
Therefore, the object of the invention can also be realized through program of operation or batch processing on any calculation element.Said calculation element can be known fexible unit.Therefore, the object of the invention also can be only through providing the program product that comprises the program code of realizing said method or device to realize.That is to say that such program product also constitutes the present invention, and the storage medium that stores such program product also constitutes the present invention.Obviously, said storage medium can be any storage medium that is developed in any known storage medium or future.
It is pointed out that also that in apparatus and method of the present invention obviously, each parts or each step can decompose and/or reconfigure.These decomposition and/or reconfigure and to be regarded as equivalents of the present invention.And; Carrying out the step of above-mentioned series of processes can order following the instructions naturally carry out in chronological order; But do not need necessarily to carry out according to said sequence; But possibly can change execution sequence, for example based on not having strict precedence relationship between the step of historical identifying information correction identifying information and the step based on mutual relationship correction identifying information between the object.
Above-mentioned embodiment does not constitute the restriction to protection domain of the present invention.Those skilled in the art should be understood that, depend on designing requirement and other factors, and various modifications, combination, son combination and alternative can take place.Any modification of within spirit of the present invention and principle, being done, be equal to replacement and improvement etc., all should be included within the protection domain of the present invention.

Claims (10)

1. one kind is extracted the method for the sentence with predetermined Special Significance from literary composition retaining, comprises the steps:
Acquisition has the sentence structure pattern of the sentence of predetermined Special Significance;
Obtain cue, the sentence that wherein contains this cue more possibly be the sentence with predetermined Special Significance than the sentence that does not contain this cue;
Combination sentence structure pattern and cue are with sentence structure pattern one cue after the combination that obtains to meet the sentence syntactic structure;
Based on the sentence structure pattern-cue that sentence comprised in the said document, confirm the mark of sentence; And
Based on the mark of sentence, come from said document, to extract sentence with predetermined Special Significance.
2. the method for claim 1, the sentence structure pattern of said sentence with predetermined Special Significance from the training collection of document automatically study obtain, or obtain by the manual work definition.
3. the method for claim 1, the sentence structure pattern with sentence of predetermined special doubt are meant that the do not match literal combination of this tactic pattern of the group of text composition and division in a proportion of this tactic pattern of coupling more possibly be the sentence with predetermined Special Significance.
4. the method for claim 1 also comprises:
Confirm the weight of sentence structure pattern-cue after each combination;
The mark of wherein said definite sentence comprises: based on each sentence comprised in the said document the sentence structure pattern-cue and the weight of corresponding sentence structure pattern-cue, confirm the mark of sentence.
5. method as claimed in claim 4, wherein:
Carry out cluster for the cue that is obtained, obtain cue bunch;
Combination sentence structure pattern and cue bunch;
Confirm the weight of the sentence structure pattern-cue bunch after each combination; And
Based on the weight of each sentence comprised in the said document sentence structure pattern-cue bunch and corresponding sentence structure pattern-cue bunch, confirm the mark of sentence.
6. the method for claim 1, the weight that the sentence that wherein will train sentence in the collection of document to be categorized as to have predetermined Special Significance through sorting technique and non-sentence with predetermined Special Significance calculate sentence structure pattern-cue.
7. method as claimed in claim 5, the sentence that wherein will train sentence in the collection of document to be categorized as to have predetermined Special Significance through sorting technique and non-sentence with predetermined Special Significance are calculated the weight of sentence structure pattern-cue bunch.
8. the method for claim 1 also comprises:
Obtain the sentence structure pattern of noise sentence, the noise sentence is meant that this sentence contains cue but is not the sentence with predetermined Special Significance;
Judge whether the sentence in the said document meets the sentence structure pattern of noise sentence; And
Deletion is judged as the sentence of noise sentence from said document.
9. one kind is extracted the device of the sentence with predetermined Special Significance from literary composition retaining, comprising:
The sentence structure pattern obtains parts, is used to obtain have the sentence structure pattern of the sentence of predetermined Special Significance;
Cue obtains parts, is used to obtain cue, and the sentence that wherein contains this cue more possibly be the sentence with predetermined Special Significance than the sentence that does not contain this cue;
Sentence structure pattern-cue combiner is used to make up sentence structure pattern and cue, with the sentence structure pattern-cue after the combination that obtains to meet the sentence syntactic structure;
The sentence mark is confirmed parts, is used for the sentence structure pattern-cue that sentence comprised based on said document, confirms the mark of sentence; And
The sentence extracting said elements is used for the mark based on sentence, comes from said document, to extract the sentence with predetermined Special Significance.
10. one kind is extracted the method for the sentence with predetermined Special Significance from literary composition retaining, comprises the steps:
Acquisition has the sentence structure pattern of the sentence of predetermined Special Significance;
Obtain cue, the sentence that wherein contains this cue more possibly be the sentence with predetermined Special Significance than the sentence that does not contain this cue;
Combination sentence structure pattern and cue are with the sentence structure pattern-cue after the combination that obtains to meet the sentence syntactic structure; And
Based on the sentence structure pattern-cue that sentence comprised in the said document, come from said document, to extract sentence with predetermined Special Significance.
CN201010268675.6A 2010-09-01 2010-09-01 Method and device for extracting sentences from document Expired - Fee Related CN102385574B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010268675.6A CN102385574B (en) 2010-09-01 2010-09-01 Method and device for extracting sentences from document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010268675.6A CN102385574B (en) 2010-09-01 2010-09-01 Method and device for extracting sentences from document

Publications (2)

Publication Number Publication Date
CN102385574A true CN102385574A (en) 2012-03-21
CN102385574B CN102385574B (en) 2014-08-20

Family

ID=45824995

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010268675.6A Expired - Fee Related CN102385574B (en) 2010-09-01 2010-09-01 Method and device for extracting sentences from document

Country Status (1)

Country Link
CN (1) CN102385574B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959312A (en) * 2017-05-23 2018-12-07 华为技术有限公司 A kind of method, apparatus and terminal that multi-document summary generates

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5924108A (en) * 1996-03-29 1999-07-13 Microsoft Corporation Document summarizer for word processors
CN1470047A (en) * 2000-11-20 2004-01-21 ���չ�˾ Method of vector analysis for a document
US7051024B2 (en) * 1999-04-08 2006-05-23 Microsoft Corporation Document summarizer for word processors
CN101382962A (en) * 2008-10-29 2009-03-11 西北工业大学 Superficial layer analyzing and auto document summary method based on abstraction degree of concept
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5924108A (en) * 1996-03-29 1999-07-13 Microsoft Corporation Document summarizer for word processors
US7051024B2 (en) * 1999-04-08 2006-05-23 Microsoft Corporation Document summarizer for word processors
CN1470047A (en) * 2000-11-20 2004-01-21 ���չ�˾ Method of vector analysis for a document
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
CN101382962A (en) * 2008-10-29 2009-03-11 西北工业大学 Superficial layer analyzing and auto document summary method based on abstraction degree of concept

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959312A (en) * 2017-05-23 2018-12-07 华为技术有限公司 A kind of method, apparatus and terminal that multi-document summary generates
CN108959312B (en) * 2017-05-23 2021-01-29 华为技术有限公司 Method, device and terminal for generating multi-document abstract
US10929452B2 (en) 2017-05-23 2021-02-23 Huawei Technologies Co., Ltd. Multi-document summary generation method and apparatus, and terminal

Also Published As

Publication number Publication date
CN102385574B (en) 2014-08-20

Similar Documents

Publication Publication Date Title
US10417341B2 (en) Systems and methods for using machine learning and rules-based algorithms to create a patent specification based on human-provided patent claims such that the patent specification is created without human intervention
US20190347571A1 (en) Classifier training
Oudah et al. A pipeline Arabic named entity recognition using a hybrid approach
US20130103390A1 (en) Method and apparatus for paraphrase acquisition
US10997369B1 (en) Systems and methods to generate sequential communication action templates by modelling communication chains and optimizing for a quantified objective
Tabassum et al. A survey on text pre-processing & feature extraction techniques in natural language processing
JPWO2014033799A1 (en) Word semantic relation extraction device
Kmail et al. An automatic online recruitment system based on exploiting multiple semantic resources and concept-relatedness measures
Shah et al. Sentiment analysis on indian indigenous languages: a review on multilingual opinion mining
Sun et al. Pre-processing online financial text for sentiment classification: A natural language processing approach
Mulki et al. Tunisian dialect sentiment analysis: a natural language processing-based approach
CN114528919A (en) Natural language processing method and device and computer equipment
Patra et al. Automatic author profiling based on linguistic and stylistic features
Boros et al. Assessing the impact of OCR noise on multilingual event detection over digitised documents
Nerabie et al. The impact of Arabic part of speech tagging on sentiment analysis: A new corpus and deep learning approach
Hamdi et al. A review on challenging issues in Arabic sentiment analysis
Sun et al. Twitter part-of-speech tagging using pre-classification Hidden Markov model
Venčkauskas et al. Problems of authorship identification of the national language electronic discourse
CN112632272B (en) Microblog emotion classification method and system based on syntactic analysis
Kapočiūtė-Dzikienė et al. Improving topic classification for highly inflective languages
Zrigui et al. ISAO: An Intelligent System of Opinions Analysis.
Archana et al. Explicit sarcasm handling in emotion level computation of tweets-A big data approach
CN102385574B (en) Method and device for extracting sentences from document
Ha-Neul et al. study of machine-learning classifier and feature set selection for intent classification of Korean tweets about food safety
Pethalakshmi Twitter sentiment analysis using Dempster Shafer algorithm based feature selection and one against all multiclass SVM classifier

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140820

Termination date: 20200901

CF01 Termination of patent right due to non-payment of annual fee