CN100419753C - Method and device for digital data central searching target file according to classified information - Google Patents

Method and device for digital data central searching target file according to classified information Download PDF

Info

Publication number
CN100419753C
CN100419753C CNB2005100229632A CN200510022963A CN100419753C CN 100419753 C CN100419753 C CN 100419753C CN B2005100229632 A CNB2005100229632 A CN B2005100229632A CN 200510022963 A CN200510022963 A CN 200510022963A CN 100419753 C CN100419753 C CN 100419753C
Authority
CN
China
Prior art keywords
classification
keyword
layer
discrimination
searching documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2005100229632A
Other languages
Chinese (zh)
Other versions
CN1987849A (en
Inventor
鲁耀杰
游赣梅
王晓霞
李刚
刘旸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to CNB2005100229632A priority Critical patent/CN100419753C/en
Priority to JP2006340412A priority patent/JP2007172616A/en
Publication of CN1987849A publication Critical patent/CN1987849A/en
Application granted granted Critical
Publication of CN100419753C publication Critical patent/CN100419753C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

This invention discloses a method and device which the digital data which is detailed classified from the first floor to the N floor search the object profile according to the classified information from the first floor to the N floor. It includes the process as follow calculating the current keyword of the important sequence and every differentiator of the classified corresponding to the current floor, according to this calculating the appearance ratio of every classified in the current floor. If the number of current floors less than M, the next floor can be as the current floor and executing the above process, if not, composing every classified differentiator of calculation keyword and the every classified appearance ratio of the calculated object profile in every floor, this can obtain the composed differentiator of every keyword corresponding to the object profile in the current keyword sequence from the first floor to the M floor.

Description

Digitalized data is concentrated the method and apparatus according to classified information ferret out document
Technical field
The present invention relates to concentrated classified information ferret out document by the digitalized data of disaggregated classification step by step according to described layering according to one deck at least, more particularly, relate to and concentrated method and apparatus by the digitalized data of disaggregated classification step by step according to the classified information ferret out document of described layering according to one deck at least.
Background technology
In recent years, we can see that increasing text document appears at the Internet, on the LAN (Local Area Network) of digital library, news and company, in order to manage the retrieval that these electronic data people also more and more pay attention to textual number information.Numerical information retrieval of today is also more and more intelligent.Numerical information retrieval also sealed unlike in the past, fixing, new numerical information is open now, and is dynamic, upgrades very soon, the while, these numerical informations generally also all distributed.The user of digital information system also expands to the general user from original specialty inquiry person, comprising the commercial staff, and managerial personnel, student or the like.The needs of various individual characteies like this digital information system have been brought.Personalized and intellectuality is a new demand of numerical information searching system.
We can see that numerical information till now has very important characteristics, and that is exactly that a lot of numerical informations are all classified in advance.For example digital library's classification (as: ACM, IEEE etc.), web classifies (as: Yahoo, Google, Sina etc.), however present numerical information searching system seldom can utilize these breakdown figures information to improve the accuracy of inquiry.
Summary of the invention
In view of above-mentioned situation, the purpose of this invention is to provide and can effectively utilize breakdown figures information, thereby by the weight of keyword being estimated improve the method for accuracy and the device of inquiry.
To achieve these goals, provide a kind of being used for being concentrated successively according to the 1st layer of method by the digitalized data of disaggregated classification step by step to the N layer according to an aspect of the present invention to the classified information ferret out document of M layer according to the 1st layer, N 〉=1 wherein, N 〉=M 〉=1, comprise step: (a) according to the inquiry that the user imported, extraction comprises that the keyword sequence of at least one keyword is as current keyword sequence; (b) obtain corresponding respectively to 1st layer the classified information of conduct from described digitalized data collection when anterior layer; (c) calculate keyword in the current keyword sequence respectively for the discrimination of each classification in the classification of the classified information indication of anterior layer; (d), calculate destination document at the probability that in each classification of anterior layer, occurs based on each the discrimination in the described keyword for each classification of the classified information indication of working as anterior layer; (e) if current level number less than M, then current level number is added 1 and return (c), otherwise obtain each keyword in the current keyword sequence at the 1st layer of integrated discrimination to the M layer for destination document by synthesize probability that each keyword of being calculated occurs respectively for the discrimination of each classification of each layer and the destination document that calculated respectively in each classification of each layer in step (d) in step (c); And (f) search for this destination document based on the size of described integrated discrimination.
Provide a kind of being used for being concentrated successively according to the 1st layer of device by the digitalized data of disaggregated classification step by step to the N layer according to another aspect of the present invention to the classified information ferret out document of M layer according to the 1st layer, N 〉=1 wherein, N 〉=M 〉=1, comprise parts: take out the speech device, according to the inquiry that the user imported, extraction comprises that the keyword sequence of at least one keyword is as current keyword sequence; Classification selection/segmentation module obtains corresponding respectively to 1st layer the classified information of conduct when anterior layer from described digitalized data collection; The discrimination counter calculates keyword in the current keyword sequence respectively for the discrimination of each classification in the classification of the classified information indication of anterior layer; The destination document estimator based on each discrimination for each classification of the classified information indication of working as anterior layer in the described keyword, calculates destination document at the probability that occurs in each classification of anterior layer; The probability that discrimination synthesis module, each keyword that is calculated by synthetic discrimination counter occur in each classification of each layer respectively for the discrimination of each classification of each layer and destination document that the destination document estimator is calculated respectively obtains each keyword in the current keyword sequence at the 1st layer of integrated discrimination for destination document to the M layer; Search engine is searched for this destination document based on the size of described integrated discrimination; And processor, if current level number less than M, then described discrimination counter and described destination document estimator are to carrying out described operation when anterior layer, otherwise described discrimination synthesis module and described search engine are carried out described operation.
Employing has improved the precision of information retrieval effectively according to the method and apparatus of classified information ferret out document of the present invention.This method and apparatus can effectively utilize electronic data and concentrate the supplementary that comprises in being sorted in, so can estimate relatively accurate keyword weight.Simultaneously, experiment shows that the present invention can effectively improve the accuracy of inquiry.
Description of drawings
Fig. 1 illustrates the block scheme of destination document searcher according to the preferred embodiment of the invention;
Fig. 2 illustrates the process flow diagram of destination document searching method according to the preferred embodiment of the invention;
Fig. 3 illustrates the schematic flow sheet that carries out the destination document search according to the present invention;
Fig. 4 illustrates the schematic flow sheet of discrimination counter;
Fig. 5 illustrates the schematic flow sheet of district's destination document estimator;
Embodiment
Describe the preferred embodiments of the present invention in detail below in conjunction with accompanying drawing.In the following description, known step/unit will be not described in detail in existing numerical information searching method/system, in order to avoid unnecessary details is obscured the present invention.
Fig. 1 illustrates the block scheme of destination document searcher according to the preferred embodiment of the invention.Being concentrated successively according to the 1st layer of device by the digitalized data of disaggregated classification step by step to the N layer as shown in Figure 1 to the classified information ferret out document of M layer according to the 1st layer, wherein M represents the number of plies of the search that the user sets as required, though promptly this digitalized data collection has been divided into the N layer, but the user still can only search for M layer wherein, this device comprises: take out speech device (TE) 101, according to the inquiry that the user imported, extraction comprises that the keyword sequence of at least one keyword is as current keyword sequence; Module (TSM) 102 selected in keyword, based on the noise that removes when keyword sequence, its corresponding discrimination and the word frequency of anterior layer classification in the current keyword sequence, thereby determines for the described keyword when the classification of following one deck of anterior layer; Classification selection/segmentation module (CSM) 103 obtains corresponding respectively to 1st layer the classified information of conduct when anterior layer from the digitalized data collection; Discrimination counter (DPC) 104 calculates keyword in the current keyword sequence respectively for the discrimination of each classification in the classification of the classified information indication of anterior layer; Destination document estimator (PRE) 105 based on each discrimination for each classification of the classified information indication of working as anterior layer in the described keyword, calculates destination document at the probability that occurs in each classification of anterior layer; The probability that discrimination synthesis module (DIM) 106, each keyword that is calculated by synthetic discrimination counter occur in each classification of each layer respectively for the discrimination of each classification of each layer and destination document that the destination document estimator is calculated respectively obtains each keyword in the current keyword sequence at the 1st layer of integrated discrimination for destination document to the M layer; Classification steady arm (CL) 107 removes noise classification based on described document at the probability in each classification of anterior layer classification, thereby determines for the described classified information when the classification of following one deck of anterior layer; Weight merges module 108 (TWC), synthetic global area calibration; Search engine 109 is searched for this destination document based on the size of described integrated discrimination; Wherein, according to the 1st layer to the process of the classified information ferret out document of M layer, if current level number is less than M, then described discrimination counter and described destination document estimator are to carrying out described operation when anterior layer, otherwise, control described discrimination synthesis module and described search engine and carry out described operation.This destination document searcher uses the keyword weighing computation method based on classification can effectively improve the degree of accuracy of inquiry.
Fig. 1 only illustrates the present invention as the preferred embodiments of the present invention, is not to limit the invention.Such as, the major technique effect that those skilled in the art should understand destination document searcher of the present invention is: utilize digitalized data to concentrate the supplementary that comprises in being sorted in, estimate relatively accurate keyword weight, thereby improved the precision of information retrieval effectively.Because keyword selects the technique effect of module (TSM) 102 to be: improve precision and reduce the response time by in processing procedure, removing the keyword noise, so under the condition that lacks keyword selection module (TSM) 102, promptly, discrimination counter (DPC) 104 directly receives keyword sequence from taking out speech device (TE) 101, can realize the present invention equally.In like manner, because the technique effect of classification steady arm (CL) 107 and weight merging module 108 (TWC) is respectively: improve inquiry precision and minimizing response time and when calculating the keyword weight, gather the general applicability that overall keyword weighing computation method improves the inquiry precision and improves system by in processing procedure, removing the classification noise, so, under the condition that lacks classification steady arm (CL) 107 and weight merging module 108 (TWC), promptly, destination document estimator (PRE) 105 does not remove the information of noise classification and search engine 109 directly receives keyword sequence from discrimination synthesis module (DIM) 106 integrated discrimination by classification steady arm (CL) 107 to classification selection/segmentation module (CSM) 103 feedbacks, can realize the present invention equally.Wherein keyword can be a speech or a phrase.
The device of searching documents of the present invention is also compatible not to have the digitalized data collection of classification and synthesis module also to synthesize whole discriminations that the discrimination counter is calculated, so that improve the general applicability of system, wherein is preferably based on probability calculation overall situation keyword.
Best, keyword is determined according to following standard the separating capacity of classification:
(1) according to keyword the separating capacity of classification is estimated.
(2) according to keyword the descriptive power difference of difference classification is estimated.
(3) considered what the attribute of classification itself obtained simultaneously according to the frequency of occurrences of keyword in class.
Fig. 2 illustrates the process flow diagram of destination document searching method according to the preferred embodiment of the invention.Being used for as shown in Figure 2 is concentrated successively according to the 1st layer of method to the classified information ferret out document of M layer by the digitalized data of disaggregated classification step by step to the N layer according to the 1st layer, N 〉=1 wherein, N 〉=M 〉=1, comprise step: according to the inquiry that the user imported, extraction comprises that the keyword sequence of at least one keyword is as current keyword sequence (S201); Obtain corresponding respectively to 1st layer the classified information (S202) of conduct from described digitalized data collection when anterior layer; Calculate keyword in the current keyword sequence respectively for the discrimination (S203) of each classification in the classification of the classified information indication of anterior layer; Based on each the discrimination in the described keyword, calculate destination document at the probability (S204) that in each classification of anterior layer, occurs for each classification of the classified information indication of working as anterior layer; If current level number is less than M (S205), then will descend one deck as working as anterior layer (S206), wherein based on keyword sequence when the anterior layer classification, its corresponding discrimination and word frequency are removed the noise in the current keyword sequence, thereby determine described keyword (S207) for the classification of the following one deck of working as anterior layer, remove noise classification based on described document at the probability in each classification of anterior layer classification, thereby determine for described classified information (S208), otherwise obtain each keyword in the current keyword sequence at the 1st layer of integrated discrimination (S209) to the M layer for destination document by synthesize probability that each keyword of being calculated occurs respectively for the discrimination of each classification of each layer and the destination document that calculated respectively in each classification of each layer in step S204 in step S203 when the classification of following one deck of anterior layer; Synthetic global area calibration (S210); Search for this destination document (S211) based on the size of described integrated discrimination.
Fig. 2 only illustrates the present invention as the preferred embodiments of the present invention, is not to limit the invention.Such as, the major technique effect that those skilled in the art should understand destination document searching method of the present invention is: utilize digitalized data to concentrate the supplementary that comprises in being sorted in, estimate relatively accurate keyword weight, thereby improved the precision of information retrieval effectively.Because the technique effect of step S207 is: improve precision and reduce the response time by in processing procedure, removing the keyword noise,, can realize the present invention equally so under the condition that lacks step S207, that is, directly arrive step S208 from step S206.In like manner, because the technique effect of step S208 and step S210 is respectively: improve the inquiry precision and reduce the response time and when calculating the keyword weight, gather the general applicability that overall keyword weighing computation method improves inquiry precision and raising system by in processing procedure, removing the classification noise, so, under the condition that lacks step S208 and step S210, promptly, directly arrive step S203 and directly arrive step S211 from step S206, can realize the present invention equally from step S209.Wherein keyword can be a speech or a phrase.
The method of searching documents of the present invention is also compatible not to have the digitalized data collection of classification and synthesis module also to synthesize whole discriminations that the discrimination counter is calculated, so that improve the general applicability of system, wherein is preferably based on probability calculation overall situation keyword.
Best, keyword is determined according to following standard the separating capacity of classification:
(1) according to keyword the separating capacity of classification is estimated.
(2) according to keyword the descriptive power difference of difference classification is estimated.
(3) considered what the attribute of classification itself obtained simultaneously according to the frequency of occurrences of keyword in class.
Fig. 3 illustrates the schematic flow sheet that carries out the destination document search according to the present invention.Below by in conjunction with Fig. 3, apparatus and method of the present invention are combined to be described.
At first the user imports an inquiry, and this inquiry has fully showed user's query intention, can be several words in native system, a sentence, one section description, or even one piece of article.
The speech module of taking out of system is at first carried out user's inquiry, takes out speech and handles, and obtains a user inquiring parallel expression sequence:
T=(t 1,t 2...t m)
In this article, our data set of inquiry has such feature: these data are divided into several classes, and each classification can be divided into several subclasses again, and these subclasses can be divided again again ...
We at first use the CSM module to select the ground floor classification:
C=(c 1,c 2...c n)
The separating capacity (here we become this word the weight of word to the separating capacity of document) that each each keyword in the T vector is all different to each document.The weight of estimating these keywords is a key point of inquiry system.This paper has realized a kind of at the estimating system based on multistratum classification keyword weight, and in this system, we progressively approach the final weight that obtains word by the continuous segmentation to grouped data.
According to vector T and the C that we obtain in front, we use, and each keyword is to the separating capacity of classification C among the DPC calculating T, and we will obtain a corresponding discrimination vector by DPC:
DP=(dp 1,dp 2..dp m)
We come the input of discrimination vector DP as the TSM module noise keyword is once filtered then.We have just obtained the discrimination sequence of a new corresponding keyword of keyword vector sum to classification C like this:
T=(t 1,t 2...t m)DP=(dp 1,dp 2..dp m)
Next we import the corresponding discrimination sequence of new keyword vector sum into the PRE module, the PRE module will estimating user the possibility (PC) of destination document in each classification of inquiry:
PC=(pc 1,pc 2..pc n)
In fact, many times, the user wants that the document of searching belongs to classification ck, but some query terms often in the query statement of user input will tend to other classification mistakenly.We are called noise-like to these classes.
For fear of the interference of these noise-like, we use CL to delete these noise-like, and we obtain a new class vector simultaneously:
C=(c 1,c 2..c q)
Classifying among the C each, we further segment these classification with the CSM module, for example for c kWe obtain a new class vector ∈ C:
C k=(c k1,c k2...c ku)
It is the same that we introduce the classification of processing upper level with the front, and we calculate the separating capacity of each keyword in corresponding levels classification to keyword sequence and classified information input DPC module; Use the PRE module to come the possibility of estimating target document then in each classification; Use CL to come the select target classification then.If necessary, we continue to segment classification and carry out the calculating of next round.For no data set and different accuracy requirement, we can define the rank that we calculate classification.
Up to the present we have finished the calculating to discrimination, and we have obtained the keyword sequence of user inquiring, and each keyword also has the possibility of destination document in each classification at the classificatory discrimination of each layer.
We are the input of these data as the DIM module, and the DIM module will go out final keyword discrimination according to these information calculations.
This discrimination has very big separating capacity for class, we can navigate to the classification that destination document belongs to easily to use this information, but it also has a shortcoming, if this word is the article universal word (frequency of occurrences is very high) in the target classification, we just are difficult to choose the conceivable destination document of user by this word in these classification more so.So our integrated use other is based on statistics
w i ′ = log ( k * N n t + 1 )
The keyword computing method.Obtain final keyword weight.
We use the TF*IDF weight calculator to calculate the weight based on TF*IDF of keyword:
Wherein N is a total number of documents, and nt is the number of documents that contains keyword t.
This company is a distortion of famous Robertson/Spark-Jones formula.
We obtain final keyword weight to weight that obtains based on the TF*IDF mode and the total discrimination input TWC module that obtains previously.
Below, to the parts and the step analysis of apparatus and method of the present invention, and its parts and step are made an explanation in conjunction with example.
Take out speech device (TE):
The function of taking out the speech device is to extract keyword sequence according to the inquiry of user's input.Its input is user's inquiry, and output is keyword sequence.
Processing procedure:
(1) user inquiring is carried out participle.
(2) according to the speech of part of speech, tentatively filter telling.In this step, removed some useless speech, such as measure word, number etc.
(3) utilization stops vocabulary and removes some noise speech, as " use ", " effect ".
(4) noise reduction.In this step, we remove those results' that may lead to errors word.We use a threshold value (ts) to filter word, and the frequency of occurrences of any word is lower than this threshold value, all will be filtered.
For example: the user imports following inquiry:
" apparatus for introducing sunlight to room; Neat regular script of the present invention a kind of can be from motion tracking and compile sunlight with the light collector maximum area all the time, and light is transferred to the apparatus for introducing sunlight to room of dispersing in the inlet chamber of fixed position with fixed angle all the time.The present invention all has good prospect ".
We wherein have two wrongly written or mispronounced characterss as can be seen: neat regular script (disclosing) all has (having).First wrongly written or mispronounced characters is uncommon, but another one is more common.
We obtain a following sequence of terms after the first step:
Sunlight is gone into, the chamber, and device, this, invention, neat regular script,, a kind of, can, automatically, follow the tracks of, and, all the time, with, light collector, maximum, area compiles, will, light, fixing, angle, transmission is arrived, and the position enters, and is indoor, disperse,, all have, fine, prospect
Second goes on foot us only selects noun, verb, and adjective, so we will obtain following sequence:
Sunlight is gone into, the chamber, and device, invention, neat regular script, can, automatically, follow the tracks of, all the time, with, light collector, maximum, area compiles, will, light, fixing, angle, transmission is arrived, and the position enters, and is indoor, disperses, and prospect is all arranged
In the implementation in the 3rd step we find wherein " can ", " with " and " general " be the word that stops in the vocabulary, so we filter out these speech:
Sunlight is gone into, the chamber, and device, invention, neat regular script automatically, is followed the tracks of, all the time, and light collector, maximum, area compiles, and light is fixing, angle, transmission, the position enters, and is indoor, disperses, and prospect is all arranged
In the 4th step, because the frequency of occurrences of " public pattern " this word is low excessively, so we filter this speech the final keyword sequence that gets to the end:
Sunlight is gone into, the chamber, and device, invention automatically, is followed the tracks of, all the time, and light collector, maximum, area compiles, and light is fixing, angle, transmission, the position enters, and is indoor, disperses, and prospect is all arranged.
Classification selection/segmentation module (CSM)
This module functions be from data set and obtain classified information.The input of this module can make sky, or a classification.The output of this module is the classification set.This module also has a function is whether the classification that decision is imported needs to continue segmentation, so CSM is different for different data sets.
Processing procedure:
If input is empty:
Judgment data collection and whether classification is arranged, if classification is arranged:
Read configuration information, judge whether to need to use discrimination information
Output ground floor classified information
Telling system to finish discrimination calculates.
Otherwise the input of this module is a classified information:
Judge whether this classification has subclassification information, if subclassification information is arranged:
Read configuration information, need to judge whether further segmentation
Output is the classified information of one deck down
Telling system to finish discrimination calculates.
For example: data set and be a patent data based on IPC.The system configuration discrimination calculates the 3rd layer:
So input is that N/A. output is the IPC of ground floor:
A portion---human lives needs
B portion---operation; Transportation
C portion---chemistry; Metallurgical
D portion---weaving; Papermaking
E portion---fixed buildings
F portion---mechanical engineering; Illumination; Heating; Weapon; Explosion
G portion---physics
H portion---electricity
1) if input is " a H portion electricity ", output will be so:
The H01 essential electronic element
The generating of H02 electric power, power transformation or distribution
The H03 basic electronic circuit
The H04 electrical communication technology
Other classifications of H05 power technology not to be covered
2) if the input of this module is " a H01C resistor ", because this classification is a three-layer classification, so corresponding output type N/A, this output expression does not have ensuing the classification, but when system received this output, system will finish discrimination and calculate.
Discrimination counter (DPC)
The input of this module is keyword sequence and sorting sequence, and output is the discrimination value.
Every piece of document all has the characteristic of himself, and these characteristics all are by speech greatly---and the most basic semantic expressiveness unit shows.In other words, each word all more or less must show the characteristic of document.Whether this word occurs in this piece document, and the frequency of appearance is how, position of appearance or the like, and these all information can both be described the attribute of document.
For each keyword t in T, be different for the descriptive power of class, we are designated as Pt to this descriptive power.For all t i∈ T and all c j∈ C, we obtain a Pt matrix:
P = t 1 t 2 . . . t i . . . t m c 1 p 11 p 21 . . . p i 1 . . . p m 1 c 2 p 12 p 22 . . . p i 2 . . . p m 2 . . . . . . . . . . . . . . . . . . . . . c j p 1 j p 2 j . . . p ij . . . p mj . . . . . . . . . . . . . . . . . . . . . c n p 1 n p 2 n . . . p in . . . p mn
pt ij = f ij K + f ij
Wherein:
f ij = n t i l j
K j = k * ( ( 1 - b ) + b * l j l ave )
Figure C20051002296300145
Be t iAt c jThe middle number of times that occurs
l jBe c jLength (c jIn all speech numbers)
l ave = Σ j = 1 n l j n It is the average length of class
Here, f IjBe the frequency of occurrences of speech in a classification, he represents to be exactly a speech what of occurrence number in a classification.If a speech tends to a classification, this value is bigger comparatively speaking high so.But because we also must consider the relation of this frequency and class length, we have considered class length and class average length together in this calculates.If the frequency of occurrences of a speech in each class is the same, this class is short more so, and this speech descriptive power of class hereto is strong more.Parameter k and b are the factors of influence that is used for regulating class length and word frequency.
We obtain a vector (matrix column) of describing this keyword to the class description ability for each keyword:
P T i = ( p i 1 SPT , p i 2 SPT . . . p in SPT ) = ( p t 1 , pt 2 . . . pt n )
Be defined as follows:
Wherein: SPT = Σ j = 1 n p ij
According to vectorial PT iWe can estimate the discrimination of word in current taxonomical hierarchy.We calculate this discrimination with following formula:
dp i = n Σ k = 1 n pt k 2 - 1 / n - 1
Wherein n is the classification number.
In this formula, we have supposed that keyword is distributed uniform in each classification, and the distribution of destination document in all kinds of classes also is uniform as can be seen for we.
This dp value has embodied the distributional difference of speech in each class.Because speech is if in one or a few class, this dp will obtain than higher value so, that is to say that this speech allows whole inquiry tend to those classification more.(when a speech only appears in the classification, the dp value will be got maximal value 1 so).On the other hand, if a speech occurs in each classification, corresponding dp value will be got lower value (all the same when the distribution of this speech in each classification, the dp value will be got minimum value 0).
Fig. 4 illustrates the process flow diagram of discrimination counter.For example: the keyword sequence T of input (t1, t2, t3), the class sequence be C (c1, c2, c3).
From data set and we obtain c1, c2, the sum of the word of c3 is respectively: 2000,3000 and 5000.And initiation parameter k=2, b=0.7.
Query Database obtains
T1 occurs in c1 100 times, occurs 300 times in c2, occurs 200 times in c3.
T2 occurs in c1 70 times, occurs 500 times in c2, occurs 1000 times in c3.
T3 occurs in c1 200 times, occurs 300 times in c2, occurs 500 times in c3.
We obtain following matrix:
P = c 1 c 2 c 3 t 1 t 2 t 3 0.0583 0.0383 0.102 0.0893 0.140 0.0893 0.0299 0.133 0.0714
PT 1=(0.328,0.503,0.168)
PT 2=(0.123,0.450,0.427)
PT 3=(0.388,0.340,0.272)
dp 1=0.083
dp 2=0.100
dp 3=0.0102
So we obtain a such discrimination sequence:
DP=(0.430,0.518,0.0528)
Module (TSM) selected in keyword
This module functions is the noise that removes in the keyword.The input of this module is a keyword sequence and corresponding discrimination value and word frequency; Output is selecteed keyword sequence and corresponding discrimination value.
We delete those discriminations simultaneously df are less than a given threshold value less than such threshold value: Max (dp) * parameter in this module, and wherein parameter is a preset parameter.
By these filtration steps, module will be exported new keyword sequence and corresponding discrimination value:
T u=(t 1,t 2...t p)
DP=(dp 1,dp 2...dp p)
Destination document estimator (PRE)
This module functions is the possibility of estimating target document in a certain class.The input of this module is a keyword sequence, corresponding discrimination value and classified information, and output is the probability in each class (pq) of destination document.
Fig. 5 illustrates the schematic flow sheet of district's destination document estimator.
According to the keyword sequence and the classified information of input, equally with the DPC module obtain a Pt matrix:
P = t 1 t 2 . . . t i . . . t p c 1 p 11 p 21 . . . p i 1 . . . p p 1 c 2 p 12 p 22 . . . p i 2 . . . p p 2 . . . . . . . . . . . . . . . . . . . . . c j p 1 j p 2 j . . . p ij . . . p pj . . . . . . . . . . . . . . . . . . . . . c n p 1 n p 2 n . . . p in . . . p pn
Classifying for each, we obtain the row of matrix, and are defined as follows:
P C j = ( p 1 j SPC , p 2 j SPC . . . p pj SPC ) = ( p c 1 , pc 2 . . . pc p )
Wherein: SPC = Σ i = 1 p p ij
We can estimate according to discrimination, and destination document belongs to the possibility of certain classification:
Definition:
For c j∈ C
pq j=PC′ j·DP
PC ' wherein jBe the contrary of PCj.
DP=(dp 1, dp 2... dp p) be the discrimination sequence of input.
For example: we obtain following matrix according to input:
P = c 1 c 2 c 3 t 1 t 2 t 3 0.0583 0.038 0.102 0.0893 0.140 0.0893 0.0299 0.133 0.0714
DP=(0.430,0.518,0.0528)
So we obtain:
PC 1=(0.294,0.193,0.514)
PC 2=(0.280,0.439,0.280)
PC 3=(0.128,0.568,0.305)
pq 1=PC′ 1·DP=0.254
pq 2=PC′ 2·DP=0.363
pq 2=PC′ 2·DP=0.365
Classification steady arm (CL)
Sometimes, the inquiry of user input should belong to ck., but some keywords of wherein extracting out tend to classify more cq. we such cq is referred to as the noise classification.
In order to improve the hit rate of last inquiry, we do a filtration step and remove those noise-like.We remove those pc values less than tc k=Max (PQ ' k) * (1-pq 1k) classification of * k, wherein Max (PQ ' k) be maximal value, pq 1kBe the possibility of destination document in classification Ck, this is worth calculating in the classification of upper strata.If when anterior layer is a ground floor, pq=1/n so, wherein n is the number of categories of current classification.
The input of this module is the PQ sequence, and sorting sequence, is output as selecteed classification:
C=(c 1,c 2..c q)
Discrimination synthesis module (DIM)
The function of this module of DIM is the discrimination of synthetic different layers.The input of this module is the possibility that discrimination and destination document belong to each classification; Output is last discrimination after integrated.
t w i = k 1 * d p i + k 2 * Σ t = 1 u ( ( d p ti + k 3 * Σ u = 1 r ( ( d p tui + . . . . ) * p q tu ) ) * p q t )
K1 wherein, k2 and k3 are given parameters.
Tw iIt is the weight of i keyword.
Dp iBe that i keyword is at the classificatory discrimination of ground floor.
Dp TiBe the discrimination of i keyword on the Ct subclass.
Pq tIt is the possibility that destination document belongs to the Ct that classifies.
Dp TuiBe that i keyword belongs to the discrimination on the subclassification of Ct.
Pq TuIt is the possibility that destination document belongs to the subclassification of the Ct that classifies.
Weight merges module (TWC)
Said as the front, discrimination can be easy the classification of localizing objects document, but he also has shortcoming, be exactly when keyword is commonplace in target classification, we just can not further distinguish with this keyword, so we need synthesize this method and other overall weighing computation methods.The input of this module is keyword sequence and corresponding discrimination value, and the weight calculated of overall weight method.Output is the weight of each keyword.
We use following computing formula:
tw f i = ( t w i α * w i β ) δ .
Below described the preferred embodiments of the present invention in conjunction with the accompanying drawings, but the present invention is not limited in this concrete embodiment.Under the situation of the spirit and scope that do not depart from claim, can make various changes to it.

Claims (22)

1. one kind is used for being concentrated successively according to the 1st layer of method to the classified information ferret out document of M layer by the digitalized data of disaggregated classification step by step to the N layer according to the 1st layer, N 〉=1 wherein, and N 〉=M 〉=1 comprises step:
(a) according to the inquiry that the user imported, extraction comprises that the keyword sequence of at least one keyword is as current keyword sequence;
(b) obtain corresponding respectively to 1st layer the classified information of conduct from described digitalized data collection when anterior layer;
(c) calculate keyword in the current keyword sequence respectively for the discrimination of each classification in the classification of the classified information indication of anterior layer;
(d), calculate destination document at the probability that in each classification of anterior layer, occurs based on each the discrimination in the described keyword for each classification of the classified information indication of working as anterior layer;
(e) if current level number less than M, then will descend one deck as and (e) as anterior layer and execution in step (c), (d), otherwise by synthesize each keyword of in step (c), being calculated respectively for the discrimination of each classification of each layer and the destination document that in step (d), calculated respectively in each classification of each layer the probability of appearance obtain each keyword in the current keyword sequence at the 1st layer of integrated discrimination to the M layer for destination document; And
(f) search for this destination document based on the size of described integrated discrimination.
2. the method for searching documents as claimed in claim 1, wherein said classification be also compatible not have the digitalized data collection of classifying.
3. the method for searching documents as claimed in claim 2, wherein step (e) also comprises:
Synthesize the whole discriminations that in step (c), calculated.
4. the method for searching documents as claimed in claim 3, the wherein also synthetic global area calibration of step (e).
5. the method for searching documents as claimed in claim 4, wherein said discrimination are to the separating capacity estimation of classification according to keyword.
6. the method for searching documents as claimed in claim 5, wherein keyword is to the different estimations of descriptive power of difference classification according to keyword to the separating capacity of classification.
7. the method for searching documents as claimed in claim 6, wherein keyword is to have considered simultaneously that according to the frequency of occurrences of keyword in classification the attribute of classification itself obtains to the descriptive power of difference classification.
8. the method for searching documents as claimed in claim 1 wherein will descend one deck as before anterior layer in described step (e), also comprise step:
(g), thereby determine for described keyword when the classification of following one deck of anterior layer wherein based on the noise that removes when keyword sequence, its corresponding discrimination and the word frequency of anterior layer classification in the current keyword sequence.
9. the method for searching documents as claimed in claim 4, wherein the computing method of global area calibration are based on probability.
10. the method for searching documents as claimed in claim 1 wherein will descend one deck as before anterior layer in described step (e), also comprise step:
(h) remove noise classification based on described destination document at the probability of occurrence in each classification of anterior layer classification, thereby determine described classified information for the classification of the following one deck of working as anterior layer.
11. as the method for the described searching documents of one of claim 1 to 10, wherein keyword is a speech or a phrase.
12. one kind is used for being concentrated successively according to the 1st layer of device to the classified information ferret out document of M layer by the digitalized data of disaggregated classification step by step to the N layer according to the 1st layer, N 〉=1 wherein, and N 〉=M 〉=1 comprises parts:
Take out the speech device, according to the inquiry that the user imported, extraction comprises that the keyword sequence of at least one keyword is as current keyword sequence;
Classification selection/segmentation module obtains corresponding respectively to 1st layer the classified information of conduct when anterior layer from described digitalized data collection;
The discrimination counter calculates keyword in the current keyword sequence respectively for the discrimination of each classification in the classification of the classified information indication of anterior layer;
The destination document estimator based on each discrimination for each classification of the classified information indication of working as anterior layer in the described keyword, calculates destination document at the probability that occurs in each classification of anterior layer;
The probability that discrimination synthesis module, each keyword that is calculated by synthetic discrimination counter occur in each classification of each layer respectively for the discrimination of each classification of each layer and destination document that the destination document estimator is calculated respectively obtains each keyword in the current keyword sequence at the 1st layer of integrated discrimination for destination document to the M layer;
Search engine is searched for this destination document based on the size of described integrated discrimination; With
Wherein, according to the 1st layer to the process of the classified information ferret out document of M layer, if current level number is less than M, then described discrimination counter and described destination document estimator are to carrying out described operation when anterior layer, otherwise, control described discrimination synthesis module and described search engine and carry out described operation.
13. the also compatible digitalized data collection that does not have classification of the device of searching documents as claimed in claim 12, wherein said classification.
14. whole discriminations that the also synthetic discrimination counter of the device of searching documents as claimed in claim 13, wherein said discrimination synthesis module is calculated.
15. the device of searching documents as claimed in claim 14 also comprises:
Weight merges module, synthetic global area calibration.
16. the device of searching documents as claimed in claim 15, wherein said discrimination are according to keyword the separating capacity of classification to be estimated.
17. the device of searching documents as claimed in claim 16, wherein keyword is to estimate the descriptive power of difference classification is different according to keyword to the separating capacity of classification.
18. the device of searching documents as claimed in claim 17, wherein keyword is to have considered simultaneously that according to the frequency of occurrences of keyword in classification the attribute of classification itself obtains to the descriptive power of difference classification.
19. the device of searching documents as claimed in claim 12 also comprises:
Module selected in keyword, based on the noise that removes when keyword sequence, its corresponding discrimination and the word frequency of anterior layer classification in the current keyword sequence, thereby determines for the described keyword when the classification of following one deck of anterior layer.
20. the device of searching documents as claimed in claim 15 is wherein based on probability calculation global area calibration.
21. the device of searching documents as claimed in claim 12 also comprises:
The classification steady arm removes noise classification based on described destination document at the probability of occurrence in each classification of anterior layer classification, thereby determines for the described classified information when the classification of following one deck of anterior layer.
22. as the device of the described searching documents of one of claim 12 to 21, wherein keyword is a speech or a phrase.
CNB2005100229632A 2005-12-19 2005-12-19 Method and device for digital data central searching target file according to classified information Expired - Fee Related CN100419753C (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CNB2005100229632A CN100419753C (en) 2005-12-19 2005-12-19 Method and device for digital data central searching target file according to classified information
JP2006340412A JP2007172616A (en) 2005-12-19 2006-12-18 Document search method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005100229632A CN100419753C (en) 2005-12-19 2005-12-19 Method and device for digital data central searching target file according to classified information

Publications (2)

Publication Number Publication Date
CN1987849A CN1987849A (en) 2007-06-27
CN100419753C true CN100419753C (en) 2008-09-17

Family

ID=38184648

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100229632A Expired - Fee Related CN100419753C (en) 2005-12-19 2005-12-19 Method and device for digital data central searching target file according to classified information

Country Status (2)

Country Link
JP (1) JP2007172616A (en)
CN (1) CN100419753C (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122980B (en) * 2011-01-25 2021-08-27 阿里巴巴集团控股有限公司 Method and device for identifying categories to which commodities belong
JP6237334B2 (en) * 2014-02-27 2017-11-29 富士通株式会社 Query generation method, query generation program, and query generation apparatus
CN109145108A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Classifier training method, classification method, device and computer equipment is laminated in text

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11134364A (en) * 1997-10-31 1999-05-21 Omron Corp Systematized knowledge analyzing method and device therefor, and classifying method and device therefor
US20040078224A1 (en) * 2002-03-18 2004-04-22 Merck & Co., Inc. Computer assisted and/or implemented process and system for searching and producing source-specific sets of search results and a site search summary box
US6778975B1 (en) * 2001-03-05 2004-08-17 Overture Services, Inc. Search engine for selecting targeted messages
US20050149504A1 (en) * 2004-01-07 2005-07-07 Microsoft Corporation System and method for blending the results of a classifier and a search engine
CN1701324A (en) * 2001-11-02 2005-11-23 Dba西方集团西方出版社 Systems, methods, and software for classifying text

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3178406B2 (en) * 1998-02-27 2001-06-18 日本電気株式会社 Hierarchical sentence classification device and machine-readable recording medium recording program
JP2002230005A (en) * 2001-02-05 2002-08-16 Seiko Epson Corp Support center system
JP3677006B2 (en) * 2002-02-22 2005-07-27 日本ユニシス株式会社 Information processing apparatus and method
JP2003316819A (en) * 2002-04-22 2003-11-07 Shinkichi Himeno Object classification researching device and program for executing it
JP2004355069A (en) * 2003-05-27 2004-12-16 Sony Corp Information processor, information processing method, program, and recording medium
JP4510483B2 (en) * 2004-02-23 2010-07-21 株式会社エヌ・ティ・ティ・データ Information retrieval device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11134364A (en) * 1997-10-31 1999-05-21 Omron Corp Systematized knowledge analyzing method and device therefor, and classifying method and device therefor
US6778975B1 (en) * 2001-03-05 2004-08-17 Overture Services, Inc. Search engine for selecting targeted messages
CN1701324A (en) * 2001-11-02 2005-11-23 Dba西方集团西方出版社 Systems, methods, and software for classifying text
US20040078224A1 (en) * 2002-03-18 2004-04-22 Merck & Co., Inc. Computer assisted and/or implemented process and system for searching and producing source-specific sets of search results and a site search summary box
US20050149504A1 (en) * 2004-01-07 2005-07-07 Microsoft Corporation System and method for blending the results of a classifier and a search engine

Also Published As

Publication number Publication date
JP2007172616A (en) 2007-07-05
CN1987849A (en) 2007-06-27

Similar Documents

Publication Publication Date Title
CN103605665B (en) Keyword based evaluation expert intelligent search and recommendation method
EP0597630B1 (en) Method for resolution of natural-language queries against full-text databases
US7711731B2 (en) Synthesizing information-bearing content from multiple channels
CN101119326B (en) Method and device for managing instant communication conversation record
Kwon et al. Web page classification based on k-nearest neighbor approach
US20060089924A1 (en) Document categorisation system
EP1391834A2 (en) Document retrieval system and question answering system
CN103885934A (en) Method for automatically extracting key phrases of patent documents
US20050021545A1 (en) Very-large-scale automatic categorizer for Web content
CN101609450A (en) Web page classification method based on training set
CN101097570A (en) Advertisement classification method capable of automatic recognizing classified advertisement type
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
Chen et al. Developing a semantic-enable information retrieval mechanism
KR101059557B1 (en) Computer-readable recording media containing information retrieval methods and programs capable of performing the information
CN105528411A (en) Full-text retrieval device and method for interactive electronic technical manual of shipping equipment
CN115618014A (en) Standard document analysis management system and method applying big data technology
CN100419753C (en) Method and device for digital data central searching target file according to classified information
CN103853797A (en) Image retrieval method and system based on n-gram image indexing structure
CN104699817A (en) Search engine ordering method and search engine ordering system based on improved spectral clusters
JP3921837B2 (en) Information discrimination support device, recording medium storing information discrimination support program, and information discrimination support method
Li et al. Netnews bursty hot topic detection based on bursty features
CN113282641A (en) Webpage search data information intelligent classification management method and system based on user behavior deep analysis and computer storage medium
Campelo et al. A model for geographic knowledge extraction on web documents
WO2002037328A2 (en) Integrating search, classification, scoring and ranking
CN112100330A (en) Theme searching method and system based on artificial intelligence technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20080917

Termination date: 20141219

EXPY Termination of patent right or utility model