CN101984435B

CN101984435B - Method and device for distributing texts

Info

Publication number: CN101984435B
Application number: CN201010549183A
Authority: CN
Inventors: 蔡勋梁; 彭学政; 王广彬
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2010-11-17
Filing date: 2010-11-17
Publication date: 2012-10-10
Anticipated expiration: 2030-11-17
Also published as: CN101984435A

Abstract

The invention provides a method and device for distributing texts, which is applied to a column frame comprising at least two levels of columns, wherein the method comprises the following steps: A, respectively executing the following distributing steps aiming at each grabbed text: matching the degrees of similarity of the keywords of the texts to be distributed currently and the central vectors of each column, and distributing the texts to be distributed currently to the column meeting a distribution matching polity according to the matching result, wherein the central vectors of the column are generated based on seed words set for the column in advance; and B, according to the hierarchy relation of the columns, distributing all or part of the texts under the set columns to an upper level parent column or the lower level sub-column. The method and device of the invention can reduce the workload and cost of text distribution, shortens the text distributing duration, and is convenient for flexibly increasing and decreasing the columns.

Description

A kind of method and apparatus that text is distributed

[technical field]

The present invention relates to Internet technical field, particularly a kind of method and apparatus that text is distributed.

[background technology]

Along with internet popularizing in the whole world; And the continuous development of internet, applications; Text message on the webpage is explosive growth; How fully to effectively utilize the text message on the webpage, and how to organize these text messages effectively and offer the user, become an important research direction in the data mining field gradually and had very high industry and be worth.At present, text classification has been applied in many fields, for example: the news pages of each column is recalled, delivery of electronic mail, generation user interest pattern or the like.

Text classification is exactly that a large amount of texts are distributed under the different columns, and wherein column can belong to different classification, also can belong to the different subclasses under the same classification.The ways of distribution of existing text promptly is provided with the collection of document that a manual sort handled based on training sample, trains the distribution that realizes text according to this training sample.But there is following defective in this mode based on training sample:

The foundation of one of which, training sample need be carried out the stages such as language material collection, training pattern foundation, needs very big workload, and especially the language material collection needs the manual work mark of a large amount of professional domains, causes the workload and the cost of text distribution excessive.

Two, the training duration is long, and the foundation of training sample can bring week other distribution duration of level usually.

In addition; Because training sample is corresponding with the column framework; In case the column framework changes, just need to confirm training sample again, and training sample be difficult to very much that obtain and consuming time very long; The cost that can further bring text to distribute is excessive, the distribution duration is long, can not increase and decrease column neatly.

[summary of the invention]

The invention provides a kind of method and apparatus that text is distributed,, shorten the distribution duration, to make things convenient for the flexible increase and decrease of column can reduce the cost of text distribution.

Concrete technical scheme is following:

A kind of method that text is distributed is applied to comprise the column framework of two-stage column at least, and this method comprises:

A, carry out following distributing step respectively to each text that grasps:

Distributing step: the keyword of current text to be distributed and the center vector of each column are carried out the similarity coupling,, current text to be distributed is distributed under the column that satisfies the distribution matching strategy according to matching result; Wherein, the center vector of said column generates based on the seed speech that is provided with for this column in advance;

B, according to the hierarchical relationship between each column, with all or part of upper level father column or the sub-column of next stage of being distributed to of setting text under the column.

Wherein, the said distribution matching strategy of column comprises at least: the similarity between the keyword of said text to be distributed and the center vector of column surpasses the similarity threshold that is provided with to this column; Perhaps,

The result that similarity between the keyword of said text to be distributed and the center vector of column deducts after the similarity between the opposite vector of keyword and same column of said text to be distributed surpasses the similarity threshold that is provided with to this column, and the opposite vector of wherein said column is based on the reverse speech generation that is provided with for this column in advance.

More excellent ground, said step B specifically comprises a kind of or combination in any in the following mode:

The column of being distributed text according to the mode of said steps A is sub-column; To gather the column to the upper level father at preceding N1 text according to all texts or the ordering that the mode of said steps A is distributed under each sub-column of text, wherein N1 is preset positive integer; Perhaps,

The column of being distributed text according to the mode of said steps A is father's column, will be distributed to the sub-column of next stage according to all texts that the mode of said steps A is distributed under father's column of text; Perhaps,

The column of being distributed text according to the mode of said steps A comprises father's column and sub-column, will be distributed the sub-column of next stage that part text under father's column of text is distributed to not distributed text according to the mode of said steps A.

Further, said column can comprise: have the common column of text exhibition attribute and have the not hiding column of text exhibition attribute.

More excellent ground, this method further comprises: extract down by the keyword of distribution text from the column that is provided with the seed speech, the seed speech that the keyword that extracts is combined this column is to form the new center vector of this column.

Further, after said step B, carry out following steps respectively to each column:

C1, the text under the column is carried out cluster, form this column next one above bunch;

C2, choose strategy, in each bunch, choose of the expression of top text respectively as each bunch according to preset top news.

Behind said step C2, also comprise:

Calculate the weight of each text under the column according to text attribute, the weight that the weight of each text is confirmed bunch in utilizing bunch, according to bunch weight each bunch under the column sorted; Perhaps,

According to preset focus text selection strategy, choose the focus text the text under each column respectively and displaying under each column.

Wherein, said top news is chosen strategy and is comprised a kind of or combination in any in the following strategy: choose the text of text issuing time in setting range, choose title and satisfy and set the text that requires, choose and the text of bunch center vector similarity in setting range, choose the text that text quality satisfies preset requirement.

Particularly, the weights W of each text _PageComputing formula be:

W_{Page} = \frac{α}{Δ_{t} + α} \times δ (Site) \times φ (Segcount);

Wherein, α is preset inverse ratio factor die-away time, Δ _tBe the current mistiming of text issuing time distance, δ (site) is the computing function of text quality's factor, and φ (segcount) is the computing function of the reprinting rate factor.

A kind of device that text is distributed is applied to comprise the column framework of two-stage column at least, and this device comprises: text acquiring unit, first Dispatching Unit and second Dispatching Unit;

Said text acquiring unit is used for each text that grasps is delivered to said first Dispatching Unit as text to be distributed respectively;

Said first Dispatching Unit is used for the keyword of current text to be distributed and the center vector of each column are carried out the similarity coupling, according to matching result, current text to be distributed is distributed under the column that satisfies the distribution matching strategy; Wherein, the center vector of said column generates based on the seed speech that is provided with for this column in advance;

Said second Dispatching Unit, be used to treat that said first Dispatching Unit is accomplished the distribution to all texts to be distributed after, according to the hierarchical relationship between each column, with all or part of upper level father column or the sub-column of next stage of being distributed to of setting text under the column.

The column of said first Dispatching Unit distribution is sub-column; This moment, said second Dispatching Unit gathered all texts under each sub-column of said first Dispatching Unit distribution or ordering with the column to the upper level father at the individual text of preceding N1, and wherein N1 is preset positive integer; Perhaps,

The column of said first Dispatching Unit distribution is father's column, and this moment, said second Dispatching Unit was distributed to the sub-column of next stage with all texts under each sub-column of said first Dispatching Unit distribution; Perhaps,

The column of said first Dispatching Unit distribution comprises father's column and sub-column, and this moment, said second Dispatching Unit was distributed to the part text under father's column of said first Dispatching Unit distribution in the sub-column of next stage of not distributed text.

Particularly, said column comprises: have the common column of text exhibition attribute and have the not hiding column of text exhibition attribute.

More excellent ground; This device also comprises: keyword extracting unit; Be used for extracting down by the keyword of distribution text from the column that is provided with the seed speech, the seed speech that the keyword that extracts is combined this column is to form the new center vector of this column and to offer said first Dispatching Unit.

Further, this device also comprises: text cluster unit and top news are chosen the unit;

Said text cluster unit is used for the Distribution Results according to said first Dispatching Unit and said second Dispatching Unit, and the text under the column is carried out cluster, form each column next one above bunch;

Said top news is chosen the unit, is used for choosing strategy according to preset top news, in each bunch, chooses the expression of top text as each bunch respectively.

More excellent ground, this device also comprises: bunch sequencing unit or focus are chosen a kind of or whole in the unit;

Said bunch of sequencing unit is used for calculating according to text attribute the weight of each text under the column, the weight that the weight of each text is confirmed bunch in utilizing bunch, according to bunch weight each bunch under the column sorted;

Said focus is chosen the unit, is used for the Distribution Results according to said first Dispatching Unit and said second Dispatching Unit, according to preset focus text selection strategy, chooses the focus text the text under each column respectively and displaying under each column.

Particularly, the weights W of each text _PageComputing formula be:

W_{Page} = \frac{α}{Δ_{t} + α} \times δ (Site) \times φ (Segcount);

Can find out by above technical scheme, the text distribution that the center vector distribution text that the present invention's employing generates based on column seed speech is given column and binding layer inter-stage, the duration that text is distributed is controlled at a second level, has improved the efficient of text classification greatly.In addition; Adopt method and apparatus of the present invention to avoid complicated training sample to set up process, in case and the column framework change, only need set suitable seed speech and the text distribution rules between level to the column that increases; The text distribution rules of revising between level to the column of deletion gets final product; Obviously need to confirm again the mode of training sample in the prior art of comparing, can reduce the cost of text distribution, increase and decrease column more neatly.

[description of drawings]

Fig. 1 is a main method process flow diagram provided by the invention;

The news distribution flow figure of each column that Fig. 2 provides for the embodiment of the invention one;

Fig. 3 a is first kind of news pages ways of distribution that the embodiment of the invention one provides;

Fig. 3 b is second kind of news pages ways of distribution that the embodiment of the invention one provides;

Fig. 3 c is the third news pages ways of distribution that the embodiment of the invention one provides;

The employing that Fig. 4 provides for the embodiment of the invention one mixes the synoptic diagram of news pages ways of distribution;

The process flow diagram of the formation news that Fig. 5 provides for the embodiment of the invention two bunch;

Fig. 6 is an apparatus structure synoptic diagram provided by the invention.

[embodiment]

In order to make the object of the invention, technical scheme and advantage clearer, describe the present invention below in conjunction with accompanying drawing and specific embodiment.

Fig. 1 is main method process flow diagram provided by the invention, and is as shown in Figure 1, can mainly may further comprise the steps:

Step 101: each text to grasping is carried out following distributing step respectively:

Distributing step: the keyword of current text to be distributed and the center vector of each column are carried out the similarity coupling,, current text to be distributed is distributed under the column that satisfies the distribution matching strategy according to matching result; Wherein the center vector of above-mentioned column generates based on the seed speech that is provided with for this column in advance.

In this step, the distribution matching strategy of column can be provided with flexibly, comprises at least: the similarity between the keyword of text to be distributed and the center vector of column surpasses the similarity threshold that is provided with to this column.In addition, the distribution matching strategy of column can further include but is not limited to a kind of or combination in any in the following strategy: the similarity between the keyword of text to be distributed and the center vector of column is the highest, perhaps; The website source of text to be distributed meets the website requirement of column, and perhaps, the author of text to be distributed meets author's requirement of column; Perhaps; Text to be distributed meets the requirement of column for picture or video, and perhaps, the title regular expression of text to be distributed meets the requirement of column for the title regular expression; Perhaps, the URL of text to be distributed (URL) type meets the URL type requirement of column.

Step 102: according to the hierarchical relationship between each column, with all or part of upper level father column or the sub-column of next stage of being distributed to of setting news under the column, to accomplish text distribution to each column in the column framework.

After in the column framework, can preestablishing mode or other existing modes that some column utilizing step 101 and being distributed text, the text under this column is distributed in upper level father column or the sub-column of next stage.Can be through this step to the column distribution news of seed speech is not set, this partial content will be described in detail in embodiment one.

Through concrete embodiment said method provided by the invention is described below, in following embodiment, all adopted this text of news pages is distributed as example.At first adopt the news pages distribution flow of a pair of each column of embodiment to be described in detail.

Embodiment one,

The news distribution flow figure of each column that Fig. 2 provides for the embodiment of the invention one, as shown in Figure 2, can specifically may further comprise the steps:

Step 201: in advance for the column in the column framework is provided with the seed speech, and for being provided with the column formation center vector of seed speech.

In the column framework, the seed speech is usually by the manual work setting, and the column that the seed speech is set can be the root column, also can be sub-column.To a column one or more seed speech can be set and constitute one group of seed speech.

Because the artificial seed speech that is provided with is limited; Can not all possible keyword of exhaustive this column, the therefore simple center vector that relies on the artificial seed speech that is provided with possibly cause the part news pages can't recall (recall refer to news pages be dispensed under certain column), therefore; More excellent ground; In the time of can under column, being called back the part news pages, utilize the news pages that is called back to extract keyword, and utilize keyword to combine the seed morphology of this column to become the new center vector of this column; Thereby make the center vector that forms describe the content guiding of this column, the accuracy rate and the recall rate of the news that the raising column is recalled more accurately.Corresponding following step 206, the cycle index of utilizing the news pages that is called back to extract keyword can for example be set to circulate 3 times according to the empirical value setting.

Step 202: to each news pages of grabbing one by one execution in step 203 to step 204.

After search engine grabs news pages in batches, can the news pages that grasp be distributed one by one.

More excellent ground after grabbing news pages, can at first carry out feature selecting, go heavily to wait processing the news pages that grabs, and at first filtering out part news pages useless or that repeat, thereby improves the efficient that news is recalled.

Step 203: extract the keyword of current news pages to be distributed, the keyword of extraction and the center vector of each column to be matched are carried out the similarity coupling.

Step 204: according to matching result, it is the highest that current news pages to be distributed is distributed to similarity, and surpass under the column of column similarity threshold.

In this embodiment, the distribution matching strategy is the highest and to surpass the column similarity threshold be example with similarity, can also adopt any other strategies described in the step 101, no longer repeats to give unnecessary details at this.

In addition, because the granularity of seed speech is bigger usually, under carrying out column during the recalling of news pages; Usually can introduce noise, therefore, when under each column, realizing the recalling of news pages; Can further reverse speech be set to column; Become opposite vector based on reverse morphology, carrying out similarity when coupling, the similarity that can confirm keyword and the center vector of news pages to be distributed deducts the result after the similarity with opposite vector; Judge that whether the result who confirms satisfies the distribution matching strategy, promptly comprises at least: judge whether the result who confirms surpasses the similarity threshold that is provided with to column.

In the column framework; The news ways of distribution of each column can dispose in the column attribute; Specifically can in the column attribute, dispose based on the center vector mode of seed speech and obtain news pages (these columns obtain the overall news pages resource that the set of news pages can grab for web crawlers); Perhaps obtain news pages (these columns obtain the news pages set that the set of news pages can be obtained for his father's column or sub-column) in uncle's column or the sub-column, perhaps adopt alternate manner to obtain news pages.For example, can adopt the mode of step 203 to step 205 can realize recalling of news pages,, then can obtain news pages from other columns for the column that does not dispose the seed speech for the column that has disposed the seed speech.Obtain in uncle's column or the sub-column described in the mode such as following step of news pages.

Step 205: according to the hierarchical relationship between each column, with all or part of upper level father column or the sub-column of next stage of being distributed to of news under the column.

Usually there is certain hierarchical relationship between each column, can adopts at this but be not limited to the mode of recalling of following three kinds of news pages:

First kind of mode: each sub-column is realized recalling of each sub-column news pages through the mode of step 203 to step 204, the news pages under each sub-column is gathered to be distributed to upper level father column then.Shown in Fig. 3 a, shaded nodes representes to be provided with the column of seed speech among Fig. 3 a, arrow points be news pages distributor to.This mode is suitable for each sub-column usually and differs greatly, the not high situation of seed speech degree of overlapping between the column.For example; Father's column is " amusement ", and sub-column is respectively " domestic amusement ", " Hong Kong, Macao and Taiwan amusement ", " Japan and Korea S's amusement " and " American-European amusement " etc., and the Artists of the seed speech of each sub-column for corresponding area is set; Because the seed speech degree of overlapping between each sub-column is lower; Therefore, each sub-column adopts the mode of step 203 to step 204 to recall news pages, gathers then to father's column " amusement ".

Wherein, can all news pages under each sub-column all be gathered and be distributed to upper level father column, also can ordering in each sub-column be gathered the column to the upper level father in preceding several news pages.Wherein, news pages can be according to the sequencing of similarity of the keyword and the column center vector of news pages in each sub-column, also can according to the weighted value of place news bunch and with the relevancy ranking of place news bunch, concrete ranking criteria can be provided with flexibly.Wherein the formation of news bunch will be described in embodiment two under the column.

Can limit for gathering the news pages total amount that is distributed to father's column, for example, the news total amount that father's column is set is N, and its sub-column quantity is m, the news pages quantity that each sub-column is distributed to father's column can be set so be no more than 2 * N/m.

The second way: father's column is realized recalling of father's column news pages through the mode of step 203 to step 204, then the news pages under father's column is distributed to the sub-column of next stage.Shown in Fig. 3 b, shaded nodes representes that father node adopts the similarity matching way of the center vector that becomes based on the seed morphology to recall news pages among Fig. 3 b, arrow points be news pages distributor to.It is less that this mode is suitable for the difference of each sub-column usually, seed speech degree of the overlapping condition with higher between the column.For example, father node is " electronic product ", and sub-column is " new product " and " product shopping guide "; Because the diversity factor between " new product " and " product shopping guide " is smaller, the seed speech degree of overlapping between the column is higher, for example; Possibly all there are seed speech such as " trendy ", " electronics "; Therefore, can be employed in configuration seed speech on father's column, the mode of distributing to the sub-column of next stage again.

The sub-column of next stage also can be according to the similarity matching way of step 203 to the center vector that becomes based on the seed morphology shown in the step 204; Recall the part news pages in the news pages of uncle's column distribution; At this moment; Also can adopt other matching ways, for example mate according to the URL type of website source, author, picture or video requirement or news pages.

If the news pages that father's column issues does not belong to any one existing sub-column; Can be distributed to an independently sub-column; Suppose existing sub-column m; So finally forming sub-column m+1 altogether, is N if father node is distributed the news pages of getting off, and can limiting the news pages quantity that gets into each sub-column so, to be no more than 2 * N/ (m+1) individual.

The third mode: father's column and parton column are realized recalling of each sub-column news pages through the mode of step 203 to step 204, obtain the news pages with this sub-column coupling in remaining a part of sub-column uncle column.Shown in Fig. 3 c, shaded nodes representes to be provided with the column of seed speech among Fig. 3 c, arrow points be news pages distributor to.It is little that this mode is suitable under father's column certain a little column discrimination usually, and the relatively large situation of another a little column discrimination.For example, father's column is " society ", and sub-column is " society and a method " and " social everything "; Because sub-column " society and method " has higher discrimination; And the discrimination of " social everything " is less, therefore, and can be to father's column " society " and sub-column " society and method " configuration seed speech; Mode according to based on center vector is recalled news pages, and sub-column " social everything " uncle's column obtains the part news pages.Need to prove, owing to possibly comprise the column of multilayer level in the column framework, can adopt more than one the mode in the above-mentioned news obtain manner to mix use, even can in a column framework, mix use with the existing mode of recalling.Give one example at this, as shown in Figure 4, arrow points is the direction of news distribution, and frame of broken lines is for hiding column (hiding column will relate to) in subsequent descriptions, and solid box is non-hiding column (being common column).In this example, first order column 2,3,4 and 5 and second level column a, b and e on all disposed the seed speech, adopt and to be distributed news pages based on the mode of center vector.It is column 1 that the news pages that column a and column b will be distributed converges to its upper level father column, corresponding above-mentioned first kind of mode; The news pages that column 2 will be distributed further is distributed to the sub-column of its next stage, i.e. column c and column d, the corresponding above-mentioned second way; The part news pages that column 3 will be distributed is distributed to the sub-column of other next stage except column e, i.e. column f and column g, corresponding above-mentioned the third mode.

Because possibly there is incomplete factor in column when being provided with; " deep bid " column for example is set, and what this column needed is the deep bid information of domestic stock market, but owing to do not carry out the differentiation of Hong Kong stock, stock in America etc.; Therefore can introduce the noise of some Hong Kong stocks and the stock in America related news page; Can be provided with the hiding column of Hong Kong stock and stock in America this moment, and this hiding column is not showed, thereby filtered out the related news pages such as Hong Kong stock and stock in America.Again for example, can hiding columns such as yellow or reaction be set under the column framework, from the news that grabs, recall news pages such as yellow or reaction and hide and to show.Hiding column also adopts the mode of the described center vector based on the seed speech of step 203 to step 204 to carry out recalling of news pages.Likewise, also can from the news pages of having recalled, extract keyword to hiding column and expand the seed speech, thereby reach than the better filter effect of the reverse speech of configuration.

Step 206: extract keyword the news pages under column, the keyword that extracts is combined the center vector of the seed morphology Cheng Xin of this column, when treating the news pages that next round grabs distributed, can adopt new center vector.

When extracting keyword, can from news pages, extract keyword according to word frequency, meaning of a word weight or part of speech weight etc., the extracting mode of concrete keyword is a prior art, no longer specifically describes at this.

Can find out through above-mentioned flow process; To each the column node in the column framework, can concrete configuration: the distribution matching strategy of this column, the node structure of this column (being the information of upper level father node and next stage child node) and show attribute (whether being to hide column) etc.

So far, flow process shown in the embodiment one finishes.A large amount of news pages that under each column, has been called back can't all be illustrated in it under column, and this just need select focus and shows, through embodiment two this process is specifically described below.

Embodiment two,

The process flow diagram of the formation news that Fig. 5 provides for the embodiment of the invention two bunch, as shown in Figure 4, carry out following steps to the news pages under each column:

Step 501: the news pages under the column is carried out cluster, form more than one news bunch.

Owing to recalled a large amount of news pages under each column; And the classification granularity that with the column is news pages is excessive; Therefore, can the mode of the news pages under each column through cluster be divided into a plurality of news bunch, the news pages in the identical news bunch has higher similarity.

Can adopt in the embodiment of the invention but be not limited to hierarchical clustering mode, cohesion cluster mode, divide formula cluster mode, based on the cluster mode of density or grid cluster mode etc.Particularly, if present embodiment adopts the hierarchical clustering mode, then can be provided with the cluster termination condition for less than preset similarity threshold or news number of clusters amount less than preset threshold value.

If directly the news pages under each column is carried out cluster; May bring relatively poor cluster effect: because the news pages under the same column all is and the very high document of same center vector similarity; Possibly cause a large amount of news to be gathered is one type, and remaining news becomes many groups again.Therefore, when the news pages under the column is carried out cluster, can at first reduce the weights of column center vector in cluster calculation, can give prominence to the content of each news outside center vector like this, and carry out polymerization according to the difference of these contents.

More excellent ground can at first screen the news pages under each column before execution in step 501, for example only kept and preceding M maximum news pages of the center vector similarity of column, and wherein M is preset positive integer.

Step 502: the top news according to preset is chosen strategy, in news bunch, chooses the expression of top news as this news bunch.

The top news of news bunch is chosen strategy and can be provided with flexibly, can include but not limited to a kind of or its combination in any in the following strategy: choose the news pages of news briefing time in setting range, choose title and satisfy and set the news pages that requires, choose with the news pages of news bunch center vector similarity in setting range, choose the news pages that the news quality satisfies preset requirement.For example, can choose that issuing time is new, title is long and with the higher news pages of the center vector similarity of news bunch as top news.Wherein, the news quality can depend on: a kind of or combination in any in the flow of website weight, news pages, the response speed of news pages, the clutter etc.Need to prove,, then can adopt the text quality's form that adapts with concrete text for other texts owing to be example with this text of news pages in the present embodiment.

Lift an instance of in certain news bunch, choosing top news: obtain in this news bunch the highest preceding 3 news pages of center vector similarity with news bunch, therefrom choose the readable good conduct top news of a title then; If readability is all bad, then choose following 3 news pages with the center vector similarity of news bunch, therefrom choose the readable good conduct top news of a title, and the like good until selecting a readability.

Step 503: according to the weight of each news pages under the property calculation column of news pages, utilize the weight of each news pages in the news bunch to confirm the weight of news bunch, each news under the column bunch is sorted according to the weight of news bunch.

The attribute of the news pages of mentioning in this step can include but not limited to a kind of or combination in any in the following attribute: news briefing time, news quality, reprinting rate.At instance that calculates the weight of news pages of this measure, for example can adopt formula (1) to calculate the weights W of news pages _Page:

W_{page} = \frac{α}{Δ_{t} + α} \times δ (site) \times φ (segcount) - - - (1)

Wherein, α is preset inverse ratio factor die-away time, Δ _tBe the news briefing current mistiming of time interval, δ (site) is the computing function of news quality factor, and φ (segcount) is the computing function of the reprinting rate factor.

When confirming the weight of news bunch, can adopt multiple mode, for example directly with the weight of each news pages in the news bunch with as the weight of news bunch, perhaps, with the weight average of each news pages in the news bunch as the weight of news bunch etc.

Step 504: the focus according to preset is chosen strategy, chooses focus the news pages under column and under this column, shows.

Focus is chosen strategy and can be provided with flexibly; For example: can from each news bunch, choose the focus of several news pages respectively as this column, perhaps, according to the ordering situation of each news bunch; Choose the focus of K2 news pages respectively K1 news bunch before come as this column; Wherein K1 and K2 are positive integer, or the like, no longer exhaustive at this.

Step 502, step 503 and step 504 do not have fixing sequencing, and this flow process is merely wherein a kind of embodiment.

Need to prove in each column, whether show focus, and whether each news bunch shows that top news all is configurable.That is to say, can be in the display properties of column concrete configuration: the content of text of demonstration and concrete mode.

So far flow process shown in the embodiment two finishes.

More than be the description that method provided by the present invention is carried out, be described in detail in the face of device provided by the present invention down.As shown in Figure 6, this device can comprise: text acquiring unit 601, first Dispatching Unit 602 and second Dispatching Unit 603.

Text acquiring unit 601 is used for each text that grasps is delivered to first Dispatching Unit 602 as text to be distributed respectively.

First Dispatching Unit 602 is used for the keyword of current text to be distributed and the center vector of each column are carried out the similarity coupling, according to matching result, current text to be distributed is distributed under the column that satisfies the distribution matching strategy; Wherein, the center vector of column generates based on the seed speech that is provided with for this column in advance.

Second Dispatching Unit 603, be used to treat distributions that first Dispatching Unit 602 accomplishes all texts to be distributed after, according to the hierarchical relationship between each column, with all or part of upper level father column or the sub-column of next stage of being distributed to of setting text under the column.

Wherein, the distribution matching strategy of above-mentioned column comprises at least: the similarity between the keyword of text to be distributed and the center vector of column surpasses the similarity threshold that is provided with to this column; Perhaps; The result that similarity between the keyword of text to be distributed and the center vector of column deducts after the similarity between the opposite vector of keyword and same column of text to be distributed surpasses the similarity threshold that is provided with to this column, and wherein the opposite vector of column is based on the reverse speech generation that is provided with for this column in advance.

In addition, the distribution matching strategy can further include but is not limited to a kind of or combination in any in the following strategy: the similarity between the keyword of text to be distributed and the center vector of column is the highest, perhaps; The website source of text to be distributed meets the website requirement of column, and perhaps, the author of text to be distributed meets author's requirement of column; Perhaps; Text to be distributed meets the requirement of column for picture or video, and perhaps, the title regular expression of text to be distributed meets the requirement of column for the title regular expression; Perhaps, the URL type of text to be distributed meets the URL type requirement of column.

Particularly; If the column of first Dispatching Unit, 602 distributions is sub-column; This moment, second Dispatching Unit 603 can gather the column to the upper level father at the individual text of preceding N1 with all texts under each sub-column of first Dispatching Unit, 602 distributions or ordering, and wherein N1 is preset positive integer.

If the column of first Dispatching Unit, 602 distributions is father's column, this moment, second Dispatching Unit 603 can be distributed to the sub-column of next stage with all texts under each sub-column of first Dispatching Unit, 602 distributions.

If the column of first Dispatching Unit, 602 distributions comprises father's column and sub-column, this moment, second Dispatching Unit 603 can be distributed to the part text under father's column of first Dispatching Unit, 602 distributions the sub-column of next stage of not distributed text.

Column involved in the present invention can comprise: have the common column of text exhibition attribute and have the not hiding column of text exhibition attribute.Wherein, hiding column can be used to realize the filtering function to text.

This device can also comprise: keyword extracting unit 604; Be used for extracting down by the keyword of distribution text from the column that is provided with the seed speech, the seed speech that the keyword that extracts is combined this column is to form the new center vector of this column and to offer first Dispatching Unit 602.Through the renewal of 604 pairs of column center vectors of this keyword extracting unit, can improve the accuracy rate that column is distributed text so that the center vector that upgrades is described the content guiding of this column more exactly.

Further, this device can also comprise: unit 606 is chosen with top news in text cluster unit 605.

Text cluster unit 605 is used for the Distribution Results according to first Dispatching Unit 602 and second Dispatching Unit 603, and the text under the column is carried out cluster, form each column next one above bunch.

Top news is chosen unit 606, is used for choosing strategy according to preset top news, in each bunch that text cluster unit 605 forms, chooses the expression of top text as each bunch respectively.

More preferably, this device can also comprise: bunch sequencing unit 607 or focus are chosen a kind of or whole (among the Fig. 6 be example to comprise two unit simultaneously) in the unit 608.

Bunch sequencing unit 607, be used for text cluster unit 605 form each column down bunch after, calculate the weight of each text under the column according to text attribute, the weight that the weight of each text is confirmed bunch in utilizing bunch, the weight of foundation bunch sorts to each bunch under the column.

Focus is chosen unit 608, is used for the Distribution Results according to first Dispatching Unit 602 and second Dispatching Unit 603, according to preset focus text selection strategy, chooses the focus text the text under each column respectively and displaying under each column.

Wherein, above-mentioned top news is chosen strategy and can be comprised a kind of or combination in any in the following strategy: choose the text of text issuing time in setting range, choose title and satisfy and set the text that requires, choose and the text of bunch center vector similarity in setting range, choose the text that text quality satisfies preset requirement.

More excellent ground, the weights W of each text _PageCan adopt following computing formula:

W_{Page} = \frac{α}{Δ_{t} + α} \times δ (Site) \times φ (Segcount);

Can find out that by above technical scheme method and apparatus provided by the invention can possess following advantage:

1) the present invention's employing distributes for the text of column and binding layer inter-stage based on the center vector distribution text that column seed speech generates, and the duration that text is distributed is controlled at a second level, has improved the efficient of text distribution greatly.In addition; Adopt method and apparatus of the present invention can avoid complicated training sample to set up process, in case and the column framework change, only need set suitable seed speech and the text distribution rules between level to the column that increases; The text distribution rules of revising between level to the column of deletion gets final product; Obviously need to confirm again the mode of training sample in the prior art of comparing, can reduce the cost of text distribution, increase and decrease column more neatly.

2) can modes such as hiding column perhaps be set realize text filtering through reverse speech is set in column in the present invention, improve the accuracy rate that text represents in column.

3) in the present invention; The text that can from column, distribute extracts keyword; And utilize the keyword that extracts to combine the center vector of the seed morphology Cheng Xin of this column; Make the center vector description bar purpose content guiding more accurately of column, thereby improve accuracy rate and the recall rate that column is distributed text.

4) the invention provides multiple distribution matching strategy, can be according to demand control group purpose text recall rate neatly.

The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope that the present invention protects.

Claims

1. the method that text is distributed is applied to comprise the column framework of two-stage column at least, it is characterized in that this method comprises:

A, carry out following distributing step respectively to each text that grasps:

Distributing step: the keyword of current text to be distributed and the center vector of each column are carried out the similarity coupling,, current text to be distributed is distributed under the column that satisfies the distribution matching strategy according to matching result; Wherein, the center vector of said column generates based on the seed speech that is provided with for this column in advance; Said column comprises: have the common column of text exhibition attribute and have the not hiding column of text exhibition attribute;

2. method according to claim 1 is characterized in that, the said distribution matching strategy of column comprises at least: the similarity between the keyword of said text to be distributed and the center vector of column surpasses the similarity threshold that is provided with to this column; Perhaps,

3. method according to claim 1 is characterized in that, said step B specifically comprises a kind of or combination in any in the following mode:

4. according to the described method of the arbitrary claim of claim 1 to 3; It is characterized in that; This method further comprises: extract down by the keyword of distribution text from the column that is provided with the seed speech, the seed speech that the keyword that extracts is combined this column is to form the new center vector of this column.

5. according to the described method of the arbitrary claim of claim 1 to 3, it is characterized in that, after said step B, carry out following steps respectively to each column:

6. method according to claim 5 is characterized in that, behind said step C2, also comprises:

7. method according to claim 5; It is characterized in that said top news is chosen strategy and comprised a kind of or combination in any in the following strategy: choose the text of text issuing time in setting range, choose title and satisfy and set the text that requires, choose and the text of bunch center vector similarity in setting range, choose the text that text quality satisfies preset requirement.

8. method according to claim 6 is characterized in that, the weights W of each text _PageComputing formula be:

W_{Page} = \frac{α}{Δ_{t} + α} \times δ (Site) \times φ (Segcount);

9. the device that text is distributed is applied to comprise the column framework of two-stage column at least, it is characterized in that this device comprises: text acquiring unit, first Dispatching Unit and second Dispatching Unit;

Said first Dispatching Unit is used for the keyword of current text to be distributed and the center vector of each column are carried out the similarity coupling, according to matching result, current text to be distributed is distributed under the column that satisfies the distribution matching strategy; Wherein, the center vector of said column generates based on the seed speech that is provided with for this column in advance; Said column comprises: have the common column of text exhibition attribute and have the not hiding column of text exhibition attribute;

10. device according to claim 9 is characterized in that, the said distribution matching strategy of column comprises at least: the similarity between the keyword of said text to be distributed and the center vector of column surpasses the similarity threshold that is provided with to this column; Perhaps,

11. device according to claim 9; It is characterized in that; The column of said first Dispatching Unit distribution is sub-column; This moment, said second Dispatching Unit gathered all texts under each sub-column of said first Dispatching Unit distribution or ordering with the column to the upper level father at the individual text of preceding N1, and wherein N1 is preset positive integer; Perhaps,

12. according to the described device of the arbitrary claim of claim 9 to 11; It is characterized in that; This device also comprises: keyword extracting unit; Be used for extracting down by the keyword of distribution text from the column that is provided with the seed speech, the seed speech that the keyword that extracts is combined this column is to form the new center vector of this column and to offer said first Dispatching Unit.

13. according to the described device of the arbitrary claim of claim 9 to 11, it is characterized in that this device also comprises: text cluster unit and top news are chosen the unit;

14. device according to claim 13 is characterized in that, this device also comprises: bunch sequencing unit or focus are chosen a kind of or whole in the unit;

15. device according to claim 13; It is characterized in that said top news is chosen strategy and comprised a kind of or combination in any in the following strategy: choose the text of text issuing time in setting range, choose title and satisfy and set the text that requires, choose and the text of bunch center vector similarity in setting range, choose the text that text quality satisfies preset requirement.

16. device according to claim 14 is characterized in that, the weights W of each text _PageComputing formula be:

W_{Page} = \frac{α}{Δ_{t} + α} \times δ (Site) \times φ (Segcount);