CN100444591C

CN100444591C - Method for acquiring front-page keyword and its application system

Info

Publication number: CN100444591C
Application number: CNB2006101124628A
Authority: CN
Inventors: 田野; 陈亮; 李晶
Original assignee: Beijing Kingsoft Software Co Ltd
Current assignee: Beijing Kingsoft Office Software Inc
Priority date: 2006-08-18
Filing date: 2006-08-18
Publication date: 2008-12-17
Anticipated expiration: 2026-08-18
Also published as: CN1909522A

Abstract

The invention relates to a method for obtaining the keyword of page, and a relative system, wherein said method comprises: classifying the page title, to obtaining the root of page title; based on the times of each page title root in the page, selecting the most one page title root as the keyword of said page. The inventive method can quickly and accurately obtain the keyword of page. And the method can be used in page catch system, to analyze the caught page, to obtain the keyword and store the page and keyword into database, to supply more pages to user.

Description

Obtain the method and the application system thereof of front-page keyword

Technical field

The present invention relates to networking technology area, relate in particular to a kind of method and application system thereof of obtaining front-page keyword.

Background technology

Along with the fast development of network, obtain the important means that information has become people's acquired information by network.Go out the webpage of being concerned about for the ease of user's rapid screening from a large amount of webpages, the webpage supplier need carry out preliminary treatment to web page contents, and reed is got front-page keyword, and each front-page keyword and web page contents together are kept in the database.When a certain webpage was browsed in user request, server at first obtained the keyword of this webpage correspondence from database, and the webpage that search has a same keyword from database offers the viewer.

The method of obtaining front-page keyword at present is by manual read's web page contents, to obtain front-page keyword.Adopt this method to obtain the keyword weak point and be when webpage quantity is huge, need a large amount of manual operations, increase workload, efficient is low, the front-page keyword accuracy that is found is not high, and this method only is applicable to info web by website supplier issue, and a limited number of websites of webpage, for example, news websites etc. are not suitable for the website that forum etc. is released news by the user, or the huge website of webpage quantity, for example, forum etc.

Summary of the invention

The technical problem to be solved in the present invention provides a kind of method and application system thereof of obtaining front-page keyword, to realize obtaining fast and accurately front-page keyword.

For solving the problems of the technologies described above, the objective of the invention is to be achieved through the following technical solutions.

A kind of method of obtaining front-page keyword, this method comprises:

Obtain web page title, web page title is carried out participle, obtain the web page title root; At netpage search web page title root, the number of times that statistical web page title root occurs in webpage; Be chosen at least one many web page title root of occurrence number in the webpage as described front-page keyword.

In said method, described web page title is carried out participle, obtain the web page title root and be specially:

According to the read-write order, travel through each web page title character, in each ergodic process, at first current character is preserved as a web page title root, on this web page title root basis, order is appended character or character string again, preserves as the web page title root.

Said method further comprises: for each web page title root is provided with a counter;

Described at netpage search web page title root, the number of times that statistical web page title root occurs in webpage is specially:

According to the read-write order, from the webpage source file, read the effective text data in the web data, travel through each character that effective text data comprises, in each ergodic process, at first current character is mated as a web page contents root and web page title root, if the match is successful, the counter of corresponding web page title root is added 1, on this web page contents root basis, order is appended character or character string again, mate as web page contents root and web page title root,, corresponding web page title root counter is added 1 if the match is successful.

Said method further comprises: described front-page keyword and described webpage are kept at web database.

Said method further comprises: add up the number of times that described front-page keyword occurs in a plurality of webpages, select at least one many front-page keyword of occurrence number as the hottest keyword.

Said method further comprises: described front-page keyword and/or the hottest keyword are enumerated on webpage, and for it link is set.

The system of front-page keyword method is obtained in a kind of application, and described system comprises: the front-page keyword acquiring unit, and webpage is preserved unit, web search unit;

The front-page keyword acquiring unit is used to obtain web page title, and web page title is carried out participle, obtains the web page title root; At netpage search web page title root, the number of times that statistical web page title root occurs in webpage; Be chosen at least one many web page title root of occurrence number in the webpage as described front-page keyword;

Webpage is preserved the unit, is used to preserve the front-page keyword that web page contents, web page address, front-page keyword acquiring unit obtain;

The web search unit is used for that webpage is preserved the unit and retrieves, and obtains the webpage that has same keyword with current browsing page.

Said system further comprises: webpage is climbed and is grabbed the unit, is used to obtain webpage;

The front-page keyword acquiring unit, be used for webpage climbed and grab that the webpage caught is climbed in the unit or the title of the current webpage of browsing of user carries out the participle analysis, obtain the web page title root, according to the number of times that each web page title root occurs in webpage, select at least one many web page title root of occurrence number as the described keyword of grabbing webpage or current browsing page of climbing.

Above technical scheme as can be seen, because the present invention is by carrying out the participle analysis to web page title, obtain the title root, the number of times that in webpage, occurs according to each title root, select the keyword of at least one many title root of occurrence number as described webpage, therefore, adopt this method can obtain front-page keyword fast, further, adopt the method for obtaining web page title root and statistical web page title root occurrence number in webpage provided by the present invention, the front-page keyword that obtains is more accurate than artificial method.In addition, this method is applicable to various types of websites, such as, comprehensive website that webpage quantity is very huge or info web are by the websites such as forum of user's issue, and climb the webpage that the technology of grabbing obtains for employing, can obtain rapidly to climb and grab front-page keyword, described climbing grabbed webpage and keyword deposits database in, the webpage supplier can provide more webpages for inquiry for it according to user's needs.

Description of drawings

The explanation of Fig. 1 web page title;

Fig. 2 obtains the front-page keyword method flow;

The flow process of Fig. 3 statistical web page title root occurrence number method in webpage;

Fig. 4 obtains the front-page keyword method and is applied to the block diagram that webpage is climbed the system of grabbing;

Fig. 5 system shown in Figure 4 workflow.

Embodiment

Core concept of the present invention is: by web page title is carried out the participle analysis, obtain the title root, according to the number of times that each title root occurs, select the keyword of at least one many title root of occurrence number as described webpage in webpage.

Choose web page title, it is carried out the participle analysis, reason is that web page title generally is the summary to web page contents, often comprises front-page keyword.

With reference to Fig. 1, web page title is described, wherein, label 101 is depicted as the web page title hurdle, and label 102 is depicted as the web page contents title, and label 103 is depicted as the corresponding webpage source code of web page title; Each webpage all has a title (title) attribute, this title attribute value, be generally shown in the title bar of browser, when checking source code, can see＜title＜/title〉this is to label, this is exactly the title attribute value of webpage to the value that label bracketed, and the webpage supplier can be by setting＜title〉＜/title〉value that bracketed, for webpage is provided with a title.Generally speaking, the webpage supplier can be set to web page title by the web page contents title, for example, and headline, article title, model title etc.

More than for the core concept of the inventive method and propose the foundation of this thought, below will introduce method provided by the present invention in detail, with reference to Fig. 2, Fig. 2 shows the realization flow of the inventive method, said method comprising the steps of:

Step 201: obtain web page title; Each developing instrument all provides the interface function that obtains the web page title attribute, by calling described interface function, can obtain the title of a webpage, for example, under the VC development environment, can obtain web page title by following code:

HRESULT IHTMLDocument2::get_title (BSTR*P); Wherein, IHTMLDocument2 points to the current web page data;

Step 202: web page title is carried out participle, and reed is got the web page title root and is kept in the tabulation temporarily;

Step 203: at netpage search web page title root, the number of times that statistical web page title root occurs in webpage;

Step 204: select the keyword of at least one many web page title root of occurrence number as described webpage;

So far, realized acquisition front-page keyword business, in actual applications, this method further comprises:

Together be kept at front-page keyword and web page contents in the web database;

Described front-page keyword is enumerated on webpage, and link is set for each keyword;

Add up the number of times that described front-page keyword occurs in a plurality of webpages, select at least one many front-page keyword of occurrence number to enumerate on webpage, and, offer the viewer for it is provided with link as the hottest keyword.

The invention provides two kinds of modes of obtaining the web page title root, wherein, the embodiment of the invention one adopts title root obtain manner (), and described title root obtain manner () is specially:

According to the read-write order, travel through each web page title character, in each ergodic process, at first current character is preserved as a web page title root, on this heading pile foundation, append character late again, preserve as a title root, by that analogy, until traveling through, having appended last character, be kept at the web page title root in the tabulation temporarily;

For example: Chinese Government releases the intellectual property new measure, can be divided into following root: in, China, middle international politics, Chinese Government, Chinese Government push away, and the like, traveled through " in " behind this character, begin traversal " state ", can be divided into following root: state, international politics, state government, state government pushes away.。。。And the like, " arrange " until having traveled through last character;

Wherein, the embodiment of the invention two adopts title root obtain manner (two), and described title root obtain manner (two) is specially:

Web page title is offered third party's participle software, carry out participle, obtain the web page title root;

This mode can effectively reduce root, improves search efficiency, such as, use participle software that " Chinese Government releases the intellectual property new measure " title is analyzed, can obtain roots such as Chinese Government, intellectual property, behave;

In embodiments of the present invention, can a counter be set for each web page title root, initial value is 0, is used for writing down the number of times that each web page title root occurs at webpage;

In other embodiment of the present invention, can adopt other counting mode, the number of times that record web page title root occurs in webpage does not influence the present invention and realizes;

The embodiment of the invention provides, and at netpage search web page title root, the method for the number of times that statistical web page title root occurs in webpage referring to Fig. 3, specifically comprises:

Step 301: from the webpage source file, read effective text data in the web data by the read-write order;

Wherein, comprise effective text data, label data, descriptive data in the webpage source file; Different data have different labels, and the present invention adopts regular expression or other character string processing method in reading process, remove the non-legible content in the webpage source file, obtain effective text data;

Wherein, those skilled in the art will know that described regular expression is a kind of character string processing method commonly used;

Wherein, described effective text data refers to be presented at the word content on the webpage, can be that Chinese also can be the literal of other Languages; Described label data and data of description refer to be used in the webpage source file order of the descriptive language of display web page content, with the html language is example, comprising: display text order＜p〉＜/p, display graphics order＜img, show form order＜table＜/table, show chained command＜ahref= Www.sina.com.cnSina＜/a〉etc.;

Wherein, in the webpage during display text, can not use any markup language in the webpage source code, if need carry out attribute or locational setting to literal, then can use markup language, for example＜font size=1 () color=red hello＜/font show on the webpage be exactly font size be 10, red " hello " two words;

Step 302: a comparison string variable Str is set;

Step 303: from effective text data, read a character S _i,, Str=S is set as current web page content root _i, each the title root in Str and the tabulation is mated, if the title root of coupling is arranged, then the counter with this title root adds 1, represents that this title root has occurred in webpage once, after coupling is finished, presses the read-write order at S _iAfter append a character S _I+1, Str=S is set _iS _I+1, again with tabulation in each title root mate, if the match is successful, then with counter+1 of corresponding title root, the rest may be inferred, is appended in proper order by 15 characters by read-write until described web page contents root and forms, be i.e. Str=S _iS _I+1S _I+2... S _I+j, from S _iTo S _I+jBe 15 characters, then finish this step, enter step 304;

Step 304: read next effectively text data character S _I+1, judge that whether this character is last character in effective text data, if not, then repeating step 303, if then enter step 305;

Step 305: this character and title root are mated, if the match is successful, then the counter with corresponding title root adds 1, finishes whole flow process;

Wherein, in the step 303,, also can finish this step, enter step 304 if read non-legible characters such as punctuation mark, or space character;

Wherein, the web page contents root is appended to the reason that comprises 15 characters at most and is that the length of title root generally can not surpass 15 characters in the step 303;

Wherein, the probability that title root that only comprises a character becomes keyword is very low, therefore, can not consider only to comprise the title root of a character when selecting keyword;

Wherein, can determine the keyword number of described webpage according to webpage supplier's needs.

More than be the description of method provided by the present invention, the inventive method has multiple application, will introduce respectively below:

(1) use one:

The website that is releasing news by the user, such as, forum etc., or the huge website of webpage quantity, adopt the method for obtaining keyword provided by the invention, obtain each front-page keyword and together be kept at database, when a certain webpage is browsed in user's request with web page contents, server obtains the keyword of this webpage correspondence from database, search for the webpage with same keyword according to user's needs from database and offer the user;

Because the content of posting of forum is determined by the general user, if adopt the method for manual read's web page contents to obtain keyword, then can't deposit this front-page keyword in database in real time, if and the user's modification web page contents causes front-page keyword to change, adopt manual type can't in time revise the front-page keyword that deposits database in, the webpage that causes searching does not meet customer requirements, adopts method provided by the present invention, can avoid the generation of above problem;

(2) use two:

Using 2 is further to optimize using 1, make it more convenient user, after adopting method provided by the invention to obtain the keyword of current web page, each front-page keyword and web page contents that the webpage supplier not only will obtain together are kept at database, and these keywords are enumerated on webpage, and for each keyword is provided with link, link is pointed to all and is had the address of one or more webpages of this keyword, and the user can check the keywords link of being concerned about according to the needs of oneself;

(3) use three:

Provide recent network the hottest keyword, employing the invention provides method and obtains front-page keyword, adds up the number of times that described front-page keyword occurs in a plurality of webpages, and the keyword that occurrence number is maximum promptly is the hottest keyword;

Wherein, the hottest described keyword is meant in the recent period the frequent front-page keyword that occurs in a plurality of webpages with identical or close theme;

(4) use four:

Owing to the invention provides the method for obtaining front-page keyword automatically, therefore when reed is got relevant info web, the related web page of other websites except that this website can also be provided as required, only the related web page of other websites need be climbed and grab, use the invention provides method and obtains and climb the front-page keyword of catching, and it is kept in the web database.

It is a kind of technology of obtaining webpage that described webpage is climbed the technology of grabbing, and may further comprise the steps:

One,, obtains the content of this webpage according to web page address; Different programming languages provides different interface function in order to obtain web page contents, and for example, the PHP language provides GetContentString () function, in order to obtain the web page contents of specifying network address;

Two, behind webpage of acquisition, analyze this web page contents again, according to regular expression, obtain the link that comprised in this webpage, re-use GetContentString () function, obtain the corresponding web page contents of each link, and the like, can obtain multistage webpage as required, again web page contents and its corresponding address are kept in the webpage preservation unit.

It is huge that the webpage quantity that the technology of grabbing obtains is climbed in employing, in this case, manually obtains the method for front-page keyword then need a large amount of manual operations if still adopt, and wastes time and energy.

Fig. 4 is for climbing system's pie graph of using the method that the invention provides in the system of grabbing at webpage, this system comprises:

Webpage is climbed and is grabbed the unit, is used to obtain webpage;

The web search unit is used for that webpage is preserved the unit and retrieves, and obtains the webpage with current browsing page same keyword.

Fig. 5 is the system shown in Figure 4 workflow, comprising:

Step 501: the user browses certain webpage to the Website server request;

Step 502: the front-page keyword acquiring unit, this webpage is analyzed, obtain at least one keyword of this webpage;

Step 503: webpage is climbed and is grabbed the unit and webpage is climbed grab as required, and is kept in the database;

Step 504: the front-page keyword acquiring unit, handle climbing the webpage of catching in the step 503 respectively, obtain the keyword of each webpage, and the webpage that keyword is corresponding with it is saved in together in the webpage preservation module;

Step 505: the web search unit is preserved in the unit at webpage according to the keyword that obtains in the step 502, retrieves the webpage identical with this keyword, offers the user;

Wherein,

step

503 and 504 can be carried out in advance.

More than a kind of method and application system thereof of obtaining front-page keyword provided by the present invention is described in detail, used specific case herein principle of the present invention and execution mode are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1, a kind of method of obtaining front-page keyword is characterized in that, this method comprises:

Obtain web page title, web page title is carried out participle, obtain the web page title root;

At netpage search web page title root, the number of times that statistical web page title root occurs in webpage;

Be chosen at least one many web page title root of occurrence number in the webpage as described front-page keyword.

2, according to the described method of claim 1, it is characterized in that, described web page title carried out participle, obtain the web page title root and be specially:

According to the read-write order, travel through each web page title character;

In each ergodic process, at first current character is preserved as a web page title root;

On this web page title root basis, order is appended character or character string again, preserves as the web page title root.

3, according to claim 1 or 2 described arbitrary methods, it is characterized in that this method further is included as each web page title root a counter is set;

According to the read-write order, from the webpage source file, read the effective text data in the web data;

Travel through each character that effective text data comprises;

In each ergodic process, at first current character is mated as a web page contents root and web page title root, if the match is successful, the counter of corresponding web page title root is added 1;

On this web page contents root basis, order is appended character or character string again, mates as web page contents root and web page title root, if the match is successful, corresponding web page title root counter is added 1.

4, method according to claim 1 is characterized in that, this method further comprises: described front-page keyword and described webpage are kept at web database.

5, according to claim 1 or 4 described methods, it is characterized in that this method further comprises: add up the number of times that described front-page keyword occurs in a plurality of webpages, select at least one many front-page keyword of occurrence number as the hottest keyword.

According to the described method of claim 5, it is characterized in that 6, this method further comprises: described front-page keyword and/or the hottest keyword are enumerated, and link is set for it on webpage.

7, the system of front-page keyword method is obtained in a kind of application, it is characterized in that, described system comprises: the front-page keyword acquiring unit, and webpage is preserved unit, web search unit;

According to the described system of claim 7, it is characterized in that 8, described system further comprises: webpage is climbed and is grabbed the unit, is used to obtain webpage;