CN102253998A - Method for automatically discovering and sequencing outdated webpage based on Web time inconsistency - Google Patents

Method for automatically discovering and sequencing outdated webpage based on Web time inconsistency Download PDF

Info

Publication number
CN102253998A
CN102253998A CN 201110194133 CN201110194133A CN102253998A CN 102253998 A CN102253998 A CN 102253998A CN 201110194133 CN201110194133 CN 201110194133 CN 201110194133 A CN201110194133 A CN 201110194133A CN 102253998 A CN102253998 A CN 102253998A
Authority
CN
China
Prior art keywords
time
webpage
inconsistent
web
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201110194133
Other languages
Chinese (zh)
Other versions
CN102253998B (en
Inventor
李石君
甘琳
杨莎
刘世超
刘咏宁
李宇轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN 201110194133 priority Critical patent/CN102253998B/en
Publication of CN102253998A publication Critical patent/CN102253998A/en
Application granted granted Critical
Publication of CN102253998B publication Critical patent/CN102253998B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method for automatically discovering and sequencing an outdated webpage based on Web time inconsistency. On the basis of time living of a webpage, The method comprises the following steps: establishing a multi-dimensional time vector of a webpage; extracting a multi-dimensional time value of the webpage; establishing a Web time inconsistency model; providing a novel method for reasoning and measuring the Web time inconsistency; constructing a principle framework for solving the Web time inconsistency; and applying the principle framework to (1) the automatic discovering and sequencing of the webpage when the website is outdated so as to release website maintainers from heavy manpower inspection and contribute to saving human resources, (2) quality arrangement of similar websites on the basis of time inconsistency measurement, such as sequencing of a government or university website in quality of information freshness, and (3) time sensitive information arrangement method for a search engine to ensure that a user can search latest information more convenient and improve the content quality and user evaluation of the website.

Description

Automatically find and sort method based on inconsistent out-of-date webpage of Web time
Technical field
The present invention relates to technical fields such as tense Web, webpage quality evaluation and ageing tolerance, Web info web searching system, Web information extraction, temporal database, relate in particular to a kind of based on the automatic discovery of inconsistent out-of-date webpage of Web time and sort method and system based on the time perception.
Background technology
Tense Web: in recent years, tense Web day by day becomes the focus that scholars pay close attention to.The most authoritative international conference WWW (International World Wide Web Conference) of Web science has set up " tense Web " symposial specially at 2011 o'clock.People such as Na Dai are based on the tense Web figure that forms at the multiple network snapshot of different time points, have set up the refresh rate that the net surfing model makes up each network [1]Marius Pasca discovers, to mark the web document of timestamp when retrieving, the time height correlation may be followed in user's term, and these time terms can be divided into dominance and recessive two classes [2]Yun Chi etc. come analytical structure and time dynamic to find community by " community's decomposition " [3]The relevant achievement of tense Web provides theoretical foundation for research of the present invention, and deep research does not define and studies Web time inconsistency though tense Web research has than system.
Webpage quality evaluation and ageing tolerance: existing studies show that, though the part Study achievement has disclosed the time distribution character and the time-sensitive degree of Internet resources to a certain extent [2,5], but rarely have the scholar to conduct a research at the time consistency problem of info web content specially.Fang Binxing etc. have studied the new dimension (social mark) of utilizing the webpage quality assessment to improve the web search performance [4]Chinese Academy of Sciences clock China, Huang Tao etc. have proposed a kind of performance diagnogtics method of Internet resources sensitivity [5]Chen Chuanfu etc. are adopting analytical hierarchy process to determine to have constructed the judgment matrix of ageing index in the process of index weights at different levels [6]People such as Brian D.Davison utilize the webpage freshness to assess the webpage quality, and go into the freshness that webpage is measured in the page two aspects from the page itself and its chain [7]In fact, above assessment indicator all at be the total quality of web site contents and information timeliness in general sense, do not carry out modeling and tolerance for the time inconsistency of webpage.
Web info web searching system based on the time perception: with PageRank (webpage rank) be representative be used for the grade and the importance of presentation web page based on the link analysis scoring method, though main flow searching algorithms such as PageRank begin to have considered time dimension, but only be simple update time with reference to webpage, do not consider the consistance of webpage and express the temporal information of the key word of user search intent, so in the time-sensitive search, there is certain deviation in its ranking results, and is usually unsatisfactory [8]Therefore, expansion and the in-depth to the time system of existing retrieval model becomes inevitable.In recent years, the searching system to time-based information constantly occurs, Klaus Berberich etc. propose a kind of index structure, can carry out high performance retrieval support to the document that has temporal information effectively [9], but this structure is only supported the inquiry based on time point, the not inquiry of the free segment information of tenaculum.How Susan T.Dumais influences the user capture mode by studying time dependent user interest, proposes an information retrieval model in conjunction with the content temporal evolution [10]Zhumin Chen has studied the order models based on the time-sensitive type webpage of the time of delivering (P-time), and has proposed in the text when not having obvious P-time to infer that webpage delivers time method [11]
Web information extraction: Weikum, people such as Gerhard have studied a named entity based on knowledge understanding, their semantic category, and their mutual relationship [12]People such as Steven Schockaert have proposed the obfuscation framework based on the Allen interval algebra, by using simple heuristic technique, extracting time information from the Web document, and pass through the reliability that the Fuzzy Time reasoning improves extraction information, handled because the fuzzy conflict that causes of incident [13]Utku Irmak and Reiner Kraft have studied the name structural solid, a kind of new three grades of guiding frames that detect half hitch structure entity have been proposed, describe phone, date and time entity, and carried out the extensive evaluation of English, German, Polish, Swedish and Turkish file [14]People such as Tim Weninger propose the method that a kind of mode based on text and mark ratio extracts various Web content of pages [15]People such as Mohammed Kayed propose a kind of page-level Web abstracting method based on web page template [16]
Temporal database and other correlative study: The Technology of Temporal Database is incorporated into traditional database with temporal information, recently temporal data has all been obtained great achievement in research fields such as theory, model and standardization, comprises temporal database model, historical relation model, historical relation algebraically, GT model GT etc. [18], the at present most widely used TSQL2 (Temporal Extension to the SQL-92Language) that is based on two tense conceptual models.Utilization temporal logic such as domestic Tang Yong and dynamic logic have been carried out the axiomatization modeling to the time shaft of temporal database, and design and realized temporal data handling principle system [19]People such as Alessandro Artale expand to time dimension with conceptual data model, have proposed the concept of time model, carry out reasoning research from aspects such as time mark, evolution, conversion and life cycles, and have measured the complicacy of reasoning [20]People such as Haiquan Chen use and based on the method for Bayesian inference the data of space-time redundancy are cleaned [21]
In a word, existing research has been carried out more deep research in tense Web, webpage quality evaluation and ageing tolerance, but in the modeling of Web time inconsistency, reasoning and tolerance, and the discovery and ordering aspect automatically of website outdated information, also do not carry out the deep research of system.
List of references
[1]Na?Dai,Brian?D.Davison:Freshness?Matters:In?Flowers,Food,and?Web?Authority.SIGIR?2010:114-121.
[2]Marius?Pasca:Towards?Temporal?Web?Search.SAC?March?16-20,2008:1117-1121.
[3]Yun?Chi,Shenghuo?Zhu,Xiaodan?Song,Jun′ichi?Tatemura,Belle?L.Tseng:Structural?and?temporal?analysis?of?the?blogosphere?through?community?factorization.KDD?2007:163-172.
[4] Liu Kaipeng, Fang Binxing. a kind of webpage sort algorithm based on social mark. Chinese journal of computers .Vol.33 (6), 2010:1014-1023.
[5] Wang Wei, Zhang Wenbo, Wei Jun, Zhong Hua, Huang Tao. a kind of Web application performance diagnostic method of resource sensitivity. software journal .Vol.21 (2), 2010:194-208.
[6] Chen Chuanfu, Tang Qiong, in the beautiful woman, Wu Zhiqiang etc. the ageing measurement of scientific information on the network. information journal .Vol.28 (4), 2009:610-617.
[7]Na?Dai,Brian?D.Davison:Capturing?Page?Freshness?for?Web?Search.SIGIR,2010:871-872.
[8]Junghoo?Cho,Sourashis?Roy,Robert?E.Adams:Page?Quality:In?Search?of?an?Unbiased?Web?Ranking.SIGMOD?2005:551-562.
[9]Klaus?Berberich,Srikanta?J.Bedathur,Thomas?Neumann,Gerhard?Weikum:Atime?machine?for?text?search.SIGIR?2007:519-526.
[10]Susan?T.Dumais:Temporal?dynamics?and?information?retrieval.CIKM2010:7-8.
[11]Zhumin?Chen,Jun?Ma,Chaoran?Cui,Hongxing?Rui,Shaomang?Huang:WebPage?Publication?Time?Detection?and?its?Application?for?Page?Rank.SIGIR?2010:859-860.
[12]Weikum,Gerhard?and?Theobald,Martin:From?information?to?knowledge:harvesting?entities?and?relationships?from?web?sources.PODS?2010:65-76.
[13]Steven?Schockaert,Martine?De?Cock,Etienne?E.Kerre:Reasoning?about?fuzzy?temporal?information?from?the?web:towards?retrieval?of?historical?events.Soft?Comput.(SOCO)2010,Vol.14(8):869-886.
[14]Utku?Irmak,Reiner?Kraft:A?scalable?machine-learning?approach?for?semi-structured?named?entity?recognition.WWW?2010:461-470.
[15]Tim?Weninger,William?H.Hsu,Jiawei?Han:CETR:content?extraction?via?tag?ratios.WWW?2010:971-980.
[16]Mohammed?Kayed,Chia-Hui?Chang.FiVaTech:Page-Level?Web?Data?Extraction?from?Template?Pages.IEEE?Transactions?on?Knowledge?and?Data?Engineering.2010,Vol.22(2):249-263.
[17] Li Shijun, Yu Junqing, Ou Weijie. based on the Web information extracting method of HTML pattern algebraically. computer research and development, 2006, Vol.43 (9): 1644-1650.
[18]Fusheng?Wang,Carlo?Zaniolo,Xin?Zhou:ArchIS:an?XML-based?approach?to?transaction-time?temporal?database?systems.The?VLDB?Journal,2008,17:1445-1463.
[19] Liu Dongning, Tang Yong. the dynamic logic model of temporal database time shaft. software journal .Vol.21, No.4, April 2010:694-701.
[20]Alessandro?Artale,Roman?Kontchakov,Vladislav?Ryzhikov,Michael?Zakharyaschev:Complexity?of?Reasoning?over?Temporal?Data?Models.ER?2010:174-187.
[21]Haiquan?Chen,Wei-Shinn?Ku,Haixun?Wang,Min-Te?Sun:Leveraging?Spatio-Temporal?Redundancy?for?RFID?Data?Cleansing.SIGMOD?2010:51-62.
Summary of the invention
At the technical matters of above-mentioned existence, the present invention is based on the life cycle of webpage, proposed a kind of based on automatic discovery of inconsistent out-of-date webpage of Web time and sort method.The present invention will be referred to notion " Web time consistency " and " Web time inconsistency ", Web time inconsistency refers under current situation, the time that webpage is explained is paid close attention to the user and the real time of understanding exists ambiguousness and conflict property, this notion is an important indicator of evaluating network information quality, is related to the ageing and accuracy of web page contents.
For solving the problems of the technologies described above, the present invention adopts following technical scheme:
One, a kind of based on automatic discovery of inconsistent out-of-date webpage of Web time and sort method, may further comprise the steps:
Step 1, at the time inconsistent problem of different web pages information to existing among the sensitivity of time and the Web, set up inconsistent model of Web time, wherein, inconsistent model of Web time comprises inconsistent model of webpage time, webpage and inconsistent model of column time, inconsistent model of the identical column of different web sites time; This step further comprises following substep:
1-1 carries out sensitivity analysis to different web pages information, webpage is classified about the variation tendency of time according to the theme and the information of webpage, and estimates the degree interval of each class webpage to time-sensitive;
1-2 utilizes the time shaft order relation logically of Web information to set up Web time relationship vector model;
The inconsistent problem of time that 1-3 exists at each class webpage self, make up inconsistent model of webpage time according to Web time relationship vector model, wherein, inconsistent model of webpage time comprises the inconsistent model of time delay, the inconsistent model of constraint and does not have the inconsistent model of constraint;
1-4 is at the time problem of inconsistency between webpage temporal information in the column of website and the column intension, make up webpage and inconsistent model of column time according to Web time relationship vector model, wherein, webpage and inconsistent model of column time comprise the inconsistent model of time delay, retrain inconsistent model, do not have the inconsistent model of constraint;
1-5 describes the inconsistent problem of time of the webpage of identical information down at the identical column of different web sites, set up inconsistent model of the identical column of different web sites time according to Web time relationship vector model, wherein, the identical inconsistent model of column time of different web sites comprises more inconsistent model and the inconsistent model of prediction;
Step 2, utilize time knowledge concepts model, canonical grammar coupling and pattern algebraically that Web information is carried out the various dimensions decimation in time, wherein, the various dimensions time comprise Time To Event, the time of writing, issuing time, reading time, reproduced time and text expired time;
Step 3, webpage is classified according to inconsistent model of Web time, and according to the Web information various dimensions time of extracting, carry out inconsistent tolerance of Web time, obtain the inconsistent tolerance degree of time of webpage, wherein, the inconsistent tolerance of time between inconsistent tolerance of the Web time inconsistent tolerance of time, webpage that comprise webpage self and inconsistent tolerance of the time between the column, the identical column of different web sites;
Inconsistent tolerance degree of time between inconsistent tolerance degree of the time of webpage self and webpage and the column is: InCon ( W ) = Σ i = 1 n α i × webpage . Inconsistency ( i ) , InCon (W) is an inconsistent tolerance degree of the time of webpage W; N is the quantity of the inconsistent problem types of time that inconsistent webpage of time exists in the website; α iWeights for inconsistent problem types i of time; The inconsistent degree of time of the i class inconsistent problem of time of webpage.Inconsistency (i) expression webpage W;
Inconsistent tolerance degree of time between the identical column of different web sites comprises more inconsistent tolerance degree and the inconsistent tolerance degree of prediction, more inconsistent tolerance degree InConCompare ( W 1 , W 2 ) = cos < x , y > | x | | y | + 1 2 , InConCompare (W1, W2) expression webpage W 1And W 2More inconsistent degree, x, y are respectively webpage W 1And W 2The event description vector; Predict inconsistent tolerance degree InConPre ( W 1 , W 2 ) = cos < x , y > | x | | y | + 1 2 , InConPre (W 1, W 2) expression webpage W 1And W 2The inconsistent degree of prediction, x, y are respectively webpage W 1And W 2The event description vector;
Step 4, inconsistent rule set of structure time, by inconsistent rule set of time, based on inconsistent reasoning from logic operator of rule set, the time of time knowledge concepts, based on the Web information various dimensions time of extracting, carry out the inconsistent reasoning of Web time, wherein, inconsistent rule set of time comprises the inconsistent rule set of time delay, retrains inconsistent rule set, does not have constraint inconsistent rule set, more inconsistent rule set, predicts inconsistent rule set; The inconsistent reasoning of Web time comprises the inconsistent statistical reasoning of inconsistent reasoning of time, the identical hurdle of different web sites object time of the reasoning of unknown dimension time value, same subject info web in the webpage time relationship vector;
Step 5, according to the station address of user input, draw the inconsistent degree of time of each webpage based on inconsistent model of Web time, inconsistent tolerance of Web time, the inconsistent reasoning of Web time, and find the out-of-date webpage in website automatically, and provide out-of-date web page listings according to inconsistent degree of the time of webpage.
On step 1~five basis, can also comprise step 6:
Step 6, based on inconsistent model of Web time, inconsistent tolerance of Web time, the inconsistent reasoning of Web time, according to the site information freshness similar website and webpage are sorted, wherein:
The site information freshness FScore = FineFScore + CourseFScore 2 , FineFScore is the fine granularity freshness, FineFScore = 1 - &Sigma; i = 1 n &Sigma; j = 1 m webpage ( j ) . Inconsistency ( i ) n &times; m ; CourseFScore is the coarseness freshness, CourseFScore = 1 - N inconsistency ( webpage ) m ; M is the quantity of webpage in the website; N is the quantity of the inconsistent problem types of time that inconsistent webpage of time exists in the website; Webpage (j) .Inconsistency (i) is the inconsistent degree of time of the i class inconsistent problem of time in j the webpage; Inconsistency (i) is the inconsistent degree of time of i class inconsistent problem of time, 0≤Inconsistency (i)≤1; N Inconsistency(webpage) be the quantity of the inconsistent webpage of this website life period.
On step 1~five basis, can also comprise step 7:
Step 7, based on inconsistent model of Web time, according to the semanteme of express time in the user search keyword, result for retrieval is carried out time perception ordering according to inconsistent problem of Web time and corresponding inconsistent degree of time.
The substep 1-3 of above-mentioned steps one and the inconsistent model of time delay among the 1-4, retrain inconsistent model, do not have the inconsistent model of constraint and be:
The inconsistent model of time delay:
The time of origin of webpage incident e and the delay of issuing time D happen publish ( e ) = T publish ( e ) - T happen ( e ) , T Publish(e), T Happen(e) be respectively Time To Event and the issuing time of webpage incident e; When
Figure BDA0000075106470000072
The time, be the time delay unanimity, the time delay consistent degree
Figure BDA0000075106470000073
When
Figure BDA0000075106470000074
The time, inconsistent for time delay, inconsistent degree ConD=1, a are the inconsistent critical value of time delay, are provided with according to the time-sensitive degree of info web;
Retrain inconsistent model:
ConUC = e T relative ln 0.6 = 0.6 T relative , Wherein, ConUC is degree of consistency confinement time, T RelativeBe relative time, T relative = T read - T publish T out - T publish ;
There is not the inconsistent model of constraint:
ConUC = 1,0 &le; t &le; b ( 1 + a - 2 ( t - b ) - 2 ) - 1 , t &GreaterEqual; b , InConUC = 0,0 &le; t &le; c ( 1 + ( t - c a ) - 2 ) - 1 , t &GreaterEqual; c
Wherein, ConUC is degree of consistency confinement time of webpage; InConUC is inconsistency degree confinement time of webpage; A is the information sensing degree of webpage, and [b, c] is expired time T OutThe neighborhood interval, T out = T publish + 1 a .
The workflow of more inconsistent model among the substep 1-5 of above-mentioned steps one and the inconsistent model of prediction is as follows respectively:
The workflow of more inconsistent model is:
1. to webpage W 1The incident of carrying out is excavated, and draws the event description vector x of this webpage;
2. according to webpage W 1The incident e that describes delivers time T Publish(e), determine webpage W 1Neighborhood time interval [T Publish(e)-and δ, T Publish(e)+and δ], δ>0, δ → 0;
3. according to webpage W 1The event description vector x, the deterministic retrieval keyword carries out the search of related web page;
4. at the related web page that searches, it is delivered the time and the neighborhood time interval mates, if in the neighborhood interval, then keep this related web page, otherwise delete this webpage, finally obtain a related web page collection W;
5. to all webpages among the related web page collection W, carry out time dimension and extract, determine the event description vector y of all webpages among the related web page collection W 1, y 2... y n, n is the quantity of all webpages among the related web page collection W;
6. compare event description vector x, y i, i ∈ [1, n], if cos<x, y i>≤0, then more inconsistent, otherwise, stop;
7. stop.
The workflow of predicting inconsistent model is:
1. predict webpage W 1Judge: if T Happen(e)>T PublishOr T (e), Happen(e) be empty, then think webpage W 1Be the prediction webpage, otherwise stop that wherein, e is webpage W 1In incident;
2. compare T Read(e) and T Out(e), if T Read(e) 〉=T Out(e), then predict inconsistently, stop, otherwise execution in step 3.;
3. at prediction webpage W 1The incident of carrying out is excavated, and draws webpage W 1The event description vector x;
4. according to webpage W 1The event description vector x, the deterministic retrieval keyword carries out the search of related web page;
5. result for retrieval is screened, choose the high webpage of the degree of correlation, obtain a related web page collection W,, carry out time dimension and extract, determine the event description vector y of all webpages among the related web page collection W all webpages among the related web page collection W 1, y 2... y n, n is the quantity of all webpages among the related web page collection W;
6. compare event description vector x, y i, i ∈ [1, n], if cos<x, y i>≤0, then predict inconsistent, otherwise, stop;
7. stop.
Time in the above-mentioned steps four, inconsistent rule set was as follows:
The inconsistent rule set of time delay: establishing t is the inconsistent critical value of time delay of incident e, if
Figure BDA0000075106470000081
Time delay unanimity then; If
Figure BDA0000075106470000091
Then time delay is inconsistent; Wherein, R is the time order relation among the T of tense system;
Retrain inconsistent rule set: webpage incident e is for current time t, a) if having S is a time point, R (t s)=1 o'clock, satisfies Ge=1, and
Figure BDA0000075106470000093
As R (s 0, t)=1 o'clock τ (t, Ge)=0, as R (t, s 0(t, He)=1 an incident e is expired, and expired time point is s for)=1 o'clock τ 0, claim that promptly the just expired distance of incident e is s 0-t.B) if
Figure BDA0000075106470000094
R (s, s 0)=1 and R (t s)=1 o'clock, satisfies Ge=1, and
Figure BDA0000075106470000095
The time, incident e is expired, claim that then incident e is effective, and expired time point is s 0, the negative expired distance that promptly claims incident e is s 0-t; Wherein, R is the time order relation among the T of tense system; Ge is illustrated in all future times, and e is true; τ is the true and false assignment function among the T of tense system; He is illustrated in time in the past, and e after all is true;
Do not have the inconsistent rule set of constraint: establishing t is the marginal time, t 0Be the current time in system, the incident that e describes for webpage W, e ' is the incident that webpage W ' describes, as R (T Publicate(e), t)=1 o'clock, it is inconsistent to think that webpage W does not have constraint.And satisfied to the consistent webpage of have or not constraint:
Figure BDA0000075106470000096
T Publicate(e ') satisfies R (T Publicate(e), t 0)=1, R (T Publicate(e '), t 0)=1 is as R (T Publicate(e), T Publicate(e '))=1 o'clock, the priority ratio W height of webpage W ' is arranged;
More inconsistent rule set: e is the incident that webpage W describes, and e ' be the incident of webpage W ' description, and to similar web page W arbitrarily, W ' works as T Publish(e ') ∈ [T Publish(e)-and δ, T Publish(e)+δ] time, if cos<Vector (W), Vector (W ')>≤0, then more inconsistent;
Predict inconsistent rule set: for all forecasting type webpages, if R (T is arranged Happen(e), T Publish=1 or R (T (e)) Out(e), T Publish(e))=1, think that then the webpage prediction is inconsistent.
Above-mentioned steps five further comprises following substep:
Extraction, reasoning and the tolerance of 5-1 inconsistent webpage of time:
1. information extraction: at the time sensitivity webpage that screens, carry out temporal information and extract, comprising: webpage place column title temporal information extracts, the web page title temporal information extracts, the web page contents time dimension extracts;
2. inconsistent reasoning of Web time: the inconsistent inference method of Web time in four of taking steps, for the webpage time dimension that is not drawn into, carry out the inconsistent reasoning of webpage time, infer the unknown dimension of webpage by the known dimension of webpage, and the time dimension of similar web page infers unknown webpage time dimension;
3. inconsistent tolerance of Web time: on the basis that temporal information extracts, by the inconsistent reasoning of Web time, carry out the inconsistent pattern-recognition of time of webpage, the pattern according to different adopts different inconsistent models of time to carry out inconsistent tolerance of time;
The automatic discovery of the out-of-date webpage in 5-2 website:
1. according to each inconsistent tolerance of class time, draw each inconsistent degree of class time of each webpage;
2. if max{Inconsistency (i) } 〉=a, think that then webpage is out-of-date webpage; Wherein, i is an inconsistent type of time; Inconsistency (i) is the inconsistent degree of time of the i class inconsistent type of time of webpage; A is the out-of-date critical value of webpage, and a=0.5 is generally got in 0.5≤a≤1;
5-3 sorts based on inconsistent out-of-date webpage of Web time:
According to inconsistent degree of each time and max{Inconsistency (i) } webpage is carried out out-of-date webpage ordering, provide out-of-date Web page classifying tabulation and final ranking tabulation, wherein, out-of-date Web page classifying table comprises inconsistent out-of-date web page listings of webpage time, webpage and inconsistent out-of-date web page listings of column time and inconsistent out-of-date web page listings of identical column different web sites time.
Above-mentioned steps seven further comprises following substep:
7-1 sets up the dictionary of the word of express time, and the time word in the dictionary is classified;
7-2 sets up different constraint functions to every class time word;
7-3 is according to the time word in user's the searching key word, the matching constraint function, and adopt corresponding retrieval model, result for retrieval is carried out time perception ordering.
Among the step 7-1 time word in the dictionary is divided into two classes, the first kind is the time word of expression " up-to-date " notion, and second class is the time word of expression " a period of time ";
Constraint function among the step 7-2 has two classes: the first class constraint function is to set up at first kind time word, for:
Figure BDA0000075106470000101
T Publish(W ')<t 0, W ' and W are respectively the different web pages in the web page listings, t 0Be the current time in system, work as T Publish(W)<T PublishWhen (W '), the priority of webpage W ' is higher than webpage W, wherein, and T Publish(W), T Publish(W ') is respectively the issuing time of webpage W ' and W; The second class constraint function is to set up at the second class time word, for: t (keyword)<T Publish(W)<t 0, W represents any webpage in the web page listings, wherein, and t 0Be the current time in system, t (keyword) represents represented time period of searching key word, T Publish(W) be the issuing time of webpage W;
Step 7-3 further comprises substep:
7-3-1 adopts the query assessment technology, adopts retrieval model to obtain a preliminary web page listings L;
7-3-2 adopts retrieval model, and the webpage time of delivering among the web page listings L is retrained coupling;
7-3-3 according to ConScore (q W) determines the order of webpage in the web page listings,
ConScore(q,W)=α(q,W)×Sim(q,W)+β(q,W)×Sim_t(q,W)+γ(q,W)×(1-InCon(W))
Wherein,
ConScore (q, W): webpage W is to the similarity of a certain inquiry q;
Sim (q, W): the webpage similarity;
Sim_t (q, W): the webpage time of delivering among the web page listings L is retrained when coupling, if the webpage time of delivering satisfy constraint, then Sim_t (q, W)=1, otherwise, Sim_t (q, W)=0;
InCon (W): the time inconsistency degree value of webpage W;
α (q, W): Sim (q, weight W);
β (q, W): the freshness of webpage W,
Figure BDA0000075106470000111
T Publish(W) represent the time that webpage is delivered, t represents the start time point of the time-constrain of user's appointment, t 0Expression current time in system point;
γ (q, W): the weight of time consistency degree in ordering of webpage W;
α(q,W)+β(q,W)+γ(q,W)=1,α(q,W),β(q,W),γ(q,W)≥0;
Webpage during 5-3-4 tabulates web document is according to ConScore (q, descending series arrangement W).
Two, a kind of based on automatic discovery of inconsistent out-of-date webpage of Web time and ordering system, comprising:
1. out-of-date webpage is found module, 2. similar site information freshness order module and 3. time perception search module automatically;
1. out-of-date webpage find automatically module further comprise submodule temporal information abstraction module, inconsistent reasoning module of Web time, inconsistent metric module of Web time, based on inconsistent out-of-date webpage order module of Web time, wherein:
The temporal information abstraction module is used at the time sensitivity webpage that screens, and carries out temporal information and extracts; The temporal information abstraction module comprises the abstraction module that submodule is used to extract the abstraction module of webpage place column title temporal information, is used to extract the abstraction module of web page title temporal information and is used for the extracting content on web pages time dimension;
Inconsistent reasoning module of Web time is used for time dimension that webpage is not drawn into, carries out the inconsistent reasoning of webpage time, and the inconsistent pattern of time of identification webpage; Inconsistent reasoning module of Web time comprise submodule be used for known dimension by webpage infer webpage unknown dimension reasoning module and be used for inferring the reasoning module of unknown webpage time dimension by the time dimension of webpage;
Inconsistent metric module of Web time is used for according to the inconsistent pattern of the different time of webpage, adopts the inconsistent model of time corresponding to carry out inconsistent tolerance of Web time; Inconsistent metric module of Web time comprise submodule be used to measure the inconsistent metric module of time of webpage self, be used to measure webpage with inconsistent metric module of the time between the column, be used to measure the inconsistent metric module of time between the identical column of different web sites;
Based on inconsistent out-of-date webpage order module of Web time, be used for tolerance result according to Web inconsistent metric module of time, webpage is carried out out-of-date webpage ordering; Comprise submodule based on inconsistent out-of-date webpage order module of Web time: the order module that is used for inconsistent out-of-date webpage of webpage time is sorted, the order module that is used for webpage and inconsistent out-of-date webpage of column time are sorted, be used for the order module that sorts to inconsistent out-of-date webpage of column different web sites time;
2. similar site information freshness order module further comprises submodule webpage acquisition module and order module, wherein:
The webpage acquisition module is used to obtain similar website and webpage;
Order module is used for similar website and webpage that the webpage acquisition module is obtained, sorts the site information freshness based on the site information freshness FScore = FineFScore + CourseFScore 2 , FineFScore is the fine granularity freshness, FineFScore = 1 - &Sigma; i = 1 n &Sigma; j = 1 m webpage ( j ) . Inconsistency ( i ) n &times; m ; CourseFScore is the coarseness freshness, CourseFScore = 1 - N inconsistency ( webpage ) m ; M is the quantity of webpage in the website; N is the quantity of the inconsistent problem types of time that inconsistent webpage of time exists in the website; Webpage (j) .Inconsistency (i) is the inconsistent degree of time of the i class inconsistent problem of time in j the webpage; Inconsistency (i) is the inconsistent degree of time of i class inconsistent problem of time, 0≤Inconsistency (i)≤1; N Inconsistency(webpage) be the quantity of the inconsistent webpage of this website life period.
3. time perception search module, this module further comprise submodule unit search module, webpage reordering module and feedback module, wherein:
Unit's search module is used to adopt first way of search to obtain the Search Results that a plurality of search engines return;
The webpage reordering module, be used for according to time perception ordering ConScore (q, W) Search Results that first search module is obtained is reset, and obtains new webpage ordering, ConScore (q W) be the similarity of webpage W to a certain q of inquiry,
ConScore(q,W)=α(q,W)×Sim(q,W)+β(q,W)×Sim_t(q,W)+γ(q,W)×(1-InCon(W))
Sim (q, W): the webpage similarity;
Sim_t (q, W): the webpage time of delivering among the web page listings L is retrained when coupling, if the webpage time of delivering satisfy constraint, then Sim_t (q, W)=1, otherwise, Sim_t (q, W)=0;
InCon (W): the time inconsistency degree value of webpage W;
α (q, W): Sim (q, weight W);
β (q, W): the freshness of webpage W,
Figure BDA0000075106470000131
T Publish(W) represent the time that webpage is delivered, t represents the start time point of the time-constrain of user's appointment, t 0Expression current time in system point;
γ (q, W): the weight of time consistency degree in ordering of webpage W;
α(q,W)+β(q,W)+γ(q,W)=1,α(q,W),β(q,W),γ(q,W)≥0;
Feedback module is used for new webpage ranking results is returned to the user.
Compared with prior art, the present invention has the following advantages and good effect:
1) the present invention can find the out-of-date webpage in website automatically and to resequencing automatically, the website maintenance personnel be freed from heavy hand inspection, helps to save human resources;
2) adopt the present invention to sort based on inconsistent tolerance of time, sort in the quality aspect the information fresh degree as government or university website to similar website;
3) the present invention makes the user can search for up-to-date information more easily for search engine provides time-sensitive information sorting method, and the content quality and the user that can improve the website estimate.
Description of drawings
Fig. 1 is an overall plan of the present invention;
Fig. 2 is a time delay synoptic diagram of the present invention;
Fig. 3 is the time delay consistent degree synoptic diagram of each big website;
Fig. 4 is the inconsistent exponential model figure of constraint of the present invention;
Fig. 5 is www.xinhuanet.com's DIALOGUE column consistent degree confinement time synoptic diagram;
Fig. 6 is the inconsistent degree of time and the time inconsistent degree fuzzy set illustraton of model of nothing constraint webpage of the present invention;
Fig. 7 retrains inconsistent degree and does not have constraint consistent degree synoptic diagram for the nothing of generation of electricity by new energy webpage in State Grid's Information Network;
Fig. 8 is a more inconsistent model process flow diagram of the present invention;
Fig. 9 is the inconsistent model process flow diagram of prediction of the present invention;
Figure 10 be time knowledge concepts of the present invention and between the relation ontology describing.
Embodiment
Figure 1 shows that overall plan of the present invention, the solution of the present invention is divided into two levels:
Ground floor is " theoretical frame ", by webpage is carried out the time sensitivity analysis, set up webpage various dimensions time model, according to time knowledge concepts model, canonical grammar with based on the decimation in time of pattern algebraically, carrying out webpage various dimensions time data extracts, thereby set up inconsistent model of Web time, comprise inconsistent model of webpage time, webpage and inconsistent model of column time, inconsistent model of the identical column of different web sites time; Carry out inconsistent reasoning of Web time and tolerance, it comprises two parts: first is inconsistent reasoning of webpage time and tolerance, form by two parts based on the inconsistent reasoning of webpage time, be divided into unknown time dimension of known time dimension reasoning and the unknown time dimension of similar web page time dimension reasoning, and form with inconsistent tolerance of column time, inconsistent tolerance three parts of the identical column of different web sites time by inconsistent tolerance of webpage time, webpage based on inconsistent tolerance of webpage time; Second portion, is realized resetting based on the webpage of inconsistent tolerance of webpage time by the time measure to the search engine return results for time perception ordering, forms new ranking results.
The second layer is " application scenarios ", the application of three aspects is arranged: first, the user imports station address, and system finds out-of-date webpage automatically, provides out-of-date web page listings, and according to out-of-date degree ordering, reminded to the user, informed the inconsistent type of time and the degree of this webpage, and reminded the portal management personnel, propose brand-new automatic amendment scheme, inconsistent point of the time in the webpage is corrected; The second, to similar website,, carry out the website rank based on inconsistent reasoning of time and tolerance as the government website of various places; The 3rd, by search engine, according to the key word of the inquiry of user's input, according to the ageing analysis of the webpage between the website, the webpage that time consistency is the highest is preferentially recommended the user, the results list of time of return unanimity, the time unanimity of realization Webpage searching result.
For the ease of understanding the present invention, at first the theoretical foundation that the present invention relates to is described in detail, specific as follows:
Step 1, at the time inconsistent problem of different web pages information to existing among the sensitivity of time and the Web, set up inconsistent model of Web time, wherein, inconsistent model of Web time comprises inconsistent model of webpage time, webpage and inconsistent model of column time, inconsistent model of the identical column of different web sites time; This step further comprises following substep:
1-1 carries out sensitivity analysis to different web pages information, webpage is classified about the variation tendency of time according to the theme and the information of webpage, and estimates the degree interval of each class webpage to time-sensitive, and this step specifically comprises following substep:
1. information classification: webpage is divided according to the difference of theme, be divided into historical class, financial class, literature, news category, general knowledge class etc.;
2. information is along with the variation tendency of time: the time dependent trend of each class webpage of statistical study, and estimate the degree interval of each class webpage to time-sensitive;
3. filter the webpage very little: find by statistical study to the time-sensitive degree, webpages such as general knowledge class, historical class, literature are very little to the sensitivity of time, this class webpage is the inconsistent problem of life period not, we do not consider, and only research is to comparatively responsive info web of time.
1-2 utilizes the time shaft order relation logically of Web information to set up Web time relationship vector model;
Following notion that this step is self-defined:
1) definition tense system: the T=<T of tense system, R, τ 〉, T is the tense system, and T is a time non-NULL point set, and R is the order relation of time set, and τ is true and false assignment function, T is the finite aggregate T={t of time point 1, t 2..., t n; Wherein, the time, the order relation R of set was a non-partial ordering relation "<", with R (t i, t I+1) expression t i<t I+1, i.e. t iOccur in t I+1Before; True and false assignment function τ: T * L TAtom set → { 0,1} is at L TGo up to temporal operator F, P, G, H and operational character ∨, ∧,
Figure BDA0000075106470000161
Each atomic sentence on All Time point, give true-false value, wherein, Fe is illustrated in the future certain time e for true; It is true that Pe is illustrated in over certain time e; It all is true that Ge is illustrated in all future time e; He is illustrated in time in the past e always for true.
According to the life cycle of webpage and content thereof, the Web time is divided into 6 dimensions, comprising: Time To Event (T Happen), (T writes the time Write), issuing time (T Publish), reading time (T Read), reproduced time (T Share), text expired time (T Out).T Happen, T Write, T Publish, T Read, T Share, T Out∈ T is expressed as the Web time arrow based on these 6 dimensions with webpage, and is specific as follows:
2) definition Time To Event (T Happen): the original time of origin of the described incident of web page text.
3) define (T that writes the time Write): journalist or event description person, i.e. author, the time of incident being write as contribution.
4) definition issuing time (T Publish): the contribution that will describe incident is published to the online time.
5) definition (T reading time Read): the user browses the time of the web page text of this incident of description.
6) definition reproduced time (T Share): the time that the user reprints this web page text or share.
7) definition text expired time (T Out): there are the webpage of the term of validity, the expiration time of webpage text content for webpage text content.
8) definition Web time relationship vector: about the Web time relationship vector Vector (W) of any one webpage W=<T Happen(e), T Write(e), T Publish(e), T Read(e), T Share(e), T Out(e)>, the incident that e describes for webpage W, T Happen(e), T Write(e), T Publish(e), T Read(e), T Share(e), T Out(e) be respectively the Time To Event of incident e, the time of writing, issuing time, reading time, reproduced time, text expired time.
The inconsistent problem of time that 1-3 exists at each class webpage self, make up inconsistent model of webpage time according to Web time relationship vector model, wherein, inconsistent model of webpage time comprises the inconsistent model of time delay, the inconsistent model of constraint and does not have the inconsistent model of constraint;
The webpage time is inconsistent be only exist and the title of certain webpage, time of delivering between time and the text text inconsistent, comprise that time delay is inconsistent, constraint is inconsistent, it is inconsistent not have constraint.Based on Web time shaft theory, we make up model to inconsistent problem of every class time.
9) the definition time delay is inconsistent: Time To Event with write between the time time interval of existence, write and have a time interval between time and the issuing time, be respectively write time delay and issue time delay, when the time delay of writing+issue time delay<=certain value, then think this web page text time delay unanimity, otherwise time delay is inconsistent.
Inconsistent definition makes up the inconsistent model of time delay according to time delay:
The time of origin of webpage incident e and the delay of issuing time D happen publish ( e ) = T publish ( e ) - T happen ( e ) , T Publish(e), T Happen(e) be respectively Time To Event and the issuing time of webpage incident e; When
Figure BDA0000075106470000172
The time, be the time delay unanimity, the time delay consistent degree When
Figure BDA0000075106470000174
The time, for time delay inconsistent, inconsistent degree ConD=1, wherein, a is the inconsistent critical value of time delay, is provided with according to the time-sensitive degree of info web, as Fig. 2.
For example:, calculate the time delay of the related web page of several big portal websites at " Obama delivers national televised address and announces that Laden is shot dead " this news.This Laden is morning May 1 (U.S.'s time) by the time of shooting dead, and it is 11: 36 evening of May 1 (U.S.'s time) that Obama delivers the time that national televised address announces that Laden has been shot dead, i.e. 10: 36 on the 2nd May of Beijing time.Get the inconsistent critical value a=12h of time delay, by the inconsistent model of time delay following table is arranged, distribution situation is seen Fig. 3:
Table 1 is delivered time and time delay situation
Figure BDA0000075106470000175
10) the definition constraint is inconsistent: at the info web that obvious time-constrain is arranged, promptly info web exists effectively and expired this problem.When webpage before the deadline, claim constraint consistent, expired when webpage, claim that constraint is inconsistent.
The present invention adopts the notion of relative time, sets up the inconsistent model of constraint, and is specific as follows:
Make relative time
Figure BDA0000075106470000176
Set up relative time T RelativeFunction with degree of consistency confinement time
Figure BDA0000075106470000177
ConC represents degree of consistency confinement time, and the codomain standard of ConC to [0,1] interval, is then had: ConUC = e T relative ln 0.6 = 0.6 T relative , As Fig. 4.
Draw by analysis, in the term of validity of webpage, confinement time, the variation of the degree of consistency was very fast, and expired when webpage, confinement time, the variation of the degree of consistency was gradually slow.At each concrete info web, its expired time and its order of magnitude all are different, thus it confinement time the degree of consistency curve also different.
For example: the DIALOGUE column of the www.xinhuanet.com, select preceding 100 topics, the time of delivering of webpage, to 2011-5-5 16:48:00, wherein 1 was invalid data from 2011-1-27 16:53:00,99 is valid data.Reading time T Read(being the current time in system) is: 2011-05-04 21:14:00; And by " DIALOGUE " as can be known, expired time T OutBe 2011-05-05 00:00:00.By delivering the time of webpage under the extraction column, adopt the inconsistent model of constraint, the webpage under this column is retrained inconsistent tolerance, confinement time, the degree of consistency was exponential distribution from 0.136521392 to 6.43397E-75, as Fig. 5.
11) it is inconsistent that definition does not have constraint: at the info web that does not provide the clear and definite term of validity, promptly the term of validity of info web is infinitely great.Deliver too for a long time when web page text, upgrade for a long time, be called that not have constraint inconsistent.
When webpage e's
Figure BDA0000075106470000182
Perhaps T Out(e)=during ∞, webpage is not for there being the constraint webpage.At this type of webpage, consider how long text has been delivered, the present invention is zero point with the time of delivering, according to the sensitivity of info web to the time, webpage is adopted the fuzzy set modeling about the time, draw webpage to above two types of degrees of membership of not having the constraint webpage, the time inconsistency degree of this webpage then is the degree of membership of webpage to expired webpage.The structure that does not have the inconsistent model of constraint is as follows:
All webpages of field of definition X={ }, the effective webpage of A={ }, A '={ expired webpage }, A ',
Figure BDA0000075106470000183
Then
Figure BDA0000075106470000184
Can determine a mapping μ by set A A: X → [0,1], xa μ A(x).μ A(x) be the degree of membership of x, i.e. degree of consistency ConUC confinement time of webpage to effective webpage; μ A '(x) be the degree of membership of x, i.e. inconsistency degree InConUC confinement time of webpage to expired webpage.Make the effective webpage of A={ }, t express time, membership function ConUC ( x ) = &mu; A ( x ) = 1,0 &le; t &le; b ( 1 + a - 2 ( t - b ) - 2 ) - 1 , t &GreaterEqual; b ; Make A '={ expired webpage }, t express time, membership function InConUC ( x ) = &mu; A &prime; ( x ) = 0,0 &le; t &le; c ( 1 + ( t - c a ) - 2 ) - 1 , t &GreaterEqual; c , As Fig. 6.Wherein, a is the information sensing degree of webpage, and [b, c] is expired time T OutThe neighborhood interval, T out = T publish + 1 a .
For example: in the popular science garden column of State Grid's Information Network, all articles were all delivered before 9 days Mays in 2008.Choose in the column of popular science garden all webpages about generation of electricity by new energy, adopting does not have the inconsistent model of constraint and does not have the inconsistent tolerance of constraint: deliver time T PublishFrom 2008-05-8 to 2007-04-28 (the Hour Minute Second default setting is 00:00:00), reading time T Read(being current time in system x) is: 2011-05-0422:00:00.Because science popularization information is the general knowledge category information, and is little to the susceptibility of time, gets its time susceptibility a=0.1/year, δ=1month, then neighborhood [ b , c ] = [ T publish + 1 a - &delta; , T publish + 1 a + &delta; ] , Promptly have b = T publish + 1 a - &delta; = 2009 - 06 - 08 , c = T publish + 1 a + &delta; = 2009 - 04 - 08 .
By &mu; A = 1,0 &le; x &le; b ( 1 + a - 2 ( x - b ) - 2 ) - 1 , x &GreaterEqual; b With &mu; A &prime; = 0,0 &le; x &le; c ( 1 + ( x - c a ) - 2 ) - 1 , x &GreaterEqual; c There is it not have inconsistent degree of constraint and nothing constraint consistent degree, sees Fig. 7.
1-4 is at the time problem of inconsistency between webpage temporal information in the column of website and the column intension, make up webpage and inconsistent model of column time according to Web time relationship vector model, wherein, webpage and inconsistent model of column time comprise the inconsistent model of time delay, retrain inconsistent model, do not have the inconsistent model of constraint;
Frequent time of occurrence problem of inconsistency between webpage temporal information and the column intension in the column of website, this class column title is the title with time correlation, and info web (comprise web page title, deliver time and body text) did not conform to the column title time in the column.
When the column title contains the time word of expression " up-to-date " notion, then adopt the inconsistent model of constraint in the inconsistent model of webpage time; And the column title contains the expression time word of " a period of time ", as this day, this week, next week, this month etc., then adopts the nothing in the inconsistent model of webpage time to retrain inconsistent model.Time delay in concrete model and the inconsistent model of webpage time is inconsistent, constraint is inconsistent and it is inconsistent similar not have constraint, T OutDetermine by the column title.
1-5 describes the inconsistent problem of time of the webpage of identical information down at the identical column of different web sites, set up inconsistent model of the identical column of different web sites time according to Web time relationship vector model, wherein, the identical inconsistent model of column time of different web sites comprises more inconsistent model and the inconsistent model of prediction;
Under the identical column of different web sites, inconsistent phenomenon often appears in the webpage of describing identical information, predict certain new stock as new stock prediction column in the Netease, and the east wealth is online, and the launch information of this strand occurs.At this inconsistent situation of class time, make up inconsistent model of different web sites identical column time according to Web time relationship vector model, comprise more inconsistent inconsistent with prediction.
12) definition is more inconsistent: section at one time, there is conflict in the webpage of describing same incident in the event description vector, as Time To Event etc.
Fig. 8 is the process flow diagram of more inconsistent model, and its concrete workflow is as follows:
1. to webpage W 1The incident of carrying out is excavated, and draws the event description vector x of this webpage;
2. according to webpage W 1The incident e that describes delivers time T Publish(e), determine the neighborhood time interval [T of webpage W1 Publish(e)-and δ, T Publish(e)+and δ], δ>0, δ → 0;
3. according to webpage W 1The event description vector x, the deterministic retrieval keyword carries out the search of related web page;
4. at the related web page that searches, it is delivered the time and the neighborhood time interval mates, if in the neighborhood interval, then keep this related web page, otherwise delete this webpage, finally obtain a related web page collection W;
5. to all webpages among the related web page collection W, carry out time dimension and extract, determine the event description vector y of all webpages among the related web page collection W 1, y 2... y n, n is the quantity of all webpages among the related web page collection W;
6. compare event description vector x, y i, i ∈ [1, n], if cos<x, y i>≤0, then more inconsistent, otherwise, stop;
7. stop.
13) the definition prediction is inconsistent: when the incident e of certain webpage W prediction has taken place, perhaps e does not take place according to the time of prediction, and also exist the information of forecasting of incident e on the network, and the info web of predicted events e exists with the fact and conflicts like this, and it is inconsistent that promptly there is prediction in webpage W.
Make up the inconsistent model of prediction:
Predict the inconsistent two kinds of situations that are divided into: (1) unenlightened prediction that causes is inconsistent: the prediction after, expired before, incident has taken place, and T is promptly arranged Publish(e)<T Happen(e)<T Out(e), but this prediction webpage do not handle, and still in prediction, thinks that then the webpage prediction is inconsistent; (2) the untimely prediction that causes of renewal is inconsistent, is predicting after date, promptly Read(e) 〉=T Out(e), this prediction webpage does not have processed (expired as pointing out, perhaps directly deleted), still in " prediction ".With above-mentioned two kinds of situation unifications is a model, sees Fig. 9.The workflow of this model is as follows:
1. predict webpage W 1Judge: if T Happen(e)>T PublishOr T (e), Happen(e) be empty, then think webpage W 1Be the prediction webpage, otherwise stop that wherein, e is webpage W 1In incident;
2. compare T Read(e) and T Out(e), if T Read(e) 〉=T Out(e), then predict inconsistently, stop, otherwise execution in step 3.;
3. at prediction webpage W 1The incident of carrying out is excavated, and draws webpage W 1The event description vector x;
4. according to webpage W 1The event description vector x, the deterministic retrieval keyword carries out the search of related web page;
5. result for retrieval is screened, choose the high webpage of the degree of correlation, obtain a related web page collection W,, carry out time dimension and extract, determine the event description vector y of all webpages among the related web page collection W all webpages among the related web page collection W 1, y 2... y n, n is the quantity of all webpages among the related web page collection W;
6. compare event description vector x, y i, i ∈ [1, n], if cos<x, y i>≤0, then predict inconsistent, otherwise, stop;
7. stop.
For example: in the China's economic net financial instrument net finance and economics rolling news column, be entitled as the webpage of stock market's three big guesses tomorrow and countermeasure.The T of this article PublishOn 05 03rd, 1 15:59, and on 05 04th, 2011 00:00:00<T HappenOn 05 05th, 1 00:00:00, and T OutOn 05 05th, 1 00:00:00 then has T Publish<T Happen<T OutBy T Publish<T HappenHave: this webpage has current time in system T again for the prediction webpage ReadOn 05 05th, 1 17:00:00 then has T Publish<T Happen<T Out<T Read, by T Out<T ReadHave: this webpage prediction is inconsistent.
Step 2, utilize time knowledge concepts model, canonical grammar coupling and pattern algebraically that Web information is carried out the various dimensions decimation in time, wherein, the various dimensions time comprise Time To Event, the time of writing, issuing time, reading time, reproduced time and text expired time;
2-1 at time knowledge and between concern knowledge concepts model Time Created, in order to describe the semanteme of relevant time knowledge concepts in the natural language;
Figure 10 be the time knowledge concepts and between the relation ontology describing; A webpage is described an incident, and info web was described time of incident, according to the statement of natural language, the time representation of natural language is divided into two classes, and the first kind is to express the tangible time, as today, on April 27th, 2011, following three days ...; Second class was expressed for the implicit time, as National Day, New Year's Eve, leap year ...
Standard time represents to adopt year, month, day, hour, min, second 6 units to express, and the statement of natural language time and standard time represent that the needs moment changes with the cycle.
2-2 adopts canonical grammar coupling and pattern algebraically, concrete entity title in the match time knowledge concepts model, service time knowledge concepts model and concept of time definition, according to web page template feature and temporal characteristics speech Web information is carried out various dimensions decimation in time based on semanteme; Wherein, entity title concrete in the time knowledge concepts model comprises time object and concept of time, specifically refers to 1. year, month, day, hour, min, second; 2. Time To Event, deliver time equal time dimension; 3. red-letter day equal time knowledge; The various dimensions time comprises Time To Event, the time of writing, issuing time, reading time, reproduced time and text expired time;
Step 3, webpage is classified according to inconsistent model of Web time, and carry out inconsistent tolerance of Web time, wherein, the inconsistent tolerance of time between inconsistent tolerance of the Web time inconsistent tolerance of time, webpage that comprise webpage self and inconsistent tolerance of the time between the column, the identical column of different web sites;
The inconsistent tolerance of time of 3-1 webpage self:
Inconsistent table of degree of the time of webpage W is shown InCon ( W ) = &Sigma; i = 1 n &alpha; i &times; webpage . Inconsistency ( i ) , Wherein, n is the quantity of the inconsistent problem of time that inconsistent webpage of time exists in the website; α iWeights for inconsistent problem types i of time; The inconsistent degree of time of the i class inconsistent problem of time of webpage.Inconsistency (i) expression webpage W.
Inconsistent tolerance of time between 3-2 webpage and the column:
Because column comprises time-sensitive information, and conflicts with the webpage temporal information under the column, so according to different hurdle semantemes object time, define different time inconsistency metric function.Based on column information extraction and Web page information extraction, each webpage under the column is carried out inconsistent tolerance of time, measure is identical with inconsistent tolerance of the time of webpage self, wherein, T OutDetermine according to the column heading message with the time-sensitive degree of information, rather than determine by info web.
Inconsistent tolerance of time between the identical column of 3-3 different web sites:
The identical column of different web sites exists inconsistent situation of time equally, and this type of time is inconsistent to be that web site contents renewal speed difference causes.Inconsistent tolerance of time between the identical column of different web sites not only will be considered the inconsistent degree between webpage and the column, but also needs to consider interval effective time of tolerance.Inconsistent tolerance of such time comprises more inconsistent tolerance and predicts inconsistent tolerance, and is specific as follows:
1. more inconsistent tolerance: InConCompare ( W 1 , W 2 ) = cos < x , y > | x | | y | + 1 2 , Wherein, InConCompare (W1, W2) expression webpage W 1And W 2More inconsistent degree, x, y are respectively webpage W 1And W 2The event description vector;
2. predict inconsistent tolerance: InConPre ( W 1 , W 2 ) = cos < x , y > | x | | y | + 1 2 , Wherein, InConPre (W 1, W 2) expression webpage W 1And W 2The inconsistent degree of prediction, x, y are respectively webpage W 1And W 2The event description vector.
Step 4, by inconsistent rule set of structure time, based on inconsistent reasoning from logic operator of rule set, the time of time knowledge concepts, carry out the inconsistent reasoning of Web time, wherein, inconsistent rule set of time comprises the inconsistent rule set of time delay, retrains inconsistent rule set, does not have constraint inconsistent rule set, more inconsistent rule set, predicts inconsistent rule set; The inconsistent reasoning of Web time comprises the inconsistent reasoning of time of the reasoning of unknown dimension time value, same subject info web in the webpage time relationship vector, identical hurdle inconsistent statistical reasoning object time of different web sites;
4-1 structure inconsistent rule set of time and based on the rule set of time knowledge concepts
1. inconsistent rule set of time comprises the inconsistent rule set of time delay, retrains inconsistent rule set, does not have constraint inconsistent rule set, more inconsistent rule set, predicts inconsistent rule set, wherein:
The inconsistent rule set of time delay: establishing t is the inconsistent critical value of time delay of incident e, if
Figure BDA0000075106470000232
Time delay unanimity then; If
Figure BDA0000075106470000233
Then time delay is inconsistent; Wherein, R is the time order relation among the T of tense system;
Retrain inconsistent rule set: webpage incident e is for current time t, a) if having
Figure BDA0000075106470000234
S is a time point, R (t s)=1 o'clock, satisfies Ge=1, and
Figure BDA0000075106470000235
As R (s 0, t)=1 o'clock τ (t, Ge)=0, as R (t, s 0(t, He)=1 an incident e is expired, and expired time point is s for)=1 o'clock τ 0, claim that promptly the just expired distance of incident e is s 0-t.B) if
Figure BDA0000075106470000236
R (s, s 0)=1 and R (t s)=1 o'clock, satisfies Ge=1, and
Figure BDA0000075106470000237
The time, incident e is expired, claim that then incident e is effective, and expired time point is s 0, the negative expired distance that promptly claims incident e is s 0-t; Wherein, R is the time order relation among the T of tense system; Ge is illustrated in all future times, and e is true; τ is the true and false assignment function among the T of tense system; He is illustrated in time in the past, and e is true always;
Do not have the inconsistent rule set of constraint: establishing t is the marginal time, t 0Be the current time in system, the incident that e describes for webpage W, e ' is the incident that webpage W ' describes, as R (T Publicate(e), t)=1 o'clock, it is inconsistent to think that webpage W does not have constraint.And satisfied to the consistent webpage of have or not constraint: T Publicate(e ') satisfies R (T Publicate(e), t 0)=1, R (T Publicate(e '), t 0)=1 is as R (T Publicate(e), T Publicate(e '))=1 o'clock, the priority ratio W height of webpage W ' is arranged;
More inconsistent rule set: e is the incident that webpage W describes, and e ' be the incident of webpage W ' description, and to similar web page W arbitrarily, W ' works as T Publish(e ') ∈ [T Publish(e)-and δ, T Publish(e)+δ] time, if cos<Vector (W), Vector (W ')>≤0, then more inconsistent;
Predict inconsistent rule set: for all forecasting type webpages, if R (T is arranged Happen(e), T Publish=1 or R (T (e)) Out(e), T Publish(e))=1, think that then the webpage prediction is inconsistent.
2. based on the rule set of time knowledge concepts: describe natural rule between the time knowledge concepts and people's agreement custom.
For example: " leap year " → " February 29 was arranged "; " China's National Day " → " October 1 "; " China's vacation on National Day " → " 1-October 5 October "
4-2 makes up inconsistent reasoning from logic operator of time at inconsistent relation of time, carries out the inconsistent reasoning of time of the reasoning of unknown dimension time value, same subject info web in the webpage time relationship vector and identical hurdle inconsistent statistical reasoning object time of different web sites.
1. reasoning from logic operator: the tense inference method adopts event calculus and interval algebra, utilizes existing tense inference theory structure inconsistent reasoning operator of webpage time; Mainly there are two classes in the tense primitive: (or being called a little) and time period (time interval) constantly; Time structure adopts unlimited continuous time; Temporal entity is the conceptual entity with time correlation, as incident, news etc.; Temporal constraint comprises expired time, freshness.
2. based on web page text information, column information, related web page, web page interlinkage and adjacent webpage, carry out the inconsistent reasoning of time of the reasoning of unknown dimension time value, same subject info web in the webpage time relationship vector and identical hurdle inconsistent statistical reasoning object time of different web sites.
Web page text information: according to web page text information, extract the temporal entity of webpage and time dimension information thereof (comprise time of origin, deliver the time, expired time etc.), based on above-mentioned information, carry out the unknown time dimension reasoning and the inconsistent reasoning of time of webpage self.
Column information:,, carry out the inconsistent reasoning of time of column and webpage by the interval algebra computing according to the temporal constraint of column information.
The link information of webpage: the corresponding time dimension information with its chain page-out of time dimension information that the chain of a webpage is gone into the page is to constitute time interval, according to interval algebra, the linked web pages of this webpage is carried out computing, the temporal information of this webpage of reasoning.
The adjacent net page information: the webpage under the same column in general website is arranged by the time inverted order, extracts the sequence that webpage is delivered the time in view of the above, according to this time series, carries out time series analysis, infers website time consistency trend.Simultaneously, according to the time interval that the time dimension information of adjacent webpage constitutes, infer the time dimension information of this webpage.
Related web page information: generally speaking, it is very approaching that the webpage of describing same temporal entity is delivered the time, in view of the above the related web page of describing identical temporal entity carried out event calculus, finishes the inconsistent reasoning identical hurdle of different web sites object time.
Based on above-mentioned theory, the out-of-date webpage based on Web inconsistent modeling of time that the present invention proposes is found and sort method automatically, considered of the influence of time inconsistency, and the method that proposes can be good at being integrated in existing info web quality evaluation, website ordering and the information retrieval system to info web.
Below application scenarios of the present invention is described further, specific as follows:
One, the automatic discovery and the ordering of the out-of-date webpage in website
The extraction of step 1, inconsistent webpage of time, reasoning and tolerance
Substep 1, information extraction: at the time sensitivity webpage that screens, carry out temporal information and extract, comprising: webpage place column title temporal information extracts, the web page title temporal information extracts, the web page contents time dimension extracts;
Substep 2, the inconsistent reasoning of Web time: the inconsistent inference method of Web time in four of taking steps, for the webpage time dimension that is not drawn into, carry out the inconsistent reasoning of webpage time, infer the unknown dimension of webpage by the known dimension of webpage, and the time dimension of similar web page infers unknown webpage time dimension.
Substep 3, inconsistent tolerance of Web time: on the basis that temporal information extracts,, carry out the inconsistent pattern-recognition of time of webpage by the inconsistent reasoning of Web page or leaf time.Pattern according to different adopts different inconsistent models of time to carry out inconsistent tolerance of time.
The automatic discovery of step 2, the out-of-date webpage in website
Substep 1, according to each inconsistent tolerance of class time, draw each inconsistent degree of class time of each webpage;
Substep 2, if max{Inconsistency (i) 〉=a, think that then webpage is out-of-date webpage; Wherein i is an inconsistent type of time; Inconsistency (i) is the inconsistent degree of time of the i class inconsistent type of time of webpage; A is the out-of-date critical value of webpage, and a=0.5 is generally got in 0.5≤a≤1;
Step 3, based on the inconsistent out-of-date webpage ordering of Web time:
According to inconsistent degree of each time and max{Inconsistency (i) } webpage is carried out out-of-date webpage ordering, provide out-of-date Web page classifying tabulation and final ranking tabulation, wherein, out-of-date Web page classifying table comprises inconsistent out-of-date web page listings of webpage time, webpage and inconsistent out-of-date web page listings of column time and inconsistent out-of-date web page listings of identical column different web sites time.
Two, similar site information freshness ordering
Step 1, utilize multiple mode that similar website and webpage are obtained;
Provide multiple mode to obtain similar website and webpage.Comprise sampling, get, get etc. based on climbing of link based on climbing of field theme.For providing different webpages, the user climbs the scheme of getting according to different applicable cases.For example:, adopt the mode of sampling to obtain webpage for large-scale news website.For the evaluation of media event renewal speed, select crawling method based on the field theme.
Step 2, with column time inconsistency, the identical column of different web sites time inconsistency the freshness of site information is sorted based on webpage time inconsistency, webpage.
Described site information freshness FScore = FineFScore + CourseFScore 2 ,
Wherein,
FineFScore is the fine granularity freshness, FineFScore = 1 - &Sigma; i = 1 n &Sigma; j = 1 m webpage ( j ) . Inconsistency ( i ) n &times; m ;
CourseFScore is the coarseness freshness, CourseFScore = 1 - N inconsistency ( webpage ) m ;
M is the quantity of webpage in the website; N is the quantity of the inconsistent problem types of time that inconsistent webpage of time exists in the website;
Webpage (j) .Inconsistency (i) is the inconsistent degree of time of the i class inconsistent problem of time in j the webpage;
Inconsistency (i) is the inconsistent degree of time of i class inconsistent problem of time, 0≤Inconsistency (i)≤1; N Inconsistency(webpage) be the quantity of the inconsistent webpage of this website life period.
Three, according to the semanteme of express time in the user search keyword,, result for retrieval is carried out time perception ordering according to inconsistent problem of Web time and corresponding inconsistent degree of time based on inconsistent model of Web time.
Step 1, set up the dictionary of the word of express time, the time word in the dictionary is classified.The first kind: the time word of expression " up-to-date " notion, as " up-to-date ", " recently " etc.Second class: the time word of expression " a period of time ", as this day, this week, next week, this month, this year (degree), a following week, following January, following 1 year etc., and red-letter day, " Spring Festival " constant pitch Time of Day sections such as " National Day ".
Step 2, every class time word is set up different constraint functions;
The first class constraint function is to set up at first kind time word, for:
To webpage W, W ' satisfy T arbitrarily in the web page listings Publish(W)<t 0, T Publish(W ')<t 0, work as T Publish(W)<T PublishWhen (W '), the priority of webpage W ' is higher than webpage W, wherein, and t 0Be the current time in system, T Publish(W), T Publish(W ') is respectively the issuing time of webpage W ' and W;
The second class constraint function is to set up at the second class time word, for: any webpage W in the web page listings satisfies t (keyword)<T Publish(W)<t 0, wherein, t 0Be the current time in system; T (keyword) represents the represented time period of searching key word, is " this week " as the time-critical speech, then t (keyword)=t 0-7day; T Publish(W) be the issuing time of webpage W.
Step 3, according to the time word in user's the searching key word, matching constraint function, and adopt corresponding retrieval model carries out time perception ordering with result for retrieval according to inconsistent problem of Web time and corresponding inconsistent degree of time.
1. adopt common query assessment technology,, obtain a preliminary web page listings L according to certain retrieval model (as boolean's model, probability model, language model etc.);
2. the webpage time of delivering among the web page listings L is retrained coupling, can adopt boolean's model to retrain coupling;
3. according to ConScore (q W) determines the order of webpage in the web page listings, wherein:
ConScore(q,W)=α(q,W)×Sim(q,W)+β(q,W)×Sim_t(q,W)+γ(q,W)×(1-InCon(W))
Wherein,
ConScore (q, W): webpage W is to the similarity of a certain inquiry q;
Sim (q, W): the webpage similarity that adopts the existing information retrieval model to calculate;
Sim_t (q, W): the webpage time of delivering among the web page listings L is retrained when coupling, if the webpage time of delivering satisfy constraint, then Sim_t (q, W)=1, otherwise Sim_t (q, W)=0;
InCon (W): the time inconsistency degree value of webpage W;
α (q, W): Sim (q, weight W);
β (q, W): the freshness of webpage W, satisfy the webpage of delivering near the current time more, its freshness is big more; The web page text of more early delivering, its novel degree is more little, T Publish(W) represent the time that web page text is delivered; T represents the start time point of the time-constrain of user's appointment, determine according to the time-critical speech that the user retrieved; t 0The expression current point in time;
Figure BDA0000075106470000282
T Publish(W) represent the time that webpage is delivered; T represents the start time point of the time-constrain of user's appointment; t 0Expression current time in system point;
γ (q, W): the weight of time consistency degree in ordering of webpage W;
α(q,W)+β(q,W)+γ(q,W)=1,α(q,W),β(q,W),γ(q,W)≥0。
4. with the webpage in the web document tabulation according to ConScore (q, descending series arrangement W), and ranking results fed back to the user.

Claims (10)

1. automatically find and sort method based on inconsistent out-of-date webpage of Web time for one kind, it is characterized in that, may further comprise the steps:
Step 1, at the time inconsistent problem of different web pages information to existing among the sensitivity of time and the Web, set up inconsistent model of Web time, wherein, inconsistent model of Web time comprises inconsistent model of webpage time, webpage and inconsistent model of column time, inconsistent model of the identical column of different web sites time; This step further comprises following substep:
1-1 carries out sensitivity analysis to different web pages information, webpage is classified about the variation tendency of time according to the theme and the information of webpage, and estimates the degree interval of each class webpage to time-sensitive;
1-2 utilizes the time shaft order relation logically of Web information to set up Web time relationship vector model;
The inconsistent problem of time that 1-3 exists at each class webpage self, make up inconsistent model of webpage time according to Web time relationship vector model, wherein, inconsistent model of webpage time comprises the inconsistent model of time delay, the inconsistent model of constraint and does not have the inconsistent model of constraint;
1-4 is at the time problem of inconsistency between webpage temporal information in the column of website and the column intension, make up webpage and inconsistent model of column time according to Web time relationship vector model, wherein, webpage and inconsistent model of column time comprise the inconsistent model of time delay, retrain inconsistent model, do not have the inconsistent model of constraint;
1-5 describes the inconsistent problem of time of the webpage of identical information down at the identical column of different web sites, set up inconsistent model of the identical column of different web sites time according to Web time relationship vector model, wherein, the identical inconsistent model of column time of different web sites comprises more inconsistent model and the inconsistent model of prediction;
Step 2, utilize time knowledge concepts model, canonical grammar coupling and pattern algebraically that Web information is carried out the various dimensions decimation in time, wherein, the various dimensions time comprise Time To Event, the time of writing, issuing time, reading time, reproduced time and text expired time;
Step 3, webpage is classified according to inconsistent model of Web time, and according to the Web information various dimensions time of extracting, carry out inconsistent tolerance of Web time, obtain the inconsistent tolerance degree of time of webpage, wherein, the inconsistent tolerance of time between inconsistent tolerance of the Web time inconsistent tolerance of time, webpage that comprise webpage self and inconsistent tolerance of the time between the column, the identical column of different web sites;
Inconsistent tolerance degree of time between inconsistent tolerance degree of the time of webpage self and webpage and the column is: InCon ( W ) = &Sigma; i = 1 n &alpha; i &times; webpage . Inconsistency ( i ) , InCon (W) is an inconsistent tolerance degree of the time of webpage W; N is the quantity of the inconsistent problem types of time that inconsistent webpage of time exists in the website; α iWeights for inconsistent problem types i of time; The inconsistent degree of time of the i class inconsistent problem of time of webpage.Inconsistency (i) expression webpage W;
Inconsistent tolerance degree of time between the identical column of different web sites comprises more inconsistent tolerance degree and the inconsistent tolerance degree of prediction, more inconsistent tolerance degree InConCompare ( W 1 , W 2 ) = cos < x , y > | x | | y | + 1 2 , InConCompare (W1, W2) expression webpage W 1And W 2More inconsistent degree, x, y are respectively webpage W 1And W 2The event description vector; Predict inconsistent tolerance degree InConPre ( W 1 , W 2 ) = cos < x , y > | x | | y | + 1 2 , InConPre ( W1, W 2) expression webpage W 1And W 2The inconsistent degree of prediction, x, y are respectively webpage W 1And W 2The event description vector;
Step 4, inconsistent rule set of structure time, by inconsistent rule set of time, based on inconsistent reasoning from logic operator of rule set, the time of time knowledge concepts, based on the Web information various dimensions time of extracting, carry out the inconsistent reasoning of Web time, wherein, inconsistent rule set of time comprises the inconsistent rule set of time delay, retrains inconsistent rule set, does not have constraint inconsistent rule set, more inconsistent rule set, predicts inconsistent rule set; The inconsistent reasoning of Web time comprises the inconsistent statistical reasoning of inconsistent reasoning of time, the identical hurdle of different web sites object time of the reasoning of unknown dimension time value, same subject info web in the webpage time relationship vector;
Step 5, according to the station address of user input, draw the inconsistent degree of time of each webpage based on inconsistent model of Web time, inconsistent tolerance of Web time, the inconsistent reasoning of Web time, and find the out-of-date webpage in website automatically, and provide out-of-date web page listings according to inconsistent degree of the time of webpage.
2. according to claim 1ly automatically find and sort method, it is characterized in that, also comprise step based on inconsistent out-of-date webpage of Web time:
Based on inconsistent model of Web time, inconsistent tolerance of Web time, the inconsistent reasoning of Web time, according to the site information freshness similar website and webpage are sorted, wherein:
The site information freshness FScore = FineFScore + CourseFScore 2 , FineFScore is the fine granularity freshness, FineFScore = 1 - &Sigma; i = 1 n &Sigma; j = 1 m webpage ( j ) . Inconsistency ( i ) n &times; m ; CourseFScore is the coarseness freshness, CourseFScore = 1 - N inconsistency ( webpage ) m ; M is the quantity of webpage in the website; N is the quantity of the inconsistent problem types of time that inconsistent webpage of time exists in the website; Webpage (j) .Inconsistency (i) is the inconsistent degree of time of the i class inconsistent problem of time in j the webpage; Inconsistency (i) is the inconsistent degree of time of i class inconsistent problem of time, 0≤Inconsistency (i)≤1; N Inconsistency(webpage) be the quantity of the inconsistent webpage of this website life period.
3. according to claim 1ly automatically find and sort method, it is characterized in that, also comprise step based on inconsistent out-of-date webpage of Web time:
Based on inconsistent model of Web time,, result for retrieval is carried out time perception ordering according to inconsistent problem of Web time and corresponding inconsistent degree of time according to the semanteme of express time in the user search keyword.
4. describedly automatically find and sort method according to claim 1,2 or 3, it is characterized in that based on inconsistent out-of-date webpage of Web time:
The substep 1-3 of described step 1 and the inconsistent model of time delay among the 1-4, retrain inconsistent model, do not have the inconsistent model of constraint and be:
The inconsistent model of time delay:
The time of origin of webpage incident e and the delay of issuing time D happen publish ( e ) = T publish ( e ) - T happen ( e ) , T Publish(e), T Happen(e) be respectively Time To Event and the issuing time of webpage incident e; When
Figure FDA0000075106460000033
The time, be the time delay unanimity, the time delay consistent degree
Figure FDA0000075106460000034
When The time, inconsistent for time delay, inconsistent degree ConD=1, a are the inconsistent critical value of time delay, are provided with according to the time-sensitive degree of info web;
Retrain inconsistent model:
ConUC = e T relative ln 0.6 = 0.6 T relative , Wherein, ConUC is degree of consistency confinement time, T RelativeBe relative time, T relative = T read - T publish T out - T publish ;
There is not the inconsistent model of constraint:
ConUC = 1,0 &le; t &le; b ( 1 + a - 2 ( t - b ) - 2 ) - 1 , t &GreaterEqual; b , InConUC = 0,0 &le; t &le; c ( 1 + ( t - c a ) - 2 ) - 1 , t &GreaterEqual; c
Wherein, ConUC is degree of consistency confinement time of webpage; InConUC is inconsistency degree confinement time of webpage; A is the information sensing degree of webpage, and [b, c] is expired time T OutThe neighborhood interval, T out = T publish + 1 a .
5. describedly automatically find and sort method according to claim 1,2 or 3, it is characterized in that based on inconsistent out-of-date webpage of Web time:
The workflow of more inconsistent model among the substep 1-5 of described step 1 and the inconsistent model of prediction is as follows respectively:
The workflow of more inconsistent model is:
1. to webpage W 1The incident of carrying out is excavated, and draws the event description vector x of this webpage;
2. according to webpage W 1The incident e that describes delivers time T Publish(e), determine webpage W 1Neighborhood time interval [T Publish(e)-and δ, T Publish(e)+and δ], δ>0, δ → 0;
3. according to webpage W 1The event description vector x, the deterministic retrieval keyword carries out the search of related web page;
4. at the related web page that searches, it is delivered the time and the neighborhood time interval mates, if in the neighborhood interval, then keep this related web page, otherwise delete this webpage, finally obtain a related web page collection W;
5. to all webpages among the related web page collection W, carry out time dimension and extract, determine the event description vector y of all webpages among the related web page collection W 1, y 2... y n, n is the quantity of all webpages among the related web page collection W;
6. compare event description vector x, y i, i ∈ [1, n], if cos<x, y i>≤0, then more inconsistent, otherwise, stop;
7. stop.
The workflow of predicting inconsistent model is:
1. predict webpage W 1Judge: if T Happen(e)>T PublishOr T (e), Happen(e) be empty, then think webpage W 1Be the prediction webpage, otherwise stop that wherein, e is webpage W 1In incident;
2. compare T Read(e) and T Out(e), if T Read(e) 〉=T Out(e), then predict inconsistently, stop, otherwise execution in step 3.;
3. at prediction webpage W 1The incident of carrying out is excavated, and draws webpage W 1The event description vector x;
4. according to webpage W 1The event description vector x, the deterministic retrieval keyword carries out the search of related web page;
5. result for retrieval is screened, choose the high webpage of the degree of correlation, obtain a related web page collection W,, carry out time dimension and extract, determine the event description vector y of all webpages among the related web page collection W all webpages among the related web page collection W 1, y 2... y n, n is the quantity of all webpages among the related web page collection W;
6. compare event description vector x, y i, i ∈ [1, n], if cos<x, y i>≤0, then predict inconsistent, otherwise, stop;
7. stop.
6. describedly automatically find and sort method according to claim 1,2 or 3, it is characterized in that based on inconsistent out-of-date webpage of Web time:
Time in the described step 4, inconsistent rule set was as follows:
The inconsistent rule set of time delay: establishing t is the inconsistent critical value of time delay of incident e, if
Figure FDA0000075106460000051
Time delay unanimity then; If
Figure FDA0000075106460000052
Then time delay is inconsistent; Wherein, R is the time order relation among the T of tense system;
Retrain inconsistent rule set: webpage incident e is for current time t, a) if having
Figure FDA0000075106460000053
S is a time point, R (t s)=1 o'clock, satisfies Ge=1, and
Figure FDA0000075106460000054
As R (s 0, t)=1 o'clock τ (t, Ge)=0, as R (t, s 0(t, He)=1 an incident e is expired, and expired time point is s for)=1 o'clock τ 0, claim that promptly the just expired distance of incident e is s 0-t.B) if R (s, s 0)=1 and R (t s)=1 o'clock, satisfies Ge=1, and
Figure FDA0000075106460000056
The time, incident e is expired, claim that then incident e is effective, and expired time point is s 0, the negative expired distance that promptly claims incident e is s 0-t; Wherein, R is the time order relation among the T of tense system; Ge is illustrated in all future times, and e is true; τ is the true and false assignment function among the T of tense system; He is illustrated in time in the past, and e after all is true;
Do not have the inconsistent rule set of constraint: establishing t is the marginal time, t 0Be the current time in system, the incident that e describes for webpage W, e ' is the incident that webpage W ' describes, as R (T Publicate(e), t)=1 o'clock, it is inconsistent to think that webpage W does not have constraint.And satisfied to the consistent webpage of have or not constraint:
Figure FDA0000075106460000057
T Publicate(e ') satisfies R (T Publicate(e), t 0)=1, R (T Publicate(e '), t 0)=1 is as R (T Publicate(e), T Publicate(e '))=1 o'clock, the priority ratio W height of webpage W ' is arranged;
More inconsistent rule set: e is the incident that webpage W describes, and e ' be the incident of webpage W ' description, and to similar web page W arbitrarily, W ' works as T Publish(e ') ∈ [T Publish(e)-and δ, T Publish(e)+δ] time, if cos<Vector (W), Vector (W ')>≤0, then more inconsistent;
Predict inconsistent rule set: for all forecasting type webpages, if R (T is arranged Happen(e), T Publish=1 or R (T (e)) Out(e), T Publish(e))=1, think that then the webpage prediction is inconsistent.
7. describedly automatically find and sort method according to claim 1,2 or 3, it is characterized in that based on inconsistent out-of-date webpage of Web time:
Described step 5 further comprises following substep:
Extraction, reasoning and the tolerance of 5-1 inconsistent webpage of time:
1. information extraction: at the time sensitivity webpage that screens, carry out temporal information and extract, comprising: webpage place column title temporal information extracts, the web page title temporal information extracts, the web page contents time dimension extracts;
2. inconsistent reasoning of Web time: the inconsistent inference method of Web time in four of taking steps, for the webpage time dimension that is not drawn into, carry out the inconsistent reasoning of webpage time, infer the unknown dimension of webpage by the known dimension of webpage, and the time dimension of similar web page infers unknown webpage time dimension;
3. inconsistent tolerance of Web time: on the basis that temporal information extracts, by the inconsistent reasoning of Web time, carry out the inconsistent pattern-recognition of time of webpage, the pattern according to different adopts different inconsistent models of time to carry out inconsistent tolerance of time;
The automatic discovery of the out-of-date webpage in 5-2 website:
1. according to each inconsistent tolerance of class time, draw each inconsistent degree of class time of each webpage;
2. if max{Inconsistency (i) } 〉=a, think that then webpage is out-of-date webpage; Wherein, i is an inconsistent type of time; Inconsistency (i) is the inconsistent degree of time of the i class inconsistent type of time of webpage; A is the out-of-date critical value of webpage, and a=0.5 is generally got in 0.5≤a≤1;
5-3 sorts based on inconsistent out-of-date webpage of Web time:
According to inconsistent degree of each time and max{Inconsistency (i) } webpage is carried out out-of-date webpage ordering, provide out-of-date Web page classifying tabulation and final ranking tabulation, wherein, out-of-date Web page classifying table comprises inconsistent out-of-date web page listings of webpage time, webpage and inconsistent out-of-date web page listings of column time and inconsistent out-of-date web page listings of identical column different web sites time.
8. according to claim 3 based on automatic discovery of inconsistent out-of-date webpage of Web time and sort method, it is characterized in that:
This step further comprises following substep:
1 sets up the dictionary of the word of express time, and the time word in the dictionary is classified;
2 pairs of every class time words are set up different constraint functions;
3 according to the time word in user's the searching key word, the matching constraint function, and adopt corresponding retrieval model, result for retrieval is carried out time perception ordering.
9. according to claim 8 based on automatic discovery of inconsistent out-of-date webpage of Web time and sort method, it is characterized in that:
In the described step 1 time word in the dictionary is divided into two classes, the first kind is the time word of expression " up-to-date " notion, and second class is the time word of expression " a period of time ";
Constraint function in the described step 2 has two classes:
The first class constraint function is to set up at first kind time word, for:
Figure FDA0000075106460000071
W ' and W are respectively the different web pages in the web page listings, t 0Be the current time in system, work as T Publish(W)<T PublishWhen (W '), the priority of webpage W ' is higher than webpage W, wherein, and T Publish(W), T Publish(W ') is respectively the issuing time of webpage W ' and W;
The second class constraint function is to set up at the second class time word, for: t (keyword)<T Publish(W)<t 0, W represents any webpage in the web page listings, wherein, and t 0Be the current time in system, t (keyword) represents represented time period of searching key word, T Publish(W) be the issuing time of webpage W;
Described step 3 further comprises substep:
1. adopt the query assessment technology, adopt retrieval model to obtain a preliminary web page listings L;
2. adopt retrieval model, the webpage time of delivering among the web page listings L is retrained coupling;
3. according to ConScore (q W) determines the order of webpage in the web page listings,
ConScore(q,W)=α(q,W)×Sim(q,W)+β(q,W)×Sim_t(q,W)+γ(q,W)×(1-InCon(W))
Wherein,
ConScore (q, W): webpage W is to the similarity of a certain inquiry q;
Sim (q, W): the webpage similarity;
Sim_t (q, W): the webpage time of delivering among the web page listings L is retrained when coupling, if the webpage time of delivering satisfy constraint, then Sim_t (q, W)=1, otherwise, Sim_t (q, W)=0;
InCon (W): the time inconsistency degree value of webpage W;
α (q, W): Sim (q, weight W);
β (q, W): the freshness of webpage W,
Figure FDA0000075106460000081
T Publish(W) represent the time that webpage is delivered, t represents the start time point of the time-constrain of user's appointment, t 0Expression current time in system point;
γ (q, W): the weight of time consistency degree in ordering of webpage W;
α(q,W)+β(q,W)+γ(q,W)=1,α(q,W),β(q,W),γ(q,W)≥0。
4. the webpage in web document being tabulated is according to ConScore (q, descending series arrangement W).
10. automatically find and ordering system based on inconsistent out-of-date webpage of Web time for one kind, it is characterized in that, comprising:
1. out-of-date webpage is found module, 2. similar site information freshness order module and 3. time perception search module automatically;
1. out-of-date webpage find automatically module further comprise submodule temporal information abstraction module, inconsistent reasoning module of Web time, inconsistent metric module of Web time, based on inconsistent out-of-date webpage order module of Web time, wherein:
The temporal information abstraction module is used at the time sensitivity webpage that screens, and carries out temporal information and extracts; The temporal information abstraction module comprises the abstraction module that submodule is used to extract the abstraction module of webpage place column title temporal information, is used to extract the abstraction module of web page title temporal information and is used for the extracting content on web pages time dimension;
Inconsistent reasoning module of Web time is used for time dimension that webpage is not drawn into, carries out the inconsistent reasoning of webpage time, and the inconsistent pattern of time of identification webpage; Inconsistent reasoning module of Web time comprise submodule be used for known dimension by webpage infer webpage unknown dimension reasoning module and be used for inferring the reasoning module of unknown webpage time dimension by the time dimension of webpage;
Inconsistent metric module of Web time is used for according to the inconsistent pattern of the different time of webpage, adopts the inconsistent model of time corresponding to carry out inconsistent tolerance of Web time; Inconsistent metric module of Web time comprise submodule be used to measure the inconsistent metric module of time of webpage self, be used to measure webpage with inconsistent metric module of the time between the column, be used to measure the inconsistent metric module of time between the identical column of different web sites;
Based on inconsistent out-of-date webpage order module of Web time, be used for tolerance result according to Web inconsistent metric module of time, webpage is carried out out-of-date webpage ordering; Comprise submodule based on inconsistent out-of-date webpage order module of Web time: the order module that is used for inconsistent out-of-date webpage of webpage time is sorted, the order module that is used for webpage and inconsistent out-of-date webpage of column time are sorted, be used for the order module that sorts to inconsistent out-of-date webpage of column different web sites time;
2. similar site information freshness order module further comprises submodule webpage acquisition module and order module, wherein:
The webpage acquisition module is used to obtain similar website and webpage;
Order module is used for similar website and webpage that the webpage acquisition module is obtained, sorts the site information freshness based on the site information freshness FScore = FineFScore + CourseFScore 2 , FineFScore is the fine granularity freshness, FineFScore = 1 - &Sigma; i = 1 n &Sigma; j = 1 m webpage ( j ) . Inconsistency ( i ) n &times; m ; CourseFScore is the coarseness freshness, CourseFScore = 1 - N inconsistency ( webpage ) m ; M is the quantity of webpage in the website; N is the quantity of the inconsistent problem types of time that inconsistent webpage of time exists in the website; Webpage (j) .Inconsistency (i) is the inconsistent degree of time of the i class inconsistent problem of time in j the webpage; Inconsistency (i) is the inconsistent degree of time of i class inconsistent problem of time, 0≤Inconsistency (i)≤1; N Inconsistency(webpage) be the quantity of the inconsistent webpage of this website life period.
3. time perception search module, this module further comprise submodule unit search module, webpage reordering module and feedback module, wherein:
Unit's search module is used to adopt first way of search to obtain the Search Results that a plurality of search engines return;
The webpage reordering module, be used for according to time perception ordering ConScore (q, W) Search Results that first search module is obtained is reset, and obtains new webpage ordering, ConScore (q W) be the similarity of webpage W to a certain q of inquiry,
ConScore(q,W)=α(q,W)×Sim(q,W)+β(q,W)×Sim_t(q,W)+γ(q,W)×(1-InCon(W))
Sim (q, W): the webpage similarity;
Sim_t (q, W): the webpage time of delivering among the web page listings L is retrained when coupling, if the webpage time of delivering satisfy constraint, then Sim_t (q, W)=1, otherwise, Sim_t (q, W)=0;
InCon (W): the time inconsistency degree value of webpage W;
α (q, W): Sim (q, weight W);
β (q, W): the freshness of webpage W,
Figure FDA0000075106460000101
T Publish(W) represent the time that webpage is delivered, t represents the start time point of the time-constrain of user's appointment, t 0Expression current time in system point;
γ (q, W): the weight of time consistency degree in ordering of webpage W;
α(q,W)+β(q,W)+γ(q,W)=1,α(q,W),β(q,W),γ(q,W)≥0;
Feedback module is used for new webpage ranking results is returned to the user.
CN 201110194133 2011-07-12 2011-07-12 Method for automatically discovering and sequencing outdated webpage based on Web time inconsistency Expired - Fee Related CN102253998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110194133 CN102253998B (en) 2011-07-12 2011-07-12 Method for automatically discovering and sequencing outdated webpage based on Web time inconsistency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110194133 CN102253998B (en) 2011-07-12 2011-07-12 Method for automatically discovering and sequencing outdated webpage based on Web time inconsistency

Publications (2)

Publication Number Publication Date
CN102253998A true CN102253998A (en) 2011-11-23
CN102253998B CN102253998B (en) 2013-08-14

Family

ID=44981262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110194133 Expired - Fee Related CN102253998B (en) 2011-07-12 2011-07-12 Method for automatically discovering and sequencing outdated webpage based on Web time inconsistency

Country Status (1)

Country Link
CN (1) CN102253998B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521225A (en) * 2011-09-29 2012-06-27 用友软件股份有限公司 Incremental data extraction device and incremental data extraction method
CN102737125A (en) * 2012-06-15 2012-10-17 武汉大学 Web temporal object model-based outdated webpage information automatic discovering method
CN102880660A (en) * 2012-09-03 2013-01-16 常州嘴馋了信息科技有限公司 Website hot-spot information sequencing system
CN103927365A (en) * 2014-04-21 2014-07-16 武汉大学 Web page time sensibility measurement method based on energy function
CN106874308A (en) * 2015-12-14 2017-06-20 北京搜狗科技发展有限公司 It is a kind of to recommend method and apparatus, a kind of device for recommending
CN107729153A (en) * 2017-10-31 2018-02-23 麦格创科技(深圳)有限公司 Web retrieval method for allocating tasks and system
CN111241379A (en) * 2018-11-28 2020-06-05 阿里巴巴集团控股有限公司 Search result processing method and device, electronic equipment and computer readable medium
CN112084452A (en) * 2020-09-22 2020-12-15 扆亮海 Webpage time efficiency obtaining method for temporal consistency constraint judgment
CN112256987A (en) * 2020-10-19 2021-01-22 中国互联网金融协会 Method, device, equipment and storage medium for monitoring overseas stock trading website
US11250015B2 (en) 2020-02-07 2022-02-15 Coupang Corp. Systems and methods for low-latency aggregated-data provision
CN114626306A (en) * 2022-03-22 2022-06-14 华北电力大学 Method and system for guaranteeing freshness of regulation and control information of park distributed energy
CN115186163A (en) * 2022-06-27 2022-10-14 北京百度网讯科技有限公司 Training method and device of search result ranking model and search result ranking method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1798127A (en) * 2004-12-30 2006-07-05 伺服网路科技股份有限公司 Method of reducing time to download a web page
US20060271533A1 (en) * 2005-05-26 2006-11-30 Kabushiki Kaisha Toshiba Method and apparatus for generating time-series data from Web pages
US20090119276A1 (en) * 2007-11-01 2009-05-07 Antoine Sorel Neron Method and Internet-based Search Engine System for Storing, Sorting, and Displaying Search Results
CN101499074A (en) * 2008-01-31 2009-08-05 株式会社日立制作所 Method, apparatus and server for presenting web page contents
CN101546308A (en) * 2008-09-25 2009-09-30 厦门市美亚柏科资讯科技有限公司 Web page search method and web page search system based on overdue retrieval

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1798127A (en) * 2004-12-30 2006-07-05 伺服网路科技股份有限公司 Method of reducing time to download a web page
US20060271533A1 (en) * 2005-05-26 2006-11-30 Kabushiki Kaisha Toshiba Method and apparatus for generating time-series data from Web pages
US20090119276A1 (en) * 2007-11-01 2009-05-07 Antoine Sorel Neron Method and Internet-based Search Engine System for Storing, Sorting, and Displaying Search Results
CN101499074A (en) * 2008-01-31 2009-08-05 株式会社日立制作所 Method, apparatus and server for presenting web page contents
CN101546308A (en) * 2008-09-25 2009-09-30 厦门市美亚柏科资讯科技有限公司 Web page search method and web page search system based on overdue retrieval

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《中文信息学报》 20080331 王勇等 基于用户兴趣分析的网页生命周期建模 76-80 第22卷, 第2期 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521225B (en) * 2011-09-29 2013-09-11 用友软件股份有限公司 Incremental data extraction device and incremental data extraction method
CN102521225A (en) * 2011-09-29 2012-06-27 用友软件股份有限公司 Incremental data extraction device and incremental data extraction method
CN102737125A (en) * 2012-06-15 2012-10-17 武汉大学 Web temporal object model-based outdated webpage information automatic discovering method
CN102880660A (en) * 2012-09-03 2013-01-16 常州嘴馋了信息科技有限公司 Website hot-spot information sequencing system
CN103927365A (en) * 2014-04-21 2014-07-16 武汉大学 Web page time sensibility measurement method based on energy function
CN103927365B (en) * 2014-04-21 2017-01-25 武汉大学 Web page time sensibility measurement method based on energy function
CN106874308B (en) * 2015-12-14 2021-03-26 北京搜狗科技发展有限公司 Recommendation method and device and recommendation device
CN106874308A (en) * 2015-12-14 2017-06-20 北京搜狗科技发展有限公司 It is a kind of to recommend method and apparatus, a kind of device for recommending
CN107729153A (en) * 2017-10-31 2018-02-23 麦格创科技(深圳)有限公司 Web retrieval method for allocating tasks and system
CN111241379A (en) * 2018-11-28 2020-06-05 阿里巴巴集团控股有限公司 Search result processing method and device, electronic equipment and computer readable medium
CN111241379B (en) * 2018-11-28 2023-04-25 阿里巴巴集团控股有限公司 Search result processing method and device, electronic equipment and computer readable medium
US11250015B2 (en) 2020-02-07 2022-02-15 Coupang Corp. Systems and methods for low-latency aggregated-data provision
TWI794709B (en) * 2020-02-07 2023-03-01 南韓商韓領有限公司 Computer -implemented system and method for low-latency aggregated-data provision
US11899678B2 (en) 2020-02-07 2024-02-13 Coupang Corp. Systems and methods for low latency aggregated data provision
CN112084452A (en) * 2020-09-22 2020-12-15 扆亮海 Webpage time efficiency obtaining method for temporal consistency constraint judgment
CN112256987A (en) * 2020-10-19 2021-01-22 中国互联网金融协会 Method, device, equipment and storage medium for monitoring overseas stock trading website
CN114626306A (en) * 2022-03-22 2022-06-14 华北电力大学 Method and system for guaranteeing freshness of regulation and control information of park distributed energy
CN115186163A (en) * 2022-06-27 2022-10-14 北京百度网讯科技有限公司 Training method and device of search result ranking model and search result ranking method and device

Also Published As

Publication number Publication date
CN102253998B (en) 2013-08-14

Similar Documents

Publication Publication Date Title
CN102253998B (en) Method for automatically discovering and sequencing outdated webpage based on Web time inconsistency
Wei et al. A survey of faceted search
Zheng et al. Learning to crawl deep web
CN105279264B (en) A kind of semantic relevancy computational methods of document
WO2013009613A1 (en) Systems and methods for natural language searching of structured data
Li et al. Context-based diversification for keyword queries over XML data
CN101334783A (en) Network user behaviors personalization expression method based on semantic matrix
Lin et al. Finding topic-level experts in scholarly networks
Magdy et al. Microblogs data management: a survey
CN102737125B (en) Web temporal object model-based outdated webpage information automatic discovering method
Carrasco et al. A new model for linguistic summarization of heterogeneous data: an application to tourism web data sources
CN114201598B (en) Text recommendation method and text recommendation device
Ma et al. Stream-based live public opinion monitoring approach with adaptive probabilistic topic model
CN102760140A (en) Incident body-based method for expanding searches
Moraes et al. Prequery discovery of domain-specific query forms: A survey
Rogushina Use of Semantic Similarity Estimates for Unstructured Data Analysis.
Zhu et al. Evolution analysis of online topics based on ‘word-topic’coupling network
Khodaei et al. Temporal-textual retrieval: Time and keyword search in web documents
Deng et al. Information re-finding by context: A brain memory inspired approach
Ma et al. Content Feature Extraction-based Hybrid Recommendation for Mobile Application Services.
Leong Hou et al. Durable top-k search in document archives
Wu et al. Sub-event discovery and retrieval during natural hazards on social media data
Butt et al. A taxonomy of semantic web data retrieval techniques
Stojanovic Information-need driven query refinement
Yu et al. Clustering and recommendation for semantic web service in time series

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130814

Termination date: 20210712

CF01 Termination of patent right due to non-payment of annual fee