Búsqueda Imágenes Maps Play YouTube Noticias Gmail Drive Más »
Iniciar sesión
Usuarios de lectores de pantalla: deben hacer clic en este enlace para utilizar el modo de accesibilidad. Este modo tiene las mismas funciones esenciales pero funciona mejor con el lector.

Patentes

  1. Búsqueda avanzada de patentes
Número de publicaciónCN1536483 A
Tipo de publicaciónSolicitud
Número de solicitudCN 03109338
Fecha de publicación13 Oct 2004
Fecha de presentación4 Abr 2003
Fecha de prioridad4 Abr 2003
Número de publicación03109338.8, CN 03109338, CN 1536483 A, CN 1536483A, CN-A-1536483, CN03109338, CN03109338.8, CN1536483 A, CN1536483A
Inventores陈文中
Solicitante陈文中
Exportar citaBiBTeX, EndNote, RefMan
Enlaces externos:  SIPO, Espacenet
Method for extracting and processing network information and its system
CN 1536483 A
Resumen
The invention relates to a network information extracting and processing method, adopting artificial intelligence and natural language processing technique, able to automatically download daily up-to-date news and information from named websites, making content extraction, classification, automatic abstracting and retrenching full text, then storing the full text, and then indexing the full text for making high-efficiency full text retrieval in future.
Reclamaciones(24)  traducido del chino
1.一种网络信息抽取及处理的方法,包括如下步骤:一.新闻下载步骤:包括如下步骤url分析步骤:系统指定一定的url,程序能够自动的从这些url上分析出新闻的最终内容url,而不用对每个新闻网站做一个特定的url模块,采用给予url统计以及对url进行相关性分析的方法,在一个含有最终内容新闻连接地址的网页,进行统计和分析,找到有用的最终url地址;自动抓取新闻网页步骤:将目标地址中的链接页面所有符合url格式的页面进行下载;垃圾过滤步骤:实现对抓下来的新闻内容网页进行垃圾过滤,除去其中的html标签以及一些无用的中文,最终得到中文向量信息;信息提取步骤:对以上得到的中文向量进行信息提取,前期实现能够提取标题和内容,后期实现对web新闻内容进行特征提取,相关性分析,文档分类,排重处理等等;二.自动生成摘要步骤:进行分词、特征词分析、句子重要分析、生成摘要,并输出摘要;三.生成全文索引步骤:对所有已经下载并且完成内容抽取的新闻内容文件进行全文索引,包括如下步骤:传入步骤,传入下一个文件名;索引判断步骤,判断是否已经索引过,是则回到传入步骤,否则进入下一步;过滤步骤,过滤其中所有垃圾及无意义的词;匹配分词步骤,进行词典匹配分词;ngram分词步骤,进行ngram分词,以免词典分词有未能完全分出来的词;更新步骤,对每一个词都更新相关的索引文件,包括关键字和日期,类别索引;四.层次文本分类步骤:是把一个新的文档归入一个给定的层次类别里的一个类里分类步骤;每份文档仅仅只能被归入一个类里,在层次类别里的每个类与许多词汇和术语相关有较大权重一个给定的术语在层次中的一个层次上,而stopword在另一个层次上.被摘录的文档(财政的新闻)的特征词在这个系统中被当作术语和字汇使用;包括层次训练步骤和文档分类步骤;层次训练是文档分类的预处理,在分类之前,先对类别的层次进行训练;训练的功能是要收集来自训练文档的一组特征(特征词),然后为每个节点(类别)在层次中分配特征权重,在文档分类算法中,特征权重是用来为一份新的文档计算类别等级;文件分类步骤是在被训练阶级组织之后,现在一份文件能被分类到一个类别,文件分类方法从根类别开始,根类别的所有子类别被分配等级,它由下面等式计算:Rcd=ΣfNfdWfc]]>c是一个类别,d是一份文件,f是一个在D中的特征,Rcd是c的等级,Nfd是f出现在d中的次数,Wfc是f在类别c中的权重;如果所有子类别的等级都是零的或负的,d被留在根类别;如果在子类别中有确定的正的最大的等级的类别,则该类别被选择;如果该类别是一个叶类别,文件d被分到该类别;如果被选择的类别不是叶类别,则在该类别的子类别中继续进行计算;因此,文件d能分到叶类别或内部类别。 A network information extraction and processing method, comprising the steps of: a news download procedure: url analysis step comprises the steps of: system specifies certain url, the program can automatically analyze from the final content of these url url News instead of doing a specific url module for each news site, as well as the use of statistical methods to give url url correlation analysis, in a news link address containing the final content of the page, statistics and analysis, find useful end url address; automatically crawl news pages step: the target address in the link page all eligible url formatted page to download; spam filtering step: to achieve the catch down the news content pages spam filter, remove the html tags and some useless Chinese, finally get Chinese vector information; information extraction step of: Chinese vector obtained above information extraction, pre-implementation can extract the title and content, post web news content to achieve feature extraction, correlation analysis, document classification, heavy discharge process And so on; the second step of automatically generated summary: word segmentation, feature word analysis, sentence analysis is important to generate summary and summary output; third generation full-text indexing step: all have been downloaded and extracted the contents of the news content complete full-text indexing files comprising the steps of: passing step, passing under a file name; index determination step of determining whether or not already indexed, it is then passed back to step, otherwise the next step; filtration step, filter where all rubbish and nonsense word; matching word in steps and dictionary matching word; ngram segmentation step, performed ngram word, lest Dictionary and have failed to completely separate out the words; updating step of every word update relevant index files, including keywords and date , category index; four-level text classification step: put a new document is classified as a given level category of a class classification step; each document can only be classified as a class in the hierarchy category Each class associated with a number of words and terms have greater weight for a given term at the level of a hierarchy, but on another level stopword. is an excerpt of the document (Financial News) feature words in this system is as terminology and vocabulary use; including the level of training procedures and document classification step; pretreatment document classification level of training, before classification, the first category of the level of training; training function is to collect one from the training document Group features (feature words), then for each node (category) feature weight assignment in the document classification algorithm, the feature weights in the hierarchy is used to calculate a new document classification level; document classification step is being trained After class organization, now a document can be classified into a category, document classification starting from the root category, all subcategories root category is assigned grades, which calculated by the following equation: Rcd = & Sigma; fNfdWfc]]> c is a category, d is a document, f is a feature in D, Rcd is c level, Nfd frequency f is present in the d, Wfc is f c the right weight category; if all subcategories rating is zero or negative, d is left in the root category; if there is a positive rating of the largest categories identified in the sub-category, the category is selected; if the category is a leaf category, the file was assigned to d The category; if the selected category is not a leaf category, the category's subcategories proceed computing; therefore, the file could be assigned to the leaf category d or internal category.
2.根据权利要求1所述的网络信息抽取及处理的方法,其特征是所述的新闻下载步骤还包括管理步骤,实现对本机存储的新闻数据进行管理,如删除,更新等。 Network information according to claim 1, wherein the extraction and treatment methods, wherein said step further includes news download management step, to achieve the news locally stored data management, such as delete, update.
3.根据权利要求1所述的网络信息抽取及处理的方法,其特征是所述的方法还包括新闻查询步骤,包括如下步骤:提交步骤,用户提交查询条件;搜索步骤,对索引进行搜索操作,得到结果集;返回步骤,将结果返回给用户。 According to one of the network information extraction and processing method of claim, wherein said method further comprises the step of press inquiries, comprising the steps of: Submit step, the user submits a query condition; searching step, the index search operation to obtain the results set; return step, the results are returned to the user.
4.根据权利要求1所述的网络信息抽取及处理方法,其特征是所述的方法还包括日志以及事务处理步骤,即使是执行到一半终止了,下次运行仍然能够恢复原有的索引结果,并且从失败的位置开始从新进行索引工作,对于下载和摘要等工作进行纪录;下载线程的url分析模块在分析url的时候,就先读入计数文件,并载入最新的两个日志文件,用以判断是否已经下载过;每当下载一个新闻内容网页,就存储相关的url至最新的日志文件中;在索引的过程中,必须先读入索引的位置信息,然后读入必须索引的日志文件信息;然后对对应的内容文件进行索引,同时更新索引日志文件中的索引位置信息;在摘要的过程中,必须先读入摘要的位置信息,然后读入必须摘要的日志文件信息;然后对对应的内容文件进行摘要,同时更新摘要日志文件中的摘要位置信息;每当下载完一个文件的源代码,分析出内容,进行完摘要,完成索引都要对这项工作进行纪录;下载,摘要,索引三个线程永不停止,即使已经完成了某项工作,比如摘要已经完成,则重新下载摘要的日志文件,开始摘要。 According to one of the network information extraction and processing method according to claim, wherein said method further comprises a log, and transaction processing step, even if the implementation of the half ended, the next run will still be able to restore the original index results and starting from the position of the new index failed to work, and summaries for download work record; url analysis module in the analysis url download thread when you first read the count file and load the latest two log files, to determine whether downloaded; whenever downloading a news content pages on storage-related url to the latest log file; during the index, you must first read position information into the index, and then reads the index must log log file information in the summary of the process, we must first read the summary of the location information, then you must read the summary;; file information; then the contents of the corresponding file index, while the index information to update the index position in the log file and then corresponding content file summary, summary and update the location information summary log file; every time a file is downloaded source code analysis of the content, has been completely abstract, complete index must be a record for this work; download, summary index three threads never stop, even though they have done some work, such as summary has been completed, then re-download the summary of the log file, start summary.
5.根据权利要求1所述的网络信息抽取及处理的方法,其特征是所述的方法还包括管理步骤,管理步骤主要实现对本机的数据管理,类别管理,新闻源管理,数据删除的索引更新,日志更新等。 According to one of the network information extraction and treatment methods, wherein said method further comprises a management step of managing step the main achievement of the machine data management, category management, news source management, data deletion claims Index Update log updates.
6.根据权利要求1所述的网络信息抽取及处理的方法,其特征是所述的自动摘要步骤可以是一个独立的步骤,需要与外部接口的API接口只有一个get摘要ion,其接口原型为public String get摘要ion(String FileName,boolean FileMode,intRatio)FileName参数,根据FileMode来决定;如果FileMode=true,那么FileName则为文件名;否则,为待抽取的文档本身;FileMode参数是模式参数;Ratio为抽比率,只允许0-100之间的整数。 6. The method according to the network information extraction and processing according to claim 1, wherein said step of automatic summary may be a separate step, and API interfaces require only one external interface get summary ion, its interface prototype public String get summary ion (String FileName, boolean FileMode, intRatio) FileName parameters, according to FileMode to decide; if FileMode = true, then compared with the file name FileName; otherwise, the document itself to be drawn; FileMode parameter is the mode parameter; Ratio the pumping rate, only integer between 0 and 100 allowed.
7.根据权利要求1所述的网络信息抽取及处理的方法,其特征是所述的生成全文索引步骤可以是一个独立的步骤,所需要提供的接口参数只是一个文件名。 7. The network of claim 1, wherein the information extraction and treatment methods, wherein said step of generating full-text index may be a separate step, the interface parameters only need to provide a file name.
8.根据权利要求1所述的网络信息抽取及处理的方法,其特征是所述的新闻下载步骤中采用对html的token分析方法;充分运用java中的面向对象的思想,将每个html源代码文件看成一个对象,同时建立一个名为token的类,token用来描述html中一个有意义的字符串,并且由token继承出来urltoken类,urltoken用来描述特征符合url格式的token;在进行html源代码分析的时候,将每个文件看成一个对象,同时就该文件中每一个html tag以及每一个html tag之间的字符串,都将其看成一个字符串;每个token所具有的属性为String tokenstr=null;//描述该token的串值int tokenloc=0; //该token在原文件中的位置int gbnum=0; //该token中具有的中文字符数量boolean iskeentag=false;//是否完全是一个内容亲密token Float keenvalue=0; //与内容的亲密程度Token具有的比较特别的方法: public boolean ishref() { String flag 1=″href=″; int flag2=-1; if(tokenstr.索引Of(flag1)==flag2) return false; else return true; }该方法用来判断是否一个url html tag;对url进行分析,主要是由urlanalyse.class与contentanalyse.class两个类实现的,主要实现了token流的分析;分析的主要方法:urlanalyse.class有一个方法geturl(stringfilename)先将源代码转化成token流读入来,然后将每一个符合格式的url token与这个url后面的gbnum不等于0的token加入缓存的hashmap中,一般情况下,url后面的gbnum不等于0的token都是新闻的标题。 According to claim 1, wherein the network information extraction and treatment methods, wherein said step of news download analysis method using token html; the full use of java object-oriented thinking will each html source code file as an object, while establishing a class called token, token is used in a meaningful html string descriptor, and inherited by the token out urltoken class, urltoken used to describe the features conform token url format; performing html source code analysis, when each file as an object, but that document in a string tag and each html tag between each html, will treat it as a string; each token has The property is String tokenstr = null; // description of the token string value int tokenloc = 0; // the token position in the original document int gbnum = 0; // number of Chinese characters in the token has a boolean iskeentag = false; // if the content is a completely intimate token Float keenvalue = 0; // A special method and content of the closeness of the Token has: public boolean ishref () {String flag 1 = "href ="; int flag2 = -1; if (tokenstr Index Of (flag1) == flag2.) return false; else return true;} This method is used to determine whether a url html tag; for url analyzed mainly by urlanalyse.class and contentanalyse.class Class Two implementation, the main achievement of the token stream analysis; analysis of the main methods: urlanalyse.class has a method geturl (stringfilename) first converted into the source code token stream to read, and then each url in a form of token this url back gbnum added token 0 is not equal to the cache of the hashmap, under normal circumstances, url back gbnum token 0 is not equal to the news headlines.
9.根据权利要求1所述的网络信息抽取及处理的方法,其特征是所述的自动生成摘要步骤中分词采用“无词库”分词方法,采用词频,词重——衡量是词的可能性的算法公式:P(w)=F(w)*L(w)c当(F(w)>minFreq,L(w)>minLen)否则P(w)minFreq是预设的词的出现最小频率;通常≥2;降低不是词的串minLen是预设的词的最短词长;通常≥1;保证低频词不被分开c是预设的一个常值;通常≥4;保证长词不被分开;流程如下:整文当作一个字符串,从头开始求子串,对所有子串求权,取权高者作为词(太多无用扫描),系统值取一个串,采用所有文件作为背景。 Possible measure is the word - the network information according to claim 1, wherein the extraction and treatment methods, wherein automatically generating the word summary step carve a "no thesaurus" segmentation method using word frequency, word weight Algorithm of formula: P (w) = F (w) * L (w) c When (F (w)> minFreq, L (w)> minLen) or P (w) minFreq word appeared preset minimum frequency; usually ≥2; lower is not the word is the shortest word string minLen preset words in length; usually ≥1; to ensure that low-frequency words are not separated by a constant c is the default value; usually ≥4; ensure the long term will not be apart; the process is as follows: as a whole text string, scratch Praying string substring find the right for all, the right to take the higher as words (too many useless scanning), the system value takes a string, using all files as background .
10.根据权利要求1所述的网络信息抽取及处理的方法,其特征是所述的自动生成摘要步骤中特征词的抽取,基于词的频率以及想对于背景知识库的词频来统计,P(w)=Fi(w)·(numdocadvnumdoc)·(L(w)-D)c]]>F(w)为词出现的频率,L(w)为词的长度,numdoc为该词的在本文中出现次数,advnumdoc为所有文档中出现平均次数,D预设的最短词长。 According to claim 1, wherein the network information extraction and treatment methods, wherein the step of automatically generated summary of extracting feature words said, based on word frequency and think for background knowledge of word frequency to statistics, P ( w) = Fi (w) & CenterDot; (numdocadvnumdoc) & CenterDot; (L (w) -D) c]]> F (w) frequency word appears, L (w) is the length of the word, numdoc of the word occurrence in this number, advnumdoc for the average number of all documents appearing, D preset minimum word length.
11.根据权利要求1所述的网络信息抽取及处理的方法,其特征是所述的自动生成摘要步骤中句子的重要性与摘要的生成的关系:T(s)·ΣTis0*s1*s2*m]]>对每一个句子按这个公式计算他们的权重;Ti为句子组成的词的权重,S0为句子的总词数,S1为句子的字句数,S2为数词的个数,m为整型常值,通常为1。 Network information according to claim 1, wherein the extraction and treatment methods, wherein automatically generating the importance of the relationship between the generation of summary sentences in a summary of the steps: T (s) & CenterDot; & Sigma; Tis0 * s1 * s2 * m]]> For each sentence according to the formula of their weight; the right to Ti as sentences of words heavy, S0 is the total number of words in a sentence, S1 is the phrase several sentences, S2 is numerals number, m is an integer constant, usually 1.
12.根据权利要求1所述的网络信息抽取及处理的方法,其特征是所述的层次训练步骤包括4个步骤:1)收集来自叶子类的特征词:层次中,对于每个叶子类的训练文档(新闻)的特征词,只有那些在单一训练文档中出现2次以上或者在训练文档集出现10次以上的特征词才被收集,这些词最后在摘要中出现,这些收集的特征词表示了叶子类的特征,当一个叶子类属于某一个训练文档集时,父类就要包含该叶子类的特征,非叶子类的特征包括它的孩子节点的所有特征和在所有孩子节点中特征发生频率的总和;2)层次最优化步骤:最优化用来解决在类别节和它的父母类别之间的竞争,因为一份文件(新闻)只能在类别的层次组织中被指定为一个类别,当在类别之间有竞争的时候,运算法则应该为文件决定适当的类别,包括如下步骤:采集步骤,采集在一个类别中所有的特征;特征判断步骤,判断是否在父母中的特征频率比在这个类别中大,是则到下一步骤,否则没有操作;查继步骤,查继承者的特征目录,找出继承者高频率和最低的频率的特征;比率判断步骤,判断是否在高的频率和最低的频率之差与最高频率的比率比门槛值大,是则到下一步骤,否则从所有的继承者删除该特征。 Network information according to claim 1, wherein the extraction and treatment methods, wherein the level of training step comprises four steps: 1) collecting feature words from the leaf class: the hierarchy, for each leaf class training documents (news) feature words, only those that appear in a single training documents or appear in more than 2 times more than 10 times the feature words in the training set was only collected documentation, these words appear in the final summary, these words represent collected features leaves class feature, when a leaf class belongs to a training document set, the parent class must contain the leaf characteristics, non-leaf class features include all the features of its child nodes and features occur in all children nodes sum frequency; 2) levels of optimization steps: optimization to solve the competition between the Category section and its parent class, because a document (news) can be designated as a category in the category level organizations, When there is competition between the category of time, the algorithm should determine the appropriate category of documents, comprising the steps of: collecting step, collect all of the features in one category; characteristic determination step of determining whether or not the parents than the characteristic frequency This category large, it is then to the next step, otherwise there is no operation; check the following steps to check the characteristics of the successor to the directory to find the successor to the high frequency and low frequency characteristics; ratio determination step of determining whether the high frequency and the lowest rate difference between the frequency and the highest frequency is greater than the threshold value, it is then to the next step, or remove the feature from all of the successor. 只有父母保有该特征;删除步骤,从继承者中删除该特征除非继承者有该特征的最高频率;3)分配类别特征权重步骤:为类别的每个特征指定权重,有比较高的权重特征意味着它对类别是更重要的,在每个类别中所有的特征被分配权重,由下式定义:Wfc=(λ+(1-λ)×Nfc/Mc)f正在每个存在的特征,c是类别,Wfc是被指定为特征的权重,λ是一个叁数并且现在设定为0.4,Nfc是f在c中出现的次数,Mc是在c中任何的特征最大的频率;当一个特征只出现在兄弟类别中的时候,但是不在c中它本身,它被指定为负权重,有负权重的特征被增加到c的特征列表,负权重由下式定义:Wfc=-(λ+(1-λ)×Nfp/Mp)f正在每个存在的特征,c是类别,Wfc是被指定为特征的权重,λ是一个叁数并且现在设定为0.4,Nfp是f在c的父类别中出现的次数,Mc是在c的父类别中任何的特征最大的频率;4)过滤每个类别的特征列表,每个类别的特征列表将被过滤,只有前面200个正特征和前面200个负特征被保留到该类别的最终特征列表中,无论是父类别还是叶类别,其他的特征将被抛弃。 Only parents retain the feature; delete steps to remove the feature unless the successor has the characteristic of the highest frequency from the successor; and 3) the assignment category feature weights steps of: for each feature to specify the right type of weight, have a relatively high weighting characteristics mean It is more important to the category, all the features are re-distribution rights in each category, defined by the formula: Wfc = (λ + (1-λ) × Nfc / Mc) f are each existing features, c is the category, Wfc is characterized designated weights, λ is a number three and is now set to 0.4, Nfc is the number that appears in f, c, Mc is the largest in c any characteristic frequency; when a feature only appears in Brothers category sometimes, but not in c in itself, it is designated as a negative weight, negative weights feature is added feature list c, and negative weights defined by the formula: Wfc = - (λ + (1 -λ) × Nfp / Mp) f are each occurrence wherein, c is the category, Wfc is designated as a feature weights, λ is a number three and is now set to 0.4, Nfp is f c of the parent category the number of occurrences, Mc is the parent category c maximum frequency of any of the characteristics; and 4) filtering feature list for each category, features a list of each category will be filtered and only the first 200 positive features and front 200 negative feature is reserved to a final features list for that category, whether it is a parent category or leaf category, other features will be discarded. 限制特征的数量是用来降低分类一个文件的计算复杂度。 Limit the number of features is used to reduce the computational complexity of the classification of a file.
13.一种网络信息抽取及处理的系统,其特征在于:包括如下装置:一.新闻下载装置:包括如下装置url分析装置:系统指定一定的url,程序能够自动的从这些url上分析出新闻的最终内容url,而不用对每个新闻网站做一个特定的url模块,采用给予url统计以及对url进行相关性分析的方法,在一个含有最终内容新闻连接地址的网页,进行统计和分析,找到有用的最终url地址;自动抓取新闻网页装置:将目标地址中的链接页面所有符合url格式的页面进行下载;垃圾过滤装置:实现对抓下来的新闻内容网页进行垃圾过滤,除去其中的html标签以及一些无用的中文,最终得到中文向量信息;信息提取装置:对以上得到的中文向量进行信息提取,前期实现能够提取标题和内容,后期实现对web新闻内容进行特征提取,相关性分析,文档分类,排重处理等等;二.自动生成摘要装置:进行分词、特征词分析、句子重要分析、生成摘要,并输出摘要;三.生成全文索引装置:对所有已经下载并且完成内容抽取的新闻内容文件进行全文索引,包括如下装置:传入装置,传入下一个文件名;索引判断装置,判断是否已经索引过,是则回到传入装置,否则进入下一步;过滤装置,过滤其中所有垃圾及无意义的词;匹配分词装置,进行词典匹配分词;ngram分词装置,进行ngram分词,以免词典分词有未能完全分出来的词;更新装置,对每一个词都更新相关的索引文件,包括关键字和日期,类别索引;四.层次文本分类装置:是把一个新的文档归入一个给定的层次类别里的一个类里分类装置;每份文档仅仅只能被归入一个类里,在层次类别里的每个类与许多词汇和术语相关有较大权重一个给定的术语在层次中的一个层次上,而stopword在另一个层次上.被摘录的文档(财政的新闻)的特征词在这个系统中被当作术语和字汇使用;包括层次训练装置和文档分类装置;层次训练装置是对文档分类的预处理,在分类之前,先对类别的层次进行训练;训练的功能是要收集来自训练文档的一组特征(特征词),然后为每个节点(类别)在层次中分配特征权重,在文档分类算法中,特征权重是用来为一份新的文档计算类别等级;文件分类装置是在被训练阶级组织之后,现在一份文件能被分类到一个类别,文件分类方法从根类别开始,根类别的所有子类别被分配等级,它由下面等式计算:Rcd=ΣfNfdWfc]]>c是一个类别,d是一份文件,f是一个在D中的特征,Rcd是c的等级,Nfd是f出现在d中的次数,Wfc是f在类别c中的权重;如果所有子类别的等级都是零的或负的,d被留在根类别;如果在子类别中有确定的正的最大的等级的类别,则该类别被选择;如果该类别是一个叶类别,文件d被分到该类别;如果被选择的类别不是叶类别,则在该类别的子类别中继续进行计算;因此,文件d能分到叶类别或内部类别。 13. A network information extraction and treatment system, characterized in that: include the following devices: a news Download means: analyzing means comprises means url: system specifies certain url, the program can automatically analyze from these url News The final content url, instead of doing a specific url module for each news site, as well as the use of statistics to give url url correlation analysis, the final content of the news contained in a connection address of the page, statistics and analysis, found useful end url address; automatically crawl news pages means: the destination address in the link page all eligible url formatted page to download; spam filtering devices: to achieve the catch down the news content pages spam filter, remove the html tags and some useless Chinese, finally get Chinese vector information; information extraction means: The Chinese vector obtained above information extraction, pre-implementation can extract the title and content, post web news content to achieve feature extraction, correlation analysis, document classification , row of heavy processing and the like; the second device automatically generates summary: word segmentation, feature word analysis, sentence analysis is important to generate summary and summary output; three full-text indexing means: All the content has been downloaded and extracted complete news content Full-text indexing files, including the following devices: Incoming means passing under a file name; index judgment means for judging whether indexed, it is then passed back to the device, otherwise the next step; filtration device which filters all spam and meaningless words; matching word means carried dictionary matching word; ngram word means conduct ngram word, lest Dictionary and have failed to completely separate out the words; updating means, for every word update relevant index files, including keyword and date, category index; four-level text classification means: put a new document is included in a given level category classification means a class; each document can only be classified as a class, In the category of each class level associated with a number of words and terms that a given term weight on one level in the hierarchy have greater rights, but on another level stopword. is an excerpt of the document (financial news) is characterized Words in this system are treated as use of terminology and vocabulary; including level training device and document classification means; pretreatment level training device for document classification, before classification, the first category of the level of training; training function is to collecting a set of features from the training documentation (feature word), then for each node (category) feature weight distribution in the hierarchy in the document classification algorithm, the feature weights are used to calculate a new document classification level; file classification means that after being trained class organization, now a document can be classified into a category, document classification starting from the root category, all subcategories root category is assigned grades, which calculated by the following equation: Rcd = & Sigma; fNfdWfc]]> c is a category, d is a document, f is a feature in D, Rcd is c level, Nfd frequency f is present in the d, Wfc is right f c in the category of weight; If all sub-categories of rank are zero or negative, d is left in the root category; if there is a positive rating of the largest categories identified in the sub-category, the category is selected; if the category is a leaf category the file was assigned to the category d; if the selected category is not a leaf category, the category's subcategories proceed computing; therefore, the file could be assigned to the leaf category d or internal category.
14.根据权利要求13所述的网络信息抽取及处理的系统,其特征是所述的新闻下载装置还包括管理装置,实现对本机存储的新闻数据进行管理,如删除,更新等。 Network information according to claim 13, wherein the extraction and treatment system, wherein News Download apparatus further comprises management means to realize the news locally stored data management, such as delete, update.
15.根据权利要求13所述的网络信息抽取及处理系统,其特征是所述的系统还包括新闻查询装置,包括如下装置:提交装置,用户提交查询条件;搜索装置,对索引进行搜索操作,得到结果集;返回装置,将结果返回给用户。 Network information according to claim 13, wherein the extraction and treatment system, wherein said system further comprises a press query means, including the following devices: Submit device, a user submits a query condition; search means the index search operation, get the result set; return means, the results returned to the user.
16.根据权利要求13所述的网络信息抽取及处理系统,其特征是所述的系统还包括日志以及事务处理装置,即使是执行到一半终止了,下次运行仍然能够恢复原有的索引结果,并且从失败的位置开始从新进行索引工作,对于下载和摘要等工作进行纪录;下载线程的url分析模块在分析url的时候,就先读入计数文件,并载入最新的两个日志文件,用以判断是否已经下载过;每当下载一个新闻内容网页,就存储相关的url至最新的日志文件中;在索引的过程中,必须先读入索引的位置信息,然后读入必须索引的日志文件信息;然后对对应的内容文件进行索引,同时更新索引日志文件中的索引位置信息;在摘要的过程中,必须先读入摘要的位置信息,然后读入必须摘要的日志文件信息。 Network information according to claim 13, wherein the extraction and treatment system, wherein said system further includes a log, and transaction processing device, even if the implementation of the half ended, the next run will still be able to restore the original index results and starting from the position of the new index failed to work, and summaries for download work record; url analysis module in the analysis url download thread when you first read the count file and load the latest two log files, to determine whether downloaded; whenever downloading a news content pages on storage-related url to the latest log file; during the index, you must first read position information into the index, and then reads the index must log File information; then the contents of the corresponding index file, and update the index index log file location information; and in the summary of the process, you must first read the summary of the location information, and then have to read a summary of the log file information. 然后对对应的内容文件进行摘要,同时更新摘要日志文件中的摘要位置信息;每当下载完一个文件的源代码,分析出内容,进行完摘要,完成索引都要对这项工作进行纪录;下载,摘要,索引三个线程永不停止,即使已经完成了某项工作,比如摘要已经完成,则重新下载摘要的日志文件,开始摘要。 Then the corresponding content file summary, summary and update the location information summary log file; whenever downloading a file of source code, analyze the content, has been completely abstract, complete index must be a record for this work; download , abstracting, indexing three threads never stop, even though they have done some work, such as summary has been completed, then re-download the summary of the log file, start summary.
17.根据权利要求13所述的网络信息抽取及处理的系统,其特征是所述的系统还包括管理装置,管理装置主要实现对本机的数据管理,类别管理,新闻源管理,数据删除的索引更新,日志更新等。 Network information according to claim 13, wherein the extraction and processing system, wherein said system further comprises a management apparatus, the management apparatus main achievement of the native data management, category management, news source management, the index data deleted Update log updates.
18.根据权利要求13所述的网络信息抽取及处理的系统,其特征是所述的自动摘要装置可以是一个独立的装置,需要与外部接口的API接口只有一个get摘要ion,其接口原型为public String get摘要ion(String FileName,boolean FileMode,intRatio)FileName参数,根据FileMode来决定;如果FileMode=true,那么FileName则为文件名;否则,为待抽取的文档本身;FileMode参数是模式参数;Ratio为抽比率,只允许0-100之间的整数。 18. The network of claim information extraction and processing system 13, characterized in that the means of automatic summarization can be a stand-alone device, you need an external interface and API interfaces get only a summary ion, its interface prototype public String get summary ion (String FileName, boolean FileMode, intRatio) FileName parameters, according to FileMode to decide; if FileMode = true, then compared with the file name FileName; otherwise, the document itself to be drawn; FileMode parameter is the mode parameter; Ratio the pumping rate, only integer between 0 and 100 allowed.
19.根据权利要求13所述的网络信息抽取及处理的系统,其特征是所述的生成全文索引装置可以是一个独立的装置,所需要提供的接口参数只是一个文件名。 Network information according to claim 13, wherein the extraction and treatment system, characterized in that said full-text indexing means may be a stand-alone device, the interface parameters are required to provide only a file name.
20.根据权利要求13所述的网络信息抽取及处理的系统,其特征是所述的新闻下载装置中采用对html的token分析方法;充分运用java中的面向对象的思想,将每个html源代码文件看成一个对象,同时建立一个名为token的类,token用来描述html中一个有意义的字符串,并且由token继承出来urltoken类,urltoken用来描述特征符合url格式的token;在进行html源代码分析的时候,将每个文件看成一个对象,同时就该文件中每一个html tag以及每一个html tag之间的字符串,都将其看成一个字符串;每个token所具有的属性为String tokenstr=null;//描述该token的串值int tokenloc=0; //该token在原文件中的位置int gbnum=0; //该token中具有的中文字符数量boolean iskeentag=false;//是否完全是一个内容亲密token Float keenvalue=0; //与内容的亲密程度Token具有的比较特别的方法: public boolean ishref() { String flag 1=″href=″; int flag2=-1; if(tokenstr.索引Of(flag1)=flag2) return false; else return true; }该方法用来判断是否一个url html tag;对url进行分析,主要是由urlanalyse.class与contentanalyse.class两个类实现的,主要实现了token流的分析;分析的主要方法:urlanalyse.class有一个方法geturl(stringfilename)先将源代码转化成token流读入来,然后将每一个符合格式的url token与这个url后面的gbnum不等于0的token加入缓存的hashmap中,一般情况下,url后面的gbnum不等于0的token都是新闻的标题。 20. The network information 13 of the extraction and processing system of claim, wherein the device according to the news download analysis method using token html; the full use of java object-oriented thinking will each html source code file as an object, while establishing a class called token, token is used in a meaningful html string descriptor, and inherited by the token out urltoken class, urltoken used to describe the features conform token url format; performing html source code analysis, when each file as an object, but that document in a string tag and each html tag between each html, will treat it as a string; each token has The property is String tokenstr = null; // description of the token string value int tokenloc = 0; // the token position in the original document int gbnum = 0; // number of Chinese characters in the token has a boolean iskeentag = false; // if the content is a completely intimate token Float keenvalue = 0; // A special method and content of the closeness of the Token has: public boolean ishref () {String flag 1 = "href ="; int flag2 = -1; if (tokenstr Index Of (flag1) = flag2.) return false; else return true;} This method is used to determine whether a url html tag; to analyze url mainly achieved by the two classes urlanalyse.class and contentanalyse.class , the main achievement of the token stream analysis; analysis of the main methods: urlanalyse.class has a method geturl (stringfilename) first converted into the source code token stream to read, and then each url in a form of token back to this url The gbnum added token 0 is not equal to the cache of the hashmap, under normal circumstances, url back gbnum token 0 is not equal to the news headlines.
21.根据权利要求13所述的网络信息抽取及处理的系统,其特征是所述的自动生成摘要装置中分词采用“无词库”分词方法,采用词频,当P(w)=F(w)*L(w)c(F(w)>minFreq,L(w)>minLen)否则P(w)minFreq是预设的词的出现最小频率;通常≥2;降低不是词的串minLen是预设的词的最短词长;通常≥1;保证低频词不被分开c是预设的一个常值;通常≥4;保证长词不被分开;流程如下:整文当作一个字符串,从头开始求子串,对所有子串求权,取权高者作为词(太多无用扫描),系统值取一个串,采用所有文件作为背景; 21. The network information 13 of the extraction and processing system of claim, wherein the device automatically generates a summary of the use of the carved words "No thesaurus" segmentation method using word frequency, when P (w) = F (w ) * L (w) c (F (w)> minFreq, L (w)> minLen) or P (w) minFreq emergence preset minimum frequency words; usually ≥2; word string minLen not reduce pre- The shortest term set of words long; usually ≥1; guarantee low-frequency words are not separated by a constant c is the default value; usually ≥4; ensure the long term will not be separated; the process is as follows: the whole text as a string, from the beginning Praying start string, substring seek all right, take the higher power as words (too many useless scanning), the system value takes a string, using all files as background;
22.根据权利要求13所述的网络信息抽取及处理的系统,其特征是所述的自动生成摘要装置中特征词的抽取,基于词的频率以及想对于背景知识库的词频来统计,P(w)=Fi(w)·(numdocadvnumdoc)·(L(w)-D)c]]>F(w)为词出现的频率,L(w)为词的长度,numdoc为该词的在本文中出现次数,advnumdoc为所有文档中出现平均次数,D预设的最短词长。 22. The network information 13 of the extraction and processing system of claim, wherein the device automatically generates a summary of the feature word extraction based on the frequency of words and want to come to the background knowledge of word frequency statistics, P ( w) = Fi (w) & CenterDot; (numdocadvnumdoc) & CenterDot; (L (w) -D) c]]> F (w) frequency word appears, L (w) is the length of the word, numdoc of the word occurrence in this number, advnumdoc for the average number of all documents appearing, D preset minimum word length.
23.根据权利要求13所述的网络信息抽取及处理的系统,其特征是所述的自动生成摘要装置中句子的重要性与摘要的生成的关系:T(s)·ΣTis0*s1*s2*m]]>对每一个句子按这个公式计算他们的权重;Ti为句子组成的词的权重,S0为句子的总词数,S1为句子的字句数,S2为数词的个数,m为整型常值,通常为1。 23. The network information 13 of the extraction and processing system of claim wherein the importance of the relationship is automatically generated summary generated abstract apparatus sentence said: T (s) & CenterDot; & Sigma; Tis0 * s1 * s2 * m]]> For each sentence according to the formula of their weight; the right to Ti as sentences of words heavy, S0 is the total number of words in a sentence, S1 is the phrase several sentences, S2 is numerals number, m is an integer constant, usually 1.
24.根据权利要求13所述的网络信息抽取及处理系统,其特征是所述的层次训练装置包括4个装置:1)收集装置:收集来自叶子类的特征词;层次中,对于每个叶子类的训练文档(新闻)的特征词,只有那些在单一训练文档中出现2次以上或者在训练文档集出现10次以上的特征词才被收集,这些词最后在摘要中出现,这些收集的特征词表示了叶子类的特征,当一个叶子类属于某一个训练文档集时,父类就要包含该叶子类的特征,非叶子类的特征包括它的孩子节点的所有特征和在所有孩子节点中特征发生频率的总和;2)层次最优化装置:最优化用来解决在类别节和它的父母类别之间的竞争,因为一份文件(新闻)只能在类别的层次组织中被指定为一个类别,当在类别之间有竞争的时候,运算法则应该为文件决定适当的类别,包括如下装置:采集装置,采集在一个类别中所有的特征;特征判断装置,判断是否在父母中的特征频率比在这个类别中大,是则到下一装置,否则没有操作;查继装置,查继承者的特征目录,找出继承者高频率和最低的频率的特征;比率判断装置,判断是否在高的频率和最低的频率之差与最高频率的比率比门槛值大,是则到下一装置,否则从所有的继承者删除该特征。 24. The network information 13 of the extraction and processing system of claim wherein the level of training device comprises four devices: 1) collection means: The word comes from the leaves collected feature class; hierarchy, for each leaf class training documents (news) feature words, only those that appear in a single training documents or appear in more than 2 times more than 10 times the feature words in the training set was only collected documentation, these words appear in the final summary, these collection features word for leaves class feature, when a leaf class belongs to a training document set, the parent class must contain the leaf characteristics, non-leaf class features include all the features of its child nodes and nodes in all children the sum of the frequency characteristics of the occurrence; 2) optimizing device level: Optimization to solve the competition between the Category section and its parent class, because a document (news) can only be specified in the class as a hierarchical organization category, when there is competition between the category of time, the algorithm should determine the appropriate category of documents, including the following devices: collecting device to collect all the features in one category; characteristic judgment means for judging whether or not the parents of the characteristic frequency ratio in this category, it is then to the next device, otherwise there is no operation; check relay device, check successor feature directory to find the successor to the high frequency and low frequency characteristics; ratio judgment means for judging whether or not the high The frequency and the ratio of the difference between the lowest and highest frequencies than the threshold frequency of large, is the means to the next, or to remove the feature from all of the successor. 只有父母保有该特征;删除装置,从继承者中删除该特征除非继承者有该特征的最高频率;3)分配类别特征权重装置:为类别的每个特征指定权重,有比较高的权重特征意味着它对类别是更重要的,在每个类别中所有的特征被分配权重,由下式定义:Wfc=(λ+(1-λ)×Nfc/Mc)f正在每个存在的特征,c是类别,Wfc是被指定为特征的权重,λ是一个叁数并且现在设定为0.4,Nfc是f在c中出现的次数,Mc是在c中任何的特征最大的频率;当一个特征只出现在兄弟类别中的时候,但是不在c中它本身,它被指定为负权重,有负权重的特征被增加到c的特征列表,负权重由下式定义:Wfc=-(λ+(1-λ)×Nfp/Mp)f正在每个存在的特征,c是类别,Wfc是被指定为特征的权重,λ是一个叁数并且现在设定为0.4,Nfp是f在c的父类别中出现的次数,Mc是在c的父类别中任何的特征最大的频率;4)过滤装置:过滤每个类别的特征列表,每个类别的特征列表将被过滤,只有前面200个正特征和前面200个负特征被保留到该类别的最终特征列表中,无论是父类别还是叶类别,其他的特征将被抛弃。 Only parents retain the feature; deleting means, to remove the feature unless the successor has the characteristic of the highest frequency from the successor; and 3) assign a category features a weight means: For each feature to specify the right type of weight, have a relatively high weighting characteristics mean It is more important to the category, all the features are re-distribution rights in each category, defined by the formula: Wfc = (λ + (1-λ) × Nfc / Mc) f are each existing features, c is the category, Wfc is characterized designated weights, λ is a number three and is now set to 0.4, Nfc is the number that appears in f, c, Mc is the largest in c any characteristic frequency; when a feature only appears in Brothers category sometimes, but not in c in itself, it is designated as a negative weight, negative weights feature is added feature list c, and negative weights defined by the formula: Wfc = - (λ + (1 -λ) × Nfp / Mp) f are each occurrence wherein, c is the category, Wfc is designated as a feature weights, λ is a number three and is now set to 0.4, Nfp is f c of the parent category the number of occurrences, Mc is the parent category c maximum frequency of any of the characteristics; and 4) filtering device: filter feature list for each category, features a list of each category will be filtered and only the first 200 positive features and front 200 negative feature is reserved to a final features list for that category, whether it is a parent category or leaf category, other features will be discarded. 限制特征的数量是用来降低分类一个文件的计算复杂度。 Limit the number of features is used to reduce the computational complexity of the classification of a file.
Descripción  traducido del chino
网络信息抽取及处理的方法及系统 Network information extraction and processing method and system

技术领域 Technical Field

本发明涉及一种数据处理方法及系统,更具体地说,涉及一种计算机网络上的各种信息特别是网上新闻的抽取及处理的方法及系统。 The present invention relates to a data processing method and system, and more particularly, to a variety of information on a computer network in particular, a method and system for online news extraction and processing.

背景技术 Background

当今是一个信息爆炸的时代,随着internet的飞速发展,人们越来越有多的通过网络来获得最新的咨询信息。 Today is an era of information explosion, with the rapid development of the internet, more and more people how to get the latest advisory information through the network.

现在,几乎每个人都有看报纸的习惯,特别是一些对咨询信息需求比较紧迫的个人和企业,更加是要从很多的报纸上获得自己需要的信息。 Now, almost everyone has a habit of reading newspapers, especially some of the more immediate information needs counseling individuals and businesses, and more is to get the information they need from a lot of newspapers. 我们几乎能够从网上看到所有的新闻,很多人已经通过上网来获取最新的新闻信息。 We can almost see all the news from the Internet, many people have to get the latest news and information through the Internet. 但是,仅仅是上网看新闻并不能减少我们所需要的时间,我们仍然需要通读一大篇的新闻才能知道这篇新闻描述的内容,或者要察看很多的网页之后才能得到我们所需要的咨询信息。 But just read the news online does not reduce the time we need, we still need to read a great article in the news in order to know the contents of this news description, or to look after a lot of pages to get the information we need advice. 而且,网上的新闻一逝即过,很多人需要对多天以前的新闻进行查询,甚至需要对几个月,一年前的新闻进行查询。 Moreover, the online news a death that is over, many people need more days before news of the inquiry, even the need for a few months, check the news a year ago. 这种情况下,通过网络已经不能满足我们的要求的了。 In this case, it has been unable to meet our requirements through the network.

传统的基于统计的自动摘要的方法,一般利用数理统计的方法给文档中每一个词都赋予一定的权值,计算权值的方法一般是通过计算词在文章中的出现频率来计算的。 The traditional method of automatic summary statistics based on the number of general use mathematical statistics to document every word given a certain weights, weights are calculated by the calculation method is generally the frequency of words in the article to be calculated. 出现频率高的词,所具有的权值就更高。 High frequency of occurrence of the word, has the right value is higher. 具有高权值的词意味着这个词是文章的中心。 Words with high weights means that the word is the center of the article.

文章的句子也是根据词的权值来赋予的,当我们给词赋完权值之后,我们就能够计算出每个句子的权值,权值越高的句子越能够代表文章的中心思想。 Sentence of the article is based on the value of the right words to impart, when right after we gave Cifu finished weights, we can calculate the value of each sentence, the sentence with higher weights to represent the main idea. 我们能够直接用权值高的句子来产生摘要。 We can directly use weights to produce high sentence summary.

这种方法生成摘要的速度很快,但是由于出现频率高的词并不一定就是文章的中心思想,而且没有进行语法分析,用权值高的句子拼凑而成的摘要的可读性也是比较差的。 This method generates summary is fast, but due to the high frequency of the word is not necessarily the main idea, and there is no parsing, a patchwork of high value with the right sentence summary is relatively poor readability a.

但是,我们可以通过改进赋予权值的方法和中心句子选择的方法来达到比较能够接受的效果。 However, we can give the right value through improved methods and center select sentences to achieve more acceptable results.

中文自动分词是建立全文索引必须经过的一个步骤。 Chinese automatic segmentation is a step of establishing a full-text index must pass. 所谓分词,就是把一句话、一篇文章中的词逐个划分出来。 The so-called word, is the word, the word in an article carved out one by one. 中文不像英文那样,中文没有明显的切分标志。 Chinese did not like English, Chinese is no obvious sign of segmentation. 词的长度不一,而且词的定义也不同,存在一词多义,同义词等情形。 Word length varies, but the definition of the word is different, the presence of polysemy, synonyms and other circumstances. 所以中文自动分词存在着很大的难度。 So Chinese word segmentation there is a great difficulty.

现在比较流行的分词的方法主要有以下几种:正向最大匹配法:是最早提出的分词方法,每次用最长(如为6)的正向切分的词和词典的词进行匹配,如果匹配成功,则继续往下分词,否则删除最后一个字,继续匹配。 Word now more popular methods are the following: the forward maximum matching: is the first to propose a segmentation method, each with a maximum (for example, 6) forward segmentation dictionary of words and word matching, If the match is successful, then continue down the word, or delete the last word, continues to match.

高频优先法:这种方法是基于词频的统计,字与字之间的构成结合律,歧义划分等现象提出来的。 High Frequency priority method: This method is proposed based on word frequency statistics constitute binding law, between words and ambiguous delineation of phenomena. 这种方法提高了分词的效率,但是对于歧义无能为力,出错率没有减低。 This method improves the efficiency of word, but do nothing for ambiguity, the error rate is not reduced.

神经网络分词法:按照模拟人脑并行,分布处理和建立数值模型工作。 Artificial neural network lexical: According to simulate the human brain in parallel, distributed processing and the establishment of a numerical model to work. 它将分词知识所分散隐式的方法存入神经网络内,通过自学习和训练修改内部权值,以求达到较好效果的分词结果。 It segmentation approach implicit knowledge stored in the neural network dispersed modify the internal weight through self-learning and training, in order to achieve better results segmentation effect.

专家系统分词法:这种分词的方法从专家系统的角度把分词的知识(包括常识性分词知识和消除歧义切分的启发性知识即歧义切分规则)从实现分词过程的推理机中独立出来。 Expert system is divided into lexical: This segmentation approach in terms of expert system knowledge word (including knowledge of word knowledge and eliminate ambiguity segmentation of heuristic knowledge that is ambiguous segmentation rules) from realization inference segmentation process of independence . 这样从而实现了知识库的维护和推理机的实现相互独立了。 Thus in order to achieve the protection and realization of the knowledge base of the independent inference engine. 它还具有发现交集性歧义字段和多义组合歧义字段的能力和一定的自学习能力。 It also has found that intersection of ambiguity and polysemy combinational ambiguities of capacity and a certain self-learning ability.

现在的全文索引一般采用倒排文件作为索引机制,在倒排文件中保存词目对应的文档编号的列表。 Now the full-text index commonly used as an indexing mechanism inverted file, save the list entry word corresponding document number in the inverted file.

对于文本检索来说,最有效的索引结构则是倒排文件:它是一个列表集合,每个词目t对应一条记录,在记录中列出了包含此词目的所有文档d的标识符。 For text search, the most efficient index structure is inverted file: it is a list of the collection, each word corresponds to an entry t records, lists all the identifiers contain this term purpose of the document d in the record.

倒排文件可被视为文档-词目频率矩阵的转置,从(d,t)转换为(t,d),因为行优先的访问比列优先的访问更为有效。 Inverted file the document may be considered - Headwords frequency matrix transpose, from (d, t) is converted to (t, d), because the row-access priority than the column access more effective.

索引文件包含三部分:词典(invf.dict),倒排文件(invf)和两者之间的映射文件(invf.idx)。 Index file contains three parts: dictionary (invf.dict), inverted file (invf) and mapping files between the two (invf.idx). 索引文件结构如图2所示。 Index file structure as shown in Figure 2.

在词典(invf.dict)中:对于每个不同的词目t,保存词目字符串t、包含t的文档总数f_t、t在整个文档集合中总的出现次数F_t。 In the dictionary (invf.dict): For each different entry word t, save headwords string t, t the total number of documents containing f_t, t in the entire document set the total number of occurrences F_t.

在映射文件(invf.idx)中:对于每个不同的词目t,保存指向相应倒排列表起始地址的指针。 In the mapping file (invf.idx): For each different entry word t, save links to the appropriate list pointer inverted start address.

在倒排文件(invf)中:对于每个不同的词目t,保存包含t的每个文档的标识符d(顺序的数值)、t在每个文档d中的出现频率fd,t,存储为<d,fd,t>的列表。 In the inverted file (invf): For each different entry word t, save a t identifier d (numerical order) each document, t fd frequency of occurrence in each document d, t, storage list; is & lt; d, fd, t & gt.

另外和权重数组Wd一起,就可以满足布尔查询(Boolean Query)和分级查询(Ranked Query)的需要。 And an array of additional weight Wd together, meet Boolean queries (Boolean Query) and classification queries (Ranked Query) needs.

发明内容 DISCLOSURE

本发明的目的是提供一种网络信息抽取及处理的方法及系统,采用了计算机技术和自然语言处理技术,能够自动的从各个指定的站点下载每天最新的新闻信息,并且进行内容抽取,分类,自动摘要精简全文,且将全文储存到本系统中,并进行文本索引以便日后进行高效的全文检索。 The purpose of the present invention is to provide a network of information extraction and processing methods and systems, using computer technology and natural language processing technology to automatically latest daily news and information from each designated site to download and content extraction, classification, AutoSummary streamline text, and the text is stored into the system, and for future efficient text indexing full-text search.

为了实现上述的目的,本发明的技术方案如下:一种网络信息抽取及处理的方法,包括如下步骤:一.新闻下载步骤:包括如下步骤url分析步骤:系统指定一定的url,程序能够自动的从这些url上分析出新闻的最终内容url,而不用对每个新闻网站做一个特定的url模块,采用给予url统计以及对url进行相关性分析的方法,在一个含有最终内容新闻连接地址的网页,进行统计和分析,找到有用的最终url地址;自动抓取新闻网页步骤:将目标地址中的链接页面所有符合url格式的页面进行下载;垃圾过滤步骤:实现对抓下来的新闻内容网页进行垃圾过滤,除去其中的html标签以及一些无用的中文,最终得到中文向量信息;信息提取步骤:对以上得到的中文向量进行信息提取,前期实现能够提取标题和内容,后期实现对web新闻内容进行特征提取相关性分析,文档分类,排重处理等等; To achieve the above object, the present invention is as follows: A network information extraction and processing method, comprising the steps of: a news download procedure: url analysis step comprises the steps of: system specifies certain url, the program can automatically From the analysis of the final content of these url url news, but do not do a specific url module for each news site, as well as the use of statistics to give url url correlation analysis, in a news link address containing the final content of the page , statistics and analysis, find useful end url address; automatically crawl news pages Step: The destination address in the link page all eligible url formatted page to download; spam filtering step: to achieve the catch down the news content pages for garbage filtered to remove one of the html tags and useless Chinese, finally get Chinese vector information; information extraction step of: Chinese vector obtained above information extraction, pre-implementation can extract the title and content, post web news content to achieve feature extraction Correlation analysis, document classification, heavy discharge treatment and the like;

二.自动生成摘要步骤:进行分词、特征词分析、句子重要分析、生成摘要,并输出摘要;三.生成全文索引步骤:对所有已经下载并且完成内容抽取的新闻内容文件进行全文索引,包括如下步骤:传入步骤,传入下一个文件名;索引判断步骤,判断是否已经索引过,是则回到传入步骤,否则进入下一步;过滤步骤,过滤其中所有垃圾及无意义的词;匹配分词步骤,进行词典匹配分词;ngram分词步骤,进行ngram分词,以免词典分词有未能完全分出来的词;更新步骤,对每一个词都更新相关的索引文件,包括关键字和日期,类别索引;四.层次文本分类步骤:是把一个新的文档归入一个给定的层次类别里的一个类里分类步骤;每份文档仅仅只能被归入一个类里,在层次类别里的每个类与许多词汇和术语相关有较大权重一个给定的术语在层次中的一个层次上,而stopword在另一个层次上。 II automatically generates a summary step of: performing segmentation, feature word analysis, sentence important analysis, generate summary and summary output; third generation full-text indexing step: all have been downloaded and completed content to extract news content file for full-text indexing, including the following Step: Incoming step, passing under a file name; index determination step of determining whether or not already indexed, it is then passed back to step, otherwise the next step; filtration step, filter where all rubbish and nonsense words; match segmentation step, a dictionary matching word; ngram segmentation step, performed ngram word, lest Dictionary and have failed to completely separate out the words; updating step, for every word update relevant index files, including keywords and date, category index ; four-level text classification step: put a new document is classified as a given level category of a class classification step; each document can only be classified as a class, for each category in the hierarchy. class with many words and terms related to a greater weight to a given term at the level of a hierarchy, and stopword at another level. 被摘录的文档(财政的新闻)的特征词在这个系统中被当作术语和字汇使用;包括层次训练步骤和文档分类步骤;层次训练是文档分类的预处理,在分类之前,先对类别的层次进行训练;训练的功能是要收集来自训练文档的一组特征(特征词),然后为每个节点(类别)在层次中分配特征权重,在文档分类算法中,特征权重是用来为一份新的文档计算类别等级;文件分类步骤是在被训练阶级组织之后,现在一份文件能被分类到一个类别,文件分类方法从根类别开始,根类别的所有子类别被分配等级,它由下面等式计算:Rcd=ΣfNfdWfc]]>c是一个类别,d是一份文件,f是一个在D中的特征,Rcd是c的等级,Nfd是f出现在d中的次数,Wfc是f在类别c中的权重;如果所有子类别的等级都是零的或负的,d被留在根类别;如果在子类别中有确定的正的最大的等级的类别,则该类别被选择;如果该类别是一个叶类别,文件d被分到该类别;如果被选择的类别不是叶类别,则在该类别的子类别中继续进行计算;因此,文件d能分到叶类别或内部类别。 Is an excerpt of the document (Financial News) feature words in this system are treated as use of terminology and vocabulary; including level training step and document classification step; pretreatment document classification level of training, before classification, the first of categories level training; training function is to collect a set of features from the training documentation (feature word), then for each node (category) feature weight distribution in the hierarchy in the document classification algorithm, the feature weight is used as a copies of the new document classification level calculation; document classification step after being trained class organization, now a document can be classified into a category, file classification starting from the root category, all subcategories root category is assigned grade, which consists of calculated by the following equation: Rcd = & Sigma; fNfdWfc]]> c is a category, d is a document, f is a feature in D, Rcd is c level, Nfd frequency f is present in the d, Wfc f c is right in the weight category; if all subcategories rating is zero or negative, d is left in the root category; if there is a positive rating of the largest categories identified in the sub-categories, the categories are choice; If the category is a leaf category, file d was assigned to that category; if the selected category is not a leaf category, sub-category in this category continue to be calculated; therefore, the file could be assigned to the leaf category d or internal category.

一种网络信息抽取及处理的系统,包括如下装置:一.新闻下载装置:包括如下装置url分析装置:系统指定一定的url,程序能够自动的从这些url上分析出新闻的最终内容url,而不用对每个新闻网站做一个特定的url模块,采用给予url统计以及对url进行相关性分析的方法,在一个含有最终内容新闻连接地址的网页,进行统计和分析,找到有用的最终url地址;自动抓取新闻网页装置:将目标地址中的链接页面所有符合url格式的页面进行下载;垃圾过滤装置:实现对抓下来的新闻内容网页进行垃圾过滤,除去其中的html标签以及一些无用的中文,最终得到中文向量信息; A network information extraction and treatment system, including the following means: a news Download means: analyzing means comprises means url: system specifies certain url, the program can automatically from the final analysis of these url url news content, and Each news sites do not do a specific url module, using statistics given url and the url correlation analysis, the final content of the news contained in a connection address of the page, statistics and analysis, find useful end url address; automatically crawl news pages means: the destination address in the link page all eligible url formatted page to download; spam filtering devices: to achieve the catch down the news content pages spam filter, remove the html tags and useless Chinese, finally get Chinese vector information;

信息提取装置:对以上得到的中文向量进行信息提取,前期实现能够提取标题和内容,后期实现对web新闻内容进行特征提取,相关性分析,文档分类,排重处理等等;二.自动生成摘要装置:进行分词、特征词分析、句子重要分析、生成摘要,并输出摘要;三.生成全文索引装置:对所有已经下载并且完成内容抽取的新闻内容文件进行全文索引,包括如下装置:传入装置,传入下一个文件名;索引判断装置,判断是否已经索引过,是则回到传入装置,否则进入下一步;过滤装置,过滤其中所有垃圾及无意义的词;匹配分词装置,进行词典匹配分词;ngram分词装置,进行ngram分词,以免词典分词有未能完全分出来的词;更新装置,对每一个词都更新相关的索引文件,包括关键字和日期,类别索引;四.层次文本分类装置:是把一个新的文档归入一个给定的层次类别里的一个类里分类装置;每份文档仅仅只能被归入一个类里,在层次类别里的每个类与许多词汇和术语相关有较大权重一个给定的术语在层次中的一个层次上,而stopword在另一个层次上。 Information extraction means: The Chinese vector obtained above information extraction, pre-implementation can extract the title and content, post web news content to achieve feature extraction, correlation analysis, document classification, heavy discharge treatment and the like; two automatically generated summary. means: word segmentation, feature word analysis, sentence analysis is important to generate summary and summary output; three full-text indexing means: All the content has been downloaded and extracted complete news content full-text index files, including the following means: Incoming means , passing under a file name; index judgment means for judging whether or not already indexed, it is then passed back to the device, otherwise the next step; filtering device, filter where all rubbish and nonsense words; word matching means carried dictionary matching word; ngram word means conduct ngram word, lest Dictionary and have failed to completely separate out the words; updating means, for every word update relevant index files, including keywords and date, category index; Fourth-level text classification means: put a new document is included in a given level category classification means a class; each document can only be classified as a class, at the level of each category and class with many words Terminology Relating to have greater weight for a given term at the level of a hierarchy, and stopword at another level. 被摘录的文档(财政的新闻)的特征词在这个系统中被当作术语和字汇使用;包括层次训练装置和文档分类装置;层次训练装置是对文档分类的预处理,在分类之前,先对类别的层次进行训练;训练的功能是要收集来自训练文档的一组特征(特征词),然后为每个节点(类别)在层次中分配特征权重,在文档分类算法中,特征权重是用来为一份新的文档计算类别等级;文件分类装置是在被训练阶级组织之后,现在一份文件能被分类到一个类别,文件分类方法从根类别开始,根类别的所有子类别被分配等级,它由下面等式计算:Rcd=ΣfNfdWfc]]>c是一个类别,d是一份文件,f是一个在D中的特征,Rcd是c的等级,Nfd是f出现在d中的次数,Wfc是f在类别c中的权重;如果所有子类别的等级都是零的或负的,d被留在根类别;如果在子类别中有确定的正的最大的等级的类别,则该类别被选择;如果该类别是一个叶类别,文件d被分到该类别;如果被选择的类别不是叶类别,则在该类别的子类别中继续进行计算;因此,文件d能分到叶类别或内部类别。 It is an excerpt of the document (Financial News) feature words in this system are treated as use of terminology and vocabulary; including level training device and document classification means; pretreatment level training device for document classification, before classification, the first of level category of training; training function is to collect a set of features from the training documentation (feature word), then for each node (category) feature weight distribution in the hierarchy in the document classification algorithm, the feature weights are used computing category rating of a new document; document classification means that after being trained class organization, now a document can be classified into a category, document classification starting from the root category, all subcategories root category is assigned grade, It is calculated by the following equation: Rcd = & Sigma; fNfdWfc]]> c is a category, d is a document, f is a feature in D, Rcd is c level, Nfd frequency f is present in the d , Wfc is right f c in the category of weight; if all subcategories rating is zero or negative, d is left in the root category; if there is a positive rating of the largest categories identified in the sub-categories, the class is selected; if the category is a leaf category, the file is assigned to the category d; if the selected category is not a leaf category, sub-category in this category continue to be calculated; therefore, the file could be assigned to the leaf category d or internal category.

由于采用上述的方法及系统,能够自动每天从指定的web站点的指点版下载最新的新闻网页源码;能够对下载的html code进行分析,获得其中有价值的新闻内容;对分析出来的内容进行自动摘要精简;对分析出来的内容进行分词并且索引,以供检索之用;对分析出来的内容进行自动分类。 As a result of the above-described method and system for automatically every day from a designated web site to download the latest version of the guidance news page source; able to download the html code analyzed to obtain the valuable news content; content analysis out automatically Abstract streamlining; content analysis word out and indexed for retrieval purposes; content analysis out automatic classification.

附图说明 Brief Description

图1为现有的自动下载网络信息的方法及程序的系统结构图;图2为现有的网络信息处理方法的索引文件结构图; Figure 1 is a system diagram of a conventional method of network information is automatically downloaded and procedures; Figure 2 is a block diagram of the index file existing network information processing method;

图3为本发明所述的网络信息抽取及处理方法中的新闻下载步骤的流程图;图4为人民网的新闻中心的新闻列表页面图;图5为分析得到token流的方法的流程图;图6为China.com财经频道页面图;图7为http://www.chinahd.com/news/stock/2002-3/161628.htm的页面图;图8为图7的源代码图;图9为某篇china.com财经频道的新闻网页图;图10为图9所述的新闻网页经内容分析可得到的内容信息图;图11为自动生成摘要方法的流程图;图12为自动生成摘要方法的分析图;图13为举例说明的内容储存原文图;图14为根据本发明自动生成的摘要图;图15为本发明所述的生成全文检索步骤的流程图;图16为本发明所述的新闻查询步骤的流程图。 Network Information Figure 3 is a flowchart of the invention the extraction and treatment of news download procedure; Figure 4 is a People's News News list page; Figure 5 is a flow chart of a method of obtaining token stream analysis; Figure 6 is China.com financial channel page; Figure 7 is a page Figure http://www.chinahd.com/news/stock/2002-3/161628.htm; and Figure 8 is a source of Fig. 7; Fig. 9 is a chapter china.com CNBC news pages; Figure 10 is a diagram of the news pages 9 through content analysis of the content of the information available; Figure 11 is a flowchart of a method to automatically generate summary; Fig. 12 is automatically generated FIG summary analysis method; Figure 13 illustrates the contents stored in its original language; Figure 14 according to the present invention automatically generates a summary chart; the full-text retrieval step flowchart in Figure 15 according to the present invention; Fig. 16 of the present invention The news flow chart query steps.

具体实施方式 DETAILED DESCRIPTION

下面结合附图和实施方式对本发明作进一步详细的说明:我们仅考虑了自动下载以及内容分析的过程,没有对每个网站构造对应的匹配模型,我们对新闻网站这一类型的站点实现了一个通用的算法,就是根据中文内容出现的频度和内容亲密的html tag出现的频度和位置来确定那一部分是新闻内容。 Below in conjunction with the accompanying drawings and embodiments of the present invention will be further described in detail: we consider only the content is automatically downloaded and process analysis, there is no structure for each site corresponding matched model, we have this type of site news site implements a general algorithm is based on the frequency of the frequency and location of Chinese content and content appear close html tag that appears to determine that part of the news content. 将在后面的实现方法中进行具体描述。 It will be described in detail later in the implementation.

由于我们需要得到准确性比较大的内容,并对之进行信息抽取传递给最终用户,所以我们不需要robot进行深层次的递归访问。 Because we need to get the accuracy of the contents of a relatively large, and the extraction of information delivered to the end user, so we do not need to conduct in-depth recursive robot access. 具体实现自动下载的方法在后面具体介绍由于考虑到通用性,所以我们不考虑文本的网页特征,考虑的是基于背景资料库的纯内容的自动摘要。 Specific methods for automatic download later introduced taking into consideration the specific versatility, so we do not consider the characteristics of the text on the page, consider the background based on pure content repository of automatic summary.

一种网络信息抽取及处理的方法,包括如下步骤:一、新闻下载步骤:如图3所示,新闻的自动下载分为两个部分,url分析以及源代码抓取两部分。 A network information extraction and processing, comprising the following steps: First, the news download step: As shown, the news is automatically downloaded into two parts, url and source code analysis section 3 to fetch two. 由于java具有的网络编程的优点,使得我们可以对网上的任意资源建立连接,形成一个流,就可以像操作本地文件一样操作网络上的资源。 Since java has the advantage of network programming, so that we can establish a connection to any resource on the Internet, forming a stream, you can manipulate resources on the network operate as a local file.

1、url分析步骤:系统指定一定的url,程序能够自动的从这些url上分析出新闻的最终内容url。 1, url analytical steps: system specifies certain url, the program can automatically analyze from these url url final content of the news. 而不用对每个新闻网站做一个特定的url模块。 And do not do a specific url module for each news site.

采用给予url统计以及对url进行相关性分析的方法,在一个含有最终内容新闻连接地址的网页,进行统计和分析,找到我们有用的最终url地址。 Using statistics and the url given url correlation analysis, the final content of the news contained in a connected web address, statistics and analysis, we find useful end url address. 例如:程序指定一定数量的已经分过类的url。 For example: The program has been designated a certain number of points off classes url. 此url应该是新闻的列表文件。 This url should be news of a list of files. 即在此页点击新闻的链接即可打开新闻内容页面。 That is, in this page click on the link to open the news page news content.

以人民网为例子:这个页面就是人民网的新闻中心的新闻列表页面,如图4所示。 In People as an example: this page is the People's News of the news list page, as shown in FIG.

通过对这个页面进行分析,我们可得出最终页面的url格式为http://www.people.com.cn/GB/guoii/25/96/20020312/*.html存到相关的最终url格式文件中。 Through the pages of this analysis, we can obtain the final page url format for http://www.people.com.cn/GB/guoii/25/96/20020312/*.html saved to the relevant final url format files in.

采用对html的token分析方法:充分运用java中的面向对象的思想,我们将每个html源代码文件看成一个对象,同时建立一个名为token的类,token用来描述html中一个有意义的字符串,并且由token继承出来urltoken类,urltoken用来描述特征符合url格式的token。 Html using token analysis method: the full use of java object-oriented thinking, we will each html source code file as an object, while establishing a class called token, token is used to describe in a meaningful html string, and inherited by the token out urltoken class, urltoken used to describe the features conform token url format.

这样在进行html源代码分析的时候,我们将每个文件看成一个对象,同时就该文件中每一个html tag以及每一个html tag之间的字符串,我们都将他们看成一个字符串。 When making this html source code analysis, we will each file as an object, but that document in a string tag and each html tag between each html, we will see them as a string.

每个token所具有的属性String tokenstr=null;//描述该token的串值int tokenloc=0;//该token在原文件中的位置int gbnum=0;//该token中具有的中文字符数量boolean iskeentag=false;//是否完全是一个内容亲密tokenFloat keenvalue=0;//与内容的亲密程度Token具有的比较特别的方法:public boolean ishref(){String flag1=″href=″;int flag2=-1;if(tokenstr.索引Of(flag1)==flag2)return false;else Each token has the property String tokenstr = null; // description of the token string value int tokenloc = 0; // the token position in the original document int gbnum = 0; // The token has a number of Chinese characters boolean iskeentag = false; // if the content is a completely intimate tokenFloat keenvalue = 0; // A special method and content of the closeness of the Token has: public boolean ishref () {String flag1 = "href ="; int flag2 = - 1; if (tokenstr Index Of (flag1) == flag2.) return false; else

return true;}该方法用来判断是否一个url html tag实际上,运用oo的思想来进行html源代码分析,利用java中流的思想,我们建立了token流,结果证明,这样做的效果是很好的:1.程序结构很清晰,oo思想得到了非常明显的体现。 return true;} This method is used to determine whether a url html tag in fact, the idea to use oo html source code analysis, the use of java flowing thought, we have established a token stream, results show that the effect of doing so is good : 1. Program structure is very clear, oo thinking has been very clearly reflected.

2.分析实现的效果很好,达到的准确率高。 2. Analysis of high accuracy to achieve a good effect, to achieve.

3.无需对每个网站定义特殊的分析stop标志等。 3. The need to define for each site specific analysis stop signs.

4.只要属于规范的html代码,都能够进行正常处理。 4. As long as belonging to specification html code can be processed normally.

分析得到token流的方法如图5所示。 Get token stream analysis method shown in Figure 5.

对每一个站点的任何一个新闻板块,我们都定义以下几个特征项:该板块所属的类别,比如政治,工业,体育等。 Any news section for each site, we define the following features items: category belongs to the sector, such as political, industrial, sports and so on. 这些类别也是由管理模块定义的;该板块所属的服务器地址,比如:news.sina.com.cn;该板块所属的当前目录(一般正规的网站,一个板块的新闻都是在一个目录下面);该板块list页面的路径属性,即绝对路径还是相对路径。 These categories are also defined by the management module; server address belongs to the sector, such as: news.sina.com.cn; the sector belongs to the current directory (usually the regular site, a section of the press is in a directory); Path Properties page list of the plate, that is absolute or relative path.

对url进行分析,主要是由urlanalyse.class与contentanalyse.class两个类实现的,主要实现了token流的分析。 The analysis of the url, mainly by urlanalyse.class and contentanalyse.class implemented two classes, the main analysis of token streams.

分析的主要方法:urlanalyse.class有一个方法geturl(stringfilename)先将源代码转化成token流读入来,然后将每一个符合格式的url token与这个url后面的gbnum不等于0的token加入缓存的hashmap中,一般情况下,url后面的gbnum不等于0的token都是新闻的标题。 The main method of analysis: urlanalyse.class has a method geturl (stringfilename) first converted into the source code token stream to read, and then the token back gbnum this url url conform to the format of each token 0 is not equal to the added cache hashmap, and under normal circumstances, url back gbnum token 0 is not equal to the news headlines.

例如:China.com财经频道页面如图6所示。 For example: China.com CNBC page shown in Figure 6.

经过url分析之后,我们可以得到相关的hashmap:http://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255883.html十年投入6000亿重庆要打造国际大都市http://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255882.html1吨油赔四五百元税控机拒收油票收现金http://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255881.html香港旅游业——经济复苏的一缕春风在获得这些之后,我们就将进行自动的抓取,将所有分析出来的url网页源代码都抓下来。 After url analysis, we can get the relevant hashmap: http: //finance.china.com/zh_cn/news/financenews/10001254/20020506/10255883.html years invested 600 billion to build an international metropolis of Chongqing http: // finance.china.com/zh_cn/news/financenews/10001254/20020506/10255882.html1 pay four or five hundred tons of oil tax ticket machines accept cash http://finance.china.com/zh_cn/news/financenews reject oil /10001254/20020506/10255881.html Hong Kong's tourism - the economic recovery after receiving such a ray of spring, we will automatically crawl, out of all the analysis url page source code are caught down.

2、自动抓取新闻网页步骤:每次启动程序,我们都要将目标地址中的链接页面所有符合url格式的页面进行下载。 2, automatically crawl news pages Step: Each time you start the program, we have a target address all eligible url links page formatted page to download. 下载过程中并不进行信息抽取等相关分析,以免加大负担,影响下载速度。 The download process is not related analysis information extraction, so as not to increase the burden, affect download speed. 对已经下载过的页面不再下载。 Already downloaded the page no longer download. 下载要区分gb,big5等编码因素的影响。 Download to distinguish the effects of gb, big5 coding and other factors.

3、垃圾过滤模块:此步骤是实现对抓下来的新闻内容网页进行垃圾过滤,除去其中的html标签以及一些无用的中文,最终得到中文向量信息。 3, spam filtering module: This step is to achieve catch down the news content pages spam filter, remove the html tags and useless Chinese, finally get Chinese vector information. 须在下载的同时在后台线程运行。 It shall run in the background while downloading thread. 后期可以考虑在得到的中文向量加入权值等相关信息。 Late may be considered in Chinese and other vector added weight to get the relevant information. (权值根据文字出现的位置,前后的html标签等确定,需要一定数量的文档进行熟悉,训练)。 (Weights according to the position of the text that appears before and after the html tag, etc. OK, we need to be familiar with a certain number of documents, training).

4、信息提取模块:对以上得到的中文向量进行信息提取,前期实现能够提取标题和内容。 4, information extraction module: The Chinese vector obtained above information extraction, pre-implementation can extract title and content. 后期实现对web新闻内容进行特征提取,相关性分析,文档分类,排重处理等等。 Late realization of web news content feature extraction, correlation analysis, document classification, heavy discharge process and so on. 保证通用性。 Ensure versatility. 保证较高的准确率。 Ensure higher accuracy. 前期的功能可以通过简单的方法实现(如a****中的词在content b***c**d**中的出现次数)实现。 Pre-function can be achieved by simple methods (such as the number of words in a **** appear in the content b *** c ** d ** the) implementation. 判断哪一块是内容可以通过句子之间的距离以及前后的html标签判断(标签都有一定权值)。 Determine which one is the html tag content by the distance between the front and rear of the sentences judgment (tag has certain weight).

如图7所示,来源:http://www.chinahd.com/news/stock/2002-3/161628.htm。 As shown in Figure 7, Source: http: //www.chinahd.com/news/stock/2002-3/161628.htm. 其源代码如图8所示,可见,内容之间的距离都非常近,而且中间的html标签一般都是<p>,&nbsp,<br>(段落,空格,换行)之类的。 Its source code is shown in Figure 8, showing the distance between the contents are very close, and in the middle of html tags are generally & lt; p & gt;, & amp; nbsp, & lt; br & gt; (paragraphs, spaces, line breaks) and the like a. 我们可以通过距离和标签的特殊性来判断内容所在。 We can particularity to judge distance and label content resides.

新闻内容抽取不同于传统的内容抽取方法,我们不针对每一个网站构造一个模型,在程序中,主要由contentanalyse.class和token.class等实现。 Unlike traditional news content extraction method to extract the contents, we are not for every site to construct a model in the program, mainly by contentanalyse.class and token.class like.

具体方法如下:1、先将要抽取内容的文件转换为具体的token流;2、将token流按照内容亲密度进行计算;3、将gb数量最集中以及亲密度同时又是最高的连续token集合取出来;4、如果gb数量以及亲密度不能同时符合以上要求,则直接cancel。 Specific methods are as follows: 1, first to extract the contents of the file into a specific token stream; 2, the token stream is calculated in accordance with the content of cohesion; 3, the most concentrated gb and intimacy at the same time the number is set to take the highest continuous token out; 4, if gb quantity and intimacy can not meet these requirements, it will directly cancel.

例如:某篇china.com财经频道的新闻网页如图9所示。 For example: a piece china.com CNBC news page shown in Figure 9.

经过内容分析之后,由于china.com是比较规范的网页,我们一般能够达到很高的准确性,具体的测试数据在后面有详细的说明。 After content analysis, because china.com is a fairly standard web pages, we are generally able to achieve very high accuracy, detailed test data is described in detail later.

内容分析可得到的内容信息如图10所示。 Content analysis of the content information can be obtained as shown in Figure 10.

在进行存储的时候,我们将会把新闻source,category,downloadtime,title,content等5部分全部储存起来,作为关键字索引,日期等索引的建立来源,同时也是摘要的来源。 When carrying out store, we will put the news source, category, downloadtime, title, content and other five parts all stored together, as a key index, the index established date source, but also a summary of sources.

5、管理步骤:实现对本机存储的新闻数据进行管理,如删除,更新等。 5, the management step: to achieve the news locally stored data management, such as delete, update.

二、自动生成摘要步骤:先对原始文档进行预处理,然后进行分词、特征词分析、句子重要分析、生成摘要,并输出摘要;自动摘要步骤可以是一个独立的步骤,需要与外部接口的API接口只有一个get摘要ion。 Second, automatically generates a summary steps: pretreatment prior to the original document, and then word, feature word analysis, sentence analysis is important to generate summary and summary output; automatic summarization step may be a separate step, you need external interface API interface to get only a summary ion. 其接口原型为public String get摘要ion(String FileName,boolean FileMode,intRatio)FileName参数,根据FileMode来决定;如果FileMode=true,那么FileName则为文件名;否则,为待抽取的文档本身FileMode参数是模式参数Ratio为抽比率,只允许0-100之间的整数自动生成摘要步骤是一个独立的步骤,有独立的日志与事物处理模块,摘要之是否完成不影响下载以及索引的进行。 Its interface prototype for public String get summary ion (String FileName, boolean FileMode, intRatio) FileName parameters, according to FileMode to decide; if FileMode = true, then compared with the file name FileName; otherwise, the document itself to be extracted parameters mode FileMode Ratio parameters for smoking rate, allowing only integer between 0 and 100 automatically generates summary step is a separate step, a separate log and transaction processing module, the summary does not affect the completion of the download and the index will be.

自动摘要系统的系统流程如图11所示。 System processes automatic summarization system shown in Figure 11.

分词采用“无词库”分词方法,采用词频,新旧算法思想一致,只做一些不必要的改进以加快分词速度。 Word uses the "no thesaurus" segmentation method using word frequency, consistency old algorithm thinking, only some unnecessary word in order to accelerate the speed of improvement. 词重----衡量是词的可能性不必要的改进以加快分词速度。 Word ---- weight measure is necessary to improve the likelihood of the word in order to accelerate the speed of segmentation.

P(w)=F(w)*L(w)c当(F(w)>minFreq,L(w)>minLen)否则P(w)=0minFreq是预设的词的出现最小频率;通常≥2;降低不是词的串minLen是预设的词的最短词长;通常≥1;保证低频词不被分开c是预设的一个常值;通常≥4;保证长词不被分开流程:整文当作一个字符串,从头开始求子串,对所有子串求权,取权高者作为词(太多无用扫描),系统值取一个串,采用所有文件作为背景,这样花去的扫描时间比较多。 P (w) = F (w) * L (w) c When (F (w)> minFreq, L (w)> minLen) or P (w) = 0minFreq is the emergence of a preset minimum frequency of words; usually ≥ 2; lower is not the word is the shortest word string minLen preset words in length; usually ≥1; to ensure that low-frequency words are not separated by a constant c is the default value; usually ≥4; ensure the long term will not be separated from the process: the whole text as a string, the string from scratch Praying for all substring seeking the right to take the higher power as words (too many useless scanning), the system value takes a string, using all files as background, so spent scanning more time.

特征词的抽取,基本思想是基于词的频率,以及想对于背景知识库的词频来统计。 Extracting feature words, the basic idea is based on the frequency of the word, and want to come to the background knowledge of word frequency statistics.

算法:P(w)=Fi(w)·(numdocadvnumdoc)·(L(w)-D)2]]>F(w)为词出现的频率L(w)为词的长度numdoc为该词的在本文中出现次数advnumdoc为所有文档中出现平均次数D预设的最短词长修改算法的原因有两点: Algorithm: P (w) = Fi (w) & CenterDot; (numdocadvnumdoc) & CenterDot; (L (w) -D) 2]]> F (w) the frequency of occurrence of the word L (w) is the word for the length numdoc The reason for the number of times the average number of times in this article advnumdoc D preset shortest term of all documents appearing in the long term to modify the algorithm of two things:

1、原算法必须使用大量的背景语料库(BWID);因此会是系统耗费更大的时间和空间;而新算法则是基于语料库本身中出现的次数来进行想对统计。 1, the original algorithm must use a lot of background corpus (BWID); therefore the system takes more time and space; and the new algorithm is based on the number of corpus itself appears to want to count.

2、新算法也具有理论说服力。 2, the new algorithm also has a theory convincing. 因为背景语料库是广泛的,因此,一些常用词频率就会很多,这样numdoc/advnumdoc基本相等;而当一个特征词,通常在本文中出现较多次,而在BWID中则不是那么多,平均下来就使numdoc/advnumdoc大。 Because the background corpus is broad, so some common words frequency will be a lot, so numdoc / advnumdoc substantially equal; and when a characteristic word, usually appears more than once in this article, but in BWID in the not so much, on average, to make numdoc / advnumdoc large. 因此特征词得权重也就大。 So it will feature words have great weight. 具体如图12所示。 Specifically, as shown in Figure 12.

句子的重要性与摘要的生成的关系:T(s)=ΣTis0*s1*s2*m]]>对每一个句子按这个公式计算他们的权重。 The importance of the relationship between sentences and abstracts generated: T (s) = & Sigma; Tis0 * s1 * s2 * m]]> For each sentence according to this formula their weight.

Ti为句子组成的词的权重S0为句子的总词数S1为句子的字句数S2为数词的个数m为整型常值,通常为1。 Ti right word for the sentences of weight S0 is the total number of words in a sentence to a sentence of words S1 S2 number of numeral number m is an integer constant, usually 1.

内容储存原文如图13所示。 Original content store 13.

摘要后文章如图14所示。 Abstract The article shown in Figure 14.

三、生成全文索引步骤:本步骤需要对所有已经下载并且完成内容抽取的新闻内容文件进行全文索引,建立索引的过程实时的在后台进行建立索引的工作。 Third, the full-text indexing step: This step requires all already downloaded and extracted the contents of the news content complete document full-text indexing, indexing process real-time indexing work carried out in the background. 自身也可以是一个独立的步骤,所需要提供的接口参数只是一个文件名。 Itself can also be a separate step, the interface parameters only need to provide a file name.

生成全文检索步骤的流程如图15所示,包括如下步骤:传入步骤,传入下一个文件名;索引判断步骤,判断是否已经索引过,是则回到传入步骤,否则进入下一步;过滤步骤,过滤其中所有垃圾及无意义的词;匹配分词步骤,进行词典匹配分词;ngram分词步骤,进行ngram分词,以免词典分词有未能完全分出来的词;更新步骤,对每一个词都更新相关的索引文件,包括关键字和日期,类别索引。 Full-text retrieval step of the process shown in Figure 15, comprising the steps of: passing step, passing under a file name; index determination step of determining whether or not already indexed, it is then passed back to step, otherwise the next step; filtration step, filter where all rubbish and nonsense words; matching word in steps and dictionary matching word; ngram segmentation step, performed ngram word, lest Dictionary and have failed to completely separate out the words; updating step, for every word update the relevant index files, including keywords and date, category index.

四、层次文本分类步骤:是把一个新的文档归入一个给定的层次类别里的一个类里分类步骤。 Fourth, the level of text classification step: put a new document is classified as a given level category of a class classification step. 每份文档仅仅只能被归入一个类里。 Each document can only be classified as a class. 在层次类别里的每个类与许多词汇和术语相关,而且分类算法本身在层次中被反复调整。 Category in the hierarchy of each class with many words and terms related and classification algorithm itself is repeatedly adjusted in the hierarchy. 因此,有较大权重一个给定的术语在层次中的一个层次上,而stopword在另一个层次上。 Therefore, there is greater weight for a given term at the level of a hierarchy, and stopword at another level. 被摘录的文档(财政的新闻)的特征词在这个系统中被当作术语和字汇使用。 Is an excerpt of the document (Financial News) feature words in this system are treated as terminology and vocabulary use.

包括二部份:层次训练步骤和文档分类步骤,层次训练是文档分类的预处理。 Consists of two parts: level of training procedures and document classification step, the level of training is pretreated document classification. 在分类之前,先对类别的层次进行训练;1.层次训练训练的功能是要收集来自训练文档的一组特征(特征词),然后为每个节点(类别)在层次中分配特征权重。 Before classification, the first category of the level of training; 1 level training training function is to collect a set of features from the training documentation (feature word), then for each node (category) feature weight distribution in the hierarchy. 在文档分类算法中,特征权重是用来为一份新的文档计算类别等级。 In the document classification algorithm, the feature weights are used to calculate a new document classification level.

训练包括4个步骤:1)收集来自叶子类的特征词;层次中,对于每个叶子类的训练文档(新闻)的特征词,只有那些在单一训练文档中出现2次以上或者在训练文档集出现10次以上的特征词才被收集,这些词最后在摘要中出现。 Training includes four steps: 1) collecting feature words from leaves like; hierarchy, for each leaf class training documents (news) feature words, only those two or more times or document set in the training exercise appear in a single document appears more than 10 times the feature was only collected words, these words appear in the final summary. 这些收集的特征词表示了叶子类的特征。 The collection features the word indicates a leaf class characteristics. 当一个叶子类属于某一个训练文档集时,父类就要包含该叶子类的特征。 When a leaf class belongs to a training document set, the parent class must contain the leaf class characteristics. 非叶子类的特征包括它的孩子节点的所有特征和在所有孩子节点中特征发生频率的总和。 Non-leaf class features include all the features of its child nodes sum and features occur in all children nodes in frequency.

2)层次最优化步骤最优化用来解决在类别节和它的父母类别之间的竞争。 2) level optimization step optimization to solve the competition between the Category section and its parent category. 因为一份文件(新闻)只能在类别的层次组织中被指定为一个类别,当在类别之间有竞争的时候,运算法则应该为文件决定适当的类别。 Because a document (news) can be designated as a category in the category of hierarchical organization, when there is competition between the category of time, the algorithm should determine the appropriate category for the file.

包括如下步骤:采集步骤,采集在一个类别中所有的特征;特征判断步骤,判断是否在父母中的特征频率比在这个类别中大,是则到下一步骤,否则没有操作;查继步骤,查继承者的特征目录,找出继承者高频率和最低的频率的特征;比率判断步骤,判断是否在高的频率和最低的频率之差与最高频率的比率比门槛值大,是则到下一步骤,否则从所有的继承者删除该特征。 Comprising the steps of: collecting step, collect all of the features in one category; characteristic determination step of determining whether the frequency characteristics of the parents in this category than large, it is then to the next step, otherwise there is no operation; check the following steps, check successor feature directory to find the successor to the high frequency and low frequency characteristics; rate determining step, the ratio is determined whether the high frequencies and the difference in frequency between the lowest and highest frequencies than the threshold value is large, it is then to the next a step, or remove the feature from all of the successor. 只有父母保有该特征;删除步骤,从继承者中删除该特征除非继承者有该特征的最高频率。 Only parents retain the feature; delete steps to remove the feature unless the successor has the highest frequency characteristic from the successor.

上述的方法法则能找出通常的特征,对父类别来说,它的继承者拥有该特征和特征的频率。 The method of the above law can find the usual features, the parent category, its successor has the features and characteristics of the frequency. 但是当该特征的频率没有传递到继承者时候,这意味着在继承者中的最高频率和最低频率。 But when the feature frequency is not transferred to the successor time, which means that the highest and lowest frequencies of the successor. 通常的特征从所有的继承者删除除非继承者包含通常的特征最高的频率。 Often characterized by the highest frequency unless deleted successor contains the usual features from all successor. 因此,所有叶类别的特征和频率向叶类别的上面除根类别之外的类别传递,在根类别他们将不参与任何的文件等级计算。 Therefore, all the characteristics and frequency of the upper leaf leaf category category category of the root pass category other than the root category they will not participate in any of the file level calculation.

当子类别保有它的时候,运算法则不能直接将一个特征从父类别删除。 When the sub-category to maintain it, the algorithm can not directly delete a feature from the parent class. 这是因为我们可能需要特征把文件传递到父类别;如果它不能传递到父类别它就没法传递到子类别。 This is because we may need to pass the file to the characteristics of the parent class; if it is not passed to the parent class that it can not transfer to the subcategory. 因此,在比较低层次类别(子类别)的分歧被向上传递到上面的层次(父类别)。 Thus, at a relatively low level category (subcategory) differences are passed up to the upper level (the parent category).

3)分配类别特征权重步骤:为类别的每个特征指定权重,有比较高的权重特征意味着它对类别是更重要的,在每个类别中所有的特征被分配权重,由下式定义:Wfc=(λ+(1-λ)×Nfc/Mc)f正在每个存在的特征,c是类别,Wfc是被指定为特征的权重,λ是一个叁数并且现在设定为0.4,Nfc是f在c中出现的次数,Mc是在c中任何的特征最大的频率。 3) assign a category feature weights steps: assigning weights to each feature weight category, a relatively high weight category features mean that it is more important, all the features are re-distribution rights in each category, defined by the following formula: Wfc = (λ + (1-λ) × Nfc / Mc) f are each occurrence wherein, c is the category, Wfc is designated as a feature weights, λ is a number three and is now set to 0.4, Nfc is frequency f c appears in the middle, Mc is the largest frequency in c any features.

当一个特征只出现在兄弟类别中的时候,但是不在c中它本身,它被指定为负权重。 When a feature only appears when the brothers categories, but not c in itself, it is designated as negative weights. 有负权重的特征被增加到c的特征列表。 There are negative weights features are added to the feature list c. 负权重由下式定义:f正在每个存在的特征,c是类别,Wfc是被指定为特征的权重,λ是一个叁数并且现在设定为0.4,Nfp是f在c的父类别中出现的次数,Mc是在c的父类别中任何的特征最大的频率。 Negative weights defined by the formula: f being characterized in each occurrence, c is the category, Wfc is designated as a feature weights, λ is a number three and is now set to 0.4, Nfp f is present in the parent category c times, Mc is the parent category in c maximum frequency of any of the features.

4)过滤每个类别的特征列表,每个类别的特征列表将被过滤。 4) filtering feature list for each category, features a list of each category will be filtered. 只有前面200个正特征和前面200个负特征被保留到该类别的最终特征列表中,无论是父类别还是叶类别。 Only the first 200 positive features and negative features front 200 are reserved to the final features list for that category, whether it is a parent category or leaf category. 其他的特征将被抛弃。 Other features will be discarded. 限制特征的数量是用来降低分类一个文件的计算复杂度。 Limit the number of features is used to reduce the computational complexity of the classification of a file.

2.文件分类方法:在被训练阶级组织之后,现在一份文件能被分类到一个类别,文件分类方法从根类别开始。 2. File classification methods: After being trained class organization, now a document can be classified into a category, file classification starting from the root category. 根类别的所有子类别被分配等级,它由下面等式计算:Rcd=ΣfNfdWfc]]>c是一个类别,d是一份文件,f是一个在D中的特征,Rcd是c的等级,Nfd是f出现在d中的次数,Wfc是f在类别c中的权重。 All subcategories root category is assigned grades, which calculated by the following equation: Rcd = & Sigma; fNfdWfc]]> c is a category, d is a document, f is a feature in D, Rcd is the level of c , Nfd frequency f is present in the d, Wfc is right f c in the category of weight.

如果所有子类别的等级都是零的或负的,d被留在根类别。 If all sub-categories of rank are zero or negative, d is left in the root category. 如果在子类别中有确定的正的最大的等级的类别,则该类别被选择。 If there is a positive rating of the largest categories identified in the sub-category, the category is selected. 如果该类别是一个叶类别,文件d被分到该类别。 If the category is a leaf category, the file is assigned to the category d. 如果被选择的类别不是叶类别,则在该类别的子类别中继续进行计算。 If the selected category is not a leaf category, sub-category in this category continue to be calculated. 因此,文件d能分到叶类别或内部类别。 Therefore, the file could be assigned to the leaf category d or internal category.

五、新闻查询步骤:如图16所示,包括如下步骤:提交步骤,用户提交查询条件;搜索步骤,对索引进行搜索操作,得到结果集;返回步骤,将结果返回给用户。 Fifth, Press Inquiries steps: 16, comprising the steps of: Submit step, the user submits a query condition; searching step, the index search operation to obtain the results set; return step, the results are returned to the user.

前面几个步骤只是实现了后台的自动下载,自动摘要,以及索引的建立,新闻查询子系统实现的功能是与用户的交互,能够让用户在前台进行相关的新闻查询,包括新闻关键字查询,新闻类别查询,新闻日期查询,新闻源查询等。 Just a few steps in front of the background to achieve automatic download, set up automatic summarization, and indexes, news query subsystem function is to achieve interaction with the user, allowing users to query reception for related news, including news keyword search, News Category query, the query date news, news source inquiries.

六、日志以及事务处理步骤:由于程序运行的情况下经常会遇到非正常性终止,比如突然死机,突然断电等。 Sixth, logs, and transaction processing steps: As is often the case that run non-normal sexual encounter termination, such as a sudden crash, a sudden power failure and so on.

这种情况下,我们必须保证后台数据的完整性,如必须保证索引必须是完整的,即使是执行到一半程序终止了,下次运行仍然能够恢复原有的索引结果,并且从失败的位置开始从新进行索引工作。 In this case, we must ensure the integrity of the background data, such as the need to ensure that the index must be completed, even if half of the implementation of the program is terminated, the next run will still be able to restore the original index results, and starting from a position of failure new indexing work.

还有,对于下载和摘要等工作,为了不造成重复工作以及节省时间,那么也必须对他们的工作进行纪录。 Also, for download and abstract work, in order not to cause duplication of work and to save time, you also have to record their work.

Log文件系统功能:1、下载线程的url分析模块在分析url的时候,就先读入计数文件,并载入最新的两个log文件,用以判断是否已经下载过。 Log File System features: 1, url analysis module in the analysis url download thread when you first read the count file and load the latest two log files to determine whether downloaded.

2、每当下载一个新闻内容网页,就存储相关的url至最新的log文件中。 2. Whenever downloading a news content pages on storage-related url to the latest log file.

3、在索引的过程中,必须先读入索引的位置信息,然后读入必须索引的log文件信息。 3, the index of the process, we must first read position information into the index, and then reads the log file information to be indexed. 然后对对应的内容文件进行索引,同时更新索引log文件中的索引位置信息。 Then the corresponding content file index, while the index is updated location information index log file.

4、在摘要的过程中,必须先读入摘要的位置信息,然后读入必须摘要的log文件信息。 4, in the summary of the process, we must first read the summary of the location information, and then have to read a summary of the log file information. 然后对对应的内容文件进行摘要,同时更新摘要log文件中的摘要位置信息。 Then the corresponding content file summary, summary and update the location information summary log file.

5、每当下载完一个文件的源代码,分析出内容,进行完摘要,完成索引都要对这项工作进行纪录。 5, whenever a file downloaded source code analysis of the content, has been completely abstract, complete index must be a record for this work. 以免事故发生无法处理,并可避免重复工作。 In order to avoid an accident can not be processed, and avoid duplication of work.

6、下载,摘要,索引三个线程永不停止,即使已经完成了某项工作,比如摘要已经完成,则重新load摘要的log文件,开始摘要。 6, download, abstracting, indexing three threads never stop, even though they have done some work, such as summary has been completed, then re-load the summary log file to start the summary.

七、管理步骤:管理步骤主要实现对本机的数据管理,类别管理,新闻源管理,数据删除的索引更新,日志更新等。 Seven management step: management Step main achievement of the native data management, category management, news source management, the index data deleted update log updates.

Citada por
Patente citante Fecha de presentación Fecha de publicación Solicitante Título
CN1786965B21 Dic 200526 May 2010北大方正集团有限公司;北京北大方正技术研究院有限公司;北京大学Method for acquiring news web page text information
CN1858737B25 Ene 20062 Jun 2010华为技术有限公司Method and system for data searching
CN100399330C23 Mar 20052 Jul 2008腾讯科技(深圳)有限公司System for managing world wide web media in world wide web page and implementing method thereof
CN100433018C13 Mar 200712 Nov 2008云 白Method for criminating electronci file and relative degree with certain field and application thereof
CN100444591C18 Ago 200617 Dic 2008北京金山软件有限公司Method for acquiring front-page keyword and its application system
CN100462972C1 Nov 200618 Feb 2009国际商业机器公司Document-based information and uniform resource locator management method and device
CN100512181C23 Jun 20068 Jul 2009腾讯科技(深圳)有限公司Method and system for extracting information of content in Internet
CN100592293C28 Abr 200724 Feb 2010李树德Knowledge search engine based on intelligent noumenon and implementing method thereof
CN101055581B9 Feb 20074 Jul 2012Lg电子株式会社Document management system and method
CN101128819B30 Dic 200522 Jun 2011谷歌公司Local item extraction
CN101140578B6 Sep 20068 Dic 2010鸿富锦精密工业(深圳)有限公司;鸿海精密工业股份有限公司Method and system for multithread analyzing web page data
CN101180624B26 Oct 20059 May 2012雅虎公司Link-based spam detection
CN101192220B21 Nov 200615 Sep 2010财团法人资讯工业策进会Label construction method and system adapting to resource searching
CN101196935B3 Ene 20089 Jun 2010中兴通讯股份有限公司System and method for creating index database
CN101203847B10 Mar 200619 May 2010雅虎公司System and method for managing listings
CN101231641B22 Ene 200719 May 2010北大方正集团有限公司;北京大学;北京北大方正技术研究院有限公司Method and system for automatic analysis of hotspot subject propagation process in the internet
CN101526938B6 Mar 200828 Dic 2011夏普株式会社文档处理装置
CN101751438B17 Dic 200822 Ago 2012中国科学院自动化研究所Theme webpage filter system for driving self-adaption semantics
CN101984435A *17 Nov 20109 Mar 2011百度在线网络技术(北京)有限公司Method and device for distributing texts
CN101984435B17 Nov 201010 Oct 2012百度在线网络技术(北京)有限公司Method and device for distributing texts
CN102117317A *28 Dic 20106 Jul 2011北京盈科成章科技有限公司Blind person Internet system based on voice technology
CN102117317B28 Dic 201022 Ago 2012北京盈科成章科技有限公司Blind person Internet system based on voice technology
CN102118400B31 Dic 200917 Jul 2013北京四维图新科技股份有限公司Data acquisition method and system
CN102236654A *26 Abr 20109 Nov 2011广东开普互联信息科技有限公司Web useless link filtering method based on content relevancy
CN102385570A *31 Ago 201021 Mar 2012国际商业机器公司Method and system for matching fonts
CN102446191A *13 Oct 20109 May 2012北京创新方舟科技有限公司Method for generating webpage content abstracts and equipment and system adopting same
CN102446311A *15 Oct 20109 May 2012商业对象软件有限公司Business intelligence technology for process driving
CN102446311B *15 Oct 201021 Dic 2016商业对象软件有限公司过程驱动的业务智能
CN102460437A *28 Jun 201016 May 2012乐天株式会社Information search device, information search method, information search program, and storage medium on which information search program has been stored
CN102460437B28 Jun 201015 Oct 2014乐天株式会社信息检索装置、信息检索方法、信息检索程序及记录了信息检索程序的记录介质
CN102521313A *1 Dic 201127 Jun 2012北京大学Static index pruning method based on web page quality
CN102592039A *18 Ene 201118 Jul 2012四川火狐无线科技有限公司Interaction method for processing cantering and entertainment service data and device and system for realizing same
CN102812475A *23 Dic 20105 Dic 2012梅塔瓦纳股份有限公司System And Method For Determining Sentiment Expressed In Documents
CN102902757A *25 Sep 201230 Ene 2013姚明东Automatic generation method of e-commerce dictionary
CN102902757B *25 Sep 201229 Jul 2015姚明东一种电子商务字典自动生成方法
CN102945246A *28 Sep 201227 Feb 2013北界创想(北京)软件有限公司Method and device for processing network information data
CN103149840A *1 Feb 201312 Jun 2013西北工业大学Semanteme service combination method based on dynamic planning
CN103149840B *1 Feb 20134 Mar 2015西北工业大学Semanteme service combination method based on dynamic planning
CN103150632A *13 Mar 201312 Jun 2013河海大学Structuring method for flood control and drought control bulletin generation system based on water conservancy cloud platform
CN103150632B *13 Mar 201316 Mar 2016河海大学基于水利云平台的防汛防旱简报生成系统的构建方法
CN103488750A *24 Sep 20131 Ene 2014长沙裕邦软件开发有限公司Implementation method and system of network robot
CN103853834B *12 Mar 20148 Feb 2017华东师范大学基于文本结构分析的Web文档摘要的生成方法
CN104008126A *31 Mar 201427 Ago 2014北京奇虎科技有限公司Method and device for segmentation on basis of webpage content classification
CN104657347A *6 Feb 201527 May 2015北京中搜网络技术股份有限公司News optimized reading mobile application-oriented automatic summarization method
US792562129 Ene 200812 Abr 2011Microsoft CorporationInstalling a solution
US79798036 Mar 200612 Jul 2011Microsoft CorporationRSS hostable control
US79798561 Sep 200512 Jul 2011Microsoft CorporationNetwork-based software extensions
US820494617 Nov 200919 Jun 2012Tencent Technology (Shenzhen) Company Ltd.Method and apparatus for processing instant messaging information
US82808433 Mar 20062 Oct 2012Microsoft CorporationRSS data-processing object
US842952215 Jun 201123 Abr 2013Microsoft CorporationCorrelation, association, or correspondence of electronic forms
US843370423 Sep 201030 Abr 2013Google Inc.Local item extraction
US866145921 Jun 200525 Feb 2014Microsoft CorporationContent syndication platform
US875193619 Sep 200810 Jun 2014Microsoft CorporationFinding and consuming web subscriptions in a web browser
US87688811 Ago 20121 Jul 2014Microsoft CorporationRSS data-processing object
US883257119 Sep 20089 Sep 2014Microsoft CorporationFinding and consuming web subscriptions in a web browser
US88929938 Feb 200818 Nov 2014Microsoft CorporationTranslation file
US89187292 Abr 200823 Dic 2014Microsoft CorporationDesigning electronic forms
US900287725 Ago 20117 Abr 2015International Business Machines CorporationQuick font match
US910477321 Jun 200511 Ago 2015Microsoft Technology Licensing, LlcFinding and consuming web subscriptions in a web browser
US921023413 Jun 20118 Dic 2015Microsoft Technology Licensing, LlcEnabling electronic documents for limited-capability computing devices
US92183256 Abr 201522 Dic 2015International Business Machines CorporationQuick font match
US922991718 Mar 20115 Ene 2016Microsoft Technology Licensing, LlcElectronic form user interfaces
US923982131 Oct 201419 Ene 2016Microsoft Technology Licensing, LlcTranslation file
US926876022 Abr 201323 Feb 2016Microsoft Technology Licensing, LlcCorrelation, association, or correspondence of electronic forms
US976266831 Dic 201312 Sep 2017Microsoft Technology Licensing, LlcContent syndication platform
WO2008131597A1 *29 Abr 20076 Nov 2008Haitao LinSearch engine and method for filtering agency information
WO2009021429A1 *24 Jul 200819 Feb 2009Tencent Technology (Shenzhen) Company LimitedMethod and device for dealing with the instant messaging information
Clasificaciones
Clasificación internacionalG06F17/27, G06F17/30, G06F9/445, G06F17/00
Eventos legales
FechaCódigoEventoDescripción
13 Oct 2004C06Publication
19 Abr 2006C10Request of examination as to substance
21 Jul 2010C02Deemed withdrawal of patent application after publication (patent law 2001)