CN103559297A - Breakpoint continuous acquisition method and system for book retrieval information - Google Patents

Breakpoint continuous acquisition method and system for book retrieval information Download PDF

Info

Publication number
CN103559297A
CN103559297A CN201310562445.4A CN201310562445A CN103559297A CN 103559297 A CN103559297 A CN 103559297A CN 201310562445 A CN201310562445 A CN 201310562445A CN 103559297 A CN103559297 A CN 103559297A
Authority
CN
China
Prior art keywords
page
information
module
book
breakpoint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310562445.4A
Other languages
Chinese (zh)
Inventor
肖波
赵琳
蔺志青
陆月明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201310562445.4A priority Critical patent/CN103559297A/en
Publication of CN103559297A publication Critical patent/CN103559297A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations

Abstract

The embodiment of the invention discloses a breakpoint continuous acquisition method and a system for book retrieval information. The method comprises the following steps: (1) loading breakpoint information, (2) jumping to a corresponding crawling position, (3) saving the breakpoint information, and (4) downloading and processing the book information, and repeating Step (4). The invention further discloses the breakpoint continuous acquisition system for the book retrieval information. With the adoption of the system, breakpoint continuous acquisition can be realized, the acquisition efficiency is improved, and the system has a high practical value.

Description

A kind ofly for book retrieval information, carry out the continuous method and system of adopting of breakpoint
Technical field
The present invention relates to the network information gathering technology in text information processing category, relate in particular to and a kind ofly for book retrieval information, carry out the continuous method and system of adopting of breakpoint.
Background technology
Along with the appearance of WWW, people start the information of spreading through the internet, and the network information becomes how much levels to increase.Along with rising suddenly and sharply of quantity of information, how to collect fast the focus that information needed becomes people, now web crawlers arises at the historic moment.Web crawlers is by an entrance, uses graph traversal algorithm, captures the info web in internet, and the program to the information processing crawling, storage.
Library is to collect, arrange, collect books and reference materials for the mechanism of people's reading, reference.Early library is by manual operations, and along with the development of computing machine, library tends to automation mechanized operation gradually, and this development has been accelerated in the birth of book management system.With respect to how much non-structured network informations of level, the information in library is the structured message through tissue.
In the collected books information in library, contained a large amount of valuable information, if can obtain accurately and efficiently collection information, will have important practical significance.By the library resource to different libraries, contrast, can assist unit's search; The book information of Dui Ge great colleges and universities is analyzed, and can obtain the holdings structure of each colleges and universities, and wherein holdings structure is the important embodiment of library's document supportability and service level; Analyze the shared ratio of colleges and universities' various books, can predict subject character and the academic direction of emphasis of these colleges and universities.Meanwhile, by obtaining the publication situation of various books to book information analysis, the books purchase situation of the occupation rate situation of publishing house and each colleges and universities etc.
At present, the main flow way of obtaining each macro library collection information is the web crawlers building for book system.Web crawlers can crawl all book informations under this library automatically, still, and because the reasons such as unstable, the server failure of network can cause reptile program interrupt.The conventional method of processing reptile interruption is to restart reptile, and because program is not remembered the last point of interruption, program can crawl book information again from entrance, caused so a lot of repetitive operation, has reduced the efficiency of reptile.
Summary of the invention
The problem existing for prior art, the object of this invention is to provide and a kind ofly for book retrieval information, carry out the continuous method of adopting of breakpoint.
For achieving the above object, the present invention proposes carries out the continuous mining method of breakpoint for book retrieval information and comprises the following steps:
(1) breakpoint information load step;
(2) jump to the corresponding position step that crawls;
(3) breakpoint information is preserved step;
(4) book information download process step, repeated execution of steps (2);
In said method, step (1) further comprises:
(11) read breakpoint information file step, in breakpoint information file, preserved in call number, the page number and the page number of the point of interruption which;
(12) obtain breakpoint information step, process the information of reading in, obtain the call number S of breakpoint place, page number P, which N in page.
In said method, step (2) further comprises:
(21) jump to retrieval result page step,, according to call number S and the front state that once crawls, determine this searching number Sn, jump to corresponding retrieval result page;
(22) download parsing result page step, downloads result page and obtains Query Result information by regular expression;
(23) judged whether that maximum demonstration record conditioning step, if incalculability restriction, execution step (24); If there is restricted number, judge whether current result for retrieval surpasses the maximum number that shows, if surpassed, dwindle the scope of call number, current cable book number is constant as first place, adding son position to travel through all situations thereafter, son position should comprise all characters that may occur in call number, step (21) is returned in redirect, again retrieval; If do not surpassed, perform step (24);
(24) jump to list page step, according to page number P and the front state that once crawls, determine that this gathers page number Page, by the assembled list page url to be crawled of the information such as Sn, Page;
(25) download parsing list page step, crawls list page, and the bibliography going out in list page by matching regular expressions links;
(26) jump to and crawl position step, according to which N in page, skip and crawl bibliography, and determine that this crawls position.
In said method, step (4) further comprises:
(41) download book information step, crawl book page;
(42) obtain book information step, if system provides the MARC information of books, by matching regular expressions, go out MARC information, if MARC information is not provided, by matching regular expressions, go out the essential information of books;
(43) book information storing step, preserves the book information obtaining.
What the present invention proposed carries out the continuous system of adopting of breakpoint for book retrieval information, comprises with lower module:
(1) breakpoint information load-on module;
(2) jump to the corresponding position module that crawls;
(3) breakpoint information is preserved module;
(4) book information download process module.
In said method, module (1) further comprises:
(11) read breakpoint information file module, in breakpoint information file, preserved in call number, the page number and the page number of the point of interruption which;
(12) obtain breakpoint information module, process the information of reading in, obtain the call number S of breakpoint place, page number P, which N in page.
In said method, module (2) further comprises:
(21) jump to retrieval result page module,, according to call number S and the front state that once crawls, determine this searching number Sn, jump to corresponding retrieval result page;
(22) download parsing result page module, downloads result page and obtains Query Result information by regular expression;
(23) judged whether that maximum demonstration record limiting module, if incalculability restriction, execution module (24); If there is restricted number, judge whether current result for retrieval surpasses the maximum number that shows, if surpassed, dwindle the scope of call number, current cable book number is constant as first place, thereafter, add all situations of son position traversal, son position should comprise all characters that may occur in call number, module (21) is returned in redirect, again retrieval; If do not surpassed, execution module (24);
(24) jump to list page module, according to page number P and the front state that once crawls, determine that this gathers page number Page, by the assembled list page url to be crawled of the information such as Sn, Page;
(25) download parsing list page module, crawls list page, and the bibliography going out in list page by matching regular expressions links;
(26) jump to and crawl position module, according to which N in page, skip and crawl bibliography, and determine that this crawls position.
In said method, module (4) further comprises:
(41) download book page module, crawl book page;
(42) obtain book information module, if system provides the MARC information of books, by matching regular expressions, go out MARC information, if MARC information is not provided, by matching regular expressions, go out the essential information of books;
(43) book information memory module, preserves the book information obtaining.
Accompanying drawing explanation
Fig. 1 the present invention is directed to the flow chart of steps that book retrieval information is carried out the continuous method of adopting of breakpoint;
Fig. 2 is the flow chart of steps that breakpoint information loads;
Fig. 3 jumps to the corresponding flow chart of steps that crawls position;
Fig. 4 is the flow chart of steps of book information download process;
Fig. 5 the present invention is directed to the structured flowchart that book retrieval information is carried out the continuous extraction system of breakpoint.
Embodiment
Below in conjunction with accompanying drawing, specific embodiments of the present invention is described in detail.
Fig. 1 is the process flow diagram of an embodiment of the invention, comprises the following steps:
Step S1: breakpoint information loads, and obtains call number S, which N in page number S and page.
Step S2: jump to the corresponding position that crawls, progressively jump to collection position according to S, P, N.
Step S3: breakpoint information is preserved, before crawling book information at every turn by the call number of this book, the page number and page, which is saved in file.
Step S4: the download of book information and processing, be saved in the book information after processing in file system repeated execution of steps S2.
To be specifically described each step below:
Step S1 completes breakpoint information and loads, and this is set and gathers starting point.Fig. 2 has provided the implementation process process flow diagram of the method, and concrete operation step is as follows:
Step 11, reads breakpoint information file.In this document, preserved in call number, the page number and the page that gathers interruptions which.
Step 12, obtains breakpoint information.From the content reading, parse which N in call number S, page number P and page.
Step S2 completes to jump to and crawls assigned address.Fig. 3 has provided the implementation process process flow diagram of the method, and concrete operation step is as follows:
Step 21, jumps to retrieval result page, according to breakpoint information S and the front state that once crawls, determines the call number Sn of this collection, according to the assembled retrieval result page url of Sn.
Step 22, downloads result page, and by matching regular expressions, obtains total book and count Count and show restricted information.
Step 23, has judged whether the maximum record restriction that shows, if there is execution step 24, does not perform step 26.
Step 24, judges whether current result for retrieval surpasses the maximum number that shows, if surpassed, performs step 25, does not surpass and performs step 26.
Step 25, dwindles the scope of call number, that current cable book number is constant as first place, thereafter, adds all situations of son position traversal, and son position should comprise all characters that may occur in call number, and step 21 is returned in redirect, again retrieval.
Step 26, jumps to list page, according to breakpoint information P and the front state that once crawls, determines the page number Page of this collection, according to Sn, Page and the assembled list page url of Count.
Step 27, download list page, and by matching regular expressions, go out bibliography and link.
Step 28, jumps to list page, according to which N in page, skips and crawls bibliography, and determine that this crawls position.
Step S3 completes breakpoint information and preserves, and before crawling book information, which Num in the call number Sn of this book, page number Page and page is written in breakpoint information file at every turn.
Step S4 completes download and the processing of book information.Fig. 4 has provided the implementation process process flow diagram of the method, and concrete operation step is as follows:
Step 41, downloads call number Sn, book page corresponding to which Num in page number Page and page;
Step 42, judges whether system provides MARC information, provides and performs step 43, otherwise performs step 44;
Step 43, goes out MARC information by matching regular expressions;
Step 44, by matching regular expressions, go out the essential information of books, a kind of embodiment is to match author, publishing house, Publication Year, No. ISDN etc., is more than a kind of embodiment of books essential information, and other different embodiment are not construed as limiting the invention;
Step 45, storage book information, a kind of embodiment is that the book information after processing is write in the file with call number name.Be more than a kind of embodiment that book information writes, other different embodiment are not construed as limiting the invention;
Below to carry out the embodiment of continuous each module of mining method of breakpoint for book retrieval information, set forth by reference to the accompanying drawings.
On the other hand, the invention also discloses and a kind ofly for book retrieval information, carry out the continuous system of adopting of breakpoint.With reference to Fig. 5, this system comprises as lower module:
Module (1): breakpoint information loads, and obtains call number S, which N in page number P and page.
Module (2): jump to the corresponding position that crawls, progressively jump to collection position according to S, P, N.
Module (3): breakpoint information is preserved, before crawling book information at every turn by the call number of this book, the page number and page, which is saved in file.
Module (4): the download of book information and processing, be saved in the book information after processing in file system.
The above-mentioned embodiment that carries out continuous each module of extraction system of breakpoint for book retrieval information has the technique effect identical with embodiment of the method, at this, no longer repeats to set forth.
To sum up, the continuous core of adopting of breakpoint of the present invention, for record the call number, the page number of bibliography to be collected and which before crawling at every turn, after program interrupt, is determined the starting point of this collection by reading the point of interruption of record, avoid repetitive operation, improved collecting efficiency.According to thought of the present invention, all will change in specific embodiments and applications.In sum, this description should not be construed as limitation of the present invention.
Above-described embodiment of the present invention, does not form the restriction to invention protection domain.Any modification of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims (10)

1. for book retrieval information, carry out the continuous method of adopting of breakpoint, it is characterized in that, comprise the following steps:
(1) breakpoint information load step;
(2) jump to the corresponding position step that crawls;
(3) breakpoint information is preserved step;
(4) book information download process step, repeated execution of steps (2).
2. the method for claim 1, is characterized in that, step (1) further comprises:
(11) read breakpoint information file step, in breakpoint information file, preserved in call number, the page number and the page number of the point of interruption which;
(12) obtain breakpoint information step, process the information of reading in, obtain the call number S of breakpoint place, which N in page number P and page.
3. the method for claim 1, is characterized in that, step (2) further comprises:
(21) jump to retrieval result page step,, according to call number S and the front state that once crawls, determine this searching number Sn, jump to corresponding retrieval result page;
(22) download parsing result page step, downloads result page and obtains Query Result information by matching regular expressions;
(23) judge whether that restriction shows quantity step, if incalculability restriction, execution step (24); If there is restricted number, judge whether current result for retrieval surpasses the maximum number that shows, if surpassed, dwindle the scope of call number, current cable book number is constant as first place, adding son position to travel through all situations thereafter, son position should comprise all characters that may occur in call number, step (21) is returned in redirect, again retrieval; If do not surpassed, perform step (24);
(24) jump to list page step, according to page number P and the front state that once crawls, determine that this gathers page number Page, by the assembled list page url to be crawled of the information such as Sn, Page;
(25) download parsing list page step, crawls list page, and the bibliography going out in list page by matching regular expressions links;
(26) jump to and crawl position step, according to which N in page, skip and crawl bibliography, and determine that this crawls position.
4. the method for claim 1, is characterized in that, in step (3), by breakpoint information write break point message file, breakpoint information comprise in call number, the page number and the page number which.
5. the method for claim 1, is characterized in that, step (4) further comprises:
(41) download book information step, crawl book page;
(42) obtain book information step, if system provides the MARC information of books, by matching regular expressions, go out MARC information, if do not provided, by matching regular expressions, go out the essential information of books;
(43) book information storing step, preserves the book information obtaining.
6. for book retrieval information, carry out the continuous system of adopting of breakpoint, it is characterized in that, comprise with lower module:
(1) breakpoint information load-on module;
(2) jump to the corresponding position module that crawls;
(3) breakpoint information is preserved module;
(4) book information download process module.
7. system as claimed in claim 6, is characterized in that, module (1) further comprises:
(11) read breakpoint information file module, in breakpoint information file, preserved in call number, the page number and the page number of the point of interruption which;
(12) obtain breakpoint information module, complete and obtain breakpoint information call number S, page number P, which N in page.
8. system as claimed in claim 6, is characterized in that, module (2) further comprises:
(21) jump to retrieval result page module,, according to call number S and the front state that once crawls, determine this searching number Sn, jump to corresponding retrieval result page;
(22) download parsing result page module, downloads result page and obtains Query Result information by regular expression;
(23) judge whether that restriction shows quantity module, if incalculability restriction, execution module (24); If there is restricted number, judge whether current result for retrieval surpasses the maximum number that shows, if surpassed, dwindle the scope of call number, current cable book number is constant as first place, thereafter, add all situations of son position traversal, son position should comprise all characters that may occur in call number, module (21) is returned in redirect, again retrieval; If do not surpassed, execution module (24);
(24) jump to list page module, according to page number P and the front state that once crawls, determine that this gathers page number Page, by the assembled list page url to be crawled of the information such as Sn, Page;
(25) download parsing list page module, crawls list page, and the bibliography going out in list page by matching regular expressions links;
(26) jump to and crawl position module, according to which N in page, skip and crawl bibliography, and determine that this crawls position.
9. method as claimed in claim 6, in module (3), preserves breakpoint information, comprise in call number, the page number and the page number which.
10. system as claimed in claim 6, is characterized in that, module (4) further comprises:
(41) download book page module, crawl book page;
(42) obtain book information module, if system provides the MARC information of books, by matching regular expressions, go out MARC information, if MARC information is not provided, by matching regular expressions, go out the essential information of books;
(43) book information memory module, preserves the book information obtaining.
CN201310562445.4A 2013-11-12 2013-11-12 Breakpoint continuous acquisition method and system for book retrieval information Pending CN103559297A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310562445.4A CN103559297A (en) 2013-11-12 2013-11-12 Breakpoint continuous acquisition method and system for book retrieval information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310562445.4A CN103559297A (en) 2013-11-12 2013-11-12 Breakpoint continuous acquisition method and system for book retrieval information

Publications (1)

Publication Number Publication Date
CN103559297A true CN103559297A (en) 2014-02-05

Family

ID=50013543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310562445.4A Pending CN103559297A (en) 2013-11-12 2013-11-12 Breakpoint continuous acquisition method and system for book retrieval information

Country Status (1)

Country Link
CN (1) CN103559297A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6339785B1 (en) * 1999-11-24 2002-01-15 Idan Feigenbaum Multi-server file download
CN1627290A (en) * 2003-12-12 2005-06-15 鸿富锦精密工业(深圳)有限公司 Data down loading system and method capable of continuous transmission from breakpoint
CN101291195A (en) * 2008-05-23 2008-10-22 中兴通讯股份有限公司 File downloading method, system and terminal realizing breaker point continuous transmission
CN101299219A (en) * 2008-06-27 2008-11-05 北京邮电大学 Multithread breakpoint continued transmission customizable internal net reptile system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6339785B1 (en) * 1999-11-24 2002-01-15 Idan Feigenbaum Multi-server file download
CN1627290A (en) * 2003-12-12 2005-06-15 鸿富锦精密工业(深圳)有限公司 Data down loading system and method capable of continuous transmission from breakpoint
CN101291195A (en) * 2008-05-23 2008-10-22 中兴通讯股份有限公司 File downloading method, system and terminal realizing breaker point continuous transmission
CN101299219A (en) * 2008-06-27 2008-11-05 北京邮电大学 Multithread breakpoint continued transmission customizable internal net reptile system

Similar Documents

Publication Publication Date Title
CN105243159A (en) Visual script editor-based distributed web crawler system
US9767082B2 (en) Method and system of retrieving ajax web page content
CN102567516B (en) Script loading method and device
CN103106196B (en) A kind of method and apparatus recovering browsing device net page
CN104408204A (en) Method and device for obtaining webpage page link address
CN102855318A (en) Method and system for preloading of webpages
CN102982162B (en) The acquisition system of info web
US20170124213A1 (en) Automating Web Tasks Based on Web Browsing Histories and User Actions
JP2018514846A (en) Web page access method, apparatus, device, and program
CN104572043A (en) Method and device for embedding points for controls of client application in real time
CN102982161A (en) Method and device for acquiring webpage information
CN106126747A (en) Data capture method based on reptile and device
CN104516982A (en) Method and system for extracting Web information based on Nutch
CN103530292A (en) Webpage displaying method and device
CN104750851A (en) Webpage content lazy loading method and system
CN103714116A (en) Webpage information extracting method and webpage information extracting equipment
CN114491206A (en) General low-code crawler method and system for news blog websites
KR101287371B1 (en) Method and Device for Collecting Web Contents and Computer-readable Recording Medium for the same
CN112818201A (en) Network data acquisition method and device, computer equipment and storage medium
CN103365877A (en) Method and server for making directory after webpage is transcoded
CN103020179A (en) Method, device and equipment for extracting webpage contents
CN102955852A (en) Method, device and equipment for webpage resource processing
CN101957848A (en) Method and device for navigating browser
CN103164438B (en) The acquisition method of a kind of network comment and system
CN105653550A (en) Web page filtering method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140205