CN103559297A - Breakpoint continuous acquisition method and system for book retrieval information - Google Patents
Breakpoint continuous acquisition method and system for book retrieval information Download PDFInfo
- Publication number
- CN103559297A CN103559297A CN201310562445.4A CN201310562445A CN103559297A CN 103559297 A CN103559297 A CN 103559297A CN 201310562445 A CN201310562445 A CN 201310562445A CN 103559297 A CN103559297 A CN 103559297A
- Authority
- CN
- China
- Prior art keywords
- page
- information
- module
- book
- breakpoint
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
Abstract
The embodiment of the invention discloses a breakpoint continuous acquisition method and a system for book retrieval information. The method comprises the following steps: (1) loading breakpoint information, (2) jumping to a corresponding crawling position, (3) saving the breakpoint information, and (4) downloading and processing the book information, and repeating Step (4). The invention further discloses the breakpoint continuous acquisition system for the book retrieval information. With the adoption of the system, breakpoint continuous acquisition can be realized, the acquisition efficiency is improved, and the system has a high practical value.
Description
Technical field
The present invention relates to the network information gathering technology in text information processing category, relate in particular to and a kind ofly for book retrieval information, carry out the continuous method and system of adopting of breakpoint.
Background technology
Along with the appearance of WWW, people start the information of spreading through the internet, and the network information becomes how much levels to increase.Along with rising suddenly and sharply of quantity of information, how to collect fast the focus that information needed becomes people, now web crawlers arises at the historic moment.Web crawlers is by an entrance, uses graph traversal algorithm, captures the info web in internet, and the program to the information processing crawling, storage.
Library is to collect, arrange, collect books and reference materials for the mechanism of people's reading, reference.Early library is by manual operations, and along with the development of computing machine, library tends to automation mechanized operation gradually, and this development has been accelerated in the birth of book management system.With respect to how much non-structured network informations of level, the information in library is the structured message through tissue.
In the collected books information in library, contained a large amount of valuable information, if can obtain accurately and efficiently collection information, will have important practical significance.By the library resource to different libraries, contrast, can assist unit's search; The book information of Dui Ge great colleges and universities is analyzed, and can obtain the holdings structure of each colleges and universities, and wherein holdings structure is the important embodiment of library's document supportability and service level; Analyze the shared ratio of colleges and universities' various books, can predict subject character and the academic direction of emphasis of these colleges and universities.Meanwhile, by obtaining the publication situation of various books to book information analysis, the books purchase situation of the occupation rate situation of publishing house and each colleges and universities etc.
At present, the main flow way of obtaining each macro library collection information is the web crawlers building for book system.Web crawlers can crawl all book informations under this library automatically, still, and because the reasons such as unstable, the server failure of network can cause reptile program interrupt.The conventional method of processing reptile interruption is to restart reptile, and because program is not remembered the last point of interruption, program can crawl book information again from entrance, caused so a lot of repetitive operation, has reduced the efficiency of reptile.
Summary of the invention
The problem existing for prior art, the object of this invention is to provide and a kind ofly for book retrieval information, carry out the continuous method of adopting of breakpoint.
For achieving the above object, the present invention proposes carries out the continuous mining method of breakpoint for book retrieval information and comprises the following steps:
(1) breakpoint information load step;
(2) jump to the corresponding position step that crawls;
(3) breakpoint information is preserved step;
(4) book information download process step, repeated execution of steps (2);
In said method, step (1) further comprises:
(11) read breakpoint information file step, in breakpoint information file, preserved in call number, the page number and the page number of the point of interruption which;
(12) obtain breakpoint information step, process the information of reading in, obtain the call number S of breakpoint place, page number P, which N in page.
In said method, step (2) further comprises:
(21) jump to retrieval result page step,, according to call number S and the front state that once crawls, determine this searching number Sn, jump to corresponding retrieval result page;
(22) download parsing result page step, downloads result page and obtains Query Result information by regular expression;
(23) judged whether that maximum demonstration record conditioning step, if incalculability restriction, execution step (24); If there is restricted number, judge whether current result for retrieval surpasses the maximum number that shows, if surpassed, dwindle the scope of call number, current cable book number is constant as first place, adding son position to travel through all situations thereafter, son position should comprise all characters that may occur in call number, step (21) is returned in redirect, again retrieval; If do not surpassed, perform step (24);
(24) jump to list page step, according to page number P and the front state that once crawls, determine that this gathers page number Page, by the assembled list page url to be crawled of the information such as Sn, Page;
(25) download parsing list page step, crawls list page, and the bibliography going out in list page by matching regular expressions links;
(26) jump to and crawl position step, according to which N in page, skip and crawl bibliography, and determine that this crawls position.
In said method, step (4) further comprises:
(41) download book information step, crawl book page;
(42) obtain book information step, if system provides the MARC information of books, by matching regular expressions, go out MARC information, if MARC information is not provided, by matching regular expressions, go out the essential information of books;
(43) book information storing step, preserves the book information obtaining.
What the present invention proposed carries out the continuous system of adopting of breakpoint for book retrieval information, comprises with lower module:
(1) breakpoint information load-on module;
(2) jump to the corresponding position module that crawls;
(3) breakpoint information is preserved module;
(4) book information download process module.
In said method, module (1) further comprises:
(11) read breakpoint information file module, in breakpoint information file, preserved in call number, the page number and the page number of the point of interruption which;
(12) obtain breakpoint information module, process the information of reading in, obtain the call number S of breakpoint place, page number P, which N in page.
In said method, module (2) further comprises:
(21) jump to retrieval result page module,, according to call number S and the front state that once crawls, determine this searching number Sn, jump to corresponding retrieval result page;
(22) download parsing result page module, downloads result page and obtains Query Result information by regular expression;
(23) judged whether that maximum demonstration record limiting module, if incalculability restriction, execution module (24); If there is restricted number, judge whether current result for retrieval surpasses the maximum number that shows, if surpassed, dwindle the scope of call number, current cable book number is constant as first place, thereafter, add all situations of son position traversal, son position should comprise all characters that may occur in call number, module (21) is returned in redirect, again retrieval; If do not surpassed, execution module (24);
(24) jump to list page module, according to page number P and the front state that once crawls, determine that this gathers page number Page, by the assembled list page url to be crawled of the information such as Sn, Page;
(25) download parsing list page module, crawls list page, and the bibliography going out in list page by matching regular expressions links;
(26) jump to and crawl position module, according to which N in page, skip and crawl bibliography, and determine that this crawls position.
In said method, module (4) further comprises:
(41) download book page module, crawl book page;
(42) obtain book information module, if system provides the MARC information of books, by matching regular expressions, go out MARC information, if MARC information is not provided, by matching regular expressions, go out the essential information of books;
(43) book information memory module, preserves the book information obtaining.
Accompanying drawing explanation
Fig. 1 the present invention is directed to the flow chart of steps that book retrieval information is carried out the continuous method of adopting of breakpoint;
Fig. 2 is the flow chart of steps that breakpoint information loads;
Fig. 3 jumps to the corresponding flow chart of steps that crawls position;
Fig. 4 is the flow chart of steps of book information download process;
Fig. 5 the present invention is directed to the structured flowchart that book retrieval information is carried out the continuous extraction system of breakpoint.
Embodiment
Below in conjunction with accompanying drawing, specific embodiments of the present invention is described in detail.
Fig. 1 is the process flow diagram of an embodiment of the invention, comprises the following steps:
Step S1: breakpoint information loads, and obtains call number S, which N in page number S and page.
Step S2: jump to the corresponding position that crawls, progressively jump to collection position according to S, P, N.
Step S3: breakpoint information is preserved, before crawling book information at every turn by the call number of this book, the page number and page, which is saved in file.
Step S4: the download of book information and processing, be saved in the book information after processing in file system repeated execution of steps S2.
To be specifically described each step below:
Step S1 completes breakpoint information and loads, and this is set and gathers starting point.Fig. 2 has provided the implementation process process flow diagram of the method, and concrete operation step is as follows:
Step 11, reads breakpoint information file.In this document, preserved in call number, the page number and the page that gathers interruptions which.
Step 12, obtains breakpoint information.From the content reading, parse which N in call number S, page number P and page.
Step S2 completes to jump to and crawls assigned address.Fig. 3 has provided the implementation process process flow diagram of the method, and concrete operation step is as follows:
Step 27, download list page, and by matching regular expressions, go out bibliography and link.
Step S3 completes breakpoint information and preserves, and before crawling book information, which Num in the call number Sn of this book, page number Page and page is written in breakpoint information file at every turn.
Step S4 completes download and the processing of book information.Fig. 4 has provided the implementation process process flow diagram of the method, and concrete operation step is as follows:
Below to carry out the embodiment of continuous each module of mining method of breakpoint for book retrieval information, set forth by reference to the accompanying drawings.
On the other hand, the invention also discloses and a kind ofly for book retrieval information, carry out the continuous system of adopting of breakpoint.With reference to Fig. 5, this system comprises as lower module:
Module (1): breakpoint information loads, and obtains call number S, which N in page number P and page.
Module (2): jump to the corresponding position that crawls, progressively jump to collection position according to S, P, N.
Module (3): breakpoint information is preserved, before crawling book information at every turn by the call number of this book, the page number and page, which is saved in file.
Module (4): the download of book information and processing, be saved in the book information after processing in file system.
The above-mentioned embodiment that carries out continuous each module of extraction system of breakpoint for book retrieval information has the technique effect identical with embodiment of the method, at this, no longer repeats to set forth.
To sum up, the continuous core of adopting of breakpoint of the present invention, for record the call number, the page number of bibliography to be collected and which before crawling at every turn, after program interrupt, is determined the starting point of this collection by reading the point of interruption of record, avoid repetitive operation, improved collecting efficiency.According to thought of the present invention, all will change in specific embodiments and applications.In sum, this description should not be construed as limitation of the present invention.
Above-described embodiment of the present invention, does not form the restriction to invention protection domain.Any modification of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.
Claims (10)
1. for book retrieval information, carry out the continuous method of adopting of breakpoint, it is characterized in that, comprise the following steps:
(1) breakpoint information load step;
(2) jump to the corresponding position step that crawls;
(3) breakpoint information is preserved step;
(4) book information download process step, repeated execution of steps (2).
2. the method for claim 1, is characterized in that, step (1) further comprises:
(11) read breakpoint information file step, in breakpoint information file, preserved in call number, the page number and the page number of the point of interruption which;
(12) obtain breakpoint information step, process the information of reading in, obtain the call number S of breakpoint place, which N in page number P and page.
3. the method for claim 1, is characterized in that, step (2) further comprises:
(21) jump to retrieval result page step,, according to call number S and the front state that once crawls, determine this searching number Sn, jump to corresponding retrieval result page;
(22) download parsing result page step, downloads result page and obtains Query Result information by matching regular expressions;
(23) judge whether that restriction shows quantity step, if incalculability restriction, execution step (24); If there is restricted number, judge whether current result for retrieval surpasses the maximum number that shows, if surpassed, dwindle the scope of call number, current cable book number is constant as first place, adding son position to travel through all situations thereafter, son position should comprise all characters that may occur in call number, step (21) is returned in redirect, again retrieval; If do not surpassed, perform step (24);
(24) jump to list page step, according to page number P and the front state that once crawls, determine that this gathers page number Page, by the assembled list page url to be crawled of the information such as Sn, Page;
(25) download parsing list page step, crawls list page, and the bibliography going out in list page by matching regular expressions links;
(26) jump to and crawl position step, according to which N in page, skip and crawl bibliography, and determine that this crawls position.
4. the method for claim 1, is characterized in that, in step (3), by breakpoint information write break point message file, breakpoint information comprise in call number, the page number and the page number which.
5. the method for claim 1, is characterized in that, step (4) further comprises:
(41) download book information step, crawl book page;
(42) obtain book information step, if system provides the MARC information of books, by matching regular expressions, go out MARC information, if do not provided, by matching regular expressions, go out the essential information of books;
(43) book information storing step, preserves the book information obtaining.
6. for book retrieval information, carry out the continuous system of adopting of breakpoint, it is characterized in that, comprise with lower module:
(1) breakpoint information load-on module;
(2) jump to the corresponding position module that crawls;
(3) breakpoint information is preserved module;
(4) book information download process module.
7. system as claimed in claim 6, is characterized in that, module (1) further comprises:
(11) read breakpoint information file module, in breakpoint information file, preserved in call number, the page number and the page number of the point of interruption which;
(12) obtain breakpoint information module, complete and obtain breakpoint information call number S, page number P, which N in page.
8. system as claimed in claim 6, is characterized in that, module (2) further comprises:
(21) jump to retrieval result page module,, according to call number S and the front state that once crawls, determine this searching number Sn, jump to corresponding retrieval result page;
(22) download parsing result page module, downloads result page and obtains Query Result information by regular expression;
(23) judge whether that restriction shows quantity module, if incalculability restriction, execution module (24); If there is restricted number, judge whether current result for retrieval surpasses the maximum number that shows, if surpassed, dwindle the scope of call number, current cable book number is constant as first place, thereafter, add all situations of son position traversal, son position should comprise all characters that may occur in call number, module (21) is returned in redirect, again retrieval; If do not surpassed, execution module (24);
(24) jump to list page module, according to page number P and the front state that once crawls, determine that this gathers page number Page, by the assembled list page url to be crawled of the information such as Sn, Page;
(25) download parsing list page module, crawls list page, and the bibliography going out in list page by matching regular expressions links;
(26) jump to and crawl position module, according to which N in page, skip and crawl bibliography, and determine that this crawls position.
9. method as claimed in claim 6, in module (3), preserves breakpoint information, comprise in call number, the page number and the page number which.
10. system as claimed in claim 6, is characterized in that, module (4) further comprises:
(41) download book page module, crawl book page;
(42) obtain book information module, if system provides the MARC information of books, by matching regular expressions, go out MARC information, if MARC information is not provided, by matching regular expressions, go out the essential information of books;
(43) book information memory module, preserves the book information obtaining.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310562445.4A CN103559297A (en) | 2013-11-12 | 2013-11-12 | Breakpoint continuous acquisition method and system for book retrieval information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310562445.4A CN103559297A (en) | 2013-11-12 | 2013-11-12 | Breakpoint continuous acquisition method and system for book retrieval information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103559297A true CN103559297A (en) | 2014-02-05 |
Family
ID=50013543
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310562445.4A Pending CN103559297A (en) | 2013-11-12 | 2013-11-12 | Breakpoint continuous acquisition method and system for book retrieval information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103559297A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6339785B1 (en) * | 1999-11-24 | 2002-01-15 | Idan Feigenbaum | Multi-server file download |
CN1627290A (en) * | 2003-12-12 | 2005-06-15 | 鸿富锦精密工业(深圳)有限公司 | Data down loading system and method capable of continuous transmission from breakpoint |
CN101291195A (en) * | 2008-05-23 | 2008-10-22 | 中兴通讯股份有限公司 | File downloading method, system and terminal realizing breaker point continuous transmission |
CN101299219A (en) * | 2008-06-27 | 2008-11-05 | 北京邮电大学 | Multithread breakpoint continued transmission customizable internal net reptile system |
-
2013
- 2013-11-12 CN CN201310562445.4A patent/CN103559297A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6339785B1 (en) * | 1999-11-24 | 2002-01-15 | Idan Feigenbaum | Multi-server file download |
CN1627290A (en) * | 2003-12-12 | 2005-06-15 | 鸿富锦精密工业(深圳)有限公司 | Data down loading system and method capable of continuous transmission from breakpoint |
CN101291195A (en) * | 2008-05-23 | 2008-10-22 | 中兴通讯股份有限公司 | File downloading method, system and terminal realizing breaker point continuous transmission |
CN101299219A (en) * | 2008-06-27 | 2008-11-05 | 北京邮电大学 | Multithread breakpoint continued transmission customizable internal net reptile system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105243159A (en) | Visual script editor-based distributed web crawler system | |
US9767082B2 (en) | Method and system of retrieving ajax web page content | |
CN102567516B (en) | Script loading method and device | |
CN103106196B (en) | A kind of method and apparatus recovering browsing device net page | |
CN104408204A (en) | Method and device for obtaining webpage page link address | |
CN102855318A (en) | Method and system for preloading of webpages | |
CN102982162B (en) | The acquisition system of info web | |
US20170124213A1 (en) | Automating Web Tasks Based on Web Browsing Histories and User Actions | |
JP2018514846A (en) | Web page access method, apparatus, device, and program | |
CN104572043A (en) | Method and device for embedding points for controls of client application in real time | |
CN102982161A (en) | Method and device for acquiring webpage information | |
CN106126747A (en) | Data capture method based on reptile and device | |
CN104516982A (en) | Method and system for extracting Web information based on Nutch | |
CN103530292A (en) | Webpage displaying method and device | |
CN104750851A (en) | Webpage content lazy loading method and system | |
CN103714116A (en) | Webpage information extracting method and webpage information extracting equipment | |
CN114491206A (en) | General low-code crawler method and system for news blog websites | |
KR101287371B1 (en) | Method and Device for Collecting Web Contents and Computer-readable Recording Medium for the same | |
CN112818201A (en) | Network data acquisition method and device, computer equipment and storage medium | |
CN103365877A (en) | Method and server for making directory after webpage is transcoded | |
CN103020179A (en) | Method, device and equipment for extracting webpage contents | |
CN102955852A (en) | Method, device and equipment for webpage resource processing | |
CN101957848A (en) | Method and device for navigating browser | |
CN103164438B (en) | The acquisition method of a kind of network comment and system | |
CN105653550A (en) | Web page filtering method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140205 |