WO2009140590A1 - Method and system for large volume data processing - Google Patents

Method and system for large volume data processing Download PDF

Info

Publication number
WO2009140590A1
WO2009140590A1 PCT/US2009/044127 US2009044127W WO2009140590A1 WO 2009140590 A1 WO2009140590 A1 WO 2009140590A1 US 2009044127 W US2009044127 W US 2009044127W WO 2009140590 A1 WO2009140590 A1 WO 2009140590A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
small
source file
files
server
Prior art date
Application number
PCT/US2009/044127
Other languages
French (fr)
Inventor
Yipeng Tang
Wenqi Hong
Original Assignee
Alibaba Group Holding Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Limited filed Critical Alibaba Group Holding Limited
Priority to US12/601,606 priority Critical patent/US8229982B2/en
Priority to EP20090747671 priority patent/EP2283428A4/en
Priority to JP2011509735A priority patent/JP5438100B2/en
Publication of WO2009140590A1 publication Critical patent/WO2009140590A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5055Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering software capabilities, i.e. software resources associated or available to the machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs

Definitions

  • the present disclosure relates to data processing technologies, and particularly to methods and systems for large volume data processing.
  • a sender saves a certain data into a file in a certain format, and then sends the file to a recipient.
  • the recipient Upon receiving the file, the recipient analyzes content within the file, and performs logical processing accordingly.
  • the processing system using single server or single thread may not be able to satisfy this need.
  • file data is transferred from a sender to a recipient on a regular basis, once every five minutes, for example.
  • the recipient may have a maximum delay tolerance for the data. If the recipient cannot complete processing the transmitted data during a corresponding interval, vicious cycle may result - unfinished processing of data from previous period and arrival of new data will increase the data delay of the recipient and eventually lead to a system collapse.
  • Requirement for processing such large volume of data is normally seen in a number of large-scale applications. Examples include reporting students' data from a school to an education authority in educational sector, web log processing in large-scale websites, and inter-system data synchronization, etc. Therefore, a method for processing large volume of data within a scheduled time is required to alleviate data processing delay.
  • One aspect of the disclosure is a method for processing large volume data.
  • the method allocates a server to divide a source file into multiple small files, and allocates multiple servers to distributedly process the small files.
  • the allocation of servers can be based on the filenames assigned according to a file naming scheme.
  • the disclosed method deploys multiple servers to divide and process large data files, thereby improving the processing power of the system and ensuring the system to complete the processing of the files within scheduled times.
  • the system promises good scalability. As files become bigger or the number of files increases, new servers may be added to satisfy the demands.
  • the system can be linearly expanded without having to purchase more advanced servers and to re-configure and re-deploy the servers which have been operating previously.
  • allocating the source file dividing server according to the filename of the source file is done by parsing the filename of the source file and obtaining a source file sequence number; computing ( ( (the source file sequence number) % (total number of servers available for allocation) ) + 1 ), wherein % represents a modulus operator; and allocating the source file dividing server according to computed result.
  • Allocating the small file processing servers according to the filenames of the small files can be done in a similar manner.
  • the servers may also be allocated according to the data types to be processed by the servers.
  • the method configures each server to process a data type; parses a filename of a file and obtains the data type of data stored in the file; and allocates the file to a server that is configured to process the data type of the data. After dividing the source file into small files, the method may save the small files into a disk.
  • the method further allows the source file processing server to retry to divide the source file upon failure; and allows the plurality of small file processing servers to retry to process the respective allocated small files upon failure.
  • the method may allow only a single retry to divide the source file, but allow multiple retries to process the allocated small files.
  • the method may place the source file waiting to be divided and small files waiting to be processed under different directories.
  • the data flow under the directory of the source files waiting to be divided includes the following steps: placing the source file into a directory for to-be-divided source files; after allocating the source file processing server, placing the source file waiting to be divided into a temporary directory for file division; dividing the source file; and backing up the source file into a directory storing successfully divided source files if the source file has been divided successfully, and saving the small files thus obtained into a directory storing post-division small files, or backing up the source file into a directory storing source files failed to be divided if the source file has failed to be divided after a retry.
  • the data flow under the directory of the small files waiting to be processed may include the following steps: after allocating the small file processing servers, placing the small files that are waiting to be processed into a temporary directory for small file processing; processing the small files in the temporary directory; and backing up one or more of the small files into a directory storing successfully processed small files if the one or more small files have been processed successfully, backing up one or more of the small files into a directory storing small files having partially unsuccessfully processed records if the one or more of the small files need to be re -processed, and backing up one or more of the small files into a directory storing small files failed to be processed upon retries if the one or more of the small files have failed to be processed upon retries.
  • the system includes multiple servers, with each server including a pre-processing unit, a dividing unit and a processing unit.
  • the pre-processing unit is used for determining whether a source file waiting to be divided is to be processed by the server based on a source file naming scheme and for triggering a dividing unit if affirmative the preprocessing unit is also used for determining whether a small file is to be processed by the server based on a post-division small file naming scheme and for triggering a processing unit if affirmative.
  • the dividing unit is used for dividing the source file into small files.
  • the processing unit is used for performing logical processing for the small file.
  • the pre-processing unit determines wherein whether the source file is to be processed by the server based on a source file sequence number in the filename of the source file, and determines whether the small file is to be processed by the server based on a source file sequence number in the filename of the small file or a small file sequence number in the filename of the small file.
  • the pre-processing unit determines whether the source is to be processed by the server based on a type of data stored in the source file.
  • each server further has a configuration unit used for configuring data type(s) that can be processed by the server.
  • each server may further include a storage unit which is used for saving small files obtained from dividing the source file into a disk.
  • the storage unit may adopt a directory structure, and places files waiting to be divided and files waiting to be processed under different directories.
  • each server may further include a retry unit used for retrying to divide the source file upon failure and/or to process the small files upon failure.
  • the retry unit may allow a single retry to divide the source file but allow multiple retries to process the small files.
  • the disclosed method and system has several potential advantages.
  • the present disclosure provides a method and a system capable of distributed and concurrent processing of large files. Under the control of a concurrency strategy, multiple servers may be deployed to divide and process large data files at the same time, thereby greatly improving processing power of a system and ensuring the system to complete processing of the file within a scheduled time.
  • a concurrency strategy which allocates servers for dividing and processing files based on a file naming scheme ensures that each source file is divided by just one server, and each small file obtained by dividing the source file is also processed by only one server, thereby avoiding resource competition.
  • the present disclosure discloses several different concurrency strategies.
  • One strategy allocates a server according to a source file sequence number in the filename. This strategy can guarantee a balance among servers when there are a relatively large number of files.
  • Another strategy configures, for each server, a data type that is allowed to be processed by the server, and allocates a suitable server for respective file(s) waiting for processing. This latter strategy requires only modification of a configuration table when a new server is added.
  • the present disclosure may use a combination of these two concurrency strategies so that the disclosed system may maximally balance the activity levels of the servers.
  • the system promises good scalability. As files become bigger or the number of files increases, new servers may be added to satisfy corresponding demands. Specifically, the system can be linearly expanded, without having to purchase more advanced servers and to re-deploy the servers which have been operating previously.
  • FIG. 1 shows a flow chart of an exemplary method for large volume data processing in accordance with the present disclosure.
  • FIG. 2 shows a flow chart an exemplary process of dividing a file in a directory structure.
  • FIG. 3 shows a flow chart of an exemplary process of processing a file in a directory structure.
  • FIG. 4 shows a logical structural diagram of a system for large volume data processing in accordance with the present disclosure.
  • FIG. 5 shows a diagram illustrating an exemplary internal logical structure of each server in the system of FIG. 4.
  • the present disclosure provides a method and a system for processing a large file in a concurrent or distributed manner.
  • a concurrency strategy which employs a file naming scheme to allocate servers, a number of servers can be deployed to divide and process the large data file at the same time.
  • the method can greatly improve the processing power of a system and ensure the system to complete the processing of a large data file within a scheduled time.
  • a sender creates large files FiIeA of multiple in a certain format, and sends one or more files FiIeA to a recipient from time to time (e.g., in every two minutes).
  • FiIeA has a large file size (e.g., 200M).
  • the multiple types may be data types such as product data and order data, with each type possibly having multiple data files.
  • the recipient Upon receiving the files FiIeA, the recipient performs subsequent logical processing of the files FiIeA.
  • the logical processing may be relatively complicated, and may require a number of related matters such as saving the files FiIeA in a database.
  • a goal of the disclosed method is to enable the system of the recipient to complete processing of all the files FiIeA within two minutes.
  • FIG. 1 shows a flow chart of an exemplary method for large volume data processing in accordance with the present disclosure.
  • the present disclosure adopts a method of distributed and concurrent processing by multiple servers to process a large data file according to business logic.
  • the multiple servers may execute the same procedures concurrently.
  • a prerequisite for the exemplary embodiment is that the file FiIeA sent from a sender is named using a unified file naming standard or file naming scheme. A file which fails to follow this naming standard will not be processed. Therefore, files sent from the sender are named according to the standard, and a recipient may directly parse the filenames.
  • a filename of a source file FiIeA sent from a sender is parsed.
  • a server is allocated to divide the source file FiIeA.
  • the server divides the source file FiIeA into multiple small files FiIeY.
  • N the number of small files
  • the value of N is set according to practical situations, and is primarily determined by the number of servers available for processing.
  • each small file FiIeY obtained by dividing the source file FiIeA is named automatically according to a post-division small file naming standard.
  • the filenames of the small files FiIeY are parsed.
  • a server is allocated to process the small file FiIeY.
  • the allocated servers distributedly perform subsequent logical processing.
  • the source file and each small file (which is obtained by dividing the source file) each has its own naming rule. Specific details are given below.
  • An exemplary source file naming standard is as follows.
  • DateTime_Sequence_DataType.TXT An example filename: DateTime_Sequence_DataType.TXT, where the meaning of each parameter in the file name is illustrated as follows.
  • DateTime represents the time when data is exported, and has a precision up to one second.
  • An exemplary corresponding format is year, month, day, hour, minute, second (e.g., 20070209133010).
  • Sequence represents a sequence number of data paging (used to prevent a single file from being too large), and may also be called a source file sequence number. Sequence may be set to have three digits, with the accumulation beginning from 001.
  • DataType represents the type of data stored in the file. For example, "user” represents user data, "order” represents order data, and "sample” represents product data. As shown above, each data type DataType may have multiple data files.
  • An exemplary post-division small file naming standard is as follows. An example of filename:
  • SubDateTime represents the time when a subfile (i.e., a small file) is exported, and its default value is the same as DateTime. However, upon a retry, SubDateTime and DateTime may not be the same. Specific details is found in subsequent description for retry.
  • SubSequence represents a sequence number of the small file which is created upon dividing the corresponding source file.
  • An exemplary SubSequence is set to have four digits, with the accumulation starting from 0001.
  • the X in retryX represents the number of retries made to process the corresponding file in practice. When the file is processed at the first time, X is 0. After one failure, X is changed to 1, and X accumulates accordingly thereafter. During accumulation, attention is made to accordingly modify the corresponding SubDateTime which is in front of retryX.
  • An exemplary rule sets SubDateTime to be the current system time + retry time interval.
  • DateTime of the source file is still contained in the filename of the post-division small file.
  • the primary purpose of keeping DateTime is to allow error checking in case a problem occurs in subsequent processing.
  • Blocks S 102 and S 106 both rely on filename parsing results to rationally allocate servers. Specifically, servers are dispatched to divide the source file and to process the post-division small files. In view of the execution of a whole server system, this is a process of concurrent execution. Specifically, at the time when one server is allocated to divide a source file, another server may be allocated to process a small file obtained in previous division.
  • the present disclosure provides two exemplary kinds of concurrency strategies for dividing a source file and processing a post-division small file.
  • An exemplary principle of the concurrency strategies is to allow a source file to be divided by one server only, and to allow each post-division small file to be processed by one server only, in order to avoid resource competition.
  • One exemplary strategy is to apply a modulus operation to Sequence or SubSequence in the filename using the following formula: (Sequence/SubSequence) % (number of servers waited for allocation) + 1; where "%” represents a modulus operator, and "+1" (the addition of one) ensures that a result computed will not be zero.
  • x % 3 + 1 is computed first. If the result is one, the corresponding file is allocated to be processed by serverl . If the result is two, the file is allocated to be processed by server2. If the result is three, the file is allocated to be processed by server3.
  • the same determination rule for allocation may be used for both processing a small file and for dividing a source file.
  • the allocation for processing the small file and the allocation for dividing the source file may both be based on Sequence in the corresponding filename.
  • determination of allocation may preferably be based on SubSequence for processing a small file.
  • the foregoing strategy may achieve relatively even allocation of servers.
  • serverl and server2 may process a few more files than server3.
  • a file having a relatively small data volume may have only one divided small file, and thus will always be processed by serverl according to this rule. Therefore, this strategy is suitable for situations where the number of the files to be processed is large. The larger the number of files is, the more the evenness among servers will be.
  • a new server is added, the number of the available servers changes to cause a different server allocation, and as a result all the servers are required to be re-deployed.
  • Another strategy configures, for each server, a DataType that can be processed and is only allowed to be processed by that server.
  • the system provides a configuration table, with configuration items showing, for each server, a DataType that can be processed and is only allowed to be processed by that server.
  • serverl, server2 and server3 may be configured to process order data, user data and sample data, respectively.
  • it may be required to guarantee that no inter-server conflict exist in the configuration. If a conflict exists, a warning about the configuration error is given.
  • concurrency strategies can schedule multiple servers for dividing source files and for processing small files at the same time well. Depending on specific application requirements, any one of the two strategies may be selected in practical applications.
  • the concurrency strategies are not mutually exclusive, they may preferably be combined. For example, multiple servers may be assigned to process files having "order" as DataType (i.e., files having order data), while server3 is allocated to process user data only, and server2 is allocated to process sample data only. This helps to maximize the evenness of the processing among servers, and to maximize the balance among the activities of the servers.
  • DataType i.e., files having order data
  • the corresponding post-division small files are stored into a disk.
  • the small file is obtained from the disk. If a small file obtained by dividing the source file by the server has not been completely written into a disk, this small file is not allowed to be processed. This precaution is needed because if the small file is processed in this circumstance, the content of the small file read by another server (a small file processing server) may not be complete.
  • the files may be processed in a chronological order.
  • this method of distributed and concurrent processing by multiple servers may greatly improve processing power of a system, particularly when large files are processed, and can maximally balance the activity levels of the servers. Moreover, this method promises good scalability. If files become bigger or the number of files increases, servers may be added to satisfy the demands. Specifically, the system can be linearly expanded, without having to purchase more advanced servers and to re-deploy the servers which have been operating previously.
  • the present disclosure preferably places files to be divided and files to be processed under different directories, and cache the files in their respective directories. After a file in the cache is processed, another file is read.
  • the processing order depends on a natural ordering of the filenames (i.e., according to the ascending order of DateTime in the filenames, with files of earlier DateTime being processed first).
  • An exemplary directory structure for file processing may look like the following:
  • /tmp_execute //temporary directory used during processing a certain small file according to business logic /hostnameA // hostname of a server for current processing /hostnameB // ... // these directory names are dynamically allocated according to the number of servers /bak_sucs_file // backup source files that have been successfully divided
  • the above directory structure is used in these exemplary embodiments for illustrative purposes only.
  • the directory structure may be self-adjusted. For example, whether /hour directory is needed depends on the number of small files obtained from dividing the source file by the system. Because an operating system has restrictions on the number of files, the size of directory tree, and the maximum number of files under a directory, performance of the system may be severely affected if the number of files under one directory is too large.
  • the process of creating the above directory structure is in synchronization with the processes of dividing the source file and processing the small files shown in FIG. 1, and may reflect the process shown in FIG. 1 Through data flow of files across directories. Such process primarily includes data flows in file division and file processing. Specific details are described as follows.
  • FIG. 2 shows a flow chart illustrating an exemplary process of dividing a large source file in a directory structure. The process is described with reference to the directory structure described above below.
  • a source file transmitted from a sender is placed under a directory /source file.
  • the source file which is under the directory /source file and waiting to be divided is allocated a server based on its filename.
  • the file is further placed under a corresponding directory
  • /tmp divide/hostname, which is preferably a directory of the allocated server. This process is referred to as renaming.
  • the server divides the source file which is waiting to be divided. If the source file is successfully divided, small files obtained from the division are saved under a directory /source smallfile. The source file which has been divided successfully is backed up under a directory /bak sucs file.
  • a retry to divide is attempted. If the first try is successful, the process returns back to S203, and the small files obtained from the division are saved under a directory /source smallfile. The source file which has been divided successfully is backed up under a directory /bak sucs f ⁇ le. If the retry also fails, the source file is backed up under a directory /error divide.
  • FIG. 3 shows a flow chart illustrating an exemplary process of processing a post- division small file in a directory structure. The process is described with reference to the directory structure described above.
  • a small file obtained from dividing the source file is placed under a directory /source smallfile.
  • the small file which is under the directory /source smallfile and waiting to be processed is allocated a server based on its filename.
  • the small file is further placed under corresponding directory /tmp execute/hostname, which is preferably a directory of the allocated server. This process is called renaming.
  • the allocated server logically processes the small file that is waiting to be processed. If successful, the processed small file is saved under a directory /bak_sucs_small .
  • the number of allowable retries may be set to be five, for example.
  • the small file which is waiting for a retry for processing is backed up under a directory /bak error.
  • the process returns to S301.
  • the small file which has failed to be processed at S303 is placed back under the directory
  • the small file is backed up under a directory /error execute.
  • the number of retries to divide a source file and the number of retries to process a small file are not the same. If an error occurs in dividing a large source file, a retry to divide is made. If the retry still fails to divide, the source file is transferred directly to "error directory" (i.e., /error divide). An error log is written while a warning is provided. However, a more flexible retry mechanism is used to handle an error occurring in the business logic processing of a small file.
  • the error rate for dividing a source file is very small, but failures for the subsequent business logic processing occur more frequently, multiple retries are allowed for processing small files. Because multiple retries may be attempted for a small file, a list of retry intervals is configured. The number of allowable retries and the duration needed to be waited before making the next retry may be set manually. For example, the number of allowable retries may be set to be five. If five unsuccessful retries have been made, the small file will not be processed, but transferred to "error directory". An error log is written while a warning is provided. Therefore, when a server obtains a specific file for processing, the server needs to determine whether SubDateTime of that file is earlier than the current time. If yes, the server processes the file. Otherwise, the file is not processed.
  • a directory structure is used. The small file is first written to a temporary directory /source smallfile. Upon successful writing, the small file is renamed, and transferred to another directory /tmp execute/hostname. Because the renaming is atomic, and only a file pointer needs to be modified, this will guarantee the integrity of the data.
  • FIG. 4 shows a logical structural diagram of an exemplary system for large volume data processing in accordance with the present disclosure.
  • system 400 adopts a distributed structure including multiple servers 410, with each server having the abilities of dividing source files 420 and processing small files 430 obtained from dividing the source files 420.
  • each server has the abilities of dividing source files 420 and processing small files 430 obtained from dividing the source files 420.
  • the system 400 is deployed on the recipient side and is referred to as recipient system 400 below.
  • the multiple servers 410 of the recipient system 400 may execute the same procedures concurrently.
  • a large file FiIeA (420) is first placed under a directory /source_file.
  • each server serverA (410) determines based on the file's filename whether a file 420 is to be divided by the present server. Upon obtaining a file FiIeA needed to be processed by the present server, the server serverA divides the file into N smaller files
  • FiIeY The value of N depends on practical situations and is primarily determined by the number of servers 410 available to be used for processing the small files. According to one concurrency strategy, the multiple servers serverX further determine whether any of the small files FiIeY obtained from dividing the source file is/are to be processed by the multiple servers serverX. Upon obtaining a small file FiIeY which is to be processed by server serverB, for example, the server serverB performs subsequent logical processing of the small file FiIeY.
  • the disclosed system is very flexible, has good scalability, and can support independent configuration of each server for processing certain types of files. In some embodiments, therefore, in case a new server is added, previously operated servers are not required to be reconfigured and deployed again with a new configuration.
  • the small files obtained from dividing the source file are saved into a disk. Furthermore, if an error occurs in the file division or file processing, the erroneous part is processed again. If a lot of files are waiting to be divided or processed, the files will be divided or processed in a chronological order.
  • FIG. 5 shows a structural diagram of an exemplary internal logic of each server in the system 400.
  • Server 510 includes a pre-processing unit U501, a dividing unit U502 and a processing unit U503.
  • the pre-processing unit U501 is used for scanning a directory of a disk on a regular basis to determine whether a file within the directory is to be processed by the server 510.
  • the pre-processing unit U501 determines whether a source file is to be divided by the present server 510. The determination is based on the filename of the source file which is assigned according to a source file naming scheme. If the answer is affirmative, the pre-processing unit U503 triggers a dividing unit U502 to divide the source file.
  • the pre-processing unit U501 may further determine whether a small file is to be processed by the server 510. The determination is based on the filename of the small file which is assigned according to a post-division small file naming scheme. If yes, the pre-processing unit U501 triggers a processing unit U502 to process the small file.
  • the dividing unit U502 is used for dividing the source file into small files.
  • the processing unit U503 is used for performing logical processing of the small file which, according to the above determination, is to be processed by the server 510.
  • the pre-processing unit U501 determines whether a source file is to be processed by the server 510 based on a source file sequence number in the file's filename, and determines whether a small file is to be processed by the server 510 based on a corresponding source file sequence number in the small file's filename or a small file sequence number in the small file's filename. If the second strategy is adopted, the pre-processing unit U501 determines whether a source file is to be divided by the server 510 based on the type of data stored in the source file.
  • the server 510 further includes a configuration unit U504 which is used for configuring data type(s) that can be processed by the server 510. During determination, the pre-processing unit U501 determines whether data type of the data stored in the source file is the type that can be processed by the server 510.
  • the server 510 may further include a storage unit U505 which is used for saving the small files into a disk.
  • the storage unit U505 adopts a directory structure, and places files that are waiting to be divided and files that are waiting for processing under different directories.
  • the server may further include a retry unit U506, which is used for retrying to divide a source file or to process a small file upon operation failure. Based on application needs, a single retry is attempted to divide a source file, while multiple retries may be attempted to process a small file.
  • a retry unit U506 which is used for retrying to divide a source file or to process a small file upon operation failure. Based on application needs, a single retry is attempted to divide a source file, while multiple retries may be attempted to process a small file.
  • the system 400 supports concurrent executions by multiple servers 410 (510) for improving processing power of the system 400.
  • the system 400 described also has a very good scalability.
  • servers 410 may be divided into two groups, one placed on the sender's side, and the other on the receiving side.
  • the source file is divided by the first group of servers on the sender's side and the resultant post-division small files are sent to the recipient side to be processed by the second group of servers.
  • the system may not involve a sender and recipient but has only one side which contains multiple servers to conduct filed division and file processing.

Abstract

Disclosed are a method and a system for large volume data processing for solving the problem of system collapse caused by processing delays resulting from a failure of processing a large volume of data within a scheduled time. The method allocates a server to divide a source file into multiple small files, according to a source file naming scheme, and allocates multiple servers to distributedly process the small files. The allocation of servers can be based on the filenames named according to a file naming scheme. The disclosed method deploys multiple servers to divide and process large data files, thereby maximally improving the processing power of the system and ensuring the system to complete the processing of the files within scheduled times. Furthermore, the system promises good scalability.

Description

METHOD AND SYSTEM FOR LARGE VOLUME DATA
PROCESSING
RELATED APPLICATIONS The present application claims priority benefit of Chinese patent application No.
200810097594.7, filed May 15, 2008, entitled "METHOD AND SYSTEM FOR LARGE VOLUME DATA PROCESSING", which Chinese application is hereby incorporated in its entirety by reference.
BACKGROUND ART
The present disclosure relates to data processing technologies, and particularly to methods and systems for large volume data processing.
Data processes are commonly observed in a number of applications. In a typical scenario, a sender saves a certain data into a file in a certain format, and then sends the file to a recipient. Upon receiving the file, the recipient analyzes content within the file, and performs logical processing accordingly.
In the above data process, if the file is not too big, and the recipient does not have a strict processing time requirement, a single server or a single thread may be used for processing. Under that circumstance, corresponding system may still operate normally, though the time taken by the recipient to process the data of these files may be quite long.
However, if the file is very big (and/or the number of files is large), and the recipient has a very strict processing time requirement (e.g., the recipient may require the data of the file transmitted from the sender to be completely processed within one minute or even a shorter period of time), the processing system using single server or single thread may not be able to satisfy this need.
In many instances, file data is transferred from a sender to a recipient on a regular basis, once every five minutes, for example. In addition, the recipient may have a maximum delay tolerance for the data. If the recipient cannot complete processing the transmitted data during a corresponding interval, vicious cycle may result - unfinished processing of data from previous period and arrival of new data will increase the data delay of the recipient and eventually lead to a system collapse.
Requirement for processing such large volume of data is normally seen in a number of large-scale applications. Examples include reporting students' data from a school to an education authority in educational sector, web log processing in large-scale websites, and inter-system data synchronization, etc. Therefore, a method for processing large volume of data within a scheduled time is required to alleviate data processing delay.
SUMMARY
Disclosed are a method and a system for large volume data processing to solve the problem of system collapse caused by processing delays resulting from failure to process a large volume of data within a scheduled time. One aspect of the disclosure is a method for processing large volume data. The method allocates a server to divide a source file into multiple small files, and allocates multiple servers to distributedly process the small files. The allocation of servers can be based on the filenames assigned according to a file naming scheme. The disclosed method deploys multiple servers to divide and process large data files, thereby improving the processing power of the system and ensuring the system to complete the processing of the files within scheduled times. Furthermore, the system promises good scalability. As files become bigger or the number of files increases, new servers may be added to satisfy the demands. The system can be linearly expanded without having to purchase more advanced servers and to re-configure and re-deploy the servers which have been operating previously.
In one embodiment, allocating the source file dividing server according to the filename of the source file is done by parsing the filename of the source file and obtaining a source file sequence number; computing ( ( (the source file sequence number) % (total number of servers available for allocation) ) + 1 ), wherein % represents a modulus operator; and allocating the source file dividing server according to computed result.
Allocating the small file processing servers according to the filenames of the small files can be done in a similar manner. The servers may also be allocated according to the data types to be processed by the servers. In one embodiment, the method configures each server to process a data type; parses a filename of a file and obtains the data type of data stored in the file; and allocates the file to a server that is configured to process the data type of the data. After dividing the source file into small files, the method may save the small files into a disk.
In one embodiment, the method further allows the source file processing server to retry to divide the source file upon failure; and allows the plurality of small file processing servers to retry to process the respective allocated small files upon failure. The method may allow only a single retry to divide the source file, but allow multiple retries to process the allocated small files.
The method may place the source file waiting to be divided and small files waiting to be processed under different directories. In one embodiment, the data flow under the directory of the source files waiting to be divided includes the following steps: placing the source file into a directory for to-be-divided source files; after allocating the source file processing server, placing the source file waiting to be divided into a temporary directory for file division; dividing the source file; and backing up the source file into a directory storing successfully divided source files if the source file has been divided successfully, and saving the small files thus obtained into a directory storing post-division small files, or backing up the source file into a directory storing source files failed to be divided if the source file has failed to be divided after a retry. Furthermore, the data flow under the directory of the small files waiting to be processed may include the following steps: after allocating the small file processing servers, placing the small files that are waiting to be processed into a temporary directory for small file processing; processing the small files in the temporary directory; and backing up one or more of the small files into a directory storing successfully processed small files if the one or more small files have been processed successfully, backing up one or more of the small files into a directory storing small files having partially unsuccessfully processed records if the one or more of the small files need to be re -processed, and backing up one or more of the small files into a directory storing small files failed to be processed upon retries if the one or more of the small files have failed to be processed upon retries.
Another aspect of disclosure is a system for large volume data processing. The system includes multiple servers, with each server including a pre-processing unit, a dividing unit and a processing unit. The pre-processing unit is used for determining whether a source file waiting to be divided is to be processed by the server based on a source file naming scheme and for triggering a dividing unit if affirmative the preprocessing unit is also used for determining whether a small file is to be processed by the server based on a post-division small file naming scheme and for triggering a processing unit if affirmative. The dividing unit is used for dividing the source file into small files. The processing unit is used for performing logical processing for the small file. In one embodiment, the pre-processing unit determines wherein whether the source file is to be processed by the server based on a source file sequence number in the filename of the source file, and determines whether the small file is to be processed by the server based on a source file sequence number in the filename of the small file or a small file sequence number in the filename of the small file.
In another development, the pre-processing unit determines whether the source is to be processed by the server based on a type of data stored in the source file. In this case, each server further has a configuration unit used for configuring data type(s) that can be processed by the server. Preferably, each server may further include a storage unit which is used for saving small files obtained from dividing the source file into a disk. The storage unit may adopt a directory structure, and places files waiting to be divided and files waiting to be processed under different directories.
Preferably, each server may further include a retry unit used for retrying to divide the source file upon failure and/or to process the small files upon failure. The retry unit may allow a single retry to divide the source file but allow multiple retries to process the small files.
According to some of the exemplary embodiments of the present disclosure, the disclosed method and system has several potential advantages. First, the present disclosure provides a method and a system capable of distributed and concurrent processing of large files. Under the control of a concurrency strategy, multiple servers may be deployed to divide and process large data files at the same time, thereby greatly improving processing power of a system and ensuring the system to complete processing of the file within a scheduled time. Moreover, a concurrency strategy which allocates servers for dividing and processing files based on a file naming scheme ensures that each source file is divided by just one server, and each small file obtained by dividing the source file is also processed by only one server, thereby avoiding resource competition.
Second, the present disclosure discloses several different concurrency strategies. One strategy allocates a server according to a source file sequence number in the filename. This strategy can guarantee a balance among servers when there are a relatively large number of files. Another strategy configures, for each server, a data type that is allowed to be processed by the server, and allocates a suitable server for respective file(s) waiting for processing. This latter strategy requires only modification of a configuration table when a new server is added. Depending on the practical application needs, the present disclosure may use a combination of these two concurrency strategies so that the disclosed system may maximally balance the activity levels of the servers. Furthermore, the system promises good scalability. As files become bigger or the number of files increases, new servers may be added to satisfy corresponding demands. Specifically, the system can be linearly expanded, without having to purchase more advanced servers and to re-deploy the servers which have been operating previously.
Furthermore, in order to reduce disk IO (Input/Output) pressure, files that are waiting to be divided and files that are waiting to be processed may be placed under different directories, in which all files may be subsequently cached. A new file is read only after the files in the cache have been completely processed. This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
DESCRIPTION OF DRAWINGS
FIG. 1 shows a flow chart of an exemplary method for large volume data processing in accordance with the present disclosure.
FIG. 2 shows a flow chart an exemplary process of dividing a file in a directory structure.
FIG. 3 shows a flow chart of an exemplary process of processing a file in a directory structure.
FIG. 4 shows a logical structural diagram of a system for large volume data processing in accordance with the present disclosure. FIG. 5 shows a diagram illustrating an exemplary internal logical structure of each server in the system of FIG. 4.
DETAILED DESCRIPTION
In order to better understand the characteristics of the present disclosure, the disclosed method and system are described in further details using accompanying figures and exemplary embodiments. The present disclosure provides a method and a system for processing a large file in a concurrent or distributed manner. Through a concurrency strategy which employs a file naming scheme to allocate servers, a number of servers can be deployed to divide and process the large data file at the same time. As a result the method can greatly improve the processing power of a system and ensure the system to complete the processing of a large data file within a scheduled time.
For example, a sender creates large files FiIeA of multiple in a certain format, and sends one or more files FiIeA to a recipient from time to time (e.g., in every two minutes). FiIeA has a large file size (e.g., 200M). The multiple types may be data types such as product data and order data, with each type possibly having multiple data files. Upon receiving the files FiIeA, the recipient performs subsequent logical processing of the files FiIeA. The logical processing may be relatively complicated, and may require a number of related matters such as saving the files FiIeA in a database.
In the above example, if the processing power of a single server is 10M/Min, only 2OM of data can then be processed within two minutes, leaving 180M of data remained unprocessed. In this case, a goal of the disclosed method, therefore, is to enable the system of the recipient to complete processing of all the files FiIeA within two minutes.
FIG. 1 shows a flow chart of an exemplary method for large volume data processing in accordance with the present disclosure. Using the above example, the present disclosure adopts a method of distributed and concurrent processing by multiple servers to process a large data file according to business logic. The multiple servers may execute the same procedures concurrently.
A prerequisite for the exemplary embodiment is that the file FiIeA sent from a sender is named using a unified file naming standard or file naming scheme. A file which fails to follow this naming standard will not be processed. Therefore, files sent from the sender are named according to the standard, and a recipient may directly parse the filenames.
Below uses an example of processing a file sent from a sender for illustration. In this description, the order in which a process is described is not intended to be construed as a limitation, and any number of the described process blocks may be combined in any order to implement the method, or an alternate method. The exemplary process is described as follows.
At SlOl, a filename of a source file FiIeA sent from a sender is parsed. At S 102, based on a parsing result of the filename of the source file FiIeA, a server is allocated to divide the source file FiIeA.
At S 103, the server divides the source file FiIeA into multiple small files FiIeY.
When a file is divided, the file is divided into N smaller files FiIeY. The value of N (the number of small files) is set according to practical situations, and is primarily determined by the number of servers available for processing.
At S 104, each small file FiIeY obtained by dividing the source file FiIeA is named automatically according to a post-division small file naming standard.
At S 105, the filenames of the small files FiIeY are parsed. At S 106, based on the parsing results of the filenames of each small file FiIeY, a server is allocated to process the small file FiIeY.
At S 107, the allocated servers distributedly perform subsequent logical processing. The source file and each small file (which is obtained by dividing the source file) each has its own naming rule. Specific details are given below. An exemplary source file naming standard is as follows.
An example filename: DateTime_Sequence_DataType.TXT, where the meaning of each parameter in the file name is illustrated as follows. DateTime represents the time when data is exported, and has a precision up to one second. An exemplary corresponding format is year, month, day, hour, minute, second (e.g., 20070209133010).
Sequence represents a sequence number of data paging (used to prevent a single file from being too large), and may also be called a source file sequence number. Sequence may be set to have three digits, with the accumulation beginning from 001.
DataType represents the type of data stored in the file. For example, "user" represents user data, "order" represents order data, and "sample" represents product data. As shown above, each data type DataType may have multiple data files. An exemplary post-division small file naming standard is as follows. An example of filename:
SubDateTime DateTime S equence Sub S equence retryX DataType . TXT , where the meaning for each parameter is described as follows. The meaning for DateTime, Sequence and DataType is the same as that for source file.
SubDateTime represents the time when a subfile (i.e., a small file) is exported, and its default value is the same as DateTime. However, upon a retry, SubDateTime and DateTime may not be the same. Specific details is found in subsequent description for retry.
SubSequence represents a sequence number of the small file which is created upon dividing the corresponding source file. An exemplary SubSequence is set to have four digits, with the accumulation starting from 0001. The X in retryX represents the number of retries made to process the corresponding file in practice. When the file is processed at the first time, X is 0. After one failure, X is changed to 1, and X accumulates accordingly thereafter. During accumulation, attention is made to accordingly modify the corresponding SubDateTime which is in front of retryX. An exemplary rule sets SubDateTime to be the current system time + retry time interval.
DateTime of the source file is still contained in the filename of the post-division small file. The primary purpose of keeping DateTime is to allow error checking in case a problem occurs in subsequent processing.
Blocks S 102 and S 106 both rely on filename parsing results to rationally allocate servers. Specifically, servers are dispatched to divide the source file and to process the post-division small files. In view of the execution of a whole server system, this is a process of concurrent execution. Specifically, at the time when one server is allocated to divide a source file, another server may be allocated to process a small file obtained in previous division.
The present disclosure provides two exemplary kinds of concurrency strategies for dividing a source file and processing a post-division small file. An exemplary principle of the concurrency strategies is to allow a source file to be divided by one server only, and to allow each post-division small file to be processed by one server only, in order to avoid resource competition.
One exemplary strategy is to apply a modulus operation to Sequence or SubSequence in the filename using the following formula: (Sequence/SubSequence) % (number of servers waited for allocation) + 1; where "%" represents a modulus operator, and "+1" (the addition of one) ensures that a result computed will not be zero.
For example, assume that the number of available servers is three, and the filename is 20070429160001_00x_order.txt, with 0Ox representing Sequence. Using the above formula, x % 3 + 1 is computed first. If the result is one, the corresponding file is allocated to be processed by serverl . If the result is two, the file is allocated to be processed by server2. If the result is three, the file is allocated to be processed by server3.
By default, the same determination rule for allocation may be used for both processing a small file and for dividing a source file. For example, the allocation for processing the small file and the allocation for dividing the source file may both be based on Sequence in the corresponding filename. However, in order to ensure further evenness among servers, determination of allocation may preferably be based on SubSequence for processing a small file.
It is appreciated that any other suitable formula based on Sequence or SubSequence in a filename may be used for allocating the servers evenly. The above formula which uses modulus operation is only meant to be an example for illustrative purpose.
The foregoing strategy may achieve relatively even allocation of servers. However, because the number of files is not always a multiple of the number of available service (e.g., three in the present example), serverl and server2 may process a few more files than server3. Furthermore, a file having a relatively small data volume may have only one divided small file, and thus will always be processed by serverl according to this rule. Therefore, this strategy is suitable for situations where the number of the files to be processed is large. The larger the number of files is, the more the evenness among servers will be. However, if a new server is added, the number of the available servers changes to cause a different server allocation, and as a result all the servers are required to be re-deployed.
Another strategy configures, for each server, a DataType that can be processed and is only allowed to be processed by that server. According to this strategy, the system provides a configuration table, with configuration items showing, for each server, a DataType that can be processed and is only allowed to be processed by that server. For example, serverl, server2 and server3 may be configured to process order data, user data and sample data, respectively. In setting these configuration items, it may be required to guarantee that no inter-server conflict exist in the configuration. If a conflict exists, a warning about the configuration error is given.
Either of the above-described concurrency strategies can schedule multiple servers for dividing source files and for processing small files at the same time well. Depending on specific application requirements, any one of the two strategies may be selected in practical applications. As the concurrency strategies are not mutually exclusive, they may preferably be combined. For example, multiple servers may be assigned to process files having "order" as DataType (i.e., files having order data), while server3 is allocated to process user data only, and server2 is allocated to process sample data only. This helps to maximize the evenness of the processing among servers, and to maximize the balance among the activities of the servers.
In the processes of dividing the source file and processing the small files as shown in FIG. 1, if the above first strategy is used, the filename of the source file is parsed to obtain Sequence, while the filename of the small file is parsed to obtain Sequence or SubSequence. If the second strategy is used, DataType is obtained from the filenames. If a combination of the two strategies is used, Sequence or SubSequence, and DataType may be obtained at the same time. No matter which concurrency strategy is adopted, which server is used for dividing or processing a file is determined upon the filename of the object file. Each server can only process the files assigned to it (the present server). Preferably, after a server divides a source file, the corresponding post-division small files are stored into a disk. When a small file needs to be processed, the small file is obtained from the disk. If a small file obtained by dividing the source file by the server has not been completely written into a disk, this small file is not allowed to be processed. This precaution is needed because if the small file is processed in this circumstance, the content of the small file read by another server (a small file processing server) may not be complete.
Preferably, if an error occurs during division or processing, the corresponding erroneous part is re-done. Moreover, if many files are waiting to be divided or processed, the files may be processed in a chronological order.
All in all, this method of distributed and concurrent processing by multiple servers may greatly improve processing power of a system, particularly when large files are processed, and can maximally balance the activity levels of the servers. Moreover, this method promises good scalability. If files become bigger or the number of files increases, servers may be added to satisfy the demands. Specifically, the system can be linearly expanded, without having to purchase more advanced servers and to re-deploy the servers which have been operating previously.
In order to reduce the disk I/O pressure caused by file scanning, the present disclosure preferably places files to be divided and files to be processed under different directories, and cache the files in their respective directories. After a file in the cache is processed, another file is read. The processing order depends on a natural ordering of the filenames (i.e., according to the ascending order of DateTime in the filenames, with files of earlier DateTime being processed first). An exemplary directory structure for file processing may look like the following:
/root // root directory
/source_file // directory for source files to be divided
/tmp_divide // temporary directory used during file division
/hostnameA // hostname of a server for current division /hostnameB //
// these directory names are dynamically allocated according to the number of servers
/source_smallfile // storing small files obtained from dividing the source file
/tmp_execute //temporary directory used during processing a certain small file according to business logic /hostnameA // hostname of a server for current processing /hostnameB // ... // these directory names are dynamically allocated according to the number of servers /bak_sucs_file // backup source files that have been successfully divided
/20070212 // current date /hour // hour of current date, e.g., 10, 20
/20070213 //
/hour //
// dynamically add according to current date
/bak_sucs_small // backup small files that have been successfully processed
/20070212 // current date
/hour // hour of current date, e.g., 10, 20 /20070213 //
/hour // ... // dynamically added according to current date
/bak_error // backup small files that have been recorded but partially failed to be processed /20070212 // current date /hour // hour of current date, e.g., 10, 20 /20070213 //
/hour //
// dynamically add according to current date /error_divide // backup source files which have had an error in dividing to small files /20070212 // current date
/hour // hour of current date, e.g., 10, 20 /20070213 // /hour //
// dynamically added according to current date
/error_execute // backup small files that have been processed unsuccessfully with retry failure exceeding five times /20070212 // current date /hour // hour of current date, e.g., 10, 20
/20070213 //
/hour //
// dynamically add according to current date
The above directory structure is used in these exemplary embodiments for illustrative purposes only. The directory structure may be self-adjusted. For example, whether /hour directory is needed depends on the number of small files obtained from dividing the source file by the system. Because an operating system has restrictions on the number of files, the size of directory tree, and the maximum number of files under a directory, performance of the system may be severely affected if the number of files under one directory is too large. In practice, the process of creating the above directory structure is in synchronization with the processes of dividing the source file and processing the small files shown in FIG. 1, and may reflect the process shown in FIG. 1 Through data flow of files across directories. Such process primarily includes data flows in file division and file processing. Specific details are described as follows.
FIG. 2 shows a flow chart illustrating an exemplary process of dividing a large source file in a directory structure. The process is described with reference to the directory structure described above below.
At S201, a source file transmitted from a sender is placed under a directory /source file.
At S202, the source file which is under the directory /source file and waiting to be divided is allocated a server based on its filename. In order to avoid dividing the same source file by multiple threads, the file is further placed under a corresponding directory
/tmp divide/hostname, which is preferably a directory of the allocated server. This process is referred to as renaming.
At S203, the server divides the source file which is waiting to be divided. If the source file is successfully divided, small files obtained from the division are saved under a directory /source smallfile. The source file which has been divided successfully is backed up under a directory /bak sucs file. At S204, if the above dividing process fails, a retry to divide is attempted. If the first try is successful, the process returns back to S203, and the small files obtained from the division are saved under a directory /source smallfile. The source file which has been divided successfully is backed up under a directory /bak sucs fϊle. If the retry also fails, the source file is backed up under a directory /error divide.
FIG. 3 shows a flow chart illustrating an exemplary process of processing a post- division small file in a directory structure. The process is described with reference to the directory structure described above.
At S301, a small file obtained from dividing the source file is placed under a directory /source smallfile.
At S302, the small file which is under the directory /source smallfile and waiting to be processed is allocated a server based on its filename. In order to avoid processing the same small file by multiple threads, the small file is further placed under corresponding directory /tmp execute/hostname, which is preferably a directory of the allocated server. This process is called renaming.
At S303, the allocated server logically processes the small file that is waiting to be processed. If successful, the processed small file is saved under a directory /bak_sucs_small .
At S304, if an error occurs in processing, a retry is attempted. The number of allowable retries may be set to be five, for example.
The small file which is waiting for a retry for processing is backed up under a directory /bak error. When a retry is attempted, the process returns to S301. Specifically, the small file which has failed to be processed at S303 is placed back under the directory
/source smallfile, and processed again by the originally allocated server. If the number of unsuccessful retries exceeds the maximum number of retries allowed (e.g., five times), the small file is backed up under a directory /error execute. In the exemplary embodiments, the number of retries to divide a source file and the number of retries to process a small file are not the same. If an error occurs in dividing a large source file, a retry to divide is made. If the retry still fails to divide, the source file is transferred directly to "error directory" (i.e., /error divide). An error log is written while a warning is provided. However, a more flexible retry mechanism is used to handle an error occurring in the business logic processing of a small file. Because the error rate for dividing a source file is very small, but failures for the subsequent business logic processing occur more frequently, multiple retries are allowed for processing small files. Because multiple retries may be attempted for a small file, a list of retry intervals is configured. The number of allowable retries and the duration needed to be waited before making the next retry may be set manually. For example, the number of allowable retries may be set to be five. If five unsuccessful retries have been made, the small file will not be processed, but transferred to "error directory". An error log is written while a warning is provided. Therefore, when a server obtains a specific file for processing, the server needs to determine whether SubDateTime of that file is earlier than the current time. If yes, the server processes the file. Otherwise, the file is not processed.
As described above, if a small file obtained from dividing the source file by a certain server has not been written completely into a disk, this small file is not allowed to be processed. In order to avoid processing of a small file which has not been completely written into a disk, a directory structure is used. The small file is first written to a temporary directory /source smallfile. Upon successful writing, the small file is renamed, and transferred to another directory /tmp execute/hostname. Because the renaming is atomic, and only a file pointer needs to be modified, this will guarantee the integrity of the data.
The present disclosure further provides an exemplary system for large volume data processing. FIG. 4 shows a logical structural diagram of an exemplary system for large volume data processing in accordance with the present disclosure.
As shown in FIG. 4, system 400 adopts a distributed structure including multiple servers 410, with each server having the abilities of dividing source files 420 and processing small files 430 obtained from dividing the source files 420. In the context of an example of processing data transferred from a sender to a recipient within a scheduled time, the system 400 is deployed on the recipient side and is referred to as recipient system 400 below. The multiple servers 410 of the recipient system 400 may execute the same procedures concurrently.
An exemplary process of concurrent processing a large source file FiIeA sent from a sender by multiple servers 410 of the recipient system 400 is described as follows.
A large file FiIeA (420) is first placed under a directory /source_file. Each server
410 constantly scans large files 420 which are sent from the sender and placed under this directory. Based on the foregoing concurrency strategies (which is not described here again), each server serverA (410) determines based on the file's filename whether a file 420 is to be divided by the present server. Upon obtaining a file FiIeA needed to be processed by the present server, the server serverA divides the file into N smaller files
FiIeY. The value of N depends on practical situations and is primarily determined by the number of servers 410 available to be used for processing the small files. According to one concurrency strategy, the multiple servers serverX further determine whether any of the small files FiIeY obtained from dividing the source file is/are to be processed by the multiple servers serverX. Upon obtaining a small file FiIeY which is to be processed by server serverB, for example, the server serverB performs subsequent logical processing of the small file FiIeY.
In the above process, in order to prevent resource competition, it is preferred to ensure that only one server is allowed to divide the source file FiIeA, and only one server is allowed to process each of the small files FiIeY obtained from dividing the source file FiIeA. Moreover, the activity levels of each server are balanced to the maximum extent to avoid situations in which certain servers are too busy while other servers are idling.
This can be accomplished by the concurrency strategies described in this disclosure. Furthermore, the disclosed system is very flexible, has good scalability, and can support independent configuration of each server for processing certain types of files. In some embodiments, therefore, in case a new server is added, previously operated servers are not required to be reconfigured and deployed again with a new configuration.
Preferably, in order to ensure read integrity of small files, after a source file is divided, the small files obtained from dividing the source file are saved into a disk. Furthermore, if an error occurs in the file division or file processing, the erroneous part is processed again. If a lot of files are waiting to be divided or processed, the files will be divided or processed in a chronological order.
FIG. 5 shows a structural diagram of an exemplary internal logic of each server in the system 400. Server 510 includes a pre-processing unit U501, a dividing unit U502 and a processing unit U503. The pre-processing unit U501 is used for scanning a directory of a disk on a regular basis to determine whether a file within the directory is to be processed by the server 510. Specifically, the pre-processing unit U501 determines whether a source file is to be divided by the present server 510. The determination is based on the filename of the source file which is assigned according to a source file naming scheme. If the answer is affirmative, the pre-processing unit U503 triggers a dividing unit U502 to divide the source file. The pre-processing unit U501 may further determine whether a small file is to be processed by the server 510. The determination is based on the filename of the small file which is assigned according to a post-division small file naming scheme. If yes, the pre-processing unit U501 triggers a processing unit U502 to process the small file. The dividing unit U502 is used for dividing the source file into small files. The processing unit U503 is used for performing logical processing of the small file which, according to the above determination, is to be processed by the server 510.
According to the concurrency strategies described above, if the first strategy is used, the pre-processing unit U501 determines whether a source file is to be processed by the server 510 based on a source file sequence number in the file's filename, and determines whether a small file is to be processed by the server 510 based on a corresponding source file sequence number in the small file's filename or a small file sequence number in the small file's filename. If the second strategy is adopted, the pre-processing unit U501 determines whether a source file is to be divided by the server 510 based on the type of data stored in the source file. The server 510 further includes a configuration unit U504 which is used for configuring data type(s) that can be processed by the server 510. During determination, the pre-processing unit U501 determines whether data type of the data stored in the source file is the type that can be processed by the server 510.
Preferably, in order to guarantee read integrity of small files obtained from dividing the source file, the server 510 may further include a storage unit U505 which is used for saving the small files into a disk. Moreover, in order to reduce disk IO pressure caused by file scanning, the storage unit U505 adopts a directory structure, and places files that are waiting to be divided and files that are waiting for processing under different directories.
Preferably, the server may further include a retry unit U506, which is used for retrying to divide a source file or to process a small file upon operation failure. Based on application needs, a single retry is attempted to divide a source file, while multiple retries may be attempted to process a small file.
In conclusion, the system 400 supports concurrent executions by multiple servers 410 (510) for improving processing power of the system 400. The system 400 described also has a very good scalability.
Any details of the system 400 left out in FIG. 4 and FIG. 5 can be found in related sections of the method disclosed in FIG. 1, FIG. 2 and FIG. 3, and therefore are not be repeated here.
It is appreciated configurations alternate to the above recipient system 400 may also be used. For example, servers 410 may be divided into two groups, one placed on the sender's side, and the other on the receiving side. The source file is divided by the first group of servers on the sender's side and the resultant post-division small files are sent to the recipient side to be processed by the second group of servers. Alternatively, the system may not involve a sender and recipient but has only one side which contains multiple servers to conduct filed division and file processing.
The method and the system for large volume data processing in the present disclosure have been described in detail above. Exemplary embodiments are employed to illustrate the concept and implementation of the present disclosure in this disclosure.
It is appreciated that the potential benefits and advantages discussed herein are not to be construed as a limitation or restriction to the scope of the appended claims.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described.
Rather, the specific features and acts are disclosed as exemplary forms of implementing the claims.

Claims

1. A method for large volume data processing, the method comprising: receiving a source file named according to a source file naming scheme; allocating a source file dividing server according to the filename of the source file to divide the source file into a plurality of small files named according to a small file naming scheme; and allocating a plurality of small file processing servers according to the filenames of the small files to process the plurality of small files, each allocated small file processing server processing one or more respective small files.
2. The method as recited in claim 1, wherein allocating the source file dividing server according to the filename of the source file comprises: parsing the filename of the source file and obtaining a source file sequence number; computing ( ( (the source file sequence number) % (total number of servers available for allocation) ) + 1 ), wherein % represents a modulus operator; and allocating the source file dividing server according to computed result.
3. The method as recited in claim 1, wherein allocating the plurality of small file processing servers according the filenames of the small files comprises: parsing the filename of each small file and obtaining a small file sequence number; computing ( ( (the small file sequence number) % (total number of servers available for allocation) ) + 1 ), wherein % represents a modulus operator; and allocating one of the plurality of small file processing servers to process the small file according to computed result.
4. The method as recited in claim 1, wherein allocating the source file dividing server according to the filename of the source file comprises: for each available server, configuring a data type to be processed by the server; parsing the filename of the source file, and obtaining the data type of the source file; and allocating to the source file one of the available servers configured to process the data type.
5. The method as recited in claim 1, wherein allocating the plurality of small file processing servers according to the filenames of the filenames of the small files comprises: for each available server, configuring a data type to be processed by the server; for each small file, parsing the filename of the small file and obtaining the data type of the small file; and allocating to the small file one of the available servers configured to process the data type.
6. The method as recited in claim 1, wherein after dividing the source file into the plurality of small files, the method further comprises: saving the plurality of small files into a disk.
7. The method as recited in claim 1, further comprising: allowing the source file dividing server to retry to divide the source file upon failure; and allowing the plurality of small file processing servers to retry to process the respective allocated small files upon failure.
8. The method as recited in claim 7, wherein only a single retry is allowed to divide the source file, and multiple retries are allowed to process the allocated small files.
9. The method as recited in claim 1, further comprising: placing the source file waiting to be divided and small files waiting to be processed under different directories.
10. The method as recited in claim 9, wherein data flow under a directory of the source files waiting to be divided comprises: placing the source file into a directory for to-be-divided source files; after allocating the source file processing server, placing the source file waiting to be divided into a temporary directory for file division; dividing the source file; and backing up the source file into a directory storing successfully divided source files if the source file has been divided successfully, and saving the small files thus obtained into a directory storing post-division small files, or backing up the source file into a directory storing source files failed to be divided if the source file has failed to be divided after a retry.
11. The method as recited in claim 9, wherein data flow under a directory of the small files waiting to be processed comprises: saving the small files into a directory storing post-division small files; after allocating the small file processing servers, placing the small files that are waiting to be processed into a temporary directory for small file processing; processing the small files in the temporary directory; and backing up one or more of the small files into a directory storing successfully processed small files if the one or more small files have been processed successfully, backing up one or more of the small files into a directory storing small files having partially unsuccessfully processed records if the one or more of the small files need to be re-processed, and backing up one or more of the small files into a directory storing small files failed to be processed upon retries if the one or more of the small files have failed to be processed upon retries.
12. A system for large volume data processing, wherein the system comprises multiple servers, each server comprising: a pre-processing unit used for determining whether a source file waiting to be divided is to be processed by the server based on a source file naming scheme and triggering a dividing unit if affirmative, and for determining whether a small file waiting to be processed is to be processed by the server based on a post-division small file naming scheme and triggering a processing unit if affirmative; said dividing unit used for dividing the source file into small files; and said processing unit used for performing logical processing for the small file.
13. The system as recited in claim 12, wherein the pre-processing unit determines whether the source file waiting to be divided is to be processed by the server based on a source file sequence number in the filename of the source file, and determines whether the small file waiting to be processed is to be processed by the server based on a source file sequence number in the filename of the small file or a small file sequence number in the filename of the small file.
14. The system as recited in claim 12, wherein the pre-processing unit determines whether the source file waiting to be divided is to be processed by the server based on a type of data stored in the source file, and each server further comprising: a configuration unit used for configuring data type(s) that can be processed by the server.
15. The system as recited in claim 12, wherein each server further comprising: a storage unit used for saving the small files into a disk.
16. The system as recited in claim 15, wherein the storage unit adopts a directory structure, and places the source file to be divided and the small files waiting to be processed under different directories.
17. The system as recited in claim 12, wherein each server further comprising: a retry unit used for retrying to divide the source file upon failure and/or to process the small files upon failure, wherein a single retry is allowed to divide while multiple retries are allowed to process.
18. A method for large volume data processing, the method comprising: assigning a filename to a source file according to a source file naming scheme; allocating a source file dividing server according to the filename of the source file to divide the source file into a plurality of small files; assigning filenames to the plurality of small files according to a small file naming scheme; allocating a plurality of small file processing servers according to the filenames of the small files to distributedly process the plurality of small files; and processing the plurality of small files by the allocated small file processing servers.
PCT/US2009/044127 2008-05-15 2009-05-15 Method and system for large volume data processing WO2009140590A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US12/601,606 US8229982B2 (en) 2008-05-15 2009-05-15 Method and system for large volume data processing
EP20090747671 EP2283428A4 (en) 2008-05-15 2009-05-15 Method and system for large volume data processing
JP2011509735A JP5438100B2 (en) 2008-05-15 2009-05-15 Mass data processing method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200810097594.7 2008-05-15
CN200810097594.7A CN101582064B (en) 2008-05-15 2008-05-15 Method and system for processing enormous data

Publications (1)

Publication Number Publication Date
WO2009140590A1 true WO2009140590A1 (en) 2009-11-19

Family

ID=41319073

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2009/044127 WO2009140590A1 (en) 2008-05-15 2009-05-15 Method and system for large volume data processing

Country Status (6)

Country Link
US (1) US8229982B2 (en)
EP (1) EP2283428A4 (en)
JP (1) JP5438100B2 (en)
CN (1) CN101582064B (en)
HK (1) HK1137250A1 (en)
WO (1) WO2009140590A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPWO2011099099A1 (en) * 2010-02-10 2013-06-13 日本電気株式会社 Storage device
KR20160069351A (en) * 2014-12-08 2016-06-16 에스케이텔레콤 주식회사 Apparatus for Counting the Number of Large-Scale Data by Taking Account of Data Distribution and Computer-Readable Recording Medium with Program therefor

Families Citing this family (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214184B (en) * 2010-04-07 2013-08-14 腾讯科技(深圳)有限公司 Intermediate file processing device and intermediate file processing method of distributed computing system
CN101976269B (en) * 2010-11-26 2012-12-05 山东中创软件工程股份有限公司 File scheduling method and system thereof
CN102650956B (en) * 2011-02-23 2014-08-27 蓝盾信息安全技术股份有限公司 Program concurrent method and system
CN102655471B (en) * 2011-03-03 2015-10-07 腾讯科技(深圳)有限公司 Method for routing and system
US9043914B2 (en) * 2012-08-22 2015-05-26 International Business Machines Corporation File scanning
US9460107B2 (en) * 2013-07-16 2016-10-04 International Business Machines Corporation Filename-based inference of repository actions
CN103605664B (en) * 2013-10-22 2017-01-18 芜湖大学科技园发展有限公司 Massive dynamic data fast query method meeting different time granularity requirements
CN103647790B (en) * 2013-12-24 2017-01-25 常州工学院 Extra-large file protocol analytical and statistical method
US20150186370A1 (en) * 2013-12-27 2015-07-02 A4 Data, Inc. System and method for updating files through differential compression
US9277067B2 (en) * 2014-01-24 2016-03-01 Ricoh Company, Ltd. System, apparatus and method for enhancing scan functionality
CN104049917A (en) * 2014-06-25 2014-09-17 北京思特奇信息技术股份有限公司 Data processing method and system
CN105824745B (en) * 2015-01-04 2019-03-01 中国移动通信集团湖南有限公司 A kind of gray scale dissemination method and device
CN104615736B (en) * 2015-02-10 2017-10-27 上海创景计算机系统有限公司 Big data fast resolving storage method based on database
CN106469152A (en) * 2015-08-14 2017-03-01 阿里巴巴集团控股有限公司 A kind of document handling method based on ETL and system
CN105205174B (en) * 2015-10-14 2019-10-11 北京百度网讯科技有限公司 Document handling method and device for distributed system
CN105228131B (en) * 2015-11-05 2018-07-31 上海斐讯数据通信技术有限公司 Assist process method, system and the terminal device of operational data
US10149002B1 (en) * 2016-03-21 2018-12-04 Tribune Broadcasting Company, Llc Systems and methods for retrieving content files
CN105869048A (en) * 2016-03-28 2016-08-17 中国建设银行股份有限公司 Data processing method and system
CN107346312A (en) * 2016-05-05 2017-11-14 中国移动通信集团内蒙古有限公司 A kind of big data processing method and system
CN106446254A (en) * 2016-10-14 2017-02-22 北京百度网讯科技有限公司 File detection method and device
US10387207B2 (en) * 2016-12-06 2019-08-20 International Business Machines Corporation Data processing
CN106777180B (en) * 2016-12-22 2020-09-01 北京京东金融科技控股有限公司 Method, device and system for high-performance distributed data conversion
KR101914347B1 (en) * 2016-12-23 2018-11-01 부산대학교 산학협력단 Method for replaying large event log, and large event log replaying system
CN106980669B (en) * 2017-03-23 2019-07-02 珠海格力电器股份有限公司 A kind of storage of data, acquisition methods and device
CN108959292B (en) * 2017-05-19 2021-03-30 北京京东尚科信息技术有限公司 Data uploading method, system and computer readable storage medium
CN107273542B (en) * 2017-07-06 2020-11-27 华泰证券股份有限公司 High-concurrency data synchronization method and system
CN107590054B (en) * 2017-09-21 2020-11-03 大连君方科技有限公司 Ship server log monitoring system
CN107908737B (en) * 2017-11-15 2022-08-19 中国银行股份有限公司 File splitting control method and device
CN108509478B (en) * 2017-11-23 2021-04-27 平安科技(深圳)有限公司 Splitting and calling method of rule engine file, electronic device and storage medium
CN108304554B (en) * 2018-02-02 2020-07-28 平安证券股份有限公司 File splitting method and device, computer equipment and storage medium
CN109271361B (en) * 2018-08-13 2020-07-24 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Distributed storage method and system for massive small files
US11372812B2 (en) * 2018-10-08 2022-06-28 Silicon Motion, Inc. Mobile device and method capable of earlier determining that a number of files in a directory of an external connected storage device is about to full
CN109343962A (en) * 2018-10-26 2019-02-15 北京知道创宇信息技术有限公司 Data processing method, device and distribution service
CN111258748B (en) * 2018-12-03 2023-09-05 中国移动通信集团上海有限公司 Distributed file system and control method
CN109960630B (en) * 2019-03-18 2020-09-29 四川长虹电器股份有限公司 Method for rapidly extracting logs from large-batch compressed files
CN110109881B (en) * 2019-05-15 2021-07-30 恒生电子股份有限公司 File splitting method and device, electronic equipment and storage medium
CN111142791A (en) * 2019-12-12 2020-05-12 江苏苏宁物流有限公司 Data migration method and device
CN111190868A (en) * 2020-01-02 2020-05-22 中国建设银行股份有限公司 File processing method and device
CN111901223A (en) * 2020-07-21 2020-11-06 湖南中斯信息科技有限公司 Method and device for splitting document, storage medium and processor

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6061733A (en) * 1997-10-16 2000-05-09 International Business Machines Corp. Method and apparatus for improving internet download integrity via client/server dynamic file sizes
US20040088380A1 (en) 2002-03-12 2004-05-06 Chung Randall M. Splitting and redundant storage on multiple servers
US20040268068A1 (en) * 2003-06-24 2004-12-30 International Business Machines Corporation Efficient method for copying and creating block-level incremental backups of large files and sparse files
US20060167838A1 (en) * 2005-01-21 2006-07-27 Z-Force Communications, Inc. File-based hybrid file storage scheme supporting multiple file switches
US20060277434A1 (en) * 2005-06-03 2006-12-07 Tsern Ely K Memory system with error detection and retry modes of operation
US20070136540A1 (en) * 2005-12-08 2007-06-14 Matlock Clarence B Jr Restore accelerator for serial media backup systems

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2078315A1 (en) * 1991-09-20 1993-03-21 Christopher L. Reeve Parallel processing apparatus and method for utilizing tiling
JPH11312150A (en) * 1998-04-30 1999-11-09 Nippon Telegr & Teleph Corp <Ntt> Parallel processing method, its system and recording medium with parallel processing program recorded therein
KR100532274B1 (en) 1999-09-08 2005-11-29 삼성전자주식회사 Apparatus for transfering long message in portable terminal and method therefor
KR20010049041A (en) 1999-11-30 2001-06-15 윤종용 Method for transmitting and receiving multimedia data using short message service in portable radio telephone
CN100431320C (en) 2000-10-26 2008-11-05 普里斯梅迪亚网络有限公司 Method and appts. for real-time parallel delivery of segments of large payload file
US6912543B2 (en) * 2000-11-14 2005-06-28 International Business Machines Corporation Object-oriented method and system for transferring a file system
JP2004348338A (en) * 2003-05-21 2004-12-09 Ntt Data Corp Data division processor, data division processing method, and data division processing program
KR100619812B1 (en) 2003-09-06 2006-09-08 엘지전자 주식회사 A method and a apparatus of transmitting multimedia signal with divide for mobile phone
US7596782B2 (en) * 2003-10-24 2009-09-29 Microsoft Corporation Software build extensibility
KR100574960B1 (en) 2003-11-25 2006-05-02 삼성전자주식회사 The dividing method for payload intra-frame
CN1280761C (en) * 2004-05-18 2006-10-18 中兴通讯股份有限公司 Method for realizing relation-database automatic upgrading in communication apparatus
JP4249110B2 (en) * 2004-09-30 2009-04-02 富士通株式会社 File mediation program, file mediation device, and file mediation method
CN100411341C (en) 2005-08-10 2008-08-13 华为技术有限公司 Parallel downloading method and terminal
CN101086732A (en) * 2006-06-11 2007-12-12 上海全成通信技术有限公司 A high magnitude of data management method
US9521186B2 (en) 2007-09-13 2016-12-13 International Business Machines Corporation Method and system for file transfer over a messaging infrastructure
US8341285B2 (en) 2007-10-22 2012-12-25 Hewlett-Packard Development Company, L.P. Method and system for transferring files

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6061733A (en) * 1997-10-16 2000-05-09 International Business Machines Corp. Method and apparatus for improving internet download integrity via client/server dynamic file sizes
US20040088380A1 (en) 2002-03-12 2004-05-06 Chung Randall M. Splitting and redundant storage on multiple servers
US20040268068A1 (en) * 2003-06-24 2004-12-30 International Business Machines Corporation Efficient method for copying and creating block-level incremental backups of large files and sparse files
US20060167838A1 (en) * 2005-01-21 2006-07-27 Z-Force Communications, Inc. File-based hybrid file storage scheme supporting multiple file switches
US20060277434A1 (en) * 2005-06-03 2006-12-07 Tsern Ely K Memory system with error detection and retry modes of operation
US20070136540A1 (en) * 2005-12-08 2007-06-14 Matlock Clarence B Jr Restore accelerator for serial media backup systems

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GRAMA A ET AL.: "INTRODUCTION TO PARALLEL COMPUTING", 1 January 2003, PRENTICE HALL, article "INTRODUCTION TO PARALLEL COMPUTING, PRINCIPLES OF PARALLEL ALGORITHM DESIGN", pages: 85 - 147
See also references of EP2283428A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPWO2011099099A1 (en) * 2010-02-10 2013-06-13 日本電気株式会社 Storage device
JP5533888B2 (en) * 2010-02-10 2014-06-25 日本電気株式会社 Storage device
US9021230B2 (en) 2010-02-10 2015-04-28 Nec Corporation Storage device
KR20160069351A (en) * 2014-12-08 2016-06-16 에스케이텔레콤 주식회사 Apparatus for Counting the Number of Large-Scale Data by Taking Account of Data Distribution and Computer-Readable Recording Medium with Program therefor
KR102229311B1 (en) 2014-12-08 2021-03-17 에스케이텔레콤 주식회사 Apparatus for Counting the Number of Large-Scale Data by Taking Account of Data Distribution and Computer-Readable Recording Medium with Program therefor

Also Published As

Publication number Publication date
JP2011523738A (en) 2011-08-18
US8229982B2 (en) 2012-07-24
HK1137250A1 (en) 2010-07-23
JP5438100B2 (en) 2014-03-12
US20110072058A1 (en) 2011-03-24
EP2283428A4 (en) 2012-10-31
EP2283428A1 (en) 2011-02-16
CN101582064A (en) 2009-11-18
CN101582064B (en) 2011-12-21

Similar Documents

Publication Publication Date Title
US8229982B2 (en) Method and system for large volume data processing
US11226847B2 (en) Implementing an application manifest in a node-specific manner using an intent-based orchestrator
RU2429529C2 (en) Dynamic configuration, allocation and deployment of computer systems
US9110600B1 (en) Triggered data shelving to a different storage system and storage deallocation
CN109344000B (en) Block chain network service platform, recovery tool, fault processing method thereof and storage medium
EP2751662B1 (en) Method for an efficient application disaster recovery
US20190391880A1 (en) Application backup and management
US20160275123A1 (en) Pipeline execution of multiple map-reduce jobs
US20140136779A1 (en) Method and Apparatus for Achieving Optimal Resource Allocation Dynamically in a Distributed Computing Environment
US11860741B2 (en) Continuous data protection
US11126505B1 (en) Past-state backup generator and interface for database systems
JP2012523043A5 (en)
CN111352717A (en) Method for realizing kubernets self-defined scheduler
WO2017040439A1 (en) Target-driven tenant identity synchronization
US9164849B2 (en) Backup jobs scheduling optimization
CN110648178A (en) Method for increasing kafka consumption capacity
CN113111129A (en) Data synchronization method, device, equipment and storage medium
JP5619179B2 (en) Computer system, job execution management method, and program
CN114722119A (en) Data synchronization method and system
US7313786B2 (en) Grid-enabled ANT compatible with both stand-alone and grid-based computing systems
JPH05257908A (en) Highly reliable distributed processing system
US11042454B1 (en) Restoration of a data source
CN105760215A (en) Map-reduce model based job running method for distributed file system
Baranowski et al. Evolution of the hadoop platform and ecosystem for high energy physics
CN115495265A (en) Method for improving kafka consumption capacity based on hadoop

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 12601606

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09747671

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2009747671

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2011509735

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE