WO1993013475A1 - Method for performing disk array operations using a nonuniform stripe size mapping scheme - Google Patents
Method for performing disk array operations using a nonuniform stripe size mapping scheme Download PDFInfo
- Publication number
- WO1993013475A1 WO1993013475A1 PCT/US1992/011283 US9211283W WO9313475A1 WO 1993013475 A1 WO1993013475 A1 WO 1993013475A1 US 9211283 W US9211283 W US 9211283W WO 9313475 A1 WO9313475 A1 WO 9313475A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- stripe
- disk array
- size
- region
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1076—Parity data used in redundant arrays of independent storages, e.g. in RAID systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B20/00—Signal processing not specific to the method of recording or reproducing; Circuits therefor
- G11B20/10—Digital recording or reproducing
- G11B20/18—Error detection or correction; Testing, e.g. of drop-outs
- G11B20/1833—Error detection or correction; Testing, e.g. of drop-outs by adding special lists or symbols to the coded information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2211/00—Indexing scheme relating to details of data-processing equipment not covered by groups G06F3/00 - G06F13/00
- G06F2211/10—Indexing scheme relating to G06F11/10
- G06F2211/1002—Indexing scheme relating to G06F11/1076
- G06F2211/1026—Different size groups, i.e. non uniform size of groups in RAID systems with parity
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0673—Single storage device
Definitions
- the present invention is directed toward a method for improving performance for multiple disk drives in computer systems, and more particularly to a method for performing write operations in a disk array utilizing parity data redundancy and recovery protection.
- Microprocessors and the computers which utilize them have become increasingly more powerful during the recent years.
- Currently available personal computers have capabilities in excess of the mainframe and minicomputers of ten years ago.
- Microprocessor data bus sizes of 32 bits are widely available whereas in the past 8 bits was conventional and 16 bits was common.
- each of the disks comprising the disk array so that each disk holds a portion of the data comprising a data file. If n drives are ganged together, then the effective data transfer rate may be increased up to n times.
- This technique known as striping, originated in the supercomputing environment where the transfer of large amounts of data to and from secondary storage is a frequent requirement.
- striping a sequential data block is broken into segments of a unit length, such as sector size, and sequential segments are written to sequential disk drives, not to sequential locations on a single disk drive.
- the unit length or amount of data that is stored "across" each disk is referred to as the stripe size.
- the stripe size affects data transfer characteristics and access times and is generally chosen to optimize data transfers to and from the disk array. If the data block is longer than n unit lengths, the process repeats for the next stripe location on the respective disk drives. With this approach, the n physical drives become a single logical device and may be implemented either through software or hardware.
- parity scheme One technique that is used to provide for data protection and recovery in disk array subsystems is referred to as a parity scheme.
- data blocks being written to various drives within the array are used and a known EXCLUSIVE-OR (XOR) technique is used to create parity information which is written to a reserved or parity drive within the array.
- XOR EXCLUSIVE-OR
- the advantage of a parity scheme is that it may be used to minimize the amount of data storage dedicated to data redundancy and recovery purposes within the array.
- Figure l illustrates a traditional 3+1 mapping scheme wherein three disks, disk 0, disk 1 and disk 2, are used for data storage, and one disk, disk 3, is used to store parity information.
- each rectangle enclosing a number or the letter "p" coupled with a number corresponds to a sector, which is preferably 512 bytes.
- each complete stripe uses four sectors from each of disks 0, 1 and 2 for a total of 12 sectors of data storage per disk. Assuming a standard sector size of 512 bytes, the stripe size of each of these disk stripes, which is defined as the amount of storage allocated to a stripe on one of the disks comprising the stripe, is 2 kbytes. Thus each complete stripe, which includes the total of the portion of each of the disks allocated to a stripe, can store 6 kbytes of data. Disk 3 of each of the stripes is used to store parity information.
- this structure is called an FNODE.
- FNODE file access and modification dates and file size information.
- These structures are relatively small compared with typical data stripe sizes used in disk arrays, thus resulting in a large number of partial stripe write operations.
- parity information may be generated directly from the data being written to the drive array, and therefore no extra read of the disk stripe is required.
- a problem occurs when the computer writes only a partial stripe to the disk array because the disk array controller does not have sufficient information from the data to be written to compute parity for the complete stripe.
- partial stripe write operations generally require data stored on a disk to first be read, modified by the process active on the host system, and written back to the same address on the data disk. This operation consists of a data disk READ, modification of the data, and a data disk WRITE to the same address.
- a partial stripe write to a data disk in an XOR parity fault tolerant system includes issuing a READ command in order to maintain parity fault tolerance.
- the computer system first reads the parity information from the parity disk for the data disk sectors which are being updated and the old data values that are to be replaced from the data disk.
- the XOR parity information is then recalculated by the host or a local processor, or dedicated logic, by XORing the old data sectors to be replaced with the related parity sectors. This recovers the parity value without those data values.
- the new data values are XORed on to this recovered value to produce the new parity data.
- a WRITE command is then executed, writing the updated data to the data disks and the new parity information to the parity disk. It will be appreciated that this process requires two additional partial sector READ operations, one from the parity disk and one reading the old data, prior to the generation of the new XOR parity information. Additionally, the WRITE operations are to locations which have just been read. Consequently, data transfer performance suffers.
- the second method requires reading the remainder of the data that is not to be repudiated for the stripe, despite the fact that it is not being replaced by the WRITE operation.
- the new parity information may be determined for the entire stripe which is being updated. This process requires a READ operation of the data not to be replaced and a full stripe WRITE operation to save the parity information.
- a low level format operation involves the creation of sectors on the disk along with their address markings, which are used to identify the sectors after the formatting is completed. The data portion of the sector is established and filled in with dummy data.
- the disk controller and the respective operating system in control of the computer system must perform a high level or logical format of the disk drive to place the "file system" on the disk and make the disk drive conform to the standards of the operating system. This high level formatting is performed by the respective disk controller in conjunction with an operating system service referred to as a "make file system" program.
- the make file system program works in conjunction with the disk controller to create the file system on the disk array.
- the operating system views the disk as a sequential list of blocks or sectors, and the make file system program is unaware as to the topology of these blocks.
- the present invention is directed toward a method and apparatus for improving disk performance in a computer system having a disk array subsystem.
- a nonuniform mapping scheme is used wherein the disk array includes certain designated regions having varying sizes of data stripes.
- the disk array includes a region comprised of a number of data stripes having a stripe size that is approximately the same as the size of internal data structures frequently used by the file system, in addition to a region which includes a number of data stripes having a larger stripe size which are used for general data storage.
- the data structure is preferably mapped to the small stripe region in the disk array wherein the complete stripe size matches the size of the data structure. In this manner, whenever the file system data structure is updated, the operation is a full stripe write. This reduces the number of partial stripe write operations. thus reducing the performance penalty associated with these operations.
- Figure 1 is a prior art diagram of a traditional 3+1 disk array mapping scheme having a uniform stripe size
- Figures 2 and 3 are block diagrams of an illustrative computer system on which the method of the present invention may be practiced;
- Figure 4 is a block diagram of the disk subsystem of the preferred embodiment;
- Figure 5 is a functional block diagram of the transfer controller of Fig. 4 according to the preferred embodiment
- Figure 6 is a diagram of a 3+1 disk array mapping scheme having varying stripe sizes according to a first embodiment
- Figure 7 is a diagram of a RAID 5 3+1 disk array mapping scheme having varying stripe sizes according to a second embodiment of the invention.
- Figure 8 is a diagram of a 4+1 disk array mapping scheme according to the preferred embodiment of the invention.
- Figure 9 is a flowchart diagram of a WRITE operation according to the method of the present invention.
- Figure 10 is a flowchart diagram of a READ operation according to the method of the present invention.
- the computer system and disk array subsystem described below represent the preferred embodiment of the present invention. It is also contemplated that other computer systems, not having the capabilities of the system described below, may be used to practice the present invention.
- FIG. 2 and 3 the letter C generally designates a computer system on which the present invention may be practiced.
- system C is shown in two portions with the interconnections between Figs. 2 and 3 designated by reference to the circled numbers l to 8.
- System C is comprised of a number of block elements interconnected via four buses.
- a central processing unit CPU comprises a system processor 20, a numerical co-processor 22, a cache memory controller 24, and associated logic circuits connected to a system processor bus 26.
- cache controller 24 Associated with cache controller 24 is a high speed cache data random access memory (RAM) 28, non-cacheable memory address (NCA) map programming logic circuitry 30, non ⁇ cacheable address or NCA memory map 32, address exchange latch circuitry 34, data exchange transceiver 36 and page hit detect logic 43.
- RAM cache data random access memory
- NCA non-cacheable memory address
- NCA non ⁇ cacheable address or NCA memory map 32
- address exchange latch circuitry 34 non ⁇ cacheable address or NCA memory map 32
- data exchange transceiver 36 Associated with the CPU also are system processor ready logic circuit 38, next address (NA) enable logic circuit 40 and bus request logic circuit 42.
- NA next address
- the system processor is preferably an Intel Corporation 80386 microprocessor.
- the system processor 20 has its control, address and data lines interfaced to the system processor bus 26.
- the co-processor 22 is preferably an Intel 80387 and/or Weitek WTL3167 numerical processor interfacing with the local processor bus 26 and the system processor 20 in the conventional manner.
- the cache RAM 28 is preferably a suitable high-speed static random access memory which interfaces with the address and data elements of bus 26 under the control of the cache controller 24 to carry out required cache memory operations.
- the cache controller 24 is preferably an Intel 82385 cache controller configured to operate in two-way set associative master mode. In the preferred embodiment, the components are the 33 MHz versions of the respective units.
- Address latch circuitry 34 and data transceiver 36 interface the cache controller 34 with the processor 20 and provide a local bus interface between the processor bus 26 and a host or memory bus 44.
- Circuit 38 is a logic circuit which provides a bus ready signal to control access to the bus 26 and indicate when the next cycle may begin.
- the enable circuit 40 is utilized to indicate that the next address of data or code to be utilized by sub-system elements in pipelined address mode may be placed on the local bus 26.
- Non-cacheable memory address (NCA) map programmer 30 cooperates with the processor 20 and the non- cacheable address memory 32 to map non-cacheable memory locations.
- the non-cacheable address memory 32 is utilized to designate areas of the system memory that are non-cacheable to avoid various types of cache coherency problems.
- the bus request logic circuit 42 is utilized by the processor 20 and associated elements to request access to the host bus 44 in situations such as when requested data is not located in cache memory 28 and access to system memory is required.
- the main memory array or system memory 58 is coupled to the host bus 44.
- the main memory array 58 is preferably dynamic random access memory.
- Memory 58 interfaces with the host bus 44 via EISA bus buffer (EBB) data buffer circuit 60, a memory controller circuit 62, and a memory mapper 68.
- EBB EISA bus buffer
- the buffer 60 performs data transceiving and parity generating and checking functions.
- the memory controller 62 and memory mapper 68 interface with the memory 58 via address multiplexor and column address strobe (ADDR/CAS) buffers 66 and row address strobe (RAS) enable logic circuit 64.
- ADDR/CAS address multiplexor and column address strobe
- RAS row address strobe
- System C is configured as having the processor bus 26, the host bus 44, an extended industry standard architecture (EISA) bus 46 (Fig. 3) and an X bus 90 (Fig. 3) .
- EISA extended industry standard architecture
- Fig. 3 The details of the portions of the system illustrated in Fig. 3 and not discussed in detail below are not significant to the present invention other than to illustrate an example of a fully configured computer system.
- the portion of System C illustrated in Fig. 3 is essentially a configured EISA system which includes the necessary EISA bus 46 and EISA bus controller 48, data latches and transceivers referred to as EBB data buffers 50 and address latches and buffers 52 to interface between the EISA bus 46 and the host bus 44.
- ISP integrated system peripheral
- ISP integrated system peripheral
- the integrated ISP 54 includes a direct memory access controller 56 for controlling access to main memory 58 (Fig. 1) or memory contained in an EISA slot and input/output (I/O) locations without the need for access to the processor 20.
- the ISP 54 also includes interrupt controllers 70, non-maskable interrupt logic 72, and system timer 74 which allow control of interrupt signals and generate necessary timing signals and wait states in a manner according to the EISA specification and conventional practice.
- processor generated interrupt requests are controlled via dual interrupt controller circuits emulating and extending conventional Intel
- the ISP 54 also includes bus arbitration logic 75 which, in cooperation with the bus controller 48, controls and arbitrates among the various requests for EISA bus 46 by the cache controller 24, the DMA controller 56, and bus master devices located on the EISA bus 46.
- the EISA bus 46 includes ISA and EISA control buses 76 and 78, ISA and EISA data buses 80 and 82, and are interfaced via the X bus 90 in combination with the ISA control bus 76 from the EISA bus 46. Control and data/address transfer for the X bus 90 are facilitated by X bus control logic 92, data buffers 94 and address buffers 96.
- Attached to the X bus are various peripheral devices such as keyboard/mouse controller 98 which interfaces with the X bus 90 with a suitable keyboard and a mouse via connectors 100 and 102, respectively. Also attached to the X bus are read only memory (ROM) circuits 106 which contain basic operation software for the system C and for system video operations.
- ROM read only memory
- a serial port communications port 108 is also connected to the system C via the X bus 90.
- Floppy disk support, a parallel port, a second serial port, and video support circuits are provided in block circuit 110.
- the computer system C includes a disk subsystem 111 which includes a disk array controller 112, fixed disk connector 114, and fixed disk array 116.
- the disk array controller 112 is connected to the EISA bus 46, preferably in a slot, to provide for the communication of data and address information through the EISA bus 46.
- Fixed disk connector 114 is connected to the disk array controller 112 and is in turn connected to the fixed disk array 116.
- the disk array controller 112 has a local processor 130, preferably an Intel 80186.
- the local processor 130 has a multiplexed address/data bus UAD and control outputs UC.
- the multiplexed address/data bus UAD is connected to a transceiver 132 whose output is the local processor data bus UD.
- the multiplexed address/data bus UAD is also connected to the D inputs of a latch 134 whose Q outputs form the local processor address bus UA.
- the local processor 130 has associated with it random access memory (RAM) 136 coupled via the multiplexed address/data bus UAD and the address data bus UA.
- RAM random access memory
- the RAM 136 is connected to the processor control bus UC to develop proper timing signals.
- ROM read only memory
- PAL programmable array logic
- the local processor address bus UA, the local processor data bus, UD and the local processor control bus UC are also connected to a bus master interface controller (BMIC) 142.
- the BMIC 142 serves the function of interfacing the disk array controller 112 with a standard bus, such as the EISA or MCA bus, and acts as a bus master.
- the BMIC 142 is interfaced with the EISA bus 46 and is the Intel 82355.
- the BMIC 142 can interface with the local processor 130 to allow data and control information to be passed between the host system C and the local processor 130.
- the local processor data bus UD and local processor control bus UC are preferably connected to a transfer controller 144.
- the transfer controller 144 is generally a specialized multi-channel direct memory access (DMA) controller used to transfer data between the transfer buffer RAM 146 and various other devices present in the disk array controller 112.
- DMA direct memory access
- the transfer controller 144 is connected to the BMIC 142 by the BMIC data lines BD and the BMIC control lines BC.
- the transfer controller 144 can transfer data from the transfer buffer RAM 146 to the BMIC 142 if a READ operation is requested. If a WRITE operation is requested, data can be transferred from the BMIC 142 to the transfer buffer RAM 146.
- the transfer controller 144 can then pass this information from the transfer buffer RAM 146 to disk array 116.
- the transfer controller 144 is described in greater detail in U.S. Application No. 431,735, and in its European counterpart, European Patent Office Publication No. 0427119, published April 4, 1991, which is hereby incorporated by reference.
- the transfer controller 144 includes a disk data bus DD and a disk address bus and control bus DAC.
- the disk address and control bus DAC is connected to two buffers 165 and 166 which are part of the fixed disk connector 114 and are used to send and receive control signals between the transfer controller 144 and the disk array 116.
- the disk data bus DD is connected to two data transceivers 148 and 150 which are part of the fixed disk connector 114.
- the outputs of the transceiver 148 and the transfer buffer 146 are connected to two disk drive port connectors 152 and 154.
- two connectors 160 and 162 are connected to the outputs of the transceiver 150 and the buffer 166.
- Two hard disks can be connected to each connector 152, 154, 160, and 162.
- up to 8 disk drives can be connected and coupled to the transfer controller 144.
- five disk drives are coupled to the transfer controller 144, and a 4+1 mapping scheme is used.
- a compatibility port controller (CPC) 164 is also connected to the EISA bus 46.
- the CPC 164 is connected to the transfer controller 144 over the compatibility data lines CD and the compatibility control lines CC.
- the CPC 164 is provided so that the software which was written for previous computer systems which do not have a disk array controller 112 and its BMIC 142, which are addressed over an EISA specific space and allow very high throughputs, can operate without requiring a rewriting of the software.
- the CPC 164 emulates the various control ports previously utilized in interfacing with hard disks.
- the transfer controller 144 is itself comprised of a series of separate circuitry blocks.
- the transfer controller 144 includes two main units referred to as the RAM controller 170 and the disk controller 172.
- the RAM controller 170 has an arbiter to control the various interface devices that have access to the transfer buffer RAM 146 and a multiplexor so that the data can be passed to and from the transfer buffer RAM 146.
- the disk controller 172 includes an arbiter to determine which of the various devices has access to the integrated disk interface 174 and includes multiplexing capability to allow data to be properly transferred back and forth through the integrated disk interface 174. 5
- the transfer controller 144 preferably includes 7
- DMA channels One DMA channel 176 is assigned to cooperate with the BMIC 142.
- a second DMA channel 178 is designed to cooperate with the CPC 164.
- These two devices, the BMIC 142 and the bus compatibility port 10 controller 164, are coupled only to the transfer buffer RAM 146 through their appropriate DMA channels 176 and 178 and the RAM controller 170.
- the BMIC 142 and the compatibility port controller 164 do not have direct access to the integrated disk interface 174 and the 15 disk array 116.
- the local processor 130 (Fig. 3) is connected to the RAM controller 170 through a local processor DMA channel 180 and is connected to the disk controller 172 through a local processor disk channel 182.
- the local processor 130 is connected to 20 both the transfer buffer RAM 146 and the disk array 116 as desired.
- the transfer controller 144 includes ⁇ " 4 DMA disk channels 184, 186, 188 and 190 which allow information to be independently and simultaneously 25 passed between the disk array A and the RAM 146. It is noted that the fourth DMA/disk channel 190 also includes XOR capability so that parity operations can be readily performed in the transfer controller 144 without requiring computations by the local processor 30 130.
- the above computer system C and disk array subsystem 111 represent the preferred computer system for the practice of the method of the present invention.
- the computer system C preferably utilizes the UNIX 35 operating system, although other operating systems may be used.
- the UNIX operating system includes a service referred to as the make file system program.
- the make file system program provides information to the disk controller 112 as to how many INODEs are being created and the size of the INODEs.
- the make file system includes sufficient intelligence to inform the disk controller 112 as to the desired stripe size in the small stripe and large stripe regions and the boundary separating these regions.
- the number of INODEs is approximately equal to the number of files which are to be allowed in the system.
- the disk array controller 112 uses this information to develop the file system on each of the disks comprising the array 116.
- the disk array controller 112 uses a multiple mapping scheme according to the present invention which partitions the disk array 116 into small stripe and large stripe regions.
- the small stripe region preferably occupies the first N sectors of each disk and is reserved for the INODE data structures, and the remaining stripes in the array form the large stripe region, which comprises free space used for data storage. Therefore, in the preferred embodiment, the disk controller 112 allocates the first N sectors of each of the disks in the array for the small stripe region. The remaining sectors of each of the disks are formatted into the large stripe region.
- the disk array controller 112 stores the boundary separating the small stripe and large stripe regions in the RAM 136.
- the disk array controller 112 utilizes this boundary and writes the INODEs to the small stripe portion of the array 116.
- the small stripe region does not occupy the first N sectors of each disk, but rather the small stripe region includes a plurality of regions interspersed among the large stripe regions.
- a plurality of boundaries which separate the small stripe and large stripe regions are stored in the RAM 136 so that the disk controller 112 can write the INODE data structures to the small stripe region.
- the OS/2 operating system is used.
- an OS/2 service similar to the make file system program discussed above provides information to the disk controller 112 as to how many FNODEs are being created and the size of the FNODEs.
- the disk controller uses a multiple mapping scheme similar to that discussed above to partition the disk array 116 into small stripe and large stripe regions wherein the small stripe region is reserved for the FNODE data structures. It is noted that the present invention can operate in conjunction with any type of operating system or file system.
- the resulting operation in this stripe will consist of a partial stripe write operation, which has performance penalties as described above.
- FIG. 6 a diagram illustrating a 3+1 mapping scheme utilizing multiple stripe sizes according to one embodiment of the present invention is shown.
- Figure 6 is exemplary only, it being noted that the disk array 116 will utilize a much larger number of stripes of each size.
- the disk drives used in the preferred embodiment include a number of sectors each having 512 bytes of storage.
- the disk array utilizes two stripe sizes, a disk stripe size of one kbyte and a disk stripe size of two kbytes.
- stripes 0, 1 and 2 utilize a disk stripe size of one kbyte using two sectors per disk in disks 0, 1 and 2 for a total of six sectors or 3 kbytes of data storage per complete stripe.
- stripes 0, 1, and 2 utilize two sectors in disk 3 to store parity information for each stripe.
- Stripes 3-6 utilize a 2 kbyte disk stripe size wherein four sectors per disk on disks 0, 1 and 2 are allocated for data storage for each stripe and four sectors on disk 3 are reserved for parity information for each stripe.
- the INODE data structures are written to the portion of the disk array having the small stripe size, i.e., stripes 0, 1, or 2. As previously discussed, INODE structures are assumed to be 2 kbytes in size in the preferred embodiment.
- INODE structures written to stripes 0, 1 or 2 would fill up the area in disks 0 and 1, disk 2 would generally be unused, and disk 3 would be used to store the respective parity information.
- data would not be allowed to be written to disk 2 of the respective stripe after an INODE is written there, and thus partial stripe write operations are prevented from occurring. Therefore, by using a smaller stripe size in a portion of the disk array 116 and preventing data from being written to the unused space after an INODE is written, a write operation of these structures emulates a full stripe write.
- disk 2 is unused or unwritten during this full stripe write, and thus an inefficient use of the disk area results.
- disk 2 since disk 2 will generally be unused for each small stripe where an INODE structure is written, the data transfer bandwidth from the disk array system is reduced, and the array essentially operates as a 2+1 mapping scheme in these instances.
- a 4+1 mapping scheme is used, as shown in Figure 8. It is again noted that Figure 8 is exemplary only, and the disk array 116 of the preferred embodiment will utilize a much larger number of stripes in each of the small stripe and large stripe regions.
- the disk stripe size of the stripes in the small stripe region, stripes 0-4, wherein stripe size is defined as the amount of each disk that is allocated to the stripe, is 512 bytes. In this manner, each complete stripe in the small stripe region holds exactly 2 kbytes of data, which is approximately equivalent to the size of an INODE structure.
- a disk request is preferably submitted by the system processor 20 to the disk array controller 112 through the EISA bus 46 and BMIC 142.
- the local processor 130 on receiving this request through the BMIC 142, builds a data structure in the local processor RAM memory 136.
- This data structure is known as a command list and may be a simple READ or WRITE request directed to the disk array 116, or it may be a more elaborate set of requests containing multiple READ/WRITE or diagnostic and configuration requests.
- the command list is then submitted to the local processor 130 for processing.
- the local processor 130 then oversees the execution of the command list, including the transferring of data.
- the local processor 130 notifies the operating system device driver running on the system microprocessor 20.
- the submission of the command list and the notification of the command list completion are achieved by a protocol which uses input/output (I/O) registers located in the BMIC 142.
- the READ and WRITE operations executed by the disk array controller 112 are implemented as a number of application tasks running on the local processor 130. Because of the nature of the interactive input/output operations, it is impractical for the illustrative computer system C to process disk commands as single batch tasks on the local processor 130. Accordingly, the local processor 130 utilizes a real time multi ⁇ tasking use system which permits multiple tasks to be addressed by the local processor 130, including the method of the present invention.
- the operating system on the local processor 130 is the AMX86 multi-tasking executive by Kadak Products, Ltd.
- the AMX operating system kernel provides a number of system services in addition to the applications set forth in the method of the present invention.
- the WRITE operation begins at step 200, in which the active process or application causes the system processor 20 to generate a WRITE request which is passed to the disk device driver.
- the disk device driver is a portion of the software contained within the computer system C, preferably the system memory 58, which performs the actual interface operations with the disk units.
- the disk device driver software assumes control of the system processor 20 to perform specific tasks to carry out the required I/O operations. Control transfers to step 202, wherein the disk device driver assumes control of the system processor 20 and generates a WRITE command list.
- step 204 the device driver submits the WRITE command list to the disk controller 112 via the BMIC 142 or the CPC 164.
- the device driver then goes into a wait state to await a completion signal from the disk array controller 112.
- Logical flow of the operations proceeds to step 206, wherein the local processor 130 receives the WRITE command list and determines whether an INODE data structure is being written to the disk array 116. In making this determination, the local processor preferably utilizes the boundary between the small stripe and large stripe regions.
- intelligence is incorporated into the device driver wherein the device driver utilizes the boundary between the small stripe and large stripe regions and incorporates this information into the WRITE command list.
- step 208 the local processor 130 builds disk specific WRITE instructions for the full stripe WRITE operation to the small stripe region.
- step 210 the transfer controller chip (TCC) 144 generates parity data from the INODE being written to the disk array 116.
- TCC transfer controller chip
- Control thereafter transfers to step 214, wherein the local processor 130 determines whether additional data is to be written to the disk array 116. If additional data is to be written to the disk array 116, control transfers to step 216 wherein the local processor 130 increments the memory addresses and decrements the number of bytes to be transferred. Control then returns to step 206. If no additional data is to be written to the disk array 116, control transfers from step 214 to step 224 where the local processor 130 signals WRITE complete. If the local processor 130 receives the WRITE command list and determines that an INODE structure is not being written to the disk array 116, then in step 218 the local processor 130 builds disk specific WRITE instructions for the data to be written to the large stripe region.
- this operation requires the local processor 130 to utilize the boundary between the small stripe and large stripe regions stored in the RAM 136 to develop the proper bias or offset to correct for the differing size stripes so that the proper physical disk addresses are developed.
- this intelligence can be built into the device driver wherein the device driver has access to and utilizes the boundary between the small stripe and large stripe regions and incorporates this offset information into the WRITE command list.
- the local processor 130 is not required to utilize the boundary between the small stripe and large stripe regions because this intelligence is incorporated into the device driver.
- the transfer controller chip 144 generates parity information solely for the data being written.
- the disk controller 112 can generate the parity information solely from the data to be written.
- the write operation will be a partial stripe write operation
- a preceding read operation may need to be performed to read the data or parity information currently on the disk.
- these additional read operations resulting from partial stripe write operations reduce the performance of the disk system 111.
- U.S. patent application serial number 752,773 titled METH OD F OR PERFORMING WRITE OPERATIONS IN A PARITY FAULT TOLERANT DISK ARRAY filed on August 30, 1991 and U.S.
- step 222 the disk controller 112 writes the data and parity information to the large stripe region. Control then transfers to step 214 where the local processor
- step 130 determines whether additional data is to be written to the disk array 116. If in step 214 it is determined that no additional data is to be transferred, control transfers to step 224, wherein the disk array controller 112 signals WRITE complete to the disk device driver. Control then passes to step 226, wherein the device driver releases control of the system processor 20 to continue execution of the application program. This completes operation of the WRITE sequence.
- a READ operation as carried out on the disk array subsystem 111 using the intelligent disk array controller 112 begins at step 250 when the active process or application program causes the system processor 20 to generate a READ command which is passed to the disk device driver. Control transfers to step 252, wherein the disk device driver assumes control of the system processor 20 and causes the system processor 20 to generate a READ command list similar to that described in U.S. patent application serial no. 431,737 assigned to Compaq Computer Corporation, assignee of the present invention.
- the READ command list is sent to the disk subsystem 111 in step 254, after which operation the device driver waits until it receives a READ complete signal.
- the disk controller 112 receives the READ command list, via the BMIC 142 or CPC 146 and transfer controller 144, and determines if the read operation is intended to access data in the small stripe region, i.e., an INODE, or data in the large stripe region. In making this determination, the disk controller 112 preferably compares the disk address of the requested data with the boundary between the small stripe and large stripe regions stored in the RAM 136 to determine which region is being accessed. Optionally, more intelligence can be built into the device driver such that the device driver incorporates information as to which region is being accessed in the READ command list. According to this embodiment, the disk controller 112 would require little extra intelligence and would merely utilize this information in the READ command list in generating the disk specific READ requests.
- the local processor 130 If the small stripe region is being accessed, the local processor 130 generates disk specific READ requests for the requested INODE and its associated parity information in the small stripe region in step 260 and queues the requests in local RAM 136. Control transfers to step 264, wherein the requests are executed and the requested data is transferred from the disk array 112 through the transfer controller 144 and the BMIC 142 or the CPC 164 to the system memory 58 addresses indicated by the requesting task. If the disk controller 112 determines that the read operation is intended to access data in the large stripe region, the local processor 130 generates disk specific READ requests for the requested data and its associated parity information in the large stripe region in step 262 and queues the requests in local RAM 136. These requests are executed and the data transferred in step 264. Upon completion of the data transfer in step 264, the disk array controller 112 signals READ complete to the disk device driver in step 266, which releases control of the system processor 20.
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP93902805A EP0619896A1 (en) | 1991-12-27 | 1992-12-18 | Method for performing disk array operations using a nonuniform stripe size mapping scheme |
JP5511941A JPH06511099A (en) | 1991-12-27 | 1992-12-18 | How to perform disk array operations using a non-uniform stripe size mapping scheme |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US81400091A | 1991-12-27 | 1991-12-27 | |
US814,000 | 1991-12-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1993013475A1 true WO1993013475A1 (en) | 1993-07-08 |
Family
ID=25213949
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1992/011283 WO1993013475A1 (en) | 1991-12-27 | 1992-12-18 | Method for performing disk array operations using a nonuniform stripe size mapping scheme |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP0619896A1 (en) |
JP (1) | JPH06511099A (en) |
AU (1) | AU3424993A (en) |
CA (1) | CA2126754A1 (en) |
WO (1) | WO1993013475A1 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0584804A2 (en) * | 1992-08-26 | 1994-03-02 | Mitsubishi Denki Kabushiki Kaisha | Redundant array of disks with improved storage and recovery speed |
EP0670553A1 (en) * | 1992-09-02 | 1995-09-06 | Aton Systemes S.A. | Procedure for interleaved data transfers between a computer memory and peripheral equipment comprising a management system and several storage units |
EP0701716A1 (en) * | 1993-06-03 | 1996-03-20 | Network Appliance Corporation | A method for allocating files in a file system integrated with a raid disk sub-system |
US5948110A (en) * | 1993-06-04 | 1999-09-07 | Network Appliance, Inc. | Method for providing parity in a raid sub-system using non-volatile memory |
US5963962A (en) * | 1995-05-31 | 1999-10-05 | Network Appliance, Inc. | Write anywhere file-system layout |
WO2002029539A2 (en) * | 2000-10-02 | 2002-04-11 | Sun Microsystems, Inc. | A data storage subsystem including a storage disk array employing dynamic data striping |
US6636879B1 (en) | 2000-08-18 | 2003-10-21 | Network Appliance, Inc. | Space allocation in a write anywhere file system |
US6658528B2 (en) * | 2001-07-30 | 2003-12-02 | International Business Machines Corporation | System and method for improving file system transfer through the use of an intelligent geometry engine |
US6728922B1 (en) | 2000-08-18 | 2004-04-27 | Network Appliance, Inc. | Dynamic data space |
US7072916B1 (en) | 2000-08-18 | 2006-07-04 | Network Appliance, Inc. | Instant snapshot |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0249091A2 (en) * | 1986-06-12 | 1987-12-16 | International Business Machines Corporation | Parity spreading to enhance storage access |
-
1992
- 1992-12-18 AU AU34249/93A patent/AU3424993A/en not_active Abandoned
- 1992-12-18 EP EP93902805A patent/EP0619896A1/en not_active Ceased
- 1992-12-18 JP JP5511941A patent/JPH06511099A/en active Pending
- 1992-12-18 WO PCT/US1992/011283 patent/WO1993013475A1/en not_active Application Discontinuation
- 1992-12-18 CA CA002126754A patent/CA2126754A1/en not_active Abandoned
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0249091A2 (en) * | 1986-06-12 | 1987-12-16 | International Business Machines Corporation | Parity spreading to enhance storage access |
Non-Patent Citations (1)
Title |
---|
PERFORMANCE EVALUATION REVIEW vol. 18, no. 1, May 1990, CA,US pages 74 - 85 CHEN ET AL 'AN EVALUATION OF REDUNDANT ARRAYS OF DISKS USING AN AMDAHL 5890' * |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5644697A (en) * | 1992-08-26 | 1997-07-01 | Mitsubishi Denki Kabushiki Kaisha | Redundant array of disks with improved storage and recovery speed |
EP0584804A3 (en) * | 1992-08-26 | 1994-08-10 | Mitsubishi Electric Corp | Redundant array of disks with improved storage and recovery speed |
EP0584804A2 (en) * | 1992-08-26 | 1994-03-02 | Mitsubishi Denki Kabushiki Kaisha | Redundant array of disks with improved storage and recovery speed |
US5517632A (en) * | 1992-08-26 | 1996-05-14 | Mitsubishi Denki Kabushiki Kaisha | Redundant array of disks with improved storage and recovery speed |
EP0670553A1 (en) * | 1992-09-02 | 1995-09-06 | Aton Systemes S.A. | Procedure for interleaved data transfers between a computer memory and peripheral equipment comprising a management system and several storage units |
EP0701716A4 (en) * | 1993-06-03 | 1999-11-17 | Network Appliance Corp | A method for allocating files in a file system integrated with a raid disk sub-system |
EP0701716A1 (en) * | 1993-06-03 | 1996-03-20 | Network Appliance Corporation | A method for allocating files in a file system integrated with a raid disk sub-system |
US6038570A (en) * | 1993-06-03 | 2000-03-14 | Network Appliance, Inc. | Method for allocating files in a file system integrated with a RAID disk sub-system |
US5948110A (en) * | 1993-06-04 | 1999-09-07 | Network Appliance, Inc. | Method for providing parity in a raid sub-system using non-volatile memory |
US5963962A (en) * | 1995-05-31 | 1999-10-05 | Network Appliance, Inc. | Write anywhere file-system layout |
US6636879B1 (en) | 2000-08-18 | 2003-10-21 | Network Appliance, Inc. | Space allocation in a write anywhere file system |
US6728922B1 (en) | 2000-08-18 | 2004-04-27 | Network Appliance, Inc. | Dynamic data space |
US7072916B1 (en) | 2000-08-18 | 2006-07-04 | Network Appliance, Inc. | Instant snapshot |
US7930326B2 (en) | 2000-08-18 | 2011-04-19 | Network Appliance, Inc. | Space allocation in a write anywhere file system |
WO2002029539A2 (en) * | 2000-10-02 | 2002-04-11 | Sun Microsystems, Inc. | A data storage subsystem including a storage disk array employing dynamic data striping |
WO2002029539A3 (en) * | 2000-10-02 | 2003-08-14 | Sun Microsystems Inc | A data storage subsystem including a storage disk array employing dynamic data striping |
US6745284B1 (en) | 2000-10-02 | 2004-06-01 | Sun Microsystems, Inc. | Data storage subsystem including a storage disk array employing dynamic data striping |
US6658528B2 (en) * | 2001-07-30 | 2003-12-02 | International Business Machines Corporation | System and method for improving file system transfer through the use of an intelligent geometry engine |
Also Published As
Publication number | Publication date |
---|---|
CA2126754A1 (en) | 1993-07-08 |
AU3424993A (en) | 1993-07-28 |
JPH06511099A (en) | 1994-12-08 |
EP0619896A1 (en) | 1994-10-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5333305A (en) | Method for improving partial stripe write performance in disk array subsystems | |
US5522065A (en) | Method for performing write operations in a parity fault tolerant disk array | |
EP0426185B1 (en) | Data redundancy and recovery protection | |
US5206943A (en) | Disk array controller with parity capabilities | |
EP0428021B1 (en) | Method for data distribution in a disk array | |
EP0768607B1 (en) | Disk array controller for performing exclusive or operations | |
US5822584A (en) | User selectable priority for disk array background operations | |
US5961652A (en) | Read checking for drive rebuild | |
US6018778A (en) | Disk array controller for reading/writing striped data using a single address counter for synchronously transferring data between data ports and buffer memory | |
US5720027A (en) | Redundant disc computer having targeted data broadcast | |
US5210860A (en) | Intelligent disk array controller | |
US6505268B1 (en) | Data distribution in a disk array | |
US5761526A (en) | Apparatus for forming logical disk management data having disk data stripe width set in order to equalize response time based on performance | |
US5694581A (en) | Concurrent disk array management system implemented with CPU executable extension | |
JP3247075B2 (en) | Parity block generator | |
WO1998000776A1 (en) | Cache memory controller in a raid interface | |
US5283880A (en) | Method of fast buffer copying by utilizing a cache memory to accept a page of source buffer contents and then supplying these contents to a target buffer without causing unnecessary wait states | |
WO1993013475A1 (en) | Method for performing disk array operations using a nonuniform stripe size mapping scheme | |
US6370616B1 (en) | Memory interface controller for datum raid operations with a datum multiplier | |
US6513098B2 (en) | Method and apparatus for scalable error correction code generation performance | |
WO1992004674A1 (en) | Computer memory array control |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AT AU BG BR CA CH CS DE DK ES FI GB HU JP KR NL NO PL RO RU SE |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR SN TD TG |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 1993902805 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2126754 Country of ref document: CA |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWP | Wipo information: published in national office |
Ref document number: 1993902805 Country of ref document: EP |
|
WWR | Wipo information: refused in national office |
Ref document number: 1993902805 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1993902805 Country of ref document: EP |