WO2003050697A1

WO2003050697A1 - Floating point intensive reconfigurable computing system for iterative applications

Info

Publication number: WO2003050697A1
Application number: PCT/US2002/038645
Authority: WO
Inventors: Benjamin Bishop; Thomas P. Kelliher; Shrirang Madhav Yardi
Original assignee: University Of Georgia
Priority date: 2001-12-06
Filing date: 2002-12-06
Publication date: 2003-06-19
Also published as: EP1451701A1; CA2468800A1; AU2002360469A1

Abstract

A reconfigurable computing system for accelerating execution of floating point intensive iterative applications. The reconfigurable computing system includes a plurality of interconnected processing elements mounted (20), a host processing system for displaying real-time outputs of the floating point calculations performed by the processing elements (20), and an interface for connecting the processing elements to the host system. Each of the interconnected processing elements (20) includes a floating point functional unit (22), operand memory (24), control memory (26) and a control unit (28). The floating point functional unit (22) includes a multiply accumulate function. The operand memory (24) includes a plurality of banks of static random access memory. The processing elements (20) are interconnected using a nearest neighbor or hierarchical implementation. The instruction set performed by the floating point functional unit (22) includes arithmetic, control and communication instructions. The interface can be implemented as a PCI bus interface using a field programmable gate array or as an AGP bus interface.

Description

FLOATING POINT INTENSIVE RECONFIGURABLE COMPUTING SYSTEM FOR ITERATIVE APPLICATIONS

Background of the Invention

[0001] The present invention is related in general to physical modeling of solid objects, and more particularly, to the use of specialized reconfigurable hardware architecture to provide acceleration for the execution of floating point intensive iterative applications.

[0002] There are a number of application areas which have characteristics like intensive floating point operations, simple but large number of iterations requiring large computational power and support for significant parallelism. These characteristics make the prospect of using special-purpose hardware to speed up the computation for such applications particularly attractive. Specialized hardware can make it easy to efficiently exploit parallelism, and given that the iterations are relatively simple, it is easy to unroll the iteration into hardware. Using a field-programmable gate array ("FPGA") to implement the specialized hardware would be ideal, however, there are well-known efficiency problems with the FPGA implementation of floating point arithmetic.

[0003] Recently, there has been a great deal of interest in physical-based modeling from the graphics research community. In the context of graphics, physical modeling is used to simulate the behavior of objects for the purpose of animation. Physical modeling applies to a broad range of applications with highly variable requirements.

[0004] Much of the previous work in physical modeling has come from groups interested in generating animation sequences for use in the film industry. However, there are many untapped, emerging applications areas such as warfare simulation, interactive entertainment, and virtual reality. It is important to note that the requirements for these applications differ greatly. In physical modeling for film, interactivity is not required and a large number of machines could potentially be used. On the other-hand, consumer grade electronic entertainment requires interactivity and very low hardware costs.

[0005] In physical modeling of solid objects, a number of simulation techniques exist. In rigid body simulation, objects are not allowed to deform. This lack of deformation allows a great deal of pre-computation, which makes interactive simulation possible for certain scenes. However, the lack of deformation can be an important limitation. Deformable object techniques allow for the simulation of a much wider variety of materials, but have very high computational requirements.

[0006] The idea of deformable object modeling is to generate a realistic animation for an object when it deforms due to external forces. Performance of deformable object modeling tends to be very poor on general purpose CPUs. For example, chip area is not well utilized during physical modeling computations. Much of the chip area is devoted to integer computation, instruction decoding and branching. In addition, general purpose CPUs cannot dedicate all of their resources to physical modeling. The CPU must also deal with operating system overhead, input processing, sound processing, etc. A bus bottleneck also exists between the CPU and graphics hardware. Due to this performance problem, deformable object modeling has been used for off-line generation of animation sequences for movies. Deformable objects are typically modeled as a mass-spring system. Then, Hooke's Law is iteratively solved to update the simulation. Implicit integration is typically used due to stability concerns. This results in a system of equations that are then linearized. This system of linear equations can then be solved using an iterative solver such as conjugate gradient.

[0007] In addition to solid objects, there is also interest in realistic animation of fluids and gases. The computational requirements of realistic fluid dynamics are similar to deformable object modeling in that they are far too high for interactive simulation on current general-purpose processors. Typically, fluid modeling is done using three-dimensional voxels (volume elements). Navier-Stokes equations are solved in advancing the simulation. Care must be taken to ensure the simulation remains stable. Using implicit integration, this results in a large, sparse linear system to be solved on every iteration, similar to what occurs in deformable object modeling.

[0008] Ray tracing is also a very computationally intensive operation. The basic idea is to model the behavior of individual rays of light in a three dimensional scene. Since there are a very large number of rays, the application is very computationally intensive. However, parallelism can be easily exploited. Ray tracing differs from the techniques used in 3D accelerator cards in that it models light much more accurately. This lead to very realistic shadows, reflections, and lighting.

[0009] The common characteristic of all these applications are the large number of computationally intensive operations at each iteration. General purpose CPUs perform very poorly in such scenarios. Hence, such systems are incapable of providing interactive graphics processing and most of the processing needs to be done off-line. The modeling of such systems provided by 3-D accelerator cards is sometimes not realistic and is much less accurate for scientific applications.

[0010] As the above examples illustrate, there are a number of applications, especially in the domain of graphics and animation that share many important characteristics such as: (1) floating point intensive; (2) relatively simple iterations; (3) very computationally demanding; and (4) significant parallelism. Since these applications are computationally intensive (i.e., far from interactive performance on general purpose CPUs for non-trivial simulations), efficiently exploiting parallelism through the use of specialized hardware is attractive.

[0011] Rather than building an application specific integrated circuit (ASIC) for one particular algorithm, it is desirable to have a reconfigurable system that can execute many different algorithms. The main motivation behind making such a system reconfigurable is to support a wide variety of applications. For example, in the areas of graphics and image processing, different kinds of algorithms are needed for a single application, and rather than building an ASIC for only a particular algorithm, it is desirable to have a system which can be reconfigured to execute different algorithms. This would allow for a number of simulation techniques to be used. Using an FPGA to implement such specialized hardware would be ideal, since FPGAs are off-the-shelf building blocks which would make the system very cheap. Unfortunately, there have been well-known efficiency problems when using FPGAs when implementing floating point arithmetic. This is mainly a result of demand for interconnect, when aligning/normalizing.

[0012] One important development in the area of physical modeling on general- purpose hardware is the introduction of special-streaming, floating-point instructions on processors from Intel (e.g., K ) and AMD (e.g., 3dNow!). These extensions add a number of single instruction stream, multiple data stream (SIMD) floating point instructions, allowing for additional performance when performing low-precision floating point operations. SIMD is a computer architecture that performs one operation on multiple sets of data, e.g., an array processor. However, this approach results in a rather incremental performance improvement, and compiler support remains problematic. The performance figures achieved by these systems are nowhere near those provided by fully specialized systems.

[0013] A number of new 3D graphics accelerator cards are now offering some onboard programmability. hi the future, it may be possible to perform physical modeling directly on the graphics card, which would avoid bus bottlenecks, and would potentially offer higher floating point performance. However, this scheme too is unlikely to offer orders of magnitude improvement in floating point performance. [0014] In specialized hardware for iterative floating point codes, there is little prior work. The GRAPE (Gravity Pipe) is a machine for performing N-gravity body computations that are of use to astronomers. The GRAPE project at the University of Tokyo has evolved over several generations. The current GRAPE-6 system is able to achieve a peak performance of 100 trillion floating point operations per second (TFLOPS). The Pixel Flow system developed by the University of North Carolina, Chapel Hill, is a scaleable machine for real-time advanced graphics rendering. The core idea of the Pixel Flow project is to accelerate rendering by assigning each piece of the final image to a specialized processing element.

Summary of the Invention

[0015] The basic concept of the invention is to construct a specialized hardware system in order to accelerate physical modeling and other floating point intensive iterative applications. This acceleration allows for the simulation of complex scenes at interactive speeds.

[0016] The invention is directed to a reconfigurable computing system for floating point intensive iterative applications. The main objective of the architecture of the present invention is to achieve the highest performance at the lowest cost for iterative floating point intensive applications. Since the applications typically perform a large number of relatively simply iterations, it is possible using the system of the present invention to distribute computation to a large number of independent processing elements. Each processing element (referred to herein as PE) is complex enough to handle significant precision of floating point numbers. It requires a modest control and data memory. An efficient schedule of operations for each iteration can be determined a priori and stored locally in each processing element.

[0017] The reconfigurable computing computing system of the present invention includes a plurality of interconnected processing elements mounted on a custom printed circuit board (PCB), a host processing system (such as Linux) for displaying real-time outputs of the floating point calculations performed by the processing elements, and a bus interface (such as a PCI bus or AGP) for connecting the custom printed circuit board to the host system. Using a parallel interface will enable the system to provide the results to the host quickly enough to achieve real-time simulation updates.

[0018] Each of the interconnected processing elements includes a floating point functional unit, operand memory, control memory and a control unit. The floating point unit provides floating point add, subtract, multiply, divide/reciprocate and multiply-accumulate operations. It provides suitable checks to detect overflow/underflow exceptions for each operation. The local memory is divided into operand memory and control memory. The operand memory includes a plurality of banks of static random access memory. In one embodiment, the operand memory is in the form of four banks of 128 x 32 SRAM, while the control memory is one bank of 128 x 40 SRAM. The use of local SRAM cells provides the required amount of high speed and bandwidth, giving high memory performance for the target application. In addition, each PE contains a program counter (PC) and a communications register (COMM). [0019] The processing elements are interconnected using a nearest neighbor implementation (although a hierarchical implementation can also be used). The instruction set performed by the floating point functional unit includes arithmetic, control and interconnect instructions. The PCI bus interface can be implemented as a FPGA. The device is not limited to a PCI ("plug-and-play") connectivity or configuration. An Accelerated Graphics Port (AGP) interface can be used in an alternate embodiment. The alternate embodiment, based on the same principles of operation and same inventive concept allows for full compatibility with AGP standards, including AGP8x. Data transfer rate is not limited by internal architecture, but rather by the choice of PCI or AGP connectivity.

Description of the Drawings

[0020] The invention is better understood by reading the following detailed description of an exemplary embodiment in conjunction with the accompanying drawings. [0021] Figs. 1A - IB illustrate a diagram of a processor architecture for a proof- of-concept system. [0022] Fig. 2 illustrates an animation sequence generated using the processor architecture hardware of Fig. 1. [0023] Fig. 3 illustrates a PCI board floor plan of an exemplary embodiment of the floating point intensive reconfigurable computer for iterative applications of the present invention. [0024] Fig. 4 illustrates a high level overview of the system components in an exemplary embodiment of the present invention. [0025] Fig. 5 illustrates the identification of processing elements via column and row number in accordance with an exemplary embodiment of the present invention. [0026] Fig. 6 illustrates a diagram of a custom-integrated circuit processing element in accordance with an exemplary embodiment of the present invention. [0027] Fig. 7 illustrates the transversal path for a control word in override mode in accordance with an exemplary embodiment of the present invention. [0028] Figs. 8A-8B illustrate the components of a processing element and the data flow through the processing element in accordance with an exemplary embodiment of the present invention.

Detailed Description of the Invention

[0029] The following description of the present invention is provided as an enabling teaching of the invention in its best, currently known embodiment. Those skilled in the relevant art will recognize that many changes can be made to the embodiment described, while still obtaining the beneficial results of the present invention. It will also be apparent that some of the desired benefits of the present invention can be obtained by selecting some of the features of the present invention without using other features. Accordingly, those who work in the art will recognize that many modifications and adaptations to the present invention are possible and may even be desirable in certain circumstances, and are a part of the present invention. Thus, the following description is provided as illustrative of the principles of the present invention and not in limitation thereof, since the scope of the present invention is defined by the claims.

[0030] In order to show that physical modeling performance improvements are possible through the use of specialized hardware, a proof-of-concept system was first constructed to implement a mass-spring deformable object simulation. This system uses a forward Euler solver. The proof-of-concept system was implemented on a high-density Alterra EPF10K250 FPGA (250k gates) used in conjunction with a custom circuit board and specialized memory. This field programmable gate array (FPGA) was placed on the custom-printed circuit board (PCB), which was then connected to a host machine for graphic output via a parallel cable.

[0031] The organization of the specialized processor was pipeline oriented. The idea was to take the Euler function and basically unroll it into a pipeline on the FPGA. Figs. 1A-1B show a diagram of the solver pipeline. Table 1 below shows the number and type of functional units used in the processor.

TABLE 1

[0032] The biggest problem with the proof-of-concept system was utilization of the FPGA. This was a result of the well-known problems in implementing floating point arithmetic in programmable logic. An FPGA-based implementation makes it very difficult to implement barrel shifters due to FPGA routing problems. This leads to high area and low performance for normalization/alignment steps in floating point operations. In addition, it is impossible to implement the full collision detection/resolution and spring force calculation pipeline in hardware due to area constraints on the FPGA.

[0033] In order to compress the design so that it would fit on the FPGA, a number of simplifications were necessary. These simplifications included a removal of pipeline registers, two-dimensional simulation, reduced floating point precision (the mantissa was reduced), and simple collision detection. The final FPGA utilization was 79%.

[0034] After the simplifications were made, the proof-of-concept project was completed and successfully tested. Fig. 2 shows an animation sequence that was generated on the hardware system. The performance of this system can be examined in comparison to existing general-purpose machines. For the 2D simulation above, a Pentium 11/300 MHz machine can achieve .32M iterations per second. Assuming pipelining and 300 MHz operations, the proof-of-concept system could achieve 30M iterations per second. This represents a 92X speedup over the general-purpose machine. This would be even higher for 3D simulation, since the pipeline can exploit greater parallelism.

[0035] The lessons learned on the construction of the developmental proof-of- concept system were applied in designing the present invention. The most important distinctions are that the present invention must implement an implicit solver (stability in forward Euler is very poor) and be constructed as a custom integrated circuit (IC).

[0036] The custom integrated circuits can be fabricated using Taiwan

Semiconductor Manufacturing Corp. (TSMC) design technology available through MOSIS. MOSIS is a non-profit microelectronics broker providing low- cost prototyping and small-volume production service for VLSI circuit development. Several of these custom integrated chips populate a custom printed circuit board. The custom printed circuit board is connected to a LINUX host system through a Peripheral Component Interconnect (PCI) bus 15 (Fig. 3). The bus interface can be implemented as an FPGA using a PCI core. The PCI bus interface increases the design complexity, but it is necessary in order to supply simulation results to the host quickly enough to achieve real-time display updates. Fig. 3 shows a diagram of the printed circuit board organization using a PCI bus interface. It depicts the FPGA PCI interface 10 and four processing element array integrated circuits 20.

[0037] An AGP interface can also be used in an alternate exemplary embodiment.

The most important feature of the AGP is probably direct memory execute (DIME). The DIME gives AGP chips the capability to access main memory directly for complex mapping operations. AGP is a dedicated connection that can only be used by the graphics subsystem.

[0038] A high level overview of the entire system of the present invention is illustrated in Fig. 4. The system includes a known-good host system 90, a controller FPGA 92 and the custom PE array 94 on a custom printed circuit board 96. The system has only two global signals: clock and PC-enable. PEs 20 identify themselves by column and row, as illustrated in Fig. 5. A simple locking circuit is used to prevent two adjacent PEs from performing interconnect stores to each other simultaneously, possibly burning out transistors. All that is required is to ensure that no two adjacent PEs are executing store instructions simultaneously. If this condition occurs, the write buffers of the PEs are disabled and a flag is set. At the end of program execution, the flag is examined. If it is clear, then all went well. If it is set, a simulator is used to determine precisely where the program faltered. It is possible to perform a more specific check within the array hardware itself, but this would unnecessarily increase the complexity of the PE array.

[0039] As shown in Fig. 6, each processing element 20 includes a floating point functional unit 22, operand memory 24, control memory 26, and a simple control unit 28 on one integrated circuit (i.e., a single chip) . The floating point functional unit also includes a multiply accumulate (MAC) function. Although there is also a divide/reciprocation function as well, simulations show that the divide operation is rare. [0040] In each processing element 20, there is local static random access memory

(SRAM) 24 for operand and control storage. Distributing a large amount of high speed, high bandwidth SRAM such as this offers very high memory performance for the target operation. In order to achieve optimal performance, the operand storage should contain four banks 24 of 128 32-bit words. The four banks allow for operation in the form A * B + C => D . The control memory 26 is a separate storage and relatively small in comparison (128 x 40 SRAM). The maximum instruction length is 40 bits. The size of on-board RAM is not critical for performance. The device can function faster by replacing the type of RAM used (from SIMM to DIMM) at the same size of on-board memory. SIMM and DIMM are acronyms for single in-line memory module and dual in-line memory module, respectively. They are small circuit boards that can hold a group of memory chips. The bus from the SIMM to the actual memory chips is 32 bits wide. DIMM provides a 64-bit bus.

[0041] Since most of the processing element area is consumed by memory, memory cell area is of critical importance. Dynamic random access memory (DRAM) offers considerable advantages in this regard. Static RAMs do not require refresh circuitry as do dynamic RAMs, but they do take up more space and require more power. However, for the initial implementation of the invention, SRAM was chosen for robustness. Specifically, SRAM was chosen because of the concern about noise effecting sensing on the memory bitlines.

[0042] Since computation is distributed over a number of processing elements 20, efficient communication is very important. At the board level 30, interconnect 32 is more limited than within the IC. A large number of 32-bit busses will cause packaging difficulties, increasing chip-to-chip delay due to routing congestion, and will increase power dissipated in the chip I/O pads.

[0043] Within the IC 20, the interconnect method 18 can be more flexible. The primary concern is to allow for efficient communication without significantly increasing area overhead. Delay is not expected to be a problem in the present invention since the operand bitlines have a higher capacitive load. The main interconnect options for this invention are a nearest neighbor style and a hierarchical approach.

[0044] Using a nearest neighbor interconnect strategy, the processing elements 20 can be efficiently connected to their cardinal neighbors (north, south, east, west). This results in a simple and regular layout as illustrated in Fig. 6. In a nearest neighbor interconnect strategy, a number of options exist for treating boundaries (torus, wrap, etc.), however, in this design, wiring complexity can be reduced by not allowing on-chip boundary communication. The position of each PE is hard- coded in terms of its (X, Y) coordinates and every PE is identified by its coordinates.

[0045] A hierarchical or tree-based approach offers the advantage of faster communication between distant neighbors. This interconnection method is commonly used in modern FPGAs. Hierarchical communication tends to be necessary when the number of processing elements are large, as in modern FPGA- based systems which are fine-grained in terms of the logic blocks used. In the present invention, a hierarchical interconnect approach leads to unnecessary overhead.

[0046] In each method, a processing element 20 must be allowed to quickly broadcast its data to all other processing elements. This rules out a one- dimensional interconnection method. However, broadcasts can be formed over multiple cycles by rippling through a nearest neighbor network, for example.

[0047] One additional interconnection problem is making sure that interconnect is not being driven by separate processing elements. This problem can be addressed by host-side software checking prior to execution. However, hardware locking is preferable and is discussed more fully below.

[0048] Clock distribution is also an important consideration. The primary concern is to minimize the probability of a system crippling design error. Since phase- locked loop (PLL) design is a fairly complex and error-prone process, it may be safer to combine two or more out-of-phase board level clock signals in the IC.

[0049] The PE supports a novel I/O scheme for loading the PE with programs and data and retrieving results from the PE array. Therefore, every PE operates in two modes: I O mode or override mode and compute mode. A global PC-enable signal controls the mode of operation. The PE is in compute mode if PC-enable = 1 and is in the override mode when PC-enable = 0. Thus, every PE has two different instruction classes for each mode. The instruction format and instruction sets for both compute mode and override mode are provided in greater detail below. [0050] In the override mode, while loading the PEs, each PE receives a 56-bit control word from its north and west neighbors, possibly modifies the word, and passes it on to its south and east neighbors. Therefore, I/O commands fan-out in a 2-dimensional wave, starting from the northwest corner of the array, towards the southeast corner. Fig. 7 illustrates the traversal path for a control word in override mode. This is accomplished by using the "PUT" command to write to the control memory (program), operand memory (data) or the program counter (PC). To read data from the PE array, a "LOOK-UP" instruction is sent to the northwest corner PE. This command fans out through the array. Every PE receiving this command compares its coordinates with the PE field of the instruction. If there is no match, then the instruction is simply passed on. If there is a match, then the PE performs the indicated SRAM (operand or control) read and inserts the results into a Result Response, which propagates through the array and eventually appears at the southeast corner from where it can be read.

[0051 ] This scheme dramatically reduces the number of pins required for the chip.

The entire array thus depends on only two global signals, i.e., PC-enable and clock.

[0052] In the compute mode, each PE starts execution by reading the instruction pointed to by the program counter. There are three classes of instructions in this mode: arithmetic, control and interconnect. In the arithmetic class, the instruction set provides for ADD, SUB, MULT and DIN, which are two operand instructions, and MAC, which is a three operand instruction. The arithmetic class also provides NOP, which is a zero operand instruction. [0053] Each operand field independently specifies an operand SRAM bank (2 bits) and a location within the bank (7 bits). Control instructions consist of conditional branching which is provided by the BLZ instruction.

[0054] Interconnect instructions deal with the COMM register. The LOAD instruction enables the PE to read from the COMM register of any of its four neighbors. The LOAD instruction uses the DIR field (north, south, east, or west) to load the given operand location from the given direction. The STORE instruction is used by a PE to place data into the COMM register for its neighbors to read. The STORE instruction uses the DIR field to enable its write buffers in the given direction and reads its data from the given operand location.

Processing Element Cell Design

[0055] The array of processing elements has two modes: compute and override. In compute mode each PE executes an instruction stream as defined by its control memory and program counter (referred to herein as PC). In override mode, the override instructions are streamed through the array from the upper-left-hand corner (PE 0, 0) toward the lower-right-hand corner (PE (n-1), (n-1)). In override mode, each PE forms its next override word by a bit- wise OR operation on the override messages received from its neighbors to the north and west. Simultaneously, it transmits its current override word to its neighbors to the south and east.

[0056] The override instructions are used for array I/O. The current array mode is determined by a global PC-enable signal. When PC-enable is 1, the array is in compute mode. Otherwise, it is in override mode. Unconnected communication input lines on the edges of the array are tied to ground.

[0057] Each PE has four operand banks; each bank contains 128 32-bit words.

Each PE's control memory is a single bank of 128 40-bit words. No instruction may use an operand bank more than once (each bank has a single read and a single write port).

[0058] Each PE contains a 32-bit floating point unit which is capable of performing floating point add, subtract, multiply, divide and multiply-and- accumulate (MAC) operations.

[0059] The compute mode instruction format is provided in Table 2. Operands are specified with a two-dimensional address. The operand format is provided in Table 3. Therefore, the compute mode instruction is a 40-bit word which specifies the opcode and the banks and the offsets of each operand. Again, note that no instruction may use an operand bank more than once, hence RI, R2, R3, R4 each point to a different operand memory bank.

TABLE 2

Field Opcode RI R2 R3 R4

Length (bits) 4 9 9 9 9

TABLE 3

Field Bank Offset

Len th (bits) The compute mode instructions are shown in Tables 4-7. Arithmetic instructions are provided in Table 4. Control instructions are provided in Table 5. If the word stored in RI is negative, then the PC is loaded with the value stored in the seven rightmost bits of R4. PE-PE communication instructions must specify a direction as well as a command. Adjacent PEs may read each other's COMM register during the same clock cycle. This allows for full-duplex communication. Nearest neighbor compass direction encodings are provided in Table 6. The direction is stored in the two right-most bits of RI. The instructions for communication between processing elements are provided in Table 7.

TABLE 4

Examples:

ADD 0, 02, 7 1, 19 MAC 0, 12 3, 44 1, 101 2, 98

TABLE 5

Examples:

BLZ 3,4 34 ^■ BLZ 2,23 0x37

TABLE 6

TABLE 7

Examples: LOAD EAST 0, 0 STORE 3, 6 [0061] The override mode instruction format is provided in Table 8. The format of the Location field is provided in Table 9. The X-Coordinate increases to the east; the Y-Coordinate increases to the south. The memory banks (accessed by bank and offset) are defined in Table 10.

TABLE 8

TABLE 9

TABLE 10

[0062] The override mode instruction set is provided in Table 11. Put instructions targeting an operand memory must right-align the 32-bit datum within the 40-bit NAL field. PEs responding to a lookup instruction reading an operand memory must also right-align the 32-bit datum. Furthermore, as the PEs perform single precision floating point arithmetic, decimal values stored to operand memories will be stored as single precision floating point values. Hexadecimal values will be stored to operand memories without conversion. Only unsigned hexadecimal values should be used.

TABLE 11

Examples:

PUT 0, 0 4, 0 "STORE 0,0" # Store an instruction. PUT 0, 0 0, 0 "0x5555" # Store to operand memory. PUT 2, 2 3, 127 "-10" #Store to operand memory. PUT 2, 2 3, 127 "120.35" #Store in operand memory.

PUT 2, 2 3, 127 "345E-6" #Store in operand memory.

PUT 0, 0 7, 0 "0" #Reset PC LOOKUP 1, 0 0, 0 #Read from operand memory.

Control Signals and Fields

Before describing the data flow diagram for a processing element, the external inputs, internal control signals and the fields of any processing element are provided below. The control signal equations are provided in pseudo-code. The fields for the overword (override mode) are identified in Table 12. Overword[55..0] is obtained by "OR"-ing overwords from West and North neighbors. The fields for control memory word are identified in Table 13. The control memory word is a 40-bit word read from control memory.

TABLE 12

TABLE 13

External inputs XCoord [l :0] X-Coordinate of PE YCoord [l :0] Y-Coordinate of PE Clk Clock

PCEnable Signal to switch between override and compute modes

WS n Write Strobe

Internal Signals [0064] PEHit =1 in the Override mode if XCoord and YCoord refer to the coordinates of the current PE.

PEHit = (!(PCEnable)&&(OWC.PE = ((XCoord«2) || YCoord))); [0065] The Unique signal provides a sanity check to ensure that an instruction accesses an operand memory bank only once.

Unique = CM.Op == 7

!( CM.RlBank = CM.R2Bank CM.RlBank == CM.RSBank && CM.Op == 4 CM.RlBank == CM.R4Bank CM.R2Bank == CM.R3Bank && CM.Op == 4 CM.R2Bank == CM.R4Bank CM.R3Bank == CM.R4Bank && CM.Op = 4); [0066] WriteBack signal is used to enable the tri-state buffers which allow the result to be written back into an operand memory bank. When WriteBack = 1, memory stops driving the output bus and the buffers drive the bus to write the data into the memory bank.

WriteBack=( CM.Op>0 &&CM.Op<6 || CM.Op == 7); Operand Memory Signals [0067] OMxAddrSel is used as a select line for the multiplexer which selects the address for the bank x.

OMxAddrSel [2:0] = if (IPCEnable)

4; elseif (CM.RlBank == x)

0; elseif (CM,RZBank == x) l; elseif (CM.R3Bank == x)

2; else

3; where x = 0, 1, 2, or 3.

[0068] OMxOE_n is the active low output enable signal for bank x. it is deasserted during a write operation in compute mode and for a "PUT" instruction in override mode (which is again a write).

OmxOE_n = if (PCEnable && CM.R4flank == x) l; elseif (IPCEnable && OWC.Op == 2) i ; else

0; where x = 0, 1, 2, or 3.

[0069] OmxWrt_n is the active low write enable signal for bank x. It is active for an operand memory bank write during override mode ("PUT" instruction) or compute mode (writing result back to R4). In all other cases it is de-asserted.

OmxWrt_n if (Unique && Writeback && PCEnable && CM.R4Bank = x)

0; elseif ( IPCEnable && PEHit && OWC.Bank = x && OWC.Op =2) 0; else

1; where x 0, 1, 2, or 3.

[0070] OmxWS_n is the write strobe which is controlled by the global write strobe (WS_n). The write will be executed only when write strobe is asserted.

OmxWS_n = WS_n || OmxWrt i; where x = 0, 1, 2, or 3.

[0071] OmxCE_n is the active low chip enable signal for bank x. It is active at all times in the compute mode. In the override mode it is asserted for a "PUT" or

"LOOKUP" instruction which accesses bank x.

OMxCE_n = If (PCEnable)

0; elseif ( IPCEuabJe && OWC.Bank== x)

0; else l; where x 0, 1, 2, or 3.

Functional Unit Signals [0072] The ResultSel is a select line to select the correct output from various sub- units of the functional unit.

ResultSel [2:0] = if (IPCEnable)

3; elseif (CM.Op == 3)

0; elseif ( CM.Op == 1 || CM.Op = 2 || CM.Op == 4) elseif (CM.Op = 5

2; else

4

[0073] RlSrcSel = IPCEnable: Used to select the correct address for RI (either

CM.RlBank or OWC.bank). [0074] MACOp = CM.Op = 4: Used to select between MULT and MAC outputs.

[0075] Sub/Add_n = CM.Op = 2: Used to select between ADD and SUB operations. [0076] SignBit = RI [31 :31]: Sign of data stored at RI . Used to make branching decision. [0077] CommLd = PCEnable && CM.Op == 8: Load signal for COMM register.

Program Counter signals [0078] PCInSel = IPCEnable: Used to select the correct input for the program counter. [0079] PCCnt = PCEnable && IPCLd: When PCCnt = 1 , the PC auto-increments to point to the next instruction. [0080] PCLd = (PCEnable && CM.Op == 6 && SignBit) || (IPCEnable &&

PEHit

&& OWC.Op = 2 && OWC.Bank == 7); [0081] PCLd = 1 in override mode for a PC write and in compute mode for a BLZ instruction. When PCLd = 1 , the new value gets loaded into the PC regardless of the previous contents.

Control Memory signals [0082] The functions of the following signals are similar to the operand memory signals. CMAddrSel = IPCEnable;

CMOE_n = IPCEnable && OWD.Op == 2;

CMWrt_n = [(IPCEnable && PEHit && OWC.Op = 2 && OWC.Bank =

4)

CMWS_n = WS_n || CMWrt_n; CMCE_n = if (PCEnable) 0; elseif (IPCEnable && PEHit && OWC.Op == 2 && OWC.Bank == 4) 0; else 1

Other Override Mode Signals

[0083] OWrdOutSel acts as the select line to select the final result. This is combined with the control information to form the output Ovrword.

OWrdOutSel [2:0] = if (PCEnable)

4; elseif (PEHit && OWC.Op == 1 && OWC.Bank <4) l; elseif (PEHit && DWC Op = 1 && OWC.Bank = 4)

2; elseif (PEHit && OWC.Op = 1 && OWC.Bank == 7)

3; else

0;

[0084] OWC[55:55] is the "FOUNDIT" signal (refer to instruction set) which changes the opcode to 11 when the value is found.

OWC [55:55] = OWC[55:55] || ( (PEHit && OWC.op = 1 && (OWC.Bank

< 5

|| OWC.Bank ==

7));

Data Flow Description [0085] Figs. 8A-8B illustrate the components of a processing element and the data flow through the processing element The inputs to each PE are:

- Clk - The global clock signal

- PCEnable - Control bit to switch between override and compute modes.

- XCoord, YCoord - (X, Y) co-ordinates of the PE. In the NHDL code, these are hard-coded for every PE.

- WOvrWrdln/Commln - OvrWrd (56-bits) from West neighbor

- ΝOvrWrdln/Commln - OvrWrd (56-bits) from North neighbor

- SCommln - CommWord (32-bits) from South neighbor

- ECommln - CommWord (32-bits) from East neighbor [0086] The outputs of each PE are:

- OvrWrd - 56-bit OvrWrd to East and South neighbors

- CommOut - 32-bit CommWord to West and North neighbors. [0087] Every PE receives OvrWrds from its North and West neighbors, processes the OvrWrd and provides OvrWrd output to its East and South neighbor. The Comm output from each PE goes to all of its neighbors. Consequently, every PE's COMM register can be read by any of its neighbors. [0088] The components of the PE as shown in the data flow diagram of Figs. 8A-

8B. The PE can be divided into various blocks, namely OvrWrd block, PC block, CM block, Functional Unit and Output block. [0089] OvrWrd Block - The OvrWrd block consists of an OR gate and the

OvrWrd register. The input WOvrWrd and NOvrWrd are ORed and the result acts as the OvrWrd for the current PE. This is stored in the OvrWrd register after splitting it into its components. The upper 16 bits of the OvrWrd (i.e. bits 55 to 40) are labeled as OvrWrdCtrl and the lower 40 bits (bits 39 to 0) are labeled as OvrWrdData. OvrWrdCtrl specifies the opcode and bank for the PE and OvrWrdData is the data or value used in that instruction.

[0090] PC Block - This consists of the Program Counter (PC) and the supporting logic for reading, writing and incrementing the PC. In override mode, the PCInSel = 1, PCCnt=0 and PCLd = 0, so that PC gets loaded with the lower 7 bits of the OvrWrdData. In compute mode, the PC is automatically incremented after every instruction to point to the next instruction in the control memory. The PC is rewritten with a new value when a branch instruction causes the control to be branched to a different address.

[0091 ] CM Block - The Control Memory (CM) block consists of a single 128x40

SRAM bank and supporting logic to read and write to the memory. In override mode, CMAddrSel = 1, CMWS_n = 0, OE_n = 1, CMWrt_n = 0. Hence the OvrWrdData is written in the CM at the address pointed by OwdCtrlAddr. In compute mode, CMWS_n = 1, OE_n = 0, CMWrt_n = 1, CMAddrSel = 0. Hence the data pointed by the address in the PC is read from the CM. The CM is such that when CMWS_n (CM write strobe) and CMWrt_n (CM write) are low and CMOE_n (CM output enable) is high, data is written into the CM. When CMWS n = CMWrt n = 1 and CMOE n = 0, data is read from the CM. [0092] Functional Unit - This consists of the 32-bit floating point unit with MAC support, 4 banks of 128x32 operand memories and a Result multiplexer to select the correct result to be written back to the memory. [0093] Output Block - This consists of the "Found-It" logic and the output multiplexer. The "Found-It" logic operates in the override mode and sets the MSB of the OvrWrd to 1 if a look-up instruction is successful. The output multiplexer is used to select the correct output from the functional unit. [0094] The data flow is described herein by considering a representative instruction from both the override and compute modes. Override Mode: PUT 1. 0 4. 0 <instruction> [0095] This instruction is used to write the value <instruction> into the CM. The logical flow of steps for this instruction is as follows:

- OvrWrdBlock - Assume the above instruction is available as WOvrWrdln/Commln and NOvrWrdln/Commln = 0. An OR operation is performed and OvrWrdCtrl and OvrdWrdData are obtained. Hence OvrWrdCtrl = PUT 1,0 4,0 and OvrWrdData = <instruction>.

- PC Block - In override mode, PCInSel = 0, so input to the PC is OvrWrdData[6..0]. The PC is viewed as bank 7. Since this is a write to bank 4,0, PCLd = 0 and the data is ignored by the PC.

- CM Block - For the CM Block, CMAddrSel = 1, CMOE_n = 1, CMWS_n = 0, CMWrt_n = 0. The value <instruction> gets written to the CM at the address given by OwdCfrlAddr (which in this case is 0).

Functional Block - Since this instruction does not concern any of the operand memory blocks, OMxAddrSel = 0, OMxOE = 1, OMxWrt_n = 1, OMxWS_n = 1, where x=0J,2 or 3. None of the operand memory blocks is affected in any way. Also, in override mode, PCEnable = 0, so the floating point unit and result multiplexer are disabled. If this were a write to any operand memory bank, then the OMxWS n and OMxWrt_n signals for the corresponding bank will be asserted and OMxOE_n will be deasserted.

Output Block - In override mode, the PE does not compute anything but just passes on the received data to its East and South neighbors. Hence OwrdOutSel = 0.

Compute Mode: MAC RI. R2. R3. R4 where R1=0.0 R2=l,7 R3=2J2

R4=3,5 [0096] This instruction calculates RI * R2 + R3 and places the result into R4. The bank and addresses pointed to by each of RI, R2, R3 and R4 are as mentioned above. The logical flow of steps for this instruction is as follows: -

- OvrWrd Block - Since PE is in compute mode, OvrWrd block is disabled.

- PC Block - Assume this instruction is written at address 0 in the CM. Hence the PC output is (0000000)_b. PCCnt = 1, so that PC is auto-incremented to point to the next instruction.

- CM Block - In compute mode, CMAddrSel = 0 so that address given to CM is the PC output. CMOE n = 0, CMWrt_n = 1, CMWs_n = 1. Hence the instruction is read from the address (0000000)_b of the CM. It is then decoded (the decoding is logical and is a part of the NHDL code and is not mentioned in the diagram) to give values of CM.Op - the opcode, CM.RlBank, CM.RlAddr, CM.R3Bank, CM.R2Addr, CM.R3Bank, CM.R3Addr, and CM.R4Bank, CM.R4Addr.

- Functional Unit -

- The operands are first read from the banks 0, 1 and 2. Hence OMOAddrSel = 0 (since R 1=0,0), OMIAddrSel = 2 (since R2=l,7), OM2AddrSel = 3 (since R3=2J2) and OM3AddrSel = 4 (since

R4=3,5).

For banks 0, 1, 2 OMxOE_n = 0, OMxWS_n = 1,

OMxWrt_n = 1 and for bank 3 OM30E_n = 1,

OMxWrt_n = 0, OMxWS_n = 0. The outputs of the operand memory banks are denoted as follows:

OM0 as DO, OM1 as Dl and OM2 as D2.

In compute mode, RlSrcSel = 0, so output of that multiplexer is CM.RlBank (i.e. 0) and consequently, output of the RI -multiplexer (i.e. RI) is DO. Similarly, output of the R2 -multiplexer (i.e.

R2) is Dl.

Since this is a MAC instruction, MACOp = 1, so output of the bottom multiplexer is CM.R3Bank.

Consequently, the output of the R2orR3 -multiplexer

(i.e., R2orR3) is D2. Thus all the operands to execute the instruction are determined.

RI and R2 are provided as inputs to the multiplier block (X) and the inputs to the add/subtract block

(+/-) are the output of the multiplexer and R2orR3.

The small delay provided by the +/- input multiplexer causes correct input to be given to the

+/- block. - For MAC operation, ResultSel = 1, so that output of the +/- block is selected as the result to be written back.

- During write-back, OM3Wrt_n = 0, OM3WS_n = 0, OM30E_n = 1 so that the result is written into OM3.

- Output Block - In compute mode, OwrdOutSel = 4, so that output of the COMM register is appended to OvrWrdCtrl and given as output to East and South neighbors.

[0097] The load and store instructions are slightly different. The store instruction stores the data pointed by the RI field to the COMM register. Hence, after the data from the bank pointed by RI is read, CommLd = 1 so that the COMM register is loaded with this data and then OwrdOutSel = 4, so that this data can be read by any of this PE's neighbors.

[0100] For the load instruction, the lower 2 bits of the data pointed by the RI field denote the direction of the neighbor from which the current PE reads the data. 00 denotes west, 01 denotes east, 10 is south and 11 denotes north. Hence CM.RlAddr[1..0] causes the multiplexer to select the data from the correct neighbor and this data is then written back to the address given by the R4 field.

[0101] The following general comments apply to the dataflow diagram of Figs.

8A-8B:

- A signal which is written as SIGNAL_n denotes an active low signal. - The global signals Clk, PCEnable, XCoord and YCoord are not shown connected to any component to keep the diagram as simple as possible. In the NHDL derived from the dataflow diagram, Clk is connected to every sequential component and PCEnable is connected to every component. The XCoord and YCoord signals are connected to a logic block which is used to identify the PE.

- Some inputs/outputs for the peripheral PEs which are not used are connected to the ground. For example, for the PEs in the rightmost rows, there is no East neighbor, hence their EOvrWrdOut/ECommOut lines are connected to ground.

[0102] The present invention attempts to pack as much computation into as small a space as possible. This computational density should lead to a higher level of switching activity than what is seen in general-purpose processors. Therefore, power consumption and heat generation may become problematic. For example, heat generation can constrain the clock rate. It may be possible to apply a number of techniques in order to reduce the effects of high switching activity. These techniques include heat tolerant packaging, low-swing interconnect, and possibly scheduling operations in order to reduce switching activity.

[0103] Those skilled in the art will appreciate that many modifications to the exemplary embodiments of the present invention are possible without departing from the spirit and scope of the invention. In addition, it is possible to use some of the features of the present invention without the corresponding use of the other features. Accordingly, the foregoing description of the exemplary embodiments is provided for the purpose of illustrating the principles of the present invention and not in limitation thereof since the scope of the present invention is defined solely by the appended claims.

Claims

What is Claimed is:

1. A reconfigurable computing system for accelerating execution of floating point intensive iterative applications, comprising: a plurality of interconnected processing elements, each processing element including a floating point functional unit, operand memory, control memory and a control unit; a host processing system for displaying real-time outputs of the floating point intensive iterative applications; and an interface for connecting the plurality of interconnected processing elements to the host processing system.

2. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 1 wherein the floating point functional unit includes a multiply accumulate (MAC) function.

3. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 1 wherein the operand memory comprises a plurality of banks of static random access memory.

4. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 1 wherein the operand memory comprises dynamic random access memory.

5. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 1 wherein the operand memory comprises four banks of static random access memory.

6. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 1 wherein the control memory comprises a bank of random access memory.

7. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 1 wherein the plurality of processing elements are interconnected using a nearest neighbor implementation in which each processing element is connected to its cardinal neighbor processing elements.

8. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 1 wherein the plurality of processing elements are interconnected using a hierarchical implementation.

9. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 1 wherein each processing element executes a plurality of classes of instructions.

10. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 1 wherein each processing element has both a -compute mode and an override mode of operation.

11. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 10 wherein the mode of operation for the plurality of processing elements is determined by a global signal.

12. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 10 wherein each processing element in compute mode executes an instruction stream as defined by a program counter and a control memory.

13. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 10 wherein each processing element in override mode forms an override word by a logical OR operation of control words received from a pair of cardinal neighbor processing elements.

14. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 12 wherein the compute mode instructions include arithmetic, control and communication instructions.

15. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 14 wherein the arithmetic instructions include a no operation instruction, an add instruction, a subtract instruction, a multiply instruction, a divide instruction and a multiply accumulate instruction.

16. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 14 wherein the control instructions include a conditional branch instruction.

17. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 14 wherein the communication instructions include a load instruction and a store instruction.

18. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 13 wherein the override mode instructions include a put instruction, a lookup instruction and a found instruction.

19. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 18 wherein the put instruction stores to a control memory, an operand memory or a program counter.

20. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 18 wherein the lookup instruction reads from an operand memory.

21. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 18 wherein the found instruction sets the most significant bit of an override word if a lookup instruction is successful.

22. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 1 wherein the interface is implemented using a field programmable gate array (FPGA).

23. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 1 wherein the interface is a Peripheral Component Interconnect (PCI) bus interface.

24. The reconfigurable computing system for accelerating execution of floating point intensive iterative applications of claim 1 wherein the interface is an Accelerated Graphics Port (AGP) interface.

25. A processing element for use in an array of processing elements forming a reconfigurable computing system to accelerate the execution of computationally intensive instructions comprising: a functional component for performing a plurality of floating point instructions; a plurality of operand banks for providing inputs to and writing outputs from the functional component; a control memory for providing and storing instructions that control operation of the processing element; and an output component for providing an output signal to an adjacent processing element.

26. The processing element for use in an array of processing elements of claim 25 further comprising a program counter for pointing to an instruction in control memory that is to be executed by the functional unit.

27. The processing element for use in an array of processing elements of claim 25 further comprising an input register for storing an instruction determined from a logical combination of inputs from a pair of adjacent processing elements.

28. The processing element for use in an array of processing elements of claim 25 wherein the processing element operates in a compute mode or an override mode.

29. The processing element for use in an array of processing elements of claim 25 wherein the operand memory comprises a plurality of banks of random access memory.

30. The processing element for use in an array of processing elements of claim 25 wherein the control memory comprises a bank of random access memory.

31. The processing element for use in an aπay of processing elements of claim 25 wherein the array of processing elements are interconnected using a nearest neighbor implementation in which each processing element is connected to its cardinal neighbor processing elements.

32. The processing element for use in an aπay of processing elements of claim 28 wherein the mode of operation for the processing element is determined by a global signal.

33. The processing element for use in an aπay of processing elements of claim 28 wherein the processing element in compute mode executes an instruction stream as defined by the program counter and the control memory.

34. The processing element for use in an aπay of processing elements of claim 28 wherein the processing element in override mode forms an override word by a logical operation on control words received from a pair of cardinal neighbor processing elements.

35. The processing element for use in an aπay of processing elements of claim 33 wherein the compute mode instructions include arithmetic, control and communication instructions.

36. The processing element for use in an aπay of processing elements of claim 35 wherein the arithmetic instructions include a no operation instruction, an add instruction, a subtract instruction, a multiply instruction, a divide instruction and a multiply accumulate instruction.

37. The processing element for use in an aπay of processing elements of claim 35 wherein the control instructions include a conditional branch instruction.

38. The processing element for use in an aπay of processing elements of claim 35 wherein the communication instructions include a load instruction and a store instruction.

39. The processing element for use in an aπay of processing elements of claim 34 wherein the override mode instructions include a put instruction, a lookup instruction and a found instruction.

40. The processing element for use in an aπay of processing elements of claim 39 wherein the put instruction stores to a control memory, an operand memory or a program counter.

41. The processing element for use in an aπay of processing elements of claim 39 wherein the lookup instruction reads from an operand memory.

42. The processing element for use in an aπay of processing elements of claim 39 wherein the found instruction sets the most significant bit of an oveπide word if a lookup instruction is successful.