WO2003050697A1 - Floating point intensive reconfigurable computing system for iterative applications - Google Patents
Floating point intensive reconfigurable computing system for iterative applications Download PDFInfo
- Publication number
- WO2003050697A1 WO2003050697A1 PCT/US2002/038645 US0238645W WO03050697A1 WO 2003050697 A1 WO2003050697 A1 WO 2003050697A1 US 0238645 W US0238645 W US 0238645W WO 03050697 A1 WO03050697 A1 WO 03050697A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- instruction
- floating point
- computing system
- processing elements
- processing element
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8007—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
- G06F15/8023—Two dimensional arrays, e.g. mesh, torus
Definitions
- the present invention is related in general to physical modeling of solid objects, and more particularly, to the use of specialized reconfigurable hardware architecture to provide acceleration for the execution of floating point intensive iterative applications.
- deformable object modeling is to generate a realistic animation for an object when it deforms due to external forces.
- Performance of deformable object modeling tends to be very poor on general purpose CPUs.
- chip area is not well utilized during physical modeling computations. Much of the chip area is devoted to integer computation, instruction decoding and branching.
- general purpose CPUs cannot dedicate all of their resources to physical modeling. The CPU must also deal with operating system overhead, input processing, sound processing, etc. A bus bottleneck also exists between the CPU and graphics hardware. Due to this performance problem, deformable object modeling has been used for off-line generation of animation sequences for movies. Deformable objects are typically modeled as a mass-spring system.
- Hooke's Law is iteratively solved to update the simulation. Implicit integration is typically used due to stability concerns. This results in a system of equations that are then linearized. This system of linear equations can then be solved using an iterative solver such as conjugate gradient.
- Ray tracing is also a very computationally intensive operation.
- the basic idea is to model the behavior of individual rays of light in a three dimensional scene. Since there are a very large number of rays, the application is very computationally intensive. However, parallelism can be easily exploited.
- Ray tracing differs from the techniques used in 3D accelerator cards in that it models light much more accurately. This lead to very realistic shadows, reflections, and lighting.
- SIMD single instruction stream, multiple data stream
- the basic concept of the invention is to construct a specialized hardware system in order to accelerate physical modeling and other floating point intensive iterative applications. This acceleration allows for the simulation of complex scenes at interactive speeds.
- the invention is directed to a reconfigurable computing system for floating point intensive iterative applications.
- the main objective of the architecture of the present invention is to achieve the highest performance at the lowest cost for iterative floating point intensive applications. Since the applications typically perform a large number of relatively simply iterations, it is possible using the system of the present invention to distribute computation to a large number of independent processing elements.
- Each processing element (referred to herein as PE) is complex enough to handle significant precision of floating point numbers. It requires a modest control and data memory.
- An efficient schedule of operations for each iteration can be determined a priori and stored locally in each processing element.
- the reconfigurable computing computing system of the present invention includes a plurality of interconnected processing elements mounted on a custom printed circuit board (PCB), a host processing system (such as Linux) for displaying real-time outputs of the floating point calculations performed by the processing elements, and a bus interface (such as a PCI bus or AGP) for connecting the custom printed circuit board to the host system.
- PCB custom printed circuit board
- a host processing system such as Linux
- a bus interface such as a PCI bus or AGP
- Each of the interconnected processing elements includes a floating point functional unit, operand memory, control memory and a control unit.
- the floating point unit provides floating point add, subtract, multiply, divide/reciprocate and multiply-accumulate operations. It provides suitable checks to detect overflow/underflow exceptions for each operation.
- the local memory is divided into operand memory and control memory.
- the operand memory includes a plurality of banks of static random access memory. In one embodiment, the operand memory is in the form of four banks of 128 x 32 SRAM, while the control memory is one bank of 128 x 40 SRAM. The use of local SRAM cells provides the required amount of high speed and bandwidth, giving high memory performance for the target application.
- each PE contains a program counter (PC) and a communications register (COMM).
- PC program counter
- COMM communications register
- the processing elements are interconnected using a nearest neighbor implementation (although a hierarchical implementation can also be used).
- the instruction set performed by the floating point functional unit includes arithmetic, control and interconnect instructions.
- the PCI bus interface can be implemented as a FPGA.
- the device is not limited to a PCI ("plug-and-play") connectivity or configuration.
- An Accelerated Graphics Port (AGP) interface can be used in an alternate embodiment.
- the alternate embodiment based on the same principles of operation and same inventive concept allows for full compatibility with AGP standards, including AGP8x. Data transfer rate is not limited by internal architecture, but rather by the choice of PCI or AGP connectivity.
- FIG. 1A - IB illustrate a diagram of a processor architecture for a proof- of-concept system.
- Fig. 2 illustrates an animation sequence generated using the processor architecture hardware of Fig. 1.
- Fig. 3 illustrates a PCI board floor plan of an exemplary embodiment of the floating point intensive reconfigurable computer for iterative applications of the present invention.
- Fig. 4 illustrates a high level overview of the system components in an exemplary embodiment of the present invention.
- Fig. 5 illustrates the identification of processing elements via column and row number in accordance with an exemplary embodiment of the present invention.
- Fig. 6 illustrates a diagram of a custom-integrated circuit processing element in accordance with an exemplary embodiment of the present invention.
- Fig. 7 illustrates the transversal path for a control word in override mode in accordance with an exemplary embodiment of the present invention.
- Figs. 8A-8B illustrate the components of a processing element and the data flow through the processing element in accordance with an exemplary embodiment of the present invention.
- a proof-of-concept system was first constructed to implement a mass-spring deformable object simulation. This system uses a forward Euler solver. The proof-of-concept system was implemented on a high-density Alterra EPF10K250 FPGA (250k gates) used in conjunction with a custom circuit board and specialized memory. This field programmable gate array (FPGA) was placed on the custom-printed circuit board (PCB), which was then connected to a host machine for graphic output via a parallel cable.
- FPGA field programmable gate array
- Fig. 2 shows an animation sequence that was generated on the hardware system.
- the performance of this system can be examined in comparison to existing general-purpose machines.
- a Pentium 11/300 MHz machine can achieve .32M iterations per second.
- the proof-of-concept system could achieve 30M iterations per second. This represents a 92X speedup over the general-purpose machine. This would be even higher for 3D simulation, since the pipeline can exploit greater parallelism.
- the custom integrated circuits can be fabricated using Taiwan
- MOSIS Semiconductor Manufacturing Corp.
- TSMC Semiconductor Manufacturing Corp.
- MOSIS is a non-profit microelectronics broker providing low- cost prototyping and small-volume production service for VLSI circuit development.
- Several of these custom integrated chips populate a custom printed circuit board.
- the custom printed circuit board is connected to a LINUX host system through a Peripheral Component Interconnect (PCI) bus 15 (Fig. 3).
- PCI Peripheral Component Interconnect
- the bus interface can be implemented as an FPGA using a PCI core.
- the PCI bus interface increases the design complexity, but it is necessary in order to supply simulation results to the host quickly enough to achieve real-time display updates.
- Fig. 3 shows a diagram of the printed circuit board organization using a PCI bus interface. It depicts the FPGA PCI interface 10 and four processing element array integrated circuits 20.
- An AGP interface can also be used in an alternate exemplary embodiment.
- DIME direct memory execute
- FIG. 4 A high level overview of the entire system of the present invention is illustrated in Fig. 4.
- the system includes a known-good host system 90, a controller FPGA 92 and the custom PE array 94 on a custom printed circuit board 96.
- the system has only two global signals: clock and PC-enable.
- PEs 20 identify themselves by column and row, as illustrated in Fig. 5.
- a simple locking circuit is used to prevent two adjacent PEs from performing interconnect stores to each other simultaneously, possibly burning out transistors. All that is required is to ensure that no two adjacent PEs are executing store instructions simultaneously. If this condition occurs, the write buffers of the PEs are disabled and a flag is set. At the end of program execution, the flag is examined. If it is clear, then all went well. If it is set, a simulator is used to determine precisely where the program faltered. It is possible to perform a more specific check within the array hardware itself, but this would unnecessarily increase the complexity of the PE array.
- each processing element 20 includes a floating point functional unit 22, operand memory 24, control memory 26, and a simple control unit 28 on one integrated circuit (i.e., a single chip) .
- the floating point functional unit also includes a multiply accumulate (MAC) function. Although there is also a divide/reciprocation function as well, simulations show that the divide operation is rare.
- MAC multiply accumulate
- SRAM operand and control storage. Distributing a large amount of high speed, high bandwidth SRAM such as this offers very high memory performance for the target operation.
- the control memory 26 is a separate storage and relatively small in comparison (128 x 40 SRAM). The maximum instruction length is 40 bits.
- the size of on-board RAM is not critical for performance. The device can function faster by replacing the type of RAM used (from SIMM to DIMM) at the same size of on-board memory.
- SIMM and DIMM are acronyms for single in-line memory module and dual in-line memory module, respectively. They are small circuit boards that can hold a group of memory chips. The bus from the SIMM to the actual memory chips is 32 bits wide. DIMM provides a 64-bit bus.
- DRAM Dynamic random access memory
- Static RAMs do not require refresh circuitry as do dynamic RAMs, but they do take up more space and require more power.
- SRAM was chosen for robustness. Specifically, SRAM was chosen because of the concern about noise effecting sensing on the memory bitlines.
- interconnect 32 is more limited than within the IC. A large number of 32-bit busses will cause packaging difficulties, increasing chip-to-chip delay due to routing congestion, and will increase power dissipated in the chip I/O pads.
- the interconnect method 18 can be more flexible.
- the primary concern is to allow for efficient communication without significantly increasing area overhead. Delay is not expected to be a problem in the present invention since the operand bitlines have a higher capacitive load.
- the main interconnect options for this invention are a nearest neighbor style and a hierarchical approach.
- the processing elements 20 can be efficiently connected to their cardinal neighbors (north, south, east, west). This results in a simple and regular layout as illustrated in Fig. 6.
- a nearest neighbor interconnect strategy a number of options exist for treating boundaries (torus, wrap, etc.), however, in this design, wiring complexity can be reduced by not allowing on-chip boundary communication.
- the position of each PE is hard- coded in terms of its (X, Y) coordinates and every PE is identified by its coordinates.
- a hierarchical or tree-based approach offers the advantage of faster communication between distant neighbors. This interconnection method is commonly used in modern FPGAs. Hierarchical communication tends to be necessary when the number of processing elements are large, as in modern FPGA- based systems which are fine-grained in terms of the logic blocks used. In the present invention, a hierarchical interconnect approach leads to unnecessary overhead.
- a processing element 20 In each method, a processing element 20 must be allowed to quickly broadcast its data to all other processing elements. This rules out a one- dimensional interconnection method. However, broadcasts can be formed over multiple cycles by rippling through a nearest neighbor network, for example.
- Clock distribution is also an important consideration. The primary concern is to minimize the probability of a system crippling design error. Since phase- locked loop (PLL) design is a fairly complex and error-prone process, it may be safer to combine two or more out-of-phase board level clock signals in the IC.
- PLL phase- locked loop
- the PE supports a novel I/O scheme for loading the PE with programs and data and retrieving results from the PE array. Therefore, every PE operates in two modes: I O mode or override mode and compute mode.
- a global PC-enable signal controls the mode of operation.
- every PE has two different instruction classes for each mode.
- the instruction format and instruction sets for both compute mode and override mode are provided in greater detail below.
- In the override mode while loading the PEs, each PE receives a 56-bit control word from its north and west neighbors, possibly modifies the word, and passes it on to its south and east neighbors.
- Fig. 7 illustrates the traversal path for a control word in override mode. This is accomplished by using the "PUT" command to write to the control memory (program), operand memory (data) or the program counter (PC).
- a "LOOK-UP" instruction is sent to the northwest corner PE. This command fans out through the array. Every PE receiving this command compares its coordinates with the PE field of the instruction. If there is no match, then the instruction is simply passed on. If there is a match, then the PE performs the indicated SRAM (operand or control) read and inserts the results into a Result Response, which propagates through the array and eventually appears at the southeast corner from where it can be read.
- the entire array thus depends on only two global signals, i.e., PC-enable and clock.
- each PE starts execution by reading the instruction pointed to by the program counter.
- the instruction set provides for ADD, SUB, MULT and DIN, which are two operand instructions, and MAC, which is a three operand instruction.
- the arithmetic class also provides NOP, which is a zero operand instruction.
- Each operand field independently specifies an operand SRAM bank (2 bits) and a location within the bank (7 bits).
- Control instructions consist of conditional branching which is provided by the BLZ instruction.
- Interconnect instructions deal with the COMM register.
- the LOAD instruction enables the PE to read from the COMM register of any of its four neighbors.
- the LOAD instruction uses the DIR field (north, south, east, or west) to load the given operand location from the given direction.
- the STORE instruction is used by a PE to place data into the COMM register for its neighbors to read.
- the STORE instruction uses the DIR field to enable its write buffers in the given direction and reads its data from the given operand location.
- the array of processing elements has two modes: compute and override.
- compute mode each PE executes an instruction stream as defined by its control memory and program counter (referred to herein as PC).
- override mode the override instructions are streamed through the array from the upper-left-hand corner (PE 0, 0) toward the lower-right-hand corner (PE (n-1), (n-1)).
- override mode each PE forms its next override word by a bit- wise OR operation on the override messages received from its neighbors to the north and west. Simultaneously, it transmits its current override word to its neighbors to the south and east.
- the override instructions are used for array I/O.
- the current array mode is determined by a global PC-enable signal.
- PC-enable is 1, the array is in compute mode. Otherwise, it is in override mode. Unconnected communication input lines on the edges of the array are tied to ground.
- Each PE has four operand banks; each bank contains 128 32-bit words.
- Each PE's control memory is a single bank of 128 40-bit words. No instruction may use an operand bank more than once (each bank has a single read and a single write port).
- Each PE contains a 32-bit floating point unit which is capable of performing floating point add, subtract, multiply, divide and multiply-and- accumulate (MAC) operations.
- MAC multiply-and- accumulate
- the compute mode instruction format is provided in Table 2. Operands are specified with a two-dimensional address.
- the operand format is provided in Table 3. Therefore, the compute mode instruction is a 40-bit word which specifies the opcode and the banks and the offsets of each operand. Again, note that no instruction may use an operand bank more than once, hence RI, R2, R3, R4 each point to a different operand memory bank.
- Len th (bits) The compute mode instructions are shown in Tables 4-7. Arithmetic instructions are provided in Table 4. Control instructions are provided in Table 5. If the word stored in RI is negative, then the PC is loaded with the value stored in the seven rightmost bits of R4. PE-PE communication instructions must specify a direction as well as a command. Adjacent PEs may read each other's COMM register during the same clock cycle. This allows for full-duplex communication. Nearest neighbor compass direction encodings are provided in Table 6. The direction is stored in the two right-most bits of RI. The instructions for communication between processing elements are provided in Table 7.
- the override mode instruction format is provided in Table 8.
- the format of the Location field is provided in Table 9.
- the X-Coordinate increases to the east; the Y-Coordinate increases to the south.
- the memory banks (accessed by bank and offset) are defined in Table 10.
- the override mode instruction set is provided in Table 11.
- Put instructions targeting an operand memory must right-align the 32-bit datum within the 40-bit NAL field.
- PEs responding to a lookup instruction reading an operand memory must also right-align the 32-bit datum.
- decimal values stored to operand memories will be stored as single precision floating point values.
- Hexadecimal values will be stored to operand memories without conversion. Only unsigned hexadecimal values should be used.
- control signal equations are provided in pseudo-code.
- the fields for the overword are identified in Table 12.
- Overword[55..0] is obtained by "OR"-ing overwords from West and North neighbors.
- the fields for control memory word are identified in Table 13.
- the control memory word is a 40-bit word read from control memory.
- PEHit 1 in the Override mode if XCoord and YCoord refer to the coordinates of the current PE.
- OMxAddrSel is used as a select line for the multiplexer which selects the address for the bank x.
- OMxOE_n is the active low output enable signal for bank x. it is deasserted during a write operation in compute mode and for a "PUT" instruction in override mode (which is again a write).
- OmxWrt_n is the active low write enable signal for bank x. It is active for an operand memory bank write during override mode ("PUT" instruction) or compute mode (writing result back to R4). In all other cases it is de-asserted.
- OmxWS_n is the write strobe which is controlled by the global write strobe (WS_n). The write will be executed only when write strobe is asserted.
- OmxWS_n WS_n
- OmxWrt i; where x 0, 1, 2, or 3.
- OmxCE_n is the active low chip enable signal for bank x. It is active at all times in the compute mode. In the override mode it is asserted for a "PUT" or
- the ResultSel is a select line to select the correct output from various sub- units of the functional unit.
- RlSrcSel IPCEnable: Used to select the correct address for RI (either
- CM.RlBank or OWC.bank CM.RlBank or OWC.bank.
- SignBit RI [31 :31]: Sign of data stored at RI . Used to make branching decision.
- PCInSel IPCEnable: Used to select the correct input for the program counter.
- Control Memory signals [0082] The functions of the following signals are similar to the operand memory signals.
- CMAddrSel IPCEnable;
- CMWS_n WS_n
- OWrdOutSel acts as the select line to select the final result. This is combined with the control information to form the output Ovrword.
- OWC[55:55] is the "FOUNDIT" signal (refer to instruction set) which changes the opcode to 11 when the value is found.
- OWC [55:55] OWC[55:55]
- ( (PEHit && OWC.op 1 && (OWC.Bank
- FIGs. 8A-8B illustrate the components of a processing element and the data flow through the processing element
- the inputs to each PE are:
- the PE can be divided into various blocks, namely OvrWrd block, PC block, CM block, Functional Unit and Output block.
- OvrWrd Block The OvrWrd block consists of an OR gate and the
- OvrWrd register The input WOvrWrd and NOvrWrd are ORed and the result acts as the OvrWrd for the current PE. This is stored in the OvrWrd register after splitting it into its components.
- the upper 16 bits of the OvrWrd i.e. bits 55 to 40
- OvrWrdCtrl the lower 40 bits (bits 39 to 0) are labeled as OvrWrdData.
- OvrWrdCtrl specifies the opcode and bank for the PE and OvrWrdData is the data or value used in that instruction.
- PC Block - This consists of the Program Counter (PC) and the supporting logic for reading, writing and incrementing the PC.
- PC Program Counter
- compute mode the PC is automatically incremented after every instruction to point to the next instruction in the control memory. The PC is rewritten with a new value when a branch instruction causes the control to be branched to a different address.
- CM Block - The Control Memory (CM) block consists of a single 128x40
- OvrWrdData is written in the CM at the address pointed by OwdCtrlAddr.
- the CM is such that when CMWS_n (CM write strobe) and CMWrt_n (CM write) are low and CMOE_n (CM output enable) is high, data is written into the CM.
- Functional Unit This consists of the 32-bit floating point unit with MAC support, 4 banks of 128x32 operand memories and a Result multiplexer to select the correct result to be written back to the memory.
- Output Block - This consists of the "Found-It” logic and the output multiplexer.
- the "Found-It” logic operates in the override mode and sets the MSB of the OvrWrd to 1 if a look-up instruction is successful.
- the output multiplexer is used to select the correct output from the functional unit.
- PCInSel 0 so input to the PC is OvrWrdData[6..0].
- OwdCfrlAddr which in this case is 0
- OvrWrd Block Since PE is in compute mode, OvrWrd block is disabled.
- CMAddrSel 0 so that address given to CM is the PC output.
- the instruction is read from the address (0000000) b of the CM. It is then decoded (the decoding is logical and is a part of the NHDL code and is not mentioned in the diagram) to give values of CM.Op - the opcode, CM.RlBank, CM.RlAddr, CM.R3Bank, CM.R2Addr, CM.R3Bank, CM.R3Addr, and CM.R4Bank, CM.R4Addr.
- OM0 as DO
- OM1 as Dl
- OM2 as D2.
- R2) is Dl.
- R2orR3 D2.
- RI and R2 are provided as inputs to the multiplier block (X) and the inputs to the add/subtract block
- (+/-) are the output of the multiplexer and R2orR3.
- the small delay provided by the +/- input multiplexer causes correct input to be given to the
- the load and store instructions are slightly different.
- CM.RlAddr[1..0] causes the multiplexer to select the data from the correct neighbor and this data is then written back to the address given by the R4 field.
- SIGNAL_n A signal which is written as SIGNAL_n denotes an active low signal.
- the global signals Clk, PCEnable, XCoord and YCoord are not shown connected to any component to keep the diagram as simple as possible.
- Clk is connected to every sequential component and PCEnable is connected to every component.
- the XCoord and YCoord signals are connected to a logic block which is used to identify the PE.
- Some inputs/outputs for the peripheral PEs which are not used are connected to the ground. For example, for the PEs in the rightmost rows, there is no East neighbor, hence their EOvrWrdOut/ECommOut lines are connected to ground.
- the present invention attempts to pack as much computation into as small a space as possible. This computational density should lead to a higher level of switching activity than what is seen in general-purpose processors. Therefore, power consumption and heat generation may become problematic. For example, heat generation can constrain the clock rate. It may be possible to apply a number of techniques in order to reduce the effects of high switching activity. These techniques include heat tolerant packaging, low-swing interconnect, and possibly scheduling operations in order to reduce switching activity.
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP20020795726 EP1451701A1 (en) | 2001-12-06 | 2002-12-06 | Floating point intensive reconfigurable computing system for iterative applications |
CA002468800A CA2468800A1 (en) | 2001-12-06 | 2002-12-06 | Floating point intensive reconfigurable computing system for iterative applications |
AU2002360469A AU2002360469A1 (en) | 2001-12-06 | 2002-12-06 | Floating point intensive reconfigurable computing system for iterative applications |
US10/862,269 US20070067380A2 (en) | 2001-12-06 | 2004-06-07 | Floating Point Intensive Reconfigurable Computing System for Iterative Applications |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US33834701P | 2001-12-06 | 2001-12-06 | |
US60/338,347 | 2001-12-06 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/862,269 Continuation-In-Part US20070067380A2 (en) | 2001-12-06 | 2004-06-07 | Floating Point Intensive Reconfigurable Computing System for Iterative Applications |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2003050697A1 true WO2003050697A1 (en) | 2003-06-19 |
Family
ID=23324454
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2002/038645 WO2003050697A1 (en) | 2001-12-06 | 2002-12-06 | Floating point intensive reconfigurable computing system for iterative applications |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1451701A1 (en) |
AU (1) | AU2002360469A1 (en) |
CA (1) | CA2468800A1 (en) |
WO (1) | WO2003050697A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1998258A1 (en) | 2007-05-31 | 2008-12-03 | VNS Portfolio LLC | Method and apparatus for connecting multiple multimode processors |
CN113760814A (en) * | 2017-03-28 | 2021-12-07 | 上海山里智能科技有限公司 | Integrated computing system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5892962A (en) * | 1996-11-12 | 1999-04-06 | Lucent Technologies Inc. | FPGA-based processor |
US6289440B1 (en) * | 1992-07-29 | 2001-09-11 | Virtual Computer Corporation | Virtual computer of plural FPG's successively reconfigured in response to a succession of inputs |
US6289434B1 (en) * | 1997-02-28 | 2001-09-11 | Cognigine Corporation | Apparatus and method of implementing systems on silicon using dynamic-adaptive run-time reconfigurable circuits for processing multiple, independent data and control streams of varying rates |
US6408382B1 (en) * | 1999-10-21 | 2002-06-18 | Bops, Inc. | Methods and apparatus for abbreviated instruction sets adaptable to configurable processor architecture |
US6507947B1 (en) * | 1999-08-20 | 2003-01-14 | Hewlett-Packard Company | Programmatic synthesis of processor element arrays |
-
2002
- 2002-12-06 AU AU2002360469A patent/AU2002360469A1/en not_active Abandoned
- 2002-12-06 CA CA002468800A patent/CA2468800A1/en not_active Abandoned
- 2002-12-06 EP EP20020795726 patent/EP1451701A1/en not_active Withdrawn
- 2002-12-06 WO PCT/US2002/038645 patent/WO2003050697A1/en not_active Application Discontinuation
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6289440B1 (en) * | 1992-07-29 | 2001-09-11 | Virtual Computer Corporation | Virtual computer of plural FPG's successively reconfigured in response to a succession of inputs |
US5892962A (en) * | 1996-11-12 | 1999-04-06 | Lucent Technologies Inc. | FPGA-based processor |
US6289434B1 (en) * | 1997-02-28 | 2001-09-11 | Cognigine Corporation | Apparatus and method of implementing systems on silicon using dynamic-adaptive run-time reconfigurable circuits for processing multiple, independent data and control streams of varying rates |
US6507947B1 (en) * | 1999-08-20 | 2003-01-14 | Hewlett-Packard Company | Programmatic synthesis of processor element arrays |
US6408382B1 (en) * | 1999-10-21 | 2002-06-18 | Bops, Inc. | Methods and apparatus for abbreviated instruction sets adaptable to configurable processor architecture |
Non-Patent Citations (3)
Title |
---|
KNITTEL G.: "A PCI-compatible FPGA-coprocessor for 2D/3D image processing FPGAs for custom computing machines", IEEE SYMPOSIUM ON PROCEEDINGS, 17 April 1996 (1996-04-17) - 19 April 1996 (1996-04-19), pages 136 - 145, XP010206376 * |
LIGON W.B., III. ET AL.: "Implementation and analysis of numerical components for reconfigurable computing", AEROSPACE CONFERENCE, 1999. PROCEEDINGS. 1999 IEEE, vol. 2, 1999, pages 325 - 335, XP010350338 * |
MIRSKY E., DEHON A.: "MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources FPGAs for custom computing machines", IEEE SYMPOSIUM ON PROCEEDINGS, 17 April 1996 (1996-04-17) - 19 April 1996 (1996-04-19), pages 157 - 166, XP010206378 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1998258A1 (en) | 2007-05-31 | 2008-12-03 | VNS Portfolio LLC | Method and apparatus for connecting multiple multimode processors |
US7840826B2 (en) | 2007-05-31 | 2010-11-23 | Vns Portfolio Llc | Method and apparatus for using port communications to switch processor modes |
CN113760814A (en) * | 2017-03-28 | 2021-12-07 | 上海山里智能科技有限公司 | Integrated computing system |
Also Published As
Publication number | Publication date |
---|---|
EP1451701A1 (en) | 2004-09-01 |
CA2468800A1 (en) | 2003-06-19 |
AU2002360469A1 (en) | 2003-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11544191B2 (en) | Efficient hardware architecture for accelerating grouped convolutions | |
CN108572850B (en) | Vector processing unit, computing system including the same, and method performed therein | |
EP3637265B1 (en) | Memory device performing in-memory prefetching and system including the same | |
KR20220054357A (en) | Method for performing PROCESSING-IN-MEMORY (PIM) operations on serially allocated data, and related memory devices and systems | |
CN112463719A (en) | In-memory computing method realized based on coarse-grained reconfigurable array | |
US5053986A (en) | Circuit for preservation of sign information in operations for comparison of the absolute value of operands | |
US5119324A (en) | Apparatus and method for performing arithmetic functions in a computer system | |
KR20220051006A (en) | Method of performing PIM (PROCESSING-IN-MEMORY) operation, and related memory device and system | |
Kingyens et al. | The potential for a GPU-like overlay architecture for FPGAs | |
US11822510B1 (en) | Instruction format and instruction set architecture for tensor streaming processor | |
US10185560B2 (en) | Multi-functional execution lane for image processor | |
US20110185151A1 (en) | Data Processing Architecture | |
Kwon et al. | A 1ynm 1.25 v 8gb 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep learning application | |
US20050171990A1 (en) | Floating point intensive reconfigurable computing system for iterative applications | |
US7769981B2 (en) | Row of floating point accumulators coupled to respective PEs in uppermost row of PE array for performing addition operation | |
KR20210113099A (en) | Adjustable function-in-memory computation system | |
Yousefzadeh et al. | Energy-efficient in-memory address calculation | |
Stepchenkov et al. | Recurrent data-flow architecture: features and realization problems | |
US8539207B1 (en) | Lattice-based computations on a parallel processor | |
WO2003050697A1 (en) | Floating point intensive reconfigurable computing system for iterative applications | |
Iniewski | Embedded Systems: Hardware, Design and Implementation | |
Gayles et al. | The design of the MGAP-2: A micro-grained massively parallel array | |
CN115129464A (en) | Stochastic sparsity handling in systolic arrays | |
Todaro et al. | Enhanced soft gpu architecture for fpgas | |
US20230289398A1 (en) | Efficient Matrix Multiply and Add with a Group of Warps |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2468800 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2002795726 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 10862269 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2002360469 Country of ref document: AU |
|
WWP | Wipo information: published in national office |
Ref document number: 2002795726 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2002795726 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: JP |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |