US20050235173A1 - Reconfigurable integrated circuit - Google Patents
Reconfigurable integrated circuit Download PDFInfo
- Publication number
- US20050235173A1 US20050235173A1 US10/516,626 US51662604A US2005235173A1 US 20050235173 A1 US20050235173 A1 US 20050235173A1 US 51662604 A US51662604 A US 51662604A US 2005235173 A1 US2005235173 A1 US 2005235173A1
- Authority
- US
- United States
- Prior art keywords
- processing elements
- processing
- processing element
- integrated circuit
- elements
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8007—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
- G06F15/8023—Two dimensional arrays, e.g. mesh, torus
Definitions
- the invention relates to an integrated circuit having a plurality of processing elements for executing substantially in parallel at least a subset of a plurality of instructions; issuing means for configuring the plurality of processing elements by issuing a program-counter-driven instruction flow to the plurality of processing elements; and configurable interconnection means for connecting each processing element from the plurality of processing elements to at least a subset of other processing elements from the plurality of processing elements.
- the required resources for the processing architecture are combined in each processing element and distributed over the available silicon real estate in a regular grid, e.g. a two-dimensional repetitive layout.
- a regular grid e.g. a two-dimensional repetitive layout.
- the integrated circuit of the present invention can simply reuse the one design by redefining the interconnect structure between the processing elements, or by redesigning only a single processor element, thus greatly reducing the time-to-market of the second IC.
- the second IC will also be less costly to produce, because the lithographic mask set of the first IC can be completely reused apart from the mask defining the interconnect, e.g. the VIA mask.
- the IC can simply be extended by adding an additional row or column of processing elements to the grid, which involves a minor design effort only.
- the integrated circuit comprises very long instruction word (VLIW) processor architecture and the subset of the plurality of instructions comprises a very long instruction word. More and more processing elements are being integrated in VLIW processors, which leads to serious routing issues between the various processing elements.
- VLIW processor By realizing a VLIW processor according to the teachings of the present invention, a processor architecture is obtained where these routing problems are avoided because every processing element is always close to a required resource.
- the configurable interconnection means connect each processing element to each nearest neighboring processing element in the grid. Consequently, this yields a regular grid with complete connectivity.
- the grid of processing elements can be used as a data flow machine, where each processing element is configured by the issuing means and kept in that configuration for several clock cycles, with the data being rippled from one side of the grid to another side of the grid.
- This is particularly advantageous for loop executions, because the dimensions of the grid can be tuned to the dimensions of the loop body, which can result in a whole loop or a large data-autonomous part of the loop being mapped on the grid.
- the performance of the-loop execution will be dramatically enhanced, because the slow communication between the issuing means and/or the processing elements with data and instruction memories is greatly reduced.
- data flow applications can also be executed on a grid lacking full connectivity, albeit with reduced flexibility compared to the grid with complete connectivity, e.g. a grid in which each processing element is connected to all its nearest neighbors.
- the processing elements can also be operated in the traditional VLIW way exploiting instruction-level parallelism on a cycle-by-cycle basis.
- the IC can be seen as a reconfigurable device, because during operation the configuration of the IC can be switched from the dataflow mode to a traditional VLIW mode.
- the configurable interconnection means comprise bypassing means for bypassing a processing element from the plurality of processing elements.
- bypassing means e.g. multiplexers or other switching elements
- in or around the processing elements further improves the performance of the IC, because not-neighboring processing elements can be in direct connection with each other if the processing elements in between the two communicating processing elements are bypassed.
- more than one connection path can be available between two different processing elements, configurable routing means like multiplexers being available for choosing which connection path is to be used.
- longer-distance connection paths can be provided, connecting processing elements that are not nearest neighbors. Again, configurable routing means can be used for choosing the appropriate connection paths.
- a processing element from the plurality of processing elements comprises a data storage unit, a function unit and an internal intercommunication network coupling the function unit to the data storage unit.
- the processing element comprises at least a further unit; the function unit, the further unit and the data storage unit being organized as a very long instruction word (VLIW) processor data path.
- VLIW very long instruction word
- the further unit can either be a function unit or a data storage unit.
- the issuing means are distributed over the processing elements in this embodiment.
- each VLIW processing element is equipped with its own operation register holding the control words that configure the data and control paths, e.g. the functionality of the function units and the routing between function units and data storage elements, of the VLIW processing element.
- control words that configure the data and control paths, e.g. the functionality of the function units and the routing between function units and data storage elements, of the VLIW processing element.
- an electronic device is provided as claimed in claim 8 .
- Integration of an IC according to the present invention into an electronic device leads to an electronic device with increased functional flexibility as well as a lower cost price, which substantially improves the marketability of such devices.
- a method for designing an integrated circuit is provided as claimed in claim 9 .
- Application of this method for instance by means of a computer aided design (CAD) tool, will lead to an integrated circuit design having all the advantageous features as claimed in claim 1 .
- CAD computer aided design
- the step of connecting each processing element from the plurality of processing element to at least a subset of other processing elements from the plurality of processing element includes connecting each processing element to each nearest neighboring processing element in the grid.
- FIG. 1 depicts an integrated circuit according to the present invention
- FIG. 2 depicts an exemplary embodiment of a processing element according to the present invention
- FIG. 3 depicts another exemplary embodiment of a processing element according to the present invention.
- FIG. 4 depicts a flow chart of the method according to the present invention.
- integrated circuit 100 has a processor comprising a plurality of processing elements 120 organized in a regular grid.
- the processing elements 120 which are all substantially similar to each other, e.g. have substantially the same functionality, are interconnected by reconfigurable interconnection network 140 , e.g. an addressable data communication bus or a hardwired multiplexer network.
- Interconnection network 140 can be complete in the sense that every processing element 120 is connected to its nearest neighbor, or it can implement an incomplete network. In the latter case, some interconnects between processing elements 120 are absent, as indicated in FIG. 1 by the dashed lines.
- multiple connection paths may be provided between two processing elements, or longer-distance lines may be provided that connect processing elements that are not nearest neighbors. These alternatives have not been depicted in FIG. 1 for reasons of clarity only.
- the processing elements 120 are coupled to an issuing device 160 , as symbolized by the dashed box surrounding processing elements 120 .
- Issuing device 160 is responsible for dispatching global communication, e.g. instructions, from a central memory 180 to the plurality of processing elements 120 .
- the issuing device is responsible for handling exceptions and other configuration context switches, i.e. VLIW changes, in the grid of processing elements 120 .
- issuing device 160 is responsible for the program sequencing to and the control of processing elements 120 .
- the issuing device 160 will fetch instruction bundles, like VLIW instructions, from a central memory 180 on the basis of a value of its program counter, and will partition the bundles and dispatch the separate instructions to the appropriate processing elements 120 .
- the program counter of the issuing device will be routinely altered, e.g. incrementally increased or decreased, and a next instruction bundle will be fetched.
- one of the processing elements 120 signals the detection of an exception, e.g.
- issuing device 160 will reset its program counter according to the exception and, if necessary, will flush the redundant data from processing elements 120 before issuing new instructions to the processing elements 120 on the basis of the reset value of the program counter. It will be recognized by those skilled in the art that this is a well-known way of controlling a processing architecture implementing instruction-level parallelism.
- the combination of the mapping of the desired processor functionality of the integrated circuit 100 on every processing element 120 of the processor with the organization of the processing elements 120 in a regular grid with the at least partial interconnect between the processing elements 120 provides an important advantage over prior art instruction-level-parallelized processor architectures.
- the direct data communication between any processing element 120 and a neighboring processing element has the same latency throughout the whole grid.
- a set of instructions are mapped on the processing elements 120 of integrated circuit 100 and the interconnection network 140 is configured to connect a processing element 120 to its appropriate neighbors.
- this configuration is frozen and data is allowed to ripple through the grid in a classical data flow manner. This is particularly useful if the grid is large enough to map a complete loop body onto, which then means that loop execution can be realized in a highly effective and parallel manner.
- the data flow concept can still be utilized by breaking up the loop into smaller loops, data dependencies permitting, that can be mapped onto the grid on their entirety.
- intercommunication network 140 can include hardware to bypass individual processing elements 120 in the grid, for instance by means of multiplexers that provide a direct routing through or around a processing element 120 or by means of hard-wired bypasses.
- Processing element 120 has a data storage unit 122 , e.g. a memory or a part of a distributed register file, and a function unit 124 , which can be an arithmetic logic unit (ALU), an address computation unit (ACU), a multiplier, a multiply-accumulate unit (MAC) and so on.
- ALU arithmetic logic unit
- ACU address computation unit
- MAC multiply-accumulate unit
- the data storage unit 122 is coupled to function unit 124 through an internal intercommunication network 140 b , which is either directly coupled to an external intercommunication network 140 a or coupled to external intercommunication network 140 a through a control unit 142 .
- the control unit 142 can for instance be a distributed bus controller or a network of multiplexers responsive to issuing device 160 .
- Both internal communication network 140 b and external communication network 140 a which together form intercommunication network 140 , can be realized as a point-to-point hard-wired network, as a data communication bus, or as a combination thereof.
- FIG. 3 which is described in backreference to FIG. 2 and its detailed description, another exemplary embodiment of a processing element 120 is given.
- Multiplexers 220 a - b, 220 c - d and 220 e - f are respectively coupled to a function unit 224 , a further unit 226 and a data storage unit 228 through buffers, e.g. register files, 222 a - f.
- the further unit 226 may be a further function unit or a further data storage unit. This is by way of non-limiting example only, other configurations, for instance a configuration in which several units share a buffer, can be thought of without departing from the scope of the invention.
- FIG. 3 which is described in backreference to FIG. 2 and its detailed description, another exemplary embodiment of a processing element 120 is given.
- Multiplexers 220 a - b, 220 c - d and 220 e - f are respectively coupled to a function unit 224 , a
- function unit 224 can be a 2-input ALU with its data inputs coupled to buffers 222 a and 222 b, respectively.
- Further unit 226 can be a 2-input MAC with its data inputs coupled to buffers 222 c and 222 d, respectively and data storage unit 228 can be a random access memory with an address input coupled to buffer 222 e and a data input coupled to buffer 222 f , although many other configurations are of course possible.
- the inputs of multiplexers 220 a - f are coupled to an external interconnection network 140 a and an internal interconnection network 140 b.
- External interconnection network 140 a is coupled to processing element 120 through data input ports 152 a - c on the data input side and through output arrangement 260 on the output side.
- the number of data input ports is defined by the number of neighbors the processing element 120 is connected to.
- Output arrangement 250 has a multiplexer 252 , an optional buffer 254 and an output port 256 for coupling processing element 120 to its neighboring processing elements. This ensures that only relevant data is broadcasted to connected neighboring processing elements through output port 256 .
- output arrangement 250 can also serve as a bypass for the processing element 120 ; the data input received through input ports 152 a - c can be directly forwarded to other processing elements through the appropriate configuration of multiplexer 252 .
- internal interconnection network 140 b is fully connected, e.g. each output of units 224 , 226 and 228 is coupled to multiplexers 220 a - f and multiplexer 252 . It is emphasized that this is by way of non-limiting example only, partially connected interconnection network 140 b can alternatively be used without departing from the scope of the present invention.
- Issuing device 160 can be distributed over processing elements 120 .
- a local issuing device 260 is responsible for the control of the data path of processing element 120 , by controlling the configuration of multiplexers 220 a - f, issuing opcodes to the function units, addresses to the data storage units, and, optionally, controlling the configuration of multiplexer 252 .
- Local issuing device 260 could have its own local operation register, so the global VLIW instruction can simply be formed by linking all local operation registers.
- the processor instruction memory itself could be partitioned into multiple memory blocks, each memory block being local to a processing element 120 , each memory block containing the part of the very long instruction word relevant to its corresponding processing element.
- each local issuing device 260 having its own local instruction memory block and local operation register, could be associated with its own local program sequencing and control logic, and its own Program Counter (PC), which means that each processing element 120 could operate as a VLIW processor itself.
- PC Program Counter
- the vast flexibility of the integrated circuit 100 according to the present invention enables the integration of very large scale parallelism in its architecture, which renders integrated circuit 100 suitable for the performance of very demanding computations, e.g. broadband digital signal processing, that are difficult, if not currently impossible, to achieve with known architectures. Therefore, integration of an integrated circuit 100 according to the present invention into an electronic device requiring such demanding computations, e.g. future generation mobile telecommunication devices, will not only make the realization of such future technologies feasible, but will also make the technology affordable, because of the limited design cost of the integrated circuit 100 .
- a flow chart 400 depicts the crucial steps for designing an integrated circuit with a processing architecture according to the present invention.
- a first step 420 the processing elements from the plurality of processing elements are designed to be substantially similar to each other and each processing element from the plurality of processing elements is designed to be capable of executing each instruction from the plurality of instructions. Obviously, this has only to be done for a single of the processing elements 120 , since all other processing elements in the grid should be largely similar to this single processing element 120 . This approach drastically reduces the design effort for such very large scale integration circuits utilizing instruction-level parallelism.
- a second step 440 the plurality of processing elements are layed out in a regular grid wherein a distance between a processing element from the plurality of processing elements and a nearest neighboring processing element from the plurality of processing elements in a first direction is substantially the same as a distance between the processing element and a nearest neighboring processing element from the plurality of processing elements in a second direction.
- the organization of the processing elements in the regular grid not only enables the aforementioned reconfigurable behavior of the integrated circuit 100 , e.g. the ability to switch between a data flow mode and an instruction-level parallelism mode, but it also offers the possibility to reuse the logic layout for other applications when another interconnection structure is required.
- each processing element 120 from the plurality of function units is connected to at least a subset of other processing elements from the plurality of processing elements.
- each processing element 120 can be connected to each nearest neighboring processing element in the grid to yield a completely connected two-dimensional grid in the sense that each processing element 120 is connected to each nearest neighbor.
- the definition of different interconnection networks 140 for a grid of processing elements 120 enables the reuse of the grid of processing elements 120 for other applications based on the same overall logic layout. In such a case, only the interconnect has to be redefined, which means that only a small design effort is required and only one or a few interconnect masks (e.g. a VIA mask, or an upper metal layer mask) have to be redeveloped. Both these advantages realize a substantial cost reduction in the development of follow-up IC designs.
- any reference signs placed between parentheses shall not be construed as limiting the claim.
- the word “comprising” does not exclude the presence of elements or steps other than those listed in a claim.
- the word “a” or “an” preceding an element does not exclude the presence of a plurality of such elements.
- the invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Design And Manufacture Of Integrated Circuits (AREA)
- Logic Circuits (AREA)
- Advance Control (AREA)
- Semiconductor Integrated Circuits (AREA)
Abstract
The present invention describes an integrated circuit (100) having a processor that consists of a plurality of identical, or at least very similar, processing elements (120) organized in a regular grid. Each processing element (120) is capable of executing the desired functionality of the processor. The processing elements (120) are interconnected by a configurable interconnection network (140) and are controlled by a program sequencing issuing device (160) capable of handling exceptions in the instruction flow through the processing elements (120). Consequently, the integrated circuit (100) can be easily redesigned, thus reducing design effort and time-to-market for such architectures.
Description
- The invention relates to an integrated circuit having a plurality of processing elements for executing substantially in parallel at least a subset of a plurality of instructions; issuing means for configuring the plurality of processing elements by issuing a program-counter-driven instruction flow to the plurality of processing elements; and configurable interconnection means for connecting each processing element from the plurality of processing elements to at least a subset of other processing elements from the plurality of processing elements.
- The ongoing downscaling of semiconductor dimensions has led and still leads to an increase of the number of building blocks being integrated on the available area of a semiconductor device, e.g. integrated circuit. Consequently, such devices become more versatile and the performance demands for such devices increase accordingly. This is particularly the case for circuits that are being designed to perform a dedicated task, e.g. real time digital audio of video signal processing, and which include so-called application-specific instruction set processors (ASIPs), which may have architectures as defined in the opening paragraph.
- The ever increasing performance demands for ASIPs combined with the technology downscaling typically imply that for a next generation ASIP not only more processing elements are integrated into the design, but also that the IC architecture is redesigned from scratch, because the performance of the previous generation processing elements is no longer sufficient to meet the requirements for the new ASIP.
- However, this trend is associated with a problem that becomes an increasingly difficult hurdle to overcome for forthcoming integrated circuit technologies. The increase of processing elements in those integrated circuits and the aforementioned limited reusability of these processing elements in future generation ICs implies an ongoing increase in design effort for the designers of these ICs. In addition, the increasing number of processing elements to be included in the IC design introduce design complications, because the necessary interconnect between those processing elements becomes increasingly complex. This already is starting to lead to difficult routing issues; interconnect lines between two processing elements can become so long that the transmission delay on the line jeopardizes or even prevents the performance requirements from being met. This is a very serious problem, because the required time-to-market for ICs is becoming shorter and shorter, which obviously clashes with the aforementioned increasing design complications.
- It is an object of the present invention to provide an integrated circuit of kind described in the opening paragraph that can be upgraded with a relatively small design effort.
- The invention is defined by the independent claims. Advantageous embodiments are defined in the dependent claims.
- According to the present invention, the required resources for the processing architecture are combined in each processing element and distributed over the available silicon real estate in a regular grid, e.g. a two-dimensional repetitive layout. Although it obviously creates some area overhead because, in contrast to prior art ASICs, all or at least most processing elements will comprise building blocks that might not be used during certain clock cycles, it is emphasized that this is not considered to be a drawback, since the ongoing semiconductor dimension downscaling allows for more and more functionality to be integrated onto an integrated circuit. More importantly, the combination of predominantly homogeneous processing elements and the regular grid allows for fast and cheap redesign of processing architectures. In contrast to prior art integrated circuits, where two architectures for two application domains typically both had to be redesigned from scratch, the integrated circuit of the present invention can simply reuse the one design by redefining the interconnect structure between the processing elements, or by redesigning only a single processor element, thus greatly reducing the time-to-market of the second IC. Furthermore, the second IC will also be less costly to produce, because the lithographic mask set of the first IC can be completely reused apart from the mask defining the interconnect, e.g. the VIA mask. Furthermore, when the number resources integrated in the first design are no longer sufficient to meet the performance requirements of the IC, the IC can simply be extended by adding an additional row or column of processing elements to the grid, which involves a minor design effort only.
- It is particularly advantageous if the integrated circuit comprises very long instruction word (VLIW) processor architecture and the subset of the plurality of instructions comprises a very long instruction word. More and more processing elements are being integrated in VLIW processors, which leads to serious routing issues between the various processing elements. By realizing a VLIW processor according to the teachings of the present invention, a processor architecture is obtained where these routing problems are avoided because every processing element is always close to a required resource.
- It is a further advantage if the configurable interconnection means connect each processing element to each nearest neighboring processing element in the grid. Consequently, this yields a regular grid with complete connectivity. This provides increased flexibility in the use of the integrated circuit. For instance, the grid of processing elements can be used as a data flow machine, where each processing element is configured by the issuing means and kept in that configuration for several clock cycles, with the data being rippled from one side of the grid to another side of the grid. This is particularly advantageous for loop executions, because the dimensions of the grid can be tuned to the dimensions of the loop body, which can result in a whole loop or a large data-autonomous part of the loop being mapped on the grid. Consequently, the performance of the-loop execution will be dramatically enhanced, because the slow communication between the issuing means and/or the processing elements with data and instruction memories is greatly reduced. Obviously, such data flow applications can also be executed on a grid lacking full connectivity, albeit with reduced flexibility compared to the grid with complete connectivity, e.g. a grid in which each processing element is connected to all its nearest neighbors. On the other hand, the processing elements can also be operated in the traditional VLIW way exploiting instruction-level parallelism on a cycle-by-cycle basis. Thus, the IC can be seen as a reconfigurable device, because during operation the configuration of the IC can be switched from the dataflow mode to a traditional VLIW mode.
- At this point, it is emphasized that there are important fundamental differences between known reconfigurable devices like field programmable gate arrays (FPGAs) and the regularly structured IC according to the present invention. Not only are the known reconfigurable devices typically very slow because of the large number of reconfiguration points that have to be accessed during configuration of the device, but the known reconfigurable devices are not capable of exception handling, like the switching of a configuration context, i.e. a very long instruction word, of the processor architecture following the execution of a jump instruction or a conditional expression like a branch instruction. Therefore, those skilled in the art of designing high-performance ICs will look away from the FPGA related domain, because those architectures do neither offer the necessary performance nor offer the required functionality.
- It is another advantage if the configurable interconnection means comprise bypassing means for bypassing a processing element from the plurality of processing elements. The use of bypassing means, e.g. multiplexers or other switching elements, in or around the processing elements further improves the performance of the IC, because not-neighboring processing elements can be in direct connection with each other if the processing elements in between the two communicating processing elements are bypassed. In addition, more than one connection path can be available between two different processing elements, configurable routing means like multiplexers being available for choosing which connection path is to be used. Furthermore, longer-distance connection paths can be provided, connecting processing elements that are not nearest neighbors. Again, configurable routing means can be used for choosing the appropriate connection paths.
- It is yet another advantage if a processing element from the plurality of processing elements comprises a data storage unit, a function unit and an internal intercommunication network coupling the function unit to the data storage unit. By providing each processing element with a function unit and a data storage element, e.g. a small memory or a distributed register file, the slow communications between function units and central memories and/or register files can be avoided or at least reduced and the IC performance is enhanced. This is even more the case if the data storage element is also coupled to the configurable interconnection means, because then it can also serve as data suppler for function units in other processing elements.
- In an embodiment of the present invention, the processing element comprises at least a further unit; the function unit, the further unit and the data storage unit being organized as a very long instruction word (VLIW) processor data path. This embodies a hierarchical VLIW architecture, which enhances the flexibility of the design. The further unit can either be a function unit or a data storage unit.
- Advantageously, the issuing means are distributed over the processing elements in this embodiment. For instance, each VLIW processing element is equipped with its own operation register holding the control words that configure the data and control paths, e.g. the functionality of the function units and the routing between function units and data storage elements, of the VLIW processing element. Thus, a delocalized issuing architecture is obtained, which is again advantageous in terms of performance.
- According to a further aspect of the invention, an electronic device is provided as claimed in claim 8. Integration of an IC according to the present invention into an electronic device leads to an electronic device with increased functional flexibility as well as a lower cost price, which substantially improves the marketability of such devices.
- According to yet a further aspect of the invention, a method for designing an integrated circuit is provided as claimed in claim 9. Application of this method, for instance by means of a computer aided design (CAD) tool, will lead to an integrated circuit design having all the advantageous features as claimed in claim 1.
- It is an advantage if the step of connecting each processing element from the plurality of processing element to at least a subset of other processing elements from the plurality of processing element includes connecting each processing element to each nearest neighboring processing element in the grid. By connecting a processing element to all its nearest neighbors, an IC design with a grid having complete interconnect can be obtained, which yields an IC design having the advantageous characteristics of the IC as claimed in claim 3.
- The invention is described in more detail and by way of non-limiting examples with reference to the accompanying drawings, wherein:
-
FIG. 1 depicts an integrated circuit according to the present invention; -
FIG. 2 depicts an exemplary embodiment of a processing element according to the present invention; -
FIG. 3 depicts another exemplary embodiment of a processing element according to the present invention; and -
FIG. 4 depicts a flow chart of the method according to the present invention. - In
FIG. 1 ,integrated circuit 100 has a processor comprising a plurality ofprocessing elements 120 organized in a regular grid. Theprocessing elements 120, which are all substantially similar to each other, e.g. have substantially the same functionality, are interconnected byreconfigurable interconnection network 140, e.g. an addressable data communication bus or a hardwired multiplexer network.Interconnection network 140 can be complete in the sense that everyprocessing element 120 is connected to its nearest neighbor, or it can implement an incomplete network. In the latter case, some interconnects betweenprocessing elements 120 are absent, as indicated inFIG. 1 by the dashed lines. In addition, multiple connection paths may be provided between two processing elements, or longer-distance lines may be provided that connect processing elements that are not nearest neighbors. These alternatives have not been depicted inFIG. 1 for reasons of clarity only. - The
processing elements 120 are coupled to anissuing device 160, as symbolized by the dashed box surroundingprocessing elements 120. Issuingdevice 160 is responsible for dispatching global communication, e.g. instructions, from acentral memory 180 to the plurality ofprocessing elements 120. Furthermore, the issuing device is responsible for handling exceptions and other configuration context switches, i.e. VLIW changes, in the grid ofprocessing elements 120. In short, issuingdevice 160 is responsible for the program sequencing to and the control of processingelements 120. - For instance, the
issuing device 160 will fetch instruction bundles, like VLIW instructions, from acentral memory 180 on the basis of a value of its program counter, and will partition the bundles and dispatch the separate instructions to theappropriate processing elements 120. In a next step, the program counter of the issuing device will be routinely altered, e.g. incrementally increased or decreased, and a next instruction bundle will be fetched. However, if one of theprocessing elements 120 signals the detection of an exception, e.g. a jump instruction being taken or a branch condition being met, or if an interrupt is being signaled and so on, issuingdevice 160 will reset its program counter according to the exception and, if necessary, will flush the redundant data from processingelements 120 before issuing new instructions to theprocessing elements 120 on the basis of the reset value of the program counter. It will be recognized by those skilled in the art that this is a well-known way of controlling a processing architecture implementing instruction-level parallelism. - However, the combination of the mapping of the desired processor functionality of the
integrated circuit 100 on everyprocessing element 120 of the processor with the organization of theprocessing elements 120 in a regular grid with the at least partial interconnect between theprocessing elements 120 provides an important advantage over prior art instruction-level-parallelized processor architectures. In theintegrated circuit 100 according to the present invention, the direct data communication between anyprocessing element 120 and a neighboring processing element has the same latency throughout the whole grid. Thus, by definition, if a timing constraint is satisfied between any of theprocessing elements 120 and a connected neighboring processing element, this holds for all (connected) nearest neighbors of processingelements 120. Not only does this imply that the design of the processor architecture becomes more straightforward, but it also provides a data flow driven processing mode that is not typically associated with instruction level parallelized processing. - In a data flow mode, a set of instructions are mapped on the
processing elements 120 ofintegrated circuit 100 and theinterconnection network 140 is configured to connect aprocessing element 120 to its appropriate neighbors. Now, for a period of time, e.g. a number of clock cycles, this configuration is frozen and data is allowed to ripple through the grid in a classical data flow manner. This is particularly useful if the grid is large enough to map a complete loop body onto, which then means that loop execution can be realized in a highly effective and parallel manner. In addition, if the loop is too large to be mapped in its entirety onto the grid, the data flow concept can still be utilized by breaking up the loop into smaller loops, data dependencies permitting, that can be mapped onto the grid on their entirety. If, instead, the loop body is too small to keep a majority of the processing elements in the grid busy, software pipelining can be applied, which can be particularly effective if theprocessing elements 120 have a data storage unit like a part of a distributed register file or a random access memory, because intermediate results can be stored in the local storage unit and can be forwarded to a neighboring processing element when necessary. This enables high speed, distributed communication, which typically means that very few communication conflicts occur in the processor architecture ofintegrated circuit 100, if any. The time period that the grid is kept in data flow mode can be monitored by a simple clock cycle counter, which is coupled to and can be integrated in theissuing device 160, although other control schemes are feasible as well, like data or control output monitoring in a synchronous or asynchronous data flow mode. To increase flexibility even further,intercommunication network 140 can include hardware to bypassindividual processing elements 120 in the grid, for instance by means of multiplexers that provide a direct routing through or around aprocessing element 120 or by means of hard-wired bypasses. - Now, the following Figs. will be described with backreference to
FIG. 1 and its detailed description. Corresponding reference numerals will have the same meaning, unless explicitly stated otherwise. InFIG. 2 , an exemplary embodiment of aprocessing element 120 is depicted.Processing element 120 has adata storage unit 122, e.g. a memory or a part of a distributed register file, and afunction unit 124, which can be an arithmetic logic unit (ALU), an address computation unit (ACU), a multiplier, a multiply-accumulate unit (MAC) and so on. Thedata storage unit 122 is coupled to functionunit 124 through aninternal intercommunication network 140 b, which is either directly coupled to anexternal intercommunication network 140 a or coupled toexternal intercommunication network 140 a through acontrol unit 142. Thecontrol unit 142 can for instance be a distributed bus controller or a network of multiplexers responsive to issuingdevice 160. Bothinternal communication network 140 b andexternal communication network 140 a, which together formintercommunication network 140, can be realized as a point-to-point hard-wired network, as a data communication bus, or as a combination thereof. - In
FIG. 3 , which is described in backreference toFIG. 2 and its detailed description, another exemplary embodiment of aprocessing element 120 is given. Multiplexers 220 a-b, 220 c-d and 220 e-f are respectively coupled to afunction unit 224, afurther unit 226 and adata storage unit 228 through buffers, e.g. register files, 222 a-f. Thefurther unit 226 may be a further function unit or a further data storage unit. This is by way of non-limiting example only, other configurations, for instance a configuration in which several units share a buffer, can be thought of without departing from the scope of the invention. In the embodiment ofFIG. 3 ,function unit 224 can be a 2-input ALU with its data inputs coupled tobuffers Further unit 226 can be a 2-input MAC with its data inputs coupled tobuffers data storage unit 228 can be a random access memory with an address input coupled to buffer 222 e and a data input coupled to buffer 222 f, although many other configurations are of course possible. - The inputs of multiplexers 220 a-f are coupled to an
external interconnection network 140 a and aninternal interconnection network 140 b.External interconnection network 140 a is coupled toprocessing element 120 through data input ports 152 a-c on the data input side and throughoutput arrangement 260 on the output side. The number of data input ports is defined by the number of neighbors theprocessing element 120 is connected to.Output arrangement 250 has amultiplexer 252, anoptional buffer 254 and anoutput port 256 forcoupling processing element 120 to its neighboring processing elements. This ensures that only relevant data is broadcasted to connected neighboring processing elements throughoutput port 256. It is pointed out thatoutput arrangement 250 can also serve as a bypass for theprocessing element 120; the data input received through input ports 152 a-c can be directly forwarded to other processing elements through the appropriate configuration ofmultiplexer 252. InFIG. 3 ,internal interconnection network 140 b is fully connected, e.g. each output ofunits multiplexer 252. It is emphasized that this is by way of non-limiting example only, partially connectedinterconnection network 140 b can alternatively be used without departing from the scope of the present invention. - Issuing
device 160 can be distributed overprocessing elements 120. InFIG. 3 , alocal issuing device 260 is responsible for the control of the data path ofprocessing element 120, by controlling the configuration of multiplexers 220 a-f, issuing opcodes to the function units, addresses to the data storage units, and, optionally, controlling the configuration ofmultiplexer 252.Local issuing device 260 could have its own local operation register, so the global VLIW instruction can simply be formed by linking all local operation registers. Optionally, the processor instruction memory itself could be partitioned into multiple memory blocks, each memory block being local to aprocessing element 120, each memory block containing the part of the very long instruction word relevant to its corresponding processing element. In a further embodiment, eachlocal issuing device 260, having its own local instruction memory block and local operation register, could be associated with its own local program sequencing and control logic, and its own Program Counter (PC), which means that eachprocessing element 120 could operate as a VLIW processor itself. - At this point, it is emphasized that the vast flexibility of the
integrated circuit 100 according to the present invention enables the integration of very large scale parallelism in its architecture, which renders integratedcircuit 100 suitable for the performance of very demanding computations, e.g. broadband digital signal processing, that are difficult, if not currently impossible, to achieve with known architectures. Therefore, integration of anintegrated circuit 100 according to the present invention into an electronic device requiring such demanding computations, e.g. future generation mobile telecommunication devices, will not only make the realization of such future technologies feasible, but will also make the technology affordable, because of the limited design cost of theintegrated circuit 100. - In
FIG. 4 , aflow chart 400 depicts the crucial steps for designing an integrated circuit with a processing architecture according to the present invention. - In a
first step 420, the processing elements from the plurality of processing elements are designed to be substantially similar to each other and each processing element from the plurality of processing elements is designed to be capable of executing each instruction from the plurality of instructions. Obviously, this has only to be done for a single of theprocessing elements 120, since all other processing elements in the grid should be largely similar to thissingle processing element 120. This approach drastically reduces the design effort for such very large scale integration circuits utilizing instruction-level parallelism. - In a
second step 440, the plurality of processing elements are layed out in a regular grid wherein a distance between a processing element from the plurality of processing elements and a nearest neighboring processing element from the plurality of processing elements in a first direction is substantially the same as a distance between the processing element and a nearest neighboring processing element from the plurality of processing elements in a second direction. The organization of the processing elements in the regular grid not only enables the aforementioned reconfigurable behavior of theintegrated circuit 100, e.g. the ability to switch between a data flow mode and an instruction-level parallelism mode, but it also offers the possibility to reuse the logic layout for other applications when another interconnection structure is required. - This can be realized in a
third step 460, where eachprocessing element 120 from the plurality of function units is connected to at least a subset of other processing elements from the plurality of processing elements. Optionally, eachprocessing element 120 can be connected to each nearest neighboring processing element in the grid to yield a completely connected two-dimensional grid in the sense that eachprocessing element 120 is connected to each nearest neighbor. The definition ofdifferent interconnection networks 140 for a grid ofprocessing elements 120 enables the reuse of the grid ofprocessing elements 120 for other applications based on the same overall logic layout. In such a case, only the interconnect has to be redefined, which means that only a small design effort is required and only one or a few interconnect masks (e.g. a VIA mask, or an upper metal layer mask) have to be redeveloped. Both these advantages realize a substantial cost reduction in the development of follow-up IC designs. - It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim. The word “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Claims (10)
1. An integrated circuit comprising:
a plurality of processing elements for executing substantially in parallel at least a subset of a plurality of instructions;
issuing means for configuring the plurality of processing elements by issuing a program-counter-driven instruction flow to the plurality of processing elements; and
configurable interconnection means for connecting each processing element from the plurality of processing elements to at least a subset of other processing elements from the plurality of processing elements;
characterized in that:
the processing elements from the plurality of processing elements are substantially similar to each other, each processing element from the plurality of processing elements being capable of executing each instruction from the plurality of instructions; and
the plurality of processing elements are laid out in a regular grid wherein a distance between a processing element and a neighboring processing element from the plurality of processing elements in a first direction is substantially the same as a distance between the processing element and a neighboring processing element from the plurality of processing elements in a second direction that is different from the first direction.
2. An integrated circuit as claimed in claim 1 , wherein the integrated circuit comprises a very long instruction word processor architecture and the subset of the plurality of instructions comprises a very long instruction word.
3. An integrated circuit as claimed in claim 1 , characterized in that the configurable interconnection means connect each processing element to each nearest neighboring processing element in the grid.
4. An integrated circuit as claimed in claim 1 , characterized in that the configurable interconnection means comprise bypassing means for bypassing a processing element from the plurality of processing elements.
5. An integrated circuit as claimed in claim 1 , characterized in that a processing element from the plurality of processing elements comprises a data storage unit, a function unit and an internal intercommunication network coupling the function unit to the data storage unit.
6. An integrated circuit as claimed in claim 5 , characterized in that the processing element comprises at least a further unit; the function unit, the further unit and the data storage unit being organized as a very long instruction word processor data path.
7. An integrated circuit as claimed in claim 6 , characterized in that the issuing means are distributed over the processing elements.
8. A data processing device having an input for receiving a digital data stream and having an output for transmitting a humanly perceptible data result resulting from the digital data stream, chararacterized in that the input is coupled to the output via an integrated circuit as claimed in claim 1 , the integrated circuit being arranged for extracting the data result from the digital data stream.
9. A method for designing an integrated circuit, the integrated circuit comprising:
a plurality of processing elements for executing substantially in parallel at least a subset of a plurality of instructions;
issuing means for configuring the plurality of processing elements by issuing a program-counter-driven instruction flow to the plurality of processing elements; and
configurable interconnection means for connecting each processing element from the plurality of processing elements to at least a subset of other processing elements from the plurality of processing elements;
characterized by the method comprising the steps of:
designing the processing elements from the plurality of processing elements to be substantially similar to each other, and each processing element from the plurality of processing elements to be capable of executing each instruction from the plurality of instructions;
laying out the plurality of processing elements in a regular grid wherein a distance between a processing element and a neighboring processing element from the plurality of processing elements in a first direction is substantially the same as a distance between the processing element and a neighboring processing element from the plurality of processing elements in a second direction; and
connecting each processing element from the plurality of processing elements to at least a subset of other processing elements from the plurality of processing elements.
10. A method as claimed in claim 9 , characterized in that the step of connecting each processing element from the plurality of processing elements to at least a subset of other processing elements from the plurality of processing elements includes connecting each processing element to each nearest neighboring processing element in the grid.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP02077168.9 | 2002-06-03 | ||
EP02077168 | 2002-06-03 | ||
PCT/IB2003/002198 WO2003103015A2 (en) | 2002-06-03 | 2003-05-21 | Reconfigurable integrated circuit |
Publications (1)
Publication Number | Publication Date |
---|---|
US20050235173A1 true US20050235173A1 (en) | 2005-10-20 |
Family
ID=29595034
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/516,626 Abandoned US20050235173A1 (en) | 2002-06-03 | 2003-05-21 | Reconfigurable integrated circuit |
Country Status (7)
Country | Link |
---|---|
US (1) | US20050235173A1 (en) |
EP (1) | EP1514198A2 (en) |
JP (1) | JP2005528792A (en) |
CN (1) | CN1659540A (en) |
AU (1) | AU2003228062A1 (en) |
TW (1) | TW200405546A (en) |
WO (1) | WO2003103015A2 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060101237A1 (en) * | 2003-03-17 | 2006-05-11 | Stefan Mohl | Data flow machine |
US20090281784A1 (en) * | 2007-11-01 | 2009-11-12 | Silicon Hive B.V. | Method And Apparatus For Designing A Processor |
US20130227255A1 (en) * | 2012-02-28 | 2013-08-29 | Samsung Electronics Co., Ltd. | Reconfigurable processor, code conversion apparatus thereof, and code conversion method |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010034167A1 (en) * | 2008-09-28 | 2010-04-01 | 北京大学深圳研究生院 | Processor structure of integrated circuit |
CN109523019A (en) * | 2018-12-29 | 2019-03-26 | 百度在线网络技术(北京)有限公司 | Accelerator, the acceleration system based on FPGA and control method, CNN network system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5915123A (en) * | 1997-10-31 | 1999-06-22 | Silicon Spice | Method and apparatus for controlling configuration memory contexts of processing elements in a network of multiple context processing elements |
US5956518A (en) * | 1996-04-11 | 1999-09-21 | Massachusetts Institute Of Technology | Intermediate-grain reconfigurable processing device |
US6094726A (en) * | 1998-02-05 | 2000-07-25 | George S. Sheng | Digital signal processor using a reconfigurable array of macrocells |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0728337B1 (en) * | 1994-09-13 | 2000-05-03 | Teranex, Inc. | Parallel data processor |
US6839728B2 (en) * | 1998-10-09 | 2005-01-04 | Pts Corporation | Efficient complex multiplication and fast fourier transform (FFT) implementation on the manarray architecture |
US6041400A (en) * | 1998-10-26 | 2000-03-21 | Sony Corporation | Distributed extensible processing architecture for digital signal processing applications |
-
2003
- 2003-05-21 EP EP03725531A patent/EP1514198A2/en not_active Withdrawn
- 2003-05-21 US US10/516,626 patent/US20050235173A1/en not_active Abandoned
- 2003-05-21 CN CN03812744.XA patent/CN1659540A/en active Pending
- 2003-05-21 AU AU2003228062A patent/AU2003228062A1/en not_active Abandoned
- 2003-05-21 JP JP2004510004A patent/JP2005528792A/en not_active Withdrawn
- 2003-05-21 WO PCT/IB2003/002198 patent/WO2003103015A2/en not_active Application Discontinuation
- 2003-05-30 TW TW092114757A patent/TW200405546A/en unknown
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5956518A (en) * | 1996-04-11 | 1999-09-21 | Massachusetts Institute Of Technology | Intermediate-grain reconfigurable processing device |
US5915123A (en) * | 1997-10-31 | 1999-06-22 | Silicon Spice | Method and apparatus for controlling configuration memory contexts of processing elements in a network of multiple context processing elements |
US6094726A (en) * | 1998-02-05 | 2000-07-25 | George S. Sheng | Digital signal processor using a reconfigurable array of macrocells |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060101237A1 (en) * | 2003-03-17 | 2006-05-11 | Stefan Mohl | Data flow machine |
US20090281784A1 (en) * | 2007-11-01 | 2009-11-12 | Silicon Hive B.V. | Method And Apparatus For Designing A Processor |
US8433553B2 (en) * | 2007-11-01 | 2013-04-30 | Intel Benelux B.V. | Method and apparatus for designing a processor |
US20130227255A1 (en) * | 2012-02-28 | 2013-08-29 | Samsung Electronics Co., Ltd. | Reconfigurable processor, code conversion apparatus thereof, and code conversion method |
Also Published As
Publication number | Publication date |
---|---|
TW200405546A (en) | 2004-04-01 |
WO2003103015A3 (en) | 2004-12-29 |
AU2003228062A1 (en) | 2003-12-19 |
EP1514198A2 (en) | 2005-03-16 |
WO2003103015A2 (en) | 2003-12-11 |
AU2003228062A8 (en) | 2003-12-19 |
JP2005528792A (en) | 2005-09-22 |
CN1659540A (en) | 2005-08-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7895416B2 (en) | Reconfigurable integrated circuit | |
JP6059413B2 (en) | Reconfigurable instruction cell array | |
US6298472B1 (en) | Behavioral silicon construct architecture and mapping | |
US9535877B2 (en) | Processing system with interspersed processors and communication elements having improved communication routing | |
GB2395811A (en) | Reconfigurable integrated circuit | |
JP2008537268A (en) | An array of data processing elements with variable precision interconnection | |
US20010025363A1 (en) | Designer configurable multi-processor system | |
US7716458B2 (en) | Reconfigurable integrated circuit, system development method and data processing method | |
US20050235173A1 (en) | Reconfigurable integrated circuit | |
Ram et al. | Design and implementation of run time digital system using field programmable gate array–improved dynamic partial reconfiguration for efficient power consumption | |
US9081901B2 (en) | Means of control for reconfigurable computers | |
Arifin et al. | FSM-controlled architectures for linear invasion | |
WO2021014017A1 (en) | A reconfigurable architecture, for example a coarse-grained reconfigurable architecture as well as a corresponding method of operating such a reconfigurable architecture | |
Sano et al. | Instruction buffer mode for multi-context dynamically reconfigurable processors | |
Albanesi et al. | SCPC1: Silicon compiler pyramidal chip for image processing | |
Cardoso | Data-driven array architectures: a rebirth? | |
Chattopadhyay et al. | rASIP Design Space |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: KONINKLIJKE PHILIPS ELECTRONICS N.V., NETHERLANDS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:DE OLIVEIRA KASTRUP PEREIRA, BERNARDO;REEL/FRAME:016724/0954 Effective date: 20031223 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |