US7809926B2 - Systems and methods for reconfiguring on-chip multiprocessors - Google Patents

Systems and methods for reconfiguring on-chip multiprocessors Download PDF

Info

Publication number
US7809926B2
US7809926B2 US11/556,454 US55645406A US7809926B2 US 7809926 B2 US7809926 B2 US 7809926B2 US 55645406 A US55645406 A US 55645406A US 7809926 B2 US7809926 B2 US 7809926B2
Authority
US
United States
Prior art keywords
processor
units
unit
instructions
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US11/556,454
Other versions
US20080109637A1 (en
Inventor
Jose F. Martinez
Engin Ipek
Meyrem Kirman
Nevin Kirman
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cornell Research Foundation Inc
Original Assignee
Cornell Research Foundation Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cornell Research Foundation Inc filed Critical Cornell Research Foundation Inc
Priority to US11/556,454 priority Critical patent/US7809926B2/en
Assigned to CORNELL RESEARCH FOUNDATION, INC. reassignment CORNELL RESEARCH FOUNDATION, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: IPEK, ENGIN, KIRMAN, MEYREM, KIRMAN, NEVIN, MARTINEZ, JOSE F.
Publication of US20080109637A1 publication Critical patent/US20080109637A1/en
Application granted granted Critical
Publication of US7809926B2 publication Critical patent/US7809926B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3826Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
    • G06F9/3828Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage with global bypass, e.g. between pipelines, between clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3834Maintaining memory consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • G06F9/384Register renaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • G06F9/3858Result writeback, i.e. updating the architectural state or memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • G06F9/3895Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros
    • G06F9/3897Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros with adaptable data path

Definitions

  • Improving the performance of computer or other processing systems generally improves overall throughput and/or provides a better user experience.
  • One technique of improving the overall quantity of instructions processed in a system is to increase the number of processors in the system.
  • Implementing multiprocessing (MP) systems typically requires more than merely interconnecting processors in parallel. For example, tasks or programs may need to be divided so they can execute across parallel processing resources, memory consistency systems may be needed, etc.
  • Chip multiprocessors hold the prospect of delivering long-term performance scalability while dramatically reducing design complexity compared to monolithic wide-issue processors. Complexity is reduced by designing and verifying a single, relatively simple core, and then replicating it. Performance is scaled by integrating larger numbers of cores on the die and harnessing increasing levels of thread level parallelism (TLP) with each new technology generation.
  • TLP thread level parallelism
  • ACMPs Asymmetric chip multiprocessors
  • the number and the complexity of cores are fixed at design time. The hope is to match the demands of a variety of sequential and parallel workloads by executing them on an appropriate subset of these cores.
  • Recently, the impact of performance asymmetry on explicitly parallelized applications has been studied, finding that asymmetry hurts parallel application scalability and renders the applications' performance less predictable unless relatively sophisticated software changes are introduced.
  • ACMPs may deliver increased performance on sequential codes, they may do so at the expense of parallel performance, requiring a high level of software sophistication to maximize their potential.
  • the reconfigurable multiprocessor system of these teachings comprises a number of processor units, and one or more cross unit connection components operatively connecting at least two processor units and capable of reconfigurably linking one processor unit to another processor unit, thereby enabling collective fetching and providing instructions to be executed collectively by the processor units and collectively committing instructions executed by the processor units.
  • Embodiments of the method of using the reconfigurable multiprocessor system of these teachings are also disclosed.
  • FIG. 1 a is a schematic block diagram of an embodiment of the system of this invention.
  • FIG. 1 b is a graphical schematic block diagram of a conventional unit of a chip multi-processor
  • FIG. 2 is a graphical schematic diagram of a component of an embodiment of the system of this invention.
  • FIG. 3 represents a graphical schematic diagram of a knowledge component of an embodiment of the system of this invention.
  • the reconfigurable multiprocessor system of these teachings comprises a number of processor units, and one or more cross unit connection components operatively connecting at least two processor units and capable of reconfigurably linking one processor unit to another processor unit, thereby enabling collective fetching and providing instructions to be executed collectively by the processor units and collectively committing instructions executed by the processor units.
  • the method of these teachings for, in a number of processing units, making adjustments for manner of processing, parallel or sequential includes reconfiguring cross connection between processing units upon change of manner of processing, updating instruction characteristics upon change of manner of processing and reconfiguring instruction cache memories in each processing unit upon change of manner of processing.
  • the method of these teachings for enabling executing code collectively at a number of processing units includes reconfigurably linking one processor unit to another processor unit and fetching and providing instructions to be executed collectively by the processor units.
  • FIG. 1 a shows a graphical schematic block diagram of the exemplary embodiment.
  • the chip consists of eight two-issue, out-of-order cores 15 .
  • a system bus 20 connects the L1 data caches 25 and enables snoop-based cache coherence.
  • the chip contains a shared L2 cache 30 and an integrated memory controller (not shown). The figure does not represent an actual floorplan, but rather a conceptual one.
  • FIG. 1 b For reference and to aid with the disclosure of the exemplary embodiment herein below, a graphical schematic block diagram of a conventional single core is shown in FIG. 1 b.
  • the exemplary embodiment delivers substantially high performance on sequential codes by empowering basic CMP cores with the ability to collaboratively exploit high levels of instruction level parallelism (ILP).
  • ILP instruction level parallelism
  • Fetch and rename crossbars coordinate the operation of the distributed front-end, while operand and commit crossbars are responsible for data communication and distributed commit, respectively. Because these wires link nearby cores together, they require a few clock cycles to transmit data between any two cores.
  • the connections of the crossbars to the cores can be rendered reconfigurable by conventional techniques.
  • the reconfigurable crossbars are implemented as a packet based dynamically routed connection as described in K. Mai, T. Paaske, N. Jayasena, R. Ho, W J. Dally, and M. Horowitz. Smart Memories: a modular reconfigurable architecture. Intl. Symp. On Computer Architecture, Vancouver, Canada, June 2000, pages 161-171, which is incorporated by reference herein. (It should be noted that these teachings are not limited to only the previously mentioned implementation of the reconfigurable crossbar connection. For other techniques see, for example, Katherine Leigh Compton, Architecture Generation of Customized Reconfigurable Hardware, Ph. D. Thesis, NORTHWESTERN UNIVERSITY, December 2003, which is incorporated by reference herein.)
  • the architecture of the exemplary embodiment in one instance, can fuse groups of two or four cores, making it possible to provide the equivalent of eight two-issue four four-issue, or two eight-issue processor configurations.
  • Asymmetric fusion is also possible, e.g., one eight-issue fused CPU and four more two-issue cores. This flexibility allows the system of these teachings to accommodate workloads of a widely diverse nature, including workloads with multiple parallel or sequential applications.
  • a RISC ISA where every instruction can be encoded in one word, is utilized in the exemplary embodiment.
  • predecoding/translation support is utilized.
  • each core in the exemplary embodiment is naturally equipped with its own program counter (PC) ( 50 , FIG. 1 a ), instruction cache ( 55 , FIG. 1 a ), branch predictor ( 60 , FIG. 1 a ), branch target buffer (BTB) ( 65 , FIG. 1 a ), and return address stack (RAS) ( 70 , FIG. 1 a ).
  • PC program counter
  • a small control unit called the fetch management unit (FMU) 75 , FIG. 1 a ) is attached to the fetch crossbar.
  • the crossbar latency in this exemplary embodiment, is two cycles. When cores are fused, the FMU coordinates the distributed operation of all core fetch units.
  • Cores collectively fetch an eight-instruction block in one cycle by each fetching a two-instruction portion (their default fetch capacity) from their own instruction cache.
  • Fetch is generally eight-instruction aligned, with core zero being responsible for the oldest two instructions in the fetch group, core one for the next two, and so forth.
  • fetch still starts aligned at the appropriate core (lower-order cores skip fetch in that cycle), and it is truncated accordingly so that fully aligned fetch can resume on the next cycle.
  • Cache blocks as delivered by the L2 cache 30 on an instruction (i)-cache miss, are eight words regardless of the configuration. On an i-cache miss, a full block is requested. This block is delivered to the requesting core if it is operating independently or distributed across all four cores in a fused configuration to permit collective fetch.
  • i-caches 55 in this embodiment, are reconfigurable. In one instance, the i-caches 55 are rendered reconfigurable as in K. Mai, T. Paaske, N. Jayasena, R. Ho, W J. Dally, and M. Horowitz. Smart Memories: a modular reconfigurable architecture. Intl. Symp. On Computer Architecture, Vancouver, Canada, June 2000, pages 161-171.
  • Each i-cache 55 in this exemplary embodiment, has enough tags to organize its data in two- or eight-word blocks, and each tag has enough bits to handle the worst of the two cases.
  • tags When running independently, three out of every four tags are unused, and the i-cache handles block transfers in eight-word blocks.
  • the i-cache uses all tags, covering two-word blocks.
  • the i-Tranlation lookaside buffer in this exemplary embodiment, are replicated across all cores in a fused configuration. It should be noted that this would be accomplished “naturally” as cores miss on their i-TLBs, however taking multiple i-TLB misses for a single eight-instruction block is unnecessary, since the FMU can be used to refill all i-TLBs upon a first i-TLB miss by a core. Finally, the FMU can also be used to gang-invalidate an i-TLB entry, or gang-flush all i-TLBs as needed.
  • each core accesses its own branch predictor ( 60 , FIG. 1 a ) and BTB ( 65 , FIG. 1 a ). Because collective fetch is always aligned, dynamic instances of the same static branch instruction are guaranteed to access the same branch predictor and BTB. Consequently, the effective branch predictor and BTB capacity is four times as large. This is a desirable feature since the penalty of branch misprediction is bound to be higher with the more aggressive fetch/issue width and the higher number of in-flight instructions in the fused configuration.
  • Each core can handle up to one branch prediction per cycle.
  • the redirection of the (distributed) PC upon taken branches and branch mispredictions is enabled by the FMU.
  • Each cycle every core that predicts a taken branch, as well as every core that detects a branch misprediction, sends the new target PC to the FMU.
  • the FMU selects the correct PC by giving priority to the oldest misprediction-redirect PC first, and the youngest branch-prediction PC last, and sends the selected PC to all fetch units. Once the transfer of the new PC is complete, cores use it to fetch from their own i-cache as disclosed hereinabove.
  • misspeculated instructions are squashed in all cores. This is also the case for instructions fetched along the not-taken path on a taken branch, since, in this exemplary embodiment, the target PC will arrive with a delay of two cycles.
  • the FMU can also provide the ability to keep global history across all four cores if needed for accurate branch prediction.
  • the Global History register (GHR) can be simply replicated across all cores, and updates be coordinate through the FMU.
  • GHR Global History register
  • each core communicates its prediction—whether taken or not taken—to the EMU. Two bits, in this exemplary embodiment, suffice to accomplish this.
  • the FMU receives nonspeculative updates from every back-end upon branch mispredictions. The FMU communicates such events to each core, which in turn update their OHR. Upon nonspeculative updates, earlier (checkpointed) OHR contents are recovered on each core.
  • the fix-up mechanism employed to checkpoint and recover GHR contents can, in one embodiment, be along the lines of the outstanding branch queue (OBQ) mechanism in the Alpha 21264 microprocessor, as described in R. E. Kessler, The Alpha 21264 microprocessor, IEEE Micro, March 1999, 9(2):24-36, which is incorporated by reference herein.
  • OBQ outstanding branch queue
  • core zero pushes the return address into its RAS ( 70 , FIG. 1 a ).
  • RAS return address
  • core zero pops its PAS and communicates the return address back through the FMU. Notice that, since all RAS operations are processed by core zero, the effective RAS size does not increase when cores are fused.
  • the other cores may over-fetch in the shadow of the stall handling by the FMU if (a) on an i-cache or i-TLB miss, one of the other cores does hit in its i-cache or i-TLB (very unlikely in practice, given how fused cores fetch), or (b) generally in the case of contention for branch predictor ports by two back-to-back branches fetched by the same core (itself exceedingly unlikely).
  • the cores discard any over-fetched instruction (similarly to the handling of a taken branch) and resume fetching in sync from the right PC-as if all fetch engines had synchronized through a “fetch barrier.”
  • each core After fetch, each core pre-decodes its instructions independently. Subsequently, all instructions in the fetch group need to be renamed. Steering consumers (of instructions) to the same core as their producers can improve performance by eliminating communication delays. Renaming and steering of instructions is achieved through a small control unit called the steering management unit (SMU) ( 80 , FIG. 1 a ).
  • the SMU consists of a global steering table to track the mapping of architectural registers to any core, four free-lists, in one instance of the exemplary embodiment, for register allocation (one for each core), four rename maps, in one instance of the exemplary embodiment, and steering/renaming logic (See FIG. 2 ).
  • the steering table and the four rename maps together allow up to four valid mappings of each architectural register, and enable operands to be replicated across multiple cores. Cores still retain their individual renaming structures, but these are bypassed when cores are fused.
  • FIG. 2 depicts the high level organization of the rename pipeline.
  • each core sends up to two instructions to the SMU through a set of links.
  • the SMU receives up to two instructions and six architectural register specifiers (three per instruction) from each core.
  • the SMU uses a second set of links to dispatch up to six physical register specifiers, two instructions and two copy operations to each core.
  • the SMU uses the incoming architectural register specifiers and the four free lists, in one instance of the exemplary embodiment, to rename up to eight instructions every pipeline cycle.
  • Each instruction is dispatched to one of the cores via dependence based steering (for dependence based steering, see for example, S. Palacharla, N. P. Jouppi, and I. E. Smith, Complexity - effective superscalar processors , Intl. Symp. on Computer Architecture, pages 206-218, Denver, Colo., June 1997, which is incorporated by reference herein).
  • the SMU consults the steering table to steer every instruction to the core that will produce most of its operands among all cores with free rename crossbar ports. Copy instructions are also inserted into the fetch group in this cycle. In the next cycle, instructions (and the generated copies) are renamed by accessing the appropriate rename map and free list.
  • each core receives no more than two instructions and two copy instructions, each rename map has only six read and eight write ports.
  • the steering table requires eight read and sixteen write ports; note that each steering table entry contains only a single bit, and thus the overhead of multi-porting this small table is relatively low.
  • regular and copy instructions are dispatched to the appropriate cores. If a copy instruction cannot be sent due to bandwidth restrictions, renaming stops at the offending instruction that cycle, and starts with the same instruction next cycle, thereby draining crossbar links and guaranteeing forward progress. Similarly, if resource occupancies prevent the SMU from dispatching an instruction, renaming stops that cycle, and resumes when resources are available. To facilitate this, cores inform the SMU through a four-bit interface, in one instance of the exemplary embodiment, when their issue queues, reorder buffers (ROBs), and load/store queues are full.
  • ROBs reorder buffers
  • Registers are recycled through two mechanisms. As in many existing microprocessors, at commit time, any instruction that renames an architectural register releases the physical register holding the result of the previous instruction that renamed the same register. This is accomplished in core fusion over a portion of the rename crossbar, by having each ROB send the specifiers for these registers to the SMU.
  • copy instructions do not allocate ROB entries, and recycling them requires an alternative strategy. Every copy instruction generates a replica of a physical register in one core on some other core. These replicas are not recovered on branch mispredictions. Therefore, a register holding a redundant replica can be recycled at any point in time as long as all instructions whose architectural source registers are mapped to that physical register have read its value.
  • the SMU keeps a one-bit flag, in one instance, for each register indicating whether the corresponding register is currently holding a redundant replica (i.e., it was the target of a copy instruction).
  • the SMU keeps a table of per-register read counters for each core, where every counter entry corresponds to the number of outstanding reads to a specific physical register (in one instance, each counter is four bits in a sixteen-entry issue queue). These counters are incremented at the time the SMU dispatches instructions to cores. Every time an instruction leaves a core's issue queue, the core communicates the specifiers for the physical registers read by the instruction, and the SMU decrements the corresponding counters. When a branch misprediction or a replay trap is encountered, as squashed instructions are removed from the instruction window, the counters for the corresponding physical registers are updated appropriately in the shadow of the refetch.
  • Each core's back-end includes separate floating-point and integer issue queues, a copy-out queue ( 95 , FIG. 1 a ), a copy-in queue ( 85 , FIG. 1 a ), a physical register file, functional units, load/store queues and a ROB ( 90 , FIG. 1 a ).
  • Each core's load/store queue has access only to its private L1 data cache ( 25 , FIG. 1 a ).
  • the L1 data caches are connected via a split-transaction bus and are kept coherent, in one instance, via a MESI protocol.
  • This split-transaction bus is also connected to an on-chip L2 cache that is shared by all cores.
  • the operation of the back-end is no different from the operation of a core in a homogeneous CMP.
  • back-end structures are coordinated to form a large virtual back-end capable of consuming instructions at a rate of, in one instance of the exemplary embodiment, eight instructions per cycle.
  • Copy instructions wait in the copy-out queues for their operands to become available, and once issued, they transfer their source operand and destination physical register specifier to a remote core over the operand crossbar ( FIG. 1 a ).
  • the operand crossbar is capable of supporting every cycle two copy instructions per core. In addition to copy instructions, loads, the operand crossbar is used to deliver values to their destination register.
  • copy instructions When copy instructions reach the consumer core, the copy instructions are placed in a copy-in queue to wait for the selection logic to schedule them. Each cycle, the issue queue scheduler considers the two copy instructions at the queue head for scheduling along with the instructions in the issue queue. Once issued, copies wake up their dependent instructions and update the physical register file, just as regular instructions would do.
  • the goal of the fused in-order retirement operation is to coordinate the operation of four ROBs to commit up to eight instructions per cycle. Instructions allocate ROB entries locally at the end of fetch. If the fetch group contains less than eight instructions, NOPs (No Operation instructions) are allocated at the appropriate cores to guarantee alignment. Of course, on a pipeline bubble, no ROB entries are allocated.
  • each core commits two instructions from the oldest fetch group every cycle.
  • all other cores must also stop committing on time to ensure that fetch blocks are committed atomically in-order. This is accomplished via the commit crossbar, which transfers stall/resume signals across all ROBs.
  • each ROB is extended with a speculative head pointer in addition to the conventional head and tail pointers. Instructions always pass through the speculative ROB head before they reach the actual ROB head and commit. If not ready to commit at that time, a stall signal is sent to all cores.
  • the commit component when ready to commit (the commit component “decides” when it is safe to store a result), they move past the speculative head and send a resume signal to the other cores.
  • the number of ROB entries between the speculative head pointer and the actual head pointer is enough to cover the crossbar delay. This guarantees that ROB stalls always take effect in a timely manner to prevent committing speculative state.
  • a banked-by-address LSQ implementation is utilized. This allows keeping data coherent without requiring cache flushes after dynamic reconfiguration/and to support elegantly store forwarding and speculative loads.
  • the core that issues each load/store to the memory system is determined based on effective addresses.
  • the two bits, in one instance of the exemplary embodiment, that follow the block offset are used as the LSQ bank-ID to select one of the four cores (see FIG. 3 ), and enough index bits to cover the L1 cache are allocated from the remaining bits.
  • the rest of the effective address and the bank-ID are stored as a tag (note that this does not increase the number of tag bits compared to a conventional indexing scheme). Making the bank-ID bits part of the tag properly disambiguates cache lines regardless of the configuration.
  • Effective addresses for loads and stores are generally not known at the time they are renamed. At rename time memory operations need to allocate LSQ entries from the core that will eventually issue them to the memory system.
  • bank prediction is utilized (see, for example, M. Bekerman, S. Jourdan, R. Ronen, G. Kirshenboim, L. Rappoport, A. Yoaz, and U. Weiser, Correlated load - address predictors , Intl Symp. on Computer Architecture, Atlanta, Ga., May 1999, pages 54-63, which is incorporated by reference herein).
  • each core accesses its bank predictor by using the lower bits of the load/store PC.
  • Bank predictions are sent to the SMU through the rename crossbar, and the SMU steers each load and store to the predicted core.
  • Each core allocates load queue entries for the loads it receives.
  • the SMU also signals all cores to allocate dummy store queue entries regardless of the bank prediction. Dummy store queue entries guarantee in-order commit for store instructions by reserving place-holders across all banks for store bank mispredictions.
  • remote cores with superfluous store queue dummies are signaled to discard their entries (recycling these entries requires a collapsing LSQ implementation). If a bank misprediction is detected, the store is sent to the correct queue.
  • load queue entry In the case of loads, if a bank misprediction is detected, the load queue entry is recycled (LSQ collapse) and the load is sent to the correct core. There, it allocates a load queue entry and resolves its memory dependences locally. Note that, as a consequence of bank mispredictions, loads can allocate entries in the load queues out of program order. Fortunately, this is not a problem for store-to-load forwarding because load queue entries are typically tagged by instruction age to facilitate forwarding. However, there is a danger of deadlock in cases where the mispredicted load is older than all other loads in its (correct) bank and the load queue is full at the time the load arrives at the consumer core. To prevent this situation, loads search the load queue for older instructions when they cannot allocate entries.
  • the memory consistency model should be enforced on all loads and stores.
  • relaxed consistency models are utilized where special primitives like memory fences (weak consistency) or acquire/release operations (release consistency) enforce ordering constraints on ordinary memory operations.
  • memory fences weak consistency
  • acquire/release operations release consistency
  • fences memory fence instructions
  • these operations are dispatched to all the queues, but only the copy in the correct queue performs the actual synchronization operation.
  • the fence is considered complete once each one of the local fences completes locally and all memory operations preceding each fence commit. Local fence completion is signaled to all cores through a one-bit interface in the portion of the operand crossbar that connects the load-store queues.
  • the embodiment presented hereinabove disclose the operation of the cores in a static fashion. While the embodiments disclosed hereinabove are capable of improving performance on highly parallel or purely sequential applications, by configuring the CMP prior to executing the application, partially parallelized applications will benefit most from the ability to fuse/split cores at run time, as they dynamically switch between sequential and parallel code regions, respectively.
  • the embodiment of the method of these teachings include several actions to be taken to ensure correctness upon core fusion and fission, and embodiments of this system of these teachings include an application interface to coordinate and trigger reconfiguration. (For an example of an application interface, these teaching not limited by this example, using a higher level language, see, Laufer, R.; Taylor, R.
  • run-time reconfiguration is enabled through a simple application interface.
  • the application requests core fusion/fission actions through system calls.
  • the requests can be readily encapsulated in conventional parallelizing macros or directives.
  • the application may, in one instance, request cores to be fused to execute the upcoming sequential region.
  • Cores need not get fused on every parallel-to-sequential region boundary: if the sequential region is not long enough to amortize the cost of fusion, execution can continue without reconfiguration on one of the small cores. All in-flight instructions following the system call are flushed, and the appropriate rename map on the SMU is updated with mappings to the architectural state on this core.
  • Data caches do not need any special actions to be taken upon reconfigurations; the coherence protocol naturally ensures correctness across configuration changes. I-caches undergo tag reconfiguration upon fusion and fission as described hereinabove, and all cache blocks are invalidated for consistency.
  • the programmer or the run-time system may choose not to reconfigure across fast-alternating program regions (e.g., short serial sections in between parallel sections).
  • the shared L2 cache is not affected by reconfiguration.
  • fission is achieved through a second system call, where the application informs the processor about the approaching parallel region. Fetch is stalled, in-flight instructions are allowed to drain, and enough copy instructions are generated to gather the architectural state into core zero's physical register file. When the transfer is complete, control is returned to the application.
  • cross unit connection component incorporates the functionality of several of the crossbars in the exemplary embodiment are also within the scope of these teachings.
  • the crossbars in the exemplary embodiment and one or more cross unit connection components incorporating the functionality of the crossbars in the exemplary embodiment comprise means for enabling executing sequential code collectively at processor units and enabling changing the architectural configuration of the processing units.

Abstract

A reconfigurable multiprocessor system including a number of processing units and components enabling executing sequential code collectively at processing units and enabling changing the architectural configuration of the processing units.

Description

BACKGROUND
Improving the performance of computer or other processing systems generally improves overall throughput and/or provides a better user experience. One technique of improving the overall quantity of instructions processed in a system is to increase the number of processors in the system. Implementing multiprocessing (MP) systems, however, typically requires more than merely interconnecting processors in parallel. For example, tasks or programs may need to be divided so they can execute across parallel processing resources, memory consistency systems may be needed, etc.
As logic elements continue to shrink due to advances in fabrication technology, integrating multiple processors into a single component becomes more practical, and in fact a number of current designs implement multiple processors on a single component or chip.
Chip multiprocessors (CMPs) hold the prospect of delivering long-term performance scalability while dramatically reducing design complexity compared to monolithic wide-issue processors. Complexity is reduced by designing and verifying a single, relatively simple core, and then replicating it. Performance is scaled by integrating larger numbers of cores on the die and harnessing increasing levels of thread level parallelism (TLP) with each new technology generation.
Unfortunately, high-performance parallel programming constitutes a tedious, time-consuming, and error-prone effort.
In that respect, the complexity shift from hardware to software in ordinary CMPs is one of the most serious hurdles to their success. In the short term, on-chip integration of a modest number of relatively powerful (and relatively complex, cores may yield high utilization when running multiple sequential workloads, temporarily avoiding the complexity of parallelization. However, although sequential codes are likely to remain important, they alone are not sufficient to sustain long-term performance scalability. Consequently, harnessing the full potential of CMPs in the long term makes the adoption of parallel programming very attractive.
To amortize the cost of parallelization, many programmers choose to parallelize their applications incrementally. Typically, the most promising loops/regions in a sequential execution of the program are identified through profiling. A subset of these regions are then parallelized, and the rest of the application is left as “future work.” Over time, more effort is spent on portions of the remaining code. We call these evolving workloads. As a result of this “pay-as-you-go” approach, the complexity (and cost) associated with software parallelization is amortized over a greater time span. In fact, some of the most common shared-memory programming models in use today (for example, OpenMP) are designed to facilitate the incremental parallelization of sequential codes. We envision a diverse landscape of software in different stages of parallelization, from purely sequential, to fully parallel, to everything in between. As a result, it will remain important to efficiently support sequential as well as parallel code, whether standalone or as regions within the same application at run time. This requires a level of flexibility that is hard to attain in ordinary CMPs.
Asymmetric chip multiprocessors (ACMPs) attempt to address this by providing cores with varying degrees of sophistication and computational capabilities. The number and the complexity of cores are fixed at design time. The hope is to match the demands of a variety of sequential and parallel workloads by executing them on an appropriate subset of these cores. Recently, the impact of performance asymmetry on explicitly parallelized applications has been studied, finding that asymmetry hurts parallel application scalability and renders the applications' performance less predictable unless relatively sophisticated software changes are introduced. Hence, while ACMPs may deliver increased performance on sequential codes, they may do so at the expense of parallel performance, requiring a high level of software sophistication to maximize their potential.
Instead of trying to find the right design trade-off between complex and simple cores (as ACMPs do), there is a need for a CMP that provides the flexibility to dynamically synthesize the right mix of simple and complex cores based on application requirements.
BRIEF SUMMARY
In one embodiment, the reconfigurable multiprocessor system of these teachings comprises a number of processor units, and one or more cross unit connection components operatively connecting at least two processor units and capable of reconfigurably linking one processor unit to another processor unit, thereby enabling collective fetching and providing instructions to be executed collectively by the processor units and collectively committing instructions executed by the processor units.
Other embodiments including different cross unit connection components are also disclosed.
Embodiments of the method of using the reconfigurable multiprocessor system of these teachings are also disclosed.
For a better understanding of the present teachings, together with other and further needs thereof, reference is made to the accompanying drawings and detailed description and its scope will be pointed out in the appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 a is a schematic block diagram of an embodiment of the system of this invention;
FIG. 1 b is a graphical schematic block diagram of a conventional unit of a chip multi-processor;
FIG. 2 is a graphical schematic diagram of a component of an embodiment of the system of this invention; and
FIG. 3 represents a graphical schematic diagram of a knowledge component of an embodiment of the system of this invention.
DETAILED DESCRIPTION
In one embodiment, the reconfigurable multiprocessor system of these teachings comprises a number of processor units, and one or more cross unit connection components operatively connecting at least two processor units and capable of reconfigurably linking one processor unit to another processor unit, thereby enabling collective fetching and providing instructions to be executed collectively by the processor units and collectively committing instructions executed by the processor units.
In another embodiment, the method of these teachings for, in a number of processing units, making adjustments for manner of processing, parallel or sequential, includes reconfiguring cross connection between processing units upon change of manner of processing, updating instruction characteristics upon change of manner of processing and reconfiguring instruction cache memories in each processing unit upon change of manner of processing.
In a further embodiment, the method of these teachings for enabling executing code collectively at a number of processing units includes reconfigurably linking one processor unit to another processor unit and fetching and providing instructions to be executed collectively by the processor units.
An exemplary embodiment is presented hereinbelow in order to better illustrate the systems and methods of these teachings. It should be noted, however, that the systems and methods of these teachings are not limited to the exemplary embodiment presented hereinbelow.
The exemplary embodiment presented hereinbelow start from a CMP substrate with a homogeneous set of small cores. The embodiment maximizes the core count to exploit high levels of thread level parallelism (TLP), and has the modularity advantages of fine-grain CMPs. FIG. 1 a shows a graphical schematic block diagram of the exemplary embodiment. In one instance, the chip consists of eight two-issue, out-of-order cores 15. A system bus 20 connects the L1 data caches 25 and enables snoop-based cache coherence. Beyond the system bus, the chip contains a shared L2 cache 30 and an integrated memory controller (not shown). The figure does not represent an actual floorplan, but rather a conceptual one. For reference and to aid with the disclosure of the exemplary embodiment herein below, a graphical schematic block diagram of a conventional single core is shown in FIG. 1 b.
The exemplary embodiment delivers substantially high performance on sequential codes by empowering basic CMP cores with the ability to collaboratively exploit high levels of instruction level parallelism (ILP). This is made possible primarily by employing fetch 35, rename 40, operand 45, and commit 50 cross-core wiring (FIG. 1 a), which are hereinafter referred to as crossbars. Fetch and rename crossbars coordinate the operation of the distributed front-end, while operand and commit crossbars are responsible for data communication and distributed commit, respectively. Because these wires link nearby cores together, they require a few clock cycles to transmit data between any two cores.
The connections of the crossbars to the cores can be rendered reconfigurable by conventional techniques. In one instance, the reconfigurable crossbars are implemented as a packet based dynamically routed connection as described in K. Mai, T. Paaske, N. Jayasena, R. Ho, W J. Dally, and M. Horowitz. Smart Memories: a modular reconfigurable architecture. Intl. Symp. On Computer Architecture, Vancouver, Canada, June 2000, pages 161-171, which is incorporated by reference herein. (It should be noted that these teachings are not limited to only the previously mentioned implementation of the reconfigurable crossbar connection. For other techniques see, for example, Katherine Leigh Compton, Architecture Generation of Customized Reconfigurable Hardware, Ph. D. Thesis, NORTHWESTERN UNIVERSITY, December 2003, which is incorporated by reference herein.)
The architecture of the exemplary embodiment, in one instance, can fuse groups of two or four cores, making it possible to provide the equivalent of eight two-issue four four-issue, or two eight-issue processor configurations. Asymmetric fusion is also possible, e.g., one eight-issue fused CPU and four more two-issue cores. This flexibility allows the system of these teachings to accommodate workloads of a widely diverse nature, including workloads with multiple parallel or sequential applications.
In the instance disclosed below, the fusion mechanism involving four basic cores is described. A RISC ISA, where every instruction can be encoded in one word, is utilized in the exemplary embodiment. In embodiments where CISC-like ISAs are used, predecoding/translation support is utilized.
Due to their fundamentally independent nature, each core in the exemplary embodiment is naturally equipped with its own program counter (PC) (50, FIG. 1 a), instruction cache (55, FIG. 1 a), branch predictor (60, FIG. 1 a), branch target buffer (BTB) (65, FIG. 1 a), and return address stack (RAS) (70, FIG. 1 a). A small control unit called the fetch management unit (FMU) (75, FIG. 1 a) is attached to the fetch crossbar. The crossbar latency, in this exemplary embodiment, is two cycles. When cores are fused, the FMU coordinates the distributed operation of all core fetch units.
Cores collectively fetch an eight-instruction block in one cycle by each fetching a two-instruction portion (their default fetch capacity) from their own instruction cache. Fetch is generally eight-instruction aligned, with core zero being responsible for the oldest two instructions in the fetch group, core one for the next two, and so forth. When a branch target is not fully aligned in this way, fetch still starts aligned at the appropriate core (lower-order cores skip fetch in that cycle), and it is truncated accordingly so that fully aligned fetch can resume on the next cycle.
Cache blocks, as delivered by the L2 cache 30 on an instruction (i)-cache miss, are eight words regardless of the configuration. On an i-cache miss, a full block is requested. This block is delivered to the requesting core if it is operating independently or distributed across all four cores in a fused configuration to permit collective fetch. To achieve this, i-caches 55, in this embodiment, are reconfigurable. In one instance, the i-caches 55 are rendered reconfigurable as in K. Mai, T. Paaske, N. Jayasena, R. Ho, W J. Dally, and M. Horowitz. Smart Memories: a modular reconfigurable architecture. Intl. Symp. On Computer Architecture, Vancouver, Canada, June 2000, pages 161-171. Each i-cache 55, in this exemplary embodiment, has enough tags to organize its data in two- or eight-word blocks, and each tag has enough bits to handle the worst of the two cases. When running independently, three out of every four tags are unused, and the i-cache handles block transfers in eight-word blocks. When in fused configuration, the i-cache uses all tags, covering two-word blocks.
Because fetch is collective, the i-Tranlation lookaside buffer (TLB), in this exemplary embodiment, are replicated across all cores in a fused configuration. It should be noted that this would be accomplished “naturally” as cores miss on their i-TLBs, however taking multiple i-TLB misses for a single eight-instruction block is unnecessary, since the FMU can be used to refill all i-TLBs upon a first i-TLB miss by a core. Finally, the FMU can also be used to gang-invalidate an i-TLB entry, or gang-flush all i-TLBs as needed.
During collective fetch, each core accesses its own branch predictor (60, FIG. 1 a) and BTB (65, FIG. 1 a). Because collective fetch is always aligned, dynamic instances of the same static branch instruction are guaranteed to access the same branch predictor and BTB. Consequently, the effective branch predictor and BTB capacity is four times as large. This is a desirable feature since the penalty of branch misprediction is bound to be higher with the more aggressive fetch/issue width and the higher number of in-flight instructions in the fused configuration.
Each core can handle up to one branch prediction per cycle. The redirection of the (distributed) PC upon taken branches and branch mispredictions is enabled by the FMU. Each cycle, every core that predicts a taken branch, as well as every core that detects a branch misprediction, sends the new target PC to the FMU. The FMU selects the correct PC by giving priority to the oldest misprediction-redirect PC first, and the youngest branch-prediction PC last, and sends the selected PC to all fetch units. Once the transfer of the new PC is complete, cores use it to fetch from their own i-cache as disclosed hereinabove.
On a misprediction, misspeculated instructions are squashed in all cores. This is also the case for instructions fetched along the not-taken path on a taken branch, since, in this exemplary embodiment, the target PC will arrive with a delay of two cycles.
The FMU can also provide the ability to keep global history across all four cores if needed for accurate branch prediction. To accomplish this, the Global History register (GHR) can be simply replicated across all cores, and updates be coordinate through the FMU. Specifically, in this exemplary embodiment, upon every branch prediction, each core communicates its prediction—whether taken or not taken—to the EMU. Two bits, in this exemplary embodiment, suffice to accomplish this. Additionally, as discussed hereinabove, the FMU receives nonspeculative updates from every back-end upon branch mispredictions. The FMU communicates such events to each core, which in turn update their OHR. Upon nonspeculative updates, earlier (checkpointed) OHR contents are recovered on each core. The fix-up mechanism employed to checkpoint and recover GHR contents can, in one embodiment, be along the lines of the outstanding branch queue (OBQ) mechanism in the Alpha 21264 microprocessor, as described in R. E. Kessler, The Alpha 21264 microprocessor, IEEE Micro, March 1999, 9(2):24-36, which is incorporated by reference herein.
As the target PC of a subroutine call is sent to all cores by the FMU (which flags the fact that it is a subroutine call), core zero pushes the return address into its RAS (70, FIG. 1 a). When a return instruction is encountered (possibly by a different core from the one that fetched the subroutine call) and communicated to the FMU, core zero pops its PAS and communicates the return address back through the FMU. Notice that, since all RAS operations are processed by core zero, the effective RAS size does not increase when cores are fused.
When one fetch engine stalls as a result of an i-cache or i-TLB miss, or there is contention on a fetch engine for branch predictor ports (two consecutive branches fetched by the same core in the same cycle), it is necessary for all fetch engines to stall, so that correct ordering of the instructions in one fetch block can be maintained (e.g., for orderly FMU resolution of branch targets), and to allow instructions in the same fetch group to flow through later stages of the fused front-end (most notably, through rename) in a lock-step fashion. To support this, cores communicate fetch stalls to the FMU, which informs the other cores. In one instance of the exemplary embodiment, because of the two-cycle crossbar latency, it is possible that the other cores may over-fetch in the shadow of the stall handling by the FMU if (a) on an i-cache or i-TLB miss, one of the other cores does hit in its i-cache or i-TLB (very unlikely in practice, given how fused cores fetch), or (b) generally in the case of contention for branch predictor ports by two back-to-back branches fetched by the same core (itself exceedingly unlikely). Once all cores have been informed, including the delinquent core, the cores discard any over-fetched instruction (similarly to the handling of a taken branch) and resume fetching in sync from the right PC-as if all fetch engines had synchronized through a “fetch barrier.”
After fetch, each core pre-decodes its instructions independently. Subsequently, all instructions in the fetch group need to be renamed. Steering consumers (of instructions) to the same core as their producers can improve performance by eliminating communication delays. Renaming and steering of instructions is achieved through a small control unit called the steering management unit (SMU) (80, FIG. 1 a). The SMU consists of a global steering table to track the mapping of architectural registers to any core, four free-lists, in one instance of the exemplary embodiment, for register allocation (one for each core), four rename maps, in one instance of the exemplary embodiment, and steering/renaming logic (See FIG. 2). The steering table and the four rename maps together allow up to four valid mappings of each architectural register, and enable operands to be replicated across multiple cores. Cores still retain their individual renaming structures, but these are bypassed when cores are fused.
FIG. 2 depicts the high level organization of the rename pipeline. After pre-decode, each core sends up to two instructions to the SMU through a set of links. In one instance, it is possible to support three-cycle cross-core communication over a repeated link. In one instance of the exemplary embodiment, three cycles after pre-decode, the SMU receives up to two instructions and six architectural register specifiers (three per instruction) from each core. After renaming and steering, it uses a second set of links to dispatch up to six physical register specifiers, two instructions and two copy operations to each core. Restricting the SMU dispatch bandwidth in this way keeps the wiring overhead manageable, lowers the number of required rename map ports, and also helps achieve load balancing among the fused cores, Collectively, the incoming and outgoing SMU links constitute the rename crossbar.
The SMU uses the incoming architectural register specifiers and the four free lists, in one instance of the exemplary embodiment, to rename up to eight instructions every pipeline cycle. Each instruction is dispatched to one of the cores via dependence based steering (for dependence based steering, see for example, S. Palacharla, N. P. Jouppi, and I. E. Smith, Complexity-effective superscalar processors, Intl. Symp. on Computer Architecture, pages 206-218, Denver, Colo., June 1997, which is incorporated by reference herein). For each instruction, the SMU consults the steering table to steer every instruction to the core that will produce most of its operands among all cores with free rename crossbar ports. Copy instructions are also inserted into the fetch group in this cycle. In the next cycle, instructions (and the generated copies) are renamed by accessing the appropriate rename map and free list.
Since each core receives no more than two instructions and two copy instructions, each rename map has only six read and eight write ports. The steering table requires eight read and sixteen write ports; note that each steering table entry contains only a single bit, and thus the overhead of multi-porting this small table is relatively low. After rename, regular and copy instructions are dispatched to the appropriate cores. If a copy instruction cannot be sent due to bandwidth restrictions, renaming stops at the offending instruction that cycle, and starts with the same instruction next cycle, thereby draining crossbar links and guaranteeing forward progress. Similarly, if resource occupancies prevent the SMU from dispatching an instruction, renaming stops that cycle, and resumes when resources are available. To facilitate this, cores inform the SMU through a four-bit interface, in one instance of the exemplary embodiment, when their issue queues, reorder buffers (ROBs), and load/store queues are full.
Registers are recycled through two mechanisms. As in many existing microprocessors, at commit time, any instruction that renames an architectural register releases the physical register holding the result of the previous instruction that renamed the same register. This is accomplished in core fusion over a portion of the rename crossbar, by having each ROB send the specifiers for these registers to the SMU. However, copy instructions do not allocate ROB entries, and recycling them requires an alternative strategy. Every copy instruction generates a replica of a physical register in one core on some other core. These replicas are not recovered on branch mispredictions. Therefore, a register holding a redundant replica can be recycled at any point in time as long as all instructions whose architectural source registers are mapped to that physical register have read its value. To facilitate recycling, the SMU keeps a one-bit flag, in one instance, for each register indicating whether the corresponding register is currently holding a redundant replica (i.e., it was the target of a copy instruction). In addition, the SMU keeps a table of per-register read counters for each core, where every counter entry corresponds to the number of outstanding reads to a specific physical register (in one instance, each counter is four bits in a sixteen-entry issue queue). These counters are incremented at the time the SMU dispatches instructions to cores. Every time an instruction leaves a core's issue queue, the core communicates the specifiers for the physical registers read by the instruction, and the SMU decrements the corresponding counters. When a branch misprediction or a replay trap is encountered, as squashed instructions are removed from the instruction window, the counters for the corresponding physical registers are updated appropriately in the shadow of the refetch.
Each core's back-end includes separate floating-point and integer issue queues, a copy-out queue (95, FIG. 1 a), a copy-in queue (85, FIG. 1 a), a physical register file, functional units, load/store queues and a ROB (90, FIG. 1 a). Each core's load/store queue has access only to its private L1 data cache (25, FIG. 1 a). The L1 data caches are connected via a split-transaction bus and are kept coherent, in one instance, via a MESI protocol. (The name of the MESI protocol is based on four possible states of the cache blocks: Modified, Exclusive, Shared and Invalid.) This split-transaction bus is also connected to an on-chip L2 cache that is shared by all cores. When running independently on one core, the operation of the back-end is no different from the operation of a core in a homogeneous CMP. When cores get fused, back-end structures are coordinated to form a large virtual back-end capable of consuming instructions at a rate of, in one instance of the exemplary embodiment, eight instructions per cycle.
Copy instructions wait in the copy-out queues for their operands to become available, and once issued, they transfer their source operand and destination physical register specifier to a remote core over the operand crossbar (FIG. 1 a). The operand crossbar is capable of supporting every cycle two copy instructions per core. In addition to copy instructions, loads, the operand crossbar is used to deliver values to their destination register.
When copy instructions reach the consumer core, the copy instructions are placed in a copy-in queue to wait for the selection logic to schedule them. Each cycle, the issue queue scheduler considers the two copy instructions at the queue head for scheduling along with the instructions in the issue queue. Once issued, copies wake up their dependent instructions and update the physical register file, just as regular instructions would do.
The goal of the fused in-order retirement operation is to coordinate the operation of four ROBs to commit up to eight instructions per cycle. Instructions allocate ROB entries locally at the end of fetch. If the fetch group contains less than eight instructions, NOPs (No Operation instructions) are allocated at the appropriate cores to guarantee alignment. Of course, on a pipeline bubble, no ROB entries are allocated.
When commit is not blocked, each core commits two instructions from the oldest fetch group every cycle. When one of the ROBs is blocked, all other cores must also stop committing on time to ensure that fetch blocks are committed atomically in-order. This is accomplished via the commit crossbar, which transfers stall/resume signals across all ROBs.
To accommodate the communication delay across the crossbars, each ROB is extended with a speculative head pointer in addition to the conventional head and tail pointers. Instructions always pass through the speculative ROB head before they reach the actual ROB head and commit. If not ready to commit at that time, a stall signal is sent to all cores.
Later, when ready to commit (the commit component “decides” when it is safe to store a result), they move past the speculative head and send a resume signal to the other cores. The number of ROB entries between the speculative head pointer and the actual head pointer is enough to cover the crossbar delay. This guarantees that ROB stalls always take effect in a timely manner to prevent committing speculative state.
Conventional clustered architectures, having a centralized L1 data cache or a data cache distributed based on bank assignment, utilize a scheme for handling loads and stores as described in, for example, R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi, Dynamically managing the communication-parallelism trade-off in future clustered processors, Intl. Symp. on Computer Architecture, San Diego, Calif., June 2003, pages 275-287, which is incorporated by reference herein. The exemplary embodiment of the system of these teachings utilizes a scheme for handling loads and stores conceptually similar to that of closed architectures but retaining the private nature of the L1 cache of each core, thereby requiring substantially minimal modifications of the conventional CMP cache subsystem.
In the fused mode, in the exemplary embodiment, a banked-by-address LSQ implementation is utilized. This allows keeping data coherent without requiring cache flushes after dynamic reconfiguration/and to support elegantly store forwarding and speculative loads. The core that issues each load/store to the memory system is determined based on effective addresses. The two bits, in one instance of the exemplary embodiment, that follow the block offset are used as the LSQ bank-ID to select one of the four cores (see FIG. 3), and enough index bits to cover the L1 cache are allocated from the remaining bits. The rest of the effective address and the bank-ID are stored as a tag (note that this does not increase the number of tag bits compared to a conventional indexing scheme). Making the bank-ID bits part of the tag properly disambiguates cache lines regardless of the configuration.
Effective addresses for loads and stores are generally not known at the time they are renamed. At rename time memory operations need to allocate LSQ entries from the core that will eventually issue them to the memory system. In the exemplary embodiment, bank prediction is utilized (see, for example, M. Bekerman, S. Jourdan, R. Ronen, G. Kirshenboim, L. Rappoport, A. Yoaz, and U. Weiser, Correlated load-address predictors, Intl Symp. on Computer Architecture, Atlanta, Ga., May 1999, pages 54-63, which is incorporated by reference herein). Upon pre-decoding loads and stores, each core accesses its bank predictor by using the lower bits of the load/store PC. Bank predictions are sent to the SMU through the rename crossbar, and the SMU steers each load and store to the predicted core. Each core allocates load queue entries for the loads it receives. On stores, the SMU also signals all cores to allocate dummy store queue entries regardless of the bank prediction. Dummy store queue entries guarantee in-order commit for store instructions by reserving place-holders across all banks for store bank mispredictions. Upon effective address calculation, remote cores with superfluous store queue dummies are signaled to discard their entries (recycling these entries requires a collapsing LSQ implementation). If a bank misprediction is detected, the store is sent to the correct queue.
In the case of loads, if a bank misprediction is detected, the load queue entry is recycled (LSQ collapse) and the load is sent to the correct core. There, it allocates a load queue entry and resolves its memory dependences locally. Note that, as a consequence of bank mispredictions, loads can allocate entries in the load queues out of program order. Fortunately, this is not a problem for store-to-load forwarding because load queue entries are typically tagged by instruction age to facilitate forwarding. However, there is a danger of deadlock in cases where the mispredicted load is older than all other loads in its (correct) bank and the load queue is full at the time the load arrives at the consumer core. To prevent this situation, loads search the load queue for older instructions when they cannot allocate entries. If no such entry is found, a replay trap is taken, and the load is steered to the right core at rename time. Otherwise, the load is buffered until a free load queue entry becomes available. Address banking of the LSQ also facilitates speculative loads and store forwarding.
Since any load instruction is free of bank mispredictions at the time it issues to the memory system, loads and stores to the same address are guaranteed to be processed by the same core. Dependence speculation can be achieved by integrating a storeset predictor (as detailed, for example, in G. Chrysos and J. Emer. Memory dependence prediction using store sets. Intl. Symp. on Computer Architecture, Barcelona, Spain, June-July 1998, pages 142-153, which is incorporated by reference herein) on each core (since cores perform aligned fetches, the same load is guaranteed to access the same predictor at all times).
When running parallel application threads in fused mode, the memory consistency model should be enforced on all loads and stores. In one embodiment, relaxed consistency models are utilized where special primitives like memory fences (weak consistency) or acquire/release operations (release consistency) enforce ordering constraints on ordinary memory operations. The operation of memory fences is presented hereinbelow. Acquire and release operations are handled similarly.
For the correct functioning of synchronization primitives, fences (memory fence instructions) must be made visible to all load/store queues belonging to the same thread. In this instance of the exemplary embodiment, these operations are dispatched to all the queues, but only the copy in the correct queue performs the actual synchronization operation. The fence is considered complete once each one of the local fences completes locally and all memory operations preceding each fence commit. Local fence completion is signaled to all cores through a one-bit interface in the portion of the operand crossbar that connects the load-store queues.
The embodiment presented hereinabove disclose the operation of the cores in a static fashion. While the embodiments disclosed hereinabove are capable of improving performance on highly parallel or purely sequential applications, by configuring the CMP prior to executing the application, partially parallelized applications will benefit most from the ability to fuse/split cores at run time, as they dynamically switch between sequential and parallel code regions, respectively. In order to support dynamic reconfiguration of the architecture of these teachings, the embodiment of the method of these teachings include several actions to be taken to ensure correctness upon core fusion and fission, and embodiments of this system of these teachings include an application interface to coordinate and trigger reconfiguration. (For an example of an application interface, these teaching not limited by this example, using a higher level language, see, Laufer, R.; Taylor, R. R.; Schmit, H., PCI-PipeRench and the SWORDAPI: a system for stream-based reconfigurable computing, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 1999, FCCM '99. Proceedings, Date: 1999, Pages: 200-208, which is incorporated by reference herein.)
The modular nature of the architecture of these teachings makes reconfiguration relatively easy. In one embodiment, run-time reconfiguration is enabled through a simple application interface. The application requests core fusion/fission actions through system calls. In many embodiments, the requests can be readily encapsulated in conventional parallelizing macros or directives.
After the completion of a parallel region, the application may, in one instance, request cores to be fused to execute the upcoming sequential region. Cores need not get fused on every parallel-to-sequential region boundary: if the sequential region is not long enough to amortize the cost of fusion, execution can continue without reconfiguration on one of the small cores. All in-flight instructions following the system call are flushed, and the appropriate rename map on the SMU is updated with mappings to the architectural state on this core. Data caches do not need any special actions to be taken upon reconfigurations; the coherence protocol naturally ensures correctness across configuration changes. I-caches undergo tag reconfiguration upon fusion and fission as described hereinabove, and all cache blocks are invalidated for consistency. This is generally harmless if it can be amortized over the duration of a configuration. In any case, the programmer or the run-time system may choose not to reconfigure across fast-alternating program regions (e.g., short serial sections in between parallel sections). The shared L2 cache is not affected by reconfiguration.
In one embodiment, fission is achieved through a second system call, where the application informs the processor about the approaching parallel region. Fetch is stalled, in-flight instructions are allowed to drain, and enough copy instructions are generated to gather the architectural state into core zero's physical register file. When the transfer is complete, control is returned to the application.
Results on the performance of the exemplary embodiment described hereinabove are reported in Engin pek, M. K rman, N. K rman, and J. F. Martinez. Accommodating workload diversity in chip multiprocessors via adaptive core fusion, Workshop on Complexity-effective Design, conc. with ISCA, Boston, Mass., June 2006, which is incorporated by reference herein.
While these teachings have been illustrated by the exemplary embodiment disclosed herein above, it should be noted that these teachings are not limited only to the exemplary embodiment. Embodiments in which a cross unit connection component incorporates the functionality of several of the crossbars in the exemplary embodiment are also within the scope of these teachings. The crossbars in the exemplary embodiment and one or more cross unit connection components incorporating the functionality of the crossbars in the exemplary embodiment comprise means for enabling executing sequential code collectively at processor units and enabling changing the architectural configuration of the processing units.
Although these teachings have been described with respect to various embodiments, it should be realized this invention is also capable of a wide variety of further and other embodiments within the spirit and scope of the appended claims.

Claims (16)

1. A reconfigurable multiprocessor system comprising:
a plurality of processor units, wherein the plurality of processor units are independent cores and dynamically fused into a single processor, wherein the single processor is dynamically split into distinct processing units at run time; and
at least one cross unit connection component operatively connecting at least two processor units from said plurality of processor units, said at least one cross unit connection component reconfigurably linking one processor unit to another processor unit, thereby fusing them as a single processor, said reconfigurable linking adjusting processing to sequential when in a fused mode and to parallel when in a split mode; said at least one cross unit connection component enabling collective fetching and providing instructions to be executed collectively by said at least two processor units and collectively committing executed instructions by said at least two processor units; said collective fetching comprising cooperatively fetching an instruction block; said instruction block comprising a number of subsets, said fetching being performed cooperatively by fetching a one subset of said instruction block by each of said at least two processor units, when the processor units are in the fused mode; said instruction block comprising each subsets of instructions fetched by each of said at least two processor units; said instruction block being constructed from each subset of instructions fetched by each of said at least two processor units; said collective fetching enabling operation of said at least two processor units substantially as a single processor unit; whereby executing sequential code collectively at processor units is enabled; and
when changing manner of processing from fused to split or from split to fused, bringing an internal state of said plurality of processing units to a form consistent with a new configuration, resulting from the reconfiguring of cross connection unit, and
reconfiguring instruction cache memories in each processing unit and updating instruction characteristic upon change of manner of processing.
2. The reconfigurable multiprocessor system of claim 1 wherein each processor unit from said plurality of processor units comprises:
a program counter;
an instruction cache memory;
a data cache memory;
a copy in queue memory;
a copy out queue memory, and
a reorder buffer memory;
and wherein said at least one connection component comprises:
a first cross unit connection component reconfigurably and operatively connecting;
said at least two processor units from said plurality of processor units and enabling reconfigurably linking one processor unit to another processor unit, and of enabling collective fetching and branch prediction;
a second cross unit connection component reconfigurably and operatively connected to said at least two processor units from said plurality of processor units, said second cross unit connection component enabling steering and renaming instructions for/from each processor unit;
a third cross unit connection component reconfigurably and operatively connecting said at least two processor units from said plurality of processor units and enabling transfer of output instruction values from one processor unit as input instruction values to another processor unit;
a fourth cross unit connection component reconfigurably and operatively connecting at least two processor units from said plurality of processor units, said fourth cross unit connection component enabling transferring signals between reorder buffer memories in processor units from said plurality of processor units, and of enabling collective instruction commit.
3. The reconfigurable multiprocessor system of claim 2 further comprising:
a fetch management component operatively connected to said first cross unit connection component, said fetch management component enabling coordinating distributed operation of fetch components in each processor unit.
4. The reconfigurable multiprocessor system of claim 2 further comprising:
a steering management component enabling steering and renaming instructions for/from each processor unit; and said steering management component being operatively connected to each processor unit through said second cross unit connection component.
5. The reconfigurable multiprocessor system of claim 2 further comprising:
an application interface for coordinating and starting reconfiguration.
6. A method for executing code collectively at a plurality of processing units, wherein the plurality of processing units are independent cores and dynamically fused into a single processing unit, wherein the single processing unit is dynamically split into distinct processing units at run time, the method comprising the steps of:
reconfiguring cross connection between processing units in said plurality of processing units, said reconfiguring comprising:
reconfigurably linking one processor unit to another processor unit, thereby fusing them as a single processor, said reconfigurable linking adjusting processing to sequential when in a fused mode and to parallel when in a split mode;
fetching and providing instructions to be executed collectively by the plurality of processor units; said fetching comprising fetching an instruction block cooperatively by fetching a subset of said instruction block by each of at least two processor units from the plurality of processor units;
said instruction block comprising each subset of instructions fetched by each of said at least two processor units, when the processing units are in the fused mode; said instruction block being constructed from each subset of instructions fetched by each processor unit from the plurality of processor units; said fetching enabling operation of said at least two processor units as a single processing unit; and thereby enabling execution of sequential code collectively at processing units;
when changing manner of processing from fused to split or from split to fused, bringing an internal state of said plurality of processing units to a form consistent with a new configuration, resulting from the reconfiguring of cross connection; and
reconfiguring instruction cache memories in each processing unit and updating instruction characteristic upon change of manner of processing.
7. The method of claim 6 wherein the step of fetching and providing instructions comprises the steps of:
providing a cross unit connection component having a fetch management component operatively connected thereto;
accepting at the fetch management component, a target instruction address from at least one providing processing unit;
selecting, at the fetch management component, contents of a program counter; and
providing the selected contents of a program counter to a fetch component in each processor unit;
discarding, at each processing unit, over fetched instructions; and
resuming fetching, in synchronization with each processing unit, from contents of the selected contents of the program counter.
8. The method of claim 7 wherein the step of fetching and providing instructions further comprises the steps of:
communicating, from processing units experiencing fetch stalls, the fetch stalls to the fetch management component;
reporting from the fetch management component, to processing units possibly not experiencing fetch stalls, the communicated fetch stalls.
9. The method of claim 6 wherein the step of fetching and providing instructions comprises the steps of:
providing a steerable cross unit connection component having a steering management component attached thereto;
sending to the steering management component, from each processing unit, at most a predetermined number of instructions;
receiving, at a predetermined number of cycles after sending, at the steering management component at most the predetermined number of instructions from each processing unit and a corresponding predetermined number of architectural register specifiers from each processing unit; renaming and steering instructions received from each processing unit; and sending, after renaming and steering, to each processing unit, from the steering management component, a predetermined number of renamed and steered instructions and copy operations.
10. The method of claim 6 further comprising the step of: accepting reconfiguration requests indicating a change in manner of processing.
11. The method of claim 6 further comprising the step of:
transferring operands and destination register specification from one processing unit to another processing unit.
12. The method of claim 6 further comprising the step of:
providing stall/resume signals to a Reorder buffer in each processing unit.
13. The method of claim 9 further comprising the steps of:
generating bank predictions at each processing unit after identifying load and store instructions;
providing the bank predictions from each processing unit to the steering management component;
directing load and store instructions to a predicted processing unit.
14. The method of claim 9 further comprising the steps of:
generating address predictions at each processing unit after identifying load and store instructions;
directing load and store instructions to processing units based on said address predictions;
redirecting load and store instructions to other processing units, if misprediction is detected.
15. The method of claim 14 wherein the directing load and store instructions is achieved by means of said steerable cross unit connection component and said steering management component.
16. The method of claim 14 wherein load and store instructions are redirected to other processing units upon effective address calculation.
US11/556,454 2006-11-03 2006-11-03 Systems and methods for reconfiguring on-chip multiprocessors Active 2027-01-30 US7809926B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/556,454 US7809926B2 (en) 2006-11-03 2006-11-03 Systems and methods for reconfiguring on-chip multiprocessors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/556,454 US7809926B2 (en) 2006-11-03 2006-11-03 Systems and methods for reconfiguring on-chip multiprocessors

Publications (2)

Publication Number Publication Date
US20080109637A1 US20080109637A1 (en) 2008-05-08
US7809926B2 true US7809926B2 (en) 2010-10-05

Family

ID=39361024

Family Applications (1)

Application Number Title Priority Date Filing Date
US11/556,454 Active 2027-01-30 US7809926B2 (en) 2006-11-03 2006-11-03 Systems and methods for reconfiguring on-chip multiprocessors

Country Status (1)

Country Link
US (1) US7809926B2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100049963A1 (en) * 2008-08-25 2010-02-25 Bell Jr Robert H Multicore Processor and Method of Use That Adapts Core Functions Based on Workload Execution
US8266267B1 (en) 2005-02-02 2012-09-11 Juniper Networks, Inc. Detection and prevention of encapsulated network attacks using an intermediate device
CN104375805A (en) * 2014-11-17 2015-02-25 天津大学 Method for simulating parallel computation process of reconfigurable processor through multi-core processor
US9448799B2 (en) 2013-03-14 2016-09-20 Samsung Electronics Co., Ltd. Reorder-buffer-based dynamic checkpointing for rename table rebuilding
US9946549B2 (en) 2015-03-04 2018-04-17 Qualcomm Incorporated Register renaming in block-based instruction set architecture

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8006003B2 (en) * 2008-02-29 2011-08-23 International Business Machines Corporation Apparatus, system, and method for enqueue prioritization
US8108573B2 (en) * 2008-02-29 2012-01-31 International Business Machines Corporation Apparatus, system, and method for enqueue prioritization
US8127119B2 (en) * 2008-12-05 2012-02-28 The Board Of Regents Of The University Of Texas System Control-flow prediction using multiple independent predictors
JP5411530B2 (en) * 2009-03-04 2014-02-12 キヤノン株式会社 Parallel processor system
US10698859B2 (en) 2009-09-18 2020-06-30 The Board Of Regents Of The University Of Texas System Data multicasting with router replication and target instruction identification in a distributed multi-core processing architecture
JP5707011B2 (en) 2010-06-18 2015-04-22 ボード・オブ・リージエンツ,ザ・ユニバーシテイ・オブ・テキサス・システム Integrated branch destination / predicate prediction
KR102010317B1 (en) * 2013-03-14 2019-08-13 삼성전자주식회사 Reorder-buffer-based dynamic checkpointing for rename table rebuilding
US9417879B2 (en) * 2013-06-21 2016-08-16 Intel Corporation Systems and methods for managing reconfigurable processor cores
US9632947B2 (en) * 2013-08-19 2017-04-25 Intel Corporation Systems and methods for acquiring data for loads at different access times from hierarchical sources using a load queue as a temporary storage buffer and completing the load early
US9703708B2 (en) * 2013-09-27 2017-07-11 Intel Corporation System and method for thread scheduling on reconfigurable processor cores
US9892803B2 (en) * 2014-09-18 2018-02-13 Via Alliance Semiconductor Co., Ltd Cache management request fusing
US20160154649A1 (en) * 2014-12-01 2016-06-02 Mediatek Inc. Switching methods for context migration and systems thereof
US11681531B2 (en) 2015-09-19 2023-06-20 Microsoft Technology Licensing, Llc Generation and use of memory access instruction order encodings
US11016770B2 (en) 2015-09-19 2021-05-25 Microsoft Technology Licensing, Llc Distinct system registers for logical processors
US10776115B2 (en) 2015-09-19 2020-09-15 Microsoft Technology Licensing, Llc Debug support for block-based processor
US10678544B2 (en) 2015-09-19 2020-06-09 Microsoft Technology Licensing, Llc Initiating instruction block execution using a register access instruction
US10180840B2 (en) 2015-09-19 2019-01-15 Microsoft Technology Licensing, Llc Dynamic generation of null instructions
US10198263B2 (en) 2015-09-19 2019-02-05 Microsoft Technology Licensing, Llc Write nullification
US10936316B2 (en) 2015-09-19 2021-03-02 Microsoft Technology Licensing, Llc Dense read encoding for dataflow ISA
US10719321B2 (en) 2015-09-19 2020-07-21 Microsoft Technology Licensing, Llc Prefetching instruction blocks
US10768936B2 (en) 2015-09-19 2020-09-08 Microsoft Technology Licensing, Llc Block-based processor including topology and control registers to indicate resource sharing and size of logical processor
US10871967B2 (en) 2015-09-19 2020-12-22 Microsoft Technology Licensing, Llc Register read/write ordering
US11126433B2 (en) 2015-09-19 2021-09-21 Microsoft Technology Licensing, Llc Block-based processor core composition register
US20170083327A1 (en) 2015-09-19 2017-03-23 Microsoft Technology Licensing, Llc Implicit program order
US10452399B2 (en) 2015-09-19 2019-10-22 Microsoft Technology Licensing, Llc Broadcast channel architectures for block-based processors
CN111190645B (en) * 2020-02-25 2024-03-19 江苏华创微系统有限公司 Separated instruction cache structure
CN114741137B (en) * 2022-05-09 2024-02-20 潍柴动力股份有限公司 Software starting method, device, equipment and storage medium based on multi-core microcontroller

Citations (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5471592A (en) * 1989-11-17 1995-11-28 Texas Instruments Incorporated Multi-processor with crossbar link of processors and memories and method of operation
US5475856A (en) * 1991-11-27 1995-12-12 International Business Machines Corporation Dynamic multi-mode parallel processing array
US5522083A (en) 1989-11-17 1996-05-28 Texas Instruments Incorporated Reconfigurable multi-processor operating in SIMD mode with one processor fetching instructions for use by remaining processors
US5638533A (en) * 1995-10-12 1997-06-10 Lsi Logic Corporation Method and apparatus for providing data to a parallel processing array
US5664214A (en) * 1994-04-15 1997-09-02 David Sarnoff Research Center, Inc. Parallel processing computer containing a multiple instruction stream processing architecture
US5922066A (en) * 1997-02-24 1999-07-13 Samsung Electronics Co., Ltd. Multifunction data aligner in wide data width processor
US6018796A (en) * 1996-03-29 2000-01-25 Matsushita Electric Industrial Co.,Ltd. Data processing having a variable number of pipeline stages
US20030145189A1 (en) * 2001-12-27 2003-07-31 Stmicroelectronics S.R.L. Processing architecture, related system and method of operation
US20030149862A1 (en) * 2002-02-05 2003-08-07 Sudarshan Kadambi Out-of-order processor that reduces mis-speculation using a replay scoreboard
US20030177288A1 (en) * 2002-03-07 2003-09-18 Kabushiki Kaisha Toshiba Multiprocessor system
US20030200419A1 (en) * 2002-04-19 2003-10-23 Industrial Technology Research Institute Non-copy shared stack and register file device and dual language processor structure using the same
US20040022094A1 (en) * 2002-02-25 2004-02-05 Sivakumar Radhakrishnan Cache usage for concurrent multiple streams
US6725354B1 (en) * 2000-06-15 2004-04-20 International Business Machines Corporation Shared execution unit in a dual core processor
US20050044319A1 (en) * 2003-08-19 2005-02-24 Sun Microsystems, Inc. Multi-core multi-thread processor
US6920545B2 (en) 2002-01-17 2005-07-19 Raytheon Company Reconfigurable processor with alternately interconnected arithmetic and memory nodes of crossbar switched cluster
US6940512B2 (en) * 2002-05-22 2005-09-06 Sony Corporation Image processing apparatus and method of same
US20050219253A1 (en) * 2004-03-31 2005-10-06 Piazza Thomas A Render-cache controller for multithreading, multi-core graphics processor
US6976131B2 (en) 2002-08-23 2005-12-13 Intel Corporation Method and apparatus for shared cache coherency for a chip multiprocessor or multiprocessor system
US20050289286A1 (en) * 2004-06-15 2005-12-29 Akihiko Ohwada Multi-core processor control method
US20060004988A1 (en) * 2004-06-30 2006-01-05 Jordan Paul J Single bit control of threads in a multithreaded multicore processor
US20060075192A1 (en) * 2004-10-01 2006-04-06 Advanced Micro Devices, Inc. Dynamic reconfiguration of cache memory
US20060248318A1 (en) * 2005-04-30 2006-11-02 Zohner Thayl D Method and apparatus for sharing memory among a plurality of processors
US20060294326A1 (en) * 2005-06-23 2006-12-28 Jacobson Quinn A Primitives to enhance thread-level speculation
US20070006213A1 (en) * 2005-05-23 2007-01-04 Shahrokh Shahidzadeh In-system reconfiguring of hardware resources
US7185178B1 (en) * 2004-06-30 2007-02-27 Sun Microsystems, Inc. Fetch speculation in a multithreaded processor
US7240160B1 (en) * 2004-06-30 2007-07-03 Sun Microsystems, Inc. Multiple-core processor with flexible cache directory scheme

Patent Citations (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5471592A (en) * 1989-11-17 1995-11-28 Texas Instruments Incorporated Multi-processor with crossbar link of processors and memories and method of operation
US5522083A (en) 1989-11-17 1996-05-28 Texas Instruments Incorporated Reconfigurable multi-processor operating in SIMD mode with one processor fetching instructions for use by remaining processors
US5475856A (en) * 1991-11-27 1995-12-12 International Business Machines Corporation Dynamic multi-mode parallel processing array
US5664214A (en) * 1994-04-15 1997-09-02 David Sarnoff Research Center, Inc. Parallel processing computer containing a multiple instruction stream processing architecture
US5638533A (en) * 1995-10-12 1997-06-10 Lsi Logic Corporation Method and apparatus for providing data to a parallel processing array
US6018796A (en) * 1996-03-29 2000-01-25 Matsushita Electric Industrial Co.,Ltd. Data processing having a variable number of pipeline stages
US5922066A (en) * 1997-02-24 1999-07-13 Samsung Electronics Co., Ltd. Multifunction data aligner in wide data width processor
US6725354B1 (en) * 2000-06-15 2004-04-20 International Business Machines Corporation Shared execution unit in a dual core processor
US20030145189A1 (en) * 2001-12-27 2003-07-31 Stmicroelectronics S.R.L. Processing architecture, related system and method of operation
US6920545B2 (en) 2002-01-17 2005-07-19 Raytheon Company Reconfigurable processor with alternately interconnected arithmetic and memory nodes of crossbar switched cluster
US20030149862A1 (en) * 2002-02-05 2003-08-07 Sudarshan Kadambi Out-of-order processor that reduces mis-speculation using a replay scoreboard
US20040022094A1 (en) * 2002-02-25 2004-02-05 Sivakumar Radhakrishnan Cache usage for concurrent multiple streams
US20030177288A1 (en) * 2002-03-07 2003-09-18 Kabushiki Kaisha Toshiba Multiprocessor system
US20030200419A1 (en) * 2002-04-19 2003-10-23 Industrial Technology Research Institute Non-copy shared stack and register file device and dual language processor structure using the same
US6940512B2 (en) * 2002-05-22 2005-09-06 Sony Corporation Image processing apparatus and method of same
US6976131B2 (en) 2002-08-23 2005-12-13 Intel Corporation Method and apparatus for shared cache coherency for a chip multiprocessor or multiprocessor system
US20050044319A1 (en) * 2003-08-19 2005-02-24 Sun Microsystems, Inc. Multi-core multi-thread processor
US20050219253A1 (en) * 2004-03-31 2005-10-06 Piazza Thomas A Render-cache controller for multithreading, multi-core graphics processor
US20050289286A1 (en) * 2004-06-15 2005-12-29 Akihiko Ohwada Multi-core processor control method
US20060004988A1 (en) * 2004-06-30 2006-01-05 Jordan Paul J Single bit control of threads in a multithreaded multicore processor
US7185178B1 (en) * 2004-06-30 2007-02-27 Sun Microsystems, Inc. Fetch speculation in a multithreaded processor
US7240160B1 (en) * 2004-06-30 2007-07-03 Sun Microsystems, Inc. Multiple-core processor with flexible cache directory scheme
US20060075192A1 (en) * 2004-10-01 2006-04-06 Advanced Micro Devices, Inc. Dynamic reconfiguration of cache memory
US20060248318A1 (en) * 2005-04-30 2006-11-02 Zohner Thayl D Method and apparatus for sharing memory among a plurality of processors
US20070006213A1 (en) * 2005-05-23 2007-01-04 Shahrokh Shahidzadeh In-system reconfiguring of hardware resources
US20060294326A1 (en) * 2005-06-23 2006-12-28 Jacobson Quinn A Primitives to enhance thread-level speculation

Non-Patent Citations (11)

* Cited by examiner, † Cited by third party
Title
Changkyu Kim et al. Elastic Threads on Composable Processors, Technical Report TR-2006-09, The University of Texas at Austin, Mar. 8, 2006.
Engin Ipek , Meyrem Kirman , Nevin Kirman , Jose F. Martinez, Core fusion: accommodating software diversity in chip multiprocessors, Proceedings of the 34th annual international symposium on Computer architecture, Jun. 9-13, 2007, San Diego, California, USA. *
G. Chrysos et al. Memory dependence prediction using store sets. Intl. Symp. on Computer Architecture, Barcelona, Spain, Jun.-Jul. 1998, pp. 142-153.
K. Mai et al. Smart Memories: a modular reconfigurable architecture. Intl. Symp. On Computer Architecture, Vancouver, Canada, Jun. 2000, pp. 161-171.
K. Sankaralingam et al. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In Intl. Symp. on Computer Architecture, pp. 422-433, San Diego, CA, Jun. 2003.
Katherine Leigh Compton, Architecture Generation of Customized Reconfigurable Hardware, Ph. D. Thesis, Northwestern University, Dec. 2003.
Laufer, R. et al., PCI-PipeRench and the SWORDAPI: a system for stream-based reconfigurable computing, 7th Annual IEEE Symp on Field-Prog Custom Computing Machines 200-208.
M. Bekerman et al., Correlated load-address predictors, Intl Symp. on Computer Architecture, Atlanta, GA, May 1999, pp. 54-63.
R. Balasubramonian et al., Dynamically managing the communication-parallelism trade-off in future clustered processors, Intl Symp on Computer Architecture, Jun. 2003, 275-287.
R. E. Kessler, The Alpha 21264 microprocessor, IEEE Micro, Mar. 1999, 9(2):24-36.
S. Palacharla et al., Complexity-effective superscalar processors, Intl. Symp. on Computer Architecture, pp. 206-218, Denver, CO, Jun. 1997.

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8266267B1 (en) 2005-02-02 2012-09-11 Juniper Networks, Inc. Detection and prevention of encapsulated network attacks using an intermediate device
US20100049963A1 (en) * 2008-08-25 2010-02-25 Bell Jr Robert H Multicore Processor and Method of Use That Adapts Core Functions Based on Workload Execution
US8327126B2 (en) * 2008-08-25 2012-12-04 International Business Machines Corporation Multicore processor and method of use that adapts core functions based on workload execution
US8645673B2 (en) 2008-08-25 2014-02-04 International Business Machines Corporation Multicore processor and method of use that adapts core functions based on workload execution
US9448799B2 (en) 2013-03-14 2016-09-20 Samsung Electronics Co., Ltd. Reorder-buffer-based dynamic checkpointing for rename table rebuilding
US9448800B2 (en) 2013-03-14 2016-09-20 Samsung Electronics Co., Ltd. Reorder-buffer-based static checkpointing for rename table rebuilding
CN104375805A (en) * 2014-11-17 2015-02-25 天津大学 Method for simulating parallel computation process of reconfigurable processor through multi-core processor
US9946549B2 (en) 2015-03-04 2018-04-17 Qualcomm Incorporated Register renaming in block-based instruction set architecture

Also Published As

Publication number Publication date
US20080109637A1 (en) 2008-05-08

Similar Documents

Publication Publication Date Title
US7809926B2 (en) Systems and methods for reconfiguring on-chip multiprocessors
US10338927B2 (en) Method and apparatus for implementing a dynamic out-of-order processor pipeline
Kessler The alpha 21264 microprocessor
US10013391B1 (en) Architecture emulation in a parallel processing environment
US5592679A (en) Apparatus and method for distributed control in a processor architecture
Ipek et al. Core fusion: accommodating software diversity in chip multiprocessors
US8028152B2 (en) Hierarchical multi-threading processor for executing virtual threads in a time-multiplexed fashion
US8275976B2 (en) Hierarchical instruction scheduler facilitating instruction replay
US6240502B1 (en) Apparatus for dynamically reconfiguring a processor
US8296550B2 (en) Hierarchical register file with operand capture ports
US7055021B2 (en) Out-of-order processor that reduces mis-speculation using a replay scoreboard
US9176741B2 (en) Method and apparatus for segmented sequential storage
WO2013081556A1 (en) Polymorphic heterogeneous multi-core architecture
WO2007027671A2 (en) Scheduling mechanism of a hierarchical processor including multiple parallel clusters
CN109196485A (en) Method and apparatus for maintaining the data consistency in non-homogeneous computing device
Nakazawa et al. Pseudo Vector Processor Based on Register-Windowed Superscalar Pipeline.
US20050182915A1 (en) Chip multiprocessor for media applications
Chiu et al. Hyperscalar: A novel dynamically reconfigurable multi-core architecture
Chiu et al. A unitable computing architecture for chip multiprocessors
US20240020120A1 (en) Vector processor with vector data buffer
Ipek et al. A reconfigurable chip multiprocessor architecture to accommodate software diversity
Keckler et al. Architecture and Implementation of the TRIPS Processor
Fung Dynamic warp formation: exploiting thread scheduling for efficient MIMD control flow on SIMD graphics hardware
Kırman et al. A Reconfigurable Chip Multiprocessor Architecture to Accommodate Software Diversity
Keckler et al. Architecture and Implementation of the TRIPS Processor........... and Premkishore Shivakumar

Legal Events

Date Code Title Description
AS Assignment

Owner name: CORNELL RESEARCH FOUNDATION, INC., NEW YORK

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MARTINEZ, JOSE F.;IPEK, ENGIN;KIRMAN, MEYREM;AND OTHERS;REEL/FRAME:018834/0740

Effective date: 20070126

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

CC Certificate of correction
FPAY Fee payment

Year of fee payment: 4

FEPP Fee payment procedure

Free format text: PAT HOLDER NO LONGER CLAIMS SMALL ENTITY STATUS, ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: STOL); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

CC Certificate of correction
MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552)

Year of fee payment: 8

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 12