US20050278505A1 - Microprocessor architecture including zero impact predictive data pre-fetch mechanism for pipeline data memory - Google Patents

Microprocessor architecture including zero impact predictive data pre-fetch mechanism for pipeline data memory Download PDF

Info

Publication number
US20050278505A1
US20050278505A1 US11/132,447 US13244705A US2005278505A1 US 20050278505 A1 US20050278505 A1 US 20050278505A1 US 13244705 A US13244705 A US 13244705A US 2005278505 A1 US2005278505 A1 US 2005278505A1
Authority
US
United States
Prior art keywords
fetch
pipeline
instruction
memory
microprocessor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/132,447
Inventor
Seow Lim
Kar-Lik Wong
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ARC International
Original Assignee
ARC International
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ARC International filed Critical ARC International
Priority to US11/132,447 priority Critical patent/US20050278505A1/en
Assigned to ARC INTERNATIONAL reassignment ARC INTERNATIONAL ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: LIM, SEOW CHUAN, WONG, KAR-LIK
Publication of US20050278505A1 publication Critical patent/US20050278505A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F5/00Methods or arrangements for data conversion without changing the order or content of the data handled
    • G06F5/01Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3861Recovery, e.g. branch miss-prediction, exception handling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/3648Software debugging using additional hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/30149Instruction analysis, e.g. decoding, instruction word fields of variable length instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • G06F9/322Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address
    • G06F9/325Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address for loops, e.g. loop detection or loop counter
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • G06F9/3806Instruction prefetching for branches, e.g. hedging, branch folding using address prediction, e.g. return stack, branch history buffer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3816Instruction alignment, e.g. cache line crossing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • G06F9/3844Speculative instruction execution using dynamic branch prediction, e.g. using branch history tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • G06F9/3846Speculative instruction execution using static prediction, e.g. branch taken strategy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • G06F9/3895Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros
    • G06F9/3897Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros with adaptable data path
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This invention relates generally to microprocessor architecture and more specifically to systems and methods for achieving improved performance through a predictive data pre-fetch mechanism for a pipeline data memory, including specifically XY-type data memory.
  • Multistage pipeline microprocessor architecture is known in the art.
  • a typical microprocessor pipeline consists of several stages of instruction handling hardware, wherein each rising pulse of a clock signal propagates instructions one stage further in the pipeline.
  • the clock speed dictates the number of pipeline propagations per second, the effective operational speed of the processor is dependent partially upon the rate that instructions and operands are transferred between memory and the processor.
  • processors typically employ one or more relatively small cache memories built directly into the processor.
  • Cache memory typically is an on-chip random access memory (RAM) used to store a copy of memory data in anticipation of future use by the processor.
  • the cache is positioned between the processor and the main memory to intercept calls from the processor to the main memory. Access to cache memory is generally much faster than off-chip RAM. When data is needed that has previously been accessed, it can be retrieved directly from the cache rather than from the relatively slower off-chip RAM.
  • microprocessor pipeline advances instructions on each clock signal pulse to subsequent pipeline stages.
  • effective pipeline performance can be slower than that implied by the processor speed. Therefore, simply increasing microprocessor clock speed does not usually provide a corresponding increase in system performance. Accordingly, there is a need for a microprocessor architecture that enhances effective system performance through methods in addition to increased clock speed.
  • X and Y memory structures in parallel to the microprocessor pipeline.
  • the ARCtangent-A4TM and ARCtangent-A5TM line of embedded microprocessors designed and licensed by ARC International, Inc. of Hertfordshire, UK, (ARC) employ such an XY memory structure.
  • XY memory was designed to facilitate executing compound instructions on a RISC architecture processor without interrupting the pipeline.
  • XY memory is typically located in parallel to the main processor pipeline, after the instruction decode stage, but prior to the execute stage. After decoding an instruction, source data is fetched from XY memory using address pointers. This source data is then fed to the execution stage.
  • the two X and Y memory structures source two operands and receive results in the same cycle.
  • Data in the XY memory is indexed via pointers from address generators and supplied to the ARC CPU pipeline for processing by any ARC instruction.
  • the memories are software-programmable to provide 32-bit, 16-bit, or dual 16-bit data to the pipeline.
  • Various embodiments of the invention may ameliorate or overcome one or more of the shortcomings of conventional microprocessor architecture through a predictively fetched XY memory scheme.
  • an XY memory structure is located in parallel to the instruction pipeline.
  • a speculative pre-fetching scheme is spread over several sections of the pipeline in order to maintain high processor clock speed.
  • operands are speculatively pre-fetched from X and Y memory before the current instruction has even been decoded.
  • the speculative pre-fetching occurs in an alignment stage of the instruction pipeline.
  • speculative address calculation of operands also occurs in the alignment stage of the instruction pipeline.
  • the XY memory is accessed in the instruction decode stage based on the speculative address calculation of the pipeline, and the resolution of the predictive pre-fetching occurs in the register file stage of the pipeline. Because the actual decoded instruction is not available in the pipeline until after the decode stage, all pre-fetching is done without explicit knowledge of what the current instruction is while this instruction is being pushed out of the decode stage into the register file stage. Thus, in various embodiments, a comparison is made in the register file stage between the operands specified by the actual instruction and those predictively pre-fetched. The pre-fetched values that match are selected to be passed to the execute stage of the instruction pipeline. Therefore, in a microprocessor architecture employing such a scheme, data memory fetches, arithmetic operation and result write back can be performed using a single instruction without slowing down the instruction pipeline clock speed or stalling the pipeline, even at high processor clock frequencies.
  • At least one exemplary embodiment of the invention may provide a predictive pre-fetch XY memory pipeline for a microprocessor pipeline.
  • the predictive pre-fetch XY memory pipeline may comprise a first pre-fetch stage comprising a pre-fetch pointer address register file and X and Y address generators, a second pre-fetch stage comprising X and Y memory structures accessed using address pointers generated in the first pre-fetch stage, and third data select stage comprising at least one pre-fetch buffer in which speculative operand data and address information are stored.
  • At least one additional exemplary embodiment may provide a method of predictively pre-fetching operand address and data information for a instruction pipeline of a microprocessor.
  • the method of predictively pre-fetching operand address and data information for a instruction pipeline of a microprocessor according to this embodiment may comprise, prior to decoding a current instruction in the pipeline, accessing a set of registers containing pointers to specific locations in pre-fetch memory structures, fetching operand data information from the specific locations in the pre-fetch memory structures, and storing the pointer and operand data information in at least one pre-fetch buffer.
  • the microprocessor architecture may comprise a multi-stage microprocessor pipeline, and a multi-stage pre-fetch memory pipeline in parallel to at least a portion of the instruction pipeline, wherein the pre-fetch pipeline comprises a first stage having a set of registers serving as pointers to specific pre-fetch memory locations, a second stage, having pre-fetch memory structures for storing predicted operand address information corresponding to operands in an un-decoded instruction in the microprocessor pipeline, and a third stage comprising at least one pre-fetch buffers, wherein said first, second and third stage respectively are parallel to, simultaneous to and in isolation of corresponding stages of the microprocessor pipeline.
  • FIG. 1 is a block diagram illustrating a processor core in accordance with at least one exemplary embodiment of this invention
  • FIG. 2 is a block diagram illustration a portion of an instruction pipeline of a microprocessor core architecture employing an XY memory structure and a typical multi-operand instruction processed by such an instruction pipeline in accordance with a conventional non-speculative XY memory;
  • FIG. 3 is an exemplary instruction format for performing a multiply instruction on 2 operands and a memory write back with a single instruction in accordance with at least one embodiment of this invention
  • FIG. 4 is a block diagram illustrating a microprocessor instruction pipeline architecture including a parallel predictive pre-fetch XY memory pipeline in accordance with at least one embodiment of this invention
  • FIG. 5 is a block diagram, illustrating in greater detail the structure and operation of a predictively pre-fetching XY memory pipeline in accordance with at least one embodiment of this invention
  • FIG. 6 is a block diagram illustrating the specific pre-fetch operations in an XY memory structure in accordance with at least one embodiment of this invention.
  • FIG. 7 is a flow chart detailing the steps of a method for predictively pre-fetching instruction operand addresses in accordance with at least one embodiment of this invention.
  • FIG. 1 illustrates in block diagram form, an architecture for a microprocessor core 100 and peripheral hardware structure in accordance with at least one exemplary embodiment of this invention.
  • FIG. 1 illustrates in block diagram form, an architecture for a microprocessor core 100 and peripheral hardware structure in accordance with at least one exemplary embodiment of this invention.
  • FIG. 1 illustrates in block diagram form, an architecture for a microprocessor core 100 and peripheral hardware structure in accordance with at least one exemplary embodiment of this invention.
  • FIG. 1 illustrates in block diagram form, an architecture for a microprocessor core 100 and peripheral hardware structure in accordance with at least one exemplary embodiment of this invention.
  • FIG. 1 illustrates in block diagram form, an architecture for a microprocessor core 100 and peripheral hardware structure in accordance with at least one exemplary embodiment of this invention.
  • FIG. 1 illustrates in block diagram form, an architecture for a microprocessor core 100 and peripheral hardware structure in accordance with at least one exemplary embodiment of this invention.
  • FIG. 1 illustrates in block diagram form, an architecture for
  • the align stage 120 formats the words coming from the fetch stage 110 into the appropriate instructions.
  • instructions are fetched from memory in 32-bit words.
  • the entry at that fetch address may contain an aligned 16-bit or 32-bit instruction, an unaligned 16 bit instruction preceded by a portion of a previous instruction, or an unaligned portion of a larger instruction preceded by a portion of a previous instruction based on the actual instruction address.
  • a fetched word may have an instruction fetch address of Ox 4 , but an actual instruction address of Ox 6 .
  • the 32-bit word fetched from memory is passed to the align stage 120 where it is aligned into an complete instruction.
  • this alignment may include discarding superfluous 16-bit instructions or assembling unaligned 32-bit or larger instructions into a single instructions. After completely assembling the instruction, the N-bit instruction is forwarded to the decoder 130 .
  • an instruction extension interface 180 is also shown which permits interface of customized processor instructions that are used to complement the standard instruction set architecture of the microprocessor. Interfacing of these customized instructions occurs through a timing registered interface to the various stages of the microprocessor pipeline 100 in order to minimize the effect of critical path loading when attaching customized logic to a pre-existing processor pipeline.
  • a custom opcode slot is defined in the extensions instruction interface for the specific custom instruction in order for the microprocessor to correctly acknowledge the presence of a custom instruction 182 as well as the extraction of the source operand addresses that are used to index the register file 142 .
  • the custom instruction flag interface 184 is used to allow the addition of custom instruction flags that are used by the microprocessor for conditional evaluation using either the standard condition code evaluators or custom extension condition code evaluators 184 in order to determine whether the instruction is executed or not based upon the condition evaluation result of the execute stage (EXEC) 150 .
  • a custom ALU interface 186 permits user defined arithmetic and logical extension instructions the result of which are selected in the result select stage (SEL) 160 .
  • FIG. 2 a block diagram illustrating a portion of an instruction pipeline of a microprocessor core architecture employing an XY memory structure and a typical multi-operand instruction processed by such an instruction pipeline in accordance with a conventional non-speculative XY memory is illustrated.
  • XY-type data memory is known in the art.
  • a RISC processor typically only one memory load or store can be effected per pipelined instruction.
  • DSP Digital Signal Processor
  • FIG. 2 illustrates such an XY memory implementation.
  • an instruction is fetched from memory in the fetch stage 210 and, in the next clock cycle is passed to the align stage 220 .
  • the instruction is formatted into proper form. For example, if in the fetch stage 210 a 32-bit word is fetched from memory with the fetch address 0 x 4 , but the actual instruction address is for the 16-bit word having instruction address 0 x 6 , the first 16 bits of 32-bit word are discarded.
  • This properly formatted instruction is then passed to the decode stage 230 , where it is decoded into an actual instruction, for example, the decoded instruction 241 shown in FIG. 2 .
  • This decoded instruction is then passed to the register file stage 240 .
  • FIG. 2 illustrates the format of such a decoded instruction 241 .
  • the instruction is comprised of a name (any arbitrary name used to reference the instruction), the destination address pointer and update mode, the first source address pointer and update mode, and the second source address pointer and update mode.
  • the register file stage 240 from the decoded instruction 241 , the address of the source and destination operands are selected using the register numbers (windowing registers) as pointers to a set of address registers 242 .
  • the source addresses are then used to access X memory 243 and Y memory 244 .
  • the address to use for access needs to be selected, the memory access performed, and the data selected fed to the execution stage 250 .
  • An alternative approach is to move the XY memory to an earlier stage of the instruction pipeline, ahead of the register file stage, to allow for more cycle time for the data selection. However, doing so may result in the complication that, when XY memory is moved into the decode stage, the windowing register number is not yet decoded before accessing memory.
  • the source data is predictively pre-fetched and stored for use in data buffers.
  • a comparison may be made to check if the desired data was already pre-fetched, and if so, the data is simply taken from the pre-fetched data buffer and used. If it has not been pre-fetched, then the instruction is stalled and the required data is fetched. In order to reduce the number of instructions that are stalled, it is essential to ensure that data is pre-fetched correctly most of the time. Two schemes may be used to assist in this function. Firstly, a predictable way of using windowing registers may be employed.
  • FIG. 3 illustrates the format of a compound instruction, such as an instruction that might be used in a DSP application that would require extendible processing functions including XY memory in accordance with various embodiments of this invention.
  • the compound instruction 300 consists of four sub-components, the name of the instruction 301 , the destination pointer 302 , the first operand pointer 303 and the second operand pointer 304 .
  • the instruction, Muldw is a dual 16-bit multiply instruction.
  • the destination pointer 302 specifies that the result of the calculation instruction is to be written to X memory using the pointer address AX 1 .
  • the label u 0 specifies the update mode.
  • the source operand pointers 303 and 304 specify that the first operand is to be read from X memory using the pointer address AX 0 and updated using update mode u 1 and the second operand is to be read from Y memory using the pointer address AY 0 and the update mode u 0 .
  • FIG. 4 is a block diagram illustrating a microprocessor instruction pipeline architecture including a parallel predictive pre-fetch XY memory pipeline in accordance with at least one embodiment of this invention.
  • the instruction pipeline is comprised of seven stages, FCH 401 , ALN 402 , DEC 403 , RF 04 , EX 405 , SEL 406 and WB 407 .
  • each rising pulse of the clock cycle propagates an instruction to the next stage of the instruction pipeline.
  • the predictive pre-fetch XY memory pipeline comprised of 6 stages including PF 1 412 , PF 2 413 , DSEL 414 , P 0 415 , P 1 416 and C 417 .
  • speculative pre-fetching may begin in stage PF 1 412 .
  • pre-fetching does not have to begin at the same time as the fetch instruction 401 .
  • Pre-fetching can happen much earlier, for example, when a pointer is first set-up, or was already fetched because it was recently used. Pre-fetching can also happen later if the pre-fetched instruction was predicted incorrectly.
  • the two previous stages PF 1 412 and PF 2 413 prior to the register file stage 404 , allow sufficient time for the access address to be selected, the memory access performed, and the data selected to be fed to the execution stage 405 .
  • FIG. 5 is a block diagram, illustrating in greater detail the structure and operation of a predictively pre-fetching XY memory pipeline in accordance with at least one embodiment of this invention.
  • 6 pipeline stages of the predictive pre-fetch XY memory pipeline are illustrated.
  • these stages may include the PF 1 500 , PF 2 510 , DSEL (data select) 520 , P 0 530 , P 1 540 and C 550 .
  • Stage PF 1 500 which occurs simultaneous to the align stage of the instruction pipeline, includes the pre-fetch shadow pointer address register file 502 and the X and Y address generators (used to update the pointer address) 504 and 506 .
  • stage PF 2 includes access to X memory unit 512 and Y memory unit 514 , using the pointers 504 and 506 in stage PF 1 500 .
  • stage DSEL 520 the data accessed from X memory 512 and Y memory 514 in stage PF 2 510 are written to one of multiple pre-fetch buffers 522 .
  • four pre-fetch buffers 522 are illustrated in FIG. 5 . In various embodiments, multiple queue-like pre-fetch buffers will be used.
  • each queue is associated to any pointer, but each pointer associated with at most one queue.
  • the pre-fetched data is reconciled with the pointer of the operands contained in the actual instruction forwarded from the decode stage. If the actual data have been pre-fetched, they are passed to the appropriate execute unit in the execute stage.
  • P 0 530 , P 1 540 and C 550 stages are used to continue to pass down the source address and destination address (destination address is selected in DSEL stage) so that when they reach the C 550 stage, they update the actual pointer address registers, and the destination address is also used for writing the results of execution (if required, as specified by the instruction) back to XY memory.
  • the address registers in PF 1 500 stage are only shadowing address registers which are predictively updated when required. These values only become committed at the C stage 550 .
  • Pre-fetch hazard detection performs the task of matching the addresses used in PF 1 500 and PF 2 510 stages to the destination addresses in DSEL 520 , P 0 530 , P 1 540 , and C 550 stage, so that if there is a write to a location in memory that is to be pre-fetched, the pre-fetch is stalled until, or restarted when, this Read after Write hazard has disappeared.
  • a pre-fetch hazard can also occur when there is a write to a location in memory that has already been prefetched and stored in the buffer in DSEL stage. In this case, the item in the buffer is flushed and refetched when the write operation is complete
  • FIG. 6 is a block diagram illustrating the specific structure of the pre-fetch logic in an XY memory structure in accordance with at least one embodiment of this invention.
  • speculative pre-fetch is performed by accessing a set of registers 610 that serve as pointers pointing to specific locations in the X and Y memories 614 and 612 .
  • the data is fetched from the XY memory and then on the next clock pulse, the speculative operand data and address information is stored in pre-fetch buffers 620 .
  • matching and select block 622 checks for the pre-fetched addresses. If the required operand addresses from the decoded instruction are in the pre-fetch buffers, they are selected and registered for use in the execution stage.
  • the pre-fetch buffers may be one, two, three or more deep such that a first in, first out storing scheme is used. When a data item is read out of one of the pre-fetch buffers 620 , it no longer resides in the buffer. The next data in the FIFO buffer automatically moves to the front of the queue.
  • FIG. 7 a flow chart detailing the steps of a method for predictively pre-fetching instruction operand addresses in accordance with at least one embodiment of this invention is depicted.
  • the steps of a pre-fetch method as well as the steps of a typical instruction pipeline are illustrated in parallel.
  • the individual steps of the pre-fetch method may occur at the same time as the various steps or even before.
  • steps of the pre-fetch process occur in isolation of the steps of the instruction pipeline method until matching and selection.
  • step 700 operation of the pre-fetch method begins in step 700 and proceeds to step 705 where a set of registers are accessed that serve as pointers pointing to specific locations in the X and Y memory structures.
  • step 705 may occur simultaneous to a compound instruction entering the fetch stage of the microprocessor's instruction pipeline.
  • the pre-fetch process is not based on any information in the instruction this may occur before, an instruction is fetched in step 707 .
  • step 705 may occur after a compound instruction is pre-fetched but prior to decoding.
  • a compound instruction is one that performs multiple steps, such as, for example, a memory read, an arithmetic operation and a memory write.
  • step 710 the X and Y memory structures are accessed at locations specified by the pointers in the pre-fetch registers.
  • step 715 the data read from the X and Y memory locations are written to pre-fetch buffers.
  • step 720 the results of the pre-fetch method are matched with the actual decoded instruction in the matching and selection step. Matching and selection is performed to reconcile the addresses of the operands contained in the actual instruction forwarded from the decode stage of the instruction pipeline with the pre-fetched data in the pre-fetch buffers. If the pre-fetched data is correct, operation continues to the appropriate execute unit of the execute pipeline in step 725 depending upon the nature of the instruction, i.e., shift, add, etc. It should be appreciated that if the pre-fetched operand addresses are not correct, a pipeline flush will occur while actual operands are fetched and injected into pipeline. Operation of the pre-fetch method terminates after matching and selection.
  • steps 700 - 715 are performed in parallel and isolation to the processor pipeline operations 703 - 720 that they do not effect or otherwise delay the processor pipeline operations of fetching, aligning, decoding, register file or execution.
  • predictive pre-fetching is an effective means of taking advantage of the benefits of XY memory without impacting the instruction pipeline.
  • Processor clock frequency may be maintained at high speeds despite the use of XY memory.
  • the XY memory functionality is completely transparent to the applications. Normal instruction pipeline flow and branch prediction are completely unaffected by this XY memory functionality both when it is invoked and when it is not used.
  • the auxiliary unit of the execute branch provides an interface for applications to select this extendible functionality.
  • operands can be predictively pre-fetched with sufficient accuracy to outweigh the overhead associated with mispredictions and without any impact on the processor pipeline.

Abstract

A microprocessor architecture including a predictive pre-fetch XY memory pipeline in parallel to the processor's pipeline for processing compound instructions with enhanced processor performance through predictive prefetch techniques. Instruction operands are predictively prefetched from X and Y based on the historical use of operands in instructions that target X and Y memory. After the compound instruction is decoded in the pipeline, the pre-fetched operand pointer, address and data is reconciled with the operands contained in the actual instruction. If the actual data has been pre-fetched, it is passed to the appropriate execute unit in the execute stage of the processor pipeline. As a result, if the prediction is correct, the data to use for access can be selected and the data selected fed to the execution stage without any addition processor overhead. This pre-fetch mechanism avoids the need to slow down the clock speed of the processor or insert stalls for each compound instruction when using XY memory.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application claims priority to provisional application No. 60/572,238 filed May 19, 2004, entitled “Microprocessor Architecture,” hereby incorporated by reference in its entirety.
  • FIELD OF THE INVENTION
  • This invention relates generally to microprocessor architecture and more specifically to systems and methods for achieving improved performance through a predictive data pre-fetch mechanism for a pipeline data memory, including specifically XY-type data memory.
  • BACKGROUND OF THE INVENTION
  • Multistage pipeline microprocessor architecture is known in the art. A typical microprocessor pipeline consists of several stages of instruction handling hardware, wherein each rising pulse of a clock signal propagates instructions one stage further in the pipeline. Although the clock speed dictates the number of pipeline propagations per second, the effective operational speed of the processor is dependent partially upon the rate that instructions and operands are transferred between memory and the processor. For this reason, processors typically employ one or more relatively small cache memories built directly into the processor. Cache memory typically is an on-chip random access memory (RAM) used to store a copy of memory data in anticipation of future use by the processor. Typically, the cache is positioned between the processor and the main memory to intercept calls from the processor to the main memory. Access to cache memory is generally much faster than off-chip RAM. When data is needed that has previously been accessed, it can be retrieved directly from the cache rather than from the relatively slower off-chip RAM.
  • Generally, the microprocessor pipeline advances instructions on each clock signal pulse to subsequent pipeline stages. However, effective pipeline performance can be slower than that implied by the processor speed. Therefore, simply increasing microprocessor clock speed does not usually provide a corresponding increase in system performance. Accordingly, there is a need for a microprocessor architecture that enhances effective system performance through methods in addition to increased clock speed.
  • One method of doing this has been to employ X and Y memory structures in parallel to the microprocessor pipeline. The ARCtangent-A4™ and ARCtangent-A5™ line of embedded microprocessors designed and licensed by ARC International, Inc. of Hertfordshire, UK, (ARC) employ such an XY memory structure. XY memory was designed to facilitate executing compound instructions on a RISC architecture processor without interrupting the pipeline. XY memory is typically located in parallel to the main processor pipeline, after the instruction decode stage, but prior to the execute stage. After decoding an instruction, source data is fetched from XY memory using address pointers. This source data is then fed to the execution stage. In the exemplary ARC XY architecture the two X and Y memory structures source two operands and receive results in the same cycle. Data in the XY memory is indexed via pointers from address generators and supplied to the ARC CPU pipeline for processing by any ARC instruction. The memories are software-programmable to provide 32-bit, 16-bit, or dual 16-bit data to the pipeline.
  • It should be appreciated that the description herein of various advantages and disadvantages associated with known apparatus, methods, and materials is not intended to limit the scope of the invention to their exclusion. Indeed, various embodiments of the invention may include one or more of the known apparatus, methods, and materials without suffering from their disadvantages.
  • As background to the techniques discussed herein, the following references are incorporated herein by reference: U.S. Pat. No. 6,862,563 issued Mar. 1, 2005 entitled “Method And Apparatus For Managing The Configuration And Functionality Of A Semiconductor Design” (Hakewill et al.); U.S. Ser. No. 10/423,745 filed Apr. 25, 2003, entitled “Apparatus and Method for Managing Integrated Circuit Designs”; and U.S. Ser. No. 10/651,560 filed Aug. 29, 2003, entitled “Improved Computerized Extension Apparatus and Methods”, all assigned to the assignee of the present invention.
  • SUMMARY OF THE INVENTION
  • Various embodiments of the invention may ameliorate or overcome one or more of the shortcomings of conventional microprocessor architecture through a predictively fetched XY memory scheme. In various embodiments, an XY memory structure is located in parallel to the instruction pipeline. In various embodiments, a speculative pre-fetching scheme is spread over several sections of the pipeline in order to maintain high processor clock speed. In order to prevent impact on clock speed, operands are speculatively pre-fetched from X and Y memory before the current instruction has even been decoded. In various exemplary embodiments, the speculative pre-fetching occurs in an alignment stage of the instruction pipeline. In various embodiments, speculative address calculation of operands also occurs in the alignment stage of the instruction pipeline. In various embodiments, the XY memory is accessed in the instruction decode stage based on the speculative address calculation of the pipeline, and the resolution of the predictive pre-fetching occurs in the register file stage of the pipeline. Because the actual decoded instruction is not available in the pipeline until after the decode stage, all pre-fetching is done without explicit knowledge of what the current instruction is while this instruction is being pushed out of the decode stage into the register file stage. Thus, in various embodiments, a comparison is made in the register file stage between the operands specified by the actual instruction and those predictively pre-fetched. The pre-fetched values that match are selected to be passed to the execute stage of the instruction pipeline. Therefore, in a microprocessor architecture employing such a scheme, data memory fetches, arithmetic operation and result write back can be performed using a single instruction without slowing down the instruction pipeline clock speed or stalling the pipeline, even at high processor clock frequencies.
  • At least one exemplary embodiment of the invention may provide a predictive pre-fetch XY memory pipeline for a microprocessor pipeline. The predictive pre-fetch XY memory pipeline according to this embodiment may comprise a first pre-fetch stage comprising a pre-fetch pointer address register file and X and Y address generators, a second pre-fetch stage comprising X and Y memory structures accessed using address pointers generated in the first pre-fetch stage, and third data select stage comprising at least one pre-fetch buffer in which speculative operand data and address information are stored.
  • At least one additional exemplary embodiment may provide a method of predictively pre-fetching operand address and data information for a instruction pipeline of a microprocessor. The method of predictively pre-fetching operand address and data information for a instruction pipeline of a microprocessor according to this embodiment may comprise, prior to decoding a current instruction in the pipeline, accessing a set of registers containing pointers to specific locations in pre-fetch memory structures, fetching operand data information from the specific locations in the pre-fetch memory structures, and storing the pointer and operand data information in at least one pre-fetch buffer.
  • Yet another exemplary embodiment of this invention may provide a microprocessor architecture. The microprocessor architecture according to this embodiment may comprise a multi-stage microprocessor pipeline, and a multi-stage pre-fetch memory pipeline in parallel to at least a portion of the instruction pipeline, wherein the pre-fetch pipeline comprises a first stage having a set of registers serving as pointers to specific pre-fetch memory locations, a second stage, having pre-fetch memory structures for storing predicted operand address information corresponding to operands in an un-decoded instruction in the microprocessor pipeline, and a third stage comprising at least one pre-fetch buffers, wherein said first, second and third stage respectively are parallel to, simultaneous to and in isolation of corresponding stages of the microprocessor pipeline.
  • Other aspects and advantages of the invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram illustrating a processor core in accordance with at least one exemplary embodiment of this invention;
  • FIG. 2 is a block diagram illustration a portion of an instruction pipeline of a microprocessor core architecture employing an XY memory structure and a typical multi-operand instruction processed by such an instruction pipeline in accordance with a conventional non-speculative XY memory;
  • FIG. 3 is an exemplary instruction format for performing a multiply instruction on 2 operands and a memory write back with a single instruction in accordance with at least one embodiment of this invention;
  • FIG. 4 is a block diagram illustrating a microprocessor instruction pipeline architecture including a parallel predictive pre-fetch XY memory pipeline in accordance with at least one embodiment of this invention;
  • FIG. 5 is a block diagram, illustrating in greater detail the structure and operation of a predictively pre-fetching XY memory pipeline in accordance with at least one embodiment of this invention;
  • FIG. 6 is a block diagram illustrating the specific pre-fetch operations in an XY memory structure in accordance with at least one embodiment of this invention; and
  • FIG. 7 is a flow chart detailing the steps of a method for predictively pre-fetching instruction operand addresses in accordance with at least one embodiment of this invention.
  • DETAILED DESCRIPTION OF THE DISCLOSURE
  • The following description is intended to convey a thorough understanding of the invention by providing specific embodiments and details involving various aspects of a new and useful microprocessor architecture. It is understood, however, that the invention is not limited to these specific embodiments and details, which are exemplary only. It further is understood that one possessing ordinary skill in the art, in light of known systems and methods, would appreciate the use of the invention for its intended purposes and benefits in any number of alternative embodiments, depending upon specific design and other needs.
  • Discussion of the invention will now made by way of example in reference to the various drawing figures. FIG. 1 illustrates in block diagram form, an architecture for a microprocessor core 100 and peripheral hardware structure in accordance with at least one exemplary embodiment of this invention. Several novel features will be apparent from FIG. 1 which distinguish the illustrated microprocessor architecture from that of a conventional microprocessor architecture. Firstly, the exemplary microprocessor architecture of FIG. 1 features a processor core 100 having a seven stage instruction pipeline. However, it should be appreciated that additional pipeline stages may also be present. An align stage 120 is shown in FIG. 1 following the fetch stage 110. Because the microprocessor core 100 shown in FIG. 1 is operable to work with a variable bit-length instruction set, namely, 16-bits, 32-bits, 48-bits or 64-bits, the align stage 120 formats the words coming from the fetch stage 110 into the appropriate instructions. In various exemplary embodiments, instructions are fetched from memory in 32-bit words. Thus, when the fetch stage 110 fetches a 32-bit word at a specified fetch address, the entry at that fetch address may contain an aligned 16-bit or 32-bit instruction, an unaligned 16 bit instruction preceded by a portion of a previous instruction, or an unaligned portion of a larger instruction preceded by a portion of a previous instruction based on the actual instruction address. For example, a fetched word may have an instruction fetch address of Ox4, but an actual instruction address of Ox6. In various exemplary embodiments, the 32-bit word fetched from memory is passed to the align stage 120 where it is aligned into an complete instruction. In various exemplary embodiments, this alignment may include discarding superfluous 16-bit instructions or assembling unaligned 32-bit or larger instructions into a single instructions. After completely assembling the instruction, the N-bit instruction is forwarded to the decoder 130.
  • Still referring to FIG. 1, an instruction extension interface 180 is also shown which permits interface of customized processor instructions that are used to complement the standard instruction set architecture of the microprocessor. Interfacing of these customized instructions occurs through a timing registered interface to the various stages of the microprocessor pipeline 100 in order to minimize the effect of critical path loading when attaching customized logic to a pre-existing processor pipeline. Specifically, a custom opcode slot is defined in the extensions instruction interface for the specific custom instruction in order for the microprocessor to correctly acknowledge the presence of a custom instruction 182 as well as the extraction of the source operand addresses that are used to index the register file 142. The custom instruction flag interface 184 is used to allow the addition of custom instruction flags that are used by the microprocessor for conditional evaluation using either the standard condition code evaluators or custom extension condition code evaluators 184 in order to determine whether the instruction is executed or not based upon the condition evaluation result of the execute stage (EXEC) 150. A custom ALU interface 186 permits user defined arithmetic and logical extension instructions the result of which are selected in the result select stage (SEL) 160.
  • Referring now to FIG. 2, a block diagram illustrating a portion of an instruction pipeline of a microprocessor core architecture employing an XY memory structure and a typical multi-operand instruction processed by such an instruction pipeline in accordance with a conventional non-speculative XY memory is illustrated. XY-type data memory is known in the art. Typically, in a RISC processor, only one memory load or store can be effected per pipelined instruction. However, in some cases, in order to accelerate pipeline efficiency, i.e., the number of operations executed per clock, it is desirable to have a single instruction perform multiple operations. For example, a single instruction could perform a memory read, an arithmetic operation and a memory write operation. The ability to decode and execute these kind of compound instructions is particularly important for achieving high performance in Digital Signal Processor (DSP) operations. DSP operations typically involve repetitive calculations on large data sets, thus, high memory bandwidth is required. By using an XY-memory structure, up to 2×32-bits of source data memory read access , and 1×32-bits of destination data memory write access per clock cycle are possible, resulting in a very high data memory bandwidth. (For example, a 4.8 Gbytes/s memory bandwidth can be achieved based on 3 32-bit accesses, 2 read and 1 write, per instruction in a 400 MHz processor or 3*32 bits*400 MHz/sec=38.4 Gbit/s or 4.8 Gbyte/s.)
  • In typical XY memory implementation, data used for XY memory is fetched from memory using addresses that are selected using register numbers decoded from the instruction in the decode stage. This data is then fed back to the execution units in the processor pipeline. FIG. 2 illustrates such an XY memory implementation. In FIG. 2, an instruction is fetched from memory in the fetch stage 210 and, in the next clock cycle is passed to the align stage 220. In the align stage 220, the instruction is formatted into proper form. For example, if in the fetch stage 210 a 32-bit word is fetched from memory with the fetch address 0x4, but the actual instruction address is for the 16-bit word having instruction address 0x6, the first 16 bits of 32-bit word are discarded. This properly formatted instruction is then passed to the decode stage 230, where it is decoded into an actual instruction, for example, the decoded instruction 241 shown in FIG. 2. This decoded instruction is then passed to the register file stage 240.
  • FIG. 2 illustrates the format of such a decoded instruction 241. The instruction is comprised of a name (any arbitrary name used to reference the instruction), the destination address pointer and update mode, the first source address pointer and update mode, and the second source address pointer and update mode. In the register file stage 240, from the decoded instruction 241, the address of the source and destination operands are selected using the register numbers (windowing registers) as pointers to a set of address registers 242. The source addresses are then used to access X memory 243 and Y memory 244. Thus, between the decode stage 230 and the execute stage 250, the address to use for access needs to be selected, the memory access performed, and the data selected fed to the execution stage 250. As microprocessor clock speeds increases, it becomes difficult, if not impossible, to perform all these steps in a single clock cycle. As a result, either a decrease in the processor clock frequency must occur to accommodate these extra steps, or multiple clock cycles for each instruction using XY memory must be used, both of which negate or at least reduce the benefits of using XY memory in the first place.
  • One method of solving this problem is extending the processor pipeline to add more pipeline stages between the decode and the execution stage. However, extra processor stages are undesirable for several reasons. Firstly, they complicate the architecture of the processor. Secondly, any penalties from incorrect predictions in the branch prediction stage will be increased. Finally, because XY memory functions may only be needed when certain applications are being run on the processor, extra pipeline stages will necessarily be present even when these applications are not being used.
  • An alternative approach is to move the XY memory to an earlier stage of the instruction pipeline, ahead of the register file stage, to allow for more cycle time for the data selection. However, doing so may result in the complication that, when XY memory is moved into the decode stage, the windowing register number is not yet decoded before accessing memory.
  • Therefore, in accordance with at least one embodiment of this invention, to overcome these problems, the source data is predictively pre-fetched and stored for use in data buffers. When the source data from X or Y memory is required, just before the execution stage, a comparison may be made to check if the desired data was already pre-fetched, and if so, the data is simply taken from the pre-fetched data buffer and used. If it has not been pre-fetched, then the instruction is stalled and the required data is fetched. In order to reduce the number of instructions that are stalled, it is essential to ensure that data is pre-fetched correctly most of the time. Two schemes may be used to assist in this function. Firstly, a predictable way of using windowing registers may be employed. For example, if the same set of N windowing registers are used most of the time, and each pointer address is incremented in a regular way (sequentially as selected by the windowing registers), then the next few data for each of these N windowing registers can be pre-fetched fairly accurately. This reduces the number of prediction failures.
  • Secondly, by having more prediction data buffers, more predictive fetches can be made in advance, reducing the chance of a prediction miss. Because compound instructions also include updating addresses, these addresses must also be predictively updated. In general, address updates are predictable as long as the user uses the same modifiers along with its associated non-modify mode in a sequence of code and the user sticks to a set of N pointers for an implementation with N pre-fetch data buffers. Since the data is pre-fetched, the pre-fetched data can become outdated due to write-backs to XY memory. In cases such as this, the specific pre-fetch buffer can be flushed and the out-of-date data re-fetched, or, alternatively, data forwarding can be performed to update these buffers.
  • FIG. 3 illustrates the format of a compound instruction, such as an instruction that might be used in a DSP application that would require extendible processing functions including XY memory in accordance with various embodiments of this invention. The compound instruction 300 consists of four sub-components, the name of the instruction 301, the destination pointer 302, the first operand pointer 303 and the second operand pointer 304. In the instruction 300 shown in FIG. 3, the instruction, Muldw, is a dual 16-bit multiply instruction. The destination pointer 302 specifies that the result of the calculation instruction is to be written to X memory using the pointer address AX1. The label u0 specifies the update mode. This is a user defined address update mode and must be specified before calling the extendible function. The source operand pointers 303 and 304, specify that the first operand is to be read from X memory using the pointer address AX0 and updated using update mode u1 and the second operand is to be read from Y memory using the pointer address AY0 and the update mode u0.
  • FIG. 4 is a block diagram illustrating a microprocessor instruction pipeline architecture including a parallel predictive pre-fetch XY memory pipeline in accordance with at least one embodiment of this invention. In the example illustrated in FIG. 4, the instruction pipeline is comprised of seven stages, FCH 401, ALN 402, DEC 403, RF 04, EX 405, SEL 406 and WB 407. As stated above, each rising pulse of the clock cycle propagates an instruction to the next stage of the instruction pipeline. In parallel to the instruction pipeline is the predictive pre-fetch XY memory pipeline comprised of 6 stages including PF1 412, PF2 413, DSEL 414, P0 415, P1 416 and C 417. It should be appreciated that various embodiments may utilize more or less pipeline stages. In various exemplary embodiments, speculative pre-fetching may begin in stage PF1 412. However, in various exemplary embodiments, pre-fetching does not have to begin at the same time as the fetch instruction 401. Pre-fetching can happen much earlier, for example, when a pointer is first set-up, or was already fetched because it was recently used. Pre-fetching can also happen later if the pre-fetched instruction was predicted incorrectly. The two previous stages PF1 412 and PF2 413, prior to the register file stage 404, allow sufficient time for the access address to be selected, the memory access performed, and the data selected to be fed to the execution stage 405.
  • FIG. 5, is a block diagram, illustrating in greater detail the structure and operation of a predictively pre-fetching XY memory pipeline in accordance with at least one embodiment of this invention. In FIG. 5, 6 pipeline stages of the predictive pre-fetch XY memory pipeline are illustrated. As noted here, it should be appreciated that in various embodiments, more or less stages may be employed. As stated above in the context of FIG. 4, these stages may include the PF1 500, PF2 510, DSEL (data select) 520, P0 530, P1 540 and C 550. Stage PF1 500, which occurs simultaneous to the align stage of the instruction pipeline, includes the pre-fetch shadow pointer address register file 502 and the X and Y address generators (used to update the pointer address) 504 and 506. Next, stage PF2, includes access to X memory unit 512 and Y memory unit 514, using the pointers 504 and 506 in stage PF1 500. In stage DSEL 520, the data accessed from X memory 512 and Y memory 514 in stage PF2 510 are written to one of multiple pre-fetch buffers 522. For purposes of example only, four pre-fetch buffers 522 are illustrated in FIG. 5. In various embodiments, multiple queue-like pre-fetch buffers will be used. It should be noted that typically each queue is associated to any pointer, but each pointer associated with at most one queue. In the DSEL stage 520, the pre-fetched data is reconciled with the pointer of the operands contained in the actual instruction forwarded from the decode stage. If the actual data have been pre-fetched, they are passed to the appropriate execute unit in the execute stage.
  • P0 530, P1 540 and C 550 stages are used to continue to pass down the source address and destination address (destination address is selected in DSEL stage) so that when they reach the C 550 stage, they update the actual pointer address registers, and the destination address is also used for writing the results of execution (if required, as specified by the instruction) back to XY memory. The address registers in PF1 500 stage are only shadowing address registers which are predictively updated when required. These values only become committed at the C stage 550. Pre-fetch hazard detection performs the task of matching the addresses used in PF1 500 and PF2 510 stages to the destination addresses in DSEL 520, P0 530, P1 540, and C 550 stage, so that if there is a write to a location in memory that is to be pre-fetched, the pre-fetch is stalled until, or restarted when, this Read after Write hazard has disappeared. A pre-fetch hazard can also occur when there is a write to a location in memory that has already been prefetched and stored in the buffer in DSEL stage. In this case, the item in the buffer is flushed and refetched when the write operation is complete
  • FIG. 6 is a block diagram illustrating the specific structure of the pre-fetch logic in an XY memory structure in accordance with at least one embodiment of this invention. In various exemplary embodiments, in the PF1 stage 605, speculative pre-fetch is performed by accessing a set of registers 610 that serve as pointers pointing to specific locations in the X and Y memories 614 and 612. In the PF2 stage 602, the data is fetched from the XY memory and then on the next clock pulse, the speculative operand data and address information is stored in pre-fetch buffers 620. While still in the DSEL stage which also corresponds with the processor's Register File stage 603, matching and select block 622 checks for the pre-fetched addresses. If the required operand addresses from the decoded instruction are in the pre-fetch buffers, they are selected and registered for use in the execution stage. In various exemplary embodiments, the pre-fetch buffers may be one, two, three or more deep such that a first in, first out storing scheme is used. When a data item is read out of one of the pre-fetch buffers 620, it no longer resides in the buffer. The next data in the FIFO buffer automatically moves to the front of the queue.
  • Referring now to FIG. 7, a flow chart detailing the steps of a method for predictively pre-fetching instruction operand addresses in accordance with at least one embodiment of this invention is depicted. In FIG. 7, the steps of a pre-fetch method as well as the steps of a typical instruction pipeline are illustrated in parallel. The individual steps of the pre-fetch method may occur at the same time as the various steps or even before.
  • Any correspondence between steps of the pre-fetch process and the instruction pipeline process implied by the figure are merely for ease of illustration. It should be appreciated that the steps of the pre-fetch method occur in isolation of the steps of the instruction pipeline method until matching and selection.
  • With continued reference to FIG. 7, operation of the pre-fetch method begins in step 700 and proceeds to step 705 where a set of registers are accessed that serve as pointers pointing to specific locations in the X and Y memory structures. In various embodiments, step 705 may occur simultaneous to a compound instruction entering the fetch stage of the microprocessor's instruction pipeline. However, as noted herein, in various other embodiments, because the actual compound instruction has not yet been decoded, and therefore, the pre-fetch process is not based on any information in the instruction this may occur before, an instruction is fetched in step 707. Alternatively, step 705 may occur after a compound instruction is pre-fetched but prior to decoding.
  • As used herein, a compound instruction is one that performs multiple steps, such as, for example, a memory read, an arithmetic operation and a memory write.
  • With continued reference to the method of FIG. 7, in step 710, the X and Y memory structures are accessed at locations specified by the pointers in the pre-fetch registers.
  • Operation of the method then goes to step 715 where the data read from the X and Y memory locations are written to pre-fetch buffers.
  • Next, in step 720, the results of the pre-fetch method are matched with the actual decoded instruction in the matching and selection step. Matching and selection is performed to reconcile the addresses of the operands contained in the actual instruction forwarded from the decode stage of the instruction pipeline with the pre-fetched data in the pre-fetch buffers. If the pre-fetched data is correct, operation continues to the appropriate execute unit of the execute pipeline in step 725 depending upon the nature of the instruction, i.e., shift, add, etc. It should be appreciated that if the pre-fetched operand addresses are not correct, a pipeline flush will occur while actual operands are fetched and injected into pipeline. Operation of the pre-fetch method terminates after matching and selection. It should be appreciated that if necessary, that is, if the instruction requires a write operation to X Y memory, the results of execution are written back to XY memory. Furthermore, it should be appreciated that because steps 700-715 are performed in parallel and isolation to the processor pipeline operations 703-720 that they do not effect or otherwise delay the processor pipeline operations of fetching, aligning, decoding, register file or execution.
  • As stated above, when performing repetitive functions, such as DSP extension functions where data is repeatedly read from and written to XY memory, predictive pre-fetching is an effective means of taking advantage of the benefits of XY memory without impacting the instruction pipeline. Processor clock frequency may be maintained at high speeds despite the use of XY memory. Also, when applications being run on the microprocessor do not require XY memory, the XY memory functionality is completely transparent to the applications. Normal instruction pipeline flow and branch prediction are completely unaffected by this XY memory functionality both when it is invoked and when it is not used. The auxiliary unit of the execute branch provides an interface for applications to select this extendible functionality. Therefore, as a result of the above-described microprocessor architecture, with careful use of pointers and their associated update modes, operands can be predictively pre-fetched with sufficient accuracy to outweigh the overhead associated with mispredictions and without any impact on the processor pipeline.
  • It should be appreciated that, while the descriptors “X” and “Y” have been used throughout the specification that theses terms are purely descriptive to the extent that they do not imply any specific structural. That is to say that any two dimensional pre-fetch memory structure can be considered “X Y memory.”
  • While the foregoing description includes many details and specificities, it is to be understood that these have been included for purposes of explanation only. The embodiments of the present invention are not to be limited in scope by the specific embodiments described herein. For example, although many of the embodiments disclosed herein have been described with reference to particular embodiments, the principles herein are equally applicable to microprocessors in general. Indeed, various modifications of the embodiments of the present inventions, in addition to those described herein, will be apparent to those of ordinary skill in the art from the foregoing description and accompanying drawings. Thus, such modifications are intended to fall within the scope of the following appended claims. Further, although the embodiments of the present inventions have been described herein in the context of a particular implementation in a particular environment for a particular purpose, those of ordinary skill in the art will recognize that its usefulness is not limited thereto and that the embodiments of the present inventions can be beneficially implemented in any number of environments for any number of purposes. Accordingly, the claims set forth below should be construed in view of the full breadth and spirit of the embodiments of the present inventions as disclosed herein.

Claims (20)

1. A microprocessor comprising:
a multistage instruction pipeline; and
a predictive pre-fetch memory pipeline comprising:
a first pre-fetch stage comprising a pre-fetch pointer address register file and memory address generators;
a second pre-fetch stage comprising pre-fetch memory structures accessed using address pointers generated in the first pre-fetch stage; and
a data select stage comprising at least one pre-fetch buffer in which predictive operand address and data information from the pre-fetch memory structures are stored.
2. The microprocessor of claim 1, wherein the pre-fetch memory structures comprise X and Y memory structures storing operand address data.
3. The microprocessor of claim 1, wherein the first and second pre-fetch stages and the data select stage occur in parallel to stages preceding an execute stage of the instruction pipeline.
4. The microprocessor of claim 1, wherein the instruction pipeline comprises align, decode and register file stages, and the first and second and pre-fetch stages and the data select stage occur in parallel to the align, decode and register file stages, respectively.
5. The microprocessor of claim 1, wherein the predictive pre-fetch memory pipeline further comprises hardware logic in the data select stage adapted to reconcile actual operand address information contained in an actual decoded instruction with the predictive operand address information the predictive operand address information.
6. The microprocessor of claim 5, wherein the predictive pre-fetch memory pipeline further comprises hardware logic adapted to pass the predictive operand address information from the pre-fetch buffer to an execute stage of the instruction pipeline if the actual operand address information matches the predictive operand address information.
7. The microprocessor of claim 1, wherein the predictive pre-fetch memory pipeline further comprises a write back structure invoked after the execute stage and being adapted to write the results of execution back to XY memory if the instruction requires a write to at least one of the pre-fetch memory structures.
8. A method of predictively pre-fetching operand address and data information for an instruction pipeline of a microprocessor, the method comprising:
prior to decoding a current instruction in the instruction pipeline, accessing at least one register containing pointers to specific locations in pre-fetch memory structures;
fetching predictive operand data from the specific locations in the pre-fetch memory structures; and
storing the pointer and predictive operand data in at least one pre-fetch buffer.
9. The method according to claim 8, wherein accessing, fetching and storing occur in parallel to, simultaneous to and in isolation of the instruction pipeline.
10. The method according to claim 9, wherein accessing, fetching and storing occur, respectively, in parallel to align, decode and register file stages of the instruction pipeline.
11. The method according to claim 8, further comprising, after decoding the current instruction, reconciling actual operand data contained in the decoded current instruction with the predictive operand data.
12. The method according to claim 8, further comprising decoding the current instruction and passing the pre-fetched predictive operand data to an execute unit of the microprocessor pipeline if the pre-fetched predictive operand data matches actual operand data contained in the current instruction.
13. The method according to claim 8, wherein accessing, fetching and storing are performed on successive clock pulses of the microprocessor.
14. The method according to claim 8 further comprising, performing pre-fetch hazard detection.
15. The method according to claim 14, wherein performing pre-fetch hazard detection comprises at least one operation selected from the group consisting of: stalling pre-fetch operation or restarting pre-fetch operation when the read after write hazard has disappeared, if it is determined that there is a read after write hazard characterized by a memory write to a location in memory that is to be pre-fetched; and clearing the pre-fetch buffers if there is a read from a memory location previously pre-fetched.
16. A microprocessor comprising:
a multistage microprocessor pipeline; and
a multistage pre-fetch memory pipeline in parallel to at least a portion of the microprocessor pipeline, wherein the pre-fetch memory pipeline comprises:
a first stage having at least one register serving as pointers to specific pre-fetch memory locations;
a second stage, having pre-fetch memory structures for storing predicted operand address information corresponding to operands in a pre-decoded instruction in the microprocessor pipeline; and
a third stage comprising at least one pre-fetch buffer;
wherein said first, second and third stages, respectively, are parallel to, simultaneous to and in isolation of corresponding stages of the microprocessor pipeline.
17. The microprocessor according to claim 16, wherein the microprocessor pipeline comprises align, decode, and register file stages, and the first, second and third stages of the pre-fetch memory pipeline, respectively, are parallel to the align, decode and register file stages.
18. The microprocessor according to claim 16, further comprising hardware logic in the third stage adapted to reconcile operand address information contained in an actual instruction forwarded from a decode stage of the microprocessor pipeline with the predicted operand address information.
19. The microprocessor according to claim 16, further comprising circuitry adapted to passing the predicted operand address information from the pre-fetch buffer to an execute stage of the microprocessor pipeline if the operand pointer in the actual instruction matches the predicted operand address information.
20. The microprocessor according to claim 16, further comprising post-execute stage hardware logic adapted to write the results of execution back to pre-fetch memory if a decoded instruction specifies a write back to at least one pre-fetch memory structure.
US11/132,447 2004-05-19 2005-05-19 Microprocessor architecture including zero impact predictive data pre-fetch mechanism for pipeline data memory Abandoned US20050278505A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/132,447 US20050278505A1 (en) 2004-05-19 2005-05-19 Microprocessor architecture including zero impact predictive data pre-fetch mechanism for pipeline data memory

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US57223804P 2004-05-19 2004-05-19
US11/132,447 US20050278505A1 (en) 2004-05-19 2005-05-19 Microprocessor architecture including zero impact predictive data pre-fetch mechanism for pipeline data memory

Publications (1)

Publication Number Publication Date
US20050278505A1 true US20050278505A1 (en) 2005-12-15

Family

ID=35429033

Family Applications (7)

Application Number Title Priority Date Filing Date
US11/132,448 Abandoned US20050289323A1 (en) 2004-05-19 2005-05-19 Barrel shifter for a microprocessor
US11/132,424 Active 2031-02-12 US8719837B2 (en) 2004-05-19 2005-05-19 Microprocessor architecture having extendible logic
US11/132,447 Abandoned US20050278505A1 (en) 2004-05-19 2005-05-19 Microprocessor architecture including zero impact predictive data pre-fetch mechanism for pipeline data memory
US11/132,423 Abandoned US20050278513A1 (en) 2004-05-19 2005-05-19 Systems and methods of dynamic branch prediction in a microprocessor
US11/132,432 Abandoned US20050273559A1 (en) 2004-05-19 2005-05-19 Microprocessor architecture including unified cache debug unit
US11/132,428 Abandoned US20050278517A1 (en) 2004-05-19 2005-05-19 Systems and methods for performing branch prediction in a variable length instruction set microprocessor
US14/222,194 Active US9003422B2 (en) 2004-05-19 2014-03-21 Microprocessor architecture having extendible logic

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US11/132,448 Abandoned US20050289323A1 (en) 2004-05-19 2005-05-19 Barrel shifter for a microprocessor
US11/132,424 Active 2031-02-12 US8719837B2 (en) 2004-05-19 2005-05-19 Microprocessor architecture having extendible logic

Family Applications After (4)

Application Number Title Priority Date Filing Date
US11/132,423 Abandoned US20050278513A1 (en) 2004-05-19 2005-05-19 Systems and methods of dynamic branch prediction in a microprocessor
US11/132,432 Abandoned US20050273559A1 (en) 2004-05-19 2005-05-19 Microprocessor architecture including unified cache debug unit
US11/132,428 Abandoned US20050278517A1 (en) 2004-05-19 2005-05-19 Systems and methods for performing branch prediction in a variable length instruction set microprocessor
US14/222,194 Active US9003422B2 (en) 2004-05-19 2014-03-21 Microprocessor architecture having extendible logic

Country Status (5)

Country Link
US (7) US20050289323A1 (en)
CN (1) CN101002169A (en)
GB (1) GB2428842A (en)
TW (1) TW200602974A (en)
WO (1) WO2005114441A2 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050289323A1 (en) * 2004-05-19 2005-12-29 Kar-Lik Wong Barrel shifter for a microprocessor
US20090198905A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Techniques for Prediction-Based Indirect Data Prefetching
US20090198948A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Techniques for Data Prefetching Using Indirect Addressing
US7971042B2 (en) 2005-09-28 2011-06-28 Synopsys, Inc. Microprocessor system and method for instruction-initiated recording and execution of instruction sequences in a dynamically decoupleable extended instruction pipeline
US20190012177A1 (en) * 2017-07-04 2019-01-10 Arm Limited Apparatus and method for controlling use of a register cache
US11243880B1 (en) * 2017-09-15 2022-02-08 Groq, Inc. Processor architecture
US11360934B1 (en) 2017-09-15 2022-06-14 Groq, Inc. Tensor streaming processor architecture
US11392535B2 (en) 2019-11-26 2022-07-19 Groq, Inc. Loading operands and outputting results from a multi-dimensional array using only a single side
US11809514B2 (en) 2018-11-19 2023-11-07 Groq, Inc. Expanded kernel generation
US11868908B2 (en) 2017-09-21 2024-01-09 Groq, Inc. Processor compiler for scheduling instructions to reduce execution delay due to dependencies
US11868804B1 (en) 2019-11-18 2024-01-09 Groq, Inc. Processor instruction dispatch configuration
US11875874B2 (en) 2017-09-15 2024-01-16 Groq, Inc. Data structures with multiple read ports

Families Citing this family (59)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7577795B2 (en) * 2006-01-25 2009-08-18 International Business Machines Corporation Disowning cache entries on aging out of the entry
US20070260862A1 (en) * 2006-05-03 2007-11-08 Mcfarling Scott Providing storage in a memory hierarchy for prediction information
US7752468B2 (en) 2006-06-06 2010-07-06 Intel Corporation Predict computing platform memory power utilization
US7555605B2 (en) * 2006-09-28 2009-06-30 Freescale Semiconductor, Inc. Data processing system having cache memory debugging support and method therefor
US7716460B2 (en) * 2006-09-29 2010-05-11 Qualcomm Incorporated Effective use of a BHT in processor having variable length instruction set execution modes
US7529909B2 (en) * 2006-12-28 2009-05-05 Microsoft Corporation Security verified reconfiguration of execution datapath in extensible microcomputer
US7779241B1 (en) 2007-04-10 2010-08-17 Dunn David A History based pipelined branch prediction
US9519480B2 (en) * 2008-02-11 2016-12-13 International Business Machines Corporation Branch target preloading using a multiplexer and hash circuit to reduce incorrect branch predictions
US9201655B2 (en) * 2008-03-19 2015-12-01 International Business Machines Corporation Method, computer program product, and hardware product for eliminating or reducing operand line crossing penalty
US8181003B2 (en) * 2008-05-29 2012-05-15 Axis Semiconductor, Inc. Instruction set design, control and communication in programmable microprocessor cores and the like
US8131982B2 (en) * 2008-06-13 2012-03-06 International Business Machines Corporation Branch prediction instructions having mask values involving unloading and loading branch history data
US8225069B2 (en) * 2009-03-31 2012-07-17 Intel Corporation Control of on-die system fabric blocks
US10338923B2 (en) * 2009-05-05 2019-07-02 International Business Machines Corporation Branch prediction path wrong guess instruction
JP5423156B2 (en) * 2009-06-01 2014-02-19 富士通株式会社 Information processing apparatus and branch prediction method
US8954714B2 (en) * 2010-02-01 2015-02-10 Altera Corporation Processor with cycle offsets and delay lines to allow scheduling of instructions through time
US8521999B2 (en) * 2010-03-11 2013-08-27 International Business Machines Corporation Executing touchBHT instruction to pre-fetch information to prediction mechanism for branch with taken history
US8495287B2 (en) * 2010-06-24 2013-07-23 International Business Machines Corporation Clock-based debugging for embedded dynamic random access memory element in a processor core
US9639354B2 (en) 2011-12-22 2017-05-02 Intel Corporation Packed data rearrangement control indexes precursors generation processors, methods, systems, and instructions
WO2013095554A1 (en) 2011-12-22 2013-06-27 Intel Corporation Processors, methods, systems, and instructions to generate sequences of consecutive integers in numerical order
US10223112B2 (en) 2011-12-22 2019-03-05 Intel Corporation Processors, methods, systems, and instructions to generate sequences of integers in which integers in consecutive positions differ by a constant integer stride and where a smallest integer is offset from zero by an integer offset
CN104011644B (en) 2011-12-22 2017-12-08 英特尔公司 Processor, method, system and instruction for generation according to the sequence of the integer of the phase difference constant span of numerical order
US9395994B2 (en) 2011-12-30 2016-07-19 Intel Corporation Embedded branch prediction unit
WO2013147879A1 (en) * 2012-03-30 2013-10-03 Intel Corporation Dynamic branch hints using branches-to-nowhere conditional branch
US9135012B2 (en) 2012-06-14 2015-09-15 International Business Machines Corporation Instruction filtering
US9152424B2 (en) 2012-06-14 2015-10-06 International Business Machines Corporation Mitigating instruction prediction latency with independently filtered presence predictors
WO2013188705A2 (en) * 2012-06-15 2013-12-19 Soft Machines, Inc. A virtual load store queue having a dynamic dispatch window with a unified structure
US9378017B2 (en) * 2012-12-29 2016-06-28 Intel Corporation Apparatus and method of efficient vector roll operation
CN103425498B (en) * 2013-08-20 2018-07-24 复旦大学 A kind of long instruction words command memory of low-power consumption and its method for optimizing power consumption
US10372590B2 (en) 2013-11-22 2019-08-06 International Business Corporation Determining instruction execution history in a debugger
US9870226B2 (en) * 2014-07-03 2018-01-16 The Regents Of The University Of Michigan Control of switching between executed mechanisms
US9910670B2 (en) 2014-07-09 2018-03-06 Intel Corporation Instruction set for eliminating misaligned memory accesses during processing of an array having misaligned data rows
US9740607B2 (en) 2014-09-03 2017-08-22 Micron Technology, Inc. Swap operations in memory
TWI569207B (en) * 2014-10-28 2017-02-01 上海兆芯集成電路有限公司 Fractional use of prediction history storage for operating system routines
US9665374B2 (en) * 2014-12-18 2017-05-30 Intel Corporation Binary translation mechanism
EP3286640A4 (en) * 2015-04-24 2019-07-10 Optimum Semiconductor Technologies, Inc. Computer processor with separate registers for addressing memory
US10346168B2 (en) * 2015-06-26 2019-07-09 Microsoft Technology Licensing, Llc Decoupled processor instruction window and operand buffer
US10776115B2 (en) * 2015-09-19 2020-09-15 Microsoft Technology Licensing, Llc Debug support for block-based processor
US10664280B2 (en) 2015-11-09 2020-05-26 MIPS Tech, LLC Fetch ahead branch target buffer
US10599428B2 (en) 2016-03-23 2020-03-24 Arm Limited Relaxed execution of overlapping mixed-scalar-vector instructions
GB2548601B (en) * 2016-03-23 2019-02-13 Advanced Risc Mach Ltd Processing vector instructions
US10192281B2 (en) 2016-07-07 2019-01-29 Intel Corporation Graphics command parsing mechanism
WO2018149495A1 (en) * 2017-02-16 2018-08-23 Huawei Technologies Co., Ltd. A method and system to fetch multicore instruction traces from a virtual platform emulator to a performance simulation model
US9959247B1 (en) 2017-02-17 2018-05-01 Google Llc Permuting in a matrix-vector processor
CN107179895B (en) * 2017-05-17 2020-08-28 北京中科睿芯科技有限公司 Method for accelerating instruction execution speed in data stream structure by applying composite instruction
US10902348B2 (en) 2017-05-19 2021-01-26 International Business Machines Corporation Computerized branch predictions and decisions
US10372459B2 (en) 2017-09-21 2019-08-06 Qualcomm Incorporated Training and utilization of neural branch predictor
US20200065112A1 (en) * 2018-08-22 2020-02-27 Qualcomm Incorporated Asymmetric speculative/nonspeculative conditional branching
US11163577B2 (en) 2018-11-26 2021-11-02 International Business Machines Corporation Selectively supporting static branch prediction settings only in association with processor-designated types of instructions
US11086631B2 (en) 2018-11-30 2021-08-10 Western Digital Technologies, Inc. Illegal instruction exception handling
CN109783384A (en) * 2019-01-10 2019-05-21 未来电视有限公司 Log use-case test method, log use-case test device and electronic equipment
US11182166B2 (en) 2019-05-23 2021-11-23 Samsung Electronics Co., Ltd. Branch prediction throughput by skipping over cachelines without branches
CN110442382B (en) * 2019-07-31 2021-06-15 西安芯海微电子科技有限公司 Prefetch cache control method, device, chip and computer readable storage medium
CN110727463B (en) * 2019-09-12 2021-08-10 无锡江南计算技术研究所 Zero-level instruction circular buffer prefetching method and device based on dynamic credit
CN112015490A (en) * 2020-11-02 2020-12-01 鹏城实验室 Method, apparatus and medium for programmable device implementing and testing reduced instruction set
CN113076277A (en) * 2021-03-26 2021-07-06 大唐微电子技术有限公司 Method and device for realizing pipeline scheduling, computer storage medium and terminal
US11599358B1 (en) 2021-08-12 2023-03-07 Tenstorrent Inc. Pre-staged instruction registers for variable length instruction set machine
US11663007B2 (en) * 2021-10-01 2023-05-30 Arm Limited Control of branch prediction for zero-overhead loop
CN115495155B (en) * 2022-11-18 2023-03-24 北京数渡信息科技有限公司 Hardware circulation processing device suitable for general processor
CN117193861B (en) * 2023-11-07 2024-03-15 芯来智融半导体科技(上海)有限公司 Instruction processing method, apparatus, computer device and storage medium

Citations (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4594659A (en) * 1982-10-13 1986-06-10 Honeywell Information Systems Inc. Method and apparatus for prefetching instructions for a central execution pipeline unit
US4926323A (en) * 1988-03-03 1990-05-15 Advanced Micro Devices, Inc. Streamlined instruction processor
US5148532A (en) * 1987-12-25 1992-09-15 Hitachi, Ltd. Pipeline processor with prefetch circuit
US5317701A (en) * 1990-01-02 1994-05-31 Motorola, Inc. Method for refilling instruction queue by reading predetermined number of instruction words comprising one or more instructions and determining the actual number of instruction words used
US5423011A (en) * 1992-06-11 1995-06-06 International Business Machines Corporation Apparatus for initializing branch prediction information
US5493687A (en) * 1991-07-08 1996-02-20 Seiko Epson Corporation RISC microprocessor architecture implementing multiple typed register sets
US5530825A (en) * 1994-04-15 1996-06-25 Motorola, Inc. Data processor with branch target address cache and method of operation
US5642500A (en) * 1993-11-26 1997-06-24 Fujitsu Limited Method and apparatus for controlling instruction in pipeline processor
US5692168A (en) * 1994-10-18 1997-11-25 Cyrix Corporation Prefetch buffer using flow control bit to identify changes of flow within the code stream
US5696958A (en) * 1993-01-11 1997-12-09 Silicon Graphics, Inc. Method and apparatus for reducing delays following the execution of a branch instruction in an instruction pipeline
US5808876A (en) * 1997-06-20 1998-09-15 International Business Machines Corporation Multi-function power distribution system
US5909566A (en) * 1996-12-31 1999-06-01 Texas Instruments Incorporated Microprocessor circuits, systems, and methods for speculatively executing an instruction using its most recently used data while concurrently prefetching data for the instruction
US5920711A (en) * 1995-06-02 1999-07-06 Synopsys, Inc. System for frame-based protocol, graphical capture, synthesis, analysis, and simulation
US5978909A (en) * 1997-11-26 1999-11-02 Intel Corporation System for speculative branch target prediction having a dynamic prediction history buffer and a static prediction history buffer
US5996071A (en) * 1995-12-15 1999-11-30 Via-Cyrix, Inc. Detecting self-modifying code in a pipelined processor with branch processing by comparing latched store address to subsequent target address
US6038649A (en) * 1994-03-14 2000-03-14 Texas Instruments Incorporated Address generating circuit for block repeat addressing for a pipelined processor
US6044458A (en) * 1997-12-12 2000-03-28 Motorola, Inc. System for monitoring program flow utilizing fixwords stored sequentially to opcodes
US6157988A (en) * 1997-08-01 2000-12-05 Micron Technology, Inc. Method and apparatus for high performance branching in pipelined microsystems
US6292879B1 (en) * 1995-10-25 2001-09-18 Anthony S. Fong Method and apparatus to specify access control list and cache enabling and cache coherency requirement enabling on individual operands of an instruction of a computer
US6550056B1 (en) * 1999-07-19 2003-04-15 Mitsubishi Denki Kabushiki Kaisha Source level debugger for debugging source programs
US6560754B1 (en) * 1999-05-13 2003-05-06 Arc International Plc Method and apparatus for jump control in a pipelined processor
US6609194B1 (en) * 1999-11-12 2003-08-19 Ip-First, Llc Apparatus for performing branch target address calculation based on branch type
US6622240B1 (en) * 1999-06-18 2003-09-16 Intrinsity, Inc. Method and apparatus for pre-branch instruction
US6681295B1 (en) * 2000-08-31 2004-01-20 Hewlett-Packard Development Company, L.P. Fast lane prefetching
US6718504B1 (en) * 2002-06-05 2004-04-06 Arc International Method and apparatus for implementing a data processor adapted for turbo decoding
US6718460B1 (en) * 2000-09-05 2004-04-06 Sun Microsystems, Inc. Mechanism for error handling in a computer system
US6774832B1 (en) * 2003-03-25 2004-08-10 Raytheon Company Multi-bit output DDS with real time delta sigma modulation look up from memory
US6823444B1 (en) * 2001-07-03 2004-11-23 Ip-First, Llc Apparatus and method for selectively accessing disparate instruction buffer stages based on branch target address cache hit and instruction stage wrap
US20050138607A1 (en) * 2003-12-18 2005-06-23 John Lu Software-implemented grouping techniques for use in a superscalar data processing system
US20050204121A1 (en) * 2004-03-12 2005-09-15 Arm Limited Prefetching exception vectors
US6948052B2 (en) * 1991-07-08 2005-09-20 Seiko Epson Corporation High-performance, superscalar-based computer system with out-of-order instruction execution
US6963554B1 (en) * 2000-12-27 2005-11-08 National Semiconductor Corporation Microwire dynamic sequencer pipeline stall
US20050273559A1 (en) * 2004-05-19 2005-12-08 Aris Aristodemou Microprocessor architecture including unified cache debug unit

Family Cites Families (189)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4342082A (en) 1977-01-13 1982-07-27 International Business Machines Corp. Program instruction mechanism for shortened recursive handling of interruptions
US4216539A (en) 1978-05-05 1980-08-05 Zehntel, Inc. In-circuit digital tester
US4400773A (en) 1980-12-31 1983-08-23 International Business Machines Corp. Independent handling of I/O interrupt requests and associated status information transfers
JPS63225822A (en) * 1986-08-11 1988-09-20 Toshiba Corp Barrel shifter
US4905178A (en) 1986-09-19 1990-02-27 Performance Semiconductor Corporation Fast shifter method and structure
JPS6398729A (en) 1986-10-15 1988-04-30 Fujitsu Ltd Barrel shifter
US4914622A (en) 1987-04-17 1990-04-03 Advanced Micro Devices, Inc. Array-organized bit map with a barrel shifter
DE3889812T2 (en) 1987-08-28 1994-12-15 Nec Corp Data processor with a test structure for multi-position shifters.
JPH01263820A (en) 1988-04-15 1989-10-20 Hitachi Ltd Microprocessor
EP0344347B1 (en) 1988-06-02 1993-12-29 Deutsche ITT Industries GmbH Digital signal processing unit
GB2229832B (en) 1989-03-30 1993-04-07 Intel Corp Byte swap instruction for memory format conversion within a microprocessor
EP0415648B1 (en) * 1989-08-31 1998-05-20 Canon Kabushiki Kaisha Image processing apparatus
JPH03185530A (en) * 1989-12-14 1991-08-13 Mitsubishi Electric Corp Data processor
JPH03248226A (en) 1990-02-26 1991-11-06 Nec Corp Microprocessor
JP2560889B2 (en) * 1990-05-22 1996-12-04 日本電気株式会社 Microprocessor
CA2045790A1 (en) * 1990-06-29 1991-12-30 Richard Lee Sites Branch prediction in high-performance processor
US5155843A (en) 1990-06-29 1992-10-13 Digital Equipment Corporation Error transition mode for multi-processor system
US5778423A (en) * 1990-06-29 1998-07-07 Digital Equipment Corporation Prefetch instruction for improving performance in reduced instruction set processor
JP2556612B2 (en) * 1990-08-29 1996-11-20 日本電気アイシーマイコンシステム株式会社 Barrel shifter circuit
US5636363A (en) * 1991-06-14 1997-06-03 Integrated Device Technology, Inc. Hardware control structure and method for off-chip monitoring entries of an on-chip cache
DE69229084T2 (en) * 1991-07-08 1999-10-21 Canon Kk Color imaging process, color image reader and color image processing apparatus
US5450586A (en) * 1991-08-14 1995-09-12 Hewlett-Packard Company System for analyzing and debugging embedded software through dynamic and interactive use of code markers
CA2073516A1 (en) 1991-11-27 1993-05-28 Peter Michael Kogge Dynamic multi-mode parallel processor array architecture computer system
US5485625A (en) 1992-06-29 1996-01-16 Ford Motor Company Method and apparatus for monitoring external events during a microprocessor's sleep mode
US5274770A (en) 1992-07-29 1993-12-28 Tritech Microelectronics International Pte Ltd. Flexible register-based I/O microcontroller with single cycle instruction execution
US5294928A (en) 1992-08-31 1994-03-15 Microchip Technology Incorporated A/D converter with zero power mode
US5333119A (en) 1992-09-30 1994-07-26 Regents Of The University Of Minnesota Digital signal processor with delayed-evaluation array multipliers and low-power memory addressing
US5542074A (en) 1992-10-22 1996-07-30 Maspar Computer Corporation Parallel processor system with highly flexible local control capability, including selective inversion of instruction signal and control of bit shift amount
GB2275119B (en) 1993-02-03 1997-05-14 Motorola Inc A cached processor
US5577217A (en) * 1993-05-14 1996-11-19 Intel Corporation Method and apparatus for a branch target buffer with shared branch pattern tables for associated branch predictions
JPH06332693A (en) 1993-05-27 1994-12-02 Hitachi Ltd Issuing system of suspending instruction with time-out function
US5454117A (en) 1993-08-25 1995-09-26 Nexgen, Inc. Configurable branch prediction for a processor performing speculative execution
US5584031A (en) 1993-11-09 1996-12-10 Motorola Inc. System and method for executing a low power delay instruction
US5590350A (en) 1993-11-30 1996-12-31 Texas Instruments Incorporated Three input arithmetic logic unit with mask generator
US6116768A (en) 1993-11-30 2000-09-12 Texas Instruments Incorporated Three input arithmetic logic unit with barrel rotator
US5509129A (en) 1993-11-30 1996-04-16 Guttag; Karl M. Long instruction word controlling plural independent processor operations
US5590351A (en) 1994-01-21 1996-12-31 Advanced Micro Devices, Inc. Superscalar execution unit for sequential instruction pointer updates and segment limit checks
TW253946B (en) * 1994-02-04 1995-08-11 Ibm Data processor with branch prediction and method of operation
US5517436A (en) 1994-06-07 1996-05-14 Andreas; David C. Digital signal processor for audio applications
US5809293A (en) * 1994-07-29 1998-09-15 International Business Machines Corporation System and method for program execution tracing within an integrated processor
US5566357A (en) 1994-10-06 1996-10-15 Qualcomm Incorporated Power reduction in a cellular radiotelephone
JPH08202469A (en) 1995-01-30 1996-08-09 Fujitsu Ltd Microcontroller unit equipped with universal asychronous transmitting and receiving circuit
US5600674A (en) 1995-03-02 1997-02-04 Motorola Inc. Method and apparatus of an enhanced digital signal processor
US5655122A (en) 1995-04-05 1997-08-05 Sequent Computer Systems, Inc. Optimizing compiler with static prediction of branch probability, branch frequency and function frequency
US5835753A (en) 1995-04-12 1998-11-10 Advanced Micro Devices, Inc. Microprocessor with dynamically extendable pipeline stages and a classifying circuit
US5659752A (en) * 1995-06-30 1997-08-19 International Business Machines Corporation System and method for improving branch prediction in compiled program code
US5768602A (en) 1995-08-04 1998-06-16 Apple Computer, Inc. Sleep mode controller for power management
US5842004A (en) 1995-08-04 1998-11-24 Sun Microsystems, Inc. Method and apparatus for decompression of compressed geometric three-dimensional graphics data
US5727211A (en) * 1995-11-09 1998-03-10 Chromatic Research, Inc. System and method for fast context switching between tasks
US5774709A (en) 1995-12-06 1998-06-30 Lsi Logic Corporation Enhanced branch delay slot handling with single exception program counter
US5778438A (en) 1995-12-06 1998-07-07 Intel Corporation Method and apparatus for maintaining cache coherency in a computer system with a highly pipelined bus and multiple conflicting snoop requests
JP3663710B2 (en) * 1996-01-17 2005-06-22 ヤマハ株式会社 Program generation method and processor interrupt control method
US5896305A (en) 1996-02-08 1999-04-20 Texas Instruments Incorporated Shifter circuit for an arithmetic logic unit in a microprocessor
JPH09261490A (en) * 1996-03-22 1997-10-03 Minolta Co Ltd Image forming device
US5752014A (en) 1996-04-29 1998-05-12 International Business Machines Corporation Automatic selection of branch prediction methodology for subsequent branch instruction based on outcome of previous branch prediction
US5784636A (en) 1996-05-28 1998-07-21 National Semiconductor Corporation Reconfigurable computer architecture for use in signal processing applications
US20010025337A1 (en) 1996-06-10 2001-09-27 Frank Worrell Microprocessor including a mode detector for setting compression mode
US5826079A (en) 1996-07-05 1998-10-20 Ncr Corporation Method for improving the execution efficiency of frequently communicating processes utilizing affinity process scheduling by identifying and assigning the frequently communicating processes to the same processor
US5805876A (en) * 1996-09-30 1998-09-08 International Business Machines Corporation Method and system for reducing average branch resolution time and effective misprediction penalty in a processor
US5964884A (en) * 1996-09-30 1999-10-12 Advanced Micro Devices, Inc. Self-timed pulse control circuit
US5848264A (en) * 1996-10-25 1998-12-08 S3 Incorporated Debug and video queue for multi-processor chip
GB2320388B (en) 1996-11-29 1999-03-31 Sony Corp Image processing apparatus
US6061521A (en) 1996-12-02 2000-05-09 Compaq Computer Corp. Computer having multimedia operations executable as two distinct sets of operations within a single instruction cycle
US5909572A (en) 1996-12-02 1999-06-01 Compaq Computer Corp. System and method for conditionally moving an operand from a source register to a destination register
KR100236533B1 (en) 1997-01-16 2000-01-15 윤종용 Digital signal processor
EP0855718A1 (en) 1997-01-28 1998-07-29 Hewlett-Packard Company Memory low power mode control
US6154857A (en) 1997-04-08 2000-11-28 Advanced Micro Devices, Inc. Microprocessor-based device incorporating a cache for capturing software performance profiling data
US6185732B1 (en) 1997-04-08 2001-02-06 Advanced Micro Devices, Inc. Software debug port for a microprocessor
US6584525B1 (en) 1998-11-19 2003-06-24 Edwin E. Klingman Adaptation of standard microprocessor architectures via an interface to a configurable subsystem
US6021500A (en) 1997-05-07 2000-02-01 Intel Corporation Processor with sleep and deep sleep modes
US5950120A (en) 1997-06-17 1999-09-07 Lsi Logic Corporation Apparatus and method for shutdown of wireless communications mobile station with multiple clocks
US5931950A (en) 1997-06-17 1999-08-03 Pc-Tel, Inc. Wake-up-on-ring power conservation for host signal processing communication system
US6035374A (en) 1997-06-25 2000-03-07 Sun Microsystems, Inc. Method of executing coded instructions in a multiprocessor having shared execution resources including active, nap, and sleep states in accordance with cache miss latency
US6088786A (en) 1997-06-27 2000-07-11 Sun Microsystems, Inc. Method and system for coupling a stack based processor to register based functional unit
US5878264A (en) 1997-07-17 1999-03-02 Sun Microsystems, Inc. Power sequence controller with wakeup logic for enabling a wakeup interrupt handler procedure
US6760833B1 (en) 1997-08-01 2004-07-06 Micron Technology, Inc. Split embedded DRAM processor
US6226738B1 (en) 1997-08-01 2001-05-01 Micron Technology, Inc. Split embedded DRAM processor
US6026478A (en) 1997-08-01 2000-02-15 Micron Technology, Inc. Split embedded DRAM processor
JPH1185515A (en) * 1997-09-10 1999-03-30 Ricoh Co Ltd Microprocessor
JPH11143571A (en) 1997-11-05 1999-05-28 Mitsubishi Electric Corp Data processor
US6014743A (en) 1998-02-05 2000-01-11 Intergrated Device Technology, Inc. Apparatus and method for recording a floating point error pointer in zero cycles
US6151672A (en) 1998-02-23 2000-11-21 Hewlett-Packard Company Methods and apparatus for reducing interference in a branch history table of a microprocessor
US6374349B2 (en) 1998-03-19 2002-04-16 Mcfarling Scott Branch predictor with serially connected predictor stages for improving branch prediction accuracy
US6289417B1 (en) 1998-05-18 2001-09-11 Arm Limited Operand supply to an execution unit
US6308279B1 (en) 1998-05-22 2001-10-23 Intel Corporation Method and apparatus for power mode transition in a multi-thread processor
JPH11353225A (en) 1998-05-26 1999-12-24 Internatl Business Mach Corp <Ibm> Memory that processor addressing gray code system in sequential execution style accesses and method for storing code and data in memory
US6466333B2 (en) * 1998-06-26 2002-10-15 Canon Kabushiki Kaisha Streamlined tetrahedral interpolation
US20020053015A1 (en) 1998-07-14 2002-05-02 Sony Corporation And Sony Electronics Inc. Digital signal processor particularly suited for decoding digital audio
US6327651B1 (en) 1998-09-08 2001-12-04 International Business Machines Corporation Wide shifting in the vector permute unit
US6253287B1 (en) * 1998-09-09 2001-06-26 Advanced Micro Devices, Inc. Using three-dimensional storage to make variable-length instructions appear uniform in two dimensions
US6240521B1 (en) 1998-09-10 2001-05-29 International Business Machines Corp. Sleep mode transition between processors sharing an instruction set and an address space
US6347379B1 (en) 1998-09-25 2002-02-12 Intel Corporation Reducing power consumption of an electronic device
US6339822B1 (en) * 1998-10-02 2002-01-15 Advanced Micro Devices, Inc. Using padded instructions in a block-oriented cache
US6862563B1 (en) 1998-10-14 2005-03-01 Arc International Method and apparatus for managing the configuration and functionality of a semiconductor design
US6671743B1 (en) * 1998-11-13 2003-12-30 Creative Technology, Ltd. Method and system for exposing proprietary APIs in a privileged device driver to an application
DE69910826T2 (en) * 1998-11-20 2004-06-17 Altera Corp., San Jose COMPUTER SYSTEM WITH RECONFIGURABLE PROGRAMMABLE LOGIC DEVICE
US6189091B1 (en) 1998-12-02 2001-02-13 Ip First, L.L.C. Apparatus and method for speculatively updating global history and restoring same on branch misprediction detection
US6341348B1 (en) * 1998-12-03 2002-01-22 Sun Microsystems, Inc. Software branch prediction filtering for a microprocessor
US6957327B1 (en) 1998-12-31 2005-10-18 Stmicroelectronics, Inc. Block-based branch target buffer
US6826748B1 (en) * 1999-01-28 2004-11-30 Ati International Srl Profiling program execution into registers of a computer
US6477683B1 (en) 1999-02-05 2002-11-05 Tensilica, Inc. Automated processor generation system for designing a configurable processor and method for the same
US6418530B2 (en) * 1999-02-18 2002-07-09 Hewlett-Packard Company Hardware/software system for instruction profiling and trace selection using branch history information for branch predictions
US6499101B1 (en) * 1999-03-18 2002-12-24 I.P. First L.L.C. Static branch prediction mechanism for conditional branch instructions
US6427206B1 (en) * 1999-05-03 2002-07-30 Intel Corporation Optimized branch predictions for strongly predicted compiler branches
US6438700B1 (en) 1999-05-18 2002-08-20 Koninklijke Philips Electronics N.V. System and method to reduce power consumption in advanced RISC machine (ARM) based systems
US6772325B1 (en) 1999-10-01 2004-08-03 Hitachi, Ltd. Processor architecture and operation for exploiting improved branch control instruction
US6546481B1 (en) 1999-11-05 2003-04-08 Ip - First Llc Split history tables for branch prediction
US6571333B1 (en) 1999-11-05 2003-05-27 Intel Corporation Initializing a memory controller by executing software in second memory to wakeup a system
US6909744B2 (en) 1999-12-09 2005-06-21 Redrock Semiconductor, Inc. Processor architecture for compression and decompression of video and images
KR100395763B1 (en) * 2000-02-01 2003-08-25 삼성전자주식회사 A branch predictor for microprocessor having multiple processes
US6412038B1 (en) 2000-02-14 2002-06-25 Intel Corporation Integral modular cache for a processor
JP2001282548A (en) 2000-03-29 2001-10-12 Matsushita Electric Ind Co Ltd Communication equipment and communication method
US6519696B1 (en) 2000-03-30 2003-02-11 I.P. First, Llc Paired register exchange using renaming register map
US20030070013A1 (en) 2000-10-27 2003-04-10 Daniel Hansson Method and apparatus for reducing power consumption in a digital processor
US6948054B2 (en) * 2000-11-29 2005-09-20 Lsi Logic Corporation Simple branch prediction and misprediction recovery method
TW477954B (en) * 2000-12-05 2002-03-01 Faraday Tech Corp Memory data accessing architecture and method for a processor
US20020073301A1 (en) 2000-12-07 2002-06-13 International Business Machines Corporation Hardware for use with compiler generated branch information
US7139903B2 (en) * 2000-12-19 2006-11-21 Hewlett-Packard Development Company, L.P. Conflict free parallel read access to a bank interleaved branch predictor in a processor
US6877089B2 (en) 2000-12-27 2005-04-05 International Business Machines Corporation Branch prediction apparatus and process for restoring replaced branch history for use in future branch predictions for an executing program
US20020087851A1 (en) * 2000-12-28 2002-07-04 Matsushita Electric Industrial Co., Ltd. Microprocessor and an instruction converter
US8285976B2 (en) 2000-12-28 2012-10-09 Micron Technology, Inc. Method and apparatus for predicting branches using a meta predictor
US6925634B2 (en) * 2001-01-24 2005-08-02 Texas Instruments Incorporated Method for maintaining cache coherency in software in a shared memory system
US7039901B2 (en) * 2001-01-24 2006-05-02 Texas Instruments Incorporated Software shared memory bus
US6823447B2 (en) 2001-03-01 2004-11-23 International Business Machines Corporation Software hint to improve the branch target prediction accuracy
AU2002238325A1 (en) 2001-03-02 2002-09-19 Atsana Semiconductor Corp. Data processing apparatus and system and method for controlling memory access
JP3890910B2 (en) 2001-03-21 2007-03-07 株式会社日立製作所 Instruction execution result prediction device
US7010558B2 (en) 2001-04-19 2006-03-07 Arc International Data processor with enhanced instruction execution and method
US7165168B2 (en) 2003-01-14 2007-01-16 Ip-First, Llc Microprocessor with branch target address cache update queue
US20020194462A1 (en) * 2001-05-04 2002-12-19 Ip First Llc Apparatus and method for selecting one of multiple target addresses stored in a speculative branch target address cache per instruction cache line
US7200740B2 (en) 2001-05-04 2007-04-03 Ip-First, Llc Apparatus and method for speculatively performing a return instruction in a microprocessor
US6886093B2 (en) * 2001-05-04 2005-04-26 Ip-First, Llc Speculative hybrid branch direction predictor
US20020194461A1 (en) 2001-05-04 2002-12-19 Ip First Llc Speculative branch target address cache
US7165169B2 (en) * 2001-05-04 2007-01-16 Ip-First, Llc Speculative branch target address cache with selective override by secondary predictor based on branch instruction type
GB0112269D0 (en) 2001-05-21 2001-07-11 Micron Technology Inc Method and circuit for alignment of floating point significands in a simd array mpp
GB0112275D0 (en) 2001-05-21 2001-07-11 Micron Technology Inc Method and circuit for normalization of floating point significands in a simd array mpp
JP3805339B2 (en) * 2001-06-29 2006-08-02 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Method for predicting branch target, processor, and compiler
US7162619B2 (en) * 2001-07-03 2007-01-09 Ip-First, Llc Apparatus and method for densely packing a branch instruction predicted by a branch target address cache and associated target instructions into a byte-wide instruction buffer
US7010675B2 (en) 2001-07-27 2006-03-07 Stmicroelectronics, Inc. Fetch branch architecture for reducing branch penalty without branch prediction
US7191445B2 (en) 2001-08-31 2007-03-13 Texas Instruments Incorporated Method using embedded real-time analysis components with corresponding real-time operating system software objects
US6751331B2 (en) 2001-10-11 2004-06-15 United Global Sourcing Incorporated Communication headset
JP2003131902A (en) * 2001-10-24 2003-05-09 Toshiba Corp Software debugger, system-level debugger, debug method and debug program
US7051239B2 (en) 2001-12-28 2006-05-23 Hewlett-Packard Development Company, L.P. Method and apparatus for efficiently implementing trace and/or logic analysis mechanisms on a processor chip
US20030225998A1 (en) 2002-01-31 2003-12-04 Khan Mohammed Noshad Configurable data processor with multi-length instruction set architecture
US7168067B2 (en) * 2002-02-08 2007-01-23 Agere Systems Inc. Multiprocessor system with cache-based software breakpoints
US7181596B2 (en) 2002-02-12 2007-02-20 Ip-First, Llc Apparatus and method for extending a microprocessor instruction set
US7529912B2 (en) 2002-02-12 2009-05-05 Via Technologies, Inc. Apparatus and method for instruction-level specification of floating point format
US7315921B2 (en) 2002-02-19 2008-01-01 Ip-First, Llc Apparatus and method for selective memory attribute control
US7328328B2 (en) 2002-02-19 2008-02-05 Ip-First, Llc Non-temporal memory reference control mechanism
US7546446B2 (en) 2002-03-08 2009-06-09 Ip-First, Llc Selective interrupt suppression
US7395412B2 (en) 2002-03-08 2008-07-01 Ip-First, Llc Apparatus and method for extending data modes in a microprocessor
US7155598B2 (en) 2002-04-02 2006-12-26 Ip-First, Llc Apparatus and method for conditional instruction execution
US7185180B2 (en) 2002-04-02 2007-02-27 Ip-First, Llc Apparatus and method for selective control of condition code write back
US7373483B2 (en) 2002-04-02 2008-05-13 Ip-First, Llc Mechanism for extending the number of registers in a microprocessor
US7302551B2 (en) 2002-04-02 2007-11-27 Ip-First, Llc Suppression of store checking
US7380103B2 (en) 2002-04-02 2008-05-27 Ip-First, Llc Apparatus and method for selective control of results write back
US7380109B2 (en) 2002-04-15 2008-05-27 Ip-First, Llc Apparatus and method for providing extended address modes in an existing instruction set for a microprocessor
US20030204705A1 (en) * 2002-04-30 2003-10-30 Oldfield William H. Prediction of branch instructions in a data processing apparatus
KR100450753B1 (en) 2002-05-17 2004-10-01 한국전자통신연구원 Programmable variable length decoder including interface of CPU processor
US6938151B2 (en) * 2002-06-04 2005-08-30 International Business Machines Corporation Hybrid branch prediction using a global selection counter and a prediction method comparison table
US7493480B2 (en) 2002-07-18 2009-02-17 International Business Machines Corporation Method and apparatus for prefetching branch history information
US7000095B2 (en) 2002-09-06 2006-02-14 Mips Technologies, Inc. Method and apparatus for clearing hazards using jump instructions
US20050125634A1 (en) * 2002-10-04 2005-06-09 Fujitsu Limited Processor and instruction control method
US6968444B1 (en) 2002-11-04 2005-11-22 Advanced Micro Devices, Inc. Microprocessor employing a fixed position dispatch unit
US7266676B2 (en) 2003-03-21 2007-09-04 Analog Devices, Inc. Method and apparatus for branch prediction based on branch targets utilizing tag and data arrays
US20040193855A1 (en) * 2003-03-31 2004-09-30 Nicolas Kacevas System and method for branch prediction access
US7174444B2 (en) * 2003-03-31 2007-02-06 Intel Corporation Preventing a read of a next sequential chunk in branch prediction of a subject chunk
US7590829B2 (en) 2003-03-31 2009-09-15 Stretch, Inc. Extension adapter
US20040225870A1 (en) 2003-05-07 2004-11-11 Srinivasan Srikanth T. Method and apparatus for reducing wrong path execution in a speculative multi-threaded processor
US7010676B2 (en) * 2003-05-12 2006-03-07 International Business Machines Corporation Last iteration loop branch prediction upon counter threshold and resolution upon counter one
US20040255104A1 (en) * 2003-06-12 2004-12-16 Intel Corporation Method and apparatus for recycling candidate branch outcomes after a wrong-path execution in a superscalar processor
US7668897B2 (en) 2003-06-16 2010-02-23 Arm Limited Result partitioning within SIMD data processing systems
US7783871B2 (en) * 2003-06-30 2010-08-24 Intel Corporation Method to remove stale branch predictions for an instruction prior to execution within a microprocessor
US7373642B2 (en) 2003-07-29 2008-05-13 Stretch, Inc. Defining instruction extensions in a standard programming language
US20050027974A1 (en) * 2003-07-31 2005-02-03 Oded Lempel Method and system for conserving resources in an instruction pipeline
US7133950B2 (en) 2003-08-19 2006-11-07 Sun Microsystems, Inc. Request arbitration in multi-core processor
JP2005078234A (en) * 2003-08-29 2005-03-24 Renesas Technology Corp Information processor
US7237098B2 (en) * 2003-09-08 2007-06-26 Ip-First, Llc Apparatus and method for selectively overriding return stack prediction in response to detection of non-standard return sequence
US20050066305A1 (en) * 2003-09-22 2005-03-24 Lisanke Robert John Method and machine for efficient simulation of digital hardware within a software development environment
KR100980076B1 (en) * 2003-10-24 2010-09-06 삼성전자주식회사 System and method for branch prediction with low-power consumption
US7363544B2 (en) 2003-10-30 2008-04-22 International Business Machines Corporation Program debug method and apparatus
US7219207B2 (en) 2003-12-03 2007-05-15 Intel Corporation Reconfigurable trace cache
US8069336B2 (en) 2003-12-03 2011-11-29 Globalfoundries Inc. Transitioning from instruction cache to trace cache on label boundaries
US7293164B2 (en) 2004-01-14 2007-11-06 International Business Machines Corporation Autonomic method and apparatus for counting branch instructions to generate branch statistics meant to improve branch predictions
US8607209B2 (en) 2004-02-04 2013-12-10 Bluerisc Inc. Energy-focused compiler-assisted branch prediction
US20050216713A1 (en) * 2004-03-25 2005-09-29 International Business Machines Corporation Instruction text controlled selectively stated branches for prediction via a branch target buffer
US7281120B2 (en) 2004-03-26 2007-10-09 International Business Machines Corporation Apparatus and method for decreasing the latency between an instruction cache and a pipeline processor
US20050223202A1 (en) * 2004-03-31 2005-10-06 Intel Corporation Branch prediction in a pipelined processor
US20060015706A1 (en) * 2004-06-30 2006-01-19 Chunrong Lai TLB correlated branch predictor and method for use thereof
TWI305323B (en) * 2004-08-23 2009-01-11 Faraday Tech Corp Method for verification branch prediction mechanisms and readable recording medium for storing program thereof

Patent Citations (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4594659A (en) * 1982-10-13 1986-06-10 Honeywell Information Systems Inc. Method and apparatus for prefetching instructions for a central execution pipeline unit
US5148532A (en) * 1987-12-25 1992-09-15 Hitachi, Ltd. Pipeline processor with prefetch circuit
US4926323A (en) * 1988-03-03 1990-05-15 Advanced Micro Devices, Inc. Streamlined instruction processor
US5317701A (en) * 1990-01-02 1994-05-31 Motorola, Inc. Method for refilling instruction queue by reading predetermined number of instruction words comprising one or more instructions and determining the actual number of instruction words used
US6948052B2 (en) * 1991-07-08 2005-09-20 Seiko Epson Corporation High-performance, superscalar-based computer system with out-of-order instruction execution
US5493687A (en) * 1991-07-08 1996-02-20 Seiko Epson Corporation RISC microprocessor architecture implementing multiple typed register sets
US5423011A (en) * 1992-06-11 1995-06-06 International Business Machines Corporation Apparatus for initializing branch prediction information
US5696958A (en) * 1993-01-11 1997-12-09 Silicon Graphics, Inc. Method and apparatus for reducing delays following the execution of a branch instruction in an instruction pipeline
US5642500A (en) * 1993-11-26 1997-06-24 Fujitsu Limited Method and apparatus for controlling instruction in pipeline processor
US6038649A (en) * 1994-03-14 2000-03-14 Texas Instruments Incorporated Address generating circuit for block repeat addressing for a pipelined processor
US5530825A (en) * 1994-04-15 1996-06-25 Motorola, Inc. Data processor with branch target address cache and method of operation
US5692168A (en) * 1994-10-18 1997-11-25 Cyrix Corporation Prefetch buffer using flow control bit to identify changes of flow within the code stream
US5920711A (en) * 1995-06-02 1999-07-06 Synopsys, Inc. System for frame-based protocol, graphical capture, synthesis, analysis, and simulation
US6292879B1 (en) * 1995-10-25 2001-09-18 Anthony S. Fong Method and apparatus to specify access control list and cache enabling and cache coherency requirement enabling on individual operands of an instruction of a computer
US5996071A (en) * 1995-12-15 1999-11-30 Via-Cyrix, Inc. Detecting self-modifying code in a pipelined processor with branch processing by comparing latched store address to subsequent target address
US5909566A (en) * 1996-12-31 1999-06-01 Texas Instruments Incorporated Microprocessor circuits, systems, and methods for speculatively executing an instruction using its most recently used data while concurrently prefetching data for the instruction
US5808876A (en) * 1997-06-20 1998-09-15 International Business Machines Corporation Multi-function power distribution system
US20040068643A1 (en) * 1997-08-01 2004-04-08 Dowling Eric M. Method and apparatus for high performance branching in pipelined microsystems
US6157988A (en) * 1997-08-01 2000-12-05 Micron Technology, Inc. Method and apparatus for high performance branching in pipelined microsystems
US5978909A (en) * 1997-11-26 1999-11-02 Intel Corporation System for speculative branch target prediction having a dynamic prediction history buffer and a static prediction history buffer
US6044458A (en) * 1997-12-12 2000-03-28 Motorola, Inc. System for monitoring program flow utilizing fixwords stored sequentially to opcodes
US6560754B1 (en) * 1999-05-13 2003-05-06 Arc International Plc Method and apparatus for jump control in a pipelined processor
US6622240B1 (en) * 1999-06-18 2003-09-16 Intrinsity, Inc. Method and apparatus for pre-branch instruction
US6550056B1 (en) * 1999-07-19 2003-04-15 Mitsubishi Denki Kabushiki Kaisha Source level debugger for debugging source programs
US6609194B1 (en) * 1999-11-12 2003-08-19 Ip-First, Llc Apparatus for performing branch target address calculation based on branch type
US6681295B1 (en) * 2000-08-31 2004-01-20 Hewlett-Packard Development Company, L.P. Fast lane prefetching
US6718460B1 (en) * 2000-09-05 2004-04-06 Sun Microsystems, Inc. Mechanism for error handling in a computer system
US6963554B1 (en) * 2000-12-27 2005-11-08 National Semiconductor Corporation Microwire dynamic sequencer pipeline stall
US6823444B1 (en) * 2001-07-03 2004-11-23 Ip-First, Llc Apparatus and method for selectively accessing disparate instruction buffer stages based on branch target address cache hit and instruction stage wrap
US6718504B1 (en) * 2002-06-05 2004-04-06 Arc International Method and apparatus for implementing a data processor adapted for turbo decoding
US6774832B1 (en) * 2003-03-25 2004-08-10 Raytheon Company Multi-bit output DDS with real time delta sigma modulation look up from memory
US20050138607A1 (en) * 2003-12-18 2005-06-23 John Lu Software-implemented grouping techniques for use in a superscalar data processing system
US20050204121A1 (en) * 2004-03-12 2005-09-15 Arm Limited Prefetching exception vectors
US20050273559A1 (en) * 2004-05-19 2005-12-08 Aris Aristodemou Microprocessor architecture including unified cache debug unit
US20050278513A1 (en) * 2004-05-19 2005-12-15 Aris Aristodemou Systems and methods of dynamic branch prediction in a microprocessor
US20050278517A1 (en) * 2004-05-19 2005-12-15 Kar-Lik Wong Systems and methods for performing branch prediction in a variable length instruction set microprocessor
US20050289321A1 (en) * 2004-05-19 2005-12-29 James Hakewill Microprocessor architecture having extendible logic
US20050289323A1 (en) * 2004-05-19 2005-12-29 Kar-Lik Wong Barrel shifter for a microprocessor

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050289323A1 (en) * 2004-05-19 2005-12-29 Kar-Lik Wong Barrel shifter for a microprocessor
US8719837B2 (en) 2004-05-19 2014-05-06 Synopsys, Inc. Microprocessor architecture having extendible logic
US9003422B2 (en) 2004-05-19 2015-04-07 Synopsys, Inc. Microprocessor architecture having extendible logic
US7971042B2 (en) 2005-09-28 2011-06-28 Synopsys, Inc. Microprocessor system and method for instruction-initiated recording and execution of instruction sequences in a dynamically decoupleable extended instruction pipeline
US20090198905A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Techniques for Prediction-Based Indirect Data Prefetching
US20090198948A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Techniques for Data Prefetching Using Indirect Addressing
US8166277B2 (en) 2008-02-01 2012-04-24 International Business Machines Corporation Data prefetching using indirect addressing
US8209488B2 (en) 2008-02-01 2012-06-26 International Business Machines Corporation Techniques for prediction-based indirect data prefetching
US20190012177A1 (en) * 2017-07-04 2019-01-10 Arm Limited Apparatus and method for controlling use of a register cache
US10732980B2 (en) * 2017-07-04 2020-08-04 Arm Limited Apparatus and method for controlling use of a register cache
US11243880B1 (en) * 2017-09-15 2022-02-08 Groq, Inc. Processor architecture
US11263129B1 (en) 2017-09-15 2022-03-01 Groq, Inc. Processor architecture
US11360934B1 (en) 2017-09-15 2022-06-14 Groq, Inc. Tensor streaming processor architecture
US11645226B1 (en) 2017-09-15 2023-05-09 Groq, Inc. Compiler operations for tensor streaming processor
US11822510B1 (en) 2017-09-15 2023-11-21 Groq, Inc. Instruction format and instruction set architecture for tensor streaming processor
US11868250B1 (en) 2017-09-15 2024-01-09 Groq, Inc. Memory design for a processor
US11875874B2 (en) 2017-09-15 2024-01-16 Groq, Inc. Data structures with multiple read ports
US11868908B2 (en) 2017-09-21 2024-01-09 Groq, Inc. Processor compiler for scheduling instructions to reduce execution delay due to dependencies
US11809514B2 (en) 2018-11-19 2023-11-07 Groq, Inc. Expanded kernel generation
US11868804B1 (en) 2019-11-18 2024-01-09 Groq, Inc. Processor instruction dispatch configuration
US11392535B2 (en) 2019-11-26 2022-07-19 Groq, Inc. Loading operands and outputting results from a multi-dimensional array using only a single side

Also Published As

Publication number Publication date
CN101002169A (en) 2007-07-18
US20050273559A1 (en) 2005-12-08
WO2005114441A2 (en) 2005-12-01
TW200602974A (en) 2006-01-16
US9003422B2 (en) 2015-04-07
US20050289321A1 (en) 2005-12-29
US20140208087A1 (en) 2014-07-24
GB2428842A (en) 2007-02-07
WO2005114441A3 (en) 2007-01-18
US8719837B2 (en) 2014-05-06
US20050278517A1 (en) 2005-12-15
US20050278513A1 (en) 2005-12-15
GB0622477D0 (en) 2006-12-20
US20050289323A1 (en) 2005-12-29

Similar Documents

Publication Publication Date Title
US20050278505A1 (en) Microprocessor architecture including zero impact predictive data pre-fetch mechanism for pipeline data memory
US8069336B2 (en) Transitioning from instruction cache to trace cache on label boundaries
US7836287B2 (en) Reducing the fetch time of target instructions of a predicted taken branch instruction
JP3182740B2 (en) A method and system for fetching non-consecutive instructions in a single clock cycle.
US6880073B2 (en) Speculative execution of instructions and processes before completion of preceding barrier operations
US7257699B2 (en) Selective execution of deferred instructions in a processor that supports speculative execution
US7444501B2 (en) Methods and apparatus for recognizing a subroutine call
JP2002525742A (en) Mechanism for transfer from storage to load
US7877586B2 (en) Branch target address cache selectively applying a delayed hit
US20090049286A1 (en) Data processing system, processor and method of data processing having improved branch target address cache
EP1849061A2 (en) Unaligned memory access prediction
US6260134B1 (en) Fixed shift amount variable length instruction stream pre-decoding for start byte determination based on prefix indicating length vector presuming potential start byte
US7257700B2 (en) Avoiding register RAW hazards when returning from speculative execution
US7143269B2 (en) Apparatus and method for killing an instruction after loading the instruction into an instruction queue in a pipelined microprocessor
EP3171264A1 (en) System and method of speculative parallel execution of cache line unaligned load instructions
JP2003515214A (en) Method and apparatus for performing calculations with narrow operands
CN106557304B (en) Instruction fetch unit for predicting the target of a subroutine return instruction
US5946468A (en) Reorder buffer having an improved future file for storing speculative instruction execution results
US5915110A (en) Branch misprediction recovery in a reorder buffer having a future file
US7865705B2 (en) Branch target address cache including address type tag bit
JP3683439B2 (en) Information processing apparatus and method for suppressing branch prediction
US20090198985A1 (en) Data processing system, processor and method of data processing having branch target address cache with hashed indices
US20050144427A1 (en) Processor including branch prediction mechanism for far jump and far call instructions
US6219784B1 (en) Processor with N adders for parallel target addresses calculation

Legal Events

Date Code Title Description
AS Assignment

Owner name: ARC INTERNATIONAL, UNITED KINGDOM

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LIM, SEOW CHUAN;WONG, KAR-LIK;REEL/FRAME:016933/0023

Effective date: 20050721

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION