CA1323937C - Virtual instruction cache refill algorithm - Google Patents

Virtual instruction cache refill algorithm

Info

Publication number
CA1323937C
CA1323937C CA000607160A CA607160A CA1323937C CA 1323937 C CA1323937 C CA 1323937C CA 000607160 A CA000607160 A CA 000607160A CA 607160 A CA607160 A CA 607160A CA 1323937 C CA1323937 C CA 1323937C
Authority
CA
Canada
Prior art keywords
instruction
bytes
buffer
decoder
stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CA000607160A
Other languages
French (fr)
Inventor
David B. Fite
Michael M. Mckeon
Ricky C. Hetherington
John E. Murray
Dwight P. Manley
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Equipment Corp
Original Assignee
Digital Equipment Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Equipment Corp filed Critical Digital Equipment Corp
Application granted granted Critical
Publication of CA1323937C publication Critical patent/CA1323937C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/30149Instruction analysis, e.g. decoding, instruction word fields of variable length instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3816Instruction alignment, e.g. cache line crossing

Abstract

VIRTUAL INSTRUCTION
CACHE REFILL ALGORITHM
ABSTRACT
An instruction buffer of a high speed digital computer controls the flow of instruction stream to an instruction decoder. The buffer provides the decoder with nine bytes of sequential instruction stream. The instruction set used by the computer is of the variable length type, such that the decoder consumes a variable number of the instruction stream bytes, depending upon the type of instruction being decoded. As each instruction is consumed, a shifter removes the consumed bytes and repositions the remaining bytes into the lowest order positions. The byte positions left empty by the shifter are filled by instruction stream retrieved from one of a pair of prefetch buffers (IBEX, IBEX2) or from a virtual instruction cache. These prefetch buffers are arranged to hold the next two subsequent quadwords of instruction stream and provide the desired missing bytes.
The IBEX prefetch buffer is filled from the instruction cache after being emptied, but prior to those particular bytes being requested to fill the instruction decoder.
This two level prefetching allows the relatively slow process of cache access to be performed during noncritical time. The instruction decoder is not stalled, waiting for a cache refill, but can ordinarily obtain the desired bytes of instruction stream from the prefetch buffer.

Description

1 323~7 VIRTUAL INSTRUCTION
CACHE REFILL ALGORITHM

The present application discloses certain aspects of a computing system that i5 further described in the following Canadian patent applications. Evans et al., AN
INTERFACE BETWEEN A SY~TEM CONTROL UNIT AND A SERVICE
PROCESSING UNIT OF A DIGITAL COMPU~ER, Serial No. 604,515, filed 30 June 1989; Arnold et al., ~ETHOD ~ND APPARATUS
FOR INTERFACING A SYS~EM CONTROL UNIT FOR A MULTIPROCESSOR
: SY5TEM WITH THE CENTR~L PROCESSING UNITS, Serial No. 604,514, filed 30 June 1989; Gagliardo et al., MET~OD
~O AND MEANS FOR INTERFACING A SYSTEM CONTROL UNIT FOR A
-~ MULTI-PROCESSOR SYSTEM WITH THE SYSTEM MAIN MEMORY, Serial ; No. 604,068, filed 27 June 1989; D. Fite et al., METHOD
~ND APPARATUS FOR RESOLVING A VARIABLE NUMBER OF POTENTIAL
MEMORY ACCESS CONFLICTS IN A PIPELINED COMPUTER SYSTEM, 25 Serial No. 603,222, ~iled l9 June 1989; D. Fite et al., DECODING MULTIPLE SPECIFIERS IN A VARIABLE LENGTH
INSTRUCTION ARCHITECTURE, Serial No. 605,969, filed 18 July 1989; Murray et al., PIPE~INE PROCESSING OF
REGISTER AND REGISTER MODIFYING SPECIFIERS WITHIN THE SAME
30 INST~UCTION, Serial No. 2,009,163, ~iled 2 Feb. l9g0;
Murray et al., MULTIRLE INSTRWCTION PREPROCESSING SYSTEM
WITH DATA DEPENDENCY RESOLUTION FQR DIGITAL COMPUTERS, Serial No. 2,008,238, filed 22 Jan. 1990; Murray et alO, ~ PREPROCESSING IMPLIED SPECIFIERS IN A PIPELINED PROCESSOR, :` 35 Serial No. 607,178, filed 1 Aug. 1989; D. Fite et al.l ~ BRANCH PREDICTION, Serial No-. 607,982, filed 10 Aug. 1989;
, ~ .

,, .~ .

1 323`937 -lA-Fossum et al., PIPELINED FLOATING POINT ADDER FOR DIGITAL
COMPUTER~ Serial No~ 611,711, filed 18 Sep. lg89;
Grundmann et al., SELF TIMED REGISTER FILE, Serial No. 611,061, filed 12 Sep. 1989; Beaven et al., METHOD AND
APPARATUS FOR DETECTING AND CORRECTING ERRORS IM A
PIPELINED COMPUTER SYSTEM, Serial No. 609,638, filed 29 Aug. 1989; Flynn et al., METHOD AND MEANS FOR
ARBITRATING COMMUNICATION REQUESTS USING A SYSTEM CONTROL
UNI~ IN A MULTI-PROCESSOR SYSTEM, Serial No. 610,688, filed 8 Sep. 1989; E. Fite ~t al., CONTROL OF MULTIPLE
FUNCTION UNITS WITH PARALLEL OPERATION IN A MICROCODED
EXECUTION UNIT, Serial No. 605,958, filed 18 July 1989;
Webb, Jr. et al., PROCESSING OF MEMORY ACCESS EXCEPTIONS
WITH PRE-FETCHED INSTRUCTIONS WITHIN THE INSTRUCTION
PIPELINE OF A VIRTUAL MEMORY SYSTEM-BASED DIGITAL
COMPUTER, Serial NoO ~11,918, filed 19 Sep. 1989;
Hetherington et al., METHOD AND APPARATU5 FOR CONTROLLING
THE CONVERSION OF VIRTUAL TO PHYSICAL MEMORY ADDRESSES IN
A DIGITAL COMPUTER SYSTE~, Serial No. 608,692, filed 18 Aug. 1989; Hetherington, ~RITE BACK BUFFER WITH
ERROR CORRECTING CAPABILITIES, Serial No. 609,565, 25 ~iled 28 Aug. 1989; Chinnaswamy et al., MODULAR CROSSBAR
INTERCONNECTION NETWORK FOR DATA TRANSACTIONS BETWEEN
SYSTEM UNITS IN A MULTI-PROCESSOR SYSTEM, Serial ~: No. 607,983, filed 10 Aug. 198~; Polzin et al., METHOD AND
APPARATUS FOR INTERFACING A SYSTEM CONTROL UNIT FOR A
MULTI-PROCESSOR SYSTEM WITH INPUT/OUTPUT UNITS, Serial No. 611,907, filed 19 Sep. 1989; Gagliardo et al., MEMORY
CONFIGURATION FOR USE WITH MEANS FOR INTERFACING A SYSTEM
:~ CONTROL UNIT FOR A MULTI-PROCESSOR SYSTEM WITH THE SYSTEM
MAIN MEMORY, Serial No. 607,967, filed 10 Aug. 1989; and 35 Gagliardo et al., METHOD AND MEANS FOR E~ROR CHECKING OF

' , ..

~.
; . ~ .

- . . . `
~ .

~ 32~S9~1 DRAM ~ONTROL SIGNALS BETWEEN SYSTEM MODU~ES t Serial No.
611,046, filed 12 sep. 1989.

This invention relates generally to a virtual instruction cache (VIC) of a hi.gh-speed digital computer and, more particularly, to controlling the VIC to prefetch and align variable lenyth instructions.

In the field of high speed computers, most advanced computers pipeline the entire sequence of instruction activities. A prime example is the "VAX 8600" (Trademark) computer manufactured and sold by Digital Equipment Corporation, 111 Powdermill Road, Maynard MA 97154-1418.
The instruction pipeline for the "VAX 8600" (Trademark~
computer is described in T. Fossum et al. "An Overview of the VAX 8600 System,"

: , -. . : .

-` 1 323q31 Diqital. Technical Journal, No. 1, August 1985, pp. 8-23.
Separate pipeline stages are provided for instruction fetch, instruction decode, operand address generation, operand ~etah, instruction execute, and result store.

To make s~fective use of this pipelining capability, it is desirable to keep each stage of the pipeline occupied, performing its intended ~unction on the next instruction to be executed. In order to do this, the lo instruction fetch stage must retrieve an instruction and pass it to the next stage between each transition of the system clock. Otherwise, such a disruption in the instruction stream causes the pipeline to drain, necessitating a time-consuming restart o~ the entire pipeline. of course, the purpose of the pipeline is to increase the overall speed of the computer. Thus, it is highly advantageous to avoid these situations where the pipeline is interrupted.

Howev2r, the instruction set employed in some computers is o~ the variable length type, thereby forcing the instruction buffer to have added complexity. In other words, until the instruction (opcode) is decoded, - the instruction bu~fer does not "know" how many of the subsequent bytes of the instruction stream belong with the current instruction. Therefore, the instruction buf~er can only r~spond by loading a preselected number of bytes of the instruction stream, whi~h may or may not include an entire instruction. The instruction decoder `; 30 will only consume those bytes associated with the -~ immediate instruction. Thereafter, the instruc.tion buffer must determine how many of the present bytes were used by the decoder, shift the unused bytes into the lowest order locations, and then ill the empty buffer location~ with subsequent bytes of the instruction "~
.
''.`
'~ .

~ , . , .: ..

1 32~937 ~, stream.

Re~erence to th~ main memory to retrieve these subsequent bytes of instruction stream necessarily involves multiple clock cycles. To avoid accessing main memory, many digital computers include a high speed cache between the processing unit and the main memory. Access to this cache t~kes only a small number o~ cycles of the processor's clock but often involves translating virtual addresses to physical addresses. To further accelerate the access to the instruction stream, some systems dedicate a cache solely to store the instructions. The access to this "instruction cache" o~t~n does ~ot entail translating from virtual to physical addresses as the instructions are stored under their virtual addresses.
This access to the instruction stream in a high speed virtual instruction cache may only involve one cycle o~
the processor's clock. The virtual instruction cache, however, contains only a portion of the main memory, each reference to the virtual instruction cache involves comparing the requested address with the desired address to first determine if the desired instruction stream is present and then retrieving the re~uested instruction stream. Therefore, owing to the variable length nature o~ the instruction set, the instruction buffer cannot predict whether a reference to the VIC will be required : by the inætruction currently being decoded.

~o prevent numerous references to the virtual :~ 30 instruction cache, a prefetch buffer is provided to maintain a preselected number of the subsequent bytes o~
instruction stream which are exp~cted to be used by the instruction decoder. This process ~orestalls the inevitable ref erence to the virtual instruction cache.
'~
.,~

:
., :.

., , ~

;~ . . - , ;,~ . . . .
i~'.,; . . ~ : `
'"'-;'' , ' ` ' ".

1 323q37 Since the virtual instruction cache contains only a portion of the instruction stream, refills to the instruction buffer can result in "misses" in the virtual instruction cache/ which re~uire fetches from ~he main memory. These main memory fetches generally require many clock cycl~s, thereby interrupting the pipeline.
The present invention may be summarized according to a first broad aspect, as an instruction buffer system for a digital computer for controlling the delivery of instruction stream bytes between a memory and an instruction decoder, said instruction stream bytes being grouped together into variable length instructions, and ~aid instruction decoder including means for decoding each of said bytes within said variable length instruction, the instruction buffer system ~omprising: an instruction buffer coupled betw~en said memory and said instruction decoder and having multiple byte locations for receiving a first series of said inskruction bytes, at least a portion of said first series of instruction bytes forming a variable length instruction to be decoded by said instruction decoder; first and second prefetch buffers for storing a preselected number of a second, subse~uenk series of bytes of instruction stream; means for delivering a shift signal responsive to the number of bytes of instruction stream contained in the variable length instruction currently being decoded by said decod~r; a shifter for receiving said shift signal and shifting the contents of said instruction buffer by a prsselected number of bytes indicated by said shifk signal; means for merging said shifted bytes with at least a portion of the contents of one of said first and second prefetch buffers, and delivering said merged bytes to said :

: ,, . , .. , . ~ ~
. . . - . .
.. .. .
, . :
: ~ , . ' ' .

~5A-instruction buffer; means for refilling said first 5 prefetch buffer with sequential bytes of said instruction stream when said first prefetch buffer is emptied; and means for refilling said second prefetch buffer with sequential bytes of said instruction stream when said second prefetch buffer is emptied.

According to another aspect, the pxesenk invention provides an instruction buffer system for a digital computer for controlling the delivery of instruction stream to an instruction decoder, said instruction stream beiny grouped into variable length instructions, and said instruction decoder including means for decoding each of i said bytes within sa.id variable length instruction, said instruction buffer system comprising: an instruction buffer having a plurality of storage locations for receiving a preselected number of the next sequential bytes of instructi.on stream desired by the decoder and delivering said preselected number of instruction stream -~ bytes to said decoder; said instruction decoder including means for delivering a shift signal responsive to the number of bytes of the instruction stream located in th~
instruction buffer which are currently being decoded;
first means for prefetching ancl maintaining in a first prefetch buffer a first preselected number of sequential bytes of the instruction stream; second means for prefetching and maintaining in a second prefetch buffer a ~: second preselected number of sequential bytes of the : instruction stream, said second preselected number of .~ sequential bytes of the instruction stream being :~ subssquent to the first preselected number of sequential ~ 35 bytes of the instruction stream; a shifter coupled to said .~ instruction buffer for receiving said shift signal and ,,,~.
"~
..,.
.',`
,: .
'`'' . ~
., - ~

. .
.,.,~ ..

:'.

~hifking the bytes of said instruction buffer by a preselected number of storage locations responsive to said shift signal and delivering the shifted bytes to the instruction buffer; means for retrieving se~uential bytes of the instruction stream from one of the first and second prefetch buffers and filling the instruction buf~er storage locations from which bytes o~ the instruction stream have been removed by the shifter; means for refilling said first prefetch buffers with instruction stream bytes in response to said first prefetch buffers being emptied by said means for retri~ving; and means ~or refilling said second prPfetch buffer with instruction stream bytes in response to said second means being emptied by said means for retrieving.

Other objects and advantages of the invention will become apparent upon reading the following detailed ~, ~.
. .
.

.
.
' description and upon reference to the drawings in which:

FIG. 1 is a top level block diagram of a portion of a central proce~sing unit and associated memory;

FIG. 2 .is a ~unctional diagram of the pipeline processing of a longword ADD operand;

FIG. 3 is a block diagram of the virtual instruction cach~;

FIG. 4 i~ a general block diagram of the instruction buffer interfaced with the virtual instruction cache;

FIG. 5 i~ a detailed block diagr~m of the instruction buffer and the inter~ace to the instruction decoder;

: FIG. 6 is a ~chematic diagram of the shift~r of the instruction buffer;

FIG. 7 is a schematic diagram of the rotator of the instruction buffer;
~:`
FIG. 8 i~ a schematic diagram of the merge multiplexer of the instruction buf~er; and FIG. 9 is a block diagram of the two~unit valid block store strams o~ the virtual instructio~ cache.
: 3~
While the invention is susceptible to various modifications and alternative ~orms, specific embod1ments thereof have been shown by way of example in the drawings and will herein be described in detail. It should be -~ 35 understood, however, that it is not intended to limit the -~ PD88-025~
U.S.: DIG~:009 FOREIGN: DIGM:040 r ~` ' ~ . . ` . .

' . , `` ' 1 32~937 invention to the particular ~orms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives ~alling within the spirit and scope of the invention as defined by the appended claims.

Turning now to the drawings, FIGURE 1 iS a top level block diagram of a portion of a pipelined Computer system 10. The system 10 includes at least one central processing unit (CPU) 12 having access to main memory 14.
It should be understood that additional CPUs could be used in such a system by sharing the main memory 14.

Inside the CPU 12, the exe~ution o~ an individual instruction is broken down into multiple smaller tasks.
These tasks are perf ormed by dedicated, separate, independent functional units that are optimized for that purpose.

i 20 Although each instrUction ultimately performs a different operation, many of the smaller tasks into which each instruction is broken are common to all ;: instructions. Generally, the following steps are performed during the execution of an instruction:
instruction ~etch, instruction decode, operand fPtch, execution, and result store. Thus, by the use of dedicated hardware stagesl the steps can be overlapped, thereby incr~.asin~ the total instruction throughput.

The data p~th through the pipeline includ~s a . respective set of registers ~or transferring the results o~ each pipeline stage to the next pipeline stage. These transfer registers are clocked in response to a common system cl~ck. For example, during a first clock cycle~
the first instruction is ~etched by hardware dedicated to ;~ U.5.: DIGM:009 ~ FOR~IGN: DIGM:040 , :, . :
.
:

instruction ~etch. ~uring the second clock cycle, the fetched instruction is transferred and decoded by instruction decode hardware, but, at the same time, the next instruction is fetched by the instruction ~etch hardware. During the third clock cycle, each instruction is shi~ted to the next stage of the pipeline and a new instruction i5 fetched. Thus, a~ter the pipeline is filled, an instruction will b~ completely executed at the end of each clock cycle.
This process is analogous to an assembly line in a manufacturing environment. Each worker is dedicated to performing a single tas~ on every product that passes through his or her work stage. As each ta6k is performed the product comes closer to completion. At the final stage, each time the worker performs hi~ or her assigned task a completed product rolls of~ the assembly line.

As shown in FIG. 1, eaeh CPU 12 is partitioned into ~0 at least threP functional units: the memory access unit 16, the instruction unit 18, and the execution unit 20.

The memory access unit 16 includes a main cache 22 which, on an average basis, enables the instruction and execution units 18, 20 to process data at a faster rate than the access time of the main memoxy 14. This cache 22 includes means for storing selected predefined blocks o~ data elements, means for receiving requests from the instruction unit 18 via a translation buf~er 24 to access a specified data element, means for checking whether the data element is in a block stored in the cache 22 t and means operative when data for the block including the speci~ied data element is not so stored for reading the specified block of data in the cache 22. In other words, the cache provides a "window" into the main memory, a~d ` PD88-0255 U.S.: DIGM:009 FOREI~N: DIGM:040 contains data likely to be needed by the instruction and execution units 18, 20. The organization and operation of a similar cache and translation buffer are further described in Chapter 11 of Levy and Eckhouse~ Jr., Co~uter Pro~E~mminq_and Architecture, The VAX-11, Digital Equipment corporation, pp. 351-368 (1980).

I~ a data element needed by the instruction and execution units 18, 20 is not found in the cach~ 22, then lo the data element is obtained from the main memory 14, but in the process, an entire block, including additional data, is obtained from the main memory 14 and written into the cache 22. Due to the principle of locality in time and memory space, the next time the instruction and execution unit~ desire a data element, there is a high degree of likelihood that this data element will be found in the block which includes the previously addressed data element. Conse~uently, it i5 probable that ths cache 22 : will already include the data element required by the instruction and ex~cution units 18, 20. In general, since the cache 22 is accessed at a much higher rate than the main memory 14, the main memory 14 can have a ~ proportionall.y slower access time than the cache 22 : without substantially degrading the average performance ; 25 of th8 computer system 10. Therefore, the main memory 14 i5 constructed of slower and less expensive m~mory elements.

The translation buf~er 24 is a high speed associative memory which store~ the most recently used virtual-to-physical address translations. In a virtual memory system, a reference to a single virtual address can cau~e several memory references before the desired -; information is made available. However, where the translation buffer 24 i5 used, translation is reduced to `i,, ';
~- PD88~0255 U.S.: DIGM:OO9 ; FOREIGN: DIGM-040 '~

i .

simply ~inding a "hit" in the translation buf~er 24.

The instruction unit 18 includes a program counter 26 and a virtual instruction cache (VIC) 28 for fetching instructions rom the main cache 22. The program counter 26 preferably addresses virtual memory locations rather than the physical memory locations of the main memory 14 and the cache 22. Thus, the virtual address of the program counter 26 must he translated into the physical address of the main memory 14 before instructions can be retrieved. Accordingly, the contents of the program counter 26 are transferred to the memory access unit 16 where the translation buffer 24 performs the addre~s conversion. The instruction is retrieved from its physical memory location in the cache 22 using the converted address. The cache 22 delivers the instruction over data return lines to the VIC 28.

Gen~.rally, the VIC 28 contains prestored instructions at the addresses specified by the program counter 26, and the addressed instructions are aYailable immediately ~or the transfer into an instruction bu~fer (IBUFFER) 30. From the buffer 30, the addressed instruction~ are fed to an instruction decoder 32 which decodes both the opcod s and the specifier~. An operand prooessing unit (OPU) 34 fetches the speci~ied operands and supplies them to the execution unit 20.

Th~ OPU 34 also produces virtual addresses. In particular, th~ OPU 34 produces virtual addreeses for memory source (read~ and destination (write) cperands.
For the memsry read operands, the OPU 34 delivers these ~`~ virtual addre~ses to the memory access unit 16 where they are translated to physical addresses. The physical memory locations of the cache 22 are then accessed to , ~ PD88-0255 ,:~ U.S.: DIGM:OO9 FOREIGN: DIGM:040 r fetch the operands for the memory source operands~

In each instruction, the first byte contains the opcode, and the following bytes are the operand specifiers to be decoded. The first byte of each specifier indicates the addressing mode for that specifi~r. This byte is usually broken in halves, with one-half specifying the addressing mode and the other half specifying a register to be used for addressing. The instructions pre~erably have a variable length, and various types of specifiers can be used with the same opcode, as disclosed in Strecker et al., U.S. Patent 4,241,397 issued December 23, 1980.
The first step in processing the instructions is to decode the opcode portion of the instruction. The first portion of each instruction consists of its opcode which specifies the operation to be performed in the instruction, and the nu~ber and type of specifiers to be used. Decoding is accomplished using a table-look-up technique in the instruction decoder 32. Later, the execution unit 2n performs the specified operation by executing prestored microcode, beginning at a predetermined starting address for the specified operation. Also, the de~oder 32 determines where source-operand and destination-operand specifiers occur in the instruction and passes these specifiers to th~ OPU 34 for preprocessing prior to execution of the instruction. A
preferred instruction decoder for use with the refill method and apparatus of the present invention is described in the above referenced D. Fite et al. Canadian patent application Serial No. 605,969r filed 18 July 1989, and entitled "Decoding Multiple Specifiers in a Variable Length Instruction Architecture."

;~
''~`~' -`
.
, ~ ~ ' . ' .... .1,: : ::
. ~ ' ~ 323937 After an instru~tion has been decoded, the OPU 34 parses the operand ~pecifiers and complltes their effec~ive addresses; this process involves reading GP~s and possibly modifying the GPR contents by autoincrementing or autodecrementing. The operands are th~n ~etched ~rom those effective address~s and passed on to the execution unît 20, which executes the instruction and writes the result into the destination identi~ied by th~ destination pointer for that instruction.

Each time an instruction is passed to the execution unit 20, the instruction unit 18 sends a microcode dispatch address and a set of pointers ~or (1) the location in the execution unit register file where the . source operands can be found, and (2) the location where the results are to be stored. Within the execution unit 20, a set of queues 36 includes a fork queue for storing the microcode dispatch addre~sl a source pointer queue for storing the source-operand locations, and a ~ destination pointer queue for storing the destination .~ loration~ $ach of these queues is a FIFO buffer capabla : of holding the data for multiple instructions.
. .
The execution unit 20 also include~ a source list 38, which is a multi-ported register file containing a copy of the GPRs and a list of source operands. Thus, entries in the source pointer queue will either point to GPR locations for register operand~, or point to the ~ 30 source list for memory and literal operands. Both the .. memory access unit 16 and the instruction unit 18 write - entries in the source list 38, and the execution unit 20 read~ operands out of th~ source list 38 as needed to ~ execut~ the instructions. For executing instructions, ;~ 35 the execution unit 20 include~ an instruction issue unit ., , PD88-0255 ~ U.S.: DIÇM:OO9 ;~: FOREIGN: DIGM:040 .
'::
, :
- .
,~. : ..
,, :
, :

-~`` 1 3239~7 40, a microc~de execution unit 42, an arithmetic and logic unit (ALU) ~4, and a retire unit 46.

The present invention is particularly useful with pipelined processors. As discussed above, in a pipelined pro~essor, the processor's instruction ~etch hardware may be fetching ons instruction while other hardware is decoding the operation code of a sscond instruction, fetching the operands of a third instruction, executing a fourth instruction, and ~toring the processed data of a fifth instruction. FIG. 2 illustrates a pipeline for a typical instruction such as:
ADDL3 R0,B^12(Rl),R2 This is a lonq-word addition using the displacement mode of addressing.

In the first stage of the pipelined execution of this instruction, the program count (PC~ of the instruction is created; this is usually accomplished either by incrementing the program counter 26 ~rom the previous instruction, or by using the target address of a branch instruction. The PC i~ then used to access VIC 28 in the second stage of the pipeline.

- 25 In the third stage of the pipeline, the instruction data is available from ~he cache 22 for use by th~
~ instruction decod~r 32, or to be loaded into th~. IBU~FER
- 30. The instruction decoder 32 decodes the opcode and the three speci~iers in a single cycle, as will be described in more detail below. The R0 and R2 numbers are passed to thP ALU 44, and the R1 number along with :~ the byte displacement is sent to the OPU 34 at the end of the decode cycle.

~ 35 In stage four, the OPU 34 reads the content~ of iks .

U.S.: DIG~:009 FOREIGN: DIGM:040 -:; ' ~ . , : .
.

''-` 1 32:~q37 GPR register ~ile at location Rl, add~ khat value to the specif ied displacement (12), and sends the resulting address to the translation bu~fer 24 in the memory access unit 16, along with an OP READ request, at the end o~ the address generation stage.

In stage five, the memory access unit 16 selects the address generated in stage four ~or execution. Using the translation buffer 24, the memory access unit 16 translates the virtual address to a physical address during the address translation stage. The physical addr~ss is then used to address the cache 22, which is read in stage siX o~ the pipeline.

In stage seven o~ the pipeline, the instruction is issued to the ALV 44 which adds the two operands and sends the result to the retire unit 46. During stage 4, the register numbers ~or Rl and R2, and a pointer to ths source list location Por the memory data, are sent to the execution unit and stored in the pointer queues. Then during the cache read stage, the execution unit loo~s for the two source operands in the source list. In this particular examplel it finds only the regisker data R0, but at the end of this stage the memory data arrives and : 25 is substituted for the invalidated read-out o~ the register fileO Thus, both operands are available in the ~ instruction execution sta~e.

; In the retire stage eight of the pipeline, the result data is paired with the next entry in the retire ~ queue. Although several ~unctional execution units can -~ be busy at the same time, only one instruction i5 retired in a single cycle.

In the last st~ge nine of the illustrative pipeline, PD88~0255 : U.S.: DIGM-009 . : FOREIGN: DIGM:040 A ' .i ' ,' "
''`. ~ ` ' ' ' .

:::: ` :

.: `

~ 1 3 2 3 9 3 7 the data is writtQn into the GPR portion o~ the register files in both the execution unit 20 and the instruction unit 18.

Referring now ~o FIG. 3, a block diagram o~ the virtual instruction cache (VIC) 2~ is illustrated. The VIC 28 is constructed of ~our groups of self~timed rams (S~RAMS), and acts as a window into the main memory 14.
In this regard the VIC 28 functions in a similar fashion as the main cache 22. The ~irst group o~ VIC STRAMS is the data stram 50 which provides storage space for the actual instruction stream (XSTREAM) retrieved from ~he main cache 22. Specifically, the data stram 50 contains 1024 storage locations t with each storage location being 64 bits in width. From the size of the data stram 50, it should be apparent that the ISTREAM is retrieved in quadword (8-byte) packets. Accordingly, the data path between the main cache 22 and the VIC 28 is also 64~bits ,` in width and a quadword of ISTREAM can be trans~erred " 20 during each system clock cycle.

The PC 26 delivers bits 12-3 of the 32-bit virtual i address to the data stram 50 in order to address each quadword of ISTREAM. Bits 2:0 are unnecessary, as th~y are only ne~ded to addr~ss individual bytes within each quadword. Individual by~e addressiblity is not necessary ~ for the proper operation of the VIC 28. Rather, the .~ smallest increment of IS$REAM whi~-h can be addr~ssed in the VIC 28 is a quadword. Further, the upper bits 31:13 are not used to address the data stram 50 because only ~: 1024 quadword locations are available for storing the ~ ISTREAM. Accordingly, the 10-bits 12-3 are sufficient to `~ provide a uni~ue address for each o~ the 1024 data torage l~cations ~i.e. 21-l024).

U~S.: DIG~:OO9 ~- FOREIGN: DIGM:040 ., .
,. . .

.
:
. -.. . . ~ ...... ~ . .

1 323q37 However, it should be clear that since the upper bits 31:13 are not used to address the data stram 50, there are multiple quadwords which must be stored at identical data stram locations. For example, the guadword located at address 11111111111111111110000000000 will be stored at khe same data stram location ae the quadword located at address 011111~1111111111110000000000. Both addresses share the same lower 10-bits and must, therefore, share the same data stram storage location. In fact, each data stram location can host any one o~ 1,048,576 (219=1,048,576) quadwords .

Accordingly, in order to determine which of theses 15 quadwords is stored in each o~ the data ~tram locations, a set of tag strams 52 is provided. The tag strams 52 store the upper nineteen bits 31:13 of the quadword - address. However, ISTREAM is retrieved from the main cache 2 2 in ~our qu~dword blocks . In other words, a request to the main cache 22 for the first quadword in a block causes the main cache 22 to also return the three following quadwords. Ratrieving ISTREAM in blocks sati~f ies the principle o~ locality in time and memory : space and aids the overall per~ormance o~ the VIC 28.
2S Accordingly, the 1024 data stram locations are identified by only 256 tag stram locations (1 for each ~our quadword block). Thus, the tag stram 52 contains 256 19-bit storage locations and 8-bits ~ 5) o~ the virtual address are sufficient to identify each of the 256 ~ 30 storage locations (28-256).
'~:
~: Operation of the VIC 28 is enhanced by the method used for retrieving ISTREAM ~rom the main cache 22~ The request for ISTREAM is always quadword aligned and can be for any ~uadword within a block. However, the main cache P~8~-0255 U.S~: DIG~:009 ~ FOREIGN: DIGM:040 :
,.:
~;

~ ! . `
, ,: , '"

~- 1 323937 22 only responds with the requested quadword and all subsequent quadwords to fill the block. Quadwords prior to the requ~st in the block are not returned ~rom the main cache 22. For exampl~, if the VIC 28 requests the third ~uadword in a block, only the third and fourth quadwords are returned from the main cache 22 and are written into the data stram 50. This method of retrieving ISTR~AM is employed for two reasons. First, by returning the requested quadword first, rather than the first quadword in that block, the requested ISTREAM
address is available immediately and the critical rasponse time is enhanced. Second, performance models indicate that the remainder of the block is hardly used.

Since it is possible for only a portion o~ a block to be pre~ent in the data stram 50, it is necessary to keep track of which quadwords are valid. Therefore, a quadword valid stram 54 is provided~ A valid bit is maintained for each ~uadword in tha data stram 50. The quadword valid stram 54 is organized similar to the tag stram 52, in that it contain~ 256 4-bit storage locations. Each storage location corresponds to a four quadword block of data stored in the data stram SO, with each of the ~our valid bits corr~sponding to a quadword ~- 25 within the block. Thus, lik~ the tag stram 52, the .: ~uad~ord valid stram is addressed by tha eight bits 12:5 of the virtual address.
. ~
Furth~r, however, the individual quadword valid bits must also be independently addressable in order to determine if a particular ISTREAM quadword requested by ~,~ the IBUFFER 30 is valid. A multiplexer 56 is connected to the 4-bit output of the guadword valid stram 54. The select input o~ the multiplexer 56 is connected to quadword identifying bits 4:3 of the virtual address.
!:
. ':
a-0~55 U.S~: DIGM,009 ~- FOREIGN: DIGM:040 : ~
. " .
.~ .
~' :' - . ' : .

.

;. :

~ 1 323937 For example, a request from the IBUFFER 30 for the quadword stored at location 00000000000000000001111111101000 results in the four quadword valid bits stored at location 11111111 of the quadword vali~ stram being delivered to the multiplexer 56. Bits ~:3 of the virtual address indicate that the f irst quadword (location 01) is the desired quadword.
Thus, the select lines of the multiplexer 56 cause the quadword valid bit corresponding to the selected quadword to be delivered at the multiplexer output.

Finally, the fourth group o~ VIC stram~ 58 contains valid bits ~or each block stored in the data stram 50.
Thus, the block valid stram 58 contains 256 l-bit storage locations and is addressed by bits 12:5 o~ the virtual address. Not only is it necessary for the VIC 28 to "know" which quadwords within a block are valid, but also, the VIC 28 needs to verify that the block itself is valid. At this time it is sufficient to understand that the block valid bit must be set be~or~ the VIC 28 will allow the selected quadword to be transferred to the IBUFFER 30. HoweYer, it should be noted that the block : valid stram actually consists o~ two sets of strams to ~ speed operation of the VIC 28 during a flush. At ~ny ;;i 25 given time, a selected one of the two sets o~ strams store~ the block valid bits which reflect the current status of the data in the VIC 28. The addressed block valid bit, representing the validity of the addressed block of data in the VIC 28, is selected by a multiplexer 236 as either the "BLOCK A_VALID" bit from the ~irst set of strams (set A), or the "BLOCK B VALID" bit ~rom the second set of strams (set B). This aspect of the VIC 28 is discussed in greater detail in conjunction with the description o~ the operation of the circuit shown in FIG.
~ 35 9-:,' :-~ PD88 0255 U.S~: DIG~-OOg FOR3IGN: DIGM:040 ' : ~

1 32~33-l During an IBUFFER request for a selected quadword of ISTREAM, the virtual address contained in the Pc 26 is delivered to the VIC 28. The VIC 28 responds to the request by determining if the reque~ted quadword i~
present in the data stram 50 and, if so, whether it is valid. Bits 31:13 of the PC virtual address are delivered to one input of a 19 bit comparator 60. The second input to the comparator 60 is connected to the output of the tag stram 52. Previously, bits 31:13 of the addre~s of the quadword stored in the data stram 50 were stored in the tag stram 52. There~ore, those previously stored bits 31:13 are pr~sented as the ~econd input to the comparator 60. If the two addresses match, the asserted output of t~e comparator 60 is delivered as one input to the 3-input AND yate 62. At the same time, the block and quadword valid bits are also delivered as inputs to the AND gate 62. Acsordingly, if any of the thre~ siynals i5 not asserted, the AND gate 62 produces a 2C MISS signal. Conversely, if all three signals are asserted, the AND gate 62 produces a HIT signal. A MISS
signal initiates a request to the main cache 22, while a HIT signal causes the data STRAM 50 to deliver the selected quadword of data.
. 25 The PC 2S is actually constructed o~ ~everal separate program counters. During each ~ystem clock cycle, one of two PCs (PREFETCH PC or MTAG) is selected and its virtual address is delivered to the VIC 28.
:~ 30 Generally, the virtual address contained in the PREFETCH
- PC is selected and delivered to the VIC 28. Ths PREFETCH
~ PC always points to the next quadword that the IBUFFER is - likely to accept. In sequential code the PREFETCH PC is incremented by one quadword each time the IB~FFER accepts ISTREAM from the VIC 28. When the ISTREA~ branches, the :

U o 5 ~ DIGM:OO9 FOREIGN: DIG~:040 ~-.

- .;

.! .
, , : , , ', . ~

~` 1 323'~37 PREFETCH PC is loaded with the correct destination address.

Howe~er, when ISTREAM is requested from and delivered by the main ca~he 22, the virtual addresæ
conkained in the MTAG is selected and delivered to the VIC 28. When the VIC 28 receives multipl~ quadwords of ISTREAM from the main cache 22, the address of the VIC 28 must be incremented by a quadword in each cycle of the main cache response. The PREFETCH PC would serve this purpose if the instruction decoder 32 could always consume all of the ISTREA~ as it arriv~s from the main cache 22. In practice this is not always po~sible.
Therefore, a second PC, independent from the PREFETCH PC, i~ used to store the ISTREAM in the VIC 28. Once th~
response ~ro~ th~ main cache 22 is complete7 the PREFETCH
PC is again u~ed to address the VIC 28. The MTAG is loaded with the previous value o~ the VIC address when there is no request to the main cache 22.
: 20 ` Re~erring now to FIG. 4, the IBUFFER 30 i5 illustrated. The IBUFFER 30 aligns the data ~or decoding and perform~ the ~unction of increasing the processing sp~ed of the instruction unit 18 by pre~etching subsequent sequential instructions. ~he IBUFFER 30 retrieves a selected quadword of the ISTREA~ and positions that ~uadword, such that the instruction ~`~ decoder 32 receives the instruction with the opcode positioned in the zero byte location. In order to accompli~h this complex task o~ repositioning the ISTREAM, the IBUF~ER 30 is separated into five major ~ functional sections: IBEX 64 ~ IBEX2 56, RO~ATOR 68, -.~ SHIFTER 70, MERGE MnLTIPLEXER 72, and IBUF 74.

Rather than simply increase the size of the ` U.SO: DIGM:009 :~ FOREIGN: DIGM:040 , .

. ~

-~` 1 323~37 instruction decoder 32 to contain more bytes of the ISTR~AM, a pair of prefetching buffers IBEX 64 and IBEX2 66 are dispo~ed intermediate the decoder 32 and the VIC
28. IBEX 64 and IBEX2 66 are quadword buffers functionally positioned b~tween the VIC 28 and the IBUF
74 and operational to retrieve the next sequential ~uadword of ISTREAM while the decoder 32 is operating on the present instruction. This prefetching normally hides the time required for a VIC acces~ by performing the instruction fetch during the time in which the decoder 32 i5 busy. Any one of the quadwords stored in the VIC 28 is controllably storabl2 in the IBEX 64 and IBEX2 66. As dissussed previously, the PR~FETCH PC controls operation of the VIC 28 to select and deliver a quadword of ISTREAM. The quadword currently selected by the PREFETCH
PC i~ stored in the IBEX 64 while the next subsequent quadword of ISTRE~M is retrieved from the VIC 28 and stored in the IBEX2 66.

The purpose of the IBEX 64 and IBEX2 66 is to prefatch the subsequent two quadwords of ISTRE~M and sequentially provide these bytes of ISTREAM to fill the IBUF 74 as each instruction is consumed by the instruction de~oder 32. It should be noted that the present computer system preferably employs an instruction set which is of the variable length type. Accordingly, until the in~truction decoder 32 actually decodes the opcode o the instruction, the number of bytes dedicated to the in~tant instruction is not "known" by ths IBUFFER
30. There~ore, the IBUFFER 30 does not "know" how many : bytes will be consumed by the instruction decoder 32 and will need to be refilled by the IBUFFER 30. ~hus, the logic which controls the operation of the IBEX 64, IBEX2 66, and VIC 28 must be capable of determining the number 35 of hytes needed to fill the decoder 32, which location or U.S.: DIG~:009 FOREI&N: DIGM:040 .

' : , :' `~'~

-~ 1 323937 multiple locations contain the desired bytes, and whether those bytes are valid.

The control logic ~or operating the IB~X 64, IBEX2 66, and VIC 28 includes a multiplexer 76 with control logic 78 operating the select inputs of the multiplexer 76. The IBEX 64, IBEX2 66, and VIC 28 each includes an 8-byte wide data path connected to the inputs o~ the multiplexer 76 such that any input may be selected by the control logic 78 and delivered over an 8-byte wide data path to the rotator 68 and to the IBEX 64. The IBEX2 S6 is connected directly to the VIC 28 and receives the next sequential quadword of ISTREAM over the 8-byte data path therebetween. Operation of th~ multiplexer 76 and control logic 78 is discussed in greater detail in conjunction with the description accompanying FIGso 9 and :, 10.
., The merge multiplexer 72, rotator 68 and shifter 70 interact to maintain the 9-byte in~truction decoder 32 filled with the n~xt nine sequential bytes of ISTREAM.
As the decoder 32 completes the decoding stage of each ~- instruction, those consumed bytes are shifted out and discarded by the shifter 70~ The rotator 68 act~ to provide the next sequential bytes of ISTREAM to replace tho~e bytes which were discarded~ In this manner, the instruction buffer 30 attempts to provide at least the next 9-bytes of ISTRE~M to the in~truction decoder 32 n Therefore, independent of the length o~ the present ~- 30 instruction, the decoder 32 is assured that for the - majority of instructions ~relatively few instructions require mor~. than 9 bytes~ the entire instruction i~
present and available ~or decoding.
':
The IBUF 74 is a 9-byte register for storing the -, !,' PD83-0255 U.S.: DIG~:OO9 : FOREIGN: DIGM:040 , ....
.~
~.

~'` 1 323937 results of the merge multiplex~r 72 until the decoder 32 is avai~able to accept the ISTREAM. Further, the output o~ the IBUF 74 is also connected to the input of the shifter 70.

Turning now to FIG. 5, the data paths to and from the instruction decoder 32 are shown i~ greatex detail.
In order to simultaneously decode a number of operand specifiers, the IBUF 74 is linked to the instruction decoder 32 by a data path 80 for conveying the values of up to nine bytes of an instruction currently being decoded. Associated with the eight bits of each byte is a parity bit for detecting any single bit errors in the byte, and also a valid data flag for indicating whether - 15 the IBUF 74 has, in fact, been filled with data ~rom the V~C 28 as requeste~ by the program counter 26.
:;
~ he instruction decoder 32 decodes a variable number o~ specifiers depending upon the particular opcode being decoded, the amount of valid data in the IBVF 74l and whether the downstream stages in the pipeline are available to accept more specifiers. Specifically, the instruction decoder 32 inspects the opcode to determine the number of subsequent bytas which are associated with that particular instruction. Then the decoder 32 checks the valid data 1ays to determine how many o~ the aæsociated ~pecifiers that can be decoded and then decodes these speci~iers in a sinyle cyale. The instruction decoder 32 delivers a signal indicating the :~ 30 number of bytes that were deeoded in order to remove these bytes ~rom the IBUF 74. For example~ if the opcode ~ includes four bytes of associated specifiers, the decoder .~ inspe ts ~he valid bytes to ensure that these ~our bytes are valid and then decodes these specifiers. Thereafter, : 35 the decoder instructs the shi~ter 70 to remove the opcode U.S.: DIGM:009 FOREIGN: DIG~:040 - - . .

- !

--` 1 323937 and the consumed four bytes and move tha upper four bytes into the low order four byte locations. This shi~ting process is ef~ective to move the next opcode into the zero byte location of the IBUF 74.

The IBUF 74 need not be large enough to hold an entire instruct:ion, so long as it may hold at least three specifiers of the kind which are typically found in an instruction. The instruction decoder 32 is som~what simplified if the byte 0 position of the IBUF 74 holds the opcod~ while the other bytes of the instruction are shifted into and out o~ the I~UF 74. In effect; the IBUF
74 holds the opcode in byte 0 and functions as a first-in, first-out buffer for byte positions 1 through 8. ~he instruction decoder 32 is also simplified by the operating criteria that only the specifiers for a single instruction ~re decoded during each cycle of the system clock. There~ore, at the end of a cycle in which all o~
the specifiers ~or an instruction will have been decoded, the instruction de~oder 32 transmits a "shift opcode"
signal to the shifter 70 in order to shift the opcode out of the byte 0 position of the IBUF 74 so that the n~xt opcode may be received in the byte 0 position.
~, The VIC 2B is preferably arranged to re~,eive and transmit instruction data in blocks o~ multiple bytes of data. The block size is preferably a power o~ two so that the blocks have memory addresses specified by a certain number of most significant bits in the address provided by the program counter 26. For example, in the preferred embodiment, each block consists of 32~bytes or ~our quadwords and is addressed by a 32-bit address.
Thus, bits 31 5 are unique for each block. Further, : owing to the instructions being of variable length, the ~-: 35 address of the opcodes within the ISTREAM occur at ~,-:: PD8~-0255 U~ S~ DIGMo 009 ; FOREIGN: DIGM:040 ~, .
. ~ - ..
, .
- . .

variou~ positions within the block. To load byte 0 of the IBUF 74 with the next opcode to be executed, which may occur at any byte position within a block of instructio~ data from the cache, the rotator 68 is disposed in th~ data path from the VIC 28 to the IBUF 74.
The rotator 68, as well as the shifter 70, are comprised of cross-bar switches. The data path from the VIC 28 includes eight parallel busses, one bus being provided for each byte of the ISTREAM.
In the general case, it is necessary to keep track of the number of valid bytes in the I~UF 74~ The number of valid bytes at any particular instance is kept in a regi~ter called IBUF VALID COUNT 81. The value of this register is the pr~vious IBUF VALID COUNT minus the number of bytes shi~ted plus the number of new bytes merged through MERGE MUX 72. Similarly it is necessary to know how many bytes remain in IBEX 64. Any bytes that have been moved into the IBUF 74 are considered invalid~
As IBUF 64 be~omes full the remaining bytes from the quadword of data or a complete new quadword are stored in . IBEX. The number of valid bytes in IBEX is stored in a `.: 'virtual' register called IBEX VALID COUNT. This is not a hardware register but the output from combinational logic that produces either, the previous IBEX VALID COUNT
minus the numb~r of bytes merged into the IBUF 74 if IBEX
, .
is being ~elected into MUX 76, or eight bytes minus the number of bytes merged into the IBUF 74 if IBEX 2 or VIC
is selected into MUX 76.
.~: 3Q
At the beginning of a program or after a branch or : jump instruction is executed, it is desirable to load the IBUF 74 with entirely new data ~rom the VIC ~. For this purpose, combinational logic 82 controlling the merge .~ 35 multiplexer 72 r~ceives a IBUF VALID COUNT o~ zero so U.S.: DIG~:009 `~ FOREIGN: ~IGM:040 ~ . . . . .
. .
. . . .. .

. . ~ , . .
. . .

,, ~

tha~ all o~ the sel2ct lines So-S8 are not asserted and the merge multiplexer 72 selects data from only the BO to B8 inputs. Since none of the instructions in the IBUF 74 are valid they are discarded, and only the new instructions contained in ROTATOR 68 are presented to the IBUF 74.

In order to load new ISTREAM into the IBUF 74 ~rom the VIC 28, the MERGE MUX 72 is used to ~elect the number of bytes from the ROTATOR S8 to be merged with a select number of bytes from the shifter 70. If the signal SHIFT
OP iæ asserted the output o~ the SHIFTER 70 will be the IBUF 74 bytes O through 8 shifted down by the number to shift, otherwise if SHIYT OP is not asserted th~ output of the shi~ter will be IBUF 74 byte O in position AO with XBUF 74 bytes 1 through 9 shifted down by the number o~
bytes to shift.

Also when the IBUF 74 is initially loaded, there will be an sffset between the address corresponding to the opcode in the data ~rom VIC 28. In particular, this offset is given by the least signi~icant bits o~ the program counter 26~ As sho~n in FIG. 5 a quadword of IST~EAM (eight bytes) is delivered to the ROTATOR 68, ; 25 thu~ using the three least ~ignificant bit~ from the program counter 26 as the rotate value the opcode byte is .~ delivered t5 the BO input o~ merge mux 72. For exampIe, if the program branches to BOD 16 i.e~, the fifth byte of : the second quadword in a block. The quadword address i5 B08 16, the least significant khree bits are 5, so when the VIC provides the quadword the ROTATOR 67 rotates by 5 bytes and d~livers byte 5 to the BO input of MERGE MUX
72, '' In the general case, though, the rotate value is : PD88-0~55 U.S.: DIG~:009 FOREIGN: DIGM:040 , . ,, , ~ . - , - .

~, . . :
. .~

---" 1 323q37 calculated uslng the formula:

rotate value = 8 - IBEX VALID_COUNT -(IBUF VALID COUNT
- NO. BYTES TO SHIF~) For example, if there are nine valid bytes in the IBUF 74 and three in IBEX (bytes 5, 6, 7 of a quadword) and the num~er of bytes to shift is two, the rotate value 0 is minus two, therefore the rotator shifts up by two (as the result was negative). Thus, the rotator 68 delivers byte 5 of the quadword in IBEX 64 to the B7 input on merge mux 72, and byte 6 to B8 ~byte 7 is of no interest as it will not be merged, it is however, delivered to the BO input3. Positive rotate values will cause the ROTATOR
68 to shift down. Thus, combinational logic 90 controlling the rotator 68 calculates the relevant rotate value.

The control ~or the ~ERGE ~UX in combinational logic 82 produces individual select lin~s SO - S8 for the merge mux 72 ~uch that the relevant bytes from the SHXYTER and ROTATOR are delivered to the IBUF 74. If SHIFT OP is not asserted then SO always selects the AO input such that th~ opcode byte remains in byte O of the IBUF 74. The rem~ining selects are calcul~ted as follows:

: MERGE VALUE = IBUF VALID COUNT - NO. BYTES_TO SHIFTS
any ~lect (SloS8~ less than MERGE VALUE selects the SHIFTER 70, and the rest select the ROTATOR 68.

:~ For example, if there are eight valid bytes in the I8UF 74 and the number to shift is three, the ~erge value ~: is five so Sl, S2~ S3s S4 select the output from the ~ 35 S~IYTER 70 but S5, S6, S7, S8 select the output from the :~;

U.S.: DIG~OO9 FOREIGN: DIGM:040 ~. ~

~ ~ '' ' .

~ 1 323937 ROTATOR 68.

Since the ROTATOR 68 receives eight bytes of data but transmits nine bytes to the MERGE MUX 72, the nine bytes delivered to BO - B8 inputs are never all valid.
The ninth byte gets the same data as the fir~t byte but it is only valid when the rotate value i~ negative.

Once an opcode has been loaded intD the byte o position of the IBUF 74, the instruction decoder 32 examines it and the other bytes in the IBUF 74 to determine whether it is possible to simultaneously decode up to three operand specifiers. The instruction decoder 32 further separates the source operands from the destination operands. In particular, in a ~ingle cycle of the system clock, the instruction d~coder 32 may decode up to two source operands an~ one destination operand~ Flags indicating whether source operands or a destination operand are decoded for each cycle are transmitted from the instruction decoder 32 to the OPU
34.
~;
` The instruction decoder 32 simultaneously decodes up to three register speci~iers per cycle. When a register ~5 specifier i~ decoded, its register address i5 placed on the transfer bus TR and sent to the source list queue 38 via a trans~er unit 92 in the OPU 34.

The instruction decoder 32 may decode one short : 30 lit~ral specifier per cycle~ According to the VAX
instruction architecture, the short literal speci~ier must be a source operand specifier. When th~ in~truction decoder 32 decodes a short literal specifier, the short literal data is transmitted over a bus (EX) to an expansion unit 94 in the OPU 34.

PD8~-0255 U.S.: DIGM:OO9 YOREIGN: DIGM:040 ' ;`
. .
:~, : : ': ': . `
: .
.: .
-`-`` 1 323937 -2~-Preferably the instruction decoder 32 is capable of decodin~ one complex specifier per cycle. The complex speci~ier data is transmitted by the instruction decoder 32 over a general purpose bus (GP) to a general purpose unit 96 in the OPU 34.

Once all of th~ specifiers for the i~struction have been decoded, the instruction decoder 32 transmit~ the ~Ishift op" signal to the shifter 70. The instruction decoder and also transmits a microprogram "fork" address to a ~ork queue in the queues 36, as soon as a valid opcode is received by the IBUF 74.

Referring now to FIG. 6, a schematic diagram of the shifter 7~ is shownD The Ao~A8 byte inputs of the merge multiplexer 72 are illustrated connected to the ~-bit outputs of a bank o~ multiplexers which comprise the shifter 70. It should be remembered that the purpose of the shifter 70 is to move the unused portion of the instruction stream contained in the IBUF 7~ into those bytes of the IBUF 74 which were previously consumed by the instruction decoder 32. For example, if, during t.he previous cycle, the instruction decoder 32 used the three lowe~t bytes (0, 1, 2) of the IBUF 74, then in order to properly present the next instruction to the decoder 32, it is preferable to shift the remaining valid six bytes (3-8) into the low order six bytes of the IBUF 74.

Accordin~ly, the consumed low order bytes are no longer of any immediate use to the decoder 32 and are discarded. Thus, the shifter 70 need only move high order bytes into low order byts positions and does not rotate the low order bytes into the high order byte positions. This requirement simplifies the shifter U.S. DIGM:009 ~OREIGN~ DIG~:040 ~ .

con~iguration ~or the higher order bytes since each byte position only receives shifted bytes from those positions which are relatively higher. For example, byte po~ition six only receives shifted byte~ from its two higher order positions (7 and 8), while byte po~ition one receives shi~ted bytes from its seven higher order positions (2-8).

To better describe this process, the internal configuration of one of the multiplexer banks is illustrated and generally shown at 102. The multiplexer bank 102 receives bytes 6, 7, and 8 ~rom the IBUF 74 and delivers an output to the A6 input of the merge mul~iplexer 72. Within the multiplexer bank 102 is a 15 group of eight 3-input multiplexers 102a-102h. The multiplexer 102a receives the zero bit o~ eash of the input bytes 6, 7, and 8 at input locations 0, 1, and 2 respectively. Similarly, the multiplexers lO~b-102h receive bits 1-7 respectively of the three input bytes.
20 The select line~ for each of the multiplexers 102a-102h is connected to the instruction decoder 32 and carries the 3-bit signal ~Inumb~r to shift'i. The "nu~ber tn shift" signal is, of course, the number of bytes that were consumed by the instruction decoder 32.
Therefore, it can be seen that the sel~ct lines of the multiplexers 102a-102h act to deliver all eight bits of the selected byte. For example, if the decoder 32 consumes two bytes of the ISTREA~, then the contents of the IBUF 74 are shifted by two bytes, such that byte eight is moved into sixth byte location. Accordingly, the "number to shift" signal is set to the value two, thereby selecting the third input to the multiplexers 102a-102hO Thus, the byte eight position is selected and delivered to the merge multiplexer input A6.

U.S.: DIGM:009 FOREIGN: DIGM:040 :' . . .
, : , .
!'`. : :~ . .

The internal structure of the remaining multiplexer banks 104-114 are ~ubstantially similar, varying only in the number of input bytss. The mul~iplexer bank 114 has an output connected to the A7 input of the merge multiplexer 72. The inputs to the multiplexer 114 include only bytes 7 and 8 of the IBUF 74. The multiplexer bank 112 has an output ~.onnected to the A5 input o~ the merge multiplexer 72. The inputs to the multiplexer 112 include bytes 5, 6, 7, and 8 of the IBUF
74. The multiplexer bank 110 has an output connected to the A4 input of the merge multiplexer 72. The inputs to the multiplexer 110 include bytes 4, 5, 6, 7, and 8 of the IBUF 74. The multiplexer bank 108 has an output connected to the A3 input o~ the merge multiplexer 72.
The inputs to the multiplexer 108 include bytes 3, 4, 5, 6, 7, and 8 of the IBUF 74. The multiplexer bank 106 has an output connect~d to the A2 input of the merge multiplexer 72~ The inputs to the multiplexer 106 include bytes 2/ 3, 4, 5, 6, 7, and 8 of the IBUF 74.

The multiplexer bank 104 di~ers slightly from the other multiplexer banks, in that its output is directly connected to the merge multiplexer 72 and also the zero byte position of the IBUF 74. The byte zero cas~ is additionally complicated by a requirement that in addition to the shifter 70 being capable of moving any of the higher order bytes into the zero byte position, the shifter 70 must also be capable of retaining the current zero byte while the remaining bytes are shifted. This ~eature is desired becau~e byte zero contains the opcodeO
Thus, if the specifiers extend beyond the length of the IBUF 74, then the consumed bytes must be shifted out and new specifiers rotated in, but the opcode must remain until the entire instruction is d~coded. Accordingly, U.S.: DIGM-009 FOREI&N: DIGM:040 . -, ~ ,. , , : '' ..
.: ..
.~ :
, .

. .- : .

1 323q37 the inputs to the multiplexer 104 inclllde bytes 1, 2, 3, 4, 5, 6, 7, and 8 of the IBUF 74. However, the output of the multiplex~r 104 is delivered to one input of a bank o~ multiplexers 116. The second input to the multiplexer bank 116 is connected to the zero byt~ position of the IBUF 74. A single bit select line is connected to the instruction decoder 32 through an OR gate 118, so that when the instruction decoder 32 issues either a "shi~t opcode" or an "FD shift opcode" signal, the select line is asserted and the output of the multiplexer 104 is delivered to the Ao input of the merge multiplexer 72.
Otherwise, if neither o~ these signals is asserted, then byte 0 is s~lected and delivered to the Ao input of the merge multiplexer 72.
Re~erring now to FIG. 7, there i5 shown a schematic diagram of the rotator 6~. The Bo-BB byte input~ of the merge multiplexer 72 are illustrated as connected to the 8-bit outputs of a bank of multiplexers which comprise the rotator 68. It should be remembered that the purpose of the rotator 68 is to rotate the next quadword o~
ISTRE~M so that the merge multiplexer 72 can fill the IBUF 74 with the valid low order bytes o~ the shifter 70 and the rotated high order bytes of the rotator 68.
2~ Furtherl unlike the shi~er (70 in FI~. 5), each of the multiplexer banks in the rotator 68 i~ capable of delivering any of the input bytes at its output.

For example, if, during the previous cycle, the instruction decoder 32 us~s the three lowest byt~s ~0, 1 2) of the IBUF 74, then the shifter 70 moves the remaining valid six bytes (3-8) into the low order ~ix bytes (0-5) of merge multiplexer inputs Ao A5. Thus, the rotator 68 rotates its low order thr~e bytes into 35 positions 6, 7, and 8 so that the merge multiplexer 72 U.S.: DIGM:003 ~OREIGN: DIGM:040 . .

: .

can combine Ao~As and B6-B8 to fill the IBUF 74. The low order three byt~s available from the multiplexer 76 could be the low order three bytes of IBEX2 66 or the VIC 28 or any three consecutive bytes of IBEX 64.

To better describe this process, the internal configuration of one o~ the multiplexer banks is illustrated and generally shown at 132. The multiplexer bank 132 rceives bytes 0-7 from either the VIC 28, IBEX
S4, or IBEX2 66, as described in conjunction wi~h FIGs.
4, 9, and 10. The output of the multiplexer bank 13~ is delivered to the B4 input o~ the merge multiplexer 72.
Within the multiplexer bank 132 is a group of eight 8-input multiplexers 132a-132h. The multiplexer 132a receives the zero bit of each of the input bytes 0-7 at multiplexer 132a input locations 4-3 respectively.
Similarly, the multiplexers 132b-132h receive bits 1-7 respectively of all o the eight input bytes. The select lines for each of the multiplexers 132a-132h receives the 3-bit rotate value as described in conjunction with FIG.
5. The signal is, of course, the number of bytes positions that the ISTREAM should be rotated to properly fill the IBUF 74.

It can be seen that if the rotate value is selected to be a value of three by the rotator control logic 90, the multiplexers 132a-132h will each select the input located at position three. Accordingly, bits 0-7 of input byte seven are selected and delivered to the B4 input of the merge multiplexer 72. Therefore, in response to a request ~or a three byte rotate, the input byte seven is delivered to byte position four.

The remaining multiplexer banks 134-148 are substantially similar to the multiplexer bank 132, , U.S.: DIGM:OO9 j FOREIGN: DIGM:040 , ! ' ` ~ . :
:' ' :
:' , ` .
1, :.
```'~ . ' ' ' ~ :

- ~ 323937 di~fering only in the order in which the illpU~ bytes areconnected to the multiplexer banks 132-14~. For example, the same request for a three byte rotate causes multiplexer bank 140 to deliver the sixth input byte to byte position three tB3).

Consider now the combined affect of the operation of the rotator 68 and shifter 70. Assume both IBUF 74 and IBEX 64 are full. Also assume that the decoder 32 has con~umed the low order three bytes of the IBUF 74. The decoder 32 produces a value of three as the ~'number to shift" signal. The shifter 70 responds to ~his signal by relocating the ISTREAM so that positions Ao~AB of the merge multiplexer 72 respectively receive positions 3, 4, 5, 6, 7, 8, 6, 7, 8. At the ~ame time the rotator control loyic 90 delivers the rotate value to the rotator 68. rrhe rotate value is set to tha value minus six.
Accordingly, the rotator 68 rotates its contents so that positions Bo~B8 of the merge multiplexer 72 respectively receive positions 3, 4, 5, 6, 7, 8, o, 1, 2. Therefore, the merge multiplexer successfully combines the two inputs to deliver the next nine bytes of ISTREAM to the IBUF 74 by selecting inputs Ao-A~ and B6-B8.

Referring now to FIG. 8, there is shown a schematic diagram o~ the merge multiplexer 72 and merge multiplexer control logic 82. It should be remembered that the merge multiplexer 72 operates under control o~ the logic 82 to select the next nine bytes of ISTREAM from the two sets of 9 byte inputs from the rotator 68 and shifter 70.
Generally, the low order bytes are selected from the shi~ter 70 while the rotator 68 fills the remaining high order byte positions.

The control logic 82 receives the "number to shift"

U.S.: DIGM:OO9 FOREIGN: DIGM:040 , , .

, . .

signal (m) and the IBUF VALID COUNT and uses the values of these signals to select the proper input bytes.

The merge multiplexer 72 includ~s nine banks o~
multiplexers 150, 152, 154, 156, 158, 160, 162, 164, 166 with each bank receiving two byte position inputs, one byte each ~rom the rotator 68 and shifter 70. ~hus, the select line connected to each bank o~ multiplexers is asserted to select the rotator input and unasserted to select the shifter input.

To better describe this process, the internal configuration o~ one of the multiplexer banks is illustrated and generally shown at 150. The multiplexer bank 150 receives bits 0-7 from the zero byte position of both the shifter 70 (Aoo-Ao7) and rotator 68 (Boo-Bo7~o The output of the multiplexer bank 150 is delivered to the zero byte position of the IBUF 74. Contained w:Lthin the multiplexer bank 150 is a group of eiqht 2-input multiplexers 150a-150h. The multiplexer 150a receives the zero bit of both of the ~ers) position input bytes such that an asserted value on the select line delivers Boo and an unasserted value deliver~ Aoo. Similarly, the multiplexers 150b-150h receive bits 1-7 respectiYely of ~oth o~ the input bytes. The select lines ~o~ each of the multiplexers :150a-150h receives a l-bit select signal from the priority decoder 82 in order to commonly deliver all eight bi~ of the selected byte to the zero input position of the IBUF 74.
Within the control logic 82~ the "number to shi~tl' signal (m) is subtracted from the IBUF VALID COUNT to determine the lowest order byte position into which the rotator input~ should be delivered. The signal m i~
delivered to a ls complement generatox 168 to convert the U.S.: DIGM:009 FOREIGN: DIGM:040 '~
,; ` t :' .~ :

1 323~37 signal m into a negative value. The signal -m is delivered to an adder 170 which performs the arithmetic operation and deli~ers the re~ult to a 4:16 decoder 172.
Accordingly, the lower order nine output bits o~ the decoder produce a single asserted signal at the numeric position corresponding to the lowest order byte position into which the rotator inputs should be delivered.
Therefore, this asserted byte position and all higher order byte positions should be asserted to properly select rotator inputs at the corresponding multiplexers.

For example, as discussed previously, if the ~number to shift~ signal is set to a value of three, then the rotator inputs should be selected for byte positions 6 through 8. The output of the decoder 172 asserts only the line corresponding to byte position 6. Thus, a bank of OR gates 174 are connected to the outputs of the decoder 172 to provide asserted signals to the multiplexers corresponding to the asserted line and all higher order byte positions.

During normal operation the ~'number to shi~t~ signal '; controls the operation of the merge multiplexer 72.
~ow2ver, at the beginning of a program or at a context switch, the "number to shiftl~ signal is zero and the IBUF
VALID COUNT iS zero and the entire contents of the rotator 68 are loaded into the IBUF 74. There~ore, the output of the adder 170 i~ zero, enabling all of the outputs of the bank o~ OR gates 82. Thus, the select 30 lines to the multiplexers 150-166 all act to select the B
inputs and pa~s the entire contents of the rotator to the IBUF 74.
, . .
The control logic 78 for operating the multiplexer 76 o~ FIG. 4 selects either IBEX 64, IBEX2 66 or VIC 28 ,~

U.S.: DIG~:009 FOREI&N: DIGM:040 "
.; ~ . .

.

::
: , , .

according to the following priority scheme.

The control logic 78 selects IBEX 64, IBEX2 66 or VIC 28 with a simple priority algorithm. If IBEX is not empty then IBEX 64 is delivered to the ROTATOR 68 otherwise if IBEX2 is valid it is delivered to the rotation 68 and i~ both IBEX is empty and IBEX~ is not valid VIC data i5 delivered to the ROTATOR 68.

IBEX is loaded each cycle with the data delivered by MUX 76 but it is marked empty either on a FLUSH or when all ~alid data on the ROTATOR 68 is consumed by the IBUF
740 In other words, IBEX VA~ID COUNT becomes non-zero wh~n MUX 76 provides data to ~OTATOR 68 that cannot find a place in IBUF 74. For example, a~ter a branch or jump instruction has been executed IBUF 74, IB~X 6~ and IBEX 2 are cleared (FLUSHED) and the VIC is accessed for the new ISTREAM. Assume it branches to the first byte of a block that is in the VIC 28. The first quadword fro~ the VIC
28 is presented to MUX 76 this passes the data through the ROT~TOR 68 and MERGE MUX to IBUF 74. IB~X is loaded with the data but is not marked valid as all eight bytes went into the IBVF 74. In the following cycle the VIC 28 presents the second ~uadword tG MUX 76 which passes it to the ROTATOR 68. Now assuming the DECODER 32 decodes less than eight bytes, say ~our ~ytes, the SHIFT~R 70 shifts out 4 bytes, the Ro~ATOR 68 rotates by four and the MERGE
MUX 82 pa6se~ four bytes from the shi~ter 70 and five bytes from the ROTATOR 68 then IBEX contains thr~e ~nus~d bytes of ISTRE~M, so IBEX VALID COUNT is set to thre~.

IBEX2 can be considered stall buffer for ~he VIC 28.
Because of the pipelined nature of creating a new prefetch address, accessing the VIC strams then checking ;~ 35 ~or a VIC HIT it iB impractical to stop this process as ::`
~ PD88-0255 : U.S.: DIG~:OO9 : FOREIGN: DIGM:040 , , 1 3~3937 soon as IBEX contains some valid bytes. Thus data from the VIC 28 is loaded into IBEX2 66 the cycle after IBEX
64 is loaded with some valid data and IBEX2 66 is marked valid if it is a VIC HIT. Taking the above example, where a branch to th~ ~irst byte of a valid block in the VIC 28 is executed. The address of the fir~t quadword is moved to PREFETCH PC in the first cycle. In the second cycle the first quadword is delivered to IBUF 74 and PREFETCH PC moves on to the second quadword. In th~
third cycle, the second quadword is delivered to IBUF 74 and IBEX 64 and the PREFETCH PC moves to the third quadword. In the fourth cycle, assuming DECODER 32 consumes no more bytes, the third quadword is delivered to IBEX2 and PREFETCH PC moves to the fourth quadword and we decide to stall. In the ~ifth cycle the VIC 28 delivers the fourth quadword to MUX 76 but IBEX 64 data is passed to the ROTATOR 68.

As can be seen in khe above example, pre~etching of ISTREAM can move significantly ahead of the instruction in the IBUFc One benefit of the VIC 28 is that acc~sses to the main cache 22 arP significantly reduced. How~ver, this benefit will be severely reduced if pre~etching continues too far ahead of the decoded instruction strPam. On average, a branch instruction occurs once in every sixteen bytes of ISTREAM so it is essential that prefetching does not access the main cache 22 unless ~ there i~ a reasonabl~ chance the data will be used.
: Thus, a request to the main cache for data is only made i~ there is a VIC MISS, IBEX2 is not valid and IB~X is empty. This usually means seven or eight bytes are still availa~le to the DECODER 32 when the request for a VIC
~ blocX is made.

; 35 Referring now to FIG. 9, there is shown a block U.S.: DIGM:009 FOREIGN: DIGM:040 :

.
, .:
;~ . ' -~ : ;~ ', .- . - ~ .

diagram of the two unit valid block store stram 5~ of the virtual instruction cache 28. Since the VIC 28 is a virtual cache, it must be flushed on a context switch or REI instruction. In other words, all 256 of the 1-bit storage locations must be marked as invalid.
Un~ortunately, only one storage location can be marked as invalid during each clock cycle. Accordingly, it is possible that if all 256 bits are set to their valid condition, then it takes 256 clock cycles to clear the block valid stram 58.

As shown in FIG. 9, there ar~ two block valid strams 220, 222 (BVSA, BVSB). One of the strams i5 used to determine if the presently requested address "hits" or "misses" in the VIC 28. While the first stram is determining hittmiss the second stram i5 being cleared at the rate of one storage location during each clock cycle.
Therefore, assuming that 256 cycles have elapsed since the last context switch, then the second stram is clear and a context switch is accomplished in only a single cycle by switching the functions of the two strams~ It should be appreciated that each stram 220, 222 is configured to perform either hit~miss determination or valid bit clearing. In fact, each context switch causes BVSA and BVS8 to switch to the opposite function.
'~:
` BVSA and BVSB each receive a single 8-bit address from respective multiplexers 224, 226. Both of the multiplexers 224, 226 receive a pair o~ addresses from the PC 26 and a reset control 228. In order to present . the PC address to one of the strams 220, 222 and the .reset address to the other stram 220, 222, the select lines to the multiplexers 224, 226 are operated in a complementary fashion.

.~ PD88-0255 U~S.: DIGM:OO9 : FOREIGN: DIGM:040 .~.
~,. .

, ~

The reset control 2Z8 receive~ a CONTEXT SWITCH
signal from the execution unit 20 and begins to sequentially present address 0-255 to the multiplexers 224, 22~. One o~ the multiplexers 224, 226 passes these sequential addresses to the selected strams 220, 222, such that the 256 valid bitS contained therein are reset over a period o~ 256 clock cycles.

In order to prevent the execution unit from initiating a context switch before one of the strams 220, 222 is reset, the reset control delivers a handshaking signal to indicate that the reset process is complete~
An S-R flip flop 230 receives the hand~haking signal at its æet input, causing th~. flip flop 230 to latch a 15 PROCEED WITH coNl~Ex~r SWI~CH SIGNAL to the execution UXlit 20. The SWITCH CONTEXT signal from the execution unit 20 is also connected to th~ reset inpUt 0~ the ~lip flop 230 so that the PROCEED WITH CONTEXT SWITC~ signal i~ reset at the beginning of ea~h context switch.
Control of the select lines to the multiplexers 224, 226 is provided by a J-K flip flop 232 which toggles between asserted and unasserted in response to each CONTEXT SWITCH signal. Both inputs o~ the ~lip flop 232 are connected to a logical "1" and the clock input ls connected to the CONTEX~ SWITCH signal. Thus, the Q
output ~USE ~LOCK B) of the flip~flop 232 switches between !~0~l and "1" in response to a transition in the SWITCH CON~EXT signal. The select input of the multiplexer 224 is connected directly to the Q output of the flip-flop 232, while the select input of the multiplexer 226 is connected to the Q output of the flip~
flop 232 through an inverter 234.

In a ~imilar ~ashion the block valid data (MARKER

U.S.: DIGM:OO9 FOREIGN- DIGM:040 ': .

: ~ - , .
- : , :
-1 323q37 BLQ~K VALID) from the PC ~nit (26 in FIG. 1) is multiplexed between the data inputs of the strams 220, 222 in response to the USE BLOCK B SIGNAL. For this purpose, the data input of the "B" stram 222 is connected to the MARKER BLOCR VALID line through an AND gate 237 which is enabled by the USE BLOCK B signal, and the data input of the "A" stram 220 i5 connected to the MARKER
BLOCK VALID lin~ through an AND gate enabled by the complement of the USE BLOCK B signal as provided by an inverter 2390 Therefore, when th~ USE BLOCK B ~ignal is asserted, the MARKER BLOCK VALID data is ~ed into the "B"
stram 222 while the "A" stram receives zero data and is therefore cleared. Conversely, when the USE BLOCK B
signal is not assert~d, the MARKER BLUCK VALID data is ed into the "A" stram 222 while the "B" stram rec~ives zero da~a and is therefore cleared.

Finally, khe valid bit outputs o~ the strams 220, ~22 are connected to a pair of inputs to a multiplexer 20 236. The select line of the multiplexer 236 is also connected to the Q output of the flip flop 232 to operate in conjunc~ion with the multiplexers 224, 226.
Accordingly, the stram 220, 222 which is selected to receive the PC address is also selected to deliver its ~ 25 output a~ the BLOCX VALID BIT.

": ' :`~

.-.;~ PD88-0255 U.5~: DIGM:OO9 ~OREIGN: DIGM:040

Claims (9)

1. An instruction buffer system for a digital computer for controlling the delivery of instruction stream bytes between a memory and an instruction decoder, said instruction stream bytes being grouped together into variable length instructions, and said instruction decoder including means for decoding each of said bytes within said variable length instruction, the instruction buffer system comprising:
an instruction buffer coupled between said memory and said instruction decoder and having multiple byte locations for receiving a first series of said instruction bytes, at least a portion of said first series of instruction bytes forming a variable length instruction to be decoded by said instruction decoder, first and second prefetch buffers for storing a preselected number of a second, subsequent series of bytes of instruction stream, means for delivering a shift signal responsive to the number of bytes of instruction stream contained in the variable length instruction currently being decoded by said decoder, a shifter for receiving said shift signal and shifting the contents of said instruction buffer by a preselected number of bytes indicated by said shift signal, means for merging said shifted bytes with at least a portion of the contents of one of said first and second prefetch buffers, and delivering said merged bytes to said instruction buffer, means for refilling said first prefetch buffer with sequential bytes of said instruction stream when said first prefetch buffer is emptied, and means for refilling said second prefetch buffer with sequential bytes of said instruction stream when said second prefetch buffer is emptied.
2. An instruction buffer system, as set forth in claim 1, wherein the merging means includes means for retrieving a preselected number of sequential bytes of instruction stream from one of the first and second prefetch buffers and loading said preselected number of bytes into the buffer locations from which instruction stream bytes have been removed by the shifter, said preselected number of sequential bytes of instruction stream being responsive to said predetermined number of bytes indicated by said shift signal.
3. An instruction buffer system, as set forth in claim 2, wherein the merging means includes means for receiving said sequential bytes of instruction stream retrieved from said first and second prefetch buffers and rotating the bytes by a preselected number of byte locations responsive to said predetermined number of bytes indicated by said shift signal before refilling said instruction buffer.
4. An instruction buffer system, as set forth in claim 1, wherein said means for refilling the first and second prefetch buffers refills the first and second prefetch buffers in response to the absence of said shift signal.
5. An instruction buffer system, as set forth in claim 1, including means for retrieving instruction stream bytes from the memory in response to both of said first and second prefetch buffers being empty of instruction stream bytes.
6. An instruction buffer system for a digital computer for controlling the delivery of instruction stream to an instruction decoder, said instruction stream being grouped into variable length instructions, and said instruction decoder including means for decoding each of said bytes within said variable length instruction, said instruction buffer system comprising:
an instruction buffer having a plurality of storage locations for receiving a preselected number of the next sequential bytes of instruction stream desired by the decoder and delivering said preselected number of instruction stream bytes to said decoder;
said instruction decoder including means for delivering a shift signal responsive to the number of bytes of the instruction stream located in the instruction buffer which are currently being decoded;
first means for prefetching and maintaining in a first prefetch buffer a first preselected number of sequential bytes of the instruction stream;
second means for prefetching and maintaining in a second prefetch buffer a second preselected number of sequential bytes of the instruction stream, said second preselected number of sequential bytes of the instruction stream being subsequent to the first preselected number of sequential bytes of the instruction stream;
a shifter coupled to said instruction buffer for receiving said shift signal and shifting the bytes of said instruction buffer by a preselected number of storage locations responsive to said shift signal and delivering the shifted bytes to the instruction buffer;

means for retrieving sequential bytes of the instruction stream from one of the first and second prefetch buffers and filling the instruction buffer storage locations from which bytes of the instruction stream have been removed by the shifter;
means for refilling said first prefetch buffers with instruction stream bytes in response to said first prefetch buffers being emptied by said means for retrieving; and means for refilling said second prefetch buffer with instruction stream bytes in response to said second means being emptied by said means for retrieving.
7. An instruction buffer system, as set forth in claim 6, wherein the instruction buffer filling means includes means for receiving said sequential bytes of instruction stream retrieved from said first and second means and rotating the bytes by a preselected number of byte locations responsive to said shift signal before filling said instruction buffer.
8. An instruction buffer system, as set forth in claim 6, wherein said means for refilling said first and second prefetch buffers refills said first and second prefetch buffers in response to the absence of said shift signal.
9. An instruction buffer system, as set forth in claim 8, wherein said means for retrieving includes means for retrieving instruction stream from a memory in response to said first and second prefetch buffers being empty.
CA000607160A 1989-02-03 1989-08-01 Virtual instruction cache refill algorithm Expired - Fee Related CA1323937C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US07/306,831 US5113515A (en) 1989-02-03 1989-02-03 Virtual instruction cache system using length responsive decoded instruction shifting and merging with prefetch buffer outputs to fill instruction buffer
US306,831 1989-02-03

Publications (1)

Publication Number Publication Date
CA1323937C true CA1323937C (en) 1993-11-02

Family

ID=23187060

Family Applications (1)

Application Number Title Priority Date Filing Date
CA000607160A Expired - Fee Related CA1323937C (en) 1989-02-03 1989-08-01 Virtual instruction cache refill algorithm

Country Status (7)

Country Link
US (1) US5113515A (en)
EP (1) EP0380854B1 (en)
JP (1) JPH02208728A (en)
AT (1) ATE156607T1 (en)
AU (1) AU628527B2 (en)
CA (1) CA1323937C (en)
DE (1) DE68928236T2 (en)

Families Citing this family (76)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2505887B2 (en) * 1989-07-14 1996-06-12 富士通株式会社 Instruction processing system
KR940000968B1 (en) * 1989-08-28 1994-02-07 니뽄 덴끼 가부시끼가이샤 Microprocessor
JPH0395629A (en) * 1989-09-08 1991-04-22 Fujitsu Ltd Data processor
DE69030648T2 (en) * 1990-01-02 1997-11-13 Motorola Inc Method for sequential prefetching of 1-word, 2-word or 3-word instructions
CA2045791A1 (en) * 1990-06-29 1991-12-30 Richard Lee Sites Branch performance in high speed processor
US5530941A (en) * 1990-08-06 1996-06-25 Ncr Corporation System and method for prefetching data from a main computer memory into a cache memory
US5493662A (en) * 1990-08-20 1996-02-20 Nec Corporation Apparatus for enabling exchange of data of different lengths between memories of at least two computer systems
EP0477598A2 (en) * 1990-09-26 1992-04-01 Siemens Aktiengesellschaft Instruction unit for a processor featuring 'n' processing elements
JPH04340145A (en) * 1991-05-17 1992-11-26 Nec Corp Cache memory device
GB2263565B (en) * 1992-01-23 1995-08-30 Intel Corp Microprocessor with apparatus for parallel execution of instructions
GB2263987B (en) * 1992-02-06 1996-03-06 Intel Corp End bit markers for instruction decode
GB2263985B (en) * 1992-02-06 1995-06-14 Intel Corp Two stage window multiplexors for deriving variable length instructions from a stream of instructions
JP3547740B2 (en) * 1992-03-25 2004-07-28 ザイログ,インコーポレイテッド High-speed instruction decoding pipeline processor
US5438668A (en) * 1992-03-31 1995-08-01 Seiko Epson Corporation System and method for extraction, alignment and decoding of CISC instructions into a nano-instruction bucket for execution by a RISC computer
US5572682A (en) * 1992-04-03 1996-11-05 Cyrix Corporation Control logic for a sequential data buffer using byte read-enable lines to define and shift the access window
US5471628A (en) * 1992-06-30 1995-11-28 International Business Machines Corporation Multi-function permutation switch for rotating and manipulating an order of bits of an input data byte in either cyclic or non-cyclic mode
US6735685B1 (en) 1992-09-29 2004-05-11 Seiko Epson Corporation System and method for handling load and/or store operations in a superscalar microprocessor
JP3644959B2 (en) 1992-09-29 2005-05-11 セイコーエプソン株式会社 Microprocessor system
US5367657A (en) * 1992-10-01 1994-11-22 Intel Corporation Method and apparatus for efficient read prefetching of instruction code data in computer memory subsystems
JPH06222990A (en) * 1992-10-16 1994-08-12 Fujitsu Ltd Data processor
CA2123442A1 (en) * 1993-09-20 1995-03-21 David S. Ray Multiple execution unit dispatch with instruction dependency
DE69427265T2 (en) 1993-10-29 2002-05-02 Advanced Micro Devices Inc Superskalarbefehlsdekoder
DE69434669T2 (en) * 1993-10-29 2006-10-12 Advanced Micro Devices, Inc., Sunnyvale Speculative command queue for variable byte length commands
US5689672A (en) * 1993-10-29 1997-11-18 Advanced Micro Devices, Inc. Pre-decoded instruction cache and method therefor particularly suitable for variable byte-length instructions
US5630082A (en) * 1993-10-29 1997-05-13 Advanced Micro Devices, Inc. Apparatus and method for instruction queue scanning
JP3442118B2 (en) * 1993-11-19 2003-09-02 富士通株式会社 Buffer circuit
US5604909A (en) 1993-12-15 1997-02-18 Silicon Graphics Computer Systems, Inc. Apparatus for processing instructions in a computing system
US5608885A (en) * 1994-03-01 1997-03-04 Intel Corporation Method for handling instructions from a branch prior to instruction decoding in a computer which executes variable-length instructions
US5600806A (en) * 1994-03-01 1997-02-04 Intel Corporation Method and apparatus for aligning an instruction boundary in variable length macroinstructions with an instruction buffer
US5559975A (en) * 1994-06-01 1996-09-24 Advanced Micro Devices, Inc. Program counter update mechanism
US5644752A (en) * 1994-06-29 1997-07-01 Exponential Technology, Inc. Combined store queue for a master-slave cache system
US5758116A (en) * 1994-09-30 1998-05-26 Intel Corporation Instruction length decoder for generating output length indicia to identity boundaries between variable length instructions
US5860096A (en) * 1994-10-17 1999-01-12 Hewlett-Packard Company Multi-level instruction cache for a computer
US5640526A (en) * 1994-12-21 1997-06-17 International Business Machines Corporation Superscaler instruction pipeline having boundary indentification logic for variable length instructions
US6525971B2 (en) * 1995-06-30 2003-02-25 Micron Technology, Inc. Distributed write data drivers for burst access memories
US5526320A (en) 1994-12-23 1996-06-11 Micron Technology Inc. Burst EDO memory device
US6006324A (en) 1995-01-25 1999-12-21 Advanced Micro Devices, Inc. High performance superscalar alignment unit
US5832249A (en) * 1995-01-25 1998-11-03 Advanced Micro Devices, Inc. High performance superscalar alignment unit
US5737550A (en) * 1995-03-28 1998-04-07 Advanced Micro Devices, Inc. Cache memory to processor bus interface and method thereof
US5822558A (en) * 1995-04-12 1998-10-13 Advanced Micro Devices, Inc. Method and apparatus for predecoding variable byte-length instructions within a superscalar microprocessor
US5991869A (en) * 1995-04-12 1999-11-23 Advanced Micro Devices, Inc. Superscalar microprocessor including a high speed instruction alignment unit
US5758114A (en) * 1995-04-12 1998-05-26 Advanced Micro Devices, Inc. High speed instruction alignment unit for aligning variable byte-length instructions according to predecode information in a superscalar microprocessor
US5680564A (en) * 1995-05-26 1997-10-21 National Semiconductor Corporation Pipelined processor with two tier prefetch buffer structure and method with bypass
US5809529A (en) * 1995-08-23 1998-09-15 International Business Machines Corporation Prefetching of committed instructions from a memory to an instruction cache
US5781789A (en) * 1995-08-31 1998-07-14 Advanced Micro Devices, Inc. Superscaler microprocessor employing a parallel mask decoder
US6093213A (en) * 1995-10-06 2000-07-25 Advanced Micro Devices, Inc. Flexible implementation of a system management mode (SMM) in a processor
US5809273A (en) * 1996-01-26 1998-09-15 Advanced Micro Devices, Inc. Instruction predecode and multiple instruction decode
US5794063A (en) * 1996-01-26 1998-08-11 Advanced Micro Devices, Inc. Instruction decoder including emulation using indirect specifiers
US5819056A (en) * 1995-10-06 1998-10-06 Advanced Micro Devices, Inc. Instruction buffer organization method and system
US5926642A (en) 1995-10-06 1999-07-20 Advanced Micro Devices, Inc. RISC86 instruction set
US5920713A (en) * 1995-10-06 1999-07-06 Advanced Micro Devices, Inc. Instruction decoder including two-way emulation code branching
US5796974A (en) * 1995-11-07 1998-08-18 Advanced Micro Devices, Inc. Microcode patching apparatus and method
US5740392A (en) * 1995-12-27 1998-04-14 Intel Corporation Method and apparatus for fast decoding of 00H and OFH mapped instructions
US5829010A (en) * 1996-05-31 1998-10-27 Sun Microsystems, Inc. Apparatus and method to efficiently abort and restart a primary memory access
US6981126B1 (en) * 1996-07-03 2005-12-27 Micron Technology, Inc. Continuous interleave burst access
US5907702A (en) * 1997-03-28 1999-05-25 International Business Machines Corporation Method and apparatus for decreasing thread switch latency in a multithread processor
US5872946A (en) * 1997-06-11 1999-02-16 Advanced Micro Devices, Inc. Instruction alignment unit employing dual instruction queues for high frequency instruction dispatch
US6170050B1 (en) 1998-04-22 2001-01-02 Sun Microsystems, Inc. Length decoder for variable length data
US7114056B2 (en) 1998-12-03 2006-09-26 Sun Microsystems, Inc. Local and global register partitioning in a VLIW processor
US6321325B1 (en) 1998-12-03 2001-11-20 Sun Microsystems, Inc. Dual in-line buffers for an instruction fetch unit
US7117342B2 (en) 1998-12-03 2006-10-03 Sun Microsystems, Inc. Implicitly derived register specifiers in a processor
US6694423B1 (en) * 1999-05-26 2004-02-17 Infineon Technologies North America Corp. Prefetch streaming buffer
EP1150213B1 (en) * 2000-04-28 2012-01-25 TELEFONAKTIEBOLAGET LM ERICSSON (publ) Data processing system and method
US7082516B1 (en) 2000-09-28 2006-07-25 Intel Corporation Aligning instructions using a variable width alignment engine having an intelligent buffer refill mechanism
WO2002065249A2 (en) 2001-02-13 2002-08-22 Candera, Inc. Storage virtualization and storage management to provide higher level storage services
US7203730B1 (en) 2001-02-13 2007-04-10 Network Appliance, Inc. Method and apparatus for identifying storage devices
US6738792B1 (en) 2001-03-09 2004-05-18 Advanced Micro Devices, Inc. Parallel mask generator
US7472231B1 (en) * 2001-09-07 2008-12-30 Netapp, Inc. Storage area network data cache
US7032136B1 (en) 2001-09-07 2006-04-18 Network Appliance, Inc. Auto regression test for network-based storage virtualization system
US7032097B2 (en) * 2003-04-24 2006-04-18 International Business Machines Corporation Zero cycle penalty in selecting instructions in prefetch buffer in the event of a miss in the instruction cache
US8806177B2 (en) * 2006-07-07 2014-08-12 International Business Machines Corporation Prefetch engine based translation prefetching
US10001993B2 (en) 2013-08-08 2018-06-19 Linear Algebra Technologies Limited Variable-length instruction buffer management
US11768689B2 (en) 2013-08-08 2023-09-26 Movidius Limited Apparatus, systems, and methods for low power computational imaging
US9672110B1 (en) 2015-09-22 2017-06-06 Amazon Technologies, Inc. Transmission time refinement in a storage system
US10254980B1 (en) 2015-09-22 2019-04-09 Amazon Technologies, Inc. Scheduling requests from data sources for efficient data decoding
US11204768B2 (en) 2019-11-06 2021-12-21 Onnivation Llc Instruction length based parallel instruction demarcator

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3771138A (en) * 1971-08-31 1973-11-06 Ibm Apparatus and method for serializing instructions from two independent instruction streams
US4521850A (en) * 1977-12-30 1985-06-04 Honeywell Information Systems Inc. Instruction buffer associated with a cache memory unit
US4236206A (en) * 1978-10-25 1980-11-25 Digital Equipment Corporation Central processor unit for executing instructions of variable length
JPS5927935B2 (en) * 1980-02-29 1984-07-09 株式会社日立製作所 information processing equipment
CA1174370A (en) * 1980-05-19 1984-09-11 Hidekazu Matsumoto Data processing unit with pipelined operands
US4500958A (en) * 1982-04-21 1985-02-19 Digital Equipment Corporation Memory controller with data rotation arrangement
JPS59123053A (en) * 1982-12-28 1984-07-16 Fujitsu Ltd Instruction control system
US4626988A (en) * 1983-03-07 1986-12-02 International Business Machines Corporation Instruction fetch look-aside buffer with loop mode control
US4602368A (en) * 1983-04-15 1986-07-22 Honeywell Information Systems Inc. Dual validity bit arrays
US4635194A (en) * 1983-05-02 1987-01-06 International Business Machines Corporation Instruction buffer bypass apparatus
JPS6051948A (en) * 1983-08-31 1985-03-23 Hitachi Ltd Branch destination buffer storage device
JPH0670773B2 (en) * 1984-11-01 1994-09-07 富士通株式会社 Advance control method
US4860192A (en) * 1985-02-22 1989-08-22 Intergraph Corporation Quadword boundary cache system
JP2539357B2 (en) * 1985-03-15 1996-10-02 株式会社日立製作所 Data processing device
US4853840A (en) * 1986-01-07 1989-08-01 Nec Corporation Instruction prefetching device including a circuit for checking prediction of a branch instruction before the instruction is executed
US4722050A (en) * 1986-03-27 1988-01-26 Hewlett-Packard Company Method and apparatus for facilitating instruction processing of a digital computer
EP0243879B1 (en) * 1986-04-23 1989-12-27 Siemens Aktiengesellschaft Method and arrangement to accelerate the transfer of an instruction to the instruction register of a micro-programme-controlled processor
JPH0810553B2 (en) * 1986-06-13 1996-01-31 松下電器産業株式会社 Memory circuit
JPS63163634A (en) * 1986-12-26 1988-07-07 Hitachi Ltd Instruction fetch system
DE3802025C1 (en) * 1988-01-25 1989-07-20 Otto 7750 Konstanz De Mueller

Also Published As

Publication number Publication date
EP0380854A2 (en) 1990-08-08
AU5394090A (en) 1991-12-19
ATE156607T1 (en) 1997-08-15
DE68928236D1 (en) 1997-09-11
EP0380854B1 (en) 1997-08-06
JPH02208728A (en) 1990-08-20
DE68928236T2 (en) 1998-03-12
US5113515A (en) 1992-05-12
EP0380854A3 (en) 1993-01-07
AU628527B2 (en) 1992-09-17

Similar Documents

Publication Publication Date Title
CA1323937C (en) Virtual instruction cache refill algorithm
KR100230105B1 (en) Data prefetch instruction in a reduced instruction set processor
KR100218572B1 (en) Granularity hint for translation buffer in high performace processor
JP3187090B2 (en) Byte comparison operation method for high performance processor
JP3055980B2 (en) Method for ensuring data integrity in a multiprocessor or pipeline processor system
KR100230643B1 (en) Branch prediction in hi performance processor
CA1323938C (en) Control of multiple function units with parallel operation in a microcoded execution unit
JP3105960B2 (en) A method of operating data in a register with a simplified instruction set processor
CA1325283C (en) Method and apparatus for resolving a variable number of potential memory access conflicts in a pipelined computer system
JP2951064B2 (en) Method of operating a pipeline processor and pipeline processor
EP0952517B1 (en) Microprocessors load/store functional units and data caches
US6240484B1 (en) Linearly addressable microprocessor cache
CA1323940C (en) Preprocessing implied specifiers in a pipelined processor
EP1116102B1 (en) Method and apparatus for calculating indirect branch targets
US6157994A (en) Microprocessor employing and method of using a control bit vector storage for instruction execution
JPH07334361A (en) Microprocessor device with pipeline for processing of instruction and apparatus for generation of program counter value used in it
US6671762B1 (en) System and method of saving and restoring registers in a data processing system
JPH07325716A (en) Pipeline processor and its operating method
CA2008238A1 (en) Multiple instruction preprocessing system with data dependency resolution for digital computers
US5469551A (en) Method and apparatus for eliminating branches using conditional move instructions
IE901517A1 (en) Virtual instruction cache refill algorithm

Legal Events

Date Code Title Description
MKLA Lapsed