CA1315896C

CA1315896C - Store queue for a tightly coupled multiple processor configuration with two-level cache buffer storage

Info

Publication number: CA1315896C
Application number: CA000588790A
Authority: CA
Inventors: Steven Lee Gregor
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1988-02-22
Filing date: 1989-01-20
Publication date: 1993-04-06
Anticipated expiration: 2010-04-06
Also published as: BR8900552A; JPH0648479B2; US5023776A; DE68922326T2; DE68922326D1; EP0329942A2; EP0329942B1; JPH01246655A; EP0329942A3

Abstract

ABSTRACT OF THE DISCLOSURE

A multiprocessor system includes a system of store queues and write buffers in a hierarchical first level and second level memory system including a first level store queue for storing instructions and/or data from a processor of the multiprocessor system prior to storage in a first level of cache, a second level store queue for storing the instructions and/or data from the first level store queue and a plurality of write buffers for storing the instructions and/or data from the second level store queue prior to storage in a second level of cache. The multiprocessor system includes hierarchical levels of caches, including a first level of cache associated with each processor, a single shared second level of cache shared by all the processors, and a third level of main memory connected to the shared second level cache. A first level store queue, associated with each processor, receives the data and/or instructions from its processor and stores the data and/or instructions in the first level of cache. A second level store queue, associated with each processor, receives the data and/or instructions from its first level store queue and temporarily stores the information therein. For sequential stores, the data and/or instructions are stored in corresponding second level write buffers. For non-sequential stores, the data and/or instructions bypass the corresponding second level write buffers and are stored directly in a final L2 cache write buffer. When stored in the second level write buffers, access to the shared second level cache is requested; and, when access is granted, the data and/or instructions is moved from the second level write buffers to the shared second level cache. When stored in the shared second level cache, corresponding obsolete entries in the first level of cache are invalidated before any other processor "sees" the obsolete data and the new data and/or instructions are over-written in the first level of cache.

Description

8 9 ~

STORE QUEUE FOR A TIGHTLY COUPLED MULrrIPLE PROCESSOR 6 The subject matter of this invention pertains to 12 computing systems, and more particularly, to a store 13 queue in a first level of memory hierarchy and a 14 store queue and write buffers in a second level of 15 ~`cache memory hierarchy of a multiprocessor eomputer 16 system for queuing data intended for storage in the 17 second level of cache memory hierarehy. 18 Computing systems include multiprocessor systems. 20 Multiprocessor systems comprise a plurality of 21 processors, each o~ which may at some point in time 22 require access to main memory. This requirement may 23 arise simultaneously with respect to two or more of 24 the processors in the multiprocessing system. Such 25 systems may comprise intermediate level caches for 26 temporarily storing instructions and data. For 27 example, US Patents 4,445,174 and 4,442,487 disclose 28 multiprocessor systems including intermediate first 29 level cache storage (L1 cache) and intermediate 30 second level cache storage (L2 cache). These31 patents do not specifically disclose a system 32 eonfiguration which comprises an L1 cache for each 33 processor of the multiprocessor system, an ~ ~ache 34 eonnected to and shared by the individual L1 caches 3s of each processor, and a main memory, designated L~, - 36 connected solely to the shared L2 eache. In such a 37 configuration, if data and/or instructions, stored 38 in an Ll cache of one processor o~ the 39 multiprocessor system, should be moved for storage 40 into the shared L2 cache, and ultimately into L3, 41 and if the shared I.2 cache is busy storing data 42 and/or instructions assoeiated with another of the 43 processors of the multiproeessor system, the L1 44 cache of the one processor must wait until the 45 shared L2 cache is no longer busy before it may 46 begin its storage operations. There~ore, a queuing 47 , EN988001 -- 2 -- 131~ 8 ~ ~

system is needed, for connection between the 6 individual Ll caches of each processor and the7 shared L2 cache, to queue the data and/or 8 instructions from the Ll cache of the one processor 9 prior to storage in the shared L2 cache so that the 10 one processor may begin another operation. In such 11 a queuing system, when a first set of data and/or 12 instructions are stored in the queue, the queue 13 itself may be full; therefore, when the one 14 processor intends to store a second set of data 15 --~and/or instructions in the queue, as well as in its 16 Ll cache, since the queue is full, the one processor 17 cannot begin another operation until the queue is no - 18 longer full. Therefore, it is desirable that the 19 queue be designed in stages, in a pipelined manner, 20 such that the second set of data and/or instructions - 21 may be queued along with the first set of data22 and/or instructions. In this manner, multiple sets 23 of data and/or instructions, intended for storage in 24 an Ll cache, may be queued for ultimate storage in 25 the shared L2 cache, thereby permitting continued ~ 26 operation of the one processor associated with the 27 Ll cache. 28 ; In addition, when data is modified by one processor 30 of a multiprocessor configuration, which includes a 31 plurality of processors, a single main memory, and 32 intermediate level caches Ll and L2, if the 33 corresponding un-modified data is stored in the34 processor's cache, the modified data must be 35 re-stored in the processor's cache. The modified 36 data must be re-stored in its cache before the other 37 processors may "see" the modified data. Therefore, 38 some method of policing the visibility of the39 modified data vis-a-vis the other processors in th~ 40 multiprocessor configuration i8 required. 41 Furthermore, an apparatus is needed to maintain 42 accurate control over acces~ to main memory and the 43 caches. Xn this application, the apparatus ~or44 policing the visibility of the modi~ied data 45 vis-a-vis other processors and for maintaining 46 .. .. I

EN988001 - 3 - 13~89~

control over access to the main memory and the 6 caches i5 termed a "Storage Subsystem". 7 .

Accordinyly, it is an object of the present12 invention to introduce a novel storage subsystem for 13 a computer system including a system of novel store 14 queues for queuing data intended for storage in an 15 --`Ll cache and for queuing data intended for storage 16 in a shared L2 cache of a Storage Subsystem. 17 It is a further ob~ect of the present invention to 19 introduce a novel storage subsystem including a 20 system of write buffers to be used in conjunction 21 with the novel store queues for increasing the 22 capacity of the store queues so that the various 23 processors of the multiprocessor system may not be 24 inhibited in their operation when attempting a store 25 operation to the shared L2 cache and to main memory. 26 In accordance with this and other objects of the 28 present in~ention, as shown in figure 6 of the 29 drawings, a storage subsystem includes a system of 30 store queues. A store queue is associated with a 31 level one Ll cache and a store queue is associated 32 with a level two L2 cache of a multiprocessor 33 system. The multiprocessor system includes at least 34 first and second processors, each processor having 35 its own Ll cache. The Ll cache of each processor is 36 connected to its own first store queue for queuing 37 data prior to storage in the Ll cache. In addition, 38 the two processors share a second level of cache, an 39 L2 cache. The L2 cache is connected to its own 40 second store queue as well. The L2 cache is41 connected to main memory, considered an I,3 level o~ 42 memory hierarchy. Further, a system of write 43 buffers connect the second store queue to the L2 44 cache. When data is intended ~or storage in Ll 45 cache, it is first queued in its first store queue 46 simultaneously with storage in Ll cache. Once 47 EN988001 - 4 - ~

stored in L1 cache, the data must be stored in L2 6 cache; but, prior to storage in L2 cache, the data 7 is first queued in the second store queue of the L2 8 cache. When stored in the second store queue, the 9 data may be further stored in one of the write 10 buffers which interconnect the second store queue to 11 the L2 cache. Ultimately, the data is stored in a 12 final L2 cache write buffer prior to actual storage 13 in the L2 cache. The data "may be" stored in one of 14 the write buffers since data and/or instructions15 associated with "sequential stores (SS) 1l are stored 16 in the write buffers prior to actual storage in the 17 final L2 cache write buffer, but data and/or 18 instructions associated with "non-sequential stores 19 (NS)" are not stored in the write buffers, rather, 20 they bypass the write buffers and are stored 21 directly into the final L2 cache write buffer from 22 the second store queue. When access to the L2 cache 23 is obtained, via arbitration, the data stored in the 24 final L2 cache write buffer is stored in L2 cache. 25 Once the data is stored in L2 cache, subsequent 26 cross-invalidation requests, originating from the L2 27 cache, invalidate other corresponding entries of the 28 data in the Ll caches. As a result of this use of 29 store queues at the Ll cache~ the store queue at the 30 L2 cache level, and the write buffers interconnected 31 between the L2 store queue and the L2 cache, one32 processor is not inhibited in its execution of 33 various instructions even though the L2 cache is34 busy with a store operation associated with another 35 of the processors in the multiprocessor system. 36 Performance of the computer system of this invention 37 is maximized and interference between processors at 38 the L2 cache level is minimized. 39 Further scope of applicability o~ the present 41 invention will become apparent rom the detailed 42 description presented hereinafter. It should be 93 understood, however, that the detailed description 44 and the specific examples, while representing a 45 preferred embodiment of the invention, are given by 4 6 way of illustration only, since various changes and 47 EN988001 ~ 5 - ~L31~89~

modifications within the spirit and scope of the 6 invention will become obvious to one skilled in the 7 art from a reading of the following detailed 8 description. 9 BRIEF DES~ IPTION OF THE DRAWINGS 12 A full understanding of the present invention will 14 be obtained from the detailed description of the 15 preferred embodiment presented hereinbelow, and the 16 accompanying drawings, which are given by way of 17 illustration only and are not intended to be18 limitative of the present invention, and wherein: 19 figure 1 illustrates a uniprocessor computer system; 21 figure 2 illustrates a triadic computer system; 23 figure 3 illustrates a detailed construction of the 25 I/D Caches (L1 cache), the I-unit, E-unit, and~ 26 Control Store (C/S) illustrated in figures l and 2; 27 figure 4 represents another diagram of the triadic 29 computer system of figure 2; 30 figure 5 illustrates a detailed construction of the 32 storage subsystem of figure 4; 33 figure 6, which comprises figures 6a through 6c, 35 illustrates a detailed construction of a portion of 36 the L2 cache/Bus Switching Unit of figure 5 in 37 accordance with the present invention; 38 figure 7 illustrates a construction o~ the Ll store 40 queue of figure 6; 41 figure 8 illustrates the L1 field address registers 43 connected to the L1 store queue of figures 6 and 7; 44 figure 9 illustrates the L2 store queue of figure 6; 46 ~ . . ~ ~

EM9-88-00] f1 ~ 3~5$~

figure 10 ilJ.ustrates tlle r.2 ~ e hol~l re(Jister~ and write buffers connected to the I,2 F~tore ~tleue of figures 6 and 9; and figures 11 through 49 iLlustrate time :line ~iagram~
corresponding to the store ro~l-tlnes a~sociated with the triadic computer sy,stem of fi~ure 2.

D~SCRIPTION OF THE RREFERRED EMBODIMENT

Referring to figure 1, a uniprocessor computer system of the present invention is illustra-ted.

In figure 1, the uniprocessor system comprises an L3 memory 10 connected to a storage controller (SCL) 12. On one end, the storage controller 12 is connected to an integrated I/O subsystem controls 14, khe controls 14 being connected to integrated adapters and single card channels 16. On the other end, the storage controller 12 is connected to I/D caches (L1) 18, which comprise an instruction cache, and a data cache, collectively termed the "L1" cache The I/D caches 18 are connected to an instruction unit (I-unit), Execution unit (E-unit), control store 20 and to a vector processor (VP) 22. The vector processor 22 is described in U.S Patent No 4,967,343, which issued October 30, 1990, entitled "High Performance Parallel Vector Processor" The uniprocessor system of figure 1 also comprises the multisystem channel communication unit 24.

The memory 10 comprises 2 "intelligent" memory cards, each memory card including a level three (L3) memory portion and an extended (L4) memory portion. The cards are "intelligent" due to the existence of certain specific features: error checking and correction, extended error checking and correction (ECC) refresh address registers and counters, and ~ 3 ~

bit spare capability. The interface to the L3 6 memory 10-is 8-bytes wide. Mernory sizes are 8, 16, 7 32, and 64 megabytes. The L3 memory is connected to 8 a storage controller (SCL) 12~ 9 The storage controller 12 comprises three bus 11 arbiters arbitrating for access to the L3 memory 10, 12 to the I/O subsystem controls 14, and to the I/D13 caches 18. The storage controller further includes 14 a directory which is responsible for searching the 15 instruction and data caches 18, otherwise termed the 16 L1 cache, for data. If the data is located in the 17 Ll caches 18, but the data is obsolete, the storage 18 controller 12 invalidates the obsolete data in the 19 L1 caches 18 thereby allowing the I/O subsystem 20 controls 14 to update the data in the L3 memory 10. 21 Thereafter, instruction and execution units 20 must 22 obtain the updated data from the L3 memory 10. The 23 storage controller 12 further includes a plurality 24 o~ buffers for buffering data being input to L3 25 memory 10 from the I/O subsystem controls 14 and for 26 buffering data being input to L3 memory 10 from 27 instruction/execution units 20. The buffer 28 associated with the instruction/execution units 20 29 is a 256 byte line buffer which allows the building 30 of entries 8 bytes at a time for certain types of 31 instructions, such as sequential operations. This 32 line bu~fer, when fullt will cause a block transfer 33 of data to L3 memory to occur. Therefore, memory34 operations are reduced from a number of individual 35 store operations to a much smaller number of line 36 transfers. ~ 37 . .
The instruction cache/data cache 18 are each 16K39 byte caches. The inter~ace to the storage 40 controller 12 is 8 bytes wide; thus, an inpage 41 operation from the storage controller 12 takes 842 data transfer cycles. The data cache 18 is a "store 43 through";cache, which means that data ~rom the 44 instruction/execution units 20 are stored in L3 45 memory and, if the corresponding obsolete data is 46 not present in the Ll caches 18, the data is not47 .

EN988001 - 8 ~ 89~

brought into and stored in the Ll caches. To assist 6 this operation, a "store buffer~ i5 present with the 7 Ll data cache 18 which is capable of buffering up to 8 8 store operatlons. 9 The vector processor 22 is connected to the data11 cache 18. It shares the dataflow of the 12 instruction/execution unit 20 into the storage 13 controller 12, but the vector processor 22 will not, 14 while it is operating, permit the 15 ~instruction/execution uni~ 20 to make accesses into 16 the storage controller 12 for the fetching of data. 17 The integrated I/o subsystem 14 is connected to the 19 storage controller 12 via an 8-byte bus. The 20 subsystem 14 comprises three 64-byte buffers used to 21 synchronize data coming from the integrated I/O 22 subsystem 14 with the storage controller 12. That 23 is, the instruction/execution unit 20 and the I/O 24 suhsystem 14 operate on different clocks, the 25 synchronization of the two clocks being achieved by 26 the three 64-byte buffer structure~ 27 The multisystem channel communication unit 24 is a 29 4-port channel to channel adapter, packaged 30 externally to the system. 31 Referring to figure 2, a triadic (multiprocessor) 33 system is illustrated. 34 .
In figure 2, a Storage Subsystem 26 is connected to 36 the ports of a pair of L3 memories lOa/lOb. The 37 Storage Subsystem 26 includes a bus switching unit 38 (BSU) 26a and an L2 cache 26b. The Storage 3g Subsystem 26 will be set forth in more detail in40 figure 5~ The BSU 26a is connected to the 41 integrated I/O subsystem 14, to shared channel 42 processor A (SHCP-A) 28a, to shared channel 43 processor B (SHCP-B) 28b, and to three processors: 44 a first processor including instruction/data caches 45 18a and instruction/execution units/control ~tore 46 20a, a second processor includin~ instruction/data 47 EN988001 ~ 9 - 131~8~

caches 18b and instruction/e~ecution units/control 6 store 20b, and a third processor including 7 instruction/data caches 18c and 8 instruction/executiOn units/control store 20c. Each 9 of the instruction/data caches 18a, 18b, 18c are 10 termed "Ll" caches. The cache in the the Storage 11 Subsystem 26 is termed the L2 cache 26b, and the 12 main memory lOa/lOb is termed the L3 memory. 13 The Storage Subsystem 26 connects together the three 15 processors 18a/20a, 18b/20b, and 18c/20c, the two L3 16 memory ports lOa/lOb, the two shared channel 17 processors 28, and an integrated I/O subsystem 14. 18 The Storage Subsystem 26 comprise circuits which 19 decide the priority for requests to be handled, such 20 as requests from each of the three processors to L3 21 memory, or requests from the I/O subsystem 14 or 22 shared channel processors, circuits which operate23 the interfaces, and circuits to access the L2 cache 24 26b. The L2 cache 26b is a "store in" cache, 25 meaning that operations which access the L2 cache, 26 to modify data, must also modify data resident in27 the L2 cache (the only exception to this rule is 28 that, if the operation originates from the I/O 29 subsystem 14, and if the data is resident only in L3 30 memory lOa/lOb and not in L2 cache 26a, the data is 31 modified only in L3 memory, not in L2 cache). 32 The interface between the Storage Subsystem 26 and 34 L3 memories lOa/~Ob comprises two 16-byte ports in35 lieu of the single 8-byte port in figure 1. 36 However, the memory 10 of figure 1 is identical to 37 the memory cards lOa/lOb of figure 2. The two 38 memory cards lOa/lOb of figure 2 are accessed in 39 parallel. 40 The shared channel processor 28 is connected to the 42 Storage Subsystem 26 via two ports, each port being 43 an 8-byte interface. The shared channel processor 44 28 is operated at a frequency which is independent 45 of the BSU 26, the clocks within the BSU being 46 synchronized with the clocks in the shared channel 47 ~31~8~

processor 28 in a manner which is similar to khe clock synchronization between ~he storage controller 7 12 and the integrated I/O subsystem 14 of fiyure 1. 8 A functional description of the operation of the 10 uniprocessor computer system of figure 1 will be set 11 forth in the following paragraphs with reference to 12 ~igure 1. 13 Normally, instructions are resident in the 15 instruction cache (Ll cache) 18, waiting to be 16 e~ecuted. The instruction/execution unit 20 17 searches a directory disposed within the Ll cache 18 18 to determine if the typical instruction is stored 19 therein. If the instruction is not stored in the Ll 20 cache 18, the instruction/execution unit 20 will 21 generate a storage request to the storage controller 22 12. The address-of-the instruction, or the cache 23 line containing the instruction will be provided to 24 the storage controller 12. The storage controller 25 12 will arbitrate for access to the bus connected to 26 the L3 memory 10. Eventually, the request from the 27 instruction/execution unit 20 will be passed to the 28 L3 memory 10, the request comprising a command 29 indicating a line in L3 memory is to be fetched for 30 transfer to the instruction/execution unit 20. The 31 ~3 memory will latch the request, decode it, select 32 the location in the memory card wherein the 33 instruction is stored, and, after a few cycles of 34 delay, the instruction will be delivered to the 35 storage controller 12 from the L3 memory in 8-byte 36 increments. The instruction is then transmitted 37 from the storage controller 12 to the instruction 38 cache (Ll cache) 18, ~herein it is temporarily 39 stored. The instruction is re-transmitted from the 40 instruction cache 18 to the instruction buf~er 41 within the instruction/execution unlt 20. The 42 instruction is decoded via a decoder within the 43 instruction unit 20. Quite often, an operand is 44 needed in order to execute the instruction, the 45 operand being resident in memory 10, The 46 instruction/e~ecution unit 20 searches the directory 47 EN988001 - 11 - 131~ 8 9 ~

in the data cache 18; if the operand is not found in 6 the directory of the data cache 1~, another storage 7 access is issued by the instruction/execution unik 8 20 to access the L3 memory 10, exactly in the manner 9 described above with respect to the instruction 10 cache miss. The operand is s~ored in the data 11 cache, the instruction/execution unit 20 searching 12 the data cache 18 for the operand. If ~he 13 instruction requires the use of microcode, the 14 instruction/execution unit 20 makes use of the 15 - microcode resident on the instruction execution unit 16 20 card. If an input/output (I/O1 operation need be 17 performed, the instruction/execution unit 20 decodes 18 an I/O instruction, resident in the instruction 19 cache 18. In~ormation is stored in an auxiliary 20 por~ion of L3 memory 10, which is sectioned off from 21 instruction execution. At that point, the 22 instruction/execution unit 20 informs the integrated 23 I/O subsystem 14 that such information is stored in 24 L3 memory, the subsystem 14 processors accessing the 25 L3 memory 10 to fetch the information. 26 A functional description o~ the operation o~ the28 multiprocessor ~omputer system of figure 2 will be 29 set forth in the following paragraphs with reference 30 to figure 2. 31 In figure 2, assume that a particular 33 instruction/execution unit, one of 20a, 20b, or 20c, 34 2 requires an instruction and searches its own Ll 35 cache, one of 18a, 18b, or 18c for the desired 36 instruction. Assume further that the desired 37 instruction is not resident in the Ll cache. The38 particular instruction execution unit will then 39 request access to the Storage 5ubsystem 26 in order 40 to search the L2 cache 26b disposed thcrein. The41 Storage Subsystem 26 contains an arbiter which 42 receives requests ~rom each of the 43 instruction/execution units 20a, 20b, 20c, from the 44 shared channel processor 28 and ~rom the inte~rated 45 I/O subsystem 14. When the particular 46 instruction/execution unit (one o~ 20a--20c) is 47 EN988001 - 12 ~ 1 3 1 ~ 8 9 ~

granted access to the Storage Subsystem 26 to search 6 the L2 cache 26b, the particular 7 instruction/execution unit searches the directory of 8 the L2 cache 26b disposed within the Storage 9 Subsystem 26 for the desired instruction. Assume 10 that the desired instruction is found in the L2 11 cache 26b. In that case, the desired instruction is 12 returned to the particular instruction/execution 13 unit. If the desired instruction is not located 14 within the L2 cache, as indicated by its directory, 15 -~a request is made to the L3 memory, one of lOa or 16 lOb, for the desired instruction. If the desired 17 instruction is located in the L3 memory, it is 18 immediately transmit~ed to the Storage Subsystem 26, 19 16 bytes at a time, and is bypassed to the20 particular instruction/execution unit (one of 21 20a-20c) while simultaneously being stored in the L2 22 cache 26b in the Storage Subsystem 26. Additional 23 functions resident within the Storage Subsystem 26 24 relate to rules for storage consistency in a 25 multiprocessor system. For example, when a26 particular instruction/execution unit 20c (otherwise 27 termed "processor" 20c) modifies data, that data 28 must be made visible to all other 29 instruction/execution units, or "processors", 20a, 30 20b in the complex. If processor 20c modifies data 31 presently stored in its Ll cache 18c, a search ~or 32 that particular data is made in the L2 cache 33 directory 26j of the Storage Subsystem 26. If 34 found, the particular data in L2 cache is modified 35 -to reflect the modification in the Ll cache 18c. 36 When this is done, the other processors 20a and 20b 37 are permitted to see the modified, correct data now 38 resident in the L2 cache 26a in order to permit such 39 other processors to modify their corrcsponding data 40 resident in their L1 caches 18a and 18b. The 41 subject processor 20c cannot re-access the 42 particular data in Ll cache until the other43 processors 20a and 20b have had a chance to modlEy 44 their corresponding data accordingly. The term 45 "cross-interrogation" re~ers to checking other 46 processor's L1 caches for copies of data modified by 47 EN988001 - 13 ~ ~ 3 ~ ~ 8 9 ~

the store request; cross-interroya~ion is used to 6 eliminate such copies; other Ll caches are not 7 updated with new data when store occurs. 8 Referring to ~igure 3, a detailed construction o~ 10 each instruction/execution unit (20 in figure 1 or 11 one of 20a-20c in figure 2) and its corresponding Ll 12 cache (18 in figure 1 or one of 18a-18c :in figure 2) 13 is illustrated. 14 In figure 1, and in ~igure 2, the 16 instruction/execution unit 20, 20a, 20b, and 20c is 17 disposed in a block labelled "I-unit E-unit C/S 18 (92KB)". This block may be termed the "processor", 19 the "instruction processing unit", or, as indicated 20 above, the "instruction/execution unit". For the21 sake of simplicity in the description provided 22 below, the block 20, 20a-20c will be called the 23 "processor". In addition, the "I/D caches (Ll) "24 will be called the "Ll cache". Figure 3 provides a 25 detailed construction for the processor (20, 20a, 26 20b, or 20c) and for the Ll cache (18, 18a, 18b, or 27 18c). 28 In figure 3, a processor (one of 20, 20a-20c) 30 comprises the following elements. A control store 31 subsyste~ 20-1 comprises a high speed fixed control 32 store 20-la of 84k bytes, a pagable area (8k byte, 33 2k word, 4-way associative pagable area) 20-lb, a 34 directory 20-lc for the pagable control store 20-lb, 35 a control store address register (CSAR) 20-ld, and 36 an 8-element branch and link (BAL STK) facil~ty 37 20-le. Machine state control~ 20-2 include the 38 global controls 20-2a for the processor, an op 39 branch table 20-2b connected to the CSAR via the 40 control store origin address bus and used to 41 generate the initial address ~or microcoded 42 instructions. An address generation unit 20-3 43 comprises 3 chips, a first being an instruction 44 cache DLAT and ~1 directory 20-3a, a second being a 45 data cache DLAT and Ll directory 20-3b, and a third 46 being an address generation chip 20-3c connected to 47 EN9-88-0()]. 1~ 131~896 the Ll cache 18, 1.8a-:18c v:ia th ,F` ad~lress hl1s. The instruction Dl.AT ancl Ll c1irectory 20--3a is connec-tecl to the instruc-tion cache portion o~ the :L,l cache via four "hit" lines which indicate -that the reqltes-ted instruction will be found in -the instruc-tlon cache Portion 18-la of the Ll cache. Likewise, our "hit" lines connect the data DLAT and Ll direc-tory 20--3b indlcatiny tha-t the requested data will be found i.n the data ca-he 1.8-2b por-tion of, the Ll cache. The address yener~ti.ol1 uni.t 20-3 contains copies of the 16 general purpose ~egi.~ter.C; used -to generate addresses (see the GPR C~PY 20-3cl) and includes three storage address reyisters (SARS) 20-3e, used to provide addresses to the microcofle for instruction execution. A fixed point instruction execution unit 20-4 is connected to the data Cache 18--~ via the data bus (D-bus) and contains a local store s-tack (local store) 20-4a which contains the 16 general purpose registers mentioned above and a number of working registers used exclusively by the microcode; condition registers 20-4b which contain the results of a number of arithmetic and shift type operati.ons and contain -Ihe results of a 370 condition code; a four-byte arithmetic 1.oyic unit (ALU) 20-4c; an 8-byte rotate merge unlt 20-4d; a branch bit select hardware 20-4e which allow t-he seLection of bi-ts from various register~ which determi.ne the direction of a branch operation, the bits being selected from general purpose registers, working registers, anc1 the condition registers. A floating point processor 20-5 includes floating point registers and four m:icrocode working registers 20-5e, a command decode and control function 20-Sa, a floating point adder 20-5b, a fixed point and floating point multiply array 20-5c, and a square-root and divide facility 20-5d. The floating point processor 20-5 is disclosed .;.n U.S. Patent No. 4,916,652 which issued April lO, 1.990, entit].ed "nyt1amic M~1ltiple Instruction Stream Multi.ple Data MIlJ.-tipl.e Pipellne Apparatus for Float.i.ng Point Sincl1.e In~tr~ct.ion Stream Single Data Architectures".

r~ ~

EN9~88-()()l 131~89S

The ~LU 20-~c C'011'1,a:i.tlS ;Ul ~-ld~, tlle arl~ be.tny disclosed in IJ.S. Pa-terlt ~lo ~ ,6l7, which issued April 3, .lg90, entitled "A High Performance Parallel Binary Byte Adcler". An externals chip 20-6 includes timers and interrupt struct1lre, the Interrupts beiny provided from the I/O subsystem t4, ancl others. An interprocessor communication facili.ty (IPC) 20-7 is connected to the storage subsystèm via a communication bus, thereby allowing the processor.s to pass messages to each other and providing access to the -time of day clock.

In figure 3, the l.l cache (one of 18, 18a, 18b, or 18c) comprises the following elementæ. ~n i.nstruction cache 18-1 comprises a ].6k byte/4~way cache 18-la, a 16-byte instruction buffer 18-lb at the output thereof, and an 8-byte inpage register 18-lc at the input from storage.
The storage bus, connected to the instruction cache 18-1 is eight bites wide, being connectecl to the inpage register 18-lc. The inpage register 18-lc is connected to the control store subsystem 20-1 and provides data to the subsystem in the event of a pagable control store miss and new data must be brought into the control store. A
data cache 18-2 comprises an inpage buffer 18-2a also connected to the ætorage bus; a clata cache 18-2b which is a 16k byte/4-way cache; a cache datafLow 18-2c which comprises a series of input and OUtp-lt registers and connected to the processor via an 8-by-te data bus (D-bus) and to the vector processor (22a-22c) via an 8-byte "vector bus"; an 8-element store buffer (STOR BFR) 18-2d.

A description of the functional cperati.on of a processor and Ll cache shown in figure 3 w.ill be ~' ~31~8~ s EN~88001 - 16 -provided in ~he following parayraphs with reference 6 to figure 3 of the drawings. 7 Assume that an instruction to be executed is located 9 in the instruction cache 18-la. The instruction is10 fetched from the instruction cache 18-la and is 11 stored in the instruction buffer 18-lb (every 12 attempt is made ~o keep the instruction buffer full 13 at all times). The instruction is fetched from the 14 instruction buffer 18-lb and is stored in the 15 instruction registers o~ the address generation chip 16 20-3, the fixed point execution unit 20-4, and the17 machine state controls 20-2, at which point, the 18 instruction decoding begins. Operands are fetched19 from the GPR COPY 20-3d in the address generation20 unit 20-3 if an operand is required (normally, GPR21 CoPY is accessed if operands are required for the22 base and index registers for an RX instruction). In 23 the next cycle, the address generation process 24 begins. The base and index register contents are25 added to a displacement field from the instruction, 26 and the effective address is generated and sent to 27 the data cache 18-2 and/or the instruction cache 28 18-1. In this exampLe, an operand is sought. 29 Therefore, the effective address will be sent to the 30 data cache 18-2. The address is also sent to the 31 data DLAT and Ll directory chip 20-3b ~since, in 32 this example, an operand is sought). Access to the 33 cache and the directories will begin in the third34 cycle. The DLAT 20-3b will determine if the address 35 is translatable from an effective address to an 36-absolute address. Assuming that this translation37 has been previously performed, we will have recorded 38 the translation. The translated address is eompared 39 with the output of the cache directory 20-3b. 40 Assumin~ that the data has previously been ~e~ched 41 into the cache 18-2b, the directory output and the42 DLAT output are compared; if they compare equal, one 43 of the four "hit" lines are ~ene~ated from the data 44 DLAT and direetory 20-3b. The hit lines are 45 eonnected to the data caehe 18-2b; a generated "hit" 46 line will indicate which of the four associativity 47 EN988001 - 17 - ~3~89~

classes contain.s the data that we wish to retrie~e. 6 On the next cycle, the data cache 18~2b output is 7 gated through a fetch alignment shifter, in the 8 cache dataflow 18-2c, is shifted appropriately, is transmitted along the D-sUs to the f ixed point 10 execution unit 20-4, and is latched into the ALU 11 20-4c. This will be the access of operand 2 of an12 RX type of instruction. In parallel with this 13 shifting process, operand 1 is accessed from the14 general purpose registers in local store 20-4a. A9 15 a result, two operands are latched in the input of 16 the ALU 20-4c, if necessary. In the ~ifth cycle, 17 the ALU 20-4c will process ladd, subtract, divide, 18 etc) the two operands accordingly, as dictated by 19 the instruction opcode. The output of the ALU 20-4c 20 is latched and the condition registers 20-4b are 21 latched, at the end of the fifth cycle, to indicate 22 an overflow or zero condition. In the sixth cycle, 23 the output of the ALU 20-4c is written back into the 24 local store 20-4a and into the GPR copy 20-3d o~ the 25 address generation unit 20-3 in order to keep the26 GPR cop~ 20-3d in sync with the content of the local 27 store 20-4a. When the decode cycle of this 28 instruction is complete, the decode cycle of the29 next instruction may begin, so that there will be up 30 to six instructions in either decoding or execution 31 at any one time. Certain instruction require the32 use of microcode to complete execution. Therefore, 33 during the decode cycle, the op-branch table 20-2b34 is searched, using the opcode from the instruction 35 as an address, the op-branch table providing the 36 beginning address of the microcode routine needed to 37 execute the instruction. These instructions, as 38 well as others, require more than 1 cycle to 39 execute. Therefore, instruction decoding i8 40 suspended while the op-branch table i9 bein~ 41 searched. In the case of microcode, the I-~US is 42 utilized to provide microinstructions to the 41 decoding hardware. The instruction cache 18-la is44 shut-off, the control store 20-la is turned-on, and45 the microinstructions are passed over the I-BUS. 46 For ~loating point instructions, decoding proceeds 47 EN988001 - 18 ~ ~ 3 1 ~ 8 ~ 6 as previously ~escrihed, e~cep-t that, duriny the 6 address generation cycle, a command is sent to the 7 floating point unit 20-5 ~o indicate and identi~y 8 the proper operation to perform. In an R~ floatiny 9 point in truction, for example, an operand is 10 fetched from the data cache 18-2b, as descrihed 11 above, and the operand is transmitted to the 12 - floating point proce~sor 20-5 in lieu of the fixe~13 point processor 20-4. Execution of the floating 14 point instruction is commenced. When complete, the 15 -~results of the execution are returned to the fixed 16 point execution unit 20-4, the "results" being 17 condition code, and any interrupt conditions, such 18 as overflow. 19 The following description represents an alternate 21 functional description of the system set forth in 22 figure 3 of the drawings. 23 In figure 3, the first stage of the pipeline is25 termed instruction decode. The instruction is 26 decoded. In the case of an RX instruction, where 27 one operand is in memory, the base and index 28 register contents must be obtained from the GPR COPY 29 20-3d. A displacement field is added to the base30 and index registers. At the beginning of the next 31 cycle, the addition of the base, index, and 32 displacement fields is completed, to yield an 33 effective address. The effective address is sent ot 34 the DLAT and Directory chips 20-3a/20-3b. The high 35 order portion of the effective address must be 36 translated, but the low order portion is not 37 translated and is sent to the cache 18-la/18-2b. In 38 the third cycle, the cache begins an access 39 operation, using the bits it has obtaincd. The DL~T 40 directories are searched, using a virtual addres5 to 41 obtain an absolute address. This absolute address 42 is compared with the absolute address kept in the 43 cache directory. I~ this compare i8 succes~ful, the 44 "hit" line is generated and sent to the cache chip 45 18-la/18-2b. Meanwhile, the cache chip has accessed 46 all four associativity classes and latches an output 47 ~3~g9~

accordingl~. In the fourth cycle, one of the four 6 "slots" or associativity classes are chosen, the 7 data is aligned, and is sent across the data bus to 8 the ~ixed or floating point processor 20-4, 20-5. 9 Therefore, at the end of the fourth cycle, one 10 operand is latched in the ALU 20~4c input.11 Meanwhile, in the processor, other instructions are 12 being executed. The GPR COPY 20-3d and the local 13 store 20-4a are accessed to obtain the other14 operand. At this point, both operands are latched 15 at the input of the ALU 20-4c. One cycle is taken 16 to do the computation, set the condition registers, 17 and finall~ write the result in the general purpose 18 registers in the GPR COPY 20-3d. The result may be 19 needed, for example, for address computation 20 purposes. Thus, the result would be input to the 21 AGEN ADDER 20-3c. During the execution of certain 22 instruction, no access to the caches 18-la/18-2b is 23 needed. Therefore, when instruction decode is 24 complete, the results are passed directly to the 25 execution unit without further delay (in terms of 26 access to the caches). Therefore, as soon as an 27 instructi l is decoded and passed to the address 28 generation chip 20-3, another instruction is29 decoded. 30 Referring to figure 4, another diagram of the data 32 processing system of figure 2 is illustrated. 33 In figure 4, the data processing system is a 35 multiprocessor system and includes a storage 36 subsystem 26; a first Ll cache storage 18a, a second 37 Ll cache storage 18b; a third Ll cache storage 18c; 38 a first processing unit 20a, including an 39 instruction unit, an execution unit, and a control 40 store, connected to the first Ll cache storage 18a; 41 a first vector processing unit 22a connected to the 42 first Ll cache storage 18a; a second processing unit 43 20b, including a instruction unit, an execution 44 unit, a control store, connected to the second Ll 45 cache storage 18b; a second vector processing unit 46 22b connected to the second Ll cache storage 18b; a 47 EN98800l - ~0 - ~ 31 .~ 8 9 ~

third processing unit 20c, including an instructien 6 unit, an execution unit, a control store, connected 7 to the third Ll cache storage 18c; and a ~hird 8 vector processing unit 22c connected to the third Ll 9 cache stora~e 18c. A shared channel processor A 28a 10 and a shared channel processor B 28b are jointly 11 connected to the storage subsystem 26, and an 12 integrated adapter subsystem 14,16 is also connected 13 to the storage subsystem 26. 14 Referring to figure 5, the storage subsystem 26 of 16 figures 2 and 4 is illustrated. 17 In figure 5, the storage subsystem 26 includes an L2 19 control 26k, an L2 cache/bus switching unit 26b/26a, 20 an L3/L4 memory lOa and an L3/L4 memory lOb21 connected to the L2 cache/bus switching unit 22 26b/26a, a memory control 26e connected to the L2 23 control 26k, a bus switching unit control 26f 24 connected to the L2 cache/bus switching unit 26b/26a 25 and to the memory control 26e, storage channel data 26 buffers 26g connected to the bus switching unit 27 control 26f and to the L2 cache/bus switching unit 28 26b/26a, an address/key control 26h connected to the 29 memory control 26e and to the L2 control 26k, L3 30 storage keys 26i connected to the address/key 31 control 26h, and a channel L2 cache directory 26j 32 connected to the memory control 26e and to the 33 address key control 26h. The L2 control 26k34 includes an arbitration unit in the BSU portion of 35 the storage subsystem 26, termed the "L2 cache 36 priority". As noted later, the L2 cache priority 37 decides if a request to store information in L2 38 cache 26b should be granted. 39 The storage subsystem maintains storage consistency 41 among up to three central proce9sing units through 42 use o~ a shared, serially reusable LZ cache storage 43 with separate store queues for each processor, and 44 to support channel devices, up to three channel 45 interfaces are supported, with two parallel paths 46 into L3/L4 stor~ge. The function of the storage 47 EN988001 - 21 - 131~

subsystem is broken into several major units. Two 6 of these units are considered master controllers in 7 that they grant access to the critical resources, 8 namely, the L2 cache and L3/L4 memory ports. The 9 remaining units are consi~ered subordinate to L2 10 control and memory control. 11 L2 Control 26K 13 L2 control provides the primary interface for the 15 central processors to access the lower levels of the 16 storage hierarchy, L2 cache, L3, and L4 memory. L2 17 control maintains a unique command/address interface 18 with each processor in the con~iguration. Across 19 this inter~ace, each processor sends fetch requests 20 from the Ll cache when an Ll cache miss occurs for a 21 processor fetch request. All processor store 22 requests are trans~erred to the L2 control across 23 this interface as well. L2 control maintains 24 request queues for each of the processors at the L2 25 . ..
cach~ level. An L2 cache priority circuit selects 26 one request from among the pending requests for 27 service on any given cycle. The L2 cache directory 28 exists in L2 control and is used to determine if the 29 selected request can complete by accessing data in 30 L2 cache. If so, the request completes and is 31 discarded. If the request cannot complete due to an 32 L2 cache directory miss, the request remains pending 33 in L2 control and a request is sent to memory 34 control to cause an inpage for the desired data from 35 L3 storage. L2 control is responsible for 36 maintaining storage consistency throughout the 37 configurationl keeping track o~ the Ll cache 38 contents in L2 status arrays. When necessary, 39 requests ~or Ll cache copy invalidation are sent to 40 the necessary processors across their respective 41 command/address interfaces. 42 Memory Control 26E 44 Memory control is the unit responsible for 46 allocating access to the L3/L4 storage ports. Two 47 EN988001 - 22 - 13~ 9 ~

independent ports exist, each containing half o~ the 6 storage contents. Memory control queues all channel 7 requests, up to a maximum of seven, as well as 8 processor L2 cache miss requests, up to one per 9 processor. From this queue o~ re~uests, memory 10 control selects a request per memory port on which 11 to operate. For the processor requests, this is the 12 - major function performed. For channels, however, 13 memory control also controls access to the storage 14 key arrays and the channel L2 cache directory. When 15 -~memory control is given a channel request from 16 address/key, it must first determine if the desired 17 data exist in L2 cache by searching an L2 cache1~
directory copy labeled the channel L2 cache 19 directory. Additionally, memory control determines 20 if the access is allowed by comparing the storage 21 key associated with the request against the storage 22 key in the storage key arrays. If no protection23 exceptions are found, memory control allows the24 request to contend for L3 access. When selected by 25 L3 priority, the request is directed to L2 control 26 if a channel L2 cache directory search has resulted 27 in a hit, or to the L3 port if the search resulted 28 in an L2 cache directory miss. 29 Address/Key Control 26H 31 Address/key control has two primary functions. 33 First, it is the command/address interface for the 34 external channel devices, supporting three separate 35 channel interfaces. It accepts ~torage requests36 from the channel devices, converting them to the 37 storage subsystem clock rate, and queues them in 38 internal buffers. It also forwards the requests to 39 memory control. Address/key returns the status of ~0 all channel operations to the channel subsystem as 41 well. The other function is to support the storage 42 key arrays and reference/change (~/C) bits arrays. 43 The key arrays support the storage keys required by 44 the S/370 architecture. Memory control is the 45 primary controller for granting access to these46 arrays. Address/key controls granting access to the 47 EN988001 - Z3 ~ 131~89~

R/C arrays which are used by processor L2 cache 6 accesses to update the reference and change bits of 7 the storage key arrays. Multiple copies o the R/C 8 bits exist and mus~ be merged on request by 9 address/key. 10 Bus Switching Unit (~SU) Control 26F 12 BSU control represents the primary controller for 14 the L2 cache/BSU data flow and storage channel data 15 buffers (SCDB) data flow. It is the focal point for 16 the L2 control and memory control requests to move 17 data throughout the storage subsystem. BSU control 18 manages the data buses capable of moving information 19 to/from the L2 cache, L3/L4 ports, and SCDBs. 20 L2 Cache/Bus Switching Unit 26B/26A Data Flow 22 The L2 cache data arrays reside here. Each central 24 processor has an eight-byte bidirectional data 25 interface into the L2 cache data flow unit. This 26 supports data movement from processor to L2 cache or 27 L3/L4 storage as well as from L2 cache or L3/L4 28 storage to the processor. Two 16-byte interfaces, 2g one to each L3/L4 port, are also supported in this 30 unit. Last, two 32-byte interfaces to the storage 31 channel data buffers are maintained here. These 32 interfaces support data movement between the SCDBs 33 and the L2 cache or L3/L4 storage. 34 ~' Storage Channel Data Buffers 26G 36 To support the three channel storage interfaces, the 38 storage channel data buffer maintains a set of 39 buffers for each independent channel data interface, 40 This supports movement of data from the channel 41 devices to L2 cache or L3/L4 storage, and back 42 again. The channel data buffer controls are split, g3 some coming ~rom the channel lnterface itself, some 44 from the storage subsystem (BSU control). The 45 storage channel data buffer unit also supports the 46 EN98~001 - Z~ - 131~ 8 9 5 memory buffer for allowing L3/L4 memory-to-memorY 6 transfers requested by the central processors. 7 In figure 5, the L2 cache/bus switchiny unit 26b/26a 9 generates three output signals: cpO, cpl, and cp2. 10 The L2 control 26k also generates three output 11 signals: cpO, cpl, and cp2. The cpO output signal 12 of the L2 cache/bus switching unit 26b/26a and the 13 cpO output signal of the L2 con~rol 26k jointly 14 comprise the output signal from storage subsystem 26 15 --of figure 1 energizing the first Ll cache storage16 18a. Similarly, the cpl output signals from L2 17 cache/bus switching unit 26b/26a and L2 control 26k 18 jointly comprise the output signal from storage 19 subsystem 26 of figure 1 energizing the second Ll 20 cache storage 18b and the cp2 output signals from 21 the L2 cache/Bus Switching Unit 26b/26a and L2 22 control 26k jointly comprise the output signal from 23 storage subsystem 26 of figure 1 energizing the 24 third Ll cache storage 18c. 25 In figure 5, the storage channel data buffers 26g 27 generate three output signals: shcpa, shcpb, and28 nio, where shcpa refers to shared channel processor 29 A 28a, shcpb refers to shared channel processor B 30 28b, and nio refers to the integrated I/O and 31 adapter subsystem 14/16. Similarly, the address/key 32 control 26h generates the three output signals 33 shcpa, shcpb, and nio. The shcpa output signal from 34 the storage channel data buffers 26g in conjunction 35 with the shcpa output signal from the address/key 36 control 26h jointly comprise the output signal 37 generated from the storage subsystem 26 of figure 1 38 to the shared channel processor A 28a. The shcpb39 output signal from the storage channel data bu~ers 40 26g in conjunction with the shcpb output sign~l ~rom 41 the address/key control 26h jointly comprise the42 output signal generated from the storage subsystem 43 26 of figure 1 to the shared channel processor B44 28b. The nio output signal ~rom the ~torage channel 45 data buffers 26g in conjunction with the nio output 46 signal from the address/key control 26h jointly 47 EN9~8001 - 25 -~3~8~

comprise the output signal generated from the 6 storage subsystem 26 of figure 1 to the integrated 7 adapter subsystem 14/16. 8 -Referring to fi~ure 6, a detailed construction of a 10 portion of the L2 cache/BSU 26b/26a of figure 5 is 11 illustrated, this portion of the L2 cache/BSU 12 26b/26a including an L2 store queue arrangement in 13 association with the L2 cache. Further, in figure 14 6, a detailed construction of the ~1 cache storage 15 18a, 18b, 18c of figure 4 is illustrated, the Ll 16 cache storage including an Ll store queue 17 arrangement in association with the Ll cache. 18 In figure 6, the Ll cache storage 18a of figure 4 20 comprises the Ll cache 18a connected to an Ll store 21 queue 18al. The Ll cache is connected, at its 22 input, to an inpage data register (INPG DATA) 18a2, 23 and is connected, at its output, to a fetch data 24 (FETCH DATA) register 18a3. The Ll cache storage 25 18b of figure 4 comprises the Ll cache 18b connected 26 to an Ll store queue 18bl. The Ll cache is 27 connected, at its input, to an inpage data register 28 (INPG DATA) 18b2, and is connected, at its output, 29 to a fetch data (FETCH DATA) register 18b3. The Ll 30 cache storage 18c of figure 4 comprises the Ll cache 31 18c connected to an Ll store queue 18cl. The Ll 32 cache is connected, at its input, to an inpage data 33 register tINPG DATA) 18c2, and is connected, a~ its 34 output, to a fetch data (FETCH DATA) register 18c3. 35 The Ll store queue 18al is connected to an L2 store 36 queue 26al. Similarly, the Ll store queue 18bl is 37 connected to an L2 store queue 26a2; and the Ll 38 store queue 18cl is connected to an L2 store queue 39 26a3. Therefore, each Ll store queue i9 uniquely 40 associated with a specific processor o~ the 41 multiprocessor configuration. Each Ll ~tore queue 42 is also uniquely associa~ed with an L2 store queue. 43 Therefore, each L2 store queue is uniquely 44 associated with a specific proaessor of the 45 multiprocessor aonfiguration. Each L2 store queue 46 has certain write buffers connected to the output 47 EN988001 - 26 - I 3~ ~8~ ~

thereof. For example, L2 store queue 26al is 6 connected, at lts output, to a L2 write buffer 0 7 (L2WB-0) 26al(a) and to a L2 write buffer 1 (L2WB-1) 8 26al(b). The L2 store queue 26al is also connected, 9 at its output, to a storage subsystem L2 write 10 buffer controls (SS L2WB CTLS) 26al(c). L2 store 11 queue 26a2 is connected, at its output, to a L2 12 write buffer O (L2WB-0) 26aZ(a) and to a L2 write 13 buffer 1 (L2WB-1) 26a2(b). The L2 store queue 26a2 14 is also connected, at its output, to a storage 15 subsystem L2 write buffer controls (SS L2WB CTLS) 16 26a2(c). L2 store queue 26a3 is connected, at its 17 output, to a L2 write buffer O (L2WB-0) 26a3(a) and 18 to a L2 write buffer 1 (L2WB-1) 26a3(b). The L2 19 store queue 26a3 is also connected, at its output, 20 to a storage subsystem L2 write buffer controls (SS 21 L2WB CTLS) 26a3(c). The aforementioned write 22 buffers and L2 store queues are all connected, at 23 their outputs, to an L2 cache write buffer (L2 CACHE 24 WB) 26a4, the L2 cache write buffer 26a4 being 25 connected, at its output, to the L2 cache 26b. The 26 aforementioned L2 write buffer controls (L2WB CTLS) 27 are each connected, at their outputs, to an L2 28 address register (L2 ADDR) 26a5 which addresses the 29 L2 cache. The L2 cache 26b is connected, at its 30 output, to an L2 cache read buffer (L2 CACHE RB) 31 26a6, which is further connected, at its output, to 32 a Ll O inpage buffer (LlOIPB) 26a7, to a Ll 1 inpage 33 buffer (LllIPB) 26a8, and to a Ll 2 inpage buffer 34 (L12IPB) 26a9. Inpage buffer 26a7 is connected to 35 the aforementioned inpage data register 18a2; inpage 36 buffer 26a8 is connected to the aforementioned 37 inpage data register 18b2, and inpage buffer 26a9 is 38 connected to the aforementioned inpage data register 39 18c2. 40 ,. .
By way of background, the Ll store queues 18al, 42 18bl, 18cl and the L2 store queues 26al, 26a2, 26a3 43 shown in figure 6 are designed to support the S/370 44 and 370-XA instruction set, maximizing performance 45 for a given processor while minimizing interference 46 between processors at the highest level common 47 EN98~001 - 27 - ~ 3~ ~ 9 ~

storage, the L2 cache 26a bu~fer gtorage. The store 6 queue organization is structured as a two-level 7 queue, assuming the attributes of the two-level 8 cache storage. Each processor possesses its own 9 store queue. At the Ll cache level, the Ll store 10 queue controls will administer the enqueue of 11 requests and maintains some storage consistency. At 12 the L2 cache level, the L2 control 26k o~ figure 5 13 will administer the dequeue o~ requests and14 maintains global storage consistency between cache 15 levels and processors. The store requests are 16 divided into categories which allow the most 17 efficient processing of storage at the shared L2 18 cache level. This store queue design maintains 19 retriability of the instruction set while allowing 20 instruction execution to proceed even when stores 21 for the instructions have not completed to the 22 highest level of common storage, the L2 cache buffer 23 storage. This allows for improved machine 24 performance by permitting the stores o~ one25 instruction to overlap with the pipeline execution 26 stages of succeeding instructions, limited only by 27 architectural consistency rules. The store queue 28 design avoids the necessity to pretest instructions 29 which may suffer from page faults in a virtual 30 storage environment by delaying storing results to 31 the highest level of common storage until the true 32 end of the instruction. Also, an efficient 33 mechanism for the machine to recover from such 34 situations using only processor microcode is 35 supported. 36 The 370-XA instruction set is divided into several 38 types which process operands residing in real 39 storage. These instructions can be split into two 40 categories based on the length of the resu~ts stored 41 to real storage: non-sequential stores (NS) and 42 sequential stores (SS). The NS type consists 43 primarily of the instructions whose operand lengths 44 are implied by the instruction operation code. 45 Their result length is from one to eight bytes and 46 they typically require a single store access to real 47 :E:N988001 - 28 - ~ 3~8~

storage. A special case is one where the starting 6 address of the resultant storage field plus the 7 length of the operand yields a result which crosses 8 a doubleword boundary, requiring two store accesses 9 to store the appropriate b~tes into each doubleword. 10 The SS type consists of the instructions whose 11 operand lengths are explicitly noted within the text 12 o~ the instruction or in general registers used by 13 the instruction. Also, such instructions as store 14 multiple may be classified this way. Their result 15 -~length ranges from one to 256 bytes, requiring 16 multiple store accesses. Results can be stored a 17 single byte at a time~ or in groups of bytes, up to 18 eight per request. Special consideration is given 19 to other types o~ instructions, requiring additional 20 processing modes. Certain instructions require the 21 ability to store multiple results to non-contiguous 22 locations in real storage. A mode of operation 23 allowing multiple NS types per instruction is 24 supported for this type of instruction. 25 Instructions that use explicit lengths for their 26 operands may actually store only one to eight bytes. 27 These store requests are converted in the storage 28 subsystem to NS types for performance reasons which 29 will be made clear later. In some situations, the 30 ability to support a mixed mode of store queue 31 operations is required. SS types, followed by NS 32 types within the same instruction support these33 instruction requirements. The other requirement to 34 support the store request handling in the storage 35 subsystem is that an end-of-operation (EOP) 36 indicator be associated with each store request: 37 '0'b implies no EOP, 'l'b implies this is the last 38 store in the instruction. The EOP indicator marks 39 the successful completion of the 370-X~ instruction 40 and its associated store re~uests. Multiple 41 requests to store data, during the execution of a 42 single instruction, into the common level of 43 storage,;the L2 cache buffer storage, are not 44 allowe~ unless an end of operation ~EOP) indication 45 has been received, indicating the end of the 46 execution of that particular instruction. If the 47 EN~8~001 - ~9 - ~ 31~89~

EOP indication has not been rec0ived, the data may 6 be stored in the Ll store queue, the L2 store queue, 7 and the L2 write buffers; however, the data may not 8 be stored in the L2 cache unless the EOP indication 9 has been received, signalling the end of execution 10 of the particular instruction. When EOP indication 11 is receivedr the store from the L2 write bu~fers to 12 the L2 cache may begin. This supports the 13 philosophy that an instruction must successfully14 complete before it is allowed to modify storage.15 This does not preclude'the modification of the 16 requesting processor's Ll cache, however. A special 17 mode of operation is supported for this store 18 request status indicator as well. In the level of 19 interrupt processing where 370-XA instructions are 20 not actually being executed, all store requests are 21 forced to contain the EOP indication to allow 22 efficient processing of microcode interrupt 23 routines. 24 A functional description of the Ll and L2 store 26 queue design of the present invention will be set 27 forth in the following paragraph with particular28 reference to figure 6 of the drawings, with 29 ancillary reference to the block of figure 5 30 labelled "L2 cache/Bus Switching Unit (BSU) 26b/26a" 31 which is shown connected on one end to Ll caches32 18a, 18b, 18c, and, on the other end, to L3/L4 33 (main) memory lOa/lOb, and with general reference to 34 figures 1-3 of the drawings. 35 A processor, one of 20a, 20b, or 20c of figure 2, 37 issues a processor store request to the Ll cache38 function, one of 18a, 18b, or 18c of figure 2. The 39 command type (NS or SS, EOP), starting field 40 address, one to eight bytes o~ data, and a Eield41 length are presented simultaneously to the ~1 cache 42 function. The starting field address consists of43 program address bit~ 1:31, ldentifying the first44 byte of the field in ~torage. The ~ieid length 45 indicates the number of bytes to be modified 46 starting at that address, one to eight. If the 47 EN988001 - 30 - I 3 ~8~ 6 storage field ~odified by the request should cross a 6 doubleword boundary, it is interpreted as two 7 requests by Ll cache, each requiring a cache access 8 and store queue enqueue. Sequential stores are 9 comprised of a number of such store requests with 10 the sequencing under control of the execution unit. 11 Referring to figure 2, the Ll cache function 18 12 receives the information. The storage address, 13 presented by the processor 20, is translated to an 14 absolute address, via the data DLAT and Ll directory 15 20-3a, 20-3b of figure 2, and the low-order bits of 16 the absolute address and the field length from the 17 absolute address are used to generate store byte 18 ~lags (STBF). The store byte flags identify the 19 bytes to be stored within the doubleword, the b~tes 20 being identified by absolute address bits 1:28. The 21 Ll cache directory, 20-3a, 20-3b of figure 2, is 22 searched to determine if the data exists in the Ll 23 cache. Next, the Ll cache function 18 makes an 24 entry on the Ll store queue. If processor 20a is 25 making the store request, Ll cache 18a makes an 26 entry on the Ll store queue 18al of figure 6, 27 enqueuing the absolute address, command type, data, 28 and store byte flags in Ll store queue 18al. In 29 parallel with this action, if the requested data 30 exists in the Ll cache 18a (an Ll cache 18a hit), 31 this data in Ll cache 18a is updated according to 32 the absolute address and store byte flags. If all 33 prior stores have been transferred to the L2 cache 34 26b function, and the interface to L2 cache 26b is 35 available, the store request, enqueued in Ll store 36 queue 18al, is transferred to the L2 cache function, 37 i.e., to the L2 store queue 26al, 26a2, or 26a3 and 38 to their respective write buffers. The L2 cache 39 function receives the following information: the 40 doubleword absolute address, the command type, the 41 data, and the store byte flags. This information is ~2 enqueued onto the L2 store queue, in this example, 43 this information is stored in the L2 store queue 44 26al. The next step is to store this information, 45 from the ~2 store queue 26al, into the L2 cache 26b. 46 As shown in ~igure 6, for data and/or instructions 47 ~ 31 ~
E~N988001 - 31 -which are part of a sequen~ial ~ore (SS) operation, 6 these instructions are store~ into L2 cache Z6b via 7 the L2 write buffers, that is, vla the L2 WB-0 8 26al(a~, L2 cache WB 26a4 or via L2 WB-l 26al(b), L2 9 cache WB 26a4. Note that non-sequential store (NS~ 10 operations do not utilize L2WB-0 or L2WB-1 when 11 storing information from ~he L2 store queue into L2 12 cache 26b; rather, they are stored directly into the 13 L2 cache WB 26a4 from the L2 store queue 26al, 26a2, 14 26a3. If all preceding stores for this processor 15 have been serviced and other conditions are 16 satisfied, the request enters L2 cache priority in 17 L2 control 26k (i.e., L2 cache priority is an 18 arbitration unit in the BSU portion of the storage 19 subsystem 26 which decides if the request to store 20 this information in L2 cache 26b should be granted 21 at this time). When the request is granted, the ~2 absolute address, obtained from the data DLAT and Ll 23 directory 20-3a,20-3b of figure 2, is used to search 24 the L2 cache directory 26J of figure 5. If the 25 corresponding data is found in L2 cache 26a, the 26 data is stored to L2 cache 26a of figure 6 under the 27 control of the store byte flags. Ll status arrays, 28 reflecting the contents of each processors' I,l 29 ` caches 18a, 18b, 18c, are interrogated and the 30 appropriate Ll cross-invalidation requests are sent 31 to the Ll caches 18a-18c in order to invalidate the 32 corresponding obsolete data stored in one or more 33 the Ll caches 18a-18c, in order to maintain storage 34 consistency. Once the data is stored in L2 cache 35 26, the corresponding data entry, stored in the Ll 36 store queue 18al, the L2 store queue 26al, and L2 37 write buffers is then dequeued (erascd or removed) 38 from both the Ll and L2 store queues 18al and 26al 39 and L2 write buffers, 40 Referring to ~igure 7, a construction o~ th¢ 42 contents of the Ll store queue 18al, 18bl, and 18cl, 43 of figure 6, is illustrated. 44 In figure 7, the Ll store queue 18al, 18bl, and 46 18cl, each comprise a logical address portion 47 EN988001 - 3~ - ~ 3 1 ~ ~ 9 6 18al(a), an absolute address portion 18al(b), a 6 command field 18al(c), a data field 18al(d), and a 7 store byte flag (STBF) field 18al(e). 8 In figure 6, each processor or execution unit 10 20a-20c in the configuration of figure 2 maintains 11 its own Ll store queue totally independent of the 12 o-ther processors. Each Ll store queue 18al-18cl of 13 figures 6 and 7 can be considered a one dimensional 14 array. It is a first~in, first-out, cyclic queue in 15 regards to requests for transfer to the L2 store16 queue. The contents of the Ll store queue 18al-18cl 17 comprise five primary fields. The first, the 18 logical address 18al(a), is not necessarily 19 required. It can be maintained to allow detection 20 of stores into the instruction stream within the21 same processor, labeled program store compare. The 22 absolute address can be used for this purpose, 23 however. The second field contains the absolute 24 address 18altb). This address represents the 25 address of the doubleword of the data in the ~tore 26 ~ueue entry. It is the address resulting from 27 dynamic address translation performed prior to 28 enqueuing the store request. This address is used 29 to update the Ll and L2 cache buffers. It is also 30 used as the address for succeeding instructions'31 attempts to fetch results enqueued on the Ll store 32 queue, but not yet stored into L2 cache, labeled33 operand store compare. The command field 18al(c)34 contains the sequential store bit: '0'b if a 35 non-sequential store request, 'l'b if sequential 36 store request. The other bit is the EOP bi-t. Xt 37 delimits instruction boundaries within the store38 queue. The next field, the data field 18al(d), 3g contains up to eight bytes of data, aligned 40 according to the loglcal address bits 29:31. ~8 an 41 example, if four bytes are to be stored startin~ at 4~
byte position 1, then bytes 1:4 contain the 43 resultant four bytes to store to memory in this 44 field. The final field, the Store Byte Flag (STBF) 45 field 18al(e), indicates which bytes are to be 46 written to storage within the enqueued doubleword. 47 EN988001 - 33 - ~ 3 ~ ~ 8 9 6 From the previous example, store hyte flags l:4 6 would be ones, while 0 an~ 5:7 would he zeros. 7 In a multiple processor confiyuration, if a 9 processor attempts to fetch data enqueued on the Ll lO
store queue, one of two actions results. First, if ll the fetch request results in an Ll cache hit, all 12 "conceptually completed" store queue entry absolute 13 addresses are compared to the doubleword boundary l4 against the doubleword absolute addr~ss of the fetch 15 request. Should one or more equal compares be 16 found, the fetch is held pending the dequeue of the 17 last matched queue entry to L2 cache. This prevents 18 the requesting processor from seeing the data before l9 the other processors in the configuration. Second, 20 if the fetch request results in an Ll cache miss, 21 all "conceptually completed" store queue entry 22 absolute addresses are compared to the Ll cache line 23 boundary against the Ll cache line ahso~ute address 24 o ~he fetch request. Should one or more equal 25 compares be found, the fetch is held pending the 26 dequeue of the last matched queue entry to L2 cache. 27 This guarantees storage consis~ency between the Ll 28 and L2 caches. 29 Still referring to figure 7, four "pointers" will be 31 discussed: the LlEP pointer, the LlTP pointer, as 32 shown in figure 7; the LlIP pointer and the LlDP 33 pointer, not shown in the drawings. An entry is 34 placed onto the Ll store queue each time a store 35 request is presented to Ll cache, provided the 36 logical address can successfully be translated and 37 no access exceptions occur and regardless of the Ll 38 cache hit/miss status. Just prior to enqueue, the 39 Ll store queue enqueue pointer (LlEP) i8 incremented 40 to point ko the next available entr~ in the queue. 4l Enqueue is permitted provided the ~l 8tore queue is ~2 not full. Store queue full is predicted, accounting 43 for the pipeline stages from the instruction 44 rëgister to the Ll cache, to prevent store queue 45 overflow. Enqueue is controlled by the execution 46 unit store requests. An Ll store queue transfer 47 EN988001 - 3~ - ~3~8~6 pointer (LlTP) is implemented to support 6 bidirectional command/address and data interfaces7 with the L2 cache. Normally, both interfaces are 8 available, and when a store reques~ is placed onto 9 the Ll store queue, it is also transferred to the LZ 10 store queue. Under certain situations, the transfer 11 of the store request to L2 must be delayed, perhaps 12 due to data transfers from L2 cache to Ll cache for 13 an Ll cache fetch miss. This allows the execution14 unit to continue after requesting the store without 15 requiring that the request has been successfully16 transferred to L2. The LlTP is incremented each 17 time a request is transferred to L2. An instruction 18 boundary pointer (LlIP) is required to delimit 19 370-XA instruction boundaries. Each time an EOP 20 indication is received, the LlEP is copied into the 21 LlIP. The LlIP is used to mark "conceptually 22 completed" stores in the Ll store queue. These are 23 complete from the viewpoint of the execution unit, 24 as the 370-XA instruction has completed, even though 25 they may not yet be reflected in L2 cache, the 26 common le~el of storage in the configuration. This 27 pointer marks the boundary for the entries checked 28 for program store compare and operand store compare. 29 Last, the Ll dequeue pointer (LlDP) identifies the 30 most recent entry removed from the store queue. It 31 actually points to an invalid Ll store queue entry, 32 marking the last available entry for enqueue. Ll33 store queue entries are dequeued only by signal from 34 the L2 store queue controls whenever the 35 corresponding entry is removed from the L2 store36 queue. This pointer is used along with the LlEP to 37 identify store queue full and empt~ conditions, 38 controlling the execution unit as necessary. 39 Referring to figure 8, two field address registers 41 are connected to an output of the Ll store queue of 42 figure 7, and in particular, to the absolute address 43 portion 18al(b) of the output of the ~1 store queue 44 18al-18cl of figure 7. These field address 45 registers are termed the Starting Field Absolute46 Address (SFAA) field address register and the Ending 47 .

EN988001 - 35 - ~ 9 ~

Field ~bsolute Address (EFAA) field address 6 register. 7 In figure 8, to support sequential store processing, 9 two additional address registers, called field 10 address registers, are required for comparison 11 purposes: the SFAA and the EFAA field address 12 registers. Consider that a 370-XA instruction may13 modify up to 256 bytes of storage with single-byte 14 store requests. Given that each request presented 15 to the Ll store queue 18al-18cl requires a unique16 entry, 256 entries would be required to contain the 17 entire instruction to support the retry philosophy. 18 To avoid this situation, when a sequential store is 19 started in the L2 cache 26a, each entry associated 20 with the same instruction is de~ueued and its 21 address is loaded into the appropriate field address 22 register of figure 8, either the SFAA or the EFAA. 23 This allows the limits of the storage field, 24 modified by the sequential store, to be maintained 25 for comparison purposes in a minimal amount of 26 hardware, at the Ll cache or Ll store queue level. 27 These registers are used to support "program store 28 compare" and "operand store compare". The following 29 sub-paragraphs describe the program store compare30 and the operand store compare concept: 31 Operand Store Compare 33 As required by the conceptual sequence within a 35 processor, if an instruction stores a result to a 36 location in storage and a subsequent instruction 37 fetches an operand from that same location the 38 operand fetch must see the updated contents of 39 the storage location. The comparison is required 40 on an absolute address basis. With the queuing 41 of store requests, it i~ required that the 42 operand fetch be delayed until the store is 43 actually completed at the L2 cache and made 44 apparent to all processors in the configuration. 45 For the uniprocessor, the restriction that the 46 store complete to L2 cache before allowing the 47 1 31 ~

fetch to continue is wai~ed as there exists no 6 other processor to be made cognizant o~ the 7 change to storage. It is not required that 8 channels be made aware of the processor stores in 9 any prescribed sequence as channels execute 10 asynchronously with the processor. In this case, 11 enqueuing on the Ll store queue, and updating the 12 Ll operand cache if the data exist there, is 13 sufficient to mark completion of the store. 14 However, if the data are not in Ll cache at the 15 time of the store, the fetch request with operand lS
store compare must wait for the store to complete 17 to L2 cache before allowing the inpage to Ll 18 cache to guarantee data consistency in all levels 19 of the cache storage hierarchy. 20 Program Store Compare 22 .~ . .. . . . .
Within a processor, two cases o~ program store24 compare exist: the first involves an operand 25 store to memory followed by an instruction fetch 26 from the same location (store then-fetch); the 27 second involves prefetching an instruction into 28 the instruction buffers and subsequently storing 29 into that ~emory location prior to execution of 30 the prefetched instruction ~fetch-then-store). 31 As required b~ the conceptual sequence within a 32 processor, if an instruction stores a result to a 33 location in storage and a subsequent instruction 34 fetch is made from that same location, the 35 instruction fetch must see the updated contents 36 of the storage location. The comparison is 37 required on a logical address basis. With the 38 queuing of store requests, it is required that39 the instruction fetch be delayed until the store 40 is actually completed at the L2 cache and made41 apparent to all processors in the configuration. 42 For the second case, the address of ezch operand 43 store executed within a processor is compared 44 against any prefetched instructions in the 45 instruction stream and, if equal, the appropriate 46 instructions are invalidated. The source of the 47 EN988001 - 37 - 1315 8 9 ~

prefetched instructions, the Ll instruction cache 6 line, is not actuall~ invalidated until the 7 operand store occurs in L2 cache. At that time, 8 L2 cache control requests invalida~ion o the Ll 9 instruction cache line. There can be no 10 relaxation of the rules for the uniprocessor as 11 the program instructions reside in a physically 12 separate Ll cache than the program operands, and 13 stores are made to the Ll operand cache only. As 14 such, the store-then-fetch case requires that the 15 -- LZ cache contain the most recent data stored by 16 the processor prior to the inpage to the Ll 17 instruction cache. 18 The field address registers (SFAA and EFAA~ of 20 figure 8 are also used for operand overlap detection 21 within the same storage-storage 370-XA instruction. 22 The concept of operand overlap is described in the 23 following paragraph: 24 -Operand Overlap 26 Within the storage-to-storage instructions, where 28 both operands exist in storage, it is possible 29 for the operands to overlap. Detection of this 30 condition is required on a logical address basis. 31 The memory system hardware actually detects this 32 overlap on an absolute address basis. The 33 destination field in storage is actually being 34 built in the Ll store queue, and Ll cache if Ll 35 cache directory hit, and in the L2 cache write 36 buffers, not in the L2 cache itself. When 37 operand overlap occurs, the Ll cache store queue 38 data and the old Ll line data from ~2 cache are 39 merged on inpage to Ll cache. In the case o~ 40 destructive overlap, the fetches for the 41 overlapped portion are not necessarily fetched 42 from storag0. Hence, the actual updating of L2 43 cache is postponed until end-of-operation for th~ 44 instruction. 45 EN988001 - 38 1 31~96 When operand overlap is detected on a fetch access, 6 no problem exists provided the data is stored i~ Ll 7 `cache when the stores modify the contents of Ll 8 cache in parallel wlth enqueuing at the Ll cache 9 level. If an Ll cache miss occurs on a fetch with 10 operand overlap, the Ll store queue is emptied (L2 11 processes all entries in the Ll skore queue related 12 to this instruction) prior to allowing the fekch 13 request transfer to L2 cache. This guarantees that 14 L2 has the mcst recent data for the instruction in 15 i~s L2 write buffers. The Ll cache miss can then be 16 handled in L2 cache, merging the contents of the L2 17 write buffers and L2 cache line to give the most 18 recent data to Ll cache. 19 Referring to figure 9, the contents of the L2 store 21 queue 26al, 26a2, and 26a3 of figure 6 is -22 illustrated. 23 In figure 9, the contents o~ the L2 store queue 25 consist of four primary fields. The first field 26 contains the absolute address 26al(a). This address 27 represents the address of the doubleword of the data 28 in the store queue entry. It is the absolute 29 address transferred with the request from the Ll 30 store queue. This address is used to update the L2 31 cache and L2 write buffers. It is also used as the 32 address for interrogating the Ll status arrays which 33 maintain a record of the data in each Ll cache for 34 each processor in the configuration. The command 35 (CMND) field 26al(b) contains the sequential store 36 bit: '0'b if a non-sequential store request, 'l'b 37 if sequential store request. The other bit is the 38 EOP bit. ~t delimits instruction boundaries within 39 the store queue. The next ~ield, the DATA field 40 26al(c), contains up to eight bytes o data,41 transferred as loaded into the Ll store queue. The 42 final field, the store byte flag (STBF) field 43 26al(d), indicates which bytes are to be written to 44 storage within the enqueued doubleword, as in the Ll 45 store queue. - 46 EN988001 - 39 - 1315 ~ 9 ~

Each processor 20a, 20b, 20c o~ fiyures 2 and ~, 6 maintains its own L2 store queue 26al, 26a2, 26a3, 7 respectively, totally independent of the other 8 processors. The storage subsysteM manages the L2 9 store queues 26al-26a3 and L2 write buf~ers (see 10 fiyure 6) for each processor. The L2 write buffers 11 may be seen in ~igure 6 as L2 write bu~er 012 (L2WB-0) 26al(a) and L2 write huffer 1 (~.2WB-1) 13 26al(b) associated with processor 20a and L2 cache 14 write buf~er tL2 cache WB 26a4. Similarly, L2WB-0 15 26a2(a) and L2WB-1 26a2(b) are associated with 16 processor 20b, and L2WB-0 26a3(a) and L2WB-1 26a3(b) 17 are associated with processor 20c. 18 Generally speaking, the L2 store queue is comprised 20 of two major parts: (1) the first part is the L2 21 store qu0ue one-dimensional array 26al-26a3; it is a 22 first-in, first-out, cyclic queue in regards to 23 dequeuing requests to L2 cache 26a or to the L2 24 write buffers itemized above from figure 6 its 25 structure is identical to the Ll store queue; 26 however, the entry pointers are slightly dif~erent; 27 and (2) the second part is the set of L2 write 28 buffers 26al(a)/26al(b), 26a2(a)/26a2(b), 29 26a3ta)/26a3(b), and 26a4 o~ figure 6 used for 30 sequential store processing for each processor in 31 the configuration. 32 In figure 9, an entry is placed onto the L2 store 34 queue each time a stoxe request is transferred 35 across the Ll cache/storage subsystem interface from 36 the requesting processor. Just prior to enqueue, 37 the L2 store queue enqueue pointer ~L2EP) is 38 incremented to point to the ne~t available entry in 39 the queue. Enqueu0 is always allowed as Ll prevents 40 store queue overflow, provided the Ll store qu~u~ 41 and L2 store queue support the same number of 42 entries. A completed pointer (L2CP) i8 required to 43 delimit serviceable stores within the L2 store 44 queue. Each time an EOP indication is received, the 45 L2EP is copied into the L2CP. Also, for sequential 46 store processing, each time a sequential store 47 EN988001 - ~0 ~ 8 ~ ~

re~uest is enqueued, the L2CP is incremented. The 6 L2CP is used to mark serviceable stores in the L2 7 store queue. These are store requests that can be 8 written into L2 cache in the case of non-sequential 9 stores and store requests that can be moved into the 10 L2 write buffers in the case of sequential stores. 11 Last, the L2 dequeue pointer tL2DP) identifies the 12 most recent entry removed from the L2 store queue. 13 It actually points to an invalid L2 store queue 14 entry, marking the last available entry for enqueue. 15 L2 store queue entries are dequeued whenever they 16 are written into L2 cache 26a (NS) or moved into the 17 L2 write buffers (SS). 18 Referring to figure 10, a set of L2 store queue line 20 hold registers (disposed in the SS L2WB CTLS 21 26al(c), 26a2(c), and 26a3(c)) and L2 store queue 22 write buffers (as shown in figure 6) is illustrated. 23 "~
As shown in figure 6, and seen again in figure 10, 25 the L2 store queue 26al, 26a2, 26a3 DATA 26al(c) are 26 each connected at its output to an L2 write buffer-0 27 (L2WB-0) 26al(a), 26a2(a) and 26a3(a); and to an L2 28 write buffer-l (L2WB-1) 26al(b), 26a2(b), and 29 26a3(b). However, as seen in figure 10, the L2 30 store queue 26al, 26a2, 26a3 ABSOLUTE ADDRESS 31 26al(a) are each connected to the storage subsystem 32 L2 write buffer controls (SS L2WB CTLS) 26al(c), 33-26a2(c), & 26a3(c), each of the write buffer 34 controls (SS L2WB CTLS) including a set of line hold 3 registers comprising line hold 0 register 26al(d), 36 26a2(d), and 26a3(d), a line hold 1 reyister 37 26al(e), 26a2(e), 26a3(e), and a line hold 2 38 register 26al(f~, 26a2(f), and 26a3(f). The line 39 hold registers are address registers, and they are 40 needed to support sequential store processing. 41 Further, although not shown in figure 6 but seen in 42 figure 10, the L2 store queue 26al, 26a2, 26a3 store 43 byte flags STBF 26al(d) are each connected to a L2 44 write buffer store byte flag 0 (L2WB STBF 0) 45 register 26al(g), 26a2(g), 26a3(g), to L2 write 46 bufer store byte flag 1 (L2WB STBF 1) register 47 g g ~ .
EN~88001 - 41 -26al(h), 26a2(h), 26a3(h) and to L2 write bu~fer 6 store byte flag 2 (L2WB STBF 2) register 26al(i), 7 26a2ti~, 26a3(i). 8 -In operation, referring to ~igure 10 in conjunction 10 with figure 6, consider that a 370-XA instruction 11 may modify up to 256 bytes of storage. With an L2 12 cache line size of 128 bytes, this 256-byte field 13 can span three L2 cache lines. When a sequential 14 store is started in L2, the first line-hold register 15 is loaded with the absolute address portion of the 16 L2 store queue and the L2 cache directory is 17 searched, using the absolute address, to determine 18 if the data currently exists in L2 cache. If it 19 does, the L2 cache set is also loaded into the 20 line-hold register and the cache line is now pinned 21 in L2 cache for the duration of the sequential 22 store. Pinning simply means the L2 cache line 23 cannot be replaced by another line for the duration 24 of the sequential store, but it does not restrict 25 access in any other way. If the L2 cache directory 26 search results in a miss, the data continue to be 27 moved into the L2 write buffer and dequeued from the 28 StQre queue. Processing continues up to the end of 29 the current cache line~ If the end of the cache 30 line is reached before the required data have been 31 inpaged into L2 cache, processing is suspended, 32 otherwise processing continues with the next L2 33 cache line. Each time another L2 cache line is 34 stored into, the L2 cache directory is searched 35 again, and another line-hold register is 36 established. Once EOP is detected for the 37 sequential store, the data are stored into L2 cache 38 in successive cache write cycles, and the line-hold 39 registers are reset. A final dequeue 3ignal i8 40 transferred to Ll to release the fLeld address gl registers associated with thls sequential store in 42 Ll. This allows the limits o~ the stoxa~e field 43 modified by the sequential store to b~ main~ained 44 with minimal hardware while improvin~ concurrency at 45 the L2 cache level. Note that the L2 cache 46 directory is accessed a m~ximum of 6 times for the 47 EN988001 - 42 - 131~896 256-byte storage ~ield. This ig a significant 6 reduction in L2 cache busy time over an 7 implementation which treats each doubleword of the 8 field as a unique store, possibly requiring 33 cache 9 accesses. In addition to the line-hold registers, 10 L2 write buffers are required to handle the data 11 portion of the L2 store queue to permit sequential 12 store processing. As se~uential store entries are 13 dequeued from the L2 store queue, the image of the 14 storage field is built in the L2 write buffers, 15 address-aligned as if the data were placed into real 16 storage. Upon receipt of EOP for the sequential 17 store, up to three contiguous line-write cycles are 18 taken in L2 cache, moving one to 128 bytes into 19 cache on each write cycle under control of the store 20 byte flags. In this way, a 256-byte field can be 21 written into L2 cache with a maximum of 3 write 22 operations. This is a considerable improvement to 23 L2 cache availability. For cases of operand24 overlap, when an Ll inpage request is handled in Ll 25 cache, controls detect the fact that data in the L2 26 write buffer is to be merged with data from L2 27 cache. The store byte flags associated with the L2 28 write buffers control which bytes are gated from the 29 L2 write buffer and which bytes are gated from L2 30 cache. The result is that Ll cache receives the 31 requested Ll cache line with its most recent 32 modifications for the currently executing 33 instruction. As each processor possesses a set of 34 such facilities, the storage subsystem supports the 35 concurrent processing of sequential store operations 36 for each processor in the configuration. The only 37 points of contention are the I2 cache directory 38 required for interrogation and actual L2 write 39 buffer storing into L2 cache. 40 To support efficient recovery ~rom paye faults in a 42 virtual storage environment, microcode may issue a 43 reset processor storage interface command. This 44 allows the Ll and L2 store queues to clear entries 45 associated with a partially completed 370-XA46 instruction. The scenario is that microcode first 47 EN9880~ 3 - ~ 31 ~g9 6 guarantee that all previously completed 6 instruGtions' stores are written into L2 cache. The 7 reset processor storage interface command can then 8 be issued. soth the L1 and L2 store queues are 9 placed into their system reset state, along with any 10 related controls. Mi~rocode invaliaates any data in 11 the L1 cache that this instruction may have 12 modified. The affect of the instruction on storage 13 has now been nullified. 14 ~n summary, the L1 and L2 store queue design of the 16 present invention allows maximum isolation of each 17 processor's execution within a tightly-coupled 18 multiple processor while minimizing the utilization 19 of the shared L2 cache buffer resource. Storage is 20 not modified by any instruction until successful21 completion of the 370-XA instruction within the 22 processor. Instructions are processed in such a way 23 as to maximize storage availability by handling them 24 ~,~
according to result storage field length. This 25 allows simplified storage handling in that partial 26 results do not appear in L2 cache, the common level 27 of storage. Therefore, lines in L2 cache do not 28 have to be held exclusive to a particular processor 29 during its operations. It also supports simplified 30 page fault handling by eliminating the need for 31 pretesting storage field addresses or requiring 32 exclusive access to data in L2 cache to allow 33 par~ial result storing. This maximizes concurrent 34 availability to data within the shared L2 cache,35 further improving the performance of the overall36 multiple processor configuration. - 37 DETAILED DESCRIPTION OF THE PREFERRED EMBODIMF.NT 40 A more detailed description of the functional 42 operation of the L1 store queue and the L2 store43 queue of the present invention, and of ~tore request 44 processing in general, will be set forth in the 45 following paragraphs with reference to figures 446 EN9~8001 - 4~ - ~ 3 ~ ~ 8 ~ ~

through 10, and ~ith the assistance of the time ~ine 6 diagrams illustrated in figures 11 through 49. 7 l.0 Storage Routines - MP/3 Processor Storage Store 9 Routines 10 1.1 Storage Store, TLB Miss 12 Refer to figure 11 for a time line diagram.14 --The execution unit issues a processor storage store 16 request to the Ll operand cache. The 17 set-associative TLB search fails to yield an18 absolute address for the logical address presented 19 b~ the request. A request for dynamic address 20 translation is presented to the execution unit and 21 the current storage operation is nulli~ied. The TLB 22 miss overrides the results of the L1 cache directory 23 search due to the lack of a valid absolute address 24 for comparison from the TLB. The write to the L1 25 cache is canceled. The Ll store queue does not 26 enqueue the request due to the TLB miss. Any 27 prefetched instructions which succeed the current 28 instruction are checked for modification by the 29 store request through logical address comparison. 30 As a TLB miss has occurred for the Ll operand cache, 31 no valid absolute address exists to complete the 32 store request. The program store compare checks are 33 blocked. The store request is not transferred to L2 34 cache due to the TLB miss. For a hardware-executed 35 instruction, program execution is restarted at this 36 instruction address if the address translation is 37 successful. For a microinstruction store request, 38 the microinstruction is re-executed if address 39 translation is successful. For either case, Ll 40 control avoids enqueuing any repeated store requests 41 to avoid transferrin~ duplicate store requests to 42 the L2 store queue and commences L1 store queue 43 enqueues with the first new store request.44 1.2 Storage Store, TLB Hit, Access Exception 46 EN988001 - ~5 - ~ 3 ~ ~ 8 9 6 Refer to fi~ur~ 12 for a time line diagram. 6 The execution unit issues A processor storage store 8 request to the Ll operand cache. The 9 set-associative TLB search yields an absolute 10 address for the logical address presented by the 11 request. However, an access exception, either 12 protection or addressing, is detected as a result of 13 the TLB access. The execution unit is notified of 14 the access exception and the current storage 15 operation is nullified. The access exception 16 overrides the results o~ the Ll cache directory 17 search. The write to the Ll cache is canceled. The 18 Ll store queue does not enqueue the request due to 19 the access exception. Any prefetched instructions 20 which succeed the current instruction are checked 21 ~or modification by the store request through 22 logical address comparison. As an access exception 23 has occurred, no valid absolute address exists to 24 complete the store request. The program store 25 compare checks are blocked. The store request is 26 not transferred to the L2 store queue as the current 27 program will abnormally end. Eventually the28 processor L2 interface will be reset by microcode as 29 part o~ the processor recovery routine to purge any 30 enqueued stores associated with this instruction. 31 1.3 Storage Store, Non-sequential, TLB Hit, No Access 33 Exceptions, Delayed Store Queue Transfer, L2 Cache 34 Busy See figure 13 for a time line diagram. - 37 The execution unit issues a non-sequential processor 3g storage store request to the Ll operand cache. The 40 set-associative TLB search ~ields an absolute 41 address, with no access exceptions, for the logical 42 address presented by the request. ~f the search o~ 43 the Ll cache directory ~inds the data in cache, an 44 Ll hit, through equal comparison with the absolute 45 address from the TLB, a write to the selected Ll 46 cache set is enabled. The store request data are 47 EN988001 - ~6 - ~3~89~

written into the Ll cache congruence and selected 6 set using the store byte control flags to write only 7 the desired bytes within the doubleword. If the 8 directory search results in an Ll cache miss, due to 9 a miscompare with the absolute address from the TLB, 10 the write of the Ll cache is canceled. In either11 case, the store request is enqueued on the Ll store 12 queue. The queue entry information consists of the 13 absolute address, data, store byte flags, and store 14 request type (non-sequential or sequential store,15 end-of-operation). The transfer of the processor 16 store request to the L2 cache store queue is 17 delayed. Any combination of three situations can18 delay the transfer. First, store requests must be 19 serviced in the sequence they enter the store queue. 20 If the Ll store queue enqueue pointer is greater21 than the Ll transfer pointer, due to some previous 22 Ll/L2 interface busy condition, this request cannot 23 be transferred to L2 cache until all preceding 24 entries are first transferred. Second, the Ll cache 25 store queue enqueue pointer equals the Ll transfer 26 pointer, but the Ll/L2 interface is busy with data 27 transfers to another Ll cache or a request for Ll 28 cache line invalidation from L2. Third, the L2 29 store queue is currently full and unable to accept 30 another store request from the Ll store queue. Any 31 prefetched instructions which succeed the current 32 instruction are checked for modification by the 33 store request through logical address comparison. 34 If an equal match occurs, the instruction buffers 35 are invalidated. Eventually, the processor store36 request is transferred to the L2 cache. If the L2 37 store queue associated with this processor i8 empky 3~
at the time the request is received and 39 end-of-operation is indicated with the store 40 request, this request can be serviced immediately if 41 selected by L2 cache priority. In any case, an 42 entry is made on the L2 store queue for the 43 requesting processor. The L2 cache store queue is 44 physically di~ided into two portions: control and 45 data. The absolute address and store request type 46 are maintained in the L2 control 26k function. The 47 ;~3~58~ .
EN98800] - ~7 -associated data and store hyte flacJs are enqueued in 6 the L2 cache data flow function. The L2 cache 7 priority does not select this processor store 8 request for service. 9 1.4 Storage Store, Non-sequentlal, TLB Hit, No Access 11 Exceptions, L2 Cache Hit 12 See figures 14-21 for time line diagrams. 14 The execution unit issues a non-se~uential processor 16 storage store request to the Ll operand cache. The 17 set-associative TLB search yields an absolute 18 address, with no access exceptions, for the logical 19 address presented by the request. If the search of 20 the Ll cache directory finds the data in cache, an 21 Ll hit, through equal comparison with the absolute 22 address from the TLB, a write to the selected Ll 23 cache set i5 enabled. The store request data are 24 written into the Ll cache congruence and selected 25 set using the store byte control flags to write only 26 the desired bytes within the doubleword. If the 27 directory search results in an Ll cache miss, due to 28 a miscompare with the absolute address from the ~LB, 29 the write of the Ll cache is canceled. In either 30 case, the store request is enqueued on the Ll store 31 queue. The queue entry information consists of the 32 absolute address, data, store byte flags, and store 33 request type (non-sequential or sequential store, 34 end-of-operation). If the store queue is empty 35 prior to this re~uest or the Ll store queue enqueue 36 pointer equals the transfer pointer, and the-Ll/L2 37 interface is available, the store request is 3~
transferred to L2 immediately. Otherwise, the 39 transfer is delayed until the Ll store queue 40 transfer pointer selects this entry while the Ll/L2 ~1 interface is available. Any prefetched instructions 42 which succeed the current instruction are checked 43 for modification by the store request through 44 logical address comparison. If an equal match 45 occurs, the instruction buffers are invalidated. L2 46 control receives the store request. If the L2 store 47 EN9~8001 - 48 -queue is empty and end-of-operation is indicated 6 with the store request, this requesk can be serviced 7 immediately if selected by L2 cache priority. If 8 the s~ore queue is empty, but no end-of-operation is 9 associated with the store request, it must wait on 10 the store queue until end-of-operation is received11 before being allowed to enter L2 cache priority. If 12 the L2 store queue for this processor is not empty, 13 then this request must wait on the store queue until 14 all preceding stores for this processor have 15 completed to L2 cache. In any case, an entry is 16 made on the L2 store queue for the requesting 17 processor. The L2 cache store queue is physically 18 divided into two portions: control and data. The19 absolute address and store request type are 20 maintained in the L2 control 26k function. The 21 associated data and store byte flags are enqueued in 22 the L2 cache data flow function. The L2 cache 23 priority selects this processor store request for 24 service. L2 control 26k transfers a processor L225 cache store command and L2 cache congruence to L2 26 cache control and a processor L2 cache store command 27 to memory control. As the Ll operand cache is a 28 store-thru cache, an inpage to Ll cache is not 29 required regardless of the original store request Ll 30 cache hit/miss status. L2 control 26k dequeues the 31 store request from the control portion of the L232 cache store queue for this processor. One of four 33 conditions result from the L2 cache directory 26J 34 search which yield an ~2 cache hit. 35 Case 1 - 37 The search of the L2 cache directory 26J results in 39 an L2 cache hit, but a freeze register with 40 uncorrectable storage error indicator active or 41 line-hold register with uncorrectable storage error 42 indicator active is set for an alternate processor 43 for the requeste~ h2 cache line. L2 control 26k 44 suspends this store request pendin~ release of the 45 freeze or line-hold with uncorrectable storage 46 error. The store request is restored onto the 47 1 3 1 .~ ~ 9 D

control portion of the L2 cache store queue for this 6 processor. Co~mand buffer requests or this 7 processor can still be serviced by L2 control 26k. 8 No information is transferred to address/key. The 9 L2 cache line status and cache set are transferred 10 to L2 cache control, the cache set modifier is 11 transferred to L2 cache, and the L2 cache line 12 status is trans~erred to memory control. Locked 13 status is forced due to the alternate processor 14 freeze or line-hold with uncorrectable storage error 15 conflict. The Ll status array compares are blocked 16 due to the freeze or line-hold with uncorrectable 17 storage error conflict. L2 control 26k blocks the 18 transfer of instruction complete to the requesting 19 processor's Ll cache due to the freeze or line-hold 20 with uncorrectable storage error conflict. L2 cache 21 control receives the processor L2 cache store 22 command and L2 cache congruence and starts the 23 access to L2 cache. L2 cache control transfers the 24 command to L2 data flow to dequeue the oldest entry 25 from the L2 store queue and write through the L2 26 write buffer into L2 cache. Upon receipt of the L2 27 cache line status, L2 hit and locked, L2 cache 28 control cancels the dequeue of the data store queue 29 entry and the write of the L2 cache. Memory control 30 receives the L2 command and L3 port identification. 31 Upon receipt of the L2 cache line status, L2 hit and 32 locked, the request is dropped. 33 Case 2 35 The search of the L2 cache directory 26J results in 37 an L2 cache hit, but a lock register is set ~or an 38 alternate processor for the requested doubleword. 39 L2 control 26k suspends this store request pending 40 release of the lock. The store request is restored 41 onto the control portion of the L2 cache store queue 42 for this processor. Command buffer requests for 43 this processor can still be serviced by L2 control 44 26k. No information is transferred to address/key. 45 The L2 cache line status and cache set are46 transferred to L2 cache control, the cache set 47 1 31~8~ ~
EN9~aO01 - 50 -modifier is transferred to L2 cache, and the L2 6 cache line status is transferred to memory control. 7 Locked status is forced due -to the alternate 8 processor 1GCk conflict. The Ll status array 9 compares are blocked due to the lock conflict. L2 10 control 26k blocks the transfer of instruction 11 complete to the requesting processor's Ll cache due 12 to the lock conflict. L2 cache control receives the 13 processor L2 cache store command and L2 cache 14 congruence and starts the access to L2 cache. L2 15 cache control transfers the command to L2 data flow 16 to dequeue the oldest entry from the L2 store queue 17 and write through the L2 write buffer into L2 cache. 18 Upon receipt of the L2 cache line status, L2 hit and 19 locked, L2 cache control cancels the dequeue of the 20 data store queue entry and the write of the L2 21 cache. Memory control receives the L2 command and 22 L3 port identification. Upon receipt of the L2 23 cache line status, L2 hit and locked, the request is 24 dropped. 25 Case 3 27 The search of the L2 cache directory 26J results in 29 an L2 cache hit, but an inpage freeze register with 30 uncorrectable storage error indication is active for 31 this processor. This situation occurs for a 32 processor ater an uncorrectable storage error has 33 been reported for an L2 cache inpage due to a store 34 request. The L2 cache line is marked invalid. The 35 absolute address is transferred to address/key with 36 a set reference and change bits command. The L2 37 cache line status and cache ~et are transferred to 38 L2 cache control, the cache set modifier is 39 transferred to L2 cache, and the L2 cache line 40 status is transferred k~ memory control 26e. L2 ~1 control clears the command buf~er request block 42 latch, the reeze register, and the uncorrectable 43 storage error indication associated with the freeze 44 register as a result o~ the store request. All Ll 45 status arrays, excluding the requesting processor's 46 Ll operand cache status, are searched for copies of 47 EN988001 - 51 ~ 1 31 ~ 8 the modified Ll cache line. The low-order L2 cache 6 congruence is used to address the Ll status arrays 7 and the L2 cache set and high-order congruence are 8 used as the comparand with the Ll status array 9 outputs. If an equal match is found in the 10 requesting processor's Ll instruction cache status 11 array, the entry is cleared, and the Ll cache 12 congruence and Ll cache set are transferred to the 13 requesting processor for local-invalidation of the 14 Ll cache copy after the request for the address buss 15 --has been granted by the Ll. If any of the alternate 16 processors' Ll status arrays yield a match the 17 necessary entries are cleared in Ll status, and the 18 Ll cache congruence and Ll cache sets, one for the 19 Ll operand cache and one for the Ll instruction 20 cache, are simultaneously transferred to the 21 required alternate processors for cross-invalidation 22 of the Ll cache copies after the request for the 23 address buss has been granted by that Ll. The L2 24 store access is not affected by the request for 25 local-invalidation or cross-invalidation as Ll 26 guarantees the granting of the required address 27 interface in a fixed number of cycles. Note that no 28 Ll copies should be found for this case as the store 29 is taking place after an L2 cache miss inpage was 30 serviced for the store request and an uncorrectable 31 storage error was detected in the L3 line. If 32 end-of-operation is associated with this store 33 request, L2 control 26k transfers an instruction 34 complete signal to the requesting processor's Ll 35 cache to remove all Ll store queue entries 36 associated with this instruction; the stores-have 37 - completed into L2 cache. The dequeue from the Ll 38 store queue occurs simultaneously with the last, or 39 only, update to L2 cache. The dequeue from the L2 40 store queue occurs as each non-sequential store 41 completes to L2 cache. L2 cache control receives 42 the processor L2 cache store command and L2 cache 43 congruence and starts the access to L2 cache. L2 44 cache control transfers the command to L2 data flow 45 to dequeue the oldest entry from the L2 store queue 46 and write through the L2 write buffer into L2 cache. 47 EN988001 - 52 - 1 3 ~ ~ ~ 9 ~

Upon receipt of the L2 cache line status, L2 hit and 6 not locked, L2 cache control uses the L2 cach~ set 7 to control the store into L2 cache and the write 8 occurs under control of the store byte flags in what 9 would be the second cycle of the processor L2 cache 10 read sequence. Memory control receives the L2 11 command and L3 port identification. Upon receipt of 12 the L2 cache llne status, L2 hit and not locked, the 13 request is dropped. Address/key receives the 14 absolute address for reference and change bits 15 updating. The reference and change bits for the ~KB 16 page containing the L2 cache line updated by the 17 store request are set to 'l'b. 18 Case 4 20 The search of the L2 cache directory 26J results in 22 an L2 cache hit. The L2 cache line is marked 23 modified. The absolute address is transferred to 24 ~ , address/key with the set reference and change bits 25 command. The L2 cache line status and cache set are 26 transferred to L2 cache control, the cache set 27 modifier is transferred to L2 cache, and the L2 28 cache line status is transferred to memory control 29 26e. If the requesting processor holds a lock, the 30 lock address is compared with the store request 31 address. If a compare results~ the lock is cleared; 32 if a miscompare results, a machine check is set. 33 All L1 status arrays, excluding the requesting 34 processor's Ll operand cache status, are searched 35 for copies of the modified Ll cache line. The 36 low-order L2 cache congruence is used to address the 37 Ll status axrays and the L2 cache set and high-order 38 congruence are used as the comparand with the L1 39 status array outputs. I an equal match is found in 40 the requesting processor's L1 instruction cache 41 status array, the entry is cleared, and the Ll cache 42 congruence and L1 cache set are transferred to the 43 requesting processor for local-invalidation of the 44 L1 cache copy after the request for the address buss 45 has been granted by the L1. If any of the alternate 46 processors' Ll status arrays yield a match the ~7 131~89~

necessary entries are clear~d in Ll status, and the 6 Ll cache congruence and Ll cache sets, one for the 7 Ll operand cache and one for the Ll instruction8 cache, are simultaneously transferred to the 9 required alternate processors for cross-invalidation 10 of the Ll cache copies after the request for the 11 address buss has been granted by that Ll. The L2 12 store access is not affected by the request for13 local-invalidation or cross-invalidation as Ll 14 guarantees the granting of the required address15 ~interface in a fixed number of cycles. If 16 end-of-operation is associated with this store 17 request, L2 control 26k transfers an instruction 18 complete signal to the requesting processor's Li 19 cache to remove all Ll store queue entries 20 associated with this instruction; the stores have 21 completed into L2 cache. The dequeue from the Ll 22 store queue occurs simultaneously with the last, or 23 only, update to L2 cache. The dequeue from the L2 24 store queue occurs as each non-sequential store25 completes to L2 cache. L2 cache control receives 26 the processor L2 cache store command and L2 cache 27 congruence and starts the access to L2 cache. L2 28 cache control transfers the command to L2 data flow 29 to dequeue the oldest entry from the L2 store queue 30 and write through the L2 write buffer into L2 cache. 31 Upon receipt of the L2 cache line status, L2 hit and 32 not locked, L2 cache control uses the L2 cache set 33 to control the store into L2 cache and the write 34 occurs under control of the store byte flags in what 35 -would be the second cycle of the processor L2 cache 36-read sequence. Memory control receives the L2 37 - com~and and L3 port identification. Upon receipt of 38 the L2 cache line status, L2 hik and not locked, th~ 39 request is dropped. Address/key receives the 40 absolute address for reference and change bits41 updating. The reference and change bits for the 4X~ 42 page containing the L2 cache line updated by the 43 store request are set to 'l'b. 44 1.5 Storage Store, Non sequential, TLB Hit, No Access 46 Exceptions, L2 Cache Miss 47 EN988001 - 5~ - 131~ 8 9 ~

See figures 22-30 for time line diagrams. 7 The execution unit issues a non~sequential processor 9 storage store request to the Ll operand cache. The 10 set-associative TLB search yields an absolute 11 address, with no access exceptions, for the logical 12 address presented by the request. If the search of 13 the Ll cache directory finds the data in cache, an 14 Ll hit, through equal comparison with the absolute 15 address from the TLB, a write to the selected Ll 16 cache set is enabled. The store request data are 17 written into the Ll cache congruence and selected 18 set using the store byte control flags to write only 19 the desired bytes within the doubleword. If the 20 directory search results in an Ll cache miss, due to 21 a miscompare with the absolute address from the TLB, 22 the write of the Ll cache is canceled. In either 23 case, the store request is enqueued on the Ll store 24 queue. The queue entry information consists of the 25 absolute address, data, store byte flags, and store 26 request type ~non-sequential or sequential store, 27 end-of-operation). I~ the store queue is empty 28 prior to this request or the Ll store queue enqueue 29 pointer equals the transfer pointer, and the Ll/L2 30 interface is available, the store request is 31 transferred to L2 immediately. Otherwise, the 32 transfer is delayed until the Ll store queue 33 transfer pointer selects this entry while the Ll/L2 34 interface is available. Any prefetched instructions 35 which succeed the current instruction are checked 36 for modification by the store request through 37 logical address comparison. If an equal match 38 occurs, the instruction buffers are invalidated. L2 39 control receives the store request. I~ thc L2 ~tore 40 queue is empty and end-of-operation is indicated 41 with the store request, this request can be serViced 42 immediately if selected by L2 cache priority. If g3 the store queue is empty, but no end-o~-operation is 44 associated with the store request, it mu5t wait on 45 the store queue until end-of-operation is received 46 before being allowed to enter L2 cache priority. If 47 EN988001 - 55 - ~ ,9 6 the L2 store queue ~or this processor is not empty, 6 then this request must wait on the store queue until 7 all preceding stores for this processor have 8 completed to L2 cache. In any case, an entry is 9 made on the L2 store queue for the requestiny 10 processor. The L2 cache store queue is physically 11 divided into two portions: control and data. The12 absolute address and store request t~pe are 13 maintained in the L2 control 26k function. The 14 associated data and store byte flags are enqueued in 15 -~the L2 cache data flow function. The L2 cache 16 priority selects this processor store request for 17 service. L2 control 26k transfers a processor L218 cache store command and L2 cache congruence to L2 19 cache control and a processor L2 cache store command 20 to memory control~ As the Ll operand cache is a 21 store-thru cache, an inpage to Ll cache is not 22 required regardless of the original store request Ll 23 cache hit/miss status. L2 control 26k dequeues the 24 store request from the control portion of the L225 cache store queue for this processor. One of three 26 conditions result from the L2 cache directory 26J 27 search which ~ield an L2 cache miss. As the L2 28 cache is a store-in cache, the L2 cache line must be 29 inpaged from L3 processor storage prior to 30 completion of the store request. The store request 31 is suspended as a result of the L2 cache miss to32 allow other requests to be serviced in the L2 cache 33 while the inpage for the requested L3 line occurs. 34 Case A 36 The search of the L2 cache directory 26~ results in 38 an L2 cache miss, but a previous L2 cache inpage is 39 pending for this processor. L2 control 26k suspends 40 this store request pending comple.tion of the 41 previous inpage request. The store request is 42 restored onto the control portion of the L2 cache 43 store queue for this processor. No further requests 44 can be serviced for this processor in L2 cache as 45 both the command buffers and store queue are pending ~6 completion Gf an L2 cache inpage. No information is 47 EN9~8001 - 56 - 131~g9~

transferred to address/key. The L2 cache line 6 status and cache set are transferred to L2 cache 7 control, the cache set modifier is ~ransferred to L2 8 cache, and the L2 cache line status is transferred 9 to memory control. Locked status is forced due to 10 the previous inpage request. The Ll status array 11 compares are blocked due to the L2 cache miss. L2 12 control 26k blocks the transfer of instruction 13 complete to the requesting processor's Ll cache due 14 to the L2 cache miss~ L2 cache control receives the 15 processor L2 cache store command and L2 cache 16 congruence and starts the access to L2 cache. L2 17 cache control transfers the command to L2 data flow 18 to dequeue the oldest entry from the L2 store queue 19 and write through the L2 write buffer into L2 cache. 20 Upon receipt of the L2 cache line status, L2 miss 21 and locked, L2 cache control cancels the dequeue of 22 the store queue entry and the write of the L2 cache. 23 Memory control receives the L2 cornmand and L3 port 24 identification. Upon receipt of the L2 cache line 25 status, L2 miss and locked, the request is dropped. 26 Case B 28 The search of the L2 cache directory 26J results in 30 an L2 cache miss, but a previous L2 cache inpage is 31 pending for an alternate processor to the same L2 32 cache line. L2 control 26k suspends this store 33 request pending completion of the previous inpage 34 request. The store request is restored onto the 35 control portion of the L2 cache store queue for this 36 processor. Command buffer requests for this37 processor can still be serviced by L2 control 26k. 38 No information is transferred to address/key. The 39 L2 cache line status and cache ~et are transferre~ 40 to L2 cache control, the cache set modifier is 41 transferred to L2 cache, and the L2 cache line 42 status is transferred to memory control 26e. Locked 43 status is forced due to the previous inpage ~reeze 44 conflict. The Ll status array compares are blocked 45 due to the L2 cache miss. L2 control 26k blocks the 46 transfer of instruction complete to the requesting 47 t3158~

processor 15 Ll cache due to the L2 cache miss. LZ 6 cache control receives the processor L2 cache store 7 command and L2 cache congruence and starts the g access to L2 cache. L2 cache control trans~ers the 9 command to L2 data flow to dequeue the oldest entry 10 from the L2 store queue and write through the L2 11 write buffer into L2 cache_ Upon receipt of the L2 12 - cache line status, L2 miss and locked, L2 cache 13 control cancels the dequeue of the store queue entry 14 and the write of the L2 cache. Memory control 15 ---receives the L2 comrnand and L3 port identification. 16 Upon receipt of the L2 cache line status, L2 miss 17 and locked, the request is dropped. 18 Case C 20 The search of the L2 cache directory 26J results in 22 an L2 cache miss. L2 control 26k suspends this 23 s~ore request and sets the processor inpage freeze 24 register. The store request is restored onto the 25 control portion of the L2 cache store queue for this 26 processor. Command buffer requests ~or this 27 processor can still be serviced by L2 control 26k. 28 The absolute address is transferred to address/key. 29 The L2 cache line status and cache set are30 transferred to L2 cache control, the cache set 31 modifier is transferred to L2 cache, and the L2 32 cache line status is transferred to memory control 33 26e. The Ll status array compares are blocked due 34 to the L2 cache miss. L2 control 26k ~locks the 35 transfer of instruction complete to the requesting 36 processor's Ll cache due to the L2 cache miss. L2 37 cache control receives the processor L2 cache store 38 i command and L2 cache congruence and starts the 39 access to L2 cache. L2 cache control trans~ers the 40 ~ command to L2 data ~low to dequeue the oldest entry 41 `~ from the L2 store queue and write through the L2 42 write buffer into L2 cache. Upon receipt of the L2 43 cache line status, L2 miss and not locked, L2 cache 44 control cancels the dequeue of the store queue entry 45 and the write of the L2 cache. Memory control 46 receives the L2 command and L3 port identification. 47 ~31~8~ ^

Upon receipt of the L2 cache line status, L2 miss 6 and not locked, the request enters priority for the 7 required L3 memory port. Whe~ all resources are 8 available, including an inpage/outpage buffer pair, 9 a command is transferred to BSU control ~o start the 10 L3 fetch access for the processor. Memory control 11 instructs L2 control 26k to set L2 directory status 12 normally for the pending inpage. Address/key 13 receives the absolute address. The reference bit 14 for the 4KB page containing the requested L2 cache 15 line is set to 'l'b. The associated change bit is 16 not altered as only an L2 cache inpage is in 17 progress; the store access will be re-executed after 18 the inpage completes. The absolute address is 19 converted to an L3 physical address. The physical 20 address is transferred to BSU control as soon as the 21 interface is available as a result of the L2 cache 22 miss. BSU control, upon receipt of the memory 23 control 26e command and address/key L3 physical 24 address, initiates the L3 memory port 128-byte fetch 25 by transferring the command and address to processor 26 storage and selecting the memory cards in the 27 desired port. Data are transferred 16 bytes at a 28 time across a multiplexed command/address and data 29 interface with the L3 memory port. Eight transfers 30 from L3 memory are required to obtain the 128-byte 31 L2 cache line. The sequence of quadword transfers 32 starts with the quadword containing the doubleword 33 requested by the store access. The next three 34 transfers contain the remainder of the Ll cache 35 line. The final four transfers contain the36 remainder of the L2 cache line. While the last data 37 transfer completes to the L2 cache inpage buffer BS~ 38 control raises the appropriate processor inpag~ 39 complete to L2 control 26k. During the data40 transfers to L2 cache, addres~/key monitors the L3 41 uncorrectable error lines. Should an uncorrectable 42 error be detected during the inpage process several 43 functions are performed. With each quadword , 44 transfer to the L2 cache, an L3 uncorrectable error 45 signal is transferred to the processor originally 46 requesting the store access. At most, the processor 47 EN988001 - 59 - 1 3~ ~g9~

receives one storage uncorrectable error indication 6 ~or a given L2 cache inpage request, the first one 7 detected by address/key. The doubleword address of 8 the first storage uncorrectable error detected by 9 address/key is recorded for the re~uesting 10 processor. Should an uncorrectable storage error 11 occur for any data in the Ll line accessed by the 12 processor, an indicator is set for storage 13 uncorrectable error handling. Finally, should an 14 uncorrectable error occur for any data transferred 15 to the L2 cache inpage buffer, address/key sends a 16 signal to L2 control to alter the handling of the L2 17 cache inpage and subsequent store request. L2 cache 18 priority selects the inpage complete for the19 processor for service. L2 control 26k transfers a 20 write inpage buffer command and L2 cache congruence 21 to L2 cache control and an inpage complete status 22 reply to memory control 26e. One of two conditions 23 result from the L2 cache directory 26~ search. 24 ~, .
Case l 26 L2 control 26k selects an L2 cache line for 28 replacement. In this case, the status of the29 replaced line reveals that it is unmodified; no 30 castout is required. The L2 directory is updated to 31 reflect the presence of the new L2 cache line. If 32 no L3 storage uncorrectable error was detected on 33 inpage to the L2 cache inpage buffer, the freeze 34 register established for this L2 cache miss inpage 35 is cleared. If an L3 storage uncorrectable error 36 was detected on inpage to the L2 cache inpage 37 buffer, the freeæe register establi~hed for this L2 38 cache miss inpage is left active and the storage 39 uncorrectable error indication associated with the 40 freeze register is set; the command bufers ~or the 41 processor which re~uested the inpage are blocked 42 from entering L2 cache priority; all Ll cache 43 indicators for this processor are set for storage 44 uncorrectable error reporting. The selected L2 45 cache set is transferred to address/key and L2 cache 46 control. The status of the replaced L2 cache line 47 EN988001 - 60 - ~ 31~9~

is transferred to L~ cache control and memory 6 control 26e, and the cache set modifier is 7 transferred to L2 cache. The Ll status arrays for 8 ,all Ll caches in the configuration are checked for 3 copies of the replaced L2 cache line. Should any be 10 found, the appropriate requests for invalidation are 11 transferred to the Ll caches. The Ll status is 12 cleared of the Ll copy status for the replaced L2 13 cache line. L2 cache control receives the write 14 inpage buffer command and prepares for an L2 line 15 write to complete the L2 cache inpage, pending 16 status from L2 control 26k. L2 cache control17 receives the L2 cache set and replaced line status. 18 As the replaced line is unmodified, L2 cache control 19 signals L2 cache that the inpage buffer is to be 20 written to L2 cache. ~s this is a full line write 21 and the cache sets are interleaved, the L2 cache set 22 must be used to manipulate address bits 25 and 26 to 23 permit the L2 cache line write. BSU control24 transfers end-of-operation to memory control 26e. 25 Address/key receives the L2 cache set from L2 26 control 26k. The L2 mini directory update address 27 register is set from the inpage address buffers and 28 the L2 cache set received from L2 control. Memory 29 control receives the status of the replaced line. 30 As no castout is required, memory control 26e 31 releases the resources held by the inpage request. 32 Memory control transfers a command to address/key to 33 update the L2 mini directory using the L2 mini 34 directory update address register associated with 35 this processor. Memory control then marks the 36 current operation completed and allows the '37 requesting processor to enter memory resource 38 priority again. The original L2 store queue request 39 now reenters the L2 cache service priority 40 circuitry. The store access is attempted again, 41 once selected for L2 cache service, and executed as 42 if this is the first attempt to service the request 43 within L2 control 26k. 44 Case 2 46 EN98~001 - 61 - 131~ ~ 9 b L2 control 26k selects an L2 cache line for6 replacem~nt. In this case, the status of the 7 replaced line reveals that it is modified; an L2 8 cache castout is required. The L2 directory is 9 updated to reflect the presence o~ the new L2 cache 10 line. If no L3 storage uncorrectable error was 11 detected on inpage to the L2 cache inpage buffer, 12 the freeze register established for this L2 cache 13 miss inpage is cleared. If an L3 storage 14 uncorrectable error was detected on inpage to the L2 15 cache inpage bu~fer, the freeze register established 16 for this L2 cache miss inpage is left active and the 17 storage uncorrectable error indication associated 18 with the freeze register is set; the command buffers 19 for the processor which requested the inpage are 20 blocked from entering L2 cache priority; all Ll 21 cache indicators for this processor are set for 22 storage uncorrectable error reporting. The address 23 read from the director~, along with the selected L2 24 cache set, are transferred to address/key. The 25 selected L2 cache set is trans~erred to L2 cache 26 control. The status o~ the replaced L2 cache line 27 is transferred to L2 cache control and memory 28 control 26e, and the cache set modifier is29 transferred to L2 cache. The Ll status arrays for 30 all Ll caches in the configuration are checked for 31 copies of the replaced L2 cache line. Should any be 32 found, the appropriate requests for invaIidation are 33 transferred to the Ll caches. The Ll status is 34 cleared of the Ll copy status for the replaced L2 35 cache line. L2 cache control receives the write 36 inpage buffer command and prepares for an L2 line 37 write to complete the L2 cache inpage, pending 38 status from L2 control 26k. L2 cache control 39 receives the L2 cache set and replaced line status. 40 As the replaced line is modi~ied, L2 cache control 41 signals L2 cache that a full line read is required 42 to the outpage bu~fer paired wikh the inpage buf~er 43 prior to writing the inpa~e buf~ex daka to L2 cache. 44 As these are ~ull line accesses and the cache sets 45 are interleaved, the L2 cache set must be used to 46 manipulate address bits 25 and 26 to permit the L2 47 EN988001 - 6~ - 1 3 1 ~ 8 ~ ~

cache line accesses. Address/key receives the 6 outpage address from L2 control 26k, converts it to 7 a physical address, and holds it in the outpage 8 address buffers along with the L2 cache set. The L2 9 mini directory update address register is set from 10 the inpage address buffers and the L2 cache set 11 received from L2 control. ~ddress/key transfers the 12 outpage physical address to ~SU control in 13 preparation for the L3 line write. Memory control 14 receives the status of the replaced line. As a 15 --castout is required, memory control 26e cannot 16 release the L3 resources until the memory update has 17 completed. Castouts are guaranteed to occur to the 18 same memory port used for the inpage. Memory 19 control transfers a command to address/key to update 20 the L2 mini directory using the L2 mini directory 21 update address register associated with this 22 processor.- Memory control then marks the current 23 operation completed and allows the requesting 24 .. ~ . .
processor to enter memory resource priority again. 25 The original L2 store queue request now reenters the 26 L2 cache service priority circuitry. The store 27 access is attempted again, once selected for L2 28 cache service, and executed as if this is the first 29 attempt to service the request within L2 control 30 26k. BSU control, recognizing that the replaced L2 31 cache line is modified, starts the castout sequence 32 after receiving the outpage address from address/key 33 by transferring a full line write command and 34 address to the selected memor~ port through the L2 35 cache data flow. Data are transferred from the 36 outpage buffer to memory 16 bytes at a time. After 37 the last quadword transfer to memory, BSU control 38 transfers end-of-operation to memory control 26e. 39 Memory control, upon receipt of end-of-operation 40 from B5U control, releases the L3 port to permit 41 overlapped access to the memory port. 42 1.6 Storage Store, Sequential, Initial L2 Line Access, 44 TLB Hit, No Access Exceptions, L2 Cache ~lit 45 See figures 31-35 for time line diagrams. 47 The execution unit issues ~ sequential processor 7 storage store request to the Ll operand cache The 8 set-associative TLB search yields an absolute 9 address, with no access exceptions, for the logical 10 address presented by the request. If the search of 11 the Ll cache directory finds the data in cache, an 12 Ll hit, through equal comparison with the absolute 13 address ~rom the TLB, a write to the selected Ll14 cache set is enabled. The store request data are15 written into the Ll cache congruence and selected 16 set using the store byte control flags to write only 17 the desired bytes within the doubleword~ I~ the 18 directory search results in an Ll cache miss, due to 19 a miscompare with the absolute address ~rom the TLB, 20 the write of the Ll cache is canceled. In either21 case, the store request is enqueued on the Ll store 22 queue. The queue entry information consists of the 23 absolute address, data, store byte flags, and store 24 request type (non-sequential or sequential store,25 end-of-operation). If the store queue is empty 26 prior to this request or the Ll store queue enqueue Z7 pointer equals the transfer pointer, and the Ll/L2 28 interface is available, the store request is 29 transferred to L2 immediately. Otherwise, the 30 transfer is delayed until the Ll store queue 31 transfer pointer selects this entry while the Ll/L2 32 interface is available. Any prefetched instructions 33 which succeed the current instruction are checked 3 4 for modification by the store request through 35 logical address comparison. If an equal match 36 occurs, the instruction buffers are invalidated. L2 37 control receives the store request. If the 38 sequential store routine has not heen started, then 39 this request is the lnitial sequential storo acce~s ~0 as well as the initial 9tore access to the L2 cache 41 line. If the initial sequential store request has ~2 been serviced and a sequential operation is in 43 progress, this represents the initial store access g4 to a néw L2 cache line in the sequential store ~5 routine. If the L2 store queue is empty, this 46 request can be serviced immediately if selected by 47 13i~
EN9~8001 - 6~ -L2 cache priority. If the L2 gtore queue for this 6 processor is not empty, then this request must wait 7 on the store queue until all precediny stores for 8 this processor have complete~ to L2 cache or the L2 9 cache write buffers. In either case, an entry i3 10 made on the L2 store queue for the requesting 11 processor. The L2 cache store queue is physically 12 divided into two portions: control and data. The 13 absolute address and store request type are14 maintained in the L2 control 26k function. The 15 associated data and store byte flags are enqueued in 16 the L2 caehe data flow function. If this store 17 request is the start of a sequential store 18 operation, L2 control 26k must check the L2 cache 19 directory 26J for the presence of the line in L2 20 cache. If a sequential operation is in progress for 21 this processor, comparison of address bits 24, 25, 22 27, and 28 with those of the previous sequential 23 store request for this processor has detected 24 absolute address bit 24 of this store request 25 differs from that of the previous store request. 26 This store request is to a different L2 cache line. 27 As such, L2 control 26k must check the L2 cache 28 directory 26J for the presence of this line in L2 29 eache. No repeat command is transferred to L2 cache 30 eontrol and no information is immediately 31 transferred to address/key and memory eontrol 26e. 32 As this is not the first line to be aeeessed by the 33 sequential store operation, L2 eontrol 26k ehecks 34 the status of the previous sequentially aecessed L2 35 cache line. If the previous line is not resident in 36 L2 eaehe, L2 eontrol 26k holds sequential proeessing 37 on the eurrent line until the inpage eompletes. 38 Otherwise, L2 eontrol 26k ean eontinue sequential 39 stores to the eurrent L2 eaehe line. See the 40 deseription of 'Sequential, Seeondary L2 Iine ~1 Aeeesses'. The L2 eaehe priority 9eleets this 42 proeessor store request for serviee. L2 control 26k 43 transfers a store to L2 eaehe write buEfer eommand ~4 and L2 eaehe eongruenee to L2 eaehe eontrol and a 4 5 proeessor L2 eaehe store eommand to memory eontrol ~6 26e. As the Ll operand eaehe is a store-thru eaehe, 47 EN988001 - 65 ~ ~ 3 ~. 5 ~ 9 ~

an inpage to Ll cache is not required regardless of 6 the original store request Ll cache hik/miss status. 7 L2 control 26k dequeues the store request from the 8 control portion of the L2 store queue to allow 9 overlapped processing of subsequent sequential store 10 requests to the same L2 cache line. L2 control 26k 11 recognizes that this store request is the start of a 12 new L2 cache line within the sequential store 13 operation. If this store request is the start of a lg sequential store operation, L2 control 26k sets the 15 sequential operation in-progress indicator for this 16 processor. Store queue request absolute address 17 bits 24, 25, 27, and 28 are saved for future 18 reference in the sequential store routine. If an 19 alternate processor lock conflict is detected, it is 20 ignored as the data are destined to the L2 cache 21 write buffers for the requesting processor, not L2 22 cache. If the requesting processor holds a lock, a 23 machine check is set. One of two conditions result 24 from the L2 cache directory 26J search which yield 25 an L2 cache hit. 26 Case 1 28 The search of the L2 cache directory 26J results in 30 an L2 cache hit, but a freeze register with 31 uncorrectable storage error indicator active or 32 line-hold register with uncorrectable storage error 33 indicator active is set for an alternate processor 34 for the requested L2 cache line. L2 control 26k 35 suspends this store request and succeeding 36 sequential store requests pending release of the 37 freeze or line-hold with uncorrectable storage 38 error. The store request is restored onto the 39 control portion of the L2 cache store queue for thi~ 40 processor. Command buffer requests ~or th~s 41 processor can still be serviced by L2 control 26k. 42 No information is transferred to address/key. The43 L2 cache line status and cache set are transferred 44 to L2 cache control, the cache set modi~ier is 45 transferred to L2 cache, and the L2 cache line 46 status is transferred to memory control 26e. Locked 47 EN988001 ~ 66 - ~ 31 ~ ~g 6 status is forced due to the alternate processor 6 freeze or line-hold with uncorrectable storage error 7 conflict. The L1 status array compares axe blocked 8 due to the sequential store operation being in 9 progress. L2 control 26k does not transfer10 instruction complete to ~he requesting processor's 11 I.1 cache due to the sequential store operation beiny 12 in prGgress. L2 cache control receives the store to 13 L2 cache write buffer command and L2 cache14 congruence and starts the access to L2 cache. L2 15 cache control transfers the command to L2 data flow 16 to dequeue the oldes~ entry from the L2 store queue 17 and write into the next L2 cache write buffer. Upon 18 receipt of the L2 cache line status, L2 hit and 19 locked, L2 cache control cancels the dequeue of the 20 data store queue entry and the write of the L2 cache 21 write buffer. Memory control receives the L2 22 command and L3 port identification. Upon receipt of 23 the L2 cache line status, L2 hit and locked, the 24 request is dropped. 25 Case 2 27 The search of the L2 cache directory 26J results in 29 an L2 cache hit. The L2 cache line is not marked 30 modified. No information is transferred to31 address/key~ The L2 cache line status and cache set 32 are transferred to L2 cache control, the cache set 33 modifier is transferred to L2 cache, and the L2 34 cache line status is transferred to memory control. 35 A line-hold, comprised of absolute address bits 4:24 36 and the L2 cache set, is established for the L2 37 cache line to be modified by this store request. 38 Absolute address bit 25 is used to record whether 39 this store request modifieg the high half-line or 40 low half-line of the L2 cache line. Bit 25 equal to 41 'O'b sets the high half-line modifier of the current 42 line-hold register; bit 25 equal to 'l'b sets the 43 low half-line modifier. The Ll status array44 compares are blocked due to the sequential store 45 operation being in progress. L2 control 26~ does 46 not transfer instruction complete to the requesting 47 131~96 processor's Ll ache due to the sequential store 6 operation being in progress. L2 cache control 7 receives the store to L2 cache write buf~er comman~ 8 and L2 cache congruence and starts the access to L2 9 cache. L2 cache control transfers the command to L2 10 data flow to dequeue the oldest entry from the L2 11 store queue and write into the next L2 cache write 12 buffer. Upon receipt of the L2 cache line status, 13 L2 hit and not locked, L2 cache control completes 14 the store to the L2 cache write buffer, loading the 15 -~data and store byte flags, address-aligned, into the 16 write b~ffer for the requesting processor. The L2 17 cache congruence is saved for subsequent sequential 18 store requests associated with this operation and L2 19 cache write buffer in L2 data flow. For this 20 por~ion of the sequential store operation, the cache 21 set is not required, but pipeline stages force the 22 store queue data to be moved into the L2 cache write 23 buffer in a manner consistent with non-sequential 24 store requests. The data store queue entry is 25 dequeued from the L2 store queue, but not the Ll 26 store queue, at the time the data are written into 27 the L2 cache write buffer. Memory control receives 28 the L2 command and L3 port identification. Upon 29 receipt of the L2 cache line status, L2 hit and not 30 locked, the request is dropped. 31 1.7 Storage Store, Sequential, Initial L2 Line Access, 33 TLB Hit, No Access Exceptions, L2 Cache Miss 34 See figures 36-44 for time line diagrams. 36 .
The execution unit issues a sequential processor 38 storage store request to the Ll operand cache. The 39 set-associative TLB search yields an absolute 40 address, with no access exceptions, for the logical 41 address presented by the re~uest. If the search of 42 the Ll cache directory finds the data in cache, an 43 Ll hit, through equal comparison with the absolute 44 address from the TL~, a write to the selected Ll 45 cache set is enabled. The store request data are 46 written into the Ll cache congruence and selected 47 EN988001 - 6~ 89~

set using the store byte control flags to write only 6 the desired bytes within the doubleword. If the 7 directory search results in an Ll cache miss, due to a miscompare with the absolute address from the TLB, 9 the write of the Ll cache is canceled. In either10 case, the store request is enqueued on the Ll store 11 queue. The queue entry information consists of the 12 absolute address, data, store byte flags, and store 13 request type (non-sequential or sequential store,14 end-of-operation). If the store queue is empty 15 prior to this request or the Ll store queue enqueue 16 pointer equals the transfer pointer, and the Ll/L2 17 interface is available, the store request is 18 transferred to L2 immediately. Otherwise, the 19 transfer is delayed until the Ll store queue 20 transfer pointer selects this entry while the Ll/L2 21 interface is available. Any prefetched instructions 22 which succeed the current instruction are checked 23 for modification by the store request through 24 logical addrèss comparison. I~ an equal match 25 occurs, the instruction buffers are invalidated. L2 26 control receives the store request. If the 27 sequential store routine has not been started, then 28 this request is the initial sequential store access 29 as well as the initial store access to the L2 cache 30 line. If the initial sequential store request has 31 been serviced and a sequential operation is in 32 progress, this represents the initial store access 33 to a new L2 cache line in the sequential store 34 routine. If the L2 store queue is empty, this 35 request can be serviced immediately if selected by 36 L2 cache priority. If the L2 store queue for this 37 processor is not empty, then this request must wait 38 on the store queue until all preceding stores for 39 this processor have completed to L2 cache or the L2 40 cache write buf fers. In either case, an entry is 41 made on the L2 store queue for the requesting 42 processor. The L2 cache store ~ueue is physically 43 divided lnto two portions: control and data. The44 absolu~e address and store request type are 45 maintained in the L2 control 26k function. The 46 associated data and store byte flags are enqueued in 47 EN~88001 - 69 - 131 ~89~

the L2 cache data flow function. If this store 6 request is the start of a sequential store 7 operation, L2 control 26k must check the L2 cache directory 26J for the presence of the line in L2 9 eache. If a sequential operation is in proyress for 10 this processor, eomparison of address bits 24, 25, 11 27, and 28 with those o~ the previous sequential 12 store request for this proeessor has d~teeted 13 absolute address bit 24 of this store request 14 differs from that of the previous store request. 15 This store request is ~o a different L2 eache line. 1~
As such, L2 control 26k must check the L2 eache 17 directory 26J for the presence of this line in L2 18 caehe. No repeat command is transferred to L2 eache 19 control and no information is immediately 20 transferred to address/key and memory control 26e. 21 As this is not the first line to be aecessed b~ the 22 sequential store operation, L2 eontrol 26k eheeks 23 the status of the previous sequentially accessed L2 24 cache line. If the previous line is not resident in 25 L2 eache, L2 control 26k holds sequential processing 26 on the current line until the inpage completes. 27 Otherwise, L2 control 26k ean eontinue sequential 28 stores to the eurrent L2 caehe line. See the 29 description of 'Sequential, Seeondary L2 Line 30 Aceesses'. The L2 eaehe priority selects this 31 processor store request for service. L2 control 26k 32 transfers a store to L2 cache write buffer eommand 33 and L2 cache congruenee to L2 eache control and a 34 processor L2 eaehe store eommand to memory eontrol 35 26e. As the Ll operand eaehe is a store-thru cache, 36 an inpage to Ll eache is not required regardless o~ 37 the original store request Ll eaehe hlt/miss status. 38 L2 eontrol 26k dequeues the store request from the 39 eontrol portion of the L2 store queue to allow 40 overlapped proeess.ing of subseque~t sequential store 41 requests to the same L2 eaehe line. One of three 42 conditions result from the L2 eaehe direetory 26~ 43 seareh whieh yield an L2 eache miss. As the L2 44 eaehe is a store-in eache, the L2 eache line must be 45 inpaged from L3 proeessor storage prior to the start 46 of the sequential store eompletion routine.47 EN988001 - 70 ~ ~ 3~ ~8 g ~

Case A 7 -The search of the L2 cache directory 26J results in 3 an L2 cache miss, but a previous L2 cache inpage is 10 pending for this processor. L2 control 26k suspends 11 this store request and succeeding sequential store 12 requests pending comple~ion of the previous inpage 13 request. The store request is restored onto the 14 control portion of the L2 cache store queue for this 15 -~processor. No further requests can be serviced for 16 this processor in L2 cache as both the command 17 buffers and the store queue are pending completion 18 of an L2 cache inpage. No information is 19 transferred to address/key. The L2 cache line 20 status and cache set are transferred to L2 cache 21 control, the cache set modifier is transferred to L2 22 cache, and the L2 cache line status is transferred 23 to memory control. Locked status is forced due to 24 the previous inpage request. The Ll status array 25 compares are blocked due to the sequential store 26 operation being in progress. L2 control does not 27 transfer instruction complete to the requesting 28 processor's Ll cache due to the sequential store 29 operation being in progress. L2 cache control 30 receives the store to L2 cache write buffer command 31 and L2 cache congruence~ and starts the access to L2 32 cache. L2 cache control transfers the command to L2 33 data flow to dequeue the oldest entry from the L2 34 store queue and write into the next L2 cache write 35 buffer. U~on receipt of the L2 cache line status, 36 '2 miss and locked, L2 cache control cancels-the 37 dequeue of the data store queue entry and the write 38 of the L2 cache write buffer. Memory control 39 receives the L2 command and L3 port identification. 40 Upon receipt of the L2 cache line status, L2 miss 41 ; and locked, the request i~ dropped. 42 Case B 44 The search of the L2 cache directory 26~ results in 46 an L2 cache miss, but a previous L2 cache inpage is 47 .

~ 3~96 pending for an alternate processor to the same L2 6 cache line. L2 control 26k suspends this store 7 request and succeeding sequenkial store reque~ts 8 pending completion of the previous inpage request. 9 The store request is restored onto the control 10 portion of the L2 cache store queue for this 11 processor. Command buffer requests ~or this 12 processor can still be serviced by L2 control. No 13 information is transerred to address/key. The L2 14 cache line status and cache set are transferred to 15 ---L2 cache control, the cache set modifier is 16 transferred to L2 cache, and the L2 cache line 17 status is transferred to memory control. Locked 18 status is forced due to the previous inpage freeze 19 conflict. The L1 status array compares are blocked 20 due to the sequential store operation being in - 21 progress. L2 control does not transfer instruction 22 complete to the requesting processor's L1 cache due 23 to the sequential store operation being in progress, 24 L2 cache control receives the store to L2 cache 25 write buffer co~mand and L2 cache congruence and 26 starts the access to L2 cache. L2 cache control 27 transfers the command to L2 data flow to dequeue the 28 oldest entry ~rom the I,2 store queue and write into 29 the next L2 cache write buffer. Upon receipt of the 30 L2 cache line status, L2 miss and locked, L2 cache 31 ; control cancels the dequeue of the data store queue 32 entry and the write of the L2 cache write buffer. 33 Memory control receives the L2 command and L3 port 34 identification. Upon receipt of the L2 cache line 35 status, L2 miss and locked, the request is dropped. 36 - -- Case C 38 .
The search of the L2 cache directory 26J results in 40 an L2 cache miss. To permit sequential store 41 processing to overlap the servicing of the L2 cache 42 miss, L2 control 26k does not suspend this store 43 request, but does set the processor inpage freeze 44 register. Both command buffer requests, and 45 sequential store requests for the current L2 cache 46 line, can be serviced by L2 control 26k for this 47 EN988001 - 72 ~ 1 31~ 8 9 ~

processor. The absolute address is transferred to address/key. The LZ cache line status and cache set 7 are trans~erred to L2 cache control, the cache sek 8 modifier is transferred to L2 cache, and the L2 9 cache line status is transferred to memory control 10 26e. If this store request is the start of a 11 sequential store opera~ion, L2 control 26k sets the 12 sequential operation in-progress indicator for this 13 processor. Store queue request absolute address14 bits 24, 25, 27, and 28 are saved for future 15 reference in the seqùential store routine. A 16 line-hold, comprised of absolute address bits 4:24 17 and the L2 cache set, is established for the L218 cache line to be modified by this store request.19 A~solute address bit 25 is used to record whether 20 this store request modifies the high half-line or21 low half-line of the L2 cache line. Bit 25 equal to 22 'O'b sets the high half-line modifier of the current 23 line-hold register; bit 25 equal to 'l'b sets the24 low half-line modifier. The Ll status array 25 compares are blocked due to the sequential store26 operation being in progress. L2 control 26k does27 not transfer instruction complete to the requesting 28 processor's Ll cache due to the sequential store29 operation being in progress. L2 cache control 30 receives the store to L2 cache write buffer command 31 and L2 cache congruence and starts the access to L2 32 cache. L2 cache control transfers the command to L2 33 data flow to dequeue the oldest entry from the L2 34 store queue and write into the next L2 cache write 35 buffer. Upon receipt of the L2 cache line status, 36 L2 miss and not locked, L2 cache control completes 37 the store to the L2 cache write buffer, loading the 38 data and store byte flags, address-aligned, into the 39 write buffer for the requesting processor. The L2 40 cache congruence is saved or sUbsequent sequential 41 store requests associated with this operation and L2 42 cache write buffer in L2 data flow. For this 43 portion of the sequential store operation, the cache 44 set is not required, but pipeline stages force the 45 store queue data to be moved into the L2 cache write 46 buffer in a manner consistent with non-sequential47 ENg88001 - 73 - 13~ 6 store requests. The data store queue entry is 6 dequeued from the ~2 store queue, but not the Ll 7 store queue, at the time the data are written into 8 the L2 cache write buffer. Memory control receives 9 the L2 command and L3 port identification. Upon 10 receipt of the L2 cache line status, L2 miss and not 11 locked, the request enters priority for the required 12 L3 memory port. When all resources are available, 13 including an inpage/outpage buffer pair, a command 14 is transferred to BSU control to start the L3 fetch 15 --access for the processor. Memory control instructs 16 L2 control to set L2 directory status normally for 17 the pending inpage. Address/key receives the 18 absolute address. The reference bit for the 4KB 19 page containing the requested L2 cache line i5 set 20 to 'l'b. The associated change bit is not altered 21 as only an L2 cache inpage is in progress; the store 22 access will be executed during the sequential store 23 completion routine. The absolute address is 24 converted to an L3 physical address. The physical 25 address is transferred to BSU control as soon as the 26 interface is available as a result o~ the L2 cache 27 miss. BSU control~ upon receipt of the memory 28 control 26e command and address/key L3 physical 29 address, initiates the L3 memory port 128-byte fetch 30 by transferring the command and address to processor 31 storage and selecting the memory cards in the 32 desired port. Data are transferred 16 bytes at a 33 time across a multiplexed command/address and data 34 interface with the L3 memory port. Eight transfers 35 from L3 memory are required to obtain the 128-byte 36 L2 cache line. ~he sequence o~ quadword transfexs 37 starts with the quadword containiny the doubleword 38 requested by the store access. The ne~t three 39 transfers contain the remainder of the Ll cache 4 line. The final four transfers contain the 41 remainder of the L2 cache line. While the last data 42 transfer completes to the L2 cache inpage bu~fer BSU 43 control raises the appropriate processor inpage 44 comple~e to L2 control 26k. During the data 45 transfers to L2 cache, address/key monitors the L3 46 uncorrectable error lines. Should an uncorrectable 47 EN988001 ~ 7~ ~ 1 3~ fi error be detected during the inpaye process several 6 functions are performed. With each quadword 7 transfer to the L2 cache, an L3 uncorrectable error 8 signal is transferred to the processor originally 9 requesting the store access. At most, the processor 10 receives one storage uncorrectable error indication 11 for a given L2 cache inpage request, the first one 12 detected by address/ke~. The doubleword address of 13 the first storage uncorrectable error detected by 14 address/key is recorded for the requesting 15 processor. Should an uncorrectable storage error 16 occur for any data in the Ll line accessed by the 17 processor, an indicator is set for storage 18 uncorrectable error handling. Finally, should an 19 uncorrectable error occur for any data transferred 20 to the L2 cache inpage buffer, address/key sends a 21 signal to L2 control 26k to alter the handling of 22 the L2 cache inpage and subsequent sequential store 23 completion routine. L2 cache priority selects the 24 inpage complete for the processor for service. L2 25 control 26k transfers a write inpage buffer command 26 and L2 cache congruence to L2-cache control and an 27 inpage complete status reply to memory control 26e. 28 One of two conditions result from the L2 cache 29 directory 26J search. 30 Case 1 32 L2 control 26k selects an L2 cache line for 34 replacement. In this case, the status of the 35 replaced line reveals that it is unmodified; no 36 castout is required. The L2 director~ i~ updated to 37 reflect the presence of the new L2 cache line. The 38 freeze register established for this L2 cache miss 39 inpage is cleared. If an L3 storage uncorrectable 40 error was detected on inpage to the L2 cache inpage 41 buffer, the uncorrectable storage error indicator 42 associated with the ]ine-hold reyister related to 43 this L2 cache miss inpage is set; all Ll cache 44 indicators for this processor are set for storage 45 uncorrectable error reporting. The selected L2 46 cache set i5 transferred to addres5/key and L2 cache 47 EN988001 ~ 75 - ~3~9~

control. The status of the replaced L2 cache line 6 is transferred to L2 cache control and memory 7 control 26e, and the cache set modifier is 8 transferred to L2 cache. The Ll status arrays for 9 all Ll caches in the configuration are checked for 10 copies of the replaced L2 cache line. Should any be 11 found, the appropriate requests for invalidation are 12 transferred to the Ll caches. The Ll status is 13 cleared of the Ll copy status for the rep]aced L2 14 cache line. L2 cache control receives the write 15 inpage buffer command and prepares for an L2 line 16 write to complete the L2 cache inpage, pending 17 status from L2 control 26k. L2 cache control 18 receives the L2 cache set and replaced line status. 19 As the replaced line is unmodified, L2 cache control 20 signals L2 cache that the inpage buffer i.s ko be 21 written to L2 cache. As this is a full line write 22 and the cache sets are interleaved, the L2 cache set 23 must be used to manipulate address bits 25 and 26 to 24 permit the L2 cache line write. BSU control25 transfers end-of-operation to memory control 26e. 26 Address/key receives the L2 cache set from L2 27 control 26k. The L2 mini directory update address 28 register is set ~rom the inpage address buffers and 29 the L2 cache set received from L2 control. Memory 30 control receives the status of the replaced line. 31 As no castout is required, memory control 26e 32 releases the resources held by the inpage request. 33 Memory control transfers a command to address/key to 34 update the L2 mini directory using ~he L2 mini 35 directory update address register associated with 36 this processor. Memory control then marks the 37 current operation completed and allows the 38 requesting processor to enter memory resource 39 priority again. 40 Case 2 42 L2 control 26k selects an L2 cache line for44 replacement. In this case, the status o~ the 45 replaced line reveals that it is modi~ied; an L2 46 cache castout is required. The L2 directory is 47 EN988001 - 7~ - ~3~589~

updated to reflect the prese~ce of the new L2 cach~ 6 line. The freeze register established for this L2 7 cache miss inpage is cleared. If an L3 storage uncorrectable error was detected on inpage to the L~ 9 cache inpage buffer, the uncorrectable storage error 10 indicator associated with the line-hold register 11 related to this L2 cache miss inpage is set; all Ll 12 cache indicators for this processor are set for 13 storage uncorrectable error reporting. The address 1~
read from the directory, along with the selected L2 15 cache set, are transferred to address/key. The 16 selected L2 cache set is transferred to L2 cache 17 control. The status of the replaced L2 cache line 1~
is transferred to L2 cache control and memory 19 control 26e, and the cache set modifier is20 transferred to L2 cache. The Ll status arrays for 21 all Ll caches in the configuration are checked for 22 copies of the replaced L2 cache line. Should any be 23 found, the appropriate requests for invalidation are 24 transferred to the Ll caches. The Ll status is 25 cleared of the Ll copy status for the replaced L2 26 cache line. L2 cache control receives the write 27 inpage buffer command and prepares for an L2 line 28 write to complete the L2 cache inpage, pending 29 status from L2 control 26k. L2 cache control 30 receives the L2 cache set and replaced line status. 31 As the replaced line is modified, L2 cache control 32 signals L2 cache that a full line read is required 33 to the outpage buffer paired with the inpage buffer 34 prior to writing the inpage buffer data to L2 cache. 35 As these are full line accesses and the cache sets 36 are interleaved, the L2 cache set must be used to 37 manipulate address bits 25 and 26 to permit the L2 38 cache line accesses. Address/key receives the 39 outpage address from L2 control 26k, converts it to 40 a physical address, and holds it in the outpage 41 address buffers along with the L2 cache 5et. ~he L2 42 mini directory update address register is set from 43 the inpage address buffers and the L2 cache set 44 received from L2 control. Address/key transfers the 45 outpage physical address to BSU control in46 preparation for the L3 line write~ Memory control 47 EN988001 - 77 - ~31~9~

receives the status of the replaced line. ~s a 6 castout is required, memory control 26e cannot 7 release the L3 resources until the memory update has 8 completed. Castouts are guaranteed to occur to the 9 same memory por~ u~ed for the inpage. Memory 10 control transfers a command to address/key to update 11 the L2 mini directory using the L2 mini directory 12 update address register associated with this 13 processor. Memory control then marks the current 14 operation completed and allows the requesting 15 --processor to enter memory resource priority again. 16 BSU control, recognizing that the replaced L2 cache 17 line is modified, starts the castout sequence after 18 receiving the outpage address from address/key by 19 transferring a ~ull line write command and address 20 to the selected memory port through the L2 cache 21 data flow. Data are transferred from the outpage 22 buffer to memory 16 bytes at a time. After the last 23 quadword transfer to memory, BSU control transfers 24 end-of-operation to memory control 26e. Memory 25 control, upon receipt of end-of-operation from BSU 26 control, releases the L3 port to permit o~erlapped 27 access to the memory port. 28 1.8 S~orage Store, Sequential, Secondary L2 30 Line Access, TLB Hit, No Access Exceptions 31 See figures 45-49 for time line diagrams. 33 The execution unit issues a sequential processor35 storage store request to the Ll operand cache. The 36 set-associative TLB ~earch yields an absolute 37 address, with no access exceptions, for the logical 38 address presented by the request. If the search of 39 the Ll cache directory finds the data in cache, an 40 Ll hit, through equal comparison wi-th the absolute 41 address from the TLB, a write to the selected Ll 42 cache set is enabled. The store request data are 43 written into the Ll cache congruence and selected 44 set using the store byte control flags to write only 45 the desired bytes within the doubleword. If the46 directory search results in an Ll cache miss, due to 47 EN988001 - 78 - ~ 3 1 ~ 8 9 ~

a miscompare with the absolute a~dress from the ~LB, 6 the write of the Ll cache is canceled. In either 7 case, the store request is enqueued on the Ll store 8 queue. The queue entry information consists of the 9 absolute address, data, store byte flags, and store 10 request type (non-sequential or sequential store, 11 end-of-operation). If the Ll store queue enqueue 12 pointer equals the transfer pointer and the Ll/L2 13 interface is available, the store request is 14 transferred to L2 immediately. Otherwise, the 15 transfer is delayed until the Ll store queue 16 transfer pointer selects this entry while the Ll/L2 17 interface is available. Any prefetched instructions 18 which succeed the current instruction are checked 19 for modification by the store request through 20 logical-address comparison. If an equal match 21 occurs, the instruction buffers are invalidated. L2 22 control 26k receives the store request. If the 23 initial sequential store request has been serviced 24 and a sequential operation is in progress, this and 25 succeeding store requests are given special26 consideration. If the L2 store queue is empty, this 27 request can be serviced immediately by a special 28 sequential store operation sequencer exclusive to 29 this processor. If the L2 store queue for this 30 processor is not empty, then this request must wait 31 on the store queue until all preceding stores for 32 this processor have completed to the L2 cache write 33 buffers. In either case, an entry is made on the L2 34 store queue for the requesting processor. The L2 35 cache store queue is physically divided into two 36 portions: control and data. The absolute address 37 and store request type are maintained in the L2 38 control 26k function. The associated data and store 39 byte flags are enqueued in the L2 cache data flow 40 function. L2 control 26k, reco~nizing that a 41 sequential operation is in progress for this 42 processor, compares address bits 24, 25, 27, and 2~ 43 with those of the previous sequential store request 44 for this processor. Absolute address bit 24 of this 45 store request matches that of the previous store 46 request. This store request is to the same L2 cache 47 EN9~8001 - 79 - ~3~896 line. As such, this store ~ueue requesk can be 6 serviced regardless o~ whether the L2 cache line 7 presently exists in L2 cache as the L2 cache and its 8 directory are not involved in the dequeue. rrhe 9 store queue request is serviced and the request is 10 dequeued from the control portion of the L2 cache 11 store queue for this processor. If absolute address 12 bit 25 equals 'l'b, the low half-line modifier o~13 the current line-hold register is set to 'l'b, 14 indicatiny that this hal~-line is modified. L2 15 control transfers one of three commands, on an 16 interface specifically allocated for this processor, 17 to L2 cache control based on the difference between 18 address bits 27 and 28 of this sequential store 19 request and those of the previous store request.20 The command is repeat with no address increment if 21 the difference is 'OO'b, repeat and increment by 8 22 if the difference is 'Ol'b, and repeat and increment 23 by 16 i~ the dif~erence is 'lO'b. Store queue 24 request absolute address hits 24, 25, 27, and 28 are 25 ., saved for future re~erence in the sequential store 26 routine and the store queue entry is dequeued, 27 allowing the next entry to be sérviced in the 28 following cycle. L2 control 26k transfers no 29 information to address/key or memory control 26e. 30 L2 cache control transfers the command to L2 data 31 flow to dequeue the oldest entry from the L2 store 32 queue using the most recently supplied L2 cache 33 congruence for this processor, with the address 34 adjusted as specified by L2 control 26k. The data 35 and store byte flags are written, address-aligned, 36 into the L2 cache write buffers for the requesting 37 processor. ~or this portion of the sequential store 38 operation, the cache set is not required, but 39 pipeline stages force the store queue data to be40 moved into the L2 cache write bufEer in a manner41 consistent with non-sequential store requests. l'he 42 data store queue entry is dequeued ~rom the L2 store 43 queue, but not the Ll store queue, at the time the 44 data are written into the L2 cache write buffer.45 1.9 Storage Store, Sequential, Completion Routine, L2 4 7 ~N98~001 - ~0 ~- 1 31~

Cache Hit 6 The sequential store completion routine is a series 8 of commands generated by L2 control 26k which cause 9 the L2 cache write buffers for a processor to be10 written to L2 cache. This is normally started byll the receipt of end-of-operation for the instruction 12 executing the sequential stores. End-of-operation 13 can be associated with the last sequential store14 request of a sequential operation or it may be 15 transferred later, as a separate end-of-operation 16 s~orage command for this processor. In either case, 17 once detected by L2 control 26k for a sequential18 store operation, the sequential operation sequencer l9 for this processor starts the comple~ion routine. 20 The sequential operation sequencer checks all active 21 line~holds against the locks of the alternate 22 processors and verifies that all required L2 cache 23 lines are resident in cache. If any lock conflicts 24 exist or any L2 cache miss is outstanding for the 25 sequential store operation, the sequential operation 26 comple~ion routine i5 held pending. If no lock 27 conflicts exist and the required lines are resident 28 in L2 cache, the sequential operation completion29 routine enters L2 control 2Çk priority. The L2 30 cache priority selects this sequential store 31 operation completion request for service. 32 Recognizing the n~nber of active line-holds, L2 33 control 26k holds the L2 cache exclusive to this34 request for a contiguous number of cycles necessary 35 to complete all L2 cache line writes associated with 36 the line-hold registers. This routine finishes the 37 sequential operation by storing all valid L2 cache 38 write buffer contents ~o L2 cache with consecutive 39 store L2 cache write buffer to L2 cache commands. 40 The following sequence is executed up to three 41 times, depending on the number of valid line-hold42 registers associated with the sequential store 43 operation. L2 control 26k transfers a store L2 44 cache write buffer to L2 cache command and the L2 45 cache congruence, taken from the line-hold register, 46 to L2 cache control. L2 control transfers an L2 47 EN988001 ~ 13~9~

cache store command to memory control. One of two 6 conditions re~ult from the L2 cache directory search 7 which yield an L2 cache hit. 8 Case 1 10 The search of the L2 cache directory 26J results in 12 an L2 cache hit, but the uncorrectable storage error 13 indicator associated with the line-hold register is 14 - active. This situation occurs for a processor after 15 -~an uncorrectable s~orage error has been reported for 16 an L2 cache inpage due to a sequential store 17 request. The L2 cache line is marked invalid. The 18 absolute address is transferred to address/key with 19 a set reference and change bits command. The L2 Z0 cache line status and cache set are transferred to 21 L2 cache control, the cache set modifier is22 transferred to L2 cache, and the L2 cache line 23 status is transferred to memory control. The 24 line-hold register associated with this L2 cache 25 line of the sequential store operation is cleared 26 and the corresponding uncorrectable storage error 27 indicator is reset. All Ll status arrays, excluding 28 the requesting processor's Ll operand cache status, 29 are searched for copies of the modified L2 cache 30 half-lines under control of the half-line modifiers 31 from the associated line-hold register. The 32 low-order L2 cache congruence is used to address the 33 Ll status arrays and the L2 cache set and high-order 34 congruence are used as the comparand with the Ll 35 status array outputs. If an equal match i5 found in 36 the requesting processor's Ll instruction cache 37 status array, the necessary entries are cleared, and 38 the Ll cache congruence and Ll cache sets are 39 transferred to the requesting processor ~or40 local-invalidation of the Ll instruction cache 41 copies after the re~uest for the address buss has 4~
been granted by the Ll. If any of the alternate 43 processors' Ll status arrays yield a match the 44 necessary entries are cleared in Ll status, and the 45 Ll cache congruence and Ll cache set5, two for the 46 Ll operand cache and two ~or the Ll instruction 47 E~988001 ~ ~ ~ 131 5~g ~

cache, are simultaneously transferred to the 6 required alternate processors for cross-invalidakion 7 of the Ll cache copies after the request ~or the 8 address buss has been granted by that Ll. The L2 9 store access is not affected by the request ~or 10 local-invalidation or cross-invalidation as Ll 11 guarantees the granting of the required address 12 interface in a fixed number of cycles. L2 control ~ 13 26k transfers an instruction complete signal to the 14 requesting processor's Ll cache to remove all 15 -entries associated with the sequential store with 16 the last store L2 cache write buffer to L2 cache 17 command in the completion routine; all associated 18 stores have completed into L2 cache. The dequeue 19 from the Ll store queue and the release of thQ L2 20 cache write buffers occur simultaneously with the 21 final update of the L2 cache. L2 cache control 22 receives the store L2 cache write buf~er to L2 cache 23 command and L2 cache congruence and starts the 24 ; access to L2 cache. L2 cache control transfers the 25 command to L2 data flow to store the required L2 26 cache write buffer contents into L2 cache. Upon 27 receipt of the L2 cache line status, L2 hit and not 28 locked, L2 cache control uses the L2 cache set to 29 control the store into L2 cache, manipulating 30 address bits 25 and 26 to accomplish the full line 31 write. The write occurs under control of the L2 32 cache write buffer store byte flags in two cycles: 33 the update to quadwords zero and one (32 bytes) 34 occurs in the first cycle; in the second cycle, the 35 remaining quadwords (96 bytes) in the L2 cache line 36 are updated. Memory control receives the L2 command 37 and L3 port identification. Upon receipt o~ the L2 38 cache line ~tatus, L2 hit and not locked, the 39 request is dropped. Address/k~y receivcs th~ 40 absolute address ~or re~erence and change bits 41 updating~ The reference and change bits ~or the 4KB 42 page containing the L2 cache line updated by the 43 store request are set to 'l'b. 44 Case 2 46 EN988001 - 83 - ~ 31~

The search of the L2 cache directory 26J results in 6 an L2 cache hit. The L2 cache line is marked 7 modified. The absolute address i5 transferred to 8 a~dress/key with a set reference and change bits 9 command. The J.2 cache line s~atus and cache set are 10 transferred to L2 cache control, the cache set 11 modifier is transferred to L2 cache, and the L2 12 cache line status is transferred to memory control 13 26e. The line-hold register associated with this L2 1~
cache line of the sequential store operation is 15 cleared. All Ll status arrays, excluding the 16 requesting processor's Ll operand cache status, are 17 searched for copies o~ the modi~ied L2 cache 18 half-lines undex control of the half-line modifiers 19 from the associated line-hold register. The 20 low-order L2 cache congruence is used to address the - 21 Ll status arrays and the L2 cache set and high-order 22 congruence are used as the comparand with the Ll 23 status array outputs. If an equal match is found in 24 the requesting processor's Ll instruction cache 25 status array, the necessary entries are cleared, and 26 the Ll cache congruence and Ll cache sets are 27 transferred to the requesting processor for 28 local-invalidation of the Ll instruction cache 29 copies after the request for the address buss has 30 been granted by the Ll. If any of the alternate 31 processors' Ll status arrays yield a match the 32 necessary entries are cleared in Ll status, and the 33 Ll cache congruence and Ll cache sets, two for the 34 Ll operand cache and two for the Ll instruction 35 cache, are simultaneously transferred to the 36 required alternate processors for cross-invalidation 37 of the Ll cache copies after the request for the 38 address buss has been granted by that Ll. The L2 39 store access is not affected by the request for 40 local-invalidation or cross-invalidation as I.l 41 guarantees the granting of the required address 42 interface in a fixed number of cycles. L2 control 43 26k transfers an instruction complete signal to the 44 requesting processor's Ll cache to remove all 45 entries associated with the sequential store with 46 the last store L2 cache write buffer to L2 cache 47 EN988001 - 8~ - ~ 31~

command in the completion routine; all associated 6 stores have completed into L2 cache. The dequeue 7 from the Ll store queue and the release of the L2 8 cache write buffers occur simultaneously with the 9 final update of the L2 cache. L2 cache control 10 receives the store L2 cache write buffer to L2 cache 11 command and L2 cache congruence and starts the 12 access to L2 cache. L2 cache control trans~ers the 13 command to L2 data flow to store the required L2 14 cache write buffer contents into L2 cache. Upon 15 receipt of the L2 cache line status, L2 hit and not 16 locked, L2 cache control uses the L2 cache set to 17 control the store into L2 cache, manipulating 18 address bits 25 and 26 to accomplish the full line 19 write. The write occurs under control of the L2 20 cache write buffer store byte flags in two cycles: 21 the update to quadwords zero and one (32 bytes) 22 occurs in the first cycle; in the second cycle, the 23 remaining quadwords (96 bytes) in the L2 cache line 24 are updated. Memory control receives the L2 command 25 and L3 port identification. Upon receipt of the L2 26 cache line status, L2 hit and not locked, the 27 request is dropped. Address/key receives the 28 absolute address for reference and change bits 29 updating. The reference and change bits for the 4KB 30 page containing the L2 cache line updated by the 31 store request are set to 'l'b. 32 The invention being thus described, it should be 34 obvious that the same may be varied in many ways. 35 Such variations are not to be regarded as a36 departure from the spirit and scope of the -37 invention, all all such modifications as would be 38 obvious to one skilled in the art are intended to be 39 included within the scope of the ~ollowing claims. 40

Claims

1. In a multiprocessor system having a plurality of processors including a first processor and at least one second processor, a first level cache connected to each processor, a single second level cache connected to each first level cache and shared by the processors, and a third level main memory connected to the second level cache, a system for queuing and buffering data and/or instructions, comprising:
a first level store queue means associated with each processor and having an input connected to its corresponding processor and connected to an input of its corresponding first level cache for receiving said data and/or instructions from said its corresponding processor intended for potential storage in said its corresponding first level cache and for queuing said data and/or instructions therein simultaneously within its corresponding first level cache, each of the first level store queue means having outputs; and a plurality of second level store queue means each of which is associated with a first level store queue means and interconnected between the output of such associated first level store queue means and an input of the single second level cache for receiving said data and/or instructions from said first level store queue means and for queuing said data and/or instructions therein prior to storage of said data and/or instruction in said second level cache, said first and second level store queues temporarily holding data and/or instructions to prevent the first processor from delaying its execution of additional instructions due to the second level cache being busy with a store operation associated with a second processor.

2. In a multiprocessor system having a plurality of processors including a first processor and at least one second processor, a first level cache connected to each processor, a single second level cache connected to each first level cache and shared by the processors, and a third level main memory connected to the second level cache, a system for queuing and buffering data and/or instructions, comprising:
a first level store queue means associated with each processor and having an input connected to its corresponding processor and connected to an input of its corresponding first level cache for receiving said data and/or instructions from its corresponding processor intended for potential storage in said its corresponding first level cache and for queuing said data and/or instructions therein, each of the first level store queue means having outputs, a plurality of second level store queue means each of which is associated with a first level store queue means and interconnected between the output of such associated first level store queue means and an input of the single second level cache for receiving said data and/or instructions from said first level store queue means and for queuing said data and/or instructions therein prior to storage of said data and/or instruction in said second level cache; and wherein each said second level store queue means comprise a queue means connected to the output of its first level store queue means for receiving said data and/or instructions from said its respective first level store queue means and initially storing said data and/or instructions therein; and write buffer and control means connected to an output of said queue means for receiving said data and/or instructions stored in said queue means and for secondarily storing at least some of said data and/or instructions therein, said at least some of said data and/or instructions stored in said write buffer and control means being stored in said second level cache when said second level cache is not busy and allows the storage of said data and/or instructions therein.

3. The system of claim 2, wherein said at least some of said data and/or instructions are stored sequentially in said second level cache from said write buffer and control means.

4. The system of claim 3, wherein certain of said data and/or instructions are allowed write access to said second level cache directly from said queue means

5. The system of claim 2, further comprising:
addressing means interconnected between each of said second level store queue means and the single shared second level cache for addressing said second level cache, said data and/or instructions stored in a said second level store queue means being stored in said single second level cache in response to the addressing thereof by said addressing means; and buffer means connected to an output of said second level cache for storing said data and/or instructions therein when said data and/or instructions are read out of said second level cache, said data and/or instructions stored in said buffer means of one processor invalidating corresponding obsolete entries of said data and/or instructions in the first level caches of other processors, the invalidation being accomplished before any of said other processors have access to said corresponding obsolete entries of said data and/or instructions.

6. The system of claim 2, wherein each of the first level store queue means comprise an address field means for storing an absolute address.

7. In a multiprocessor system having a plurality of processors including a first processor and at least one second processor, a first level cache connected to each processor, a single second level cache connected to each first level cache and shared by the processors, and a third level main memory connected to the second level cache, a system for quelling and buffering data and/or instructions, comprising:

a first level store queue means associated with each processor and having an input connected to its corresponding processor and connected to an input of its corresponding first level cache for receiving said data and/or instructions from said its corresponding processor intended for potential storage in said its corresponding first level cache and for queuing said data and/or instructions therein, each of the first level store queue means having outputs and address field means for storing an absolute address;
a plurality of second level store queue means each of which is associated with a first level store queue means and interconnected between the output of such associated first level store queue means and an input of the single second level cache for receiving said data and/or instructions from said first level store queue means and for queuing said data and/or instructions therein prior to storage of said data and/or instruction in said second level cache;
starting field absolute address register means connected to the address field means of each of said first level store queue means for storing a starting absolute address therein representing an initial absolute address associated with a first of said data and/or instructions to be stored in said first level store queue means; and ending field absolute address register means connected to the address field means of each of said first level store queue means for storing an ending absolute address therein representing the last absolute address associated with a final one of said data and/or instructions to be stored in said first level store queue means.

8. The system of claim 5, wherein each of said queue means of each said second level store queue means comprise:
an address field means for storing an absolute address of said data and/or instructions within said second level cache, a data field means for storing said data and/or instructions, and store byte flag field means for storing store byte flags indicative of specific locations at said absolute address within said second level cache wherein said data and/or instructions is stored,

9. The system of claim 8, wherein each said write buffer and control means further comprises:
control means including a plurality of line hold register means connected to the address field means of each of said queue means for storing the absolute address of said data and/or instructions within sad second level cache;
a plurality of second level write buffer means interconnected between the data field means of each of said queue means and the shared single second level cache for storing said data and/or instructions therein; and a plurality of store byte flag register means connected to the store byte flag field means of each of said queue means for storing the store byte flags therein indicative of specific locations at said absolute address within said second level cache wherein the obsolete entry of said data and/or instructions is stored.

10. The system of claim 9, wherein:
a directory associated with each of said first level caches of said other processors and with the single second level cache is interrogated using the absolute address stored in one of said line hold register means of said one processor, the corresponding obsolete entries of said data and/or instructions stored in the first level caches of said other processors are invalidated if said absolute address in said one of said line hold register means of said one processor is found in the directories of said first level caches associated with said other processors, and the corresponding obsolete entry of said data and/or instructions stored in the single second level cache is invalidated if said absolute address in said one of said line hold register means of said one processor in is found in said directory of said second level cache.

11. The system of claim 9, wherein said data and/or instructions stored in a said second level write buffer means over-writes the data and/or instructions stored in the specific locations at the absolute address of said second level cache wherein the corresponding obsolete entry of said data and/or instructions is stored, the specific locations being determined and identified in accordance with the store byte flags stored in said store byte flag register means.

12. In a multiprocessor system including a first processor and at least one second processor where each processor includes an execution unit and a translation lookaside buffer (TLB), a first level cache (L1 cache) connected to each processor, a first level cache directory (L1 cache directory) associated with each said L1 cache, a first level store queue (L1 store queue) connected to each processor and to its said L1 cache, and a second level store queue (L1 store queue) connected to each said L1 store queue, a method of operating said multiprocessor system, comprising the steps of:
(a) issuing a first storage request by said execution unit of said first processor, said first storage request including a logical address and a new set of data, said new set of data being associated with a sequential store operation;
(b) locating an absolute address in said TLB using said logical address in said first storage request;
(c) using said absolute address, searching said L1 cache directory to determine if corresponding data is located at said absolute address of said L1 cache;
(d) if said corresponding data is found in said L1 cache, writing said new set of data into said L1 cache at said absolute address and writing said new set of data into said L1 store queue;

(e) if said corresponding data is not found in said L1 cache, writing said new set of data into said L1 store queue; and (f) writing said new set of data from said L1 store queue into said L2 store queue; whereby. upon completion of step (a), said first processor may issue a second storage request including a further new set of data and repeat steps (a) through (e) to write said further new set of data into said L1 store queue, said further new set of data being associated with said sequential store operation.

13. The method of claim 12, wherein said multiprocessor system further includes at least two second level write buffers (L2 write buffers) connected to each said L2 store queue, and wherein said method further comprises the step of:
(g) writing said new set of data associated with said sequential store operation from said 1,2 store queue associated with said first processor into one of the L2 write buffers, whereby, upon completion of step (g), said further new set of data may be written from said L1 store queue into said L2 store queue.

14. The method of claim 13, wherein said multiprocessor system further includes a single second level cache (L2 cache) connected to the L2 write buffers of each processor and shared by each processor, a directory associated with said L2 cache (L2 cache directory), and a second level control (L2 control) including an arbitrating means for receiving requests from the processors to access said L2 cache, and wherein said method further comprises the steps of:
(h) using said arbitrating means in said L2 control, requesting access to said 1,2 cache;
(i) when said access to said L2 cache is granted, searching said L2 cache directory using said absolute address to determine if corresponding obsolete entries of said new set of data are present in said L2 cache; and (j) if an L2 cache hit occurs, writing said new set of data from said one of the L2 write buffers into a location of said L2 cache defined by said absolute address; whereby, said further new set of data may be written from said L2 store queue into another of the L2 write buffers.

15. The method of claim 14, wherein said multiprocessor system further includes a third level main memory (L3 memory) connected to said L2 cache, and wherein said method further comprises the steps of:
(k) if an L2 cache miss occurs, during the writing of said further new set of data from said L2 store queue into said another of the L2 write buffers, inpaging said corresponding obsolete entries of said new set of data from said L3 memory into a location of said L2 cache defined by said absolute address; and (l) following step (k), writing said new set of data from said one of the L2 write buffers into said location of said L2 cache defined by said absolute address.