WO2002099652A1 - Method and apparatus for facilitating flow control during accesses to cache memory - Google Patents

Method and apparatus for facilitating flow control during accesses to cache memory Download PDF

Info

Publication number
WO2002099652A1
WO2002099652A1 PCT/US2002/017620 US0217620W WO02099652A1 WO 2002099652 A1 WO2002099652 A1 WO 2002099652A1 US 0217620 W US0217620 W US 0217620W WO 02099652 A1 WO02099652 A1 WO 02099652A1
Authority
WO
WIPO (PCT)
Prior art keywords
cache
cache memory
outstanding
miss
accesses
Prior art date
Application number
PCT/US2002/017620
Other languages
French (fr)
Inventor
Marc Tremblay
Shailender Chaudhry
Original Assignee
Sun Microsystems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Microsystems, Inc. filed Critical Sun Microsystems, Inc.
Publication of WO2002099652A1 publication Critical patent/WO2002099652A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0844Multiple simultaneous or quasi-simultaneous cache accessing
    • G06F12/0855Overlapped cache accessing, e.g. pipeline
    • G06F12/0859Overlapped cache accessing, e.g. pipeline with reload from main memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0813Multiuser, multiprocessor or multiprocessing cache systems with a network or matrix configuration

Definitions

  • the present invention relates to the design of cache memories in computer systems. More specifically, the present invention relates to a method and an apparatus for facilitating flow control in order to support pipelined accesses to and from a cache memory.
  • Caches are typically designed with a set- associative architecture that uses a number of address bits from a request to determine a "set" to which the request is directed.
  • a set-associative cache stores a number of entries for each set, and these entries are typically referred to as "ways". For example, a four-way set-associative cache contains four entries for each set. This means that a four-way set associative cache essentially provides a small four-entry cache for each set.
  • One embodiment of the present invention provides a system that facilitates flow control to support pipelined accesses to a cache memory.
  • the system increments a number of outstanding misses that are currently in process for a set in the cache to which the miss is directed. If the number of outstanding misses is greater than or equal to a threshold value, the system stalls generation of subsequent accesses to the cache memory until the number of outstanding misses for each set in the cache memory falls below the threshold value.
  • the system Upon receiving a cache line from a memory subsystem in response to an outstanding miss, the system identifies a set that the outstanding miss is directed to. The system then installs the cache line in an entry associated with the set.
  • the system also decrements a number of outstanding misses that are currently in process for the set. If the number of outstanding misses falls below the threshold value as a result of decrementing, and if no other set has a number of outstanding misses that is greater than or equal to the threshold value, the system removes the stall condition so that subsequent accesses can be generated for the cache memory. In one embodiment of the present invention, the system determines whether to remove the stall condition by examining a state machine. This state machine keeps track of a number of outstanding misses that cause sets in the cache memory to meet or exceed the threshold value.
  • the system additionally replays the access that caused the cache line to be retrieved.
  • the system increments the number of outstanding misses that are currently in process for the set by setting a prior miss bit that is associated with an entry for a specific set and way in the cache memory.
  • This prior miss bit indicates that an outstanding miss is in process and will eventually fill the entry for the specific set and way.
  • the prior miss bit is stored along with a tag for the specific set and way, so that a tag lookup returns the prior miss bit.
  • the cache memory is a Level Two (L2) cache and the access is received from a Level One (LI) cache.
  • receiving the access involves receiving the access from a queue located at the L2 cache, wherein the queue contains accesses generated by the LI cache.
  • the system uses credit-based flow control to limit sending of accesses from the LI cache into the queue, so that the queue does not overflow.
  • the L2 cache receives accesses from a plurality of LI caches.
  • the threshold value is less than a number of entries in the cache memory associated with each set. This effectively reserves one or more additional entries for each set to accommodate in-flight accesses that have been generated but not received at the cache memory.
  • FIG. 1 illustrates a multiprocessor system in accordance with an embodiment of the present invention.
  • FIG. 2 illustrates in more detail the multiprocessor system illustrated in FIG. IB in accordance with an embodiment of the present invention.
  • FIG. 3 illustrates the structure of an L2 bank in accordance with an embodiment of the present invention.
  • FIG. 4 illustrates status bits and a tag associated with an L2 cache entry in accordance with an embodiment of the present invention.
  • FIG. 5 illustrates an exemplary pattern of pending miss operations in accordance with an embodiment of the present invention.
  • FIG. 6 illustrates a state diagram for pending miss operations in accordance with an embodiment of the present invention.
  • FIG. 7 is a flow chart illustrating processing of a cache access in accordance with an embodiment of the present invention.
  • FIG. 8 is a flow chart illustrating processing of a cache line return in accordance with an embodiment of the present invention.
  • FIG. IB illustrates a multiprocessor system 100 in accordance with an embodiment of the present invention. Note that most of multiprocessor system 100 is located within a single semiconductor chip 101. More specifically, semiconductor chip 101 includes a number of processors 110, 120, 130 and 140, which contain level one (LI) caches 112, 122, 132 and 142, respectively. Note that LI caches 112, 122,
  • LI caches 112, 122, 132 and 142 are coupled to level two
  • L2 cache 106 is in turn coupled to off-chip memory 102 through memory controller 104.
  • LI caches 112, 122, 132 and 142 are write-through caches, which means that all updates to LI caches 112, 122, 132 and 142 are automatically propagated to L2 cache 106. This simplifies the coherence protocol, because if processor 110 requires a data item that is present in LI cache 112, processor 110 can receive the data item from L2 cache 106 without having to wait for LI cache 112 to source the data item.
  • FIG. 2 illustrates in more detail the multiprocessor system illustrated in FIG. IB in accordance with an embodiment of the present invention.
  • L2 cache 106 is implemented with four banks 202-205, which can be accessed in parallel by processors 110, 120, 130 and 140 through switches 215 and 216.
  • Switch 215 handles communications that feed from processors 110, 120, 130 and 140 into L2 banks 202-205.
  • Switch 216 handles communications in the reverse direction, from L2 banks 202-205 to processors 110, 120, 130 and 140.
  • switch 215 additionally includes an I/O port 150 for receiving communications from I/O devices, and switch 216 includes an I/O port 152 for sending communications to I/O devices. Note that by using this "banked" architecture, it is possible to concurrently connect each LI cache to its own bank of L2 cache, thereby increasing the bandwidth of L2 cache 106.
  • FIG. 3 illustrates the structure of an L2 bank 202 in accordance with an embodiment of the present invention.
  • L2 bank 202 is a four- way set-associate cache, wherein there are four entries for each set 350. Note that each entry can be structured as a standard set-associative cache entry, including storage for a cache line as well as storage for tag and status bits. There are additionally four comparators to perform an associative lookup for each set. These standard cache structures, such as comparators, are known to those skilled in the art and are not illustrated in FIG. 3 in the interests of clarity.
  • Processors 110, 120, 130 and 140 generate accesses to L2 bank 202 as a result of cache misses that arise during accesses to LI caches 112, 122, 132 and 142.
  • accesses can include both read and write accesses.
  • These accesses feed through switch 215 (illustrated in FIG. 2) into request queues 310, 320, 330 and 340, which are associated with L2 bank 202.
  • Request queues 310, 320, 330 and 340 are used to store accesses to be processed by L2 bank 202.
  • request queues 310, 320, 330 and 340 can be located at L2 bank 202.
  • request queues 310, 320, 330 and 340 are located within switch 215.
  • Request queues 310, 320, 330 and 340 can send associated flow control feedback signals 313, 323, 333 and 343 back to processors 110, 120, 130 and 140, respectively, in order to prevent processors 110, 120, 130 and 140 from sending additional accesses. This ensures that request queues 310, 320, 330 and 340 do not overflow.
  • the flow control mechanism that operates between request queues 310, 320, 330, 340 and processors 110, 120, 130 and 140 is credit-based.
  • processor 110 sends a request to request queue 310 the number of credits is decremented.
  • feedback signal 113 sends one or more credits back to processor 110. This causes the number of credits in processor 110 to increase.
  • room must be reserved in request queue 310 to accommodate all possible accesses that may be in-flight between processor 110 and request queue 310.
  • processor 110 If no additional room is available in request queue 310, a stall signal must be sent to processor 110. In contrast, by using a credit based control system, processor 110 is able to keep sending accesses to request queue 310 so long as it has credits remaining, even if there is not enough room in request queue 310 to accommodate all possible in-flight transactions.
  • L2 bank 202 is also associated with a pending transaction buffer 360.
  • Pending transaction buffer 360 keeps track of transactions (accesses) that have been stalled by cache misses. This allows these stalled transactions to be replayed when a desired cache line returns from memory.
  • pending transaction buffer 360 may specify the set and way location for each pending transaction.
  • L2 bank 202 makes accesses to memory 102 in order to retrieve desired cache lines. These accesses can be pipelined through memory buffer 362, which may be located at memory controller 104 (illustrated in FIG. 1). Note that if memory buffer 362 becomes too full, processor 110, 120, 130 and 140 may be halted through a mechanism that makes use of a feedback signal 363 between memory buffer 362 and L2 bank 202. Note that this mechanism may also be credit- based.
  • FIG. 4 illustrates status bits 400 and a tag 401 associated with an L2 cache entry in accordance with an embodiment of the present invention.
  • Tag 401 includes higher order address bits that are used to perform an associative lookup.
  • Status bits 400 include bits that indicate if the corresponding cache entry is dirty 408 and/or valid 406, as well as ownership bits 402 that specify an ownership state for a cache coherence protocol. For example, ownership states can be specified by the MOESI standard.
  • Status bits 400 also include a prior miss bit 404, which indicates that a miss transaction is in process for the entry. Note that during an associative lookup in the set, prior miss bits for all entries in a set are returned along with the tag information. This allows the system to add the prior miss bits together in order to determine how many cache entries for the set are associated with pending miss transactions.
  • FIG. 5 illustrates an exemplary pattern of pending miss operations in accordance with an embodiment of the present invention.
  • This example illustrates a number of pending miss transactions for each of four sets, 510, 520, 520 and 540, in a four-way set-associative cache. (Of course, a more realistic set-associative cache has hundreds or thousands of sets.)
  • set 510 has two pending miss transactions
  • a stall is generated, even though one entry remains in set 540.
  • a stall is generated at this time because there might be one more future miss 533 in the pipeline that will not be caught by the stall. Consequently, this future miss 533 may cause the last entry in set 540 to be filled.
  • the future miss may cause set 530 to have three entries.
  • FIG. 6 illustrates a state diagram for a state machine that keeps track of pending miss operations in accordance with an embodiment of the present invention.
  • the system starts out in no stall state 602. If the system encounters a miss, and the total number of prior miss bits associated with the set is greater then or equal to two, the miss creates a third outstanding miss for the set. Hence, the system enters stall 1 state 604.
  • stalll state 604 if the system encounters a miss and the total number of prior miss bits associated with the set is greater than or equal to two, the miss may create a third outstanding miss for another set, or possibly a fourth outstanding miss for the set that caused the system to enter stalll state 604. This causes the system to enter stall2 state 606 indicating that two pending misses must be cleared before the system can be unstalled.
  • stall2 state 602 if a cache line is returned from memory 102 to L2 bank
  • the state machine returns to stalll state 604.
  • stalll state 606 if a cache line is returned from memory 102 to L2 bank 202, and if the number of prior miss bits in the associated set is greater than or equal to three, the state machine returns to no stall state 602.
  • the prior miss bit is set after the prior miss bits are totaled to determine whether a state transition needs to take place. Note that it is also possible to set the prior miss bit before the bits are totaled. In this case, the number of prior miss bits must be greater than or equal to three (instead of two) to cause a state transition.
  • a prior miss bit is reset after the number of prior miss bits is totaled. Note that it is possible to reset the prior miss bit before the prior miss bits are totaled. In this case, the number of prior miss bits must be greater than or equal to two (instead of three) to cause a state transition.
  • FIG. 7 is a flow chart illustrating processing of a cache access in accordance with an embodiment of the present invention.
  • the system starts by receiving an access to L2 bank 202 from processor 110 (step 702).
  • the system then performs an associative lookup in L2 bank 202 (step 704).
  • the system determines whether a cache miss occurs (step 706). If no cache miss occurs, the system performs the access (which can be a read or write operation) to a line in L2 bank 202 (step 708). If a cache miss occurs, the system determines if the number of prior miss bits for the set is greater than or equal to two (step 710). If not, the system sets a prior miss bit for a cache entry associated with the miss.
  • the system also generates a cache miss by placing the transaction in pending transaction buffer 360 (illustrated in FIG. 3) and requesting a cache line from memory 102 (step 714). If the number of prior miss bits is greater than or equal to two, the system causes a transition in the state machine, either from state 602 to state 604, or from state 604 to state 606. If the transition is from state 602 to state 604, the system issues a stall request to processor 110, so that processor 110 does not generate additional accesses to L2 bank 202 (step 712). The system then sets a prior miss bit for a cache entry associated with the miss, places the transaction in pending transaction buffer 360, and requests a cache line from memory 102 (step 714).
  • FIG. 8 is a flow chart illustrating processing of a cache line return in accordance with an embodiment of the present invention.
  • the system first receives a cache line from memory 102 in response to a pending miss transaction (step 802).
  • the system identifies the set and way location in L2 bank 202 for the returned cache line (step 804).
  • the system then installs the returned cache line into the set and way location in L2 bank 202 (step 806).
  • the system determines if the number of prior miss bits for the set is greater than or equal to three (step 808). If not, the system unsets the prior miss bit for the set and way location that the cache line was installed in and replays any pending transactions for the cache line from pending transaction buffer 360 (step 812).
  • the system causes a transition in the state machine illustrated in FIG. 6. This transition can either be from state 606 to state 604, or from state 604 to state 602. If the transition is from state 604 to state 602, the system removes the stall condition (step 810). The system then unsets the prior miss bit for the set and way location and replays any pending transactions for the cache line from pending transaction buffer 360 (step 812). Note that in performing the above-described operations, the system only has to examine prior miss bits for a single set. It does not have to examine prior miss bits for other sets. This makes the process of examining prior miss bits a purely local operation, which greatly decreases the complexity of the resulting circuit.

Abstract

One embodiment of the present invention provides a system that facilitates flow control to support pipelined accesses to a cache memory. When an access to the cache memory generates a miss, the system increments a number of outstanding misses that are currently in process for a set in the cache to which the miss is directed. If the number of outstanding misses is greater than or equal to a threshold value, the system stalls generation of subsequent accesses to the cache memory until the number of outstanding misses for each set in the cache memory falls below the threshold value. Upon receiving a cache line from a memory subsystem in response to an outstanding miss, the system identifies a set that the outstanding miss is directed to. The system then installs the cache line in an entry associated with the set. The system also decrements a number of outstanding misses that are currently in process for the set. If the number of outstanding misses falls below the threshold value as a result of decrementing, and if no other set has a number of outstanding misses that is greater than or equal to the threshold value, the system removes the stall condition so that subsequent accesses can be generated for the cache memory.

Description

METHOD AND APPARATUS FOR
FACILITATING FLOW CONTROL DURING
ACCESSES TO CACHE MEMORY
Inventor(s): Shailender Chaudhry and Marc Tremblay
BACKGROUND
Field of the Invention
The present invention relates to the design of cache memories in computer systems. More specifically, the present invention relates to a method and an apparatus for facilitating flow control in order to support pipelined accesses to and from a cache memory.
Related Art
As microprocessor clock speeds continue to increase at an exponential rate, it is becoming increasingly harder to provide sufficient data transfer rates between functional units on a microprocessor chip. For example, data transfers between a Level Two (L2) cache and a Level One (LI cache) can potentially require a large number of processor clock cycles. Moreover, the processor will be severely underutilized if it has to wait a large number of clock cycles to complete each access to L2 cache. Hence, in order to keep the processor busy, it is necessary to pipeline data transfers between L2 cache and the processor.
However, pipelining introduces problems. In a pipelined architecture, a number of accesses from the processor to the L2 cache can potentially be in flight at any given time. Furthermore, service times for accesses to the L2 cache are unpredictable because each access can potentially cause a cache miss if the desired data item is not present in L2 cache. Hence, what is needed is a mechanism for halting subsequent accesses to the L2 cache, as well as a mechanism for queuing in flight transactions in case preceding accesses generate time-consuming cache misses.
Additionally, there are limitations on the number of outstanding cache misses that can be pending at any given time. Caches are typically designed with a set- associative architecture that uses a number of address bits from a request to determine a "set" to which the request is directed. A set-associative cache stores a number of entries for each set, and these entries are typically referred to as "ways". For example, a four-way set-associative cache contains four entries for each set. This means that a four-way set associative cache essentially provides a small four-entry cache for each set.
Note that it is desirable not to allow more than four outstanding miss operations to be pending on any given set in a four-way set-associative cache. For example, if a system allows five outstanding misses, the five misses could potentially return at about the same time, and there would only be room to accommodate four of them. In this case, one of the returned cache lines would immediately be kicked out of the cache. Dealing with this problem can greatly complicate the design of a cache. Hence, what is needed is a mechanism for halting subsequent accesses to the L2 cache when a given set has too many pending miss operations.
SUMMARY
One embodiment of the present invention provides a system that facilitates flow control to support pipelined accesses to a cache memory. When an access to the cache memory generates a miss, the system increments a number of outstanding misses that are currently in process for a set in the cache to which the miss is directed. If the number of outstanding misses is greater than or equal to a threshold value, the system stalls generation of subsequent accesses to the cache memory until the number of outstanding misses for each set in the cache memory falls below the threshold value. Upon receiving a cache line from a memory subsystem in response to an outstanding miss, the system identifies a set that the outstanding miss is directed to. The system then installs the cache line in an entry associated with the set. The system also decrements a number of outstanding misses that are currently in process for the set. If the number of outstanding misses falls below the threshold value as a result of decrementing, and if no other set has a number of outstanding misses that is greater than or equal to the threshold value, the system removes the stall condition so that subsequent accesses can be generated for the cache memory. In one embodiment of the present invention, the system determines whether to remove the stall condition by examining a state machine. This state machine keeps track of a number of outstanding misses that cause sets in the cache memory to meet or exceed the threshold value.
In one embodiment of the present invention, the system additionally replays the access that caused the cache line to be retrieved.
In one embodiment of the present invention, the system increments the number of outstanding misses that are currently in process for the set by setting a prior miss bit that is associated with an entry for a specific set and way in the cache memory. This prior miss bit indicates that an outstanding miss is in process and will eventually fill the entry for the specific set and way. In a variation on this embodiment, the prior miss bit is stored along with a tag for the specific set and way, so that a tag lookup returns the prior miss bit.
In one embodiment of the present invention, the cache memory is a Level Two (L2) cache and the access is received from a Level One (LI) cache. In one embodiment of the present invention, receiving the access involves receiving the access from a queue located at the L2 cache, wherein the queue contains accesses generated by the LI cache. In this embodiment, the system uses credit-based flow control to limit sending of accesses from the LI cache into the queue, so that the queue does not overflow. In one embodiment of the present invention, the L2 cache receives accesses from a plurality of LI caches.
In one embodiment of the present invention, the threshold value is less than a number of entries in the cache memory associated with each set. This effectively reserves one or more additional entries for each set to accommodate in-flight accesses that have been generated but not received at the cache memory. BRIEF DESCRIPTION OF THE FIGURES
FIG. 1 illustrates a multiprocessor system in accordance with an embodiment of the present invention.
FIG. 2 illustrates in more detail the multiprocessor system illustrated in FIG. IB in accordance with an embodiment of the present invention.
FIG. 3 illustrates the structure of an L2 bank in accordance with an embodiment of the present invention.
FIG. 4 illustrates status bits and a tag associated with an L2 cache entry in accordance with an embodiment of the present invention. FIG. 5 illustrates an exemplary pattern of pending miss operations in accordance with an embodiment of the present invention.
FIG. 6 illustrates a state diagram for pending miss operations in accordance with an embodiment of the present invention.
FIG. 7 is a flow chart illustrating processing of a cache access in accordance with an embodiment of the present invention.
FIG. 8 is a flow chart illustrating processing of a cache line return in accordance with an embodiment of the present invention.
DETAILED DESCRIPTION The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
Multiprocessor System FIG. IB illustrates a multiprocessor system 100 in accordance with an embodiment of the present invention. Note that most of multiprocessor system 100 is located within a single semiconductor chip 101. More specifically, semiconductor chip 101 includes a number of processors 110, 120, 130 and 140, which contain level one (LI) caches 112, 122, 132 and 142, respectively. Note that LI caches 112, 122,
132 and 142 may be separate instruction and data caches, or alternatively, unified instruction/data caches. LI caches 112, 122, 132 and 142 are coupled to level two
(L2) cache 106. L2 cache 106 is in turn coupled to off-chip memory 102 through memory controller 104.
In one embodiment of the present invention, LI caches 112, 122, 132 and 142 are write-through caches, which means that all updates to LI caches 112, 122, 132 and 142 are automatically propagated to L2 cache 106. This simplifies the coherence protocol, because if processor 110 requires a data item that is present in LI cache 112, processor 110 can receive the data item from L2 cache 106 without having to wait for LI cache 112 to source the data item.
FIG. 2 illustrates in more detail the multiprocessor system illustrated in FIG. IB in accordance with an embodiment of the present invention. In this embodiment, L2 cache 106 is implemented with four banks 202-205, which can be accessed in parallel by processors 110, 120, 130 and 140 through switches 215 and 216. Switch 215 handles communications that feed from processors 110, 120, 130 and 140 into L2 banks 202-205. Switch 216 handles communications in the reverse direction, from L2 banks 202-205 to processors 110, 120, 130 and 140.
Note that only two bits of the address are required to determine which of the four banks 202-205 a memory request is directed to. Also note that switch 215 additionally includes an I/O port 150 for receiving communications from I/O devices, and switch 216 includes an I/O port 152 for sending communications to I/O devices. Note that by using this "banked" architecture, it is possible to concurrently connect each LI cache to its own bank of L2 cache, thereby increasing the bandwidth of L2 cache 106.
Furthermore, note although the present invention is described in the context of a banked L2 cache, the present invention can be applied to any type of cache, and is not meant to be limited to a banked architecture. L2 Bank
FIG. 3 illustrates the structure of an L2 bank 202 in accordance with an embodiment of the present invention. L2 bank 202 is a four- way set-associate cache, wherein there are four entries for each set 350. Note that each entry can be structured as a standard set-associative cache entry, including storage for a cache line as well as storage for tag and status bits. There are additionally four comparators to perform an associative lookup for each set. These standard cache structures, such as comparators, are known to those skilled in the art and are not illustrated in FIG. 3 in the interests of clarity. Processors 110, 120, 130 and 140 generate accesses to L2 bank 202 as a result of cache misses that arise during accesses to LI caches 112, 122, 132 and 142. Note that these accesses can include both read and write accesses. These accesses feed through switch 215 (illustrated in FIG. 2) into request queues 310, 320, 330 and 340, which are associated with L2 bank 202. Request queues 310, 320, 330 and 340 are used to store accesses to be processed by L2 bank 202. In one embodiment of the present invention, request queues 310, 320, 330 and 340 can be located at L2 bank 202. In another embodiment, request queues 310, 320, 330 and 340 are located within switch 215.
Request queues 310, 320, 330 and 340 can send associated flow control feedback signals 313, 323, 333 and 343 back to processors 110, 120, 130 and 140, respectively, in order to prevent processors 110, 120, 130 and 140 from sending additional accesses. This ensures that request queues 310, 320, 330 and 340 do not overflow.
In one embodiment of the present invention, the flow control mechanism that operates between request queues 310, 320, 330, 340 and processors 110, 120, 130 and 140 is credit-based. This means a given processor 110 is initially allocated a certain number of "credits" specifying a number of accesses the processor can send to L2 bank 202. Each time processor 110 sends a request to request queue 310, the number of credits is decremented. As requests are in request queue 310 are processed, feedback signal 113 sends one or more credits back to processor 110. This causes the number of credits in processor 110 to increase. Without using a credit-based flow control system, room must be reserved in request queue 310 to accommodate all possible accesses that may be in-flight between processor 110 and request queue 310. If no additional room is available in request queue 310, a stall signal must be sent to processor 110. In contrast, by using a credit based control system, processor 110 is able to keep sending accesses to request queue 310 so long as it has credits remaining, even if there is not enough room in request queue 310 to accommodate all possible in-flight transactions.
Note that L2 bank 202 is also associated with a pending transaction buffer 360. Pending transaction buffer 360 keeps track of transactions (accesses) that have been stalled by cache misses. This allows these stalled transactions to be replayed when a desired cache line returns from memory. Note that pending transaction buffer 360 may specify the set and way location for each pending transaction.
During cache misses, L2 bank 202 makes accesses to memory 102 in order to retrieve desired cache lines. These accesses can be pipelined through memory buffer 362, which may be located at memory controller 104 (illustrated in FIG. 1). Note that if memory buffer 362 becomes too full, processor 110, 120, 130 and 140 may be halted through a mechanism that makes use of a feedback signal 363 between memory buffer 362 and L2 bank 202. Note that this mechanism may also be credit- based.
FIG. 4 illustrates status bits 400 and a tag 401 associated with an L2 cache entry in accordance with an embodiment of the present invention. Tag 401 includes higher order address bits that are used to perform an associative lookup. Status bits 400 include bits that indicate if the corresponding cache entry is dirty 408 and/or valid 406, as well as ownership bits 402 that specify an ownership state for a cache coherence protocol. For example, ownership states can be specified by the MOESI standard.
Status bits 400 also include a prior miss bit 404, which indicates that a miss transaction is in process for the entry. Note that during an associative lookup in the set, prior miss bits for all entries in a set are returned along with the tag information. This allows the system to add the prior miss bits together in order to determine how many cache entries for the set are associated with pending miss transactions.
Pending Miss Operations FIG. 5 illustrates an exemplary pattern of pending miss operations in accordance with an embodiment of the present invention. This example illustrates a number of pending miss transactions for each of four sets, 510, 520, 520 and 540, in a four-way set-associative cache. (Of course, a more realistic set-associative cache has hundreds or thousands of sets.) In the example illustrated in FIG. 5, set 510 has two pending miss transactions,
511 and 512; set 520 has one pending miss transaction, 521; set 530 has two pending miss transactions, 531 and 532; and set 540 has two pending miss transactions, 541 and 542.
When a new miss 543 arrives for set 540, a stall is generated, even though one entry remains in set 540. A stall is generated at this time because there might be one more future miss 533 in the pipeline that will not be caught by the stall. Consequently, this future miss 533 may cause the last entry in set 540 to be filled. Alternatively, the future miss may cause set 530 to have three entries.
FIG. 6 illustrates a state diagram for a state machine that keeps track of pending miss operations in accordance with an embodiment of the present invention. The system starts out in no stall state 602. If the system encounters a miss, and the total number of prior miss bits associated with the set is greater then or equal to two, the miss creates a third outstanding miss for the set. Hence, the system enters stall 1 state 604. In stalll state 604, if the system encounters a miss and the total number of prior miss bits associated with the set is greater than or equal to two, the miss may create a third outstanding miss for another set, or possibly a fourth outstanding miss for the set that caused the system to enter stalll state 604. This causes the system to enter stall2 state 606 indicating that two pending misses must be cleared before the system can be unstalled. In stall2 state 602, if a cache line is returned from memory 102 to L2 bank
202, and if the number of prior miss bits in the associated set is greater than or equal to three, the state machine returns to stalll state 604.
Similarly, in stalll state 606, if a cache line is returned from memory 102 to L2 bank 202, and if the number of prior miss bits in the associated set is greater than or equal to three, the state machine returns to no stall state 602.
During a miss operation, the prior miss bit is set after the prior miss bits are totaled to determine whether a state transition needs to take place. Note that it is also possible to set the prior miss bit before the bits are totaled. In this case, the number of prior miss bits must be greater than or equal to three (instead of two) to cause a state transition.
Similarly, during a cache line return, a prior miss bit is reset after the number of prior miss bits is totaled. Note that it is possible to reset the prior miss bit before the prior miss bits are totaled. In this case, the number of prior miss bits must be greater than or equal to two (instead of three) to cause a state transition.
Processing of a Cache Access
FIG. 7 is a flow chart illustrating processing of a cache access in accordance with an embodiment of the present invention. The system starts by receiving an access to L2 bank 202 from processor 110 (step 702). The system then performs an associative lookup in L2 bank 202 (step 704). As a result of this lookup, the system determines whether a cache miss occurs (step 706). If no cache miss occurs, the system performs the access (which can be a read or write operation) to a line in L2 bank 202 (step 708). If a cache miss occurs, the system determines if the number of prior miss bits for the set is greater than or equal to two (step 710). If not, the system sets a prior miss bit for a cache entry associated with the miss. The system also generates a cache miss by placing the transaction in pending transaction buffer 360 (illustrated in FIG. 3) and requesting a cache line from memory 102 (step 714). If the number of prior miss bits is greater than or equal to two, the system causes a transition in the state machine, either from state 602 to state 604, or from state 604 to state 606. If the transition is from state 602 to state 604, the system issues a stall request to processor 110, so that processor 110 does not generate additional accesses to L2 bank 202 (step 712). The system then sets a prior miss bit for a cache entry associated with the miss, places the transaction in pending transaction buffer 360, and requests a cache line from memory 102 (step 714).
Processing of a Cache Line Return
FIG. 8 is a flow chart illustrating processing of a cache line return in accordance with an embodiment of the present invention. The system first receives a cache line from memory 102 in response to a pending miss transaction (step 802). Next, the system identifies the set and way location in L2 bank 202 for the returned cache line (step 804). The system then installs the returned cache line into the set and way location in L2 bank 202 (step 806).
The system also determines if the number of prior miss bits for the set is greater than or equal to three (step 808). If not, the system unsets the prior miss bit for the set and way location that the cache line was installed in and replays any pending transactions for the cache line from pending transaction buffer 360 (step 812).
If the number of prior miss bits for the set is greater than or equal to three, the system causes a transition in the state machine illustrated in FIG. 6. This transition can either be from state 606 to state 604, or from state 604 to state 602. If the transition is from state 604 to state 602, the system removes the stall condition (step 810). The system then unsets the prior miss bit for the set and way location and replays any pending transactions for the cache line from pending transaction buffer 360 (step 812). Note that in performing the above-described operations, the system only has to examine prior miss bits for a single set. It does not have to examine prior miss bits for other sets. This makes the process of examining prior miss bits a purely local operation, which greatly decreases the complexity of the resulting circuit.
The foregoing descriptions of embodiments of the present invention have been presented for purposes of illustration and description only. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art.
Additionally, the above disclosure is not intended to limit the present invention. The scope of the present invention is defined by the appended claims.

Claims

What Is Claimed Is:
1. A method for facilitating flow control at a cache memory in order to support pipelined accesses to the cache memory, comprising: receiving an access to the cache memory;. wherein the cache memory is set-associative; if the access generates a miss in the cache memory, identifying a set that the access is directed to, incrementing a number of outstanding misses that are currently in process for the set, and if the number of outstanding misses is greater than or equal to a threshold value, stalling generation of subsequent accesses to the cache memory until the number of outstanding misses for each set in the cache memory falls below the threshold value.
2. The method of claim 1, further comprising: receiving a cache line from a memory subsystem in response to an outstanding miss; identifying a set that the outstanding miss is directed to; installing the cache line in an entry for the set in the cache memory; decrementing a number of outstanding misses that are currently in process for the set; if the number of outstanding misses will fall below the threshold value as a result of decrementing, and if no other set has a number of outstanding misses that is greater than or equal to the threshold value, removing the stall condition so that subsequent accesses can be generated for the cache memory.
3. The method of claim 2, wherein the method further comprises determining whether to remove the stall condition by examining a state machine; wherein the state machine keeps track of a number of outstanding misses that cause sets in the cache memory to meet or exceed the threshold value.
4. The method of claim 2, wherein the method further comprises replaying the access that caused the cache line to be retrieved.
5. The method of claim 1, wherein incrementing the number of outstanding misses that are currently in process for the set involves setting a prior miss bit that is associated with an entry for a specific set and way in the cache memory; and wherein the prior miss bit indicates that an outstanding miss is in process and will eventually fill the entry for the specific set and way.
6. The method of claim 5, wherein the prior miss bit is stored along with a tag for the specific set and way, so that a tag lookup returns the prior miss bit.
7. The method of claim 1, wherein the cache memory is a Level Two
(L2) cache and the access is received from a Level One (LI) cache.
8. The method of claim 7, wherein receiving the access involves receiving the access from a queue located at the L2 cache, wherein the queue contains accesses generated by the LI cache; and wherein the method additionally comprises using credit-based flow control to limit sending of accesses from the LI cache into the queue, so that the queue does not overflow.
9. The method of claim 7, wherein the L2 cache receives accesses from a plurality of LI caches.
10. The method of claim 1, wherein the threshold value is less than a number of entries in the cache memory associated with each set, thereby reserving one or more additional entries for each set to accommodate in-flight accesses that have been generated but not received at the cache memory.
11. An apparatus that facilitates flow control at a cache memory in order to support pipelined accesses to the cache memory, comprising: the cache memory, wherein the cache memory is set-associative; wherein the cache memory is configured to receive an access; a stalling mechanism within the cache memory, wherein if the access generates a miss in the cache memory, the stalling mechanism is configured to, identify a set that the access is directed to, increment a number of outstanding misses that are currently in process for the set, and if the number of outstanding misses is greater than or equal to a threshold value, to stall generation of subsequent accesses to the cache memory until the number of outstanding misses for each set in the cache memory falls below the threshold value.
12. The apparatus of claim 11, further comprising a stall removal mechanism, wherein upon receiving a cache line from a memory subsystem in response to an outstanding miss, the stall removal mechanism is configured to: identify a set that the outstanding miss is directed to; decrement a number of outstanding misses that are currently in process for the set; and if the number of outstanding misses will fall below the threshold value as a result of decrementing, and if no other set has a number of outstanding misses that is greater than or equal to the threshold value, the stall removal mechanism is configured to remove the stall condition so that subsequent accesses can be generated for the cache memory.
13. The apparatus of claim 12, further comprising a state machine that the stall removal mechanism examines to determine whether to remove the stall condition; wherein the state machine keeps track of a number of outstanding misses that cause sets in the cache memory to meet or exceed the threshold value.
14. The apparatus of claim 12, further comprising a replay mechanism that is configured to replay the access that caused the cache line to be retrieved.
15. The apparatus of claim 11 , wherein the stalling mechanism is configured to increment the number of outstanding misses that are currently in process for the set by setting a prior miss bit that is associated with an entry for a specific set and way in the cache memory; and wherein the prior miss bit indicates that an outstanding miss is in process and will eventually fill the entry for the specific set and way.
16. The apparatus of claim 15, wherein the prior miss bit is stored along with a tag for the specific set and way, so that a tag lookup returns the prior miss bit.
17. The apparatus of claim 11, wherein the cache memory is a Level Two
(L2) cache that is configured to receive the access from a Level One (LI) cache.
18. The apparatus of claim 17, further comprising: a queue located1 at the L2 cache that is configured to receive accesses from the LI cache; and a flow control mechanism that uses credit-based flow control to limit sending of accesses from the LI cache into the queue, so that the queue does not overflow.
19. The apparatus of claim 17, wherein the L2 cache is configured to receive accesses from a plurality of LI caches.
20. The apparatus of claim 11, wherein the threshold value is less than a number of entries in the cache memory associated with each set, thereby reserving one or more additional entries for each set to accommodate in-flight accesses that have been generated but not received at the cache memory.
21. An apparatus that facilitates flow control at a Level Two (L2) cache in order to support pipelined accesses to the L2 cache, comprising: the L2 cache, wherein the L2 cache is set-associative; wherein the L2 cache is configured to receive an access from an LI cache; a stalling mechanism within the L2 cache, wherein if the access generates a miss in the L2 cache, the stalling mechanism is configured to, identify a set that the access is directed to, increment a number of outstanding misses that are currently in process for the set, and if the number of outstanding misses is greater than or equal to a threshold value, to stall generation of subsequent accesses to the L2 cache until the number of outstanding misses for each set in the L2 cache falls below the threshold value; a stall removal mechanism, wherein upon receiving a cache line from a memory subsystem in response to an outstanding miss, the stall removal mechanism is configured to, identify a set that the outstanding miss is directed to, decrement a number of outstanding misses that are currently in process for the set, and if the number of outstanding misses will fall below the threshold value as a result of decrementing, and if no other set has a number of outstanding misses that is greater than or equal to the threshold value, the stall removal mechanism is configured to remove the stall condition so that subsequent accesses can be generated for the L2 cache; and a replay mechanism that is configured to replay an access that caused the cache line to be retrieved.
22. A computer system that facilitates flow control at a cache memory in order to support pipelined accesses to the cache memory, comprising: a processor; the cache memory; wherein the cache memory is set-associative and is configured to receive an access; a stalling mechanism within the cache memory, wherein if the access generates a miss in the cache memory, the stalling mechanism is configured to, identify a set that the access is directed to, increment a number of outstanding misses that are currently in process for the set, and if the number of outstanding misses is greater than or equal to a threshold value, to stall generation of subsequent accesses to the cache memory until the number of outstanding misses for each set in the cache memory falls below the threshold value.
PCT/US2002/017620 2001-06-06 2002-06-04 Method and apparatus for facilitating flow control during accesses to cache memory WO2002099652A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US29655301P 2001-06-06 2001-06-06
US60/296,553 2001-06-06

Publications (1)

Publication Number Publication Date
WO2002099652A1 true WO2002099652A1 (en) 2002-12-12

Family

ID=23142498

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/017620 WO2002099652A1 (en) 2001-06-06 2002-06-04 Method and apparatus for facilitating flow control during accesses to cache memory

Country Status (2)

Country Link
US (1) US6754775B2 (en)
WO (1) WO2002099652A1 (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7954102B2 (en) * 2002-11-13 2011-05-31 Fujitsu Limited Scheduling method in multithreading processor, and multithreading processor
US7133950B2 (en) * 2003-08-19 2006-11-07 Sun Microsystems, Inc. Request arbitration in multi-core processor
US20050044320A1 (en) 2003-08-19 2005-02-24 Sun Microsystems, Inc. Cache bank interface unit
US8463996B2 (en) * 2003-08-19 2013-06-11 Oracle America, Inc. Multi-core multi-thread processor crossbar architecture
US7385925B2 (en) * 2004-11-04 2008-06-10 International Business Machines Corporation Data flow control method for simultaneous packet reception
US7246205B2 (en) * 2004-12-22 2007-07-17 Intel Corporation Software controlled dynamic push cache
US8230422B2 (en) * 2005-01-13 2012-07-24 International Business Machines Corporation Assist thread for injecting cache memory in a microprocessor
US20060168401A1 (en) * 2005-01-26 2006-07-27 International Business Machines Corporation Method and structure for high-performance linear algebra in the presence of limited outstanding miss slots
US7353338B2 (en) * 2005-12-14 2008-04-01 Intel Corporation Credit mechanism for multiple banks of shared cache
US7673102B2 (en) * 2006-05-17 2010-03-02 Qualcomm Incorporated Method and system for maximum residency replacement of cache memory
US8244980B2 (en) * 2006-06-21 2012-08-14 Intel Corporation Shared cache performance
US8683000B1 (en) * 2006-10-27 2014-03-25 Hewlett-Packard Development Company, L.P. Virtual network interface system with memory management
US8266383B1 (en) * 2009-09-28 2012-09-11 Nvidia Corporation Cache miss processing using a defer/replay mechanism
US10083035B2 (en) * 2013-07-15 2018-09-25 Texas Instruments Incorporated Dual data streams sharing dual level two cache access ports to maximize bandwidth utilization
US10922230B2 (en) * 2016-07-15 2021-02-16 Advanced Micro Devices, Inc. System and method for identifying pendency of a memory access request at a cache entry
JP2018106573A (en) * 2016-12-28 2018-07-05 富士通株式会社 Storage control apparatus and control program
US11930484B2 (en) 2019-03-26 2024-03-12 Charter Communications Operating, Llc Methods and apparatus for system information management in a wireless system
US11412406B2 (en) * 2020-02-13 2022-08-09 Charter Communications Operating, Llc Apparatus and methods for user device buffer management in wireless networks

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5526510A (en) * 1994-02-28 1996-06-11 Intel Corporation Method and apparatus for implementing a single clock cycle line replacement in a data cache unit
US6226713B1 (en) * 1998-01-21 2001-05-01 Sun Microsystems, Inc. Apparatus and method for queueing structures in a multi-level non-blocking cache subsystem

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6085294A (en) * 1997-10-24 2000-07-04 Compaq Computer Corporation Distributed data dependency stall mechanism
US6148372A (en) * 1998-01-21 2000-11-14 Sun Microsystems, Inc. Apparatus and method for detection and recovery from structural stalls in a multi-level non-blocking cache system
US6594701B1 (en) * 1998-08-04 2003-07-15 Microsoft Corporation Credit-based methods and systems for controlling data flow between a sender and a receiver with reduced copying of data
JP3438650B2 (en) * 1999-05-26 2003-08-18 日本電気株式会社 Cache memory

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5526510A (en) * 1994-02-28 1996-06-11 Intel Corporation Method and apparatus for implementing a single clock cycle line replacement in a data cache unit
US6226713B1 (en) * 1998-01-21 2001-05-01 Sun Microsystems, Inc. Apparatus and method for queueing structures in a multi-level non-blocking cache subsystem

Also Published As

Publication number Publication date
US6754775B2 (en) 2004-06-22
US20020188807A1 (en) 2002-12-12

Similar Documents

Publication Publication Date Title
US6754775B2 (en) Method and apparatus for facilitating flow control during accesses to cache memory
EP0817073B1 (en) A multiprocessing system configured to perform efficient write operations
US7941584B2 (en) Data processing apparatus and method for performing hazard detection
US5881303A (en) Multiprocessing system configured to perform prefetch coherency activity with separate reissue queue for each processing subnode
US5572703A (en) Method and apparatus for snoop stretching using signals that convey snoop results
US7827354B2 (en) Victim cache using direct intervention
US5958019A (en) Multiprocessing system configured to perform synchronization operations
US7305523B2 (en) Cache memory direct intervention
US7281092B2 (en) System and method of managing cache hierarchies with adaptive mechanisms
US8352712B2 (en) Method and system for specualtively sending processor-issued store operations to a store queue with full signal asserted
US9727469B2 (en) Performance-driven cache line memory access
US20070005899A1 (en) Processing multicore evictions in a CMP multiprocessor
US7739451B1 (en) Method and apparatus for stacked address, bus to memory data transfer
US6654837B1 (en) Dynamic priority external transaction system
US6272601B1 (en) Critical word forwarding in a multiprocessor system
US7519796B1 (en) Efficient utilization of a store buffer using counters
US6460133B1 (en) Queue resource tracking in a multiprocessor system
JP2001147854A (en) Processing system and method for optimizing storage in writing buffer unit and method for storing and distributing data
US6449698B1 (en) Method and system for bypass prefetch data path
US7797495B1 (en) Distributed directory cache
US6430658B1 (en) Local cache-to-cache transfers in a multiprocessor system
US6976128B1 (en) Cache flush system and method
US7003628B1 (en) Buffered transfer of data blocks between memory and processors independent of the order of allocation of locations in the buffer
US6209068B1 (en) Read line buffer and signaling protocol for processor
US20020169931A1 (en) System of and method for flow control within a tag pipeline

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP