WO2017044070A1 - Multicasting in shared non-volatile memory - Google Patents

Multicasting in shared non-volatile memory Download PDF

Info

Publication number
WO2017044070A1
WO2017044070A1 PCT/US2015/048866 US2015048866W WO2017044070A1 WO 2017044070 A1 WO2017044070 A1 WO 2017044070A1 US 2015048866 W US2015048866 W US 2015048866W WO 2017044070 A1 WO2017044070 A1 WO 2017044070A1
Authority
WO
WIPO (PCT)
Prior art keywords
payload
region
receiving
nvm
multicast
Prior art date
Application number
PCT/US2015/048866
Other languages
French (fr)
Inventor
Charles Johnson
Harumi Kuno
Tuan Tran
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Priority to PCT/US2015/048866 priority Critical patent/WO2017044070A1/en
Publication of WO2017044070A1 publication Critical patent/WO2017044070A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0727Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0772Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2038Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with a single idle spare processing component
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2046Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share persistent storage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • G06F12/0646Configuration or reconfiguration
    • G06F12/0684Configuration or reconfiguration with feedback, e.g. presence or absence of unit detected by addressing, overflow detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • G06F12/0646Configuration or reconfiguration
    • G06F12/0692Multiconfiguration, e.g. local and global addressing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/835Timestamp
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1032Reliability improvement, data loss prevention, degraded operation etc

Definitions

  • NVM non- volatile memory
  • Figure 1 illustrates an example shared non-volatile memory multicast apparatus
  • Figure 2 illustrates an example process of multicasting data payloads in non- volatile memory of the example apparatus of Figure 1;
  • Figure 3 illustrates an example of data payloads multicast by a node in a sending region of shared non- volatile memory to a plurality of nodes in receiving regions of the shared non-volatile memory;
  • Figure 4 illustrates an example of acknowledgement messages transmitted by the plurality of nodes in the receiving regions of shared non- volatile to the node in the sending region of non-volatile memory
  • Figure 5 illustrates an example of unicast retransmission of the data payload sent in response to determining a failure of one of the nodes in a shared non- volatile memory regions in receiving the multicast data payload
  • Figure 6 illustrates an example of buffered or scattered multicast of data payloads to a multicast set of receiving region addresses in shared non-volatile memory
  • Figure 7 illustrates an example of 3 -way mirroring of multicast data payload sent to three receiving regions of shared non-volatile memory
  • Figure 8 illustrates an example of mirrored 5 -way replication of data payloads sent in an active ricochet multicast to five receiving regions of shared non- volatile memory from a remote non- volatile memory region;
  • Figure 9 illustrates an example of acknowledgments sent in response to successful mirrored 5 -way replication of data payloads sent in the active ricochet multicast of Figure 8;
  • Figure 10 illustrates an example of active ricochet multicast of data payloads sent to a first set of two receiving regions and passively mirrored to a set of 3 other receiving regions of shared non- volatile memory;
  • Figure 11 illustrates a block diagram of an example system with a computer-readable storage medium including instructions executable by a processor to multicast payload in a nonvolatile memory.
  • NVM non-volatile memory
  • Data payloads in large shared NVM's may be replicated and distributed in a reliable manner.
  • a payload in one region of the shared NVM may be multicast to a plurality of other regions by a node in the sending region.
  • Each region is provided with at least a primary node.
  • each region is provided with a primary (or active) node and a backup (or standby) node. In the event the primary node becomes inactive or unavailable, the backup node takes over all processes of the primary node.
  • the sending node receives unicast acknowledgments from a node in each receiving region.
  • the sending node If an acknowledgment is not received from one or more of the regions to which the payload was multicast, the sending node retries sending the payload to those regions in unicast messages. This may be attempted a limited number of times.
  • the sending node may use any of a variety of types of addressing for storage of the payload at the regions to which the payload is multicast.
  • Systems and methods described herein provide reliable methods for the sending of small and large data into addressable locations of NVM (and multicast sets of addressable locations) across distributed nodes sharing vast regions of non- volatile memory. For these systems to maintain some form of single system image and integrity, high availability for those components and subsystems which confer that single system image reliably to middleware and applications is desirable. In availability work, lower probability of failure and higher reliability can be obtained by bringing the repair time down to near zero rather than by eradicating different kinds of single and multiple failures and outages, which can be elusively out of a systems control.
  • A MTBF / (MTBF + MTR) ⁇ 1 - (MTR / MTBF): for MTBF » MTR (1)
  • Various systems and methods described herein may improve or maximize availability A, reduce or minimize probability of failure F, and improve or maximize reliability R for computing nodes spread across a fabric of shared NVM address spaces, thereby providing reliable methods for sending (e.g., multicasting and/or unicasting) of small and large data into addressable locations and multicasting to sets of a plurality of addressable locations.
  • those reliable multicast and unicast operations may be combined to make other methods for reliable replication and mirroring of data on a large pools of shared NVM, examples of which are also disclosed herein.
  • example methods are described that are able to deal with cases where endurance measures are available and the greater or lesser robustness in the distributed NVM methods may be indicated by these endurance measures.
  • Messaging systems mask failures by the method of automatic or transparent retry on the sending side, coupled with idempotent removal of duplicates on the receiving side.
  • Example systems and methods described herein extend the techniques of transparent retry and idempotent removal of duplicates to a shared NVM fabric's load/store operations by combining those methods with various example methods of fault-tolerant execution (such as quorums or process pairs), thereby managing the state of a distributed NVM storage system in an atomic, consistent, and durable manner to ensure reliable and transparent execution of the combined operations of the shared NVM storage system.
  • fault-tolerant execution such as quorums or process pairs
  • FIG. 1 illustrates an example shared non- volatile memory multicast apparatus 100.
  • the apparatus 100 includes a shared NVM address space sprawling across a number of NVM regions 110 with monitoring nodes.
  • Each NVM region in this example, includes an active node 120 and a standby node 130, where each active node 120 and standby node 130 has a globally known address slice or set of slices of the shared NVM address space. Having a globally known address slice or set of slices allows hardware or software methods to wakeup the active node 120 and alert it to the arrival of some stored data sent from a particular source node into a globally known stored vector or mailbox address in source vector lists or mailboxes 140.
  • the source vector lists or mailboxes 140 in each NVM region 110 include storage space 145 for the source vectors or mailboxes for all the globally known address slice or slices for each of the NVM regions 110.
  • the wakeup could arrive by a hardware interrupt, or a thread dispatch or any other method that rings the doorbell of the active node 120 and allows the active node 120 to know that some number and identity of NVM storage operations have occurred in its NVM region of interest.
  • the source vector or mailbox storage 145 could include separate vectors or mailboxes for each source node, or one hardware operated queue of messages from different source nodes, with source addresses in headers, for example.
  • the active node 120 and the standby node 130 are responsible for each NVM region 110 such that a live node is most likely to be available to respond to active operations needing immediate acknowledgement.
  • Passive multicast and unicast operations do not necessarily need acknowledgment messages other than what is provided by the shared NVM fabric itself, as is described below.
  • fabrics and networks and storage devices may not be perfect, and addressably shared NVM across large numbers of nodes along with all the related networking issues may benefit from more robust methods for reliable operations, if only in the case that some of the NVM fabric has, for example, aged due to endurance and environmental impacts and, in addition, some operations may involve valuable transactions that require more trust and reliability.
  • the shared NVM address space of the example shared NVM multicast apparatus 100 includes large NVM regions 110, each assigned to a node pair, such that the standby node 130 will take over responsibility for the NVM region 110 if the active node 120 fails or hangs for any reason.
  • the usual idempotence requirements for acknowledgement messages are provided for in the shared NVM multicast apparatus 100 such that multiple acknowledgements for the same message might arrive back at the sender, but a successful store remains a success and an error remains an error across failures.
  • the shared NVM addressing of the shared NVM multicast apparatus 100 may support a mixture of NVM and DRAM (dynamic random access memory) addresses under the same addressing scheme, where the vectors or mailboxes could be placed in DRAM for speed.
  • DRAM dynamic random access memory
  • this may complicate takeover logic after the monitoring active node 120 fails and the standby node 130 takes over. Since vectors and mailboxes in private, or a local DRAM that is remotely accessible, are not persistent or even readable by another node after the DRAM possessing node has failed, then checkpoints or transfers to NVM of vector or mailbox contents can be made before they are acted upon, to preserve idempotence of the resulting actions.
  • Figure 2 illustrates an example process 200 of multicasting data payloads in nonvolatile memory of the apparatus of Figure 1.
  • Figures 3, 4 and 5 illustrate example payload and acknowledgment flow across the shared NVM regions 110 for various steps of the example process 200.
  • the process 200 is an example only and may be modified.
  • the example process 200 of Figure 2 will now be described with further references to Figures 1, 3, 4 and 5.
  • the example process 200 may begin with a node in a sending region of a shared NVM multicasting a payload to a plurality of receiving regions in the NVM.
  • Figure 3 illustrates an example of data payloads multicast by an active node 120 in a sending region 210 of shared NVM to a plurality of receiving regions 215 of the shared NVM.
  • the payload is multicast from the sending region 210 to five receiving regions 215 including a first receiving region 215-1, a second receiving region 215-2, a third receiving region 215-3, a fourth receiving region 215-4 and a fifth receiving region 215-5.
  • the payload may be multicast to any desired number of receiving regions.
  • the multicast messages 250 are stored directly into the source vector list or mailboxes 240 of the five receiving regions 215 (including source vector lists or mailboxes 240-1, 240-2, 240-3, 240-5 and 240-5).
  • Five store operations sent from the sending region 210 to the receiving regions 215 move copies of the payload directly into the source vector list or mailbox 240 of each respective receiving region 215.
  • the size of the payload may be limited by the size of the source vector list or mailbox 240.
  • other addressing methods may be employed.
  • Special global multicast byte-addresses can be permanently allocated from the shared NVM address space to create globally known multicast addresses that the NVM fabric or network will recognize statically for each receiving region 215.
  • global multicast byte-addresses can be controlled by a configuration or programmatic interface to know the multicast set of target addresses that the special global multicast address represents when used in a multicast or unicast operation.
  • a choice for specifying a special global multicast byte-address is to send a broadcast message to the vector source list or mailbox 140 of every NVM region 110 in the system connected to the NVM fabric or network. This could be done for special occasions, system-wide alerts or system-wide operations such as flushing all processor caches to NVM in case of an earthquake lockdown, for example.
  • Content-addressable special addresses may take advantage of a logical separation of the NVM address space into regions, and the assignment of attributes to nodes and pushing those down into the vector source list or mailbox 140 metadata allowing checking for combinatorial matches with preconfigured attributes of the special multicast address.
  • NVM regions 110 in certain rows or columns may get separate broadcast messages from NVM regions 110 in different rows or columns.
  • each memory region can support offset addressing from the vector or mailbox NVM address, or from some configured NVM address. This would allow offset addressing by memory region, for distributed control and management functions, or for pre-configured global headers for services. Large chunks of NVM can be fixed and
  • all processes up to a configured counter may receive a fixed NVM memory allocation to allow for faster node startup of NVM file systems and memory brokers.
  • the shared NVM address space can support dynamic remapping or NVM pointer swizzling for special, low-level functions. This allows a broadcast to be performed without moving extremely large data. In this regard, readers may get the notification of a multicast normally, but the address points to a block that is elsewhere and is remapped into a standard local offset address. In one example, a new version of database is built in the well-known offset NVM address in one NVM region and then broadcast atomically into a set of NVM
  • Shared NVM addresses can also be remapped on failure, for either the well-known local offset addresses or global shared NVM addresses (e.g., when a physical NVM module fails and the address is automatically remapped to a mirror). Finally, remapping the shared NVM addresses could allow scatter and gather functions under a single addressed move instruction.
  • the active node 120 or the standby node 130 in the sending region monitors for an acknowledgment from a node in each of the plurality of receiving regions 215 (block 204).
  • Figure 4 illustrates an example of acknowledgement messages transmitted by the plurality of nodes in the receiving regions 215 of shared non- volatile to the active node 120 (or the standby node 130) in the sending region 210 of non- volatile memory. Based on receiving the successful acknowledgments, the active node 120 or the standby node 130 in the sending region determines, at block 206, that no acknowledgment failures have occurred and the process 200 may end.
  • each of the five receiving regions 215 receives a wakeup or an interrupt or has a doorbell rung.
  • each of the five receiving regions sends an acknowledgment back to the sending node 210, as indicated by the arrows 260-1, 260-2, 260-3, 260-4 and 260-5, respectively.
  • the acknowledgments 260 may, in various examples, comprise sending a unicast acknowledgement store operation into the vector source list or mailbox 140 of the node in the sending region 210.
  • acknowledgement to the sending region constitutes a type of acknowledgement referred to as active.
  • acknowledgements may be employed, depending completely on the NVM store operational integrity of the NVM fabric or network of the shared non-volatile memory multicast apparatus 100.
  • acknowledgments may employ a majority consensus and the Thomas Write Rule to handle concurrency and consistency issues on reads and writes of multiple mirrored targets.
  • some number of reads of different copies may be multicast, and timestamps may be included (as Thomas Write Rule tie-breakers) in the metadata on each write.
  • the payload by checksum, or other data-dependent checks (length, type, etc.) depending on the stability and error rate on the load-store fabric or network, as well as on the endurance indicator of the store operation's target (in this case the vector or mailbox of the target memory region's active node).
  • the endurance indicator for the NVM can be (1) static and assigned, (2) dynamic and changing due to the statistical measurements of the error rates generally for this complex, (3) specific to this individual NVM memory component, (4) a computed function based on the aging of the specific NVM component, or (5) based on the utilization of the individual areas on the NVM component. Greater risk can incur more and deeper checking, and risk can change over time.
  • the wakeup or interrupt or ringing of the doorbell can occur immediately, allowing low-level or high-level multicast management in the node, or low-level hardware or software that ticks off the list of multicast targets and allows propagation of the completion of the active multicast to higher level software.
  • smarter hardware or firmware or embedded processing on the memory component itself can have pre-knowledge of the original multicast operation and can count down or tick off the list of multicast acknowledgements.
  • the wakeups may be condensed or joined together in a reliable way into a single multicast acknowledgement wakeup.
  • a fast timer can be used to boxcar together multiple multicast (and other) NVM wakeups together.
  • Figure 5 illustrates an example of unicast retransmission of the data payload sent in response to determining a failure of one of the nodes in a shared non- volatile memory regions in receiving the multicast data payload.
  • a failure or corruption that is either detectable or undetectable.
  • Active methods such as sending a negative acknowledgment, or NAK, employ some level of checking to mitigate some of this, which allows errors to be propagated back to the originating node in the sending region 210, thereby allowing a unicast retry to store that payload in that target again (which may be remapped after the failure to a different NVM physical module under the same shared global NVM address or local offset NVM address.)
  • a corruption is detected at the receiving region 215-5 by some checking by hardware or firmware resulting in the wakeup or interrupt or the ringing of the doorbell on the target node, which then stores a NAK message in the vector source list or mailbox 140 of the originating node in the sending region 210.
  • the NAK message is sent to the sending region 210 from the receiving region, as indicated by the arrow 270 originating at the receiving region 215-5 and unicast to the sending region 210.
  • the node at the sending region 210 determines that a failure of receipt of the acknowledgments of the node in the fifth receiving region 215 has occurred.
  • the example process 200 proceeds to block 208 where the node in the sending region 210 unicasts the payload to each of the receiving regions for which failure of receipt of an acknowledgment was determined at decision block 206, as indicated by the arrow 280 in Figure 5.
  • the process 200 Upon unicasting the payload at block 208, the process 200 returns to block 204 to monitor for an acknowledgment for the newly sent unicast payload. If the acknowledgment is received, the process 200 ends.
  • the node in the sending region 210 may store an indication of the error indicated by the NAK message 270 in the vector source list or mailbox 140 of the node in the sending regions 210, which may result in an immediate wakeup or a collective multicast wakeup for the multicast at some point later.
  • repair can be attempted by another unicast attempt to store the payload in the vector or mailbox of the node in the receiving region 215-5, for every failed leg of the multicast.
  • the example process 200 allows special multicast addresses to be used cheaply and reliably, because every failed leg may be followed up by a unicast repair, which may be repeated as many times as it is desired to retry (e.g., a predetermined number of retries). This provides a significant potential performance improvement when compared to large numbers of duplicate single store operations across an NVM fabric or network.
  • Figure 6 illustrates an example of buffered or scattered multicast of data payloads to a multicast set of receiving region addresses in shared non-volatile memory of the shared nonvolatile memory multicast device 100.
  • the multicast payloads 250-1, 250-2, 250-3, 250-4 and 250-5 illustrated in Figure 6 are similar to the multicast payloads 250 illustrated in Figure 3, with the addition of buffered or scattered payloads, as indicated by first, second, third, fourth and fifth large arrows 290-1, 290-2, 290-3, 290-4 and 290-5.
  • These buffered or scattered payloads 290 may be multicast with the multicast payloads 250 that were stored directly in the vector source lists or mailboxes 240 of the receiving regions 215 at block 204 of the process 200 in Figure 2.
  • the buffered or scattered payloads 290 of Figure 6 may be used for data sizes that are too large to fit in the vector source lists or mailboxes 240 of the receiving regions 215 of the shared NVM address space, for example.
  • the multicast set of target addresses may be located anywhere in the shared NVM address space of the receiving regions 215, and the underlying software interface may identify the vectors or mailboxes to notify the monitoring nodes in the sending region 210 that an individual multicast store has been completed in that particular receiving region 215 of shared NVM addresses.
  • multiple buffers 290 may be acknowledged with a single unicast stored acknowledgement and a single wakeup at the node in the sending region 210.
  • the scattered or buffered payload acknowledgement logic may be either passive or active.
  • the unicasting of the payload again at block 208 of the process 200 in Figure 2 may remain unchanged.
  • Figure 7 illustrates an example of three-way mirroring of multicast data payloads sent to three receiving regions of shared non- volatile memory.
  • the multicast payloads 250-1, 250-4 and 250-5 illustrated in the example of Figure 7 are similar to the multicast payloads 250-1, 250- 4 and 250-5 illustrated in Figures 3 and 6.
  • the mirroring operation illustrated in Figure 7 involves storing of copies 300-1, 300-2 and 300-3 of the multicast payloads 250-1, 250-4 and 250-5, respectively, where the copies 300 are directed to separate physical NVM components in the receiving regions 215-1, 215-4 and 215-5.
  • the acknowledgement logic either passive or active, successful or with a unicast retry on failure, for the mirroring operations may be the same as that described above with reference to the example process 200 of Figure 2 and the buffered or scattered multicast depicted in Figure 6. Active mirroring allows for automatic unicast error retry, which may be transparent to the layers above.
  • passive multicasts can be employed, which depend completely on the NVM store operational integrity of the NVM fabric or network. Alternatively, they can fall back on majority consensus and the Thomas Write Rule to handle concurrency and consistency issues on reads and writes. In this regard, some number of reads of different copies should be done, and timestamps should be included in the metadata on each write. Since there is a single writer of a mirroring software entity and not multiple writers involved across nodes, no clock synchronization is necessary.
  • the dying active (e.g., primary) node e.g., primary
  • the standby (e.g., backup) node out of the dying node's non-maskable interrupt (as in OpenSAF's use of the OpenHPI kernel interface on Linux systems)
  • that death message from the dying active node, or the timeout backup that death message with heartbeat logic could reset the Lamport clock in the standby node, such that the Thomas Write Rule (the most recent timestamp wins ties) would still apply.
  • Figure 8 illustrates an example of mirrored 5 -way replication of data payloads sent in an active ricochet multicast to one receiving/sending, or ricochet, region 410, and four receiving regions 415-1, 415-2, 415-3 and 415-4 of shared non-volatile memory in the shared NVM multicast apparatus 100 from a remote non- volatile memory region 400.
  • Ricochets allow ganging together active and passive multicast operations at a low level, such that the successful completion wakeup, interrupt or ringing of the doorbell from one operation kicks off the next one at a low level, potentially in the interrupt driver handling the interrupt that causes the ricochet, or in the kernel, without dispatching a user mode thread, with 100s of times less performance impact.
  • a node in a different rack may include the remote NVM region 400 and may replicate a chunk of data for safety, as well as for fan-out performance for remote readers (such as a live video system).
  • a first unicast payload 430 is sent from the remote NVM region 400 to the ricochet region 410.
  • the ricochet region 410 replicates a copy 440 and stores it in the ricochet region 410.
  • the ricochet region 410 further copies the first unicast payload 430 to four multicast payloads 450-1, 450-2, 450-3 and 450-4 to first, second, third and fourth receiving regions 415-1, 415-2, 415-3 and 415-4, respectively.
  • Replicated copies 460-1, 460-2, 460-3 and 460-4 of the multicast payloads 450-1, 450-2, 450-3 and 450-4 are stored in the first, second, third and fourth receiving regions 415-1, 415-2, 415-3 and 415-4, respectively.
  • Figure 9 illustrates an example of acknowledgments sent in response to successful mirrored 5 -way replication of data payloads sent in the example active ricochet multicast of Figure 8.
  • First, second, third and fourth acknowledgments 470-1, 470-2, 470-3 and 470-4 are sent from the first, second, third and fourth receiving regions 415-1, 415-2, 415-3 and 415-4 that successfully received the ricocheted copies 450 and 460 from the ricochet node 410.
  • Successful arrival of the first, second, third and fourth acknowledgements 470 for all legs of the multicast causes the ricochet region 410 to send a final acknowledgement 480, acknowledging the successful reception of the first unicast payload 430 and successful replication of the replicated copy 440, thereby allowing higher-level software that invoked the replication operation in the first place to proceed consistently to the next buffer of data to be replicated.
  • Negative acknowledgements may be handled in a similar fashion as described above in reference to the example process 200 of Figure 2.
  • Figure 10 illustrates an example of active ricochet multicast of data payloads and buffered payloads sent to a first set of two receiving regions 515-1 and 515-2 and each passively mirrored to a set of two mirror receiving regions 520-1, 520-2, 525-1 and 525-2.
  • the first receiving region 515-1 receives a first multicast payload 550-1 and passively mirrors first and second copies 590-1 and 590-2 to first mirror receiving regions 520-1 and 520-2.
  • the second receiving region 515-2 receives a second multicast payload 550-2 and passively mirrors first and second copies 570-1 and 570-2 to second mirror receiving regions 525-1 and 525-2.
  • each mirrored copy 590-1, 590-2, 570-1 and 570-2 is replicated by respective copies 595-1, 595-2, 580-1 and 580-2 in the respective regions 520-1, 520-2, 525-1 and 525-2.
  • each of the first multicast payloads 550-1 and 550-2 are replicated by respective copies 560-1 and 560-2 in respective regions 515-1 and 515-2.
  • each of the first multicast payloads 550-1 and 550-2 is actively multicast and acknowledged via first and second acknowledgments 535-1 and 535-2.
  • the replication counts, or amplification, of first multicast regions 515, first mirrored regions 520 and second mirrored regions could be of greater number, or locations within the same region (one monitoring node), etc.
  • Figure 11 illustrates a block diagram of an example system with a computer-readable storage medium including example instructions executable by a processor to multicast a payload to a plurality of regions in a shared NVM.
  • the system 600 includes the processor 610 and the computer-readable storage medium 620.
  • the computer-readable storage medium 620 includes example instructions 621-623 executable by the processor 610 to perform various functionalities described herein.
  • the example instructions includes multicast payload to plurality of receiving regions instructions 621.
  • the instructions 621 cause the processor 610 to multicast a payload from a node of a sending region to a plurality of receiving regions of the shared NVM.
  • the example instructions 622 cause the processor 610 to determine a failure of receipt of the payload by at least one receiving region.
  • the failure of receipt may be determined by, for example, failure to receive an acknowledgement from a receiving region or receiving a negative acknowledgement from the receiving region.
  • the example instructions further include unicast payload to each region for which failure of receipt is determined instructions 623.
  • the example instructions 623 cause the processor 610 to unicast the payload to each region for which, for example, a negative acknowledgement is received.
  • the instructions for determining failure of receipt 622 and unicasting the payload 623 may be repeated a predetermined number of times as necessary.

Abstract

An example device comprises a shared non-volatile memory (NVM) comprising a plurality of regions; a plurality of receiving nodes in receiving regions of the shared NVM; and a sending node in a sending region of the shared NVM. The sending node is to multicast a payload to the plurality of receiving regions of the shared NVM; monitor for an acknowledgment from each of the receiving regions; determine a failure of receipt of the acknowledgment from at least one of the receiving nodes in at least one region of the plurality of receiving regions; and unicast the payload to each of the at least one region from which failure of receipt of the acknowledgment was determined.

Description

MULTICASTING IN SHARED NON-VOLATILE MEMORY
BACKGROUND
[0001] A new era of computing is emerging involving systems that support large amounts of shared non- volatile memory (NVM) that are directly addressable by instructions from a fleet of large multi-core and many-core processors. As component counts in these systems climb in number, node, cluster software and hardware are bound to encounter halts, panics, crashes, hangs, protocol failures and other faults that cause many kinds of failures and unscheduled outages.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] For a more complete understanding of various examples, reference is now made to the following description taken in connection with the accompanying drawings in which:
[0003] Figure 1 illustrates an example shared non-volatile memory multicast apparatus;
[0004] Figure 2 illustrates an example process of multicasting data payloads in non- volatile memory of the example apparatus of Figure 1;
[0005] Figure 3 illustrates an example of data payloads multicast by a node in a sending region of shared non- volatile memory to a plurality of nodes in receiving regions of the shared non-volatile memory;
[0006] Figure 4 illustrates an example of acknowledgement messages transmitted by the plurality of nodes in the receiving regions of shared non- volatile to the node in the sending region of non-volatile memory;
[0007] Figure 5 illustrates an example of unicast retransmission of the data payload sent in response to determining a failure of one of the nodes in a shared non- volatile memory regions in receiving the multicast data payload;
[0008] Figure 6 illustrates an example of buffered or scattered multicast of data payloads to a multicast set of receiving region addresses in shared non-volatile memory;
[0009] Figure 7 illustrates an example of 3 -way mirroring of multicast data payload sent to three receiving regions of shared non-volatile memory; [0010] Figure 8 illustrates an example of mirrored 5 -way replication of data payloads sent in an active ricochet multicast to five receiving regions of shared non- volatile memory from a remote non- volatile memory region;
[0011] Figure 9 illustrates an example of acknowledgments sent in response to successful mirrored 5 -way replication of data payloads sent in the active ricochet multicast of Figure 8;
[0012] Figure 10 illustrates an example of active ricochet multicast of data payloads sent to a first set of two receiving regions and passively mirrored to a set of 3 other receiving regions of shared non- volatile memory; and
[0013] Figure 11 illustrates a block diagram of an example system with a computer-readable storage medium including instructions executable by a processor to multicast payload in a nonvolatile memory.
DETAILED DESCRIPTION
[0014] Various examples described herein provide a reliable shared non-volatile memory (NVM). Data payloads in large shared NVM's may be replicated and distributed in a reliable manner. A payload in one region of the shared NVM may be multicast to a plurality of other regions by a node in the sending region. Each region is provided with at least a primary node. In various examples, each region is provided with a primary (or active) node and a backup (or standby) node. In the event the primary node becomes inactive or unavailable, the backup node takes over all processes of the primary node. After multicasting the payload, the sending node receives unicast acknowledgments from a node in each receiving region. If an acknowledgment is not received from one or more of the regions to which the payload was multicast, the sending node retries sending the payload to those regions in unicast messages. This may be attempted a limited number of times. The sending node may use any of a variety of types of addressing for storage of the payload at the regions to which the payload is multicast.
[0015] Systems and methods described herein provide reliable methods for the sending of small and large data into addressable locations of NVM (and multicast sets of addressable locations) across distributed nodes sharing vast regions of non- volatile memory. For these systems to maintain some form of single system image and integrity, high availability for those components and subsystems which confer that single system image reliably to middleware and applications is desirable. In availability work, lower probability of failure and higher reliability can be obtained by bringing the repair time down to near zero rather than by eradicating different kinds of single and multiple failures and outages, which can be elusively out of a systems control. The following equations may be used to estimate system reliability for systems described herein. These equations use the following variables: A=availability, MTBF=mean time before failure, MTR=mean time to repair, F=probability of failure, and R=reliability.
A = MTBF / (MTBF + MTR) ~ 1 - (MTR / MTBF): for MTBF » MTR (1)
F = 1 - A ~ MTR / MTBF (2)
R = 1 / (l - A) - MTBF / MTR (3)
[0016] Various systems and methods described herein may improve or maximize availability A, reduce or minimize probability of failure F, and improve or maximize reliability R for computing nodes spread across a fabric of shared NVM address spaces, thereby providing reliable methods for sending (e.g., multicasting and/or unicasting) of small and large data into addressable locations and multicasting to sets of a plurality of addressable locations.
[0017] Further, those reliable multicast and unicast operations may be combined to make other methods for reliable replication and mirroring of data on a large pools of shared NVM, examples of which are also disclosed herein. In addition, example methods are described that are able to deal with cases where endurance measures are available and the greater or lesser robustness in the distributed NVM methods may be indicated by these endurance measures.
[0018] Messaging systems mask failures by the method of automatic or transparent retry on the sending side, coupled with idempotent removal of duplicates on the receiving side.
However, messaging systems focus on message exchanges, and they do not consider the reliable replication and mirroring of data on a fabric of shared NVM. Example systems and methods described herein extend the techniques of transparent retry and idempotent removal of duplicates to a shared NVM fabric's load/store operations by combining those methods with various example methods of fault-tolerant execution (such as quorums or process pairs), thereby managing the state of a distributed NVM storage system in an atomic, consistent, and durable manner to ensure reliable and transparent execution of the combined operations of the shared NVM storage system.
[0019] Referring now to the figures, Figure 1 illustrates an example shared non- volatile memory multicast apparatus 100. The apparatus 100 includes a shared NVM address space sprawling across a number of NVM regions 110 with monitoring nodes. Each NVM region, in this example, includes an active node 120 and a standby node 130, where each active node 120 and standby node 130 has a globally known address slice or set of slices of the shared NVM address space. Having a globally known address slice or set of slices allows hardware or software methods to wakeup the active node 120 and alert it to the arrival of some stored data sent from a particular source node into a globally known stored vector or mailbox address in source vector lists or mailboxes 140.
[0020] In one example, the source vector lists or mailboxes 140 in each NVM region 110 include storage space 145 for the source vectors or mailboxes for all the globally known address slice or slices for each of the NVM regions 110. The wakeup could arrive by a hardware interrupt, or a thread dispatch or any other method that rings the doorbell of the active node 120 and allows the active node 120 to know that some number and identity of NVM storage operations have occurred in its NVM region of interest. For a receiving node, the source vector or mailbox storage 145 could include separate vectors or mailboxes for each source node, or one hardware operated queue of messages from different source nodes, with source addresses in headers, for example.
[0021] There are various ways to divide up responsibility for sections of the global NVM address space, spanning many NVM regions and NVM failure domains for the shared NVM apparatus 100. In the example shared NVM apparatus 100, the active node 120 and the standby node 130, alternatively referred to as the primary node and the backup node, respectively, (or a quorum of more than two nodes) is responsible for each NVM region 110 such that a live node is most likely to be available to respond to active operations needing immediate acknowledgement.
[0022] Passive multicast and unicast operations do not necessarily need acknowledgment messages other than what is provided by the shared NVM fabric itself, as is described below. However, fabrics and networks and storage devices may not be perfect, and addressably shared NVM across large numbers of nodes along with all the related networking issues may benefit from more robust methods for reliable operations, if only in the case that some of the NVM fabric has, for example, aged due to endurance and environmental impacts and, in addition, some operations may involve valuable transactions that require more trust and reliability.
[0023] The shared NVM address space of the example shared NVM multicast apparatus 100 includes large NVM regions 110, each assigned to a node pair, such that the standby node 130 will take over responsibility for the NVM region 110 if the active node 120 fails or hangs for any reason. The usual idempotence requirements for acknowledgement messages are provided for in the shared NVM multicast apparatus 100 such that multiple acknowledgements for the same message might arrive back at the sender, but a successful store remains a success and an error remains an error across failures.
[0024] In various examples, the shared NVM addressing of the shared NVM multicast apparatus 100 may support a mixture of NVM and DRAM (dynamic random access memory) addresses under the same addressing scheme, where the vectors or mailboxes could be placed in DRAM for speed. However, this may complicate takeover logic after the monitoring active node 120 fails and the standby node 130 takes over. Since vectors and mailboxes in private, or a local DRAM that is remotely accessible, are not persistent or even readable by another node after the DRAM possessing node has failed, then checkpoints or transfers to NVM of vector or mailbox contents can be made before they are acted upon, to preserve idempotence of the resulting actions.
[0025] Figure 2 illustrates an example process 200 of multicasting data payloads in nonvolatile memory of the apparatus of Figure 1. Figures 3, 4 and 5 illustrate example payload and acknowledgment flow across the shared NVM regions 110 for various steps of the example process 200. The process 200 is an example only and may be modified. The example process 200 of Figure 2 will now be described with further references to Figures 1, 3, 4 and 5.
[0026] The example process 200 may begin with a node in a sending region of a shared NVM multicasting a payload to a plurality of receiving regions in the NVM. Figure 3 illustrates an example of data payloads multicast by an active node 120 in a sending region 210 of shared NVM to a plurality of receiving regions 215 of the shared NVM. In the example of Figure 3, the payload is multicast from the sending region 210 to five receiving regions 215 including a first receiving region 215-1, a second receiving region 215-2, a third receiving region 215-3, a fourth receiving region 215-4 and a fifth receiving region 215-5. Of course, in other examples, the payload may be multicast to any desired number of receiving regions. [0027] In the example multicast operation illustrated in Figure 3, the multicast messages 250 are stored directly into the source vector list or mailboxes 240 of the five receiving regions 215 (including source vector lists or mailboxes 240-1, 240-2, 240-3, 240-5 and 240-5). Five store operations sent from the sending region 210 to the receiving regions 215 move copies of the payload directly into the source vector list or mailbox 240 of each respective receiving region 215. In this regard, the size of the payload may be limited by the size of the source vector list or mailbox 240. As is described below, in various examples, other addressing methods may be employed.
[0028] Special global multicast byte-addresses can be permanently allocated from the shared NVM address space to create globally known multicast addresses that the NVM fabric or network will recognize statically for each receiving region 215. Alternatively, global multicast byte-addresses can be controlled by a configuration or programmatic interface to know the multicast set of target addresses that the special global multicast address represents when used in a multicast or unicast operation.
[0029] In various examples, a choice for specifying a special global multicast byte-address is to send a broadcast message to the vector source list or mailbox 140 of every NVM region 110 in the system connected to the NVM fabric or network. This could be done for special occasions, system-wide alerts or system-wide operations such as flushing all processor caches to NVM in case of an earthquake lockdown, for example.
[0030] Use of content-addressable special addresses for separate regions of the entire NVM fabric may also be possible in various examples. Content-addressable special addresses of this type may take advantage of a logical separation of the NVM address space into regions, and the assignment of attributes to nodes and pushing those down into the vector source list or mailbox 140 metadata allowing checking for combinatorial matches with preconfigured attributes of the special multicast address. In various examples, NVM regions 110 in certain rows or columns may get separate broadcast messages from NVM regions 110 in different rows or columns.
[0031] In various examples, each memory region can support offset addressing from the vector or mailbox NVM address, or from some configured NVM address. This would allow offset addressing by memory region, for distributed control and management functions, or for pre-configured global headers for services. Large chunks of NVM can be fixed and
preconfigured for node and cluster startup to preclude the need for all NVM to be allocated dynamically. In one example, all processes up to a configured counter may receive a fixed NVM memory allocation to allow for faster node startup of NVM file systems and memory brokers.
[0032] The shared NVM address space can support dynamic remapping or NVM pointer swizzling for special, low-level functions. This allows a broadcast to be performed without moving extremely large data. In this regard, readers may get the notification of a multicast normally, but the address points to a block that is elsewhere and is remapped into a standard local offset address. In one example, a new version of database is built in the well-known offset NVM address in one NVM region and then broadcast atomically into a set of NVM
regions/nodes which do the atomic switchover out of the vector or mailbox wakeup. In this regard, all nodes may see the same shift in data at the same moment. This may also allow the sharing or federation of databases by connection pooling under the same shared NVM global address, such that compiled database aggregate functions are shipped to a local node and some of the addresses are general, and some are specifically remapped to a local shared copy of the data.
[0033] Shared NVM addresses can also be remapped on failure, for either the well-known local offset addresses or global shared NVM addresses (e.g., when a physical NVM module fails and the address is automatically remapped to a mirror). Finally, remapping the shared NVM addresses could allow scatter and gather functions under a single addressed move instruction.
[0034] Referring again to the example process of Figure 2, upon the active node 120 or the standby node 130 in the sending region multicasting the payload at block 202, the active node 120 or the standby node 130 in the sending region monitors for an acknowledgment from a node in each of the plurality of receiving regions 215 (block 204). Figure 4 illustrates an example of acknowledgement messages transmitted by the plurality of nodes in the receiving regions 215 of shared non- volatile to the active node 120 (or the standby node 130) in the sending region 210 of non- volatile memory. Based on receiving the successful acknowledgments, the active node 120 or the standby node 130 in the sending region determines, at block 206, that no acknowledgment failures have occurred and the process 200 may end.
[0035] In the example of Figure 4, each of the five receiving regions 215 receives a wakeup or an interrupt or has a doorbell rung. In response, each of the five receiving regions sends an acknowledgment back to the sending node 210, as indicated by the arrows 260-1, 260-2, 260-3, 260-4 and 260-5, respectively. The acknowledgments 260 may, in various examples, comprise sending a unicast acknowledgement store operation into the vector source list or mailbox 140 of the node in the sending region 210. The wakeup to the receiving regions and an
acknowledgement to the sending region constitutes a type of acknowledgement referred to as active.
[0036] In various example methods described herein, passive multicasts, with no
acknowledgements, may be employed, depending completely on the NVM store operational integrity of the NVM fabric or network of the shared non-volatile memory multicast apparatus 100. Alternatively, acknowledgments may employ a majority consensus and the Thomas Write Rule to handle concurrency and consistency issues on reads and writes of multiple mirrored targets. In this regard, some number of reads of different copies may be multicast, and timestamps may be included (as Thomas Write Rule tie-breakers) in the metadata on each write. If there are multiple writers involved across nodes, then either clocks must be synchronized tightly, or a network clock resource must be utilized (the Thomas Write Rule compares timestamps for the latest value) or Lamport clocks must be used, where some continuing Lamport timestamp synchronization traffic should function properly, such that the Thomas Write Rule (the most recent timestamp wins ties) would still apply.
[0037] In various examples, after the wakeup and before the acknowledgment for any of the active multicast store operations in this disclosure, there can be checking of the payload by checksum, or other data-dependent checks (length, type, etc.) depending on the stability and error rate on the load-store fabric or network, as well as on the endurance indicator of the store operation's target (in this case the vector or mailbox of the target memory region's active node). The endurance indicator for the NVM can be (1) static and assigned, (2) dynamic and changing due to the statistical measurements of the error rates generally for this complex, (3) specific to this individual NVM memory component, (4) a computed function based on the aging of the specific NVM component, or (5) based on the utilization of the individual areas on the NVM component. Greater risk can incur more and deeper checking, and risk can change over time.
[0038] Less risk can indicate that passive NVM multicasts with no acknowledgement are sufficient, and then those passive NVM multicasts are acknowledged upon the NVM fabric or memory network having physically completed each of the store operations. This allows for a range of safe operations under a single umbrella of multicast and, according to the dynamic risk assessment, transmuting from passive to active, to active with checksums, to active with checksums and type or field validation, with the goal of mitigating risk automatically through escalation of efforts to prevent or repair corruption.
[0039] As an individual unicast acknowledgement NVM store operation arrives into the originating active multicast node's vector or mailbox, the wakeup or interrupt or ringing of the doorbell can occur immediately, allowing low-level or high-level multicast management in the node, or low-level hardware or software that ticks off the list of multicast targets and allows propagation of the completion of the active multicast to higher level software. Alternatively, smarter hardware or firmware or embedded processing on the memory component itself can have pre-knowledge of the original multicast operation and can count down or tick off the list of multicast acknowledgements. Once all acknowledgements are collected, or an error is indicated on one or more legs of the multicast targets, the wakeups may be condensed or joined together in a reliable way into a single multicast acknowledgement wakeup. In one example, if performance of wakeup load starts to impact the monitoring node, then a fast timer can be used to boxcar together multiple multicast (and other) NVM wakeups together.
[0040] Figure 5 illustrates an example of unicast retransmission of the data payload sent in response to determining a failure of one of the nodes in a shared non- volatile memory regions in receiving the multicast data payload. At every receiving region 210 in the multicast depicted in Figure 3, there may be a failure or corruption that is either detectable or undetectable. Active methods, such as sending a negative acknowledgment, or NAK, employ some level of checking to mitigate some of this, which allows errors to be propagated back to the originating node in the sending region 210, thereby allowing a unicast retry to store that payload in that target again (which may be remapped after the failure to a different NVM physical module under the same shared global NVM address or local offset NVM address.) In the example illustrated in Figure 5, a corruption is detected at the receiving region 215-5 by some checking by hardware or firmware resulting in the wakeup or interrupt or the ringing of the doorbell on the target node, which then stores a NAK message in the vector source list or mailbox 140 of the originating node in the sending region 210. The NAK message is sent to the sending region 210 from the receiving region, as indicated by the arrow 270 originating at the receiving region 215-5 and unicast to the sending region 210.
[0041] By receiving the NAK at block 206 of Figure 2, the node at the sending region 210 determines that a failure of receipt of the acknowledgments of the node in the fifth receiving region 215 has occurred. In response to the failure determination, the example process 200 proceeds to block 208 where the node in the sending region 210 unicasts the payload to each of the receiving regions for which failure of receipt of an acknowledgment was determined at decision block 206, as indicated by the arrow 280 in Figure 5. Upon unicasting the payload at block 208, the process 200 returns to block 204 to monitor for an acknowledgment for the newly sent unicast payload. If the acknowledgment is received, the process 200 ends. If another acknowledgment failure is determined at decision block 206, the node in the sending region 210 may store an indication of the error indicated by the NAK message 270 in the vector source list or mailbox 140 of the node in the sending regions 210, which may result in an immediate wakeup or a collective multicast wakeup for the multicast at some point later. Upon the failure of one leg of the multicast, repair can be attempted by another unicast attempt to store the payload in the vector or mailbox of the node in the receiving region 215-5, for every failed leg of the multicast.
[0042] The example process 200 allows special multicast addresses to be used cheaply and reliably, because every failed leg may be followed up by a unicast repair, which may be repeated as many times as it is desired to retry (e.g., a predetermined number of retries). This provides a significant potential performance improvement when compared to large numbers of duplicate single store operations across an NVM fabric or network.
[0043] Figure 6 illustrates an example of buffered or scattered multicast of data payloads to a multicast set of receiving region addresses in shared non-volatile memory of the shared nonvolatile memory multicast device 100. The multicast payloads 250-1, 250-2, 250-3, 250-4 and 250-5 illustrated in Figure 6 are similar to the multicast payloads 250 illustrated in Figure 3, with the addition of buffered or scattered payloads, as indicated by first, second, third, fourth and fifth large arrows 290-1, 290-2, 290-3, 290-4 and 290-5. These buffered or scattered payloads 290 may be multicast with the multicast payloads 250 that were stored directly in the vector source lists or mailboxes 240 of the receiving regions 215 at block 204 of the process 200 in Figure 2.
[0044] The buffered or scattered payloads 290 of Figure 6 may be used for data sizes that are too large to fit in the vector source lists or mailboxes 240 of the receiving regions 215 of the shared NVM address space, for example. The multicast set of target addresses may be located anywhere in the shared NVM address space of the receiving regions 215, and the underlying software interface may identify the vectors or mailboxes to notify the monitoring nodes in the sending region 210 that an individual multicast store has been completed in that particular receiving region 215 of shared NVM addresses. In the example of Figure 6, there is only one buffer being stored or scattered in any one receiving region 215 of NVM addresses. In other examples, there may be multiple buffers, such as a thousand, for example, going into one receiving region 215. In that case, multiple buffers 290 may be acknowledged with a single unicast stored acknowledgement and a single wakeup at the node in the sending region 210. The scattered or buffered payload acknowledgement logic may be either passive or active.
Depending on whether the scattered or buffered payloads 290 were successfully stored, the unicasting of the payload again at block 208 of the process 200 in Figure 2 may remain unchanged.
[0045] Figure 7 illustrates an example of three-way mirroring of multicast data payloads sent to three receiving regions of shared non- volatile memory. Figure 7 illustrates multicast payloads for an N-way mirroring of multicast payloads, where N=3. The multicast payloads 250-1, 250-4 and 250-5 illustrated in the example of Figure 7 are similar to the multicast payloads 250-1, 250- 4 and 250-5 illustrated in Figures 3 and 6. The mirroring operation illustrated in Figure 7 involves storing of copies 300-1, 300-2 and 300-3 of the multicast payloads 250-1, 250-4 and 250-5, respectively, where the copies 300 are directed to separate physical NVM components in the receiving regions 215-1, 215-4 and 215-5. This provides protection that single failures of physical NVM components will not take out more than one mirror. The acknowledgement logic, either passive or active, successful or with a unicast retry on failure, for the mirroring operations may be the same as that described above with reference to the example process 200 of Figure 2 and the buffered or scattered multicast depicted in Figure 6. Active mirroring allows for automatic unicast error retry, which may be transparent to the layers above.
[0046] In mirroring, passive multicasts can be employed, which depend completely on the NVM store operational integrity of the NVM fabric or network. Alternatively, they can fall back on majority consensus and the Thomas Write Rule to handle concurrency and consistency issues on reads and writes. In this regard, some number of reads of different copies should be done, and timestamps should be included in the metadata on each write. Since there is a single writer of a mirroring software entity and not multiple writers involved across nodes, no clock synchronization is necessary. However, if the mirroring software can fail over to another node, then either clocks should be synchronized tightly, or a network clock resource should be utilized (the Thomas Write Rule compares timestamps for the latest value) or Lamport clocks can be used, which will depend on continuing Lamport timestamp synchronization on checkpoints between active (primary) and standby (backup) nodes to function properly. Alternatively, in fault tolerance paradigms where there is a death message sent by the dying active (e.g., primary) node to the standby (e.g., backup) node, out of the dying node's non-maskable interrupt (as in OpenSAF's use of the OpenHPI kernel interface on Linux systems), then that death message from the dying active node, or the timeout backup that death message with heartbeat logic, could reset the Lamport clock in the standby node, such that the Thomas Write Rule (the most recent timestamp wins ties) would still apply.
[0047] Figure 8 illustrates an example of mirrored 5 -way replication of data payloads sent in an active ricochet multicast to one receiving/sending, or ricochet, region 410, and four receiving regions 415-1, 415-2, 415-3 and 415-4 of shared non-volatile memory in the shared NVM multicast apparatus 100 from a remote non- volatile memory region 400. Ricochets allow ganging together active and passive multicast operations at a low level, such that the successful completion wakeup, interrupt or ringing of the doorbell from one operation kicks off the next one at a low level, potentially in the interrupt driver handling the interrupt that causes the ricochet, or in the kernel, without dispatching a user mode thread, with 100s of times less performance impact. In this example, a node in a different rack may include the remote NVM region 400 and may replicate a chunk of data for safety, as well as for fan-out performance for remote readers (such as a live video system).
[0048] In the example of Figure 8, a first unicast payload 430 is sent from the remote NVM region 400 to the ricochet region 410. The ricochet region 410 replicates a copy 440 and stores it in the ricochet region 410. The ricochet region 410 further copies the first unicast payload 430 to four multicast payloads 450-1, 450-2, 450-3 and 450-4 to first, second, third and fourth receiving regions 415-1, 415-2, 415-3 and 415-4, respectively. Replicated copies 460-1, 460-2, 460-3 and 460-4 of the multicast payloads 450-1, 450-2, 450-3 and 450-4 are stored in the first, second, third and fourth receiving regions 415-1, 415-2, 415-3 and 415-4, respectively.
[0049] Figure 9 illustrates an example of acknowledgments sent in response to successful mirrored 5 -way replication of data payloads sent in the example active ricochet multicast of Figure 8. First, second, third and fourth acknowledgments 470-1, 470-2, 470-3 and 470-4 are sent from the first, second, third and fourth receiving regions 415-1, 415-2, 415-3 and 415-4 that successfully received the ricocheted copies 450 and 460 from the ricochet node 410. Successful arrival of the first, second, third and fourth acknowledgements 470 for all legs of the multicast causes the ricochet region 410 to send a final acknowledgement 480, acknowledging the successful reception of the first unicast payload 430 and successful replication of the replicated copy 440, thereby allowing higher-level software that invoked the replication operation in the first place to proceed consistently to the next buffer of data to be replicated. Negative acknowledgements may be handled in a similar fashion as described above in reference to the example process 200 of Figure 2.
[0050] Figure 10 illustrates an example of active ricochet multicast of data payloads and buffered payloads sent to a first set of two receiving regions 515-1 and 515-2 and each passively mirrored to a set of two mirror receiving regions 520-1, 520-2, 525-1 and 525-2. The first receiving region 515-1 receives a first multicast payload 550-1 and passively mirrors first and second copies 590-1 and 590-2 to first mirror receiving regions 520-1 and 520-2. The second receiving region 515-2 receives a second multicast payload 550-2 and passively mirrors first and second copies 570-1 and 570-2 to second mirror receiving regions 525-1 and 525-2.
[0051] In the example illustrated in Figure 10, each mirrored copy 590-1, 590-2, 570-1 and 570-2 is replicated by respective copies 595-1, 595-2, 580-1 and 580-2 in the respective regions 520-1, 520-2, 525-1 and 525-2. In addition, each of the first multicast payloads 550-1 and 550-2 are replicated by respective copies 560-1 and 560-2 in respective regions 515-1 and 515-2. Further, each of the first multicast payloads 550-1 and 550-2 is actively multicast and acknowledged via first and second acknowledgments 535-1 and 535-2. As the
acknowledgements 535-1 and 535-2 are processed and NVM stored back into the vector or mailbox of originating region 510, ricochets of the first and second mirrored copies 590 and 570 may be at the end of the acknowledgement sending functions. This creates a total of 2 x 3 = 6 copies of the data buffer with one single NVM multicast operation, which could be an NVM store operation into special NVM multicast address. The replication counts, or amplification, of first multicast regions 515, first mirrored regions 520 and second mirrored regions could be of greater number, or locations within the same region (one monitoring node), etc.
[0052] Figure 11 illustrates a block diagram of an example system with a computer-readable storage medium including example instructions executable by a processor to multicast a payload to a plurality of regions in a shared NVM. The system 600 includes the processor 610 and the computer-readable storage medium 620. The computer-readable storage medium 620 includes example instructions 621-623 executable by the processor 610 to perform various functionalities described herein.
[0053] The example instructions includes multicast payload to plurality of receiving regions instructions 621. The instructions 621 cause the processor 610 to multicast a payload from a node of a sending region to a plurality of receiving regions of the shared NVM.
[0054] The example instructions 622 cause the processor 610 to determine a failure of receipt of the payload by at least one receiving region. The failure of receipt may be determined by, for example, failure to receive an acknowledgement from a receiving region or receiving a negative acknowledgement from the receiving region.
[0055] The example instructions further include unicast payload to each region for which failure of receipt is determined instructions 623. The example instructions 623 cause the processor 610 to unicast the payload to each region for which, for example, a negative acknowledgement is received. The instructions for determining failure of receipt 622 and unicasting the payload 623 may be repeated a predetermined number of times as necessary.
[0056] Various examples described herein are described in the general context of method steps or processes, which may be implemented in one example by a software program product or component, embodied in a machine-readable medium, including executable instructions, such as program code, executed by entities in networked environments. Generally, program modules may include routines, programs, objects, components, data structures, etc. which may be designed to perform particular tasks or implement particular abstract data types. Executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.
[0057] Software implementations of various examples can be accomplished with standard programming techniques with rule-based logic and other logic to accomplish various database searching steps or processes, correlation steps or processes, comparison steps or processes and decision steps or processes.
[0058] The foregoing description of various examples has been presented for purposes of illustration and description. The foregoing description is not intended to be exhaustive or limiting to the examples disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of various examples. The examples discussed herein were chosen and described in order to explain the principles and the nature of various examples of the present disclosure and its practical application to enable one skilled in the art to utilize the present disclosure in various examples and with various modifications as are suited to the particular use contemplated. The features of the examples described herein may be combined in all possible combinations of methods, apparatus, modules, systems, and computer program products.
[0059] It is also noted herein that while the above describes examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope as defined in the appended claims.

Claims

WHAT IS CLAIMED IS:
1. A device, comprising:
a shared non-volatile memory (NVM) comprising a plurality of regions;
a plurality of receiving nodes in receiving regions of the shared NVM; and
a sending node in a sending region of the shared NVM, the sending node being to:
multicast a payload to the plurality of receiving regions of the shared NVM; monitor for an acknowledgment from each of the receiving regions; determine a failure of receipt of the acknowledgment from at least one of the receiving nodes in at least one region of the plurality of receiving regions; and
unicast the payload to each of the at least one region from which failure of receipt of the acknowledgment was determined.
2. The device of claim 1, wherein the sending node is further to:
determine a failure of receipt of an acknowledgment associated with at least one payload unicast; and
re- unicast the payload associated with the failure of receipt of an acknowledgment associated with at least one payload unicast.
3. The device of claim 1, wherein the sending node is to determine a failure by receiving a negative acknowledgement.
4. The device of claim 1, wherein the multicasting of the payload includes sending the payload to a vector source list or a mailbox of a receiving region.
5. The device of claim 1, wherein the multicasting of the payload includes sending the payload directly to a target address in a receiving region.
6. The device of claim 1, wherein the payload multicast to at least one receiving region is mirrored to a separate physical components in the at least one receiving region.
7. The device of claim 1, wherein the payload multicast to at least one receiving region is ricocheted to at least one other region of the shared NVM.
8. A method, comprising:
multicasting, by a node in a sending region of a shared non- volatile memory (NVM), a payload to a plurality of receiving regions in the shared NVM; and
at least one of:
mirroring the payload multicast to at least one receiving region to a separate physical components in the at least one receiving region;
ricocheting the payload multicast to at least one receiving region to at least one other region of the shared NVM; or
determining a failure of receipt of the multicast payload by at least one region of the plurality of receiving regions and unicasting the payload to the at least one region for which failure of receipt was determined.
9. The method of claim 8, wherein the determining a failure includes receiving a negative acknowledgement.
10. The method of claim 8, wherein the multicasting of a payload includes sending the payload to a vector source list or a mailbox of a receiving region.
11. A non-transitory computer-readable medium encoded with instructions executable by a processor of a computing system, the computer-readable storage medium comprising instructions to:
multicast, by a node in a sending region of a shared non- volatile memory (NVM), a payload to a plurality of receiving regions in the shared NVM;
determine a failure of receipt of the payload by at least one region of the plurality of receiving regions; and
unicast the payload to each of the at least one region from which failure of receipt of the payload was determined.
12. The non-transitory computer-readable medium of claim 11, wherein a failure is determined by receiving a negative acknowledgement.
13. The non-transitory computer-readable medium of claim 11, wherein the instructions to multicast the payload include instructions to send the payload to a vector source list or a mailbox of a receiving region.
14. The non-transitory computer-readable medium of claim 11, further comprising instructions to mirror the payload multicast to at least one receiving region to a separate physical component in the at least one receiving region.
15. The non-transitory computer-readable medium of claim 11, further comprising instructions to ricochet the payload multicast to at least one receiving region to at least one other region of the shared NVM.
PCT/US2015/048866 2015-09-08 2015-09-08 Multicasting in shared non-volatile memory WO2017044070A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2015/048866 WO2017044070A1 (en) 2015-09-08 2015-09-08 Multicasting in shared non-volatile memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2015/048866 WO2017044070A1 (en) 2015-09-08 2015-09-08 Multicasting in shared non-volatile memory

Publications (1)

Publication Number Publication Date
WO2017044070A1 true WO2017044070A1 (en) 2017-03-16

Family

ID=58239594

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/048866 WO2017044070A1 (en) 2015-09-08 2015-09-08 Multicasting in shared non-volatile memory

Country Status (1)

Country Link
WO (1) WO2017044070A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6510496B1 (en) * 1999-02-16 2003-01-21 Hitachi, Ltd. Shared memory multiprocessor system and method with address translation between partitions and resetting of nodes included in other partitions
US20030069939A1 (en) * 2001-10-04 2003-04-10 Russell Lance W. Packet processing in shared memory multi-computer systems
US20030227934A1 (en) * 2002-06-11 2003-12-11 White Eric D. System and method for multicast media access using broadcast transmissions with multiple acknowledgements in an Ad-Hoc communications network
US20070076739A1 (en) * 2005-09-30 2007-04-05 Arati Manjeshwar Method and system for providing acknowledged broadcast and multicast communication
US20070255865A1 (en) * 2006-04-28 2007-11-01 Gaither Blaine D System for controlling I/O devices in a multi-partition computer system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6510496B1 (en) * 1999-02-16 2003-01-21 Hitachi, Ltd. Shared memory multiprocessor system and method with address translation between partitions and resetting of nodes included in other partitions
US20030069939A1 (en) * 2001-10-04 2003-04-10 Russell Lance W. Packet processing in shared memory multi-computer systems
US20030227934A1 (en) * 2002-06-11 2003-12-11 White Eric D. System and method for multicast media access using broadcast transmissions with multiple acknowledgements in an Ad-Hoc communications network
US20070076739A1 (en) * 2005-09-30 2007-04-05 Arati Manjeshwar Method and system for providing acknowledged broadcast and multicast communication
US20070255865A1 (en) * 2006-04-28 2007-11-01 Gaither Blaine D System for controlling I/O devices in a multi-partition computer system

Similar Documents

Publication Publication Date Title
US9244771B2 (en) Fault tolerance and failover using active copy-cat
US7694170B2 (en) Match server for a financial exchange having fault tolerant operation
US5440726A (en) Progressive retry method and apparatus having reusable software modules for software failure recovery in multi-process message-passing applications
TWI528172B (en) Machine check summary register
US5590277A (en) Progressive retry method and apparatus for software failure recovery in multi-process message-passing applications
AU2010295938B2 (en) Match server for a financial exchange having fault tolerant operation
JP4345334B2 (en) Fault tolerant computer system, program parallel execution method and program
US10698771B2 (en) Zero-data-loss with asynchronous redo shipping to a standby database
US10769019B2 (en) System and method for data recovery in a distributed data computing environment implementing active persistence
US20080141092A1 (en) Network protocol for network communications
US10860411B2 (en) Automatically detecting time-of-fault bugs in cloud systems
US10664369B2 (en) Determine failed components in fault-tolerant memory
WO2017044070A1 (en) Multicasting in shared non-volatile memory
Dinu et al. Hadoop’s overload tolerant design exacerbates failure detection and recovery
Snyder et al. Robustness infrastructure for multi-agent systems
Gupta et al. Failure Detection and Fault-Tolerance for Key-Value store in Distributed Systems
Soundarabai et al. Fault Tolerance Algorithms for Distributed Computing
Bósa et al. Tolerating stop failures in Distributed Maple
Antus Reaching Consensus Using Raft
Schatzberg et al. Total order broadcast for fault tolerant exascale systems
Svobodova Computer Systems Research Division Request for Comments No.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15903711

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15903711

Country of ref document: EP

Kind code of ref document: A1