US20080071809A1 - Concurrency control for b-trees with node deletion - Google Patents

Concurrency control for b-trees with node deletion Download PDF

Info

Publication number
US20080071809A1
US20080071809A1 US11/859,597 US85959707A US2008071809A1 US 20080071809 A1 US20080071809 A1 US 20080071809A1 US 85959707 A US85959707 A US 85959707A US 2008071809 A1 US2008071809 A1 US 2008071809A1
Authority
US
United States
Prior art keywords
node
nodes
target node
index
delete
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/859,597
Inventor
David Lomet
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US11/859,597 priority Critical patent/US20080071809A1/en
Publication of US20080071809A1 publication Critical patent/US20080071809A1/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2308Concurrency control
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99942Manipulating data structure, e.g. compression, compaction, compilation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99943Generating database or data structure, e.g. via user interface

Definitions

  • This invention relates to methods and systems for data storage. More particularly, this invention relates to methods and systems for utilizing B-trees for storing data, maintaining a robust and simple data structure and allowing high concurrency of access.
  • a B-tree is a tree structure which stores data, and allows operations to find, delete, insert, and browse the data.
  • Each data record stored in a B-tree has an associated key.
  • these keys In order to be used for a B-tree, these keys must be orderable according to a predetermined function.
  • the keys may be numeric, in which case the ordering may be from least to greatest.
  • the keys may be names, in which case the ordering may be alphabetical.
  • a B-tree is height-balanced, so all leaves are at the same level of the tree. Insertions and deletions of records to the B-tree are managed so that the height-balanced property of the B-tree is maintained.
  • the insertion of a new data record may require the split of a node into two nodes; a deletion may require the deletion of a node. Insertion and deletion procedures must maintain the properties of the B-tree (e.g. height balance) in order to ensure that they result in valid B-trees.
  • Each B-tree leaf contains one or more of the stored records in one of a disjoint set of ranges of key values, while each index node (non-leaf node) of a B-tree provides access to a range of key values stored in one or more adjacent key ranges contained in data nodes.
  • Each index node of the B-tree stores, for each of its child nodes, an ordered pair consisting of a key value within the range and a pointer to the child node.
  • the key values break the range of key values represented by the node into sub-ranges, and the pointers to a leaf within the sub-range (if the index node is one level above the leaf level) or point to an index node corresponding to that sub-range.
  • FIG. 1 is a block diagram of an exemplary subtree in a B-tree data structure.
  • a sub-tree of a B-tree contains leaves 1010 storing records with the keys shown in those leaves 1010 .
  • Leaf nodes are also known as data nodes.
  • Index node 1000 corresponds to the range between 21 and 133.
  • Index node 1000 contains three ordered pairs (index pairs). The first ordered pair contains the key value 21 and first pointer 1020 of index node 1000 , which points to index node 1025 .
  • a second ordered pair contains the key value 49 and second pointer 1030 .
  • the pointer in the first ordered pair should be followed to reach any record with a key greater than or equal to 21 (the key value in the first pair) and less than 49 (the key value in the second pair).
  • the key value in the second ordered pair along with the key value of 93 in the third ordered pair, indicates that any record with a key greater than or equal to 49 and less than 93 will be found in the sub-tree whose root is index node 1035 .
  • the third ordered pair, containing third pointer 1040 indicates that any record with a key greater than or equal to 93 will be found in the sub-tree whose root is index node 1045 .
  • an index node will have as many ordered pairs of ⁇ key, pointer> as it has child nodes.
  • the range represented by each index node need not be explicitly stored in the index node.
  • node 1035 corresponds to the range of key values v where 49 ⁇ v ⁇ 93, though this range is not be explicitly stored in node 1035 in the example. Any search for key values in the range 49 ⁇ v ⁇ 93, though, will reach node 1035 .
  • another B-tree constraint concerns the number of nodes which can exist below a given node which is determined by the order assigned to the B-tree.
  • the order of a B-tree is determined dynamically, when a node of the tree fills up. In this case, a node split occurs, as described below.
  • the search begins at the root node and follows pointers from node to node based on the key value for the record being sought, descending down the tree, until either the key is located or the search fails because a leaf node is reached which does not contain a record with the key being searched for. For example, if the record with key value 113 is being sought, when index node 1000 is reached, the key values are consulted. Since the key value being sought is greater than the key value in the rightmost pair in node 1000 , the pointer 1040 from that pair is followed. Node 1045 is reached. When the key values are consulted, it can be seen that pointer 1048 should be followed to find any record with a key value 109 ⁇ v ⁇ 122.
  • This pointer 1048 leads to the appropriate leaf from leaves 1010 which contains the record for the specified key value. If a record was searched for with a key value of 112, the search would end in the same location, but because no record is found with that key value in the leaf node, the search would return an unsuccessful result.
  • a node When a node has the maximum number of key values (when there is not sufficient space for any additional index term or data record), if a new key value must be inserted into the range covered by the node, the node will be split. In order to ensure that concurrent accesses are not reading data from the node during the split, it is necessary to deny concurrent access to the node being changed. Because two nodes will now hold the information previously held by the node being split, an additional link is necessary in the parent node of the node being split. Concurrent accesses to that parent node must therefore be denied while the parent is updated. If the addition of a new key value and pointer in the parent node will overfill the parent node, the parent node will be split as well.
  • node insertions may cause splits recursively up the B-tree. This may require that a node high in the tree be locked while nodes much further down in the tree are being split, and while the split slowly propagates its way up to the locked node. This greatly impairs concurrent access to the tree.
  • the necessity for a number of locks or latches to prevent concurrent accesses to nodes being changed slows access to the information stored in the B-tree by limiting concurrent access.
  • FIG. 2 is a block diagram of an exemplary subtree in a B link -tree data structure.
  • Each non-leaf node contains an additional ordered pair, a side pair, including a side key value and a pointer (termed the “side pointer”) which points to the next node at the same level of the tree as the current node.
  • the side pointer of the rightmost node on a level is a null pointer.
  • Side pointer from side pair 1147 is null.
  • Side pointer from side pair 1107 is also shown as null, this could indicate that node 1000 is the root node or that it is the rightmost node on a level.
  • the side key value indicates the lowest value found in the next node at the same level of the tree.
  • the range of values in a node may be seen by examining the index term for the node in its parent node (which is the lower bound and is included in the range) and the side key value (which is the upper bound but is not included in the range).
  • the purpose of the side pointer is to provide an additional method for reaching a node.
  • Each leaf node also contains a side pointer which points at the next leaf node, such as side pointer 1117 .
  • One benefit of using these side pointers is to enable highly concurrent operation by allowing splits to occur with each atomic action of the split involving only one level of the tree.
  • B-link trees in order for a split to occur on a full node the contents of the full node are divided (one atomic action), and a new index term is posted to the parent (second atomic action). This avoids the situation in which multiple levels of the tree are involved in a single atomic action.
  • a split is occurring in a node at the same time that a search is being performed for a key value in the range for that node, and the node has been split, with the lefthand node replacing the node which has been split, the tree can be traversed to find data even if no index term has yet been inserted into the parent of the node for the righthand node from the new pair.
  • the parent node will point to the lefthand node, and if the data is not found in the lefthand node, the side pointer of the lefthand node provides access to the righthand node.
  • a node split need not be a single atomic operation with the parent and child nodes both inaccessible until the split is completed.
  • latches are used in order to provide mutual exclusion when a node split or node deletion is occurring.
  • a latch is a low-cost, usually short-duration lock, one which does not include deadlock control. Hence, it is not necessary to access a lock manager in order to acquire or release a latch.
  • Latches are therefore more lightweight than locks; they typically require only tens of instructions, not hundreds like locks. They prevent access of incorrect or outdated data during concurrent access of the data structure by allowing only an updater holding the latch to use the resource that has been latched.
  • latches Because no deadlock control exists for latches, a partial ordering is imposed on latches.
  • the holder of a latch on a parent node may request the latch for a child node of that parent node. Latches can propagate downward. However, the holder of a latch on a child node can not request the latch for the parent without first releasing its latch on the child; latches do not propagate upwards. In this way, the deadlock situation in which the holder of a latch for parent node A is requesting a latch for child node B at the same time that the holder of a latch for child node B requests a latch for parent node A is avoided.
  • the latch In a standard B-tree, the latch must be maintained for the node being updated, and for the parent of that node (and possibly for multiple ancestors up the tree, even perhaps to the root), so the pointers and key values in the parent can be modified to reflect the change. If the latch is not maintained for the parent, the tree can become inconsistent.
  • the latches must typically be maintained for all the nodes on the path to a leaf node that may need to be updated because of a node split at the leaf.
  • a node split therefore need not be an atomic operation that includes posting the index term to the parent, but can be divided into two parts (“half splits”), the first “half split” where a child node is split, moving some data from an old node to a new node, and setting up a side link from the old node to the new node. After such a “half split” the B link -tree will be well formed. A subsequent second “half split” posts an index term to the parent node.
  • a B link -tree data structure, method and system is presented which includes the advantages of B-tree data structures and conventional B link -tree data structures, yet allows highly concurrent access of the data and deals robustly with node deletion.
  • the “delete state” is tracked for a B link -tree data structure.
  • This delete state is guaranteed to capture when a node among some set of nodes has been deleted.
  • the absence of state indicating that any node among the set of nodes has been deleted ensures that some specific node in that set has not been deleted.
  • Two delete states are tracked to deal separately with the two complications resulting from node deletes: (i) a parent to which an index term is scheduled to be posted may have been deleted; (ii) a new node for which an index term is scheduled to be posted may have been deleted.
  • B link -tree node split operations avoid tree re-traversals to find the parent node to be updated; and they avoid having to verify that a newly created node whose index term is scheduled to be posted still exists. Additionally, split operations are divided into two atomic operations, and the second atomic operation does not need to be completed for the tree to be used.
  • the two atomic operations allow for high concurrency, and the tolerance for “lazy” scheduling of the second atomic operation (index term posting) is a simple solution which allows for easy implementation and coherence.
  • a target node to be split For a target node to be split, first, the side pointer and a portion of the stored data is moved to a new index node and the side pointer of the target node is set to point to the new index node. Then, a post operation is queued. When this operation is performed, the information regarding the new index node is posted to the parent node. Should a node delete be detected that might cause the need to re-traverse the tree to find a parent, or to re-verify that a new node still exists, the index term posting half of the node split operation is terminated, thus avoiding making this more complex and expensive. Such incomplete posting of index terms are completed when it is detected that the index term is missing in a subsequent traversal of the tree.
  • FIG. 1 is a block diagram of an exemplary subtree in a B-tree data structure
  • FIG. 2 is a block diagram of an exemplary subtree in a B link -tree data structure
  • FIG. 3 is a block diagram of an exemplary computing environment in which aspects of the invention may be implemented
  • FIG. 4 is a block diagram of an exemplary modified B link -tree data structure according to one embodiment of the invention.
  • FIG. 5 is a flow diagram of the use of the modified B link -tree according to one embodiment of the invention.
  • FIG. 6 is a flow diagram of a split node operation according to one embodiment.
  • FIG. 7 is a flow diagram of a delete node operation according to one embodiment.
  • the first tracked delete state indicates whether it is safe to directly access a parent node (hence an index node, not a data node) without re-traversing the B-tree.
  • D X contains this information for all nodes above the leaf level; and in one embodiment it is maintained outside of the tree since any index node may be deleted.
  • D X (nodeA) can be consulted to determine that index nodeA may have been deleted, or that index nodeA cannot have been deleted.
  • D X (n) is a binary function over all index nodes n in the B-tree, with one possible value indicating that the node cannot have been deleted, and the other possible value indicating that the node may have been deleted.
  • D X is a counter which is incremented when an index node has been deleted, so that a change in D X indicates that a node delete has occurred, while no change means that no nodes have been deleted since the earlier inspection of D X
  • the second tracked delete state, data delete state indicates whether it is safe to post an index term for a leaf node that resulted from a data node split. Since we access the parent of the leaf node resulting from a split in any event to post the index term, the D D state can be stored in the parent, and where each node is assigned to a disk page, without incurring any extra I/O to access the page.
  • Leaf node deletes are much more common than index node deletes, and so there is real value to localizing leaf node deletes to a sub-tree without requiring additional latching.
  • a D D state is maintained in each level 1 node (nodes which are parents of a leaf).
  • D D is a binary function over all leaf nodes and D D (nodeB) returns one value if leaf nodeB may have been deleted, and another if leaf nodeB cannot have been deleted.
  • D X is used for this verification.
  • the value indicating that the node may have been deleted may be returned even when the node being asked about has not been deleted. In one embodiment, if this value is returned, further activity on the node is abandoned or postponed. No index term will be posted. The absence of a posted index term will be re-discovered when a B link -tree traversal is required to include a side link traversal.
  • the abandonment of the posting of the index term when delete states indicate that a node may have been deleted allows concurrency to be accomplished in a simple manner. The tree will always allow searches to be executed correctly, and where an index node posting is abandoned due to the delete state, subsequent actions will allow the missing index node posting to be discovered and requeued. This allows the node split to be simple and avoids retraversals, and yet allows for a way for such node postings to be detected and performed later.
  • the tree is re-traversed when the node may have been deleted. This may cause a delay while the presence of the node is found (or while the correct node which is the parent of the node being split or deleted is found). However, deletions and splits of nodes will still occur correctly, and will enable index terms to be propagated up the tree correctly despite this delay.
  • delete states are maintained as binary functions for each node. In another embodiment, delete states are maintained as counters for a group of nodes, which are updated when a node is deleted.
  • FIG. 3 shows an exemplary computing environment in which aspects of the invention may be implemented.
  • the computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 100 .
  • the invention is operational with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, embedded systems, distributed computing environments that include any of the above systems or devices, and the like.
  • the invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium.
  • program modules and other data may be located in both local and remote computer storage media including memory storage devices.
  • an exemplary system for implementing the invention includes a general purpose computing device in the form of a computer 110 .
  • Components of computer 110 may include, but are not limited to, a processing unit 120 , a system memory 130 , and a system bus 121 that couples various system components including the system memory to the processing unit 120 .
  • the processing unit 120 may represent multiple logical processing units such as those supported on a multi-threaded processor.
  • the system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • EISA Enhanced ISA
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • the system bus 121 may also be implemented as a point-to-point connection, switching fabric, or the like, among the communicating devices.
  • Computer 110 typically includes a variety of computer readable media.
  • Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media.
  • Computer readable media may comprise computer storage media and communication media.
  • Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CDROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computer 110 .
  • Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
  • the system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132 .
  • ROM read only memory
  • RAM random access memory
  • BIOS basic input/output system
  • RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120 .
  • FIG. 3 illustrates operating system 134 , application programs 135 , other program modules 136 , and program data 137 .
  • the computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media.
  • FIG. 3 illustrates a hard disk drive 140 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152 , and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 , such as a CD ROM or other optical media.
  • removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like.
  • the hard disk drive 141 is typically connected to the system bus 121 through a non-removable memory interface such as interface 140
  • magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150 .
  • hard disk drive 141 is illustrated as storing operating system 144 , application programs 145 , other program modules 146 , and program data 147 . Note that these components can either be the same as or different from operating system 134 , application programs 135 , other program modules 136 , and program data 137 . Operating system 144 , application programs 145 , other program modules 146 , and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies.
  • a user may enter commands and information into the computer 20 through input devices such as a keyboard 162 and pointing device 161 , commonly referred to as a mouse, trackball or touch pad.
  • Other input devices may include a microphone, joystick, game pad, satellite dish, scanner, or the like.
  • These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB).
  • a monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190 .
  • computers may also include other peripheral output devices such as speakers 197 and printer 196 , which may be connected through an output peripheral interface 195 .
  • the computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180 .
  • the remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110 , although only a memory storage device 181 has been illustrated in FIG. 3 .
  • the logical connections depicted in FIG. 3 include a local area network (LAN) 171 and a wide area network (WAN) 173 , but may also include other networks.
  • LAN local area network
  • WAN wide area network
  • Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
  • the computer 110 When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170 .
  • the computer 110 When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173 , such as the Internet.
  • the modem 172 which may be internal or external, may be connected to the system bus 121 via the user input interface 160 , or other appropriate mechanism.
  • program modules depicted relative to the computer 110 may be stored in the remote memory storage device.
  • FIG. 3 illustrates remote application programs 185 as residing on memory device 181 . It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • a B link -tree according to one embodiment of the invention is shown in FIG. 4 .
  • the root node of the B link -tree 400 contains three ordered pairs including pointers 420 , 430 , and 440 to index nodes 425 , 435 , and 445 . These index nodes point to leaf nodes 410 a through 410 j , also known as data nodes. Each data nodes contains data values and a side pointer which points to the next data node in sequence.
  • the root node also contains a side pair 407 , and index nodes 425 , 435 , and 445 contain side pairs 427 , 437 , and 447 , respectively.
  • an index delete state D X 470 is stored, which, for each index node, indicates whether the index node has not been deleted, or whether it may have been deleted.
  • a data delete state D D ( 427 , 437 , and 447 , respectively) is stored.
  • This data delete state stores the delete state for each of the child leaf nodes of an index node which is just above the leaf node level.
  • data delete state D D 447 tracks the delete state of the leaf nodes 410 g , 410 h , 410 i and 410 j .
  • events may occur that may indicate that one of these nodes has been deleted. In such situations, the state for that leaf node in data delete state D D 447 will be changed to indicate that it may have been deleted.
  • At least one delete state associated with one or more of the nodes of the B link -tree is stored, step 510 .
  • a data record operation (adding, modifying, and deleting data records) is performed, step 520 .
  • delete states are updated if necessitated by the data record operation, step 530 .
  • latches come in multiple modes: share, update, and exclusive. These latch modes support different levels of concurrent access.
  • An exclusive latch on a resource prohibits other latches from being obtained for the resource, and can only be obtained if no other latch is held on the resource. No other latches are allowed if an exclusive latch is held.
  • a share latch is compatible with other share latches, and with another type of latch known as an update latch. Share latches allow reading of the resource but not modification of it. Only one update latch may be held on a resource at one time though share latches may be held on it concurrently; update latches are used to allow a user to upgrade the latch to an exclusive latch without relinquishing the update latch. Users with a share latch may not upgrade to an exclusive latch without first releasing control of the resource.
  • node deletion and node split operations may be necessitated by the addition or deletion of records and the structural requirements of the B link -tree.
  • Tree Traversal traversal of a tree to find a desired node.
  • This node may be a leaf node or an internal node. This is used to find a node for a user (for example for a record lookup) and also to provide traversal for other operations;
  • Node Update inserting, modifying, or deleting information in a node
  • Node Split splitting a node into two separate nodes, in order to accommodate more information at that point in the tree than can fit into one node. According to one embodiment of the invention, this is performed in two distinct “half-split” operations, with the second half-split capable of being queued to be performed later (or abandoned, if the tree has changed too much to make it a simple change.);
  • Access Parent used to access a parent node and to check if a deletion may have occurred, in order to allow simplicity in splitting nodes and realize the efficiency gain from keeping the delete states;
  • Post Index Term used to post an index term to a node, in order to perform the second part of a node split operation
  • Delete Node used to delete a node, in order to consolidate index nodes with low occupancy.
  • the node may be a leaf node or an intermediate node. Because nodes may be split or deleted, and a parent node may not contain index terms for all its child nodes, traversals may occur which utilize side pointers rather than pointers which appear as part of an index pair.
  • node 400 is a sub-tree which has been reached during a tree traversal to find a record with a key value of 80
  • a share latch is obtained for node 400 .
  • Node 400 is then consulted, and pointer 430 followed.
  • node 435 is reached, it will in turn be latched in share mode.
  • the latch on node 400 is released.
  • a read node operation will be run on node 435 to find the entry that points to the leaf node that is the home for a record with key value of 80. In this case, this is 410 e .
  • Node 410 e is latched in share mode, and then the latch on node 435 is released.
  • Node 410 e is then searched for the record with key value of 80.
  • This sequence of latching a node before unlatching its parent in the path is called latch coupling.
  • a side pointer is followed rather than a index pointer. If this is the case, the node pointed to by the side pointer is not referenced by an index term in the node which should be its parent node. An index term posting is therefore scheduled.
  • underutilized nodes may be discovered during tree traversal, and such underutilized nodes should be deleted. Range reading is also possible, to traverse the tree and return any records with key values in the specified range.
  • the tree traversal operation proceeds according to the following pseudocode, where tree traversal begins with nodeA, (which has been latched), the key value being searched for is K and the requested level of the node to be found is R:
  • nodeA is underutilized, enqueue a node deletion action for nodeA
  • nodeB is a higher level than has been requested, or if nodeB is a sibling node of nodeA, then recursively perform a tree traversal on nodeB for key K at requested level L; otherwise, return nodeB.
  • the tree traversal also receives data as to the requested type of latch to be used.
  • the tree In order to read a record with a given key, the tree is traversed to find the leaf node which will contain the desired key if it exists.
  • a share latch is obtained for the each node in the path to the leaf node as the tree is being traversed. Once the appropriate leaf node is returned, a read operation is performed.
  • a read node operation can, in one embodiment, be represented by the following pseudocode, where “Traverse” is the tree traversal operation detailed above and “root” is the root node of the tree:
  • An update to a node is similar to the reading of a node, except that the latch obtained before the update must be exclusive, rather than shared.
  • an update of a node consists of a tree traverse to the node followed by an insert, modify or delete of the node via an update operation. During the traversal, nodes are latched in share mode until a leaf node for the record to be inserted, updated, or deleted is reached. The leaf node is then latched with an update latch, and an exclusive latch is obtained when we confirm that the node accessed is the desired node (which, to find it, may require further side traversal). The leaf node is returned from the tree traversal latched exclusively and an update operation may then be performed on the leaf node.
  • a split node action for the maximum capacity node is undertaken, and then the insert is retried.
  • a delete node operation is enqueued for that node.
  • Action represents the action to be performed on the target node and Action Information represents information needed for the action (for example, the record to be inserted where the Action is an insertion).
  • Update is a function which inserts, modifies or deletes the node according to the specified action. This may, in the case of an insert or of certain updates, cause a node to be too full. If this is the case, the update will fail if the node is full, a split node will be attempted, and the action retried:
  • Update fails due to TargetNode being full, then perform a Split Node (TargetNode) and then retry the Update Record action
  • TargetNode is underutilized, enqueue (Delete Node (TargetNode))
  • the first half split operation involves moving the high range contents of the target node being split to a new node.
  • the side pair is also moved to the new node.
  • the side pair in the target node is changed to point to the new node, and contains also the low key value for the new node.
  • This first half split operation can be done immediately, however, because the exclusive latch will already be held by the calling operation on the node.
  • a latch need not be held on the parent of the target node being split, only on the target node itself. No latch is needed on the new node as others cannot access the new node.
  • the only path to the new node is via the original target node, on which we are already holding an exclusive latch.
  • the second half split operation involves the posting of an index term and index pointer for the new node in the parent node. This is done by enqueueing this index posting on the queue of work.
  • an Access Parent operation is performed.
  • An Access Parent operation is used to access a parent node when a split or delete has occurred to one of its children.
  • the Access Parent operation accesses the parent of a node so that the index term for a node can be inserted or deleted or so that a deletion can occur. It is given the remembered parent node address (RememberedParent) of the node (Node) and the type of action being performed, and returns with the latched parent node or an error if the parent may have been deleted.
  • the Access Parent operation can be described, in one embodiment, with the following pseudocode:
  • update D X If the parent access is for an index node deletion, update D X .
  • an access parent operation When an access parent operation is run, it returns with the latch on the parent of the target node which has changed (by being split or deleted) if the parent node exists.
  • the parent node being accessed in an access parent operation will be level 1 or higher. That is, it will not be a leaf node. Thus, there will be a delete state stored in D X for the node.
  • Access parent first latches D X with a share latch, and then, if the state stored therein indicates that the parent must exist, the parent node is latched. Once the parent node is latched, it can not be deleted until it is unlatched.
  • Access parent returns with an error if the delete state of the parent node indicates that the parent node may not exist. In this way, access parent verifies without the necessity of tree traversal that the parent node definitely exists, and only if it does definitely exist is a traversal undertaken.
  • access parent is also called with an indication of whether it is handling a delete or an index posting due to a split. If access parent is called for the index posting to a parent node of child node information, the delete state of the child node is also checked to ensure that the child node still exists. If it might not exist, access parent returns with an error.
  • index posting For a second half split index posting, if access parent returns with an error, the index posting is not performed. All data in the tree still remains properly accessible through side pointers, and an error returned from access parent in this situation is generally sufficiently rare that the lack of an index posting for the new node is not an issue. In another embodiment, when an access parent returns with an error, the tree is traversed to find the correct parent for the index posting and the index posting is then made.
  • the update node operation is then used to post the index term. This may lead to a split of the parent node; however, such a parent node split will be a separate atomic action, decoupled from the split caused it.
  • split node operation can be described, in one embodiment, with the following pseudocode, where a latch is held on the OriginalNode (the node to be split):
  • Step 1 where a new node is allocated, no latch is required as the node is invisible to the rest of the tree.
  • the first half of the split operation is embodied in steps 1-4, and the second in the operation enqueued in step 5.
  • the Post Index Term operation can be described, in one embodiment, with the following pseudocode:
  • FIG. 6 is a flow diagram of a split node operation according to one embodiment.
  • first the side pointer and a portion of the key value data of the target node are copied to a new node, in step 610 .
  • the pointer of the target node is set to point to the new index node in step 620 .
  • a post index operation for the new node is added to a work queue in step 630 .
  • steps 610 through 630 are the first half-split.
  • a check is performed to see if the stored parent node for which an index term is being posted may have been deleted, in decision 640 . This is done, in one embodiment, by consulting a delete state.
  • decision 650 a check is performed to see if the new node for which an index term is being posted may have been deleted. This is done, in one embodiment, by consulting a delete state.
  • step 660 if the parent node and the new node for which a term is being posted both have not been deleted, the new index node data is added to the parent node.
  • step 660 if it is possible that the stored parent node has been deleted (in other words, a “yes” answer to decision 640 ) or that the new node has been deleted (a “yes” answer to decision 650 ) then step 660 does not occur.
  • the second half-split operation is abandoned. As described above, this abandonment of the posting of the index term when delete states indicate that a deletion may have occurred allows B-link tree concurrency to be accomplished in a simple manner and avoids costly retraversals while allowing for the resultant “missing” node postings to be detected and the node posting to be peformed at a later time.
  • a retraversal may be done to ensure that the stored parent node and new node are both still in existence.
  • the delete node operation When a node is deleted, the delete node operation is run on the node. This permits the consolidation of index nodes with low occupancy.
  • the delete node operation first calls the access parent operation to access the parent node of the node targeted for deletion. Access parent finds the parent node. Delete state information for the target node will be found in the parent node; this delete state information is updated in the access parent operation. Access parent returns with the parent node latched.
  • the left sibling of the target node is then accessed and latched.
  • the target node is then accessed and latched.
  • the contents of the target node are moved to the left sibling.
  • the target node is then de-allocated and its index term removed from the parent.
  • FIG. 7 is a flow diagram of a delete data node operation according to one embodiment.
  • the delete state D X is latched (step 702 ) and checked (step 705 ) and, if the delete state indicates that the parent node has not been deleted, the B link -tree data structure is traversed, starting at the remembered parent, to find the current parent for the target node, step 710 .
  • This parent node is latched in step 720 .
  • the delete state D X is set to indicate that an index node has been deleted, step 723 .
  • the delete state D X is unlatched in step 725 .
  • the index term for the target node is deleted in step 730 .
  • the left sibling of the target node is then accessed and latched in step 740 .
  • a side traversal is then performed on the left sibling, for example, by following the side pointer of the left sibling. If the target node is found via said side traversal (decision 760 ), the deletion proceeds. Otherwise, the operation is abandoned. If a data node is being deleted, the delete state (D D ) for the node is updated to reflect that it will be deleted in step 765 . The parent node can then be unlatched in step 770 and the target node is latched.
  • the target node is then examined to determine if it is under-utilized and if the contents of the target node can be consolidated with the contents of the left sibling in decision 780 . If this is the case, the data including the side pointer and key data from the target node is copied to the left sibling in step 790 .
  • the target node is then deleted in step 792 .
  • the left sibling and target nodes are then unlatched in step 794 .
  • step 796 if the parent node had been found under-utilized, a delete operation is scheduled for the parent.
  • delete state generally stores state information regarding whether a node was definitely not deleted, or whether it may have been deleted. In one embodiment, this is done using a stored binary state for each index node.
  • delete state is stored using counters.
  • Index delete state D X is maintained as a counter that is incremented whenever an index node is deleted. Before an action is placed on the work queue, the present value of D X is stored along with the action on the queue.
  • the new D X counter value is saved so that when the need for the index posting is detected again, the more recent D X value is entered with the enqueued action, hence making it possible for this later action to complete successfully.
  • a change in D X is used to determine whether a node may have been deleted between the time a structure modification is scheduled and the time when it is actually done. For example, because of a sibling traversal, it may be discovered that an index term has not been posted. A new index term posting is scheduled. The latest D X value is stored when the posting is scheduled, and it is compared to the current D X value when the parent node is accessed to post the index term. The new posting thus will only fail if there are further node deletes between the scheduling of the new posting and the parent node is accessed for execution of the index term posting.
  • D X state need not be maintained across a system crash, but can be restarted when the system is brought up again.
  • the data delete state is used to determine whether leaf nodes that are immediate descendents of a lowest level index node may have been deleted. If not, a new node resulting from a split will not have been deleted, and the index term for that new node may be posted in that lowest level index node, without further verification that the node still exists.
  • the D D state describing node deletes among leaf nodes is stored in their parent index node. Some access savings are achieved, since the parent index node is accessed in any event during the posting operation.
  • D D may be maintained as a binary state for each leaf node in a given index node.
  • D D is maintained as a counter. Whenever a leaf node is deleted, the parent is latched and accessed in any event, in order to post the index term. Hence, the update of D D during leaf node delete occurs with little overhead.
  • D D for the parent node has changed when we attempt to post an index term for a new leaf node split, then the new node may already have been deleted, and hence no index term posting is required. In that case, we abort the posting.
  • D D the prior value for D D is stored when the node is visited in the traversal on the way to a leaf node. No additional latching is required.

Abstract

A data structure, added to a modified form of the Blink-tree data structure, tracks delete states for nodes. The index delete state (DX) indicates whether it is safe to directly access an index node without re-traversing the B-tree. The DX state is maintained globally, outside of the tree structure. The data delete state (DD) indicates whether it is safe to post an index term for a new leaf node. A DD state is maintained in each level 1 node for its leaf nodes. Delete states indicate whether a specific node has not been deleted, or whether it may have been deleted. Delete states are used to remove the necessity for atomic node splits and chains of latches for deletes, while not requiring retraversal. This property of not requiring a retraversal is exploited to simplify the tree modification operations.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a Continuation of U.S. patent application Ser. No. 10/768,527, filed Jan. 30, 2004, entitled “Concurrency Control For B-Trees With Node Deletion,” which is incorporated herein by reference in its entirety.
  • FIELD OF THE INVENTION
  • This invention relates to methods and systems for data storage. More particularly, this invention relates to methods and systems for utilizing B-trees for storing data, maintaining a robust and simple data structure and allowing high concurrency of access.
  • BACKGROUND OF THE INVENTION
  • A B-tree is a tree structure which stores data, and allows operations to find, delete, insert, and browse the data. Each data record stored in a B-tree has an associated key. In order to be used for a B-tree, these keys must be orderable according to a predetermined function. For example, the keys may be numeric, in which case the ordering may be from least to greatest. As another example, the keys may be names, in which case the ordering may be alphabetical.
  • A B-tree is height-balanced, so all leaves are at the same level of the tree. Insertions and deletions of records to the B-tree are managed so that the height-balanced property of the B-tree is maintained. The insertion of a new data record may require the split of a node into two nodes; a deletion may require the deletion of a node. Insertion and deletion procedures must maintain the properties of the B-tree (e.g. height balance) in order to ensure that they result in valid B-trees.
  • Each B-tree leaf contains one or more of the stored records in one of a disjoint set of ranges of key values, while each index node (non-leaf node) of a B-tree provides access to a range of key values stored in one or more adjacent key ranges contained in data nodes. Each index node of the B-tree stores, for each of its child nodes, an ordered pair consisting of a key value within the range and a pointer to the child node. The key values break the range of key values represented by the node into sub-ranges, and the pointers to a leaf within the sub-range (if the index node is one level above the leaf level) or point to an index node corresponding to that sub-range.
  • FIG. 1 is a block diagram of an exemplary subtree in a B-tree data structure. As shown in FIG. 1, a sub-tree of a B-tree contains leaves 1010 storing records with the keys shown in those leaves 1010. Leaf nodes are also known as data nodes. Index node 1000 corresponds to the range between 21 and 133. Index node 1000 contains three ordered pairs (index pairs). The first ordered pair contains the key value 21 and first pointer 1020 of index node 1000, which points to index node 1025. A second ordered pair contains the key value 49 and second pointer 1030. This indicates that the pointer in the first ordered pair should be followed to reach any record with a key greater than or equal to 21 (the key value in the first pair) and less than 49 (the key value in the second pair). The key value in the second ordered pair, along with the key value of 93 in the third ordered pair, indicates that any record with a key greater than or equal to 49 and less than 93 will be found in the sub-tree whose root is index node 1035. The third ordered pair, containing third pointer 1040, indicates that any record with a key greater than or equal to 93 will be found in the sub-tree whose root is index node 1045.
  • It can be seen that an index node will have as many ordered pairs of <key, pointer> as it has child nodes. The range represented by each index node need not be explicitly stored in the index node. In the sub-tree of FIG. 1, node 1035 corresponds to the range of key values v where 49≦v≦93, though this range is not be explicitly stored in node 1035 in the example. Any search for key values in the range 49≦v≦93, though, will reach node 1035. In addition to being height-balanced, another B-tree constraint concerns the number of nodes which can exist below a given node which is determined by the order assigned to the B-tree. When an additional node is being added below a parent node which already has the maximum number of nodes, the result would violate this constraint. In practice, the order of a B-tree is determined dynamically, when a node of the tree fills up. In this case, a node split occurs, as described below.
  • To search a B-tree for a record, the search begins at the root node and follows pointers from node to node based on the key value for the record being sought, descending down the tree, until either the key is located or the search fails because a leaf node is reached which does not contain a record with the key being searched for. For example, if the record with key value 113 is being sought, when index node 1000 is reached, the key values are consulted. Since the key value being sought is greater than the key value in the rightmost pair in node 1000, the pointer 1040 from that pair is followed. Node 1045 is reached. When the key values are consulted, it can be seen that pointer 1048 should be followed to find any record with a key value 109≦v≦122. This pointer 1048 leads to the appropriate leaf from leaves 1010 which contains the record for the specified key value. If a record was searched for with a key value of 112, the search would end in the same location, but because no record is found with that key value in the leaf node, the search would return an unsuccessful result.
  • When a node has the maximum number of key values (when there is not sufficient space for any additional index term or data record), if a new key value must be inserted into the range covered by the node, the node will be split. In order to ensure that concurrent accesses are not reading data from the node during the split, it is necessary to deny concurrent access to the node being changed. Because two nodes will now hold the information previously held by the node being split, an additional link is necessary in the parent node of the node being split. Concurrent accesses to that parent node must therefore be denied while the parent is updated. If the addition of a new key value and pointer in the parent node will overfill the parent node, the parent node will be split as well. It can be seen that node insertions may cause splits recursively up the B-tree. This may require that a node high in the tree be locked while nodes much further down in the tree are being split, and while the split slowly propagates its way up to the locked node. This greatly impairs concurrent access to the tree. The necessity for a number of locks or latches to prevent concurrent accesses to nodes being changed slows access to the information stored in the B-tree by limiting concurrent access.
  • A Blink-tree is a modification of the B-tree which addresses this issue. FIG. 2 is a block diagram of an exemplary subtree in a Blink-tree data structure. Each non-leaf node contains an additional ordered pair, a side pair, including a side key value and a pointer (termed the “side pointer”) which points to the next node at the same level of the tree as the current node. The side pointer of the rightmost node on a level is a null pointer. Thus, as shown in FIG. 2, the subtree of a B-tree shown in FIG. 1 may be converted into a subtree of a Blink-tree with the addition of side pairs 1107, 1127, 1137, and 1147. Side pointer from side pair 1147, because it is the side pointer of a rightmost node on a level, is null. Side pointer from side pair 1107 is also shown as null, this could indicate that node 1000 is the root node or that it is the rightmost node on a level. The side key value indicates the lowest value found in the next node at the same level of the tree. Therefore, the range of values in a node may be seen by examining the index term for the node in its parent node (which is the lower bound and is included in the range) and the side key value (which is the upper bound but is not included in the range). The purpose of the side pointer is to provide an additional method for reaching a node. Each leaf node also contains a side pointer which points at the next leaf node, such as side pointer 1117.
  • One benefit of using these side pointers is to enable highly concurrent operation by allowing splits to occur with each atomic action of the split involving only one level of the tree. With B-link trees, in order for a split to occur on a full node the contents of the full node are divided (one atomic action), and a new index term is posted to the parent (second atomic action). This avoids the situation in which multiple levels of the tree are involved in a single atomic action. If a split is occurring in a node at the same time that a search is being performed for a key value in the range for that node, and the node has been split, with the lefthand node replacing the node which has been split, the tree can be traversed to find data even if no index term has yet been inserted into the parent of the node for the righthand node from the new pair. In such a case, the parent node will point to the lefthand node, and if the data is not found in the lefthand node, the side pointer of the lefthand node provides access to the righthand node. Thus a node split need not be a single atomic operation with the parent and child nodes both inaccessible until the split is completed.
  • In B-trees and Blink-trees, latches are used in order to provide mutual exclusion when a node split or node deletion is occurring. A latch is a low-cost, usually short-duration lock, one which does not include deadlock control. Hence, it is not necessary to access a lock manager in order to acquire or release a latch. Latches are therefore more lightweight than locks; they typically require only tens of instructions, not hundreds like locks. They prevent access of incorrect or outdated data during concurrent access of the data structure by allowing only an updater holding the latch to use the resource that has been latched.
  • Because no deadlock control exists for latches, a partial ordering is imposed on latches. The holder of a latch on a parent node may request the latch for a child node of that parent node. Latches can propagate downward. However, the holder of a latch on a child node can not request the latch for the parent without first releasing its latch on the child; latches do not propagate upwards. In this way, the deadlock situation in which the holder of a latch for parent node A is requesting a latch for child node B at the same time that the holder of a latch for child node B requests a latch for parent node A is avoided. In a standard B-tree, the latch must be maintained for the node being updated, and for the parent of that node (and possibly for multiple ancestors up the tree, even perhaps to the root), so the pointers and key values in the parent can be modified to reflect the change. If the latch is not maintained for the parent, the tree can become inconsistent. The latches must typically be maintained for all the nodes on the path to a leaf node that may need to be updated because of a node split at the leaf.
  • In a Blink-tree, however, a latch is not required on the parent node (and any further ancestors) while the child node is being split. As described above, where the child node has been latched for the node split, the parent latch need not be held during the child node split, while the new nodes have been created but the parent node for these new nodes has not yet been updated. A node split therefore need not be an atomic operation that includes posting the index term to the parent, but can be divided into two parts (“half splits”), the first “half split” where a child node is split, moving some data from an old node to a new node, and setting up a side link from the old node to the new node. After such a “half split” the Blink-tree will be well formed. A subsequent second “half split” posts an index term to the parent node.
  • However, there is a risk that several changes (node deletes, described below, and splits) will occur, and that when the parent node is changed to reflect the new child node, that that child node will no longer exist. To guard against this requires that the existence of the child node be re-verified, which requires re-visiting the left-hand (originally full) node and ensuring that the side pointer for that node still references the right-hand (new) node. Additionally, when a node split occurs, the path to the node being split is remembered. There is a risk that when the key value and pointer for the split is to be added to the remembered parent node, that parent node no longer exists because it may have been deleted. Guarding against this requires a tree re-traversal which is resource intensive. Thus, the prior art methods of B-link node splitting incur extra execution costs, which in turn limit concurrency and throughput, and increase the complexity of the implementation.
  • SUMMARY OF THE INVENTION
  • A Blink-tree data structure, method and system is presented which includes the advantages of B-tree data structures and conventional Blink-tree data structures, yet allows highly concurrent access of the data and deals robustly with node deletion.
  • In order to do this, the “delete state” is tracked for a Blink-tree data structure. This delete state is guaranteed to capture when a node among some set of nodes has been deleted. Thus, the absence of state indicating that any node among the set of nodes has been deleted ensures that some specific node in that set has not been deleted. Two delete states are tracked to deal separately with the two complications resulting from node deletes: (i) a parent to which an index term is scheduled to be posted may have been deleted; (ii) a new node for which an index term is scheduled to be posted may have been deleted. By tracking delete states, Blink-tree node split operations avoid tree re-traversals to find the parent node to be updated; and they avoid having to verify that a newly created node whose index term is scheduled to be posted still exists. Additionally, split operations are divided into two atomic operations, and the second atomic operation does not need to be completed for the tree to be used. The two atomic operations allow for high concurrency, and the tolerance for “lazy” scheduling of the second atomic operation (index term posting) is a simple solution which allows for easy implementation and coherence.
  • For a target node to be split, first, the side pointer and a portion of the stored data is moved to a new index node and the side pointer of the target node is set to point to the new index node. Then, a post operation is queued. When this operation is performed, the information regarding the new index node is posted to the parent node. Should a node delete be detected that might cause the need to re-traverse the tree to find a parent, or to re-verify that a new node still exists, the index term posting half of the node split operation is terminated, thus avoiding making this more complex and expensive. Such incomplete posting of index terms are completed when it is detected that the index term is missing in a subsequent traversal of the tree.
  • Other features of the invention are described below.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The foregoing summary, as well as the following detailed description of preferred embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the invention, there is shown in the drawings exemplary constructions of the invention; however, the invention is not limited to the specific methods and instrumentalities disclosed. In the drawings:
  • FIG. 1 is a block diagram of an exemplary subtree in a B-tree data structure;
  • FIG. 2 is a block diagram of an exemplary subtree in a Blink-tree data structure;
  • FIG. 3 is a block diagram of an exemplary computing environment in which aspects of the invention may be implemented;
  • FIG. 4 is a block diagram of an exemplary modified Blink-tree data structure according to one embodiment of the invention;
  • FIG. 5 is a flow diagram of the use of the modified Blink-tree according to one embodiment of the invention;
  • FIG. 6 is a flow diagram of a split node operation according to one embodiment; and
  • FIG. 7 is a flow diagram of a delete node operation according to one embodiment.
  • DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
  • Overview
  • The first tracked delete state, the index delete state (DX) indicates whether it is safe to directly access a parent node (hence an index node, not a data node) without re-traversing the B-tree. DX contains this information for all nodes above the leaf level; and in one embodiment it is maintained outside of the tree since any index node may be deleted. DX(nodeA) can be consulted to determine that index nodeA may have been deleted, or that index nodeA cannot have been deleted. In one embodiment, DX (n) is a binary function over all index nodes n in the B-tree, with one possible value indicating that the node cannot have been deleted, and the other possible value indicating that the node may have been deleted. In another embodiment, DX is a counter which is incremented when an index node has been deleted, so that a change in DX indicates that a node delete has occurred, while no change means that no nodes have been deleted since the earlier inspection of DX
  • The second tracked delete state, data delete state (DD) indicates whether it is safe to post an index term for a leaf node that resulted from a data node split. Since we access the parent of the leaf node resulting from a split in any event to post the index term, the DD state can be stored in the parent, and where each node is assigned to a disk page, without incurring any extra I/O to access the page. Leaf node deletes are much more common than index node deletes, and so there is real value to localizing leaf node deletes to a sub-tree without requiring additional latching. A DD state is maintained in each level 1 node (nodes which are parents of a leaf). In one embodiment, DD is a binary function over all leaf nodes and DD (nodeB) returns one value if leaf nodeB may have been deleted, and another if leaf nodeB cannot have been deleted. For index nodes which are higher up in the tree, DX is used for this verification.
  • In one embodiment, for both delete state tests, the value indicating that the node may have been deleted may be returned even when the node being asked about has not been deleted. In one embodiment, if this value is returned, further activity on the node is abandoned or postponed. No index term will be posted. The absence of a posted index term will be re-discovered when a Blink-tree traversal is required to include a side link traversal. The abandonment of the posting of the index term when delete states indicate that a node may have been deleted allows concurrency to be accomplished in a simple manner. The tree will always allow searches to be executed correctly, and where an index node posting is abandoned due to the delete state, subsequent actions will allow the missing index node posting to be discovered and requeued. This allows the node split to be simple and avoids retraversals, and yet allows for a way for such node postings to be detected and performed later.
  • In another embodiment, the tree is re-traversed when the node may have been deleted. This may cause a delay while the presence of the node is found (or while the correct node which is the parent of the node being split or deleted is found). However, deletions and splits of nodes will still occur correctly, and will enable index terms to be propagated up the tree correctly despite this delay.
  • Because a Blink-tree is used, the tree remains search correct even when index terms are missing. Since the delete state need only be checked during structure modifications, normal Blink-tree operations can be almost completely unaffected.
  • In one embodiment, delete states are maintained as binary functions for each node. In another embodiment, delete states are maintained as counters for a group of nodes, which are updated when a node is deleted.
  • Exemplary Computing Environment
  • FIG. 3 shows an exemplary computing environment in which aspects of the invention may be implemented. The computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 100.
  • The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, embedded systems, distributed computing environments that include any of the above systems or devices, and the like.
  • The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules and other data may be located in both local and remote computer storage media including memory storage devices.
  • With reference to FIG. 3, an exemplary system for implementing the invention includes a general purpose computing device in the form of a computer 110. Components of computer 110 may include, but are not limited to, a processing unit 120, a system memory 130, and a system bus 121 that couples various system components including the system memory to the processing unit 120. The processing unit 120 may represent multiple logical processing units such as those supported on a multi-threaded processor. The system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus (also known as Mezzanine bus). The system bus 121 may also be implemented as a point-to-point connection, switching fabric, or the like, among the communicating devices.
  • Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CDROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
  • The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation, FIG. 3 illustrates operating system 134, application programs 135, other program modules 136, and program data 137.
  • The computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, FIG. 3 illustrates a hard disk drive 140 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152, and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156, such as a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 141 is typically connected to the system bus 121 through a non-removable memory interface such as interface 140, and magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150.
  • The drives and their associated computer storage media discussed above and illustrated in FIG. 3, provide storage of computer readable instructions, data structures, program modules and other data for the computer 110. In FIG. 3, for example, hard disk drive 141 is illustrated as storing operating system 144, application programs 145, other program modules 146, and program data 147. Note that these components can either be the same as or different from operating system 134, application programs 135, other program modules 136, and program data 137. Operating system 144, application programs 145, other program modules 146, and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies. A user may enter commands and information into the computer 20 through input devices such as a keyboard 162 and pointing device 161, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
  • The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110, although only a memory storage device 181 has been illustrated in FIG. 3. The logical connections depicted in FIG. 3 include a local area network (LAN) 171 and a wide area network (WAN) 173, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
  • When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, FIG. 3 illustrates remote application programs 185 as residing on memory device 181. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • Blink-tree with Delete States
  • A Blink-tree according to one embodiment of the invention is shown in FIG. 4. The root node of the Blink-tree 400 contains three ordered pairs including pointers 420, 430, and 440 to index nodes 425, 435, and 445. These index nodes point to leaf nodes 410 a through 410 j, also known as data nodes. Each data nodes contains data values and a side pointer which points to the next data node in sequence. The root node also contains a side pair 407, and index nodes 425, 435, and 445 contain side pairs 427, 437, and 447, respectively. Additionally, an index delete state D X 470 is stored, which, for each index node, indicates whether the index node has not been deleted, or whether it may have been deleted. In index nodes 425, 435, and 445, a data delete state DD (427, 437, and 447, respectively) is stored. This data delete state stores the delete state for each of the child leaf nodes of an index node which is just above the leaf node level. For example, for index node 445, data delete state D D 447 tracks the delete state of the leaf nodes 410 g, 410 h, 410 i and 410 j. As will be described later, events may occur that may indicate that one of these nodes has been deleted. In such situations, the state for that leaf node in data delete state D D 447 will be changed to indicate that it may have been deleted.
  • As shown in FIG. 5, at least one delete state associated with one or more of the nodes of the Blink-tree is stored, step 510. A data record operation (adding, modifying, and deleting data records) is performed, step 520. Additionally, delete states are updated if necessitated by the data record operation, step 530.
  • According to one embodiment of the invention, the normal data record operations of reading a record, range reading records, inserting a record, updating a record, and deleting a record are supported. To implement one embodiment of the invention, latches come in multiple modes: share, update, and exclusive. These latch modes support different levels of concurrent access.
  • An exclusive latch on a resource prohibits other latches from being obtained for the resource, and can only be obtained if no other latch is held on the resource. No other latches are allowed if an exclusive latch is held. A share latch is compatible with other share latches, and with another type of latch known as an update latch. Share latches allow reading of the resource but not modification of it. Only one update latch may be held on a resource at one time though share latches may be held on it concurrently; update latches are used to allow a user to upgrade the latch to an exclusive latch without relinquishing the update latch. Users with a share latch may not upgrade to an exclusive latch without first releasing control of the resource.
  • Because adding records and deleting records may require changes in the structure of the tree (deletion or addition of index nodes), node deletion and node split operations may be necessitated by the addition or deletion of records and the structural requirements of the Blink-tree.
  • Tree Functionality
  • To use the tree, according to one embodiment of the invention, several functions/operations are supported. These include:
  • Tree Traversal—traversal of a tree to find a desired node. This node may be a leaf node or an internal node. This is used to find a node for a user (for example for a record lookup) and also to provide traversal for other operations;
  • Node Update—inserting, modifying, or deleting information in a node;
  • Node Split—splitting a node into two separate nodes, in order to accommodate more information at that point in the tree than can fit into one node. According to one embodiment of the invention, this is performed in two distinct “half-split” operations, with the second half-split capable of being queued to be performed later (or abandoned, if the tree has changed too much to make it a simple change.);
  • Access Parent—used to access a parent node and to check if a deletion may have occurred, in order to allow simplicity in splitting nodes and realize the efficiency gain from keeping the delete states;
  • Post Index Term—used to post an index term to a node, in order to perform the second part of a node split operation; and
  • Delete Node—used to delete a node, in order to consolidate index nodes with low occupancy.
  • A more complete description of these operations, according to one embodiment, is supplied below.
  • Tree Traversal
  • In order to perform operations on the tree, the tree must be traversed to find the desired node. The node may be a leaf node or an intermediate node. Because nodes may be split or deleted, and a parent node may not contain index terms for all its child nodes, traversals may occur which utilize side pointers rather than pointers which appear as part of an index pair.
  • With reference again to FIG. 4, as an example of tree traversal, if, for example, the tree is a sub-tree which has been reached during a tree traversal to find a record with a key value of 80, a share latch is obtained for node 400. Node 400 is then consulted, and pointer 430 followed. When node 435 is reached, it will in turn be latched in share mode. At that point, the latch on node 400 is released. A read node operation will be run on node 435 to find the entry that points to the leaf node that is the home for a record with key value of 80. In this case, this is 410 e. Node 410 e is latched in share mode, and then the latch on node 435 is released. Node 410 e is then searched for the record with key value of 80. This sequence of latching a node before unlatching its parent in the path is called latch coupling. During tree traversal, it may be the case that a side pointer is followed rather than a index pointer. If this is the case, the node pointed to by the side pointer is not referenced by an index term in the node which should be its parent node. An index term posting is therefore scheduled. Additionally, underutilized nodes may be discovered during tree traversal, and such underutilized nodes should be deleted. Range reading is also possible, to traverse the tree and return any records with key values in the specified range.
  • The tree traversal operation, in one embodiment, proceeds according to the following pseudocode, where tree traversal begins with nodeA, (which has been latched), the key value being searched for is K and the requested level of the node to be found is R:
  • Traverse (nodeA, K, L)
  • 1. Search nodeA for the correct entry entryA for K;
  • 2. Latch the node (nodeB) referenced by the correct entry for K
  • 3. Release the latch on nodeA
  • 4. If entryA was a side pointer rather than an index pointer, enqueue a post index term action to post the pair <entryA, nodeB> to nodeA
  • 5. If nodeA is underutilized, enqueue a node deletion action for nodeA
  • 6. If nodeB is a higher level than has been requested, or if nodeB is a sibling node of nodeA, then recursively perform a tree traversal on nodeB for key K at requested level L; otherwise, return nodeB.
  • In one embodiment, the tree traversal also receives data as to the requested type of latch to be used.
  • Reading Records
  • In order to read a record with a given key, the tree is traversed to find the leaf node which will contain the desired key if it exists. When tree traversal is being done to perform a read operation, a share latch is obtained for the each node in the path to the leaf node as the tree is being traversed. Once the appropriate leaf node is returned, a read operation is performed.
  • Thus a read node operation can, in one embodiment, be represented by the following pseudocode, where “Traverse” is the tree traversal operation detailed above and “root” is the root node of the tree:
  • Read Node (Key value)
  • 1. Perform Traverse (root, Key Value, leaf)
  • 2. Set Leaf Node equal to the node returned from the Traverse
  • 3. Read (Leaf Node, Key Value)
  • Update (for Insertion, Modification, or Deletion of a Node)
  • An update to a node is similar to the reading of a node, except that the latch obtained before the update must be exclusive, rather than shared. Thus, an update of a node consists of a tree traverse to the node followed by an insert, modify or delete of the node via an update operation. During the traversal, nodes are latched in share mode until a leaf node for the record to be inserted, updated, or deleted is reached. The leaf node is then latched with an update latch, and an exclusive latch is obtained when we confirm that the node accessed is the desired node (which, to find it, may require further side traversal). The leaf node is returned from the tree traversal latched exclusively and an update operation may then be performed on the leaf node.
  • If the operation necessitates the insertion of an entry into a node which is at maximum capacity, a split node action for the maximum capacity node is undertaken, and then the insert is retried. As in the case of a read record, if at the completion of the update node operation the node is found to be under-utilized, a delete node operation is enqueued for that node.
  • The pseudocode for an insert, modify, or delete in one embodiment is as follows, where Action represents the action to be performed on the target node and Action Information represents information needed for the action (for example, the record to be inserted where the Action is an insertion). Update is a function which inserts, modifies or deletes the node according to the specified action. This may, in the case of an insert or of certain updates, cause a node to be too full. If this is the case, the update will fail if the node is full, a split node will be attempted, and the action retried:
  • Update Node (Key Value, Level, Action, Action Information)
  • 1. Perform Traverse (root, Key Value, Level)
  • 2. Set TargetNode equal to the node returned from the Traverse
  • 3. Update (TargetNode, Action, Action Information)
  • 4. If Update fails due to TargetNode being full, then perform a Split Node (TargetNode) and then retry the Update Record action
  • 5. If TargetNode is underutilized, enqueue (Delete Node (TargetNode))
  • Split Node Operations
  • In order to perform a split node operation, two distinct half split operations are performed. The first half split operation involves moving the high range contents of the target node being split to a new node. The side pair is also moved to the new node. The side pair in the target node is changed to point to the new node, and contains also the low key value for the new node. When this first half split operation is completed, all the data is accessible and can be found during a tree traversal. The first half split operation is not enqueued to be performed at some future time, because split node operations create room for adding data to a node. If this is not done promptly, the update operation which called for the split node will need to wait or be aborted. This first half split operation can be done immediately, however, because the exclusive latch will already be held by the calling operation on the node. A latch need not be held on the parent of the target node being split, only on the target node itself. No latch is needed on the new node as others cannot access the new node. The only path to the new node is via the original target node, on which we are already holding an exclusive latch.
  • The second half split operation involves the posting of an index term and index pointer for the new node in the parent node. This is done by enqueueing this index posting on the queue of work.
  • In order to perform an index posting, an Access Parent operation is performed. An Access Parent operation is used to access a parent node when a split or delete has occurred to one of its children. The Access Parent operation accesses the parent of a node so that the index term for a node can be inserted or deleted or so that a deletion can occur. It is given the remembered parent node address (RememberedParent) of the node (Node) and the type of action being performed, and returns with the latched parent node or an error if the parent may have been deleted. The Access Parent operation can be described, in one embodiment, with the following pseudocode:
  • 1. Latch DX in share mode if the call is for post index term operation, in exclusive mode if the call is for a delete node operation.
  • 2. If test of DX shows delete has occurred, release DX latch and return error.
  • 3. If the parent access is for an index node deletion, update DX.
  • 4. Latch node requested (RememberedParent) and release DX latch.
  • 5. Use Traverse to find the parent for the given Node. Use the results of this traversal to check if RememberedParent continues as the parent or whether the parent has split and real parent is a sibling of remembered node.
  • 6. If the Access Parent is for a data node deletion, then update DD state.
  • 7a. Else if Access Parent is to post index term for a data node: if DD(node) has changed then release the node latch and return error.
  • 7b. Else if Access Parent is to post an index term for an index node: if DX has changed, then release node latch and return error.
  • 8. Return the parent found in the Traverse step.
  • When an access parent operation is run, it returns with the latch on the parent of the target node which has changed (by being split or deleted) if the parent node exists. The parent node being accessed in an access parent operation will be level 1 or higher. That is, it will not be a leaf node. Thus, there will be a delete state stored in DX for the node. Access parent first latches DX with a share latch, and then, if the state stored therein indicates that the parent must exist, the parent node is latched. Once the parent node is latched, it can not be deleted until it is unlatched.
  • Access parent returns with an error if the delete state of the parent node indicates that the parent node may not exist. In this way, access parent verifies without the necessity of tree traversal that the parent node definitely exists, and only if it does definitely exist is a traversal undertaken.
  • In one embodiment, access parent is also called with an indication of whether it is handling a delete or an index posting due to a split. If access parent is called for the index posting to a parent node of child node information, the delete state of the child node is also checked to ensure that the child node still exists. If it might not exist, access parent returns with an error.
  • For a second half split index posting, if access parent returns with an error, the index posting is not performed. All data in the tree still remains properly accessible through side pointers, and an error returned from access parent in this situation is generally sufficiently rare that the lack of an index posting for the new node is not an issue. In another embodiment, when an access parent returns with an error, the tree is traversed to find the correct parent for the index posting and the index posting is then made.
  • When the second half split has identified the parent node and has obtained the latch on the parent node, the update node operation is then used to post the index term. This may lead to a split of the parent node; however, such a parent node split will be a separate atomic action, decoupled from the split caused it.
  • Thus, the split node operation can be described, in one embodiment, with the following pseudocode, where a latch is held on the OriginalNode (the node to be split):
  • [First Half-Split]
  • 1. Allocate new node.
  • 2. Split data between OriginalNode and the new node.
  • 3. Set the new node side pointer and key space description to the OriginalNode side pointer.
  • 4. Set the OriginalNode side pointer to point to the new node, and its key space description to the low key bound of the new node.
  • [Second Half-Split]
  • 5. Enqueue a Post Index Term operation for posting the index term for the new node to the parent of the OriginalNode
  • In Step 1, where a new node is allocated, no latch is required as the node is invisible to the rest of the tree. The first half of the split operation is embodied in steps 1-4, and the second in the operation enqueued in step 5.
  • The Post Index Term operation can be described, in one embodiment, with the following pseudocode:
  • 1. Access parent of split node via Access Parent. If an error is returned, abort posting.
  • 2. If no error was returned, use Update to update the node with the index term.
  • FIG. 6 is a flow diagram of a split node operation according to one embodiment. In FIG. 6, first the side pointer and a portion of the key value data of the target node are copied to a new node, in step 610. Then the pointer of the target node is set to point to the new index node in step 620. After this, a post index operation for the new node is added to a work queue in step 630. These steps 610 through 630 are the first half-split.
  • When post index operation is taken off the queue to be performed, a check is performed to see if the stored parent node for which an index term is being posted may have been deleted, in decision 640. This is done, in one embodiment, by consulting a delete state. In decision 650, a check is performed to see if the new node for which an index term is being posted may have been deleted. This is done, in one embodiment, by consulting a delete state. In step 660, if the parent node and the new node for which a term is being posted both have not been deleted, the new index node data is added to the parent node.
  • In one embodiment, if it is possible that the stored parent node has been deleted (in other words, a “yes” answer to decision 640) or that the new node has been deleted (a “yes” answer to decision 650) then step 660 does not occur. The second half-split operation is abandoned. As described above, this abandonment of the posting of the index term when delete states indicate that a deletion may have occurred allows B-link tree concurrency to be accomplished in a simple manner and avoids costly retraversals while allowing for the resultant “missing” node postings to be detected and the node posting to be peformed at a later time. In an alternate embodiment, a retraversal may be done to ensure that the stored parent node and new node are both still in existence.
  • Delete Node Operations
  • When a node is deleted, the delete node operation is run on the node. This permits the consolidation of index nodes with low occupancy. The delete node operation first calls the access parent operation to access the parent node of the node targeted for deletion. Access parent finds the parent node. Delete state information for the target node will be found in the parent node; this delete state information is updated in the access parent operation. Access parent returns with the parent node latched.
  • The left sibling of the target node is then accessed and latched. The target node is then accessed and latched. The contents of the target node are moved to the left sibling. The target node is then de-allocated and its index term removed from the parent.
  • The steps in Delete Node, in one embodiment, are as follows:
  • 1. Perform Access Parent to find parent of target node. If an error is returned, abort.
  • 2. Remove the index term for the deleted node. This will cause subsequent searches to access the left sibling instead.
  • 3. Retain the latch on the parent while latching the left sibling of the target node. If the parent node has no left sibling for target node to be consolidated, abort.
  • 4. Latch the node to be consolidated via a side traversal from its left sibling. If the left sibling's pointer does not equal the node to be consolidated, abort. Unlatch the parent node.
  • 5. Check whether original node remains under-utilized, and whether its contents will fit into its left sibling. If so, it will be consolidated, i.e., the target node's data and sibling pointer is moved to the left sibling. Otherwise return without consolidating.
  • 6. Delete the target node.
  • 7. Unlatch the left sibling and target nodes.
  • 8. If parent is under-utilized, enqueue a Delete Node action for the parent node.
  • 9. Return
  • FIG. 7 is a flow diagram of a delete data node operation according to one embodiment. The delete state DX is latched (step 702) and checked (step 705) and, if the delete state indicates that the parent node has not been deleted, the Blink-tree data structure is traversed, starting at the remembered parent, to find the current parent for the target node, step 710. This parent node is latched in step 720. If the delete is for an index node, the delete state DX is set to indicate that an index node has been deleted, step 723. The delete state DX is unlatched in step 725. Next, the index term for the target node is deleted in step 730. The left sibling of the target node is then accessed and latched in step 740. In step 750, a side traversal is then performed on the left sibling, for example, by following the side pointer of the left sibling. If the target node is found via said side traversal (decision 760), the deletion proceeds. Otherwise, the operation is abandoned. If a data node is being deleted, the delete state (DD) for the node is updated to reflect that it will be deleted in step 765. The parent node can then be unlatched in step 770 and the target node is latched. The target node is then examined to determine if it is under-utilized and if the contents of the target node can be consolidated with the contents of the left sibling in decision 780. If this is the case, the data including the side pointer and key data from the target node is copied to the left sibling in step 790. The target node is then deleted in step 792. The left sibling and target nodes are then unlatched in step 794. In step 796, if the parent node had been found under-utilized, a delete operation is scheduled for the parent.
  • Index Delete State DX
  • As described above, delete state generally stores state information regarding whether a node was definitely not deleted, or whether it may have been deleted. In one embodiment, this is done using a stored binary state for each index node.
  • In another embodiment, delete state is stored using counters. Index delete state DX is maintained as a counter that is incremented whenever an index node is deleted. Before an action is placed on the work queue, the present value of DX is stored along with the action on the queue.
  • When the action is performed, if DX has changed from the remembered index delete state, this is treated as a “may have been deleted” state. While this is a conservative method (because it marks all nodes as “may have been deleted” even if only one was) if deletes are not common it will cause few problems. Because leaf node deletes, which are more common, are tracked separately from index node deletes, DX rarely changes.
  • During an unsuccessful parent access, the new DX counter value is saved so that when the need for the index posting is detected again, the more recent DX value is entered with the enqueued action, hence making it possible for this later action to complete successfully. This is done because a change in DX is used to determine whether a node may have been deleted between the time a structure modification is scheduled and the time when it is actually done. For example, because of a sibling traversal, it may be discovered that an index term has not been posted. A new index term posting is scheduled. The latest DX value is stored when the posting is scheduled, and it is compared to the current DX value when the parent node is accessed to post the index term. The new posting thus will only fail if there are further node deletes between the scheduling of the new posting and the parent node is accessed for execution of the index term posting.
  • If the system crashes, all queued actions are lost. Therefore DX state need not be maintained across a system crash, but can be restarted when the system is brought up again.
  • Data Delete State DD
  • The data delete state is used to determine whether leaf nodes that are immediate descendents of a lowest level index node may have been deleted. If not, a new node resulting from a split will not have been deleted, and the index term for that new node may be posted in that lowest level index node, without further verification that the node still exists.
  • The DD state describing node deletes among leaf nodes is stored in their parent index node. Some access savings are achieved, since the parent index node is accessed in any event during the posting operation.
  • DD may be maintained as a binary state for each leaf node in a given index node.
  • In another embodiment, as with DX state, DD is maintained as a counter. Whenever a leaf node is deleted, the parent is latched and accessed in any event, in order to post the index term. Hence, the update of DD during leaf node delete occurs with little overhead.
  • If DD for the parent node has changed when we attempt to post an index term for a new leaf node split, then the new node may already have been deleted, and hence no index term posting is required. In that case, we abort the posting.
  • To make this “optimistic” approach work, the prior value for DD is stored when the node is visited in the traversal on the way to a leaf node. No additional latching is required.
  • The modifications to the operations listed above to implement the counter version of delete states include:
      • read and remember DX state prior to accessing Blink-tree;
      • read and remember DX state whenever we examine it.
      • read and remember DD state in parent of leaf before accessing a leaf node;
      • include remembered DD or DX on enqueued structure modifications;
      • latch couple from examining DX to accessing the parent node.
      • latch coupling during traverse instead of holding only single latch at a time;
      • increment DX state when deleting index node;
      • increment DD state when deleting leaf node;
      • compare DX to remembered DX before accessing parent;
      • compare DD state to remembered DD state to verify that new leaf node from split still exists; and
      • abandon structure modifications should delete states DX or DD indicate node delete.
  • It is noted that the foregoing examples have been provided merely for the purpose of explanation and are in no way to be construed as limiting of the present invention. While the invention has been described with reference to various embodiments, it is understood that the words which have been used herein are words of description and illustration, rather than words of limitations. Further, although the invention has been described herein with reference to particular means, materials and embodiments, the invention is not intended to be limited to the particulars disclosed herein; rather, the invention extends to all functionally equivalent structures, methods and uses, such as are within the scope of the appended claims. Those skilled in the art, having the benefit of the teachings of this specification, may effect numerous modifications thereto and changes may be made without departing from the scope and spirit of the invention in its aspects.

Claims (46)

1. A method for storing one or more data records in a Blink-tree data structure comprising at least two nodes, said nodes comprising at least one index node and at least one leaf node, where each of said data records is associated with a key, the method comprising:
storing at least one delete state, where said at least one delete state is associated with one or more of said nodes;
performing a node operation on a target node from among said nodes; and
updating said at least one delete state based on said node operation.
2. The method of claim 1, where said storing at least one delete state comprises:
storing a delete state associated with each of said nodes;
and where said delete state comprises a binary function in which one possible value of said binary function indicates that said associated node has not been deleted.
3. The method of claim 1, where said storing at least one delete state comprises storing a counter value, where said counter value remaining constant over time indicates that none of said associated nodes have been deleted.
4. The method of claim 1, where said storing at least one delete state comprises:
for each index node of level one, storing one delete state in said index node associated with all of said leaf nodes that are children of said index node.
5. (canceled)
6. The method of claim 1, where said step of performing a node operation on a target node from among said nodes further comprises:
determining if said parent node is under-utilized; and if said parent node is under-utilized, enqueueing a delete node operation for said parent node.
7. (canceled)
8. (canceled)
9. The method of claim 1, where said node operation is an update node operation for a given target node, and where said step of performing a node operation on a target node from among said nodes comprises:
traversing said Blink-tree data structure to find said target node;
attempting an update of said target node;
if said update fails due to said target node being full, performing a split node operation on said target node; and
if split node operation has been performed, retrying said update of said target node.
10. The method of claim 9, where said step of performing a node operation on a target node from among said nodes further comprises:
determining if said target node is under-utilized; and
if said target node is under-utilized, enqueueing a delete node operation for said target node.
11. The method of claim 1, where said node operation is an split node operation for a given target node, and where said step of performing a node operation on a target node from among said nodes comprises:
copying a side pointer of said target node and a portion of key data of said target node to a new target node;
setting said side pointer of said target node to point to said new target node; and
adding a post operation for said target node to a work queue.
12. (canceled)
13. (canceled)
14. (canceled)
15. At least one of an operating system, a computer readable medium having stored thereon a plurality of computer-executable instructions, a co-processing device, a computing device, and a modulated data signal carrying computer executable instructions for performing the method of claim 1.
16. A system for storing one or more data records in a Blink-tree data structure comprising at least two nodes, said nodes comprising at least one index node and at least one leaf node, where each of said data records is associated with a key, the system comprising:
delete state storage for storing at least one delete state, where said at least one delete state is associated with one or more of said nodes;
node operation logic for performing a node operation on a target node from among said nodes; and
delete state logic for updating said at least one delete state based on said node operation.
17. The system of claim 16, where said delete state storage stores a delete state associated with each of said nodes, and where said delete state comprises a binary function in which one possible value of said binary function indicates that said associated node has not been deleted.
18. The system of claim 16, where said delete state storage storing at least one delete state comprises counter value storage for storing a counter value, where said counter value remaining constant over time indicates that none of said associated nodes have been deleted.
19. The system of claim 16, where said delete state storage storing at least one delete state comprises:
index node delete state storage for, for each index node of level one, storing one delete state in said index node associated with all of said leaf nodes that are children of said index node.
20. (canceled)
21. (canceled)
22. (canceled)
23. (canceled)
24. The system of claim 16, where said node operation is an update node operation for a given target node, and where said node operation logic comprises:
traversal logic for traversing said Blink-tree data structure to find said target node;
first update attempt logic for attempting an update of said target node; and
split node logic for, if said update fails due to said target node being full, performing a split node operation on said target node; and
second update attempt logic for, if split node operation has been performed, retrying said update of said target node.
25. The system of claim 24, where said node operation logic further comprises:
target node utilization logic for determining if said target node is under-utilized; and
delete enqueueing logic for, if said target node is under-utilized, enqueueing a delete node operation for said target node.
26. The system of claim 16, where said node operation is an split node operation for a given target node, and where said node operation logic comprises:
target node data copying logic for copying a side pointer of said target node and a portion of key data of said target node to a new target node;
side pointer setting logic for setting said side pointer of said target node to point to said new target node; and
post operation enqueueing logic for adding a post operation for said target node to a work queue.
27. (canceled)
28. (canceled)
29. (canceled)
30. A memory for storing data for access by an application program comprising a data structure stored in said memory, said data structure adapted for storing one or more data records stored in data nodes, where each of said data records is associated with a key, said data structure comprising a tree comprising at least one index node associated with an associated node range, where each index node from among said index nodes comprises:
at least one key value dividing said associated node range into at least two sub-ranges;
at least one child node pointer pointing to another one of said index nodes or one of said data nodes;
at least one side pointer set to a null value or pointing to an index node corresponding to a successive node range; and
a delete state with information regarding whether a data node may have been deleted.
31. The memory of claim 30, where said delete state comprises a counter which is incremented when an action occurs which indicates that a data node may have been deleted.
32. A computer-readable medium for storing one or more data records in a Blink-tree data structure comprising at least two nodes, said nodes comprising at least one index node and at least one leaf node, where each of said data records is associated with a key, said computer-readable medium comprising computer executable modules having computer executable instructions, the modules comprising instructions for performing the steps of:
storing at least one delete state, where said at least one delete state is associated with one or more of said nodes;
performing a node operation on a target node from among said nodes; and
updating said at least one delete state if necessitated by said node operation.
33. The computer-readable medium of claim 32, where said storing at least one delete state comprises:
storing a delete state associated with each of said nodes;
and where said delete state comprises a binary function in which one possible value of said binary function indicates that said associated node has not been deleted.
34. The computer-readable medium of claim 32, where said storing at least one delete state comprises storing a counter value, where said counter value remaining constant over time indicates that none of said associated nodes have been deleted.
35. The computer-readable medium of claim 32, where said storing at least one delete state comprises:
for each index node of level one, storing one delete state in said index node associated with all of said leaf nodes that are children of said index node.
36. (canceled)
37. (canceled)
38. (canceled)
39. (canceled)
40. (canceled)
41. The computer-readable medium of claim 32, where said node operation is an update node operation for a given target node, and where said step of performing a node operation on a target node from among said nodes comprises:
traversing said Blink-tree data structure to find said target node;
attempting an update of said target node;
if said update fails due to said target node being full, performing a split node operation on said target node; and
if split node operation has been performed, retrying said update of said target node.
42. The computer-readable medium of claim 41, where said step of performing a node operation on a target node from among said nodes further comprises:
determining if said target node is under-utilized; and
if said target node is under-utilized, enqueueing a delete node operation for said target node.
43. The computer-readable medium of claim 32, where said node operation is an split node operation for a given target node, and where said step of performing a node operation on a target node from among said nodes comprises:
copying a side pointer of said target node and a portion of key data of said target node to a new index node;
setting said side pointer of said target node to point to said new index node; and
adding a post operation for said target node to a work queue.
44. (canceled)
45. (canceled)
46. (canceled)
US11/859,597 2004-01-30 2007-09-21 Concurrency control for b-trees with node deletion Abandoned US20080071809A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/859,597 US20080071809A1 (en) 2004-01-30 2007-09-21 Concurrency control for b-trees with node deletion

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/768,527 US7383276B2 (en) 2004-01-30 2004-01-30 Concurrency control for B-trees with node deletion
US11/859,597 US20080071809A1 (en) 2004-01-30 2007-09-21 Concurrency control for b-trees with node deletion

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/768,527 Continuation US7383276B2 (en) 2004-01-30 2004-01-30 Concurrency control for B-trees with node deletion

Publications (1)

Publication Number Publication Date
US20080071809A1 true US20080071809A1 (en) 2008-03-20

Family

ID=34807893

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/768,527 Expired - Fee Related US7383276B2 (en) 2004-01-30 2004-01-30 Concurrency control for B-trees with node deletion
US11/859,597 Abandoned US20080071809A1 (en) 2004-01-30 2007-09-21 Concurrency control for b-trees with node deletion

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/768,527 Expired - Fee Related US7383276B2 (en) 2004-01-30 2004-01-30 Concurrency control for B-trees with node deletion

Country Status (1)

Country Link
US (2) US7383276B2 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060064432A1 (en) * 2004-09-22 2006-03-23 Pettovello Primo M Mtree an Xpath multi-axis structure threaded index
US20070174309A1 (en) * 2006-01-18 2007-07-26 Pettovello Primo M Mtreeini: intermediate nodes and indexes
US20080183844A1 (en) * 2007-01-26 2008-07-31 Andrew Gavin Real time online video editing system and method
US20090012976A1 (en) * 2007-07-04 2009-01-08 Samsung Electronics Co., Ltd. Data Tree Storage Methods, Systems and Computer Program Products Using Page Structure of Flash Memory
US20100246446A1 (en) * 2009-03-30 2010-09-30 Wenhua Du Tree-based node insertion method and memory device
US20110066937A1 (en) * 2009-09-17 2011-03-17 International Business Machines Corporation Method and system for handling non-presence of elements or attributes in semi-structured data
US20110252067A1 (en) * 2010-04-12 2011-10-13 Symantec Corporation Insert optimization for b+ tree data structure scalability
US8166074B2 (en) 2005-11-14 2012-04-24 Pettovello Primo M Index data structure for a peer-to-peer network
US20120221531A1 (en) * 2011-02-28 2012-08-30 International Business Machines Corporation Bottom-up optimistic latching method for index trees
US8631028B1 (en) 2009-10-29 2014-01-14 Primo M. Pettovello XPath query processing improvements
US20140279859A1 (en) * 2013-03-15 2014-09-18 International Business Machines Corporation Index record-level locking for file systems using a b+tree structure
US9037557B2 (en) 2011-02-28 2015-05-19 International Business Machines Corporation Optimistic, version number based concurrency control for index structures with atomic, non-versioned pointer updates
US20150220574A1 (en) * 2012-08-10 2015-08-06 Industry Academic Cooperation Foundation Of Yeungnam University Database method for b+ tree based on pram
US20230252012A1 (en) * 2022-02-09 2023-08-10 Tmaxtibero Co., Ltd. Method for indexing data

Families Citing this family (107)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7146524B2 (en) 2001-08-03 2006-12-05 Isilon Systems, Inc. Systems and methods for providing a distributed file system incorporating a virtual hot spare
US7685126B2 (en) 2001-08-03 2010-03-23 Isilon Systems, Inc. System and methods for providing a distributed file system utilizing metadata to track information about data stored throughout the system
US7937421B2 (en) 2002-11-14 2011-05-03 Emc Corporation Systems and methods for restriping files in a distributed file system
US6961733B2 (en) * 2003-03-10 2005-11-01 Unisys Corporation System and method for storing and accessing data in an interlocking trees datastore
US8516004B2 (en) * 2003-09-19 2013-08-20 Unisys Corporation Method for processing K node count fields using an intensity variable
US7340471B2 (en) * 2004-01-16 2008-03-04 Unisys Corporation Saving and restoring an interlocking trees datastore
US8266234B1 (en) 2004-06-11 2012-09-11 Seisint, Inc. System and method for enhancing system reliability using multiple channels and multicast
US7873650B1 (en) * 2004-06-11 2011-01-18 Seisint, Inc. System and method for distributing data in a parallel processing system
US8886677B1 (en) 2004-07-23 2014-11-11 Netlogic Microsystems, Inc. Integrated search engine devices that support LPM search operations using span prefix masks that encode key prefix length
US7725450B1 (en) * 2004-07-23 2010-05-25 Netlogic Microsystems, Inc. Integrated search engine devices having pipelined search and tree maintenance sub-engines therein that maintain search coherence during multi-cycle update operations
US7747599B1 (en) 2004-07-23 2010-06-29 Netlogic Microsystems, Inc. Integrated search engine devices that utilize hierarchical memories containing b-trees and span prefix masks to support longest prefix match search operations
US20060036636A1 (en) * 2004-08-13 2006-02-16 Small Jason K Distributed object-based storage system that uses pointers stored as object attributes for object analysis and monitoring
US7213041B2 (en) 2004-10-05 2007-05-01 Unisys Corporation Saving and restoring an interlocking trees datastore
US7716241B1 (en) 2004-10-27 2010-05-11 Unisys Corporation Storing the repository origin of data inputs within a knowledge store
US7908240B1 (en) 2004-10-28 2011-03-15 Unisys Corporation Facilitated use of column and field data for field record universe in a knowledge store
US8051425B2 (en) 2004-10-29 2011-11-01 Emc Corporation Distributed system with asynchronous execution systems and methods
US8055711B2 (en) 2004-10-29 2011-11-08 Emc Corporation Non-blocking commit protocol systems and methods
US8238350B2 (en) 2004-10-29 2012-08-07 Emc Corporation Message batching with checkpoints systems and methods
US7499932B2 (en) * 2004-11-08 2009-03-03 Unisys Corporation Accessing data in an interlocking trees data structure using an application programming interface
US7676477B1 (en) 2005-10-24 2010-03-09 Unisys Corporation Utilities for deriving values and information from within an interlocking trees data store
US7348980B2 (en) 2004-11-08 2008-03-25 Unisys Corporation Method and apparatus for interface for graphic display of data from a Kstore
US20070162508A1 (en) * 2004-11-08 2007-07-12 Mazzagatti Jane C Updating information in an interlocking trees datastore
US7389301B1 (en) 2005-06-10 2008-06-17 Unisys Corporation Data aggregation user interface and analytic adapted for a KStore
JP4670496B2 (en) * 2005-06-14 2011-04-13 住友電気工業株式会社 Optical receiver
US7788303B2 (en) 2005-10-21 2010-08-31 Isilon Systems, Inc. Systems and methods for distributed system scanning
US7551572B2 (en) 2005-10-21 2009-06-23 Isilon Systems, Inc. Systems and methods for providing variable protection
US7917474B2 (en) 2005-10-21 2011-03-29 Isilon Systems, Inc. Systems and methods for accessing and updating distributed data
US7797283B2 (en) 2005-10-21 2010-09-14 Isilon Systems, Inc. Systems and methods for maintaining distributed data
US20070162506A1 (en) * 2006-01-12 2007-07-12 International Business Machines Corporation Method and system for performing a redistribute transparently in a multi-node system
US7848261B2 (en) 2006-02-17 2010-12-07 Isilon Systems, Inc. Systems and methods for providing a quiescing protocol
US20070214153A1 (en) * 2006-03-10 2007-09-13 Mazzagatti Jane C Method for processing an input particle stream for creating upper levels of KStore
US7734571B2 (en) * 2006-03-20 2010-06-08 Unisys Corporation Method for processing sensor data within a particle stream by a KStore
US20070220069A1 (en) * 2006-03-20 2007-09-20 Mazzagatti Jane C Method for processing an input particle stream for creating lower levels of a KStore
US20080275842A1 (en) * 2006-03-20 2008-11-06 Jane Campbell Mazzagatti Method for processing counts when an end node is encountered
US7656404B1 (en) * 2006-03-21 2010-02-02 Intuit Inc. Line trimming and arrow head placement algorithm
US7689571B1 (en) 2006-03-24 2010-03-30 Unisys Corporation Optimizing the size of an interlocking tree datastore structure for KStore
US7756898B2 (en) 2006-03-31 2010-07-13 Isilon Systems, Inc. Systems and methods for notifying listeners of events
US8238351B2 (en) * 2006-04-04 2012-08-07 Unisys Corporation Method for determining a most probable K location
US7676330B1 (en) 2006-05-16 2010-03-09 Unisys Corporation Method for processing a particle using a sensor structure
US7769727B2 (en) * 2006-05-31 2010-08-03 Microsoft Corporation Resolving update-delete conflicts
US8539056B2 (en) 2006-08-02 2013-09-17 Emc Corporation Systems and methods for configuring multiple network interfaces
US7680836B2 (en) 2006-08-18 2010-03-16 Isilon Systems, Inc. Systems and methods for a snapshot of data
US7882071B2 (en) 2006-08-18 2011-02-01 Isilon Systems, Inc. Systems and methods for a snapshot of data
US7899800B2 (en) 2006-08-18 2011-03-01 Isilon Systems, Inc. Systems and methods for providing nonlinear journaling
US7752402B2 (en) 2006-08-18 2010-07-06 Isilon Systems, Inc. Systems and methods for allowing incremental journaling
US7822932B2 (en) 2006-08-18 2010-10-26 Isilon Systems, Inc. Systems and methods for providing nonlinear journaling
US7676691B2 (en) 2006-08-18 2010-03-09 Isilon Systems, Inc. Systems and methods for providing nonlinear journaling
US7941451B1 (en) * 2006-08-18 2011-05-10 Unisys Corporation Dynamic preconditioning of a B+ tree
US7680842B2 (en) 2006-08-18 2010-03-16 Isilon Systems, Inc. Systems and methods for a snapshot of data
US7590652B2 (en) * 2006-08-18 2009-09-15 Isilon Systems, Inc. Systems and methods of reverse lookup
US7953704B2 (en) 2006-08-18 2011-05-31 Emc Corporation Systems and methods for a snapshot of data
US7697518B1 (en) 2006-09-15 2010-04-13 Netlogic Microsystems, Inc. Integrated search engine devices and methods of updating same using node splitting and merging operations
US8086641B1 (en) 2006-11-27 2011-12-27 Netlogic Microsystems, Inc. Integrated search engine devices that utilize SPM-linked bit maps to reduce handle memory duplication and methods of operating same
US7987205B1 (en) 2006-11-27 2011-07-26 Netlogic Microsystems, Inc. Integrated search engine devices having pipelined node maintenance sub-engines therein that support database flush operations
US7831626B1 (en) 2006-11-27 2010-11-09 Netlogic Microsystems, Inc. Integrated search engine devices having a plurality of multi-way trees of search keys therein that share a common root node
US7953721B1 (en) 2006-11-27 2011-05-31 Netlogic Microsystems, Inc. Integrated search engine devices that support database key dumping and methods of operating same
US8286029B2 (en) 2006-12-21 2012-10-09 Emc Corporation Systems and methods for managing unavailable storage devices
US7593938B2 (en) 2006-12-22 2009-09-22 Isilon Systems, Inc. Systems and methods of directory entry encodings
US7509448B2 (en) 2007-01-05 2009-03-24 Isilon Systems, Inc. Systems and methods for managing semantic locks
US9940345B2 (en) * 2007-01-10 2018-04-10 Norton Garfinkle Software method for data storage and retrieval
US8966080B2 (en) 2007-04-13 2015-02-24 Emc Corporation Systems and methods of managing resource utilization on a threaded computer system
US7779048B2 (en) 2007-04-13 2010-08-17 Isilon Systems, Inc. Systems and methods of providing possible value ranges
US7900015B2 (en) 2007-04-13 2011-03-01 Isilon Systems, Inc. Systems and methods of quota accounting
US8909677B1 (en) * 2007-04-27 2014-12-09 Hewlett-Packard Development Company, L.P. Providing a distributed balanced tree across plural servers
US8199641B1 (en) * 2007-07-25 2012-06-12 Xangati, Inc. Parallel distributed network monitoring
US7882068B2 (en) 2007-08-21 2011-02-01 Isilon Systems, Inc. Systems and methods for adaptive copy on write
US7949692B2 (en) 2007-08-21 2011-05-24 Emc Corporation Systems and methods for portals into snapshot data
US7966289B2 (en) 2007-08-21 2011-06-21 Emc Corporation Systems and methods for reading objects in a file system
US7870345B2 (en) 2008-03-27 2011-01-11 Isilon Systems, Inc. Systems and methods for managing stalled storage devices
US7949636B2 (en) 2008-03-27 2011-05-24 Emc Corporation Systems and methods for a read only mode for a portion of a storage system
US7984324B2 (en) 2008-03-27 2011-07-19 Emc Corporation Systems and methods for managing stalled storage devices
US7953709B2 (en) 2008-03-27 2011-05-31 Emc Corporation Systems and methods for a read only mode for a portion of a storage system
US10992555B2 (en) 2009-05-29 2021-04-27 Virtual Instruments Worldwide, Inc. Recording, replay, and sharing of live network monitoring views
US9626398B2 (en) 2012-05-22 2017-04-18 Hewlett Packard Enterprise Development Lp Tree data structure
KR101699779B1 (en) * 2010-10-14 2017-01-26 삼성전자주식회사 Indexing method for flash memory
US8788505B2 (en) * 2011-04-27 2014-07-22 Verisign, Inc Systems and methods for a cache-sensitive index using partial keys
US8843472B2 (en) 2011-10-11 2014-09-23 International Business Machines Corporation Recovery of inconsistent data in databases
US9703829B2 (en) * 2011-12-26 2017-07-11 Hitachi, Ltd. Database system and database management method
US11487707B2 (en) 2012-04-30 2022-11-01 International Business Machines Corporation Efficient file path indexing for a content repository
US8914356B2 (en) 2012-11-01 2014-12-16 International Business Machines Corporation Optimized queries for file path indexing in a content repository
US9323761B2 (en) 2012-12-07 2016-04-26 International Business Machines Corporation Optimized query ordering for file path indexing in a content repository
US11176111B2 (en) * 2013-03-15 2021-11-16 Nuodb, Inc. Distributed database management system with dynamically split B-tree indexes
WO2015047423A1 (en) 2013-09-30 2015-04-02 Mindjet Llc Scoring members of a set dependent on eliciting preference data amongst subsets selected according to a height-balanced tree
US9430274B2 (en) * 2014-03-28 2016-08-30 Futurewei Technologies, Inc. Efficient methods and systems for consistent read in record-based multi-version concurrency control
US20190026147A1 (en) * 2014-04-30 2019-01-24 International Business Machines Corporation Avoiding index contention with distributed task queues in a distributed storage system
US9817852B2 (en) * 2014-08-28 2017-11-14 Samsung Electronics Co., Ltd. Electronic system with version control mechanism and method of operation thereof
US11080253B1 (en) * 2015-12-21 2021-08-03 Amazon Technologies, Inc. Dynamic splitting of contentious index data pages
US10366065B2 (en) * 2016-04-29 2019-07-30 Netapp, Inc. Memory efficient lookup structure
US10002055B2 (en) * 2016-04-29 2018-06-19 Netapp, Inc. Efficient repair of B+ tree databases with variable-length records
JP6912724B2 (en) * 2017-11-29 2021-08-04 富士通株式会社 Information processing program, information processing device and information processing method
CN108737487B (en) * 2018-03-21 2020-09-29 腾讯科技(深圳)有限公司 Data synchronization method and device, storage medium and electronic device
US10915546B2 (en) 2018-10-10 2021-02-09 Micron Technology, Inc. Counter-based compaction of key-value store tree data block
US11100071B2 (en) * 2018-10-10 2021-08-24 Micron Technology, Inc. Key-value store tree data block spill with compaction
US11048755B2 (en) 2018-12-14 2021-06-29 Micron Technology, Inc. Key-value store tree with selective use of key portion
US10852978B2 (en) 2018-12-14 2020-12-01 Micron Technology, Inc. Key-value store using journaling with selective data storage format
US10936661B2 (en) 2018-12-26 2021-03-02 Micron Technology, Inc. Data tree with order-based node traversal
FR3103664B1 (en) * 2019-11-27 2023-04-07 Amadeus Sas Distributed storage system to store contextual data
KR102176715B1 (en) * 2020-03-03 2020-11-09 전운배 Method for controlling access to tri data structure and apparatus thereof
US11269837B2 (en) * 2020-03-16 2022-03-08 International Business Machines Corporation Data tree checkpoint and restoration system and method
CN111522548B (en) * 2020-03-24 2023-06-27 北京三快在线科技有限公司 Project function expansion method, apparatus, electronic device and computer readable medium
CN111538864B (en) * 2020-03-25 2023-03-31 新华三技术有限公司合肥分公司 Method and device for reducing Buildrun consumption
CN111522814A (en) * 2020-04-14 2020-08-11 西云图科技(北京)有限公司 Information management method of water affair system
CN112434035B (en) * 2020-11-20 2022-09-23 上海交通大学 Indexing method and system for concurrent Hash index data structure based on machine learning
CN112650899B (en) * 2020-12-30 2023-10-03 中国平安人寿保险股份有限公司 Data visualization rendering method and device, computer equipment and storage medium
US11694211B2 (en) * 2021-06-28 2023-07-04 Stripe, Inc. Constant-time cascading deletion of resources
CN114781295B (en) * 2022-06-21 2022-09-09 上海国微思尔芯技术股份有限公司 Logic circuit scale reduction method and device
CN116303586B (en) * 2022-12-09 2024-01-30 中电云计算技术有限公司 Metadata cache elimination method based on multi-level b+tree

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4823310A (en) * 1987-08-10 1989-04-18 Wang Laboratories, Inc. Device for enabling concurrent access of indexed sequential data files
US5430869A (en) * 1991-05-29 1995-07-04 Hewlett-Packard Company System and method for restructuring a B-Tree
US5446887A (en) * 1993-09-17 1995-08-29 Microsoft Corporation Optimal reorganization of a B-tree
US5535322A (en) * 1992-10-27 1996-07-09 International Business Machines Corporation Data processing system with improved work flow system and method
US6480839B1 (en) * 2000-07-18 2002-11-12 Go2Market.Com System and method for improving database data manipulation using direct indexing within a B*tree index having a tunable index organization
US20020194483A1 (en) * 2001-02-25 2002-12-19 Storymail, Inc. System and method for authorization of access to a resource
US20030026246A1 (en) * 2001-06-06 2003-02-06 Zarlink Semiconductor V.N. Inc. Cached IP routing tree for longest prefix search
US6792432B1 (en) * 1998-03-31 2004-09-14 Sybase, Inc. Database system with methods providing high-concurrency access in B-Tree structures
US7213040B1 (en) * 2002-10-29 2007-05-01 Novell, Inc. Apparatus for policy based storage of file data and meta-data changes over time
US7370055B1 (en) * 2003-06-04 2008-05-06 Symantec Operating Corporation Efficiently performing deletion of a range of keys in a B+ tree

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4823310A (en) * 1987-08-10 1989-04-18 Wang Laboratories, Inc. Device for enabling concurrent access of indexed sequential data files
US5430869A (en) * 1991-05-29 1995-07-04 Hewlett-Packard Company System and method for restructuring a B-Tree
US5535322A (en) * 1992-10-27 1996-07-09 International Business Machines Corporation Data processing system with improved work flow system and method
US5446887A (en) * 1993-09-17 1995-08-29 Microsoft Corporation Optimal reorganization of a B-tree
US6792432B1 (en) * 1998-03-31 2004-09-14 Sybase, Inc. Database system with methods providing high-concurrency access in B-Tree structures
US6480839B1 (en) * 2000-07-18 2002-11-12 Go2Market.Com System and method for improving database data manipulation using direct indexing within a B*tree index having a tunable index organization
US20020194483A1 (en) * 2001-02-25 2002-12-19 Storymail, Inc. System and method for authorization of access to a resource
US20030026246A1 (en) * 2001-06-06 2003-02-06 Zarlink Semiconductor V.N. Inc. Cached IP routing tree for longest prefix search
US7213040B1 (en) * 2002-10-29 2007-05-01 Novell, Inc. Apparatus for policy based storage of file data and meta-data changes over time
US7370055B1 (en) * 2003-06-04 2008-05-06 Symantec Operating Corporation Efficiently performing deletion of a range of keys in a B+ tree

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060064432A1 (en) * 2004-09-22 2006-03-23 Pettovello Primo M Mtree an Xpath multi-axis structure threaded index
US9171100B2 (en) 2004-09-22 2015-10-27 Primo M. Pettovello MTree an XPath multi-axis structure threaded index
US8166074B2 (en) 2005-11-14 2012-04-24 Pettovello Primo M Index data structure for a peer-to-peer network
US20070174309A1 (en) * 2006-01-18 2007-07-26 Pettovello Primo M Mtreeini: intermediate nodes and indexes
US20080183844A1 (en) * 2007-01-26 2008-07-31 Andrew Gavin Real time online video editing system and method
US20090012976A1 (en) * 2007-07-04 2009-01-08 Samsung Electronics Co., Ltd. Data Tree Storage Methods, Systems and Computer Program Products Using Page Structure of Flash Memory
US9058253B2 (en) * 2007-07-04 2015-06-16 Samsung Electronics Co., Ltd. Data tree storage methods, systems and computer program products using page structure of flash memory
US20100246446A1 (en) * 2009-03-30 2010-09-30 Wenhua Du Tree-based node insertion method and memory device
US8208408B2 (en) * 2009-03-30 2012-06-26 Huawei Technologies Co., Ltd. Tree-based node insertion method and memory device
US20110066937A1 (en) * 2009-09-17 2011-03-17 International Business Machines Corporation Method and system for handling non-presence of elements or attributes in semi-structured data
US10242123B2 (en) 2009-09-17 2019-03-26 International Business Machines Corporation Method and system for handling non-presence of elements or attributes in semi-structured data
US8549398B2 (en) * 2009-09-17 2013-10-01 International Business Machines Corporation Method and system for handling non-presence of elements or attributes in semi-structured data
US9600564B2 (en) 2009-09-17 2017-03-21 International Business Machines Corporation Method and system for handling non-presence of elements or attributes in semi-structured data
US8631028B1 (en) 2009-10-29 2014-01-14 Primo M. Pettovello XPath query processing improvements
US20110252067A1 (en) * 2010-04-12 2011-10-13 Symantec Corporation Insert optimization for b+ tree data structure scalability
US8700670B2 (en) * 2010-04-12 2014-04-15 Symantec Corporation Insert optimization for B+ tree data structure scalability
US9037557B2 (en) 2011-02-28 2015-05-19 International Business Machines Corporation Optimistic, version number based concurrency control for index structures with atomic, non-versioned pointer updates
US8666981B2 (en) * 2011-02-28 2014-03-04 International Business Machines Corporation Bottom-up optimistic latching method for index trees
US20120221531A1 (en) * 2011-02-28 2012-08-30 International Business Machines Corporation Bottom-up optimistic latching method for index trees
US20150220574A1 (en) * 2012-08-10 2015-08-06 Industry Academic Cooperation Foundation Of Yeungnam University Database method for b+ tree based on pram
US9454550B2 (en) * 2012-08-10 2016-09-27 Industry Academic Cooperation Foundation Of Yeungnam University Database method for B+ tree based on PRAM
US20140279859A1 (en) * 2013-03-15 2014-09-18 International Business Machines Corporation Index record-level locking for file systems using a b+tree structure
US9361332B2 (en) * 2013-03-15 2016-06-07 International Business Machines Corporation Index record-level locking for file systems using a B+ tree structure
US20160253353A1 (en) * 2013-03-15 2016-09-01 International Business Machines Corporation Index record-level locking for record-oriented file systems
US9672220B2 (en) * 2013-03-15 2017-06-06 International Business Machines Corporation Index record-level locking for record-oriented file systems
US20230252012A1 (en) * 2022-02-09 2023-08-10 Tmaxtibero Co., Ltd. Method for indexing data

Also Published As

Publication number Publication date
US7383276B2 (en) 2008-06-03
US20050171960A1 (en) 2005-08-04

Similar Documents

Publication Publication Date Title
US7383276B2 (en) Concurrency control for B-trees with node deletion
US7716182B2 (en) Version-controlled cached data store
US7136867B1 (en) Metadata format for hierarchical data storage on a raw storage device
US6850938B1 (en) Method and apparatus providing optimistic locking of shared computer resources
EP0336035B1 (en) Tree structure database system
US5276872A (en) Concurrency and recovery for index trees with nodal updates using multiple atomic actions by which the trees integrity is preserved during undesired system interruptions
KR920000395B1 (en) Method for fetching, insertion, and deletion key record
US8326839B2 (en) Efficient file access in a large repository using a two-level cache
US7702660B2 (en) I/O free recovery set determination
JP5108749B2 (en) System and method for manipulating data in a data storage system
US7987217B2 (en) Transaction-aware caching for document metadata
US7240114B2 (en) Namespace management in a distributed file system
US7421541B2 (en) Version management of cached permissions metadata
US7548918B2 (en) Techniques for maintaining consistency for different requestors of files in a database management system
Lomet et al. Access method concurrency with recovery
Lomet et al. Concurrency and recovery for index trees
US7203709B2 (en) Transaction-aware caching for access control metadata
US20060136376A1 (en) Infrastructure for performing file operations by a database server
JP2008524694A (en) Techniques for providing locks for file operations in database management systems
Lomet Simple, robust and highly concurrent B-trees with node deletion
JP4126843B2 (en) Data management method and apparatus, and recording medium storing data management program
US7716260B2 (en) Techniques for transaction semantics for a database server performing file operations
US20050188380A1 (en) Cache control device, and method and computer program for the same
Reddy et al. Asynchronous operations in distributed concurrency control
하영 et al. A Concurrency Control Scheme over T-tree in Main Memory Databases with Multiversion

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034766/0001

Effective date: 20141014