CA2298404A1 - Method, system and program products for managing the checkpoint/restarting of resources of a computing environment - Google Patents

Method, system and program products for managing the checkpoint/restarting of resources of a computing environment Download PDF

Info

Publication number
CA2298404A1
CA2298404A1 CA002298404A CA2298404A CA2298404A1 CA 2298404 A1 CA2298404 A1 CA 2298404A1 CA 002298404 A CA002298404 A CA 002298404A CA 2298404 A CA2298404 A CA 2298404A CA 2298404 A1 CA2298404 A1 CA 2298404A1
Authority
CA
Canada
Prior art keywords
entity
resource
checkpoint
computing environment
resources
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA002298404A
Other languages
French (fr)
Inventor
Tushar Deepak Chandra
Ahmed-Sameh Afif Fakhouri
Liana Liyow Fong
William Francis Jerome
Srirama Mandyam Krishnakumar
Vijay Krishnarao Naik
John Arthur Pershing Jr.
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CA2298404A1 publication Critical patent/CA2298404A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating

Abstract

Resources are checkpointed in order to save the state of the resources: The resources can then be brought back to the same running state, during a restart procedure, by making use of the saved state. The determination of when to take a checkpoint or when to restart a resource is made by an entity, such as a cluster manager, external to the entity initiating or taking the checkpoint or performing the restart. The decision to checkpoint/restart a resource is provided by the cluster manager to a resource manager associated with the resource. This communication is facilitated by interfaces to the cluster manager provided by the resource manager.

Description

METHOD, SYSTEM AND PROGRAM PRODUCTS FOR MANAGING
THE CHECKPOINTING/RESTARTING OF RESOURCES
OF A COMPUTING ENVIRONMENT
TECHNICAL FIELD
This invention relates, in general, to managing resources of a computing environment, and in particular, to managing the checkpointing/restarting of the resources of the computing environment.
BACKGROUND ART
The ability to recover from failures within a computing environment is of paramount importance to the users of that environment. Thus, steps have been taken to facilitate the recovery from failures.
One technique currently provided to facilitate the recovery from failures is to periodically 1 S take checkpoints of the resources of the computing environment. In particular, at certain times, each resource saves the current state of the resource, in the event that the state is needed to recover from a failure. The manner in which the checkpoint is taken by the resource is resource specific.
Thereafter, if a failure occurs requiring one or more resources to be restarted, the restarted resources bring themselves back to the state they were in when the checkpoints were taken. This provides a mechanism to recover from failures.
Although some recovery mechanisms exist today, a need still exists for mechanisms that improve the management of resources within the computing environment. In particular, a need exists for a capability that better manages the checkpointing and restarting of resources.
A further need exists for a capability that separates the decision to checkpoint/restart from the initiating and/or performing of the checkpoint/ restart. Further, a need exists for a checkpoint/
restart capability that is suitable for distributed environments, including heterogeneous environments.
A yet further need exists for a capability that cleans up checkpoint information that is no longer desired.

SUMMARY OF THE INVENTION
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a method of managing the checkpointing of resources of a computing environment. The method includes, for instance, determining, by a first entity of the computing environment, that a checkpoint of a resource of the computing environment is to be taken; and initiating the taking of the checkpoint of the resource by a second entity of the computing environment.
In one embodiment, the first entity has no knowledge of implementation details associated with the initiating the taking of the checkpoint.
In a further embodiment, the second entity is informed of the determination to checkpoint the resource by the first entity invoking an interface of the second entity indicative of the determination to take a checkpoint.
In a further embodiment, the checkpoint is used to restart the resource. As one example, the first entity makes the determination to restart the resource and forwards this determination to the second entity by invoking an interface of the second entity indicative of the determination to restart.
In another embodiment, a plurality of checkpoints of a plurality of resources is initiated by at least one second entity.
In one embodiment, at least one resource of the plurality of resources is executing on a computing node of the computing environment having a first operating system, and at least one other resource of the plurality of resources is executing on another computing node of the computing environment having a second operating system, which is different from the first operating system.
In another aspect of the present invention, a method of managing the restarting of resources of a computing environment is provided. The method includes, for instance, determining, by a first entity of the computing environment, that a resource of the computing environment is to be restarted;
and initiating the restarting of the resource by a second entity of the computing environment.
In another aspect of the present invention, a system of managing the checkpointing of resources of a computing environment is provided. The system includes, for instance, a first entity of the computing environment adapted to determine that a checkpoint of a resource of the computing environment is to be taken; and a second entity of the computing environment adapted to initiate the taking of the checkpoint of the resource.
In a further aspect of the present invention, a system of managing the restarting of resources of a computing environment is provided. The system includes, for example, a first entity of the computing environment being adapted to determine that a resource of the computing environment is to be restarted; and a second entity of the computing environment being adapted to initiate the restarting of the resource.
In yet another aspect of the present invention, an article of manufacture including at least one computer usable medium having computer readable program code means embodied therein for causing the managing of the checkpointing of the resources of a computing environment is provided.
The computer readable program code means in the article of manufacture includes, for instance, computer readable program code means for causing a computer to determine, by a first entity of the computing environment, that a checkpoint of a resource of a computing environment is to be taken;
and computer readable program code means for causing a computer to initiate the taking of the checkpoint of the resource by a second entity of the computing environment.
1 S In a further aspect of the present invention, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform a method of managing the restarting of resources of a computing environment is provided.
The method includes, for example, determining, by a first entity of the computing environment, that a resource of the computing environment is to be restarted; and initiating the restarting of the resource by a second entity of the computing environment.
In accordance with the principles of the present invention, capabilities are provided for managing the checkpointing and restarting of resources of a computing environment, such as a homogeneous or heterogeneous distributed computing environment.
Advantageously, an entity of the computing environment, other than the entity initiating or performing the checkpoint/restart, is responsible for determining when a checkpoint or restart is to be performed.
The entity making this determination (i.e., the determining entity) need not know how to initiate or perform the checkpoint/restart. The entity responsible for initiating the checkpoint/restart provides interfaces to the determining entity, which the determining entity uses to notify the initiating entity of when to proceed.

In addition to the above, a capability is advantageously provided to clean up old checkpoint information that is no longer desired. This cleanup is initiated by the determining entity.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.
BRIEF DESCRIPTION OF THE DRAWINGS
The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 depicts one example of a computing environment incorporating and using the capabilities of the present invention;
FIG. 2 depicts one embodiment of a logical organization of the computing environment of FIG. l, in accordance with the principles of the present invention;
FIG. 3 depicts one embodiment of a relationship tree, in accordance with the principles of the present invention;
FIG. 4 depicts one embodiment of interfaces of a resource manager of FIG. 2, in accordance with the principles of the present invention;
FIG. 5 depicts one embodiment of the logic associated with checkpointing a resource, in accordance with the principles of the present invention;
FIG. 6 depicts one embodiment of the logic associated with restarting a resource, in accordance with the principles of the present invention; and FIG. 7 depicts one embodiment of the logic associated with cleaning up checkpoint information, in accordance with the principles of the present invention.
BEST MODE FOR CARRYING OUT THE INVENTION
In accordance with the principles of the present invention, a checkpoint capability is provided that enables an entity, other than the entity performing the checkpoint, to make the decision when to checkpoint. Similarly, a restart capability is provided that enables an entity, other than the entity performing the restart, to make the decision of when to restart.
In one embodiment of the present invention, a cluster manager determines when a checkpoint of a resource is to be taken. The decision to checkpoint is then forwarded to a resource manager, which initiates the checkpoint. The checkpoint is then performed by the resource itself. The cluster manager need not know the details of how to initiate or implement the checkpoint. Likewise, in one embodiment, it is the cluster manager that determines that a resource is to be restarted. The cluster manager forwards this information to a resource manager, which is responsible for initiating the restart process. The restart is then performed by the resource itself. Again, the cluster manager need not know the details of how to initiate or perform the restart. The details associated with initiating the checkpoint/ restart of the resources is encapsulated by one or more resource managers that are responsible for the resources.
One embodiment of a computing environment incorporating and using the capabilities of the present invention is depicted in FIG. 1. In one example, computing environment 100 is a distributed computing environment having a cluster 102. Cluster 102 includes one or more computing nodes 104. Each computing node, is for instance, a UNIX workstation having at least one central processing unit, storage facilities and one or more input/output devices. Each central processing unit executes an operating system, such as AIX offered by International Business Machines Corporation.
Computing nodes 104 are coupled to one another by a cluster network 106, which is, for example, a high-speed local area network (LAN). One example of this type of LAN is the high-speed switch of the RS/6000 SP, offered by International Business Machines Corporation.
In this example, cluster 102 is coupled to a public network 108 to enable the cluster to provide services to one or more computing units 110 outside the cluster.
Computing unit 110 is, for example, a UNIX workstation or any other type of computer or computing unit.
The computing environment described above is only one example of a computing enmronment incorporating and using the capabilities of the present invention.
The capabilities of the present invention are equally applicable to various types of computing environments, including, but not limited to, homogeneous computing environments, heterogeneous computing environments, other types of distributed environments, and environments that do not include public networks.

Further, the computing nodes may be other than UNIX workstations. They may be any other type of computers or computing units. Additionally, the operating system need not be AIX. Further, one or more of the computing nodes of the cluster may be different types of computing nodes than the others (e.g., different hardware) and/or run different types of operating systems. Likewise, the computing units may be the same or different from the computing nodes and may be the same or different from each other. The above described environment is offered as only one example.
The cluster is used to run, for instance, critical programs (applications) which require high availability. Examples of these applications include database servers, Enterprise Resource Planning (ERP) systems and web servers. These applications are referred to herein as resources, which run on the nodes in the cluster. These applications can also depend on other types ofresources, such as other applications, operating system services, disks, file systems, network adapters, IP addresses and communication stacks, to name just a few examples.
A resource is an abstraction of a hardware component (e.g., a network adapter) and/or a software component (e.g., a database server). The abstraction allows resources to be defined/undefined and allows for the manipulation of the resources and for relationships to be created. One resource may have a relationship with one or more other resources of the computing environment. The relationship may be, for instance, a dependency relationship, in which one resource is dependent on one or more other resources. For example, a payroll application may depend on a database server, and as such, the database server has to be ready before the payroll application can be brought online. Another type of a relationship may be a location relationship.
For example, sometimes, resources that have a relationship with one another need to run on the same computing node. As another example, if an application, such as a database application, needs direct access to physical disks, then the application can only be run on those computing nodes having physical disks connected thereto.
The operation of bringing up a resource in order to enable it to function is known as onlining the resource. Conversely, the operation of shutting down a resource, so that it stops functioning is known as offlining the resource. A resource may have a state associated with it while it is running.
The operation of saving the state is called checkpointing. A resource has a set of attributes associated therewith to support checkpointing. These attributes provide information about the Y09-1998-0509 g capabilities of the resource, priority of the resource, whether the resource is checkpointable, as well as other types of information. A resource can be brought back to the same running state by making use of the state saved during checkpointing. This is known as restarting or recovering the resource.
The manner in which a checkpoint is taken for a particular resource or a restart is performed for the resource, is resource dependent and known by the resource itself. As one example, the operating system provides support for the resource to checkpoint/ restart itself efficiently.
In order to invoke the checkpointing or restarting of a resource, the resource provides an W vocation mechanism. This mechanism is very specific to the resource and requires knowledge about the resource. In accordance with the principles of the present invention, the invocation details for a particular resource is known by an entity, referred to as a resource manager, which is coupled to the resource. As depicted in FIG. 2, each resource 200 has associated therewith a resource manager 204. (It should be noted, however, that not all resources of a computing environment need a resource manager.) One resource manager can interact with one or more resources. Each resource manager 204 provides an interface to other entities that wish to inform the resource manager, for instance, that a resource is to be brought online or offline, a checkpoint is to be taken, or a resource is to be restarted. The resource manager is responsible for knowing the details of the resource and for invoking resource specific actions to make the resource checkpoint and/or restart itself. For example, a resource specific action to checkpoint a database is to have the database write a transaction log to a disk.
Additionally, the resource manager is responsible for monitoring the state of the resource.
For example, when a resource fails (or goes offline), it is the responsibility of the resource manager to notify any entity that is interested in knowing about the state of the resource.
In accordance with the principles of the present invention, the resource managers, and in particular, the resource manager interfaces are coupled to at least one cluster manager 206. Cluster manager 206 is the entity which manages the resources in the cluster. It manages a resource by operating on its abstraction to query or change its state or other attributes.
The cluster manager typically runs on one of the nodes in the cluster, but could run anywhere in the distributed system.
One example of a cluster manager is the High-Availability Cluster Multiprocessing (HACMP) offered by International Business Machines Corporation.

The cluster manager has the task of maintaining the availability of the resources (i.e., having them online) and ensuring that the resources provide the desired quality of service. The cluster manager monitors the resources in the cluster using, for example, an event processing facility 208, coupled thereto.
Event processing facility 208 allows resource managers 204 to publish events (e.g., resource status) that could be of interest to other entities in the distributed system.
It then sends these events to entities that are interested in the events. The event processing facility decouples publishers from subscribers, so that the publishers need not know the subscribers. One example of an event processing facility is described in IBM PSSP Event Management Programming Guide and Reference, IBM Pub. No. SC23-3996-O1, Second Edition,(August 1997), which is hereby incorporated herein by reference in its entirety.
The cluster manager has a global view of all the resources, their states and the relationships among them. Thus, a cluster manager arrives at certain decisions by utilizing information that may not be known to the resource. The cluster manager maintains a relationship tree (see FIG. 3, which is described below), which includes one or more of the resources of the computing environment and the constraints that interrelate them. (In another embodiment, there may be multiple relationship trees.) The cluster manger uses the global view, as well as policy or goal information supplied to it by the users of the cluster, to make decisions concerning the resources.
For example, a policy might state that the cluster manager should try to online the resource on a different node from the node used when the resource failed.
In accordance with the principles of the present invention, the policy information, which may be stored on nodes or resources within the cluster or outside of the cluster, includes policies for checkpointing/restarting the resources. These policies may be associated with the relationships and/or events of interest (e.g., a periodic time pop) that affect the resources in the relationship tree.
For example, a policy may be used to determine whether checkpointing is needed when a resource is added or removed from the relationship tree. That is, resources may be dynamically added or removed from the computing environment during run-time, and thus, dynamically added or removed from the tree. Whenever a resource is added or removed from the relationship tree, the checkpointing policy is exercised to determine whether checkpointing of a set of resources is needed.
Y09-1998-0509 g As a specific example, consider a policy which states that all the resources in a relationship tree are to be checkpointed, if a resource with a lower priority (i.e., relative importance of running this resource when compared to other resources) is added to the relationship tree. When a resource with a lower priority is added to the relationship tree, the cluster manager exercises the policy and determines that all the resources in the relationship tree are to be checkpointed. An extension of this policy would be to propagate the priority of the highest priority resource down the relationship tree.
Since a high priority resource can have a relationship with lower priority resources, the high priority is propagated down the tree, so that all the resources of the tree inherit the high priority during the running of the high priority resource. The policy could also specify the order in which the resources get checkpointed and restarted.
An illustration of the above example is depicted in FIG. 3. In FIG. 3, there is an existing relationship tree 300 with resources A, B, C and D. Resource A has a checkpoint priority (denoted as CP) of 5; Resource B has a priority of 2; and Resources C and D have a priority of I. Since Resource A has a checkpoint priority of 5, all the resources in the relationship tree now have a propagated priority (PP) of 5, as per the policy. Consider adding a new relationship to the tree by adding another resource, Resource E, with a checkpoint priority of 3. Since this relationship contains a new resource with a lower priority, a checkpoint of all the resources in the existing tree is performed. The cluster manager initiates a checkpoint of Resources A, B, C
and D using a resource manager interface for checkpointing, as described below. If there is a policy specifying the order in which the resources should be checkpointed, the cluster manager would then observe this policy. When the checkpointing is complete, the resource managers) of Resources A, B, C and D
informs the cluster manager of completion by invoking a callback provided by the cluster manager.
The new relationship is established once these resources have completed their checkpointing. If Resource E causes the system to crash, then a restart of Resources A, B, C and D is initiated. Again, there may be a policy associated with the order in which the resources are restarted.
When the cluster manager determines that an action is to be taken by a resource, the cluster manager informs the resource manager corresponding to that resource of the action to be performed.
In accordance with the principles of the present invention, the cluster manager informs the resource manager via interfaces 400 (FIG. 4) provided by the resource manager.
Interfaces 400 are, for example, application programing interfaces (APIs) called by entities within the computing environment, such as cluster manager 206. When an API is called, the resource manager takes the appropriate action and conveys the result of the action back to the caller through a callback mechanism or by posting an event to an event processing facility 208 (FIG. 2).
The event processing facility would then inform the caller of the API about the status of the call.
In accordance with the principles of the present invention, various API calls can be made by the cluster manager, for instance. Examples of some of the API calls are described below.
As one example, an API call can be made to initiate a checkpoint.
Specifically, this call informs the resource manager corresponding to the identified resource to initiate a checkpoint of the resource. This API call has, for instance, the following syntax:
CheckpointResource(ResourceId, CheckPointKey). ResourceId is a unique identifier of the resource to be checkpointed, and CheckpointKey (e.g., a numeric value, an alphanumeric value or an alpha value) is an identifier that identifies the checkpoint. The key has semantic meaning to the entity making use of the resource manager interface. Over time, a resource may have multiple keys associated therewith. Thus, if a restart of the resource is needed, any one of the keys can be selected.
Upon receiving the CheckpointResource call, the resource manager initiates resource specific actions, to be performed by the resource, to checkpoint the resource. It uses the checkpoint key to keep track of this checkpoint. If the CheckPointKey is not specified, the resource manager uses a default key. When the checkpoint has been successfully completed, the resource manager invokes the registered callback or posts an event to the event processing facility informing the caller that the checkpointing has been completed. In case of an error, the resource manager notifies the caller in a similar way.
Another example of an API call used in conjunction with the present invention is RestartResource(ResourceId, CheckPointKey). RestartResource is used to inform the resource manager to initiate resource specific actions, to be performed by the resource, to restart the resource.
The resource, identified by ResourceId, is restarted at the specified checkpoint designated by CheckPointKey. In particular, the saved state associated with that checkpoint key is used to restart the resource. If the checkpoint key is not specified, then the resource manger uses the default checkpoint key. The results of the restart are forwarded to the caller in a similar manner as the results of CheckpointResource.
A further example of an API call relevant to the present invention is CleanupCheckpoint(ResourceId, CheckPointKey). Checkpoints accumulate over a period of time and thus, are garbage collected in order to efficiently manage system resources, such as system memory. The CleanupCheckpoint call informs the resource manager associated with the resource identified by ResourceId to clean up the checkpoint information for the specified key. This may require the resource manager to invoke certain resource specific actions to be performed by the resources.
Further details regarding the taking of a checkpoint and restarting a resource in accordance with the principles of the present invention, are described below. In particular, one embodiment of taking a checkpoint is described in detail with reference to FIG. S and one embodiment for restarting a resource is described with reference to FIG. 6.
Referring to FIG. 5, initially, the cluster manager determines that a checkpoint is to be taken for one or more resources of the computing environment, STEP 500. This determination is made based on consultation with the checkpoint policy and/or based on resource dependency, as examples.
Subsequent to the cluster manager determining that a checkpoint is to be taken for a resource, the cluster manager invokes the checkpoint interface of the resource manager corresponding to that resource, STEP 502. The interface is invoked via, for instance, the CheckpointResource API. With the API call, the cluster manager provides the ResourceId, as well as the CheckPointKey. For example, if a checkpoint is to be taken of a resource (such as a database system) having a resource identity of A, then a ResourceId of A is passed with the call. Further, if Resource A is running at time tl, then the cluster manager provides a key of Ktl.
When the resource manager receives the CheckpointResource call, it initiates a checkpoint for the resource identified by the resource id, STEP 504. In particular, the resource manager initiates resource specific actions necessary to initiate the checkpoint of the resource. As one example, the resource manager may initiate the copying of a file to a mirror and remember the key associated with the checkpoint. (File mirroring is offered as only one example. There are many instances in which checkpointing is performed, and the actions taken in those instances are dependent on what needs to be accomplished.) Thereafter, the resource takes its checkpoint, STEP 506. The manner in which this is accomplished is specific to the resource and need not be known by the cluster manager.
After the resource takes the checkpoint, the resource manager informs the cluster manager of completion, STEP 508. This is accomplished by invoking the registered callback or posting an event, as examples.
The above procedure is performed for each checkpoint that is to be taken.
Although in the above embodiment, only one resource is represented by the API call, in other embodiments more than one resource id and/or more than one key id may be included in the call, thereby enabling one call to initiate checkpointing of a plurality of resources.
After a checkpoint is taken of a resource, that checkpoint can be used to restore the resource should the resource or the system fail. For example, if Resource A crashed at a time t2, then the checkpointed state of the resource at time t 1 is used to recover the resource. This is described in further detail with reference to FIG. 6.
Initially, the cluster manager determines that a resource is to be restarted, STEP 600. For example, the cluster manager either detects or is informed that a resource has failed, and thus, determines that it needs to be restarted. The cluster manager may consult its policy to ensure proper invocation of the restart. Subsequently, the cluster manager invokes the restart interface of the resource manager corresponding to the resource to be restarted, STEP 602. As one example, the RestartResource API is used, which includes the resource id of the resource to be restarted and the checkpoint key identifying the checkpoint to be used for the restart. The checkpoint key is selected from one or more keys associated with the resource.
After the resource manager receives the RestartResource call, the resource manager initiates the restart for the resource, STEP 604. The manner in which this is accomplished is specific to the resource. For example, continuing with the one file mirroring example, the resource manager initiates the bringing back of the mirror copy associated with the specified key.
Thereafter, the resource recovers itself via resource specific actions, STEP
606. Once the restart is complete, the resource manager informs the cluster manager of completion using, for example, a registered callback or by posting an event, STEP 608.
Similar to checkpointing, in other embodiments, more than one resource may be represented by the API call by providing multiple resource ids and one or more keys.
In a further aspect of the present invention, the state of a distributed system (e.g., a homogeneous or heterogeneous system) can be checkpointed by checkpointing a subset or all of the resources in the system by using one or more sets of keys. Consider an example of a cluster having Resources A, B and C running in it. The state of the cluster can be considered as the combined state of the Resources A, B and C. The resources are checkpointed using a key set K1={KA, KB, KC}, which corresponds to the resources at a time t 1. Later, at a time t2, if the cluster state is to be restored to the state at time tl, Resources A, B and C are restarted with the key set K1. In particular, Resources A, B and C are restarted with keys KA, KB and KC, respectively.
As the checkpoints accumulate, they take up system resources, such as system memory.
Thus, a garbage collection process is invoked periodically by the cluster manager. In particular, the cluster manager monitors the accumulation (or is informed that garbage collection is to take place) and makes a determination, based on its policy, that the checkpoints are to be cleaned up, STEP 700 (FIG. 7).
After this determination is made, the cluster manager invokes the CleanupCheckpoint interface of the resource manager associated with the resource id passed with the call, STEP 702.
The resource manager then cleans up the checkpoint information for the specified key, STEP 704.
This may require the resource manager to perform certain resource specific actions. As one example, in the context of mirroring, the resource manager initiates deletion of the mirrored copy identified by the key.
When the cleanup is complete, the resource manager notifies the cluster manager of completion, STEP 706.
As with the other flows, in another embodiment, multiple resource ids and/or checkpoint keys may be presented in a single API call.
As described in detail above, the present invention provides capabilities for managing the checkpointing/restarting or resources of a computing environment. In one example the computing environment is a clustered environment, which is, for instance, a high-bandwidth connected set of servers that can be managed from a single point and appears as a single resource to end users, system administrators, and other clients.

In accordance with the principles of the present invention, an entity (e.g., a cluster manager), other than the entity (e.g., a resource) to be checkpointed/restarted, makes the decision to checkpoint/restart. The deciding entity has knowledge of the dynamic resource relationship information of resources and of policies that are associated with the checkpointing and restarting of the resources. The policies may be a part of the cluster manager or may be defined by the user. The policies can be enforced at the discretion of the user. Dynamic resource relationship information is used to checkpoint and restart resources in a cluster.
Resources are managed by entities known as resource managers. The resource managers provide interfaces that can be used to checkpoint and restart resources that are running in the computing environment. Each resource manager is, for example, a process independent of the resource, which is running in the cluster (although, in some cases, a resource can be its own resource manager). In some cases, the resource manager is outside the cluster, but still manages the resources in the cluster. The resource manager understands the details of how the resources associated therewith can be checkpointed and restarted. It maintains one or more keys, where each key is associated with a particular checkpoint.
The present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media. The media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.
Additionally, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.
The flow diagrams depicted herein are just exemplary. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.
Although preferred embodiments have been depicted and described in detail herein, it will be apparent to those skilled in the relevant art that various modifications, additions, substitutions and the like can be made without departing from the spirit of the invention and these are therefore considered to be within the scope of the invention as defined in the following claims.

Claims (84)

1. A method of managing the checkpointing of resources of a computing environment, said method comprising:
determining, by a first entity of said computing environment, that a checkpoint of a resource of said computing environment is to be taken; and initiating the taking of said checkpoint of said resource by a second entity of said computing environment.
2. The method of claim 1, wherein said first entity has no knowledge of implementation details associated with said initiating the taking of said checkpoint.
3. The method of claim 1, further comprising informing said second entity of the determination to checkpoint said resource.
4. The method of claim 3, wherein said informing comprises invoking, by said first entity, an interface of said second entity indicative of said determination to take a checkpoint.
5. The method of claim 4, wherein said invoking comprises providing an identifier of said resource to be checkpointed.
6. The method of claim 5, wherein said invoking further comprises providing an identifier of the checkpoint.
7. The method of claim 1, further comprising informing said first entity, by said second entity, that the taking of the checkpoint is complete.
8. The method of claim 1, further comprising using said checkpoint to restart said resource.
9. The method of claim 8, further comprising determining, by said first entity, that said restart is to take place.
10. The method of claim 9, further comprising informing said second entity of the determination to restart said resource.
11. The method of claim 10, wherein said informing comprises invoking, by said first entity, an interface of said second entity indicative of said determination to restart.
12. The method of claim 8, further comprising informing said first entity, by said second entity, that the restart is complete.
13. The method of claim 1, wherein said initiating comprises initiating the taking of a plurality of checkpoints of a plurality of resources by at least one second entity.
14. The method of claim 13, further comprising using at least one set of checkpoint keys in checkpointing said plurality of resources.
15. The method of claim 14, further comprising using one or more sets of said at least one set of checkpoint keys to restart one or more resources of said plurality of resources.
16. The method of claim 13, wherein at least one resource of said plurality of resources is executing on a computing node of said computing environment having a first operating system, and at least one other resource of said plurality of resources is executing on another computing node of said computing environment having a second operating system, which is different from said first operating system.
17 17. The method of claim 13, wherein one or more resources of said plurality of resources have a priority associated therewith, and wherein said priority is usable during at least one of said determining and said initiating.
18. The method of claim 1, wherein said determining comprises using policy information in determining that said checkpoint is to be taken.
19. The method of claim 18, wherein said policy information includes a policy that a checkpoint is to be taken of any resources affected by adding or removing one or more resources from said computing environment.
20. The method of claim 1, wherein said determining comprises using resource relationships in determining that said checkpoint is to be taken.
21. The method of claim 1, wherein said resource has associated therewith an indicator specifying that said resource is checkpointable.
22. The method of claim 1, further comprising invoking, by said first entity, an interface of said second entity to clean up any unwanted checkpoint information.
23. The method of claim 22, wherein said invoking comprises providing at least one checkpoint key identifying the checkpoint information to be cleaned up.
24. The method of claim 1, wherein said second entity comprises one of said resource and a resource manager coupled to said resource and said first entity.
25. The method of claim 1, wherein said computing environment is a clustered environment.
26. The method of claim 1, wherein said computing environment is a distributed heterogeneous computing environment.
27. The method of claim 1, wherein said resource has a plurality of checkpoints associated therewith, and further comprising selecting one of said plurality of checkpoints to be used in restarting said resource.
28. A method of managing the restarting of resources of a computing environment, said method comprising:
determining, by a first entity of said computing environment, that a resource of said computing environment is to be restarted; and initiating the restarting of said resource by a second entity of said computing environment.
29. The method of claim 28, further comprising informing said second entity of the determination to restart said resource.
30. The method of claim 29, wherein said informing comprises invoking, by said first entity, an interface of said second entity indicative of said determination to restart.
31. The method of claim 30, wherein said invoking comprises providing a checkpoint key to said second entity identifying the checkpoint to be used during restart.
32. The method of claim 31, further comprising selecting said checkpoint key from one or more checkpoint keys associated with said resource.
33. The method of claim 28, wherein said initiating comprises initiating the restarting of a plurality of resources of said computing environment.
34. The method of claim 33, further comprising using at least one set of checkpoint keys to restart said plurality of resources.
35. A system of managing the checkpointing of resources of a computing environment, said system comprising:
a first entity of said computing environment adapted to determine that a checkpoint of a resource of said computing environment is to be taken; and a second entity of said computing environment adapted to initiate the taking of said checkpoint of said resource.
36. The system of claim 35, wherein said first entity has no knowledge of implementation details associated with said initiating the taking of said checkpoint.
37. The system of claim 35, further comprising means for informing said second entity of the determination to checkpoint said resource.
38. The system of claim 37, wherein said means for informing comprises said first entity being adapted to invoke an interface of said second entity indicative of said determination to take a checkpoint.
39. The system of claim 38, wherein said first entity provides to said interface an identifier of said resource to be checkpointed.
40. The system of claim 39, wherein said first entity provides to said interface an identifier of the checkpoint.
41. The system of claim 35, wherein said second entity is further adapted to inform said first entity that the taking of the checkpoint is complete.
42. The system of claim 35, further comprising means for using said checkpoint to restart said resource.
43. The system of claim 42, wherein said first entity is further adapted to determine that said restart is to take place.
44. The system of claim 43, further comprising means for informing said second entity of the determination to restart said resource.
45. The system of claim 44, wherein said means for informing comprises said first entity being adapted to invoke an interface of said second entity indicative of said determination to restart.
46. The system of claim 42, wherein said second entity is further adapted to inform said first entity that the restart is complete.
47. The system of claim 35, further comprising means for initiating the taking of a plurality of checkpoints of a plurality of resources by at least one second entity.
48. The system of claim 47, further comprising means for using at least one set of checkpoint keys in checkpointing said plurality of resources.
49. The system of claim 48, further comprising means for using one or more sets of said at least one set of checkpoint keys to restart one or more resources of said plurality of resources.
50. The system of claim 47, wherein at least one resource of said plurality of resources is executing on a computing node of said computing environment having a first operating system, and at least one other resource of said plurality of resources is executing on another computing node of said computing environment having a second operating system, which is different from said first operating system.
51. The system of claim 47, wherein one or more resources of said plurality of resources have a priority associated therewith, and wherein said priority is usable during at least one of the determining and the initiating.
52. The system of claim 35, wherein said first entity uses policy information in determining that said checkpoint is to be taken.
53. The system of claim 52, wherein said policy information includes a policy that a checkpoint is to be taken of any resources affected by adding or removing one or more resources from said computing environment.
54. The system of claim 35, wherein said first entity uses resource relationships in determining that said checkpoint is to be taken.
55. The system of claim 35, wherein said resource has associated therewith an indicator specifying that said resource is checkpointable.
56. The system of claim 35, wherein said first entity is further adapted to invoke an interface of said second entity to clean up any unwanted checkpoint information.
57. The system of claim 56, wherein said first entity provides to said interface at least one checkpoint key identifying the checkpoint information to be cleaned up.
58. The system of claim 35, wherein said second entity comprises one of said resource and a resource manager coupled to said resource and said first entity.
59. The system of claim 35, wherein said computing environment is at least one of a clustered environment and a distributed heterogeneous computing environment.
60. The system of claim 35, wherein said resource has a plurality of checkpoints associated therewith, and further comprising means for selecting one of said plurality of checkpoints to be used in restarting said resource.
61. A system of managing the restarting of resources of a computing environment, said system comprising:
a first entity of said computing environment being adapted to determine that a resource of said computing environment is to be restarted; and a second entity of said computing environment being adapted to initiate the restarting of said resource.
62. The system of claim 61, further comprising means for informing said second entity of the determination to restart said resource.
63. The system of claim 62, wherein said means for informing comprises means for invoking, by said first entity, an interface of said second entity indicative of said determination to restart.
64. The system of claim 63, wherein said means for invoking comprises means for providing a checkpoint key to said second entity identifying the checkpoint to be used during restart.
65. The system of claim 64, further comprising means for selecting said checkpoint key from one or more checkpoint keys associated with said resource.
66. The system of claim 61, further comprising means for initiating the restarting of a plurality of resources of said computing environment by at least one second entity.
67. The system of claim 66, further comprising means for using at least one set of checkpoint keys to restart said plurality of resources.
68. A system of managing the checkpointing of resources of a computing environment, said system comprising:

means for determining, by a first entity of said computing environment, that a checkpoint of a resource of said computing environment is to be taken; and means for initiating the taking of said checkpoint of said resource by a second entity of said computing environment.
69. The system of claim 68, further comprising means for informing said second entity of the determination to checkpoint said resource, said means for informing comprising means for invoking, by said first entity, an interface of said second entity indicative of said determination to take a checkpoint.
70. The system of claim 68, further comprising means for informing said second entity of a determination to restart said resource, said means for informing comprising means for invoking, by said first entity, an interface of said second entity indicative of said determination to restart.
71. A system of managing the restarting of resources of a computing environment, said system comprising:
means for determining, by a first entity of said computing environment, that a resource of said computing environment is to be restarted; and means for initiating the restarting of said resource by a second entity of said computing environment.
72. An article of manufacture, comprising:
at least one computer usable medium having computer readable program code means embodied therein for causing the checkpointing of resources of a computing environment, the computer readable program code means in said article of manufacture comprising:
computer readable program code means for causing a computer to determine, by a first entity of said computing environment, that a checkpoint of a resource of said computing environment is to be taken; and computer readable program code means for causing a computer to initiate the taking of said checkpoint of said resource by a second entity of said computing environment.
73. The article of manufacture of claim 72, further comprising computer readable program code means for causing a computer to inform said second entity of the determination to checkpoint said resource.
74. The article of manufacture of claim 73, wherein said computer readable program code means for causing a computer to inform comprises computer readable program code means for causing a computer to invoke, by said first entity, an interface of said second entity indicative of said determination to take a checkpoint.
75. The article of manufacture of claim 72, further comprising computer readable program code means for causing a computer to inform said first entity, by said second entity, that the taking of the checkpoint is complete.
76. The article of manufacture of claim 72, further comprising computer readable program code means for causing a computer to use said checkpoint to restart said resource.
77. The article of manufacture of claim 72, further comprising computer readable program code means for causing a computer to inform said second entity of a determination to restart said resource.
78. The article of manufacture of claim 77, wherein said computer readable program code means for causing a computer to inform comprises computer readable program code means for causing a computer to invoke, by said first entity, an interface of said second entity indicative of said determination to restart.
79. The article of manufacture claim 72, wherein said computer readable program code means for causing a computer to initiate comprises computer readable program code means for causing a computer to initiate the taking of a plurality of checkpoints of a plurality of resources by at least one second entity.
80. The article of manufacture of claim 79, further comprising computer readable program code means for causing a computer to use at least one set of checkpoint keys in checkpointing said plurality of resources.
81. The article of manufacture of claim 80, further comprising computer readable program code means for causing a computer to use one or more sets of said at least one set of checkpoint keys to restart one or more resources of said plurality of resources.
82. The article of manufacture of claim 79, wherein at least one resource of said plurality of resources is executing on a computing node of said computing environment having a first operating system, and at least one other resource of said plurality of resources is executing on another computing node of said computing environment having a second operating system, which is different from said first operating system.
83. The article of manufacture of claim 72, further comprising computer readable program code means for causing a computer to invoke, by said first entity, an interface of said second entity to clean up any unwanted checkpoint information.
84. At least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform a method of managing the restarting of resources of a computing environment, said method comprising:
determining, by a first entity of said computing environment, that a resource of said computing environment is to be restarted; and initiating the restarting of said resource by a second entity of said computing environment.
CA002298404A 1999-03-30 2000-02-16 Method, system and program products for managing the checkpoint/restarting of resources of a computing environment Abandoned CA2298404A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/280,531 US6594779B1 (en) 1999-03-30 1999-03-30 Method, system and program products for managing the checkpointing/restarting of resources of a computing environment
US09/280,531 1999-03-30

Publications (1)

Publication Number Publication Date
CA2298404A1 true CA2298404A1 (en) 2000-09-30

Family

ID=23073483

Family Applications (1)

Application Number Title Priority Date Filing Date
CA002298404A Abandoned CA2298404A1 (en) 1999-03-30 2000-02-16 Method, system and program products for managing the checkpoint/restarting of resources of a computing environment

Country Status (2)

Country Link
US (1) US6594779B1 (en)
CA (1) CA2298404A1 (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7464147B1 (en) * 1999-11-10 2008-12-09 International Business Machines Corporation Managing a cluster of networked resources and resource groups using rule - base constraints in a scalable clustering environment
US6823474B2 (en) * 2000-05-02 2004-11-23 Sun Microsystems, Inc. Method and system for providing cluster replicated checkpoint services
US20020188426A1 (en) * 2001-01-11 2002-12-12 Utpal Datta Check pointing of processing context of network accounting components
US6961865B1 (en) * 2001-05-24 2005-11-01 Oracle International Corporation Techniques for resuming a transaction after an error
US8423674B2 (en) * 2001-06-02 2013-04-16 Ericsson Ab Method and apparatus for process sync restart
US7082551B2 (en) * 2001-06-29 2006-07-25 Bull Hn Information Systems Inc. Method and data processing system providing checkpoint/restart across multiple heterogeneous computer systems
US7076692B2 (en) * 2001-08-31 2006-07-11 National Instruments Corporation System and method enabling execution stop and restart of a test executive sequence(s)
US6892320B1 (en) * 2002-06-03 2005-05-10 Sun Microsystems, Inc. Method and apparatus for providing multiple-version support for highly available objects
WO2004081762A2 (en) * 2003-03-12 2004-09-23 Lammina Systems Corporation Method and apparatus for executing applications on a distributed computer system
US8190780B2 (en) * 2003-12-30 2012-05-29 Sap Ag Cluster architecture having a star topology with centralized services
US20050188068A1 (en) * 2003-12-30 2005-08-25 Frank Kilian System and method for monitoring and controlling server nodes contained within a clustered environment
US7275183B2 (en) * 2004-04-30 2007-09-25 Hewlett-Packard Development Company, L.P. Method of restoring processes within process domain
US8181162B2 (en) * 2004-06-14 2012-05-15 Alcatel Lucent Manager component for checkpoint procedures
US7895528B2 (en) * 2004-08-05 2011-02-22 International Business Machines Corporation System and method for reversing a windows close action
US7987225B2 (en) * 2004-12-22 2011-07-26 International Business Machines Corporation Method for remembering resource allocation in grids
US7478278B2 (en) * 2005-04-14 2009-01-13 International Business Machines Corporation Template based parallel checkpointing in a massively parallel computer system
US7552306B2 (en) * 2005-11-14 2009-06-23 Kabushiki Kaisha Toshiba System and method for the sub-allocation of shared memory
US20070174697A1 (en) * 2006-01-19 2007-07-26 Nokia Corporation Generic, WSRF-compliant checkpointing for WS-Resources
JP5595633B2 (en) * 2007-02-26 2014-09-24 スパンション エルエルシー Simulation method and simulation apparatus
US8918490B1 (en) * 2007-07-12 2014-12-23 Oracle America Inc. Locality and time based dependency relationships in clusters
US8078421B1 (en) 2007-12-19 2011-12-13 Western Digital Technologies, Inc. Multi-cell disk drive test system providing a power recovery mode
US8145944B2 (en) * 2009-09-30 2012-03-27 International Business Machines Corporation Business process error handling through process instance backup and recovery
US9207987B2 (en) * 2010-01-15 2015-12-08 Oracle International Corporation Dispersion dependency in oracle clusterware
US8949425B2 (en) * 2010-01-15 2015-02-03 Oracle International Corporation “Local resource” type as a way to automate management of infrastructure resources in oracle clusterware
US9098334B2 (en) * 2010-01-15 2015-08-04 Oracle International Corporation Special values in oracle clusterware resource profiles
US9069619B2 (en) * 2010-01-15 2015-06-30 Oracle International Corporation Self-testable HA framework library infrastructure
US20110179173A1 (en) * 2010-01-15 2011-07-21 Carol Colrain Conditional dependency in a computing cluster
US9727421B2 (en) * 2015-06-24 2017-08-08 Intel Corporation Technologies for data center environment checkpointing

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4648031A (en) 1982-06-21 1987-03-03 International Business Machines Corporation Method and apparatus for restarting a computing system
US5214778A (en) 1990-04-06 1993-05-25 Micro Technology, Inc. Resource management in a multiple resource system
US5319774A (en) 1990-05-16 1994-06-07 International Business Machines Corporation Recovery facility for incomplete sync points for distributed application
US5884021A (en) * 1996-01-31 1999-03-16 Kabushiki Kaisha Toshiba Computer system having a checkpoint and restart function
US5845292A (en) * 1996-12-16 1998-12-01 Lucent Technologies Inc. System and method for restoring a distributed checkpointed database
JP3253883B2 (en) * 1997-01-31 2002-02-04 株式会社東芝 Process restart method and process monitoring device
US6154877A (en) * 1997-07-03 2000-11-28 The University Of Iowa Research Foundation Method and apparatus for portable checkpointing using data structure metrics and conversion functions
US6360331B2 (en) * 1998-04-17 2002-03-19 Microsoft Corporation Method and system for transparently failing over application configuration information in a server cluster
US6330689B1 (en) * 1998-04-23 2001-12-11 Microsoft Corporation Server architecture with detection and recovery of failed out-of-process application
US6351754B1 (en) * 1998-06-23 2002-02-26 Oracle Corporation Method and system for controlling recovery downtime
US6289474B1 (en) * 1998-06-24 2001-09-11 Torrent Systems, Inc. Computer system and process for checkpointing operations on data in a computer system by partitioning the data
US6338147B1 (en) * 1998-10-29 2002-01-08 International Business Machines Corporation Program products for performing checkpoint/restart of a parallel program

Also Published As

Publication number Publication date
US6594779B1 (en) 2003-07-15

Similar Documents

Publication Publication Date Title
US6594779B1 (en) Method, system and program products for managing the checkpointing/restarting of resources of a computing environment
JP4307673B2 (en) Method and apparatus for configuring and managing a multi-cluster computer system
US6983321B2 (en) System and method of enterprise systems and business impact management
US6393485B1 (en) Method and apparatus for managing clustered computer systems
US7058853B1 (en) Highly available transaction processing
US7366742B1 (en) System and method for distributed discovery and management of frozen images in a storage environment
US9514160B2 (en) Automatic recovery of a failed standby database in a cluster
KR100423687B1 (en) Cascading failover of a data management application for shared disk file system in loosely coupled node clusters
US7827438B2 (en) Distributed testing system and techniques
US6789257B1 (en) System and method for dynamic generation and clean-up of event correlation circuit
US6789114B1 (en) Methods and apparatus for managing middleware service in a distributed system
US6782408B1 (en) Controlling a number of instances of an application running in a computing environment
US20120151272A1 (en) Adding scalability and fault tolerance to generic finite state machine frameworks for use in automated incident management of cloud computing infrastructures
US7770057B1 (en) System and method for customized disaster recovery reports
US7093013B1 (en) High availability system for network elements
US8751856B2 (en) Determining recovery time for interdependent resources in heterogeneous computing environment
KR20010109090A (en) Method of, system for, and computer program product for providing a job monitor
US6735772B1 (en) System and method for handling orphaned cause and effect objects
CA2241861C (en) A scheme to perform event rollup
Maloney et al. A survey and review of the current state of rollback‐recovery for cluster systems
CA2619778C (en) Method and apparatus for sequencing transactions globally in a distributed database cluster with collision monitoring
US8595349B1 (en) Method and apparatus for passive process monitoring
Jing-fan et al. Policy driven and multi-agent based fault tolerance for Web services
Laranjeira NCAPS: Application high availability in UNIX computer clusters
Putman General framework for fault tolerance from ISO/ITU Reference Model for Open Distributed Processing (RM-ODP)

Legal Events

Date Code Title Description
EEER Examination request
FZDE Dead