US20080222296A1

US20080222296A1 - Distributed server architecture

Info

Publication number: US20080222296A1
Application number: US12/044,775
Authority: US
Inventors: Lisa Ellen Lippincott; Peter James Lincroft; Peter Benjamin Loer; John Edward Firebaugh; Dennis Sidney Goodrow
Original assignee: BigFix Inc
Current assignee: International Business Machines Corp
Priority date: 2007-03-07
Filing date: 2008-03-07
Publication date: 2008-09-11
Also published as: US7962610B2; US20080228442A1

Abstract

A method and apparatus for synchronizing network element state when a network connection between a plurality of servers is restored after a network failure includes a plurality of objects that exist within the network. Each object exists in a plurality of different versions, in which each said different object version results from modifications to an object made by different servers during the network failure when the servers are unable to communicate with each other but otherwise continue to function. Each object comprises a vector including a separate version number for each server, in which each server increments its version number in the vector when it modifies the object. An automatic conflict resolution mechanism provides, at each server, a most up to date view of all objects across all of said plurality of servers upon restoration of the network connection between said plurality of servers after said network failure. The conflict resolution mechanism reconciles the existence of said plurality of different versions of an object to determine which object version should take precedence over other object versions. Conflict resolution is performed when there are multiple versions of a same object at a server. The conflict resolution mechanism also comprises at least one tie breaking rule that is applied to decide which servers take precedence over other servers when determining which object version should take precedence over other object versions.

Description

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. patent application Ser. No. 60/893,528, filed Mar. 7, 2007, which application is incorporated herein in its entirety by this reference thereto.

BACKGROUND OF THE INVENTION

1. Technical Field
The invention relates to computer networks. More particularly, the invention relates to a high availability network.
2. Description of the Prior Art
One basic disaster recovery scenario for users of a network management system, such as the BES system (BigFix, Inc., Emeryville, Calif.) has been based on the distributed nature of various agents, which know enough about their state and history that, even if a system server is completely lost, the server can simply be replaced with a fresh server, which quickly rebuilds information about all the clients in the system and regains the ability to issue such management objects as fixlets, tasks, etc.
Users who take their disaster recovery planning seriously also make regular backups of their server so that it can be restored without losing custom content, the action audit trail, etc. Some users maintain a live backup server that can be switched to immediately in the event of a failure on their primary server.
However, for the users with the most demanding requirements, simple disaster recovery is not good enough. These users have critical infrastructure availability policies that require the continuing and transparent availability of services, even in the presence of severe infrastructure outage or loss.
It would be advantageous to provide a solution that allows such users to transition automatically to a hot backup if their primary server should fail, but even this capability would not necessarily meet all of their availability needs.
The destruction of the World Trade Center has taught us that we no longer have the luxury of assuming that because something catastrophic has not happened that it will not happen. This event, and other smaller events, such as the East Coast blackout in the summer of 2003, have taught us that core infrastructure components must be self-healing because, in such events, administrators who know the system may not be readily available, and those who remain are under enormous stress. Even if used exclusively for patch management, a network management system, such as the BES system, is a critical piece of Windows infrastructure. One cannot deploy such a critical infrastructure component if it does not meet the requirements described below.
A typical distributed computing environment, for example, might be segmented into three regions: Americas (NA), Europe (EMEA), and Asia Pacific (APAC). Each regional hub office might support multiple branch offices. Each regional hub office might have a primary and at least one secondary data center. In addition, certain larger branch offices might also have multiple data centers, but would generally rely on the hub office for infrastructure services.
Intra-regionally, there would be connectivity between primary and secondary data centers, and all data centers would have connectivity to NA.
It would be advantageous to provide a solution that met the following requirements for such system.
Availability requirements for this system are:
1. The system must operate with no loss of functionality in the event of loss of connectivity between any or all of the hubs.
2. The system must operate with no loss of functionality in the event of loss of connectivity between any Primary Data Center (PDC) and its corresponding Secondary Data Center (SDC).
Administrative requirements are:
1. Under normal circumstances, the system must offer a single administrative interface: tasks, actions, and configuration changes may be made in only one place and flow down through the system. In a business continuityevent, the administrative interface must fail over to an appropriate server given the circumstances of the event.
2. The system must provide full regional and global administration through the same user interface. Actions, etc., which are to be deployed to a regional subset of machines, must apply to the machines in that region only, while global actions, etc. must apply to all machines globally.
3. The system must offer a unified reporting interface: reporting data must be aggregated and presented through a single interface. In a business continuity event, reporting data must be available regionally, with data for as many of the other regions as possible given the circumstances of the event.
Operational and Support Requirements are:
1. Failover and failback must not cause adverse impact on the system: clients must continue functioning normally, data must remain consistent, administrative changes must be transactional so that they can be rolled back in the event of a failure while making a change. Failover must not result in a sharp increase in processing load, which could cause cascading failures or slow the system significantly.
2. Data replication must be implemented in such a way that either no data inconsistencies are possible, or data inconsistencies are corrected automatically within a reasonable time.
3. High Availability functionality must be supported natively in the management system, without the use of third party add-on products, such as Doubletake or Replistor
4. Failover and failback must be automatic and as real time as possible. These may not require the manipulation of domain name server (DNS) entries or external configuration files. The system must sense when a failover is required, or when failback is possible. Manual failover/failback is an acceptable interim step until a future version provides automated failover.
5. Notification (SNMP, Windows Event Log, MOM, etc.) must be provided when failover/failback occurs.
6. The system operations team must be able to manually failover to perform routine maintenance on parts of the system. They must be able to freeze the failed over configuration to prevent silly failback syndrome, i.e. an intermittent connection that causes peer servers to spend all their time performing failover operations followed by failback operations, etc., if a server is unstable, and failback when the maintenance is concluded. These changes must be enabled through an administrative interface, and not through external configuration files or DNS changes.
7. Clients must use the most local server when the move to another part of the plant. When clients roam, the system must prevent duplicate records being produced for those clients.

SUMMARY OF THE INVENTION

An embodiment of the invention provides a method and apparatus for synchronizing network element state when a network connection between a plurality of servers is restored after a network failure. A plurality of objects exist within the network, each object existing in a plurality of different versions, in which each said different object version results from modifications to an object made by different servers during the network failure when the servers are unable to communicate with each other but otherwise continue to function. Each object comprises a vector including a separate version number for each server, in which each server increments its version number in the vector when it modifies the object. An automatic conflict resolution mechanism provides, at each server, a most up to date view of all objects across all of said plurality of servers upon restoration of said network connection between said plurality of servers after said network failure. The conflict resolution mechanism reconciles the existence of said plurality of different versions of an object to determine which object version should take precedence over other object versions. Conflict resolution is performed when there are multiple versions of a same object at a server. The conflict resolution mechanism also comprises at least one tie breaking rule that is applied to decide which servers take precedence over other servers when determining which object version should take precedence over other object versions.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block schematic diagram that shows three interconnected top level servers according to the invention;

FIG. 2 is a block schematic diagram that show a traditional hierarchical control system having a root server and a single point of failure;

FIG. 3 is a block schematic diagram that shows that application of tie breaking rules pursuant to object synchronization according to the invention; and

FIG. 4 is a block diagram showing an advisor viewpoint as described in U.S. Pat. No. 7,277,919.

DETAILED DESCRIPTION OF THE INVENTION

An embodiment of the invention comprises a distributed server architecture that is useful for situations where a network infrastructure fails or an entire server infrastructure is damaged. The invention operates in a fail over system that provides an approach that is adaptable. In such system, when certain components go down, the network and other components within the network detect that components that they depend upon are gone. In such case, the system and remaining components seek out other components to provide the same services that had been provided by the damaged or missing components. The invention operates in a system that allows the network and the remaining components to connect together in ways that route around the problem area, for example, if a top level server goes down, in which case the remaining components migrate their needs to other servers.
The invention solves various technical challenges. One of these challenges is that changes to the state of components in the network are introduced through more than one of the servers. For example, if there are “N” top level servers all running, and they are dealing with load characteristics: some of them are vanishing; some of them are coming back; some of them are being taken offline because they are being backed up, and then they are being put back into service, etc., then change is happening while they are gone, i.e. while they are disconnected. Change can be coming in from a variety of different directions. Change can be coming the consoles in the sense that new actions are being created, policies are being deployed, fixlets are being created, and new properties are being set up to be measured by agents. Such change is referred to herein as being instantiated in various objects, for example as are exchanged in a network management system, such as the BigFix Enterprise Suite (BigFix, Inc., Emeryville, Calif.).
The independent creation of these objects when a particular server is unavailable means that later, when that server is back up again, it is necessary to perform a conflict resolution to bring the server into a current state, i.e. the most up to date view of all the objects of all of the operators across all of the top level servers. Key to the invention is a technique for organizing a system that can resolve such conflicts and address the fact that multiple versions of an object are in existence. The invention resolves conflicts as to whether one version of an object should take precedence over the others, and addresses such conflicts as when servers that have been disconnected from each other are then reconnected later.
FIG. 1 is a block schematic diagram that shows three interconnected top level servers 20, 23, 26, respective relays 21, 22, 24, 25, 27, 28 connected to the servers, and individual groups of clients 31-36 connected through the relays to the servers. There are consoles (not shown) that are connected to the servers, and that can change the system state as well. Thus, the clients are sending change into the system, in the form of new and/or revised objects, and they are reporting new values and properties, e.g. in the BES system, developments and/or fixlets that are becoming relevant or becoming irrelevant. The servers must communicate with each other to keep up to date on all of this information. One client reports in through a hierarchy to one server and that information must be replicated to the other servers in a fashion that, even if that client moves to a different level of servers, the most up to date version of the information is maintained across the entire infrastructure. In this embodiment, the sources of information are the clients, relays, servers, and consoles. Note that FIG. 1 shows a break in the communication links to server S_s(23), as indicated by a large “X” on each line into the server. The invention herein is particularly well suited for maintaining and restoring a consistent state in a network where, as in FIG. 1, a server or similar logical element in the network is isolated, yet continues to function otherwise.
FIG. 2 is a block schematic diagram that shows a traditional hierarchal control system having a root server 10 as a single point of failure. FIG. 1 shows a distributed server architecture having a variety of elements, such as relays 11, 12, and client devices 14-19. Some of these elements can vanish and the system can manage around it. In the traditional system, some of the elements have redundancy, e.g. the relays have this property. If the relay gets hit by a meteor or goes down or becomes unavailable to the client, the client goes and finds another relay. That is, it finds a new path by itself by reaching out and probing the network up to the server and thus stays connected. However, with a single server there is a single point of failure. If that server is unavailable by some path to a client, then the client is not controllable. If the server goes offline for whatever reason, the system is unmanageable.
A key aspect of a distributed server architecture is its ability to have multiple top level servers (see FIG. 1). Each server has information flowing up to it, each in the same way that they would previously have in the traditional system. The servers communicate with each other on a peer basis and exchange all the information that flows up to them. There is no strict hierarchy for these servers, or it would be necessary to go back and forth between servers, based upon a scheme imposed as a result of such hierarchical relationship. In the inventive system, the client can go back and forth between relays and different servers. These elements are all perceived as one network until there is a failure.
There are two important types of failure such system tolerates:
One major type of failure occurs when a server goes away, in which case the relays reach out to another server, such that everything is still controllable. In that case, the relays maintain some information about fail over servers that they can go to so when they lose contact with one server, i.e. the relays are aware of the set of servers in the network. They know who they are going to currently; otherwise, they find whichever server is closest to their network.
A second major type of failure that the system tolerates, and to which the invention is specifically directed, is a failure in which a region loses connectivity, e.g. a connection between North America and Europe is broken, and there are now two separated but fully functional networks (see FIG. 1, for example). That is, operators in London, for example, can continue to connect to the system and apply patches and have visibility and manage the part of their network that they are still connected to Everyone in North America can also continue to manage a set of computers. However, it is not possible to manage computers across the gap. In a large enterprise, local organizations must be able to continue to operate. One of the major technical challenges when the network splits and regions lose connectivity is that there are now two networks. Changes of state continue to occur in each of the two networks. From the time of the split, they are diverging in their sense of the state of the world. When the connection comes back, up one of the first things that must happen is that policies created in each region must be reconciled with each other as a key part of a recovery scenario.
An embodiment provides a scheme for taking every policy that is created during the interval that connectivity is lost and, when the connection is restored, determine which policies should supersede which other policies. For purposes of the discussion herein, a policy may be thought of as an object, as discussed above. The invention provides this functionality automatically without any manual intervention. Each client knows what to do once it starts seeing policies flowing from all different parts of the network. When presented with the same content, even if it was delayed, any particular client arrives at the same conclusion as to what set of policies are the right ones to which attention should be paid.
Key to the invention is the reconciliation that occurs when the various parts of the network start working together again after a connection is lost. The network elements automatically come back together and they all resolve to the same state. They all decide upon the correct set of information that is the correct operational set. Thus, while a fail over scheme or autonomous operation may be known in connection with the first failure type discussed above, the invention provides a scheme that determines how to get the network elements back together again in a synchronized fashion, and thus addresses heretofore unresolved failure type two above.
It should be noted that the problem with lack of consistency between servers exists all the time because there is latency in a network. Each server has a different view of the world. The more that the servers drift apart in time, the greater number of conflicts that can exist, and the harder it is to resolve them. Thus, a loss of connectivity, as discussed above, is an extreme case and it is this case that the invention herein is particularly well suited to remediate. Note that the term server, as used herein, does not refer only to a server in the generic sense of a server. Rather, a server as a collection of policy and a collection of state. Thus, in a connection failure the policy library for each one of the servers drifts farther and farther apart from each other server in time as their separation increases in time.
FIG. 3 is a block schematic diagram that shows that application of tie breaking rules pursuant to object synchronization according to the invention. The tie breaking rules are discussed below. To address the problem synchronizing network components upon restoration of a lost connection, an embodiment of the invention provides a mechanism that stamps every object with information about the objecf's origin, i.e. what server it originated from, e.g. in FIG. 3, SERVER₀(20), SERVER₁(26), or SERVER₂(23). The mechanism also stamps each object with information regarding what version of the object it is. FIG. 3 shows an object 40, 42 which, in this example is a fixlet. An object can be modified over time and a new version can be time stamped into it. The object's total version number is both its server, i.e. the server that modified it, as well as the version number of the object itself. Conflict resolution is performed when they are multiple versions of an object at a server. The server must determine which set of objects it should be working with. An embodiment of the invention provides a technique for deciding which the objects are the correct objects.
An embodiment of the invention comprises a conflict resolution technique that allows for multiple instances of a object to be presented to any particular piece of software, and that piece of software can make a decision concerning the correct object to use. One approach to doing this, e.g. time stamping the object with time, does not solve the problem that occurs because time is not correctly kept across a network, i.e. local time is not typically consistent across a network.
The invention addresses the synchronization problem from a different perspective. For conflicts, the invention provides a mechanism that applies tie breaking rules that are applied to decide which servers take precedence over other servers. In FIG. 3, each object 40, 42 is shown to maintain a version number. When any server 20, 23, 26 replaces an object or updates an object, it changes the object's version number. Objects have many versions from many servers. The invention provides a vector 41 of version number per server. The first element of the vector is associated with server number zero 20. The second element of the vector is associated with server number one 26, and so on. Whenever any particular server, e.g. server 20, decides it wants to update an object, e.g. object 40, it takes the existing many version of the object that it has and it finds its position in an array of vectors 41. It then increments the object's version number in the vector. This increments the object's version number. With reference to FIG. 3, the object 40 associated with one server 20 has a manyversion number 1,3,0. That manyversion number can be compared with the version number for the object at every other server, e.g. the object 42 associated with the server 23 has a manyversion number 1,2,1. A rule for comparison allows the server to decide which version of the object is going to take precedence based upon which server takes precedence.
For example, if only one of the version numbers has been modified, then all of the servers agree that there is only one higher number in one of the vector elements. That version takes precedence because there is no conflict. The complexity comes in if one server incremented its version number for an object and another server also incremented its version number for the same object. In this case, the invention applies a very simple tie breaker rule: the object version having the lower number server wins. In the example of FIG. 3, the server having the lower number, SERVER₀, has an object that it has modified and it has indicated this in the vector, e.g. in the manyversion number 1,3,0 the 1 indicates that the server has modified the object. Thus, even though the object 42 associated with SERVER₂is also modified, the tie breaker rule gives precedence to the object stamped by SERVER₀.
Accordingly, a presently preferred embodiment provides multiple vector versions of an object. Each server has a position in that vector. When an object is changed, the server takes a previous version of the object, increments its portion of the vector, and puts the vector in a new object. The object is then propagated into the network with the version number stamped into it. If a conflict is encountered, the version number resolution tie breaking rules decide which object should be used by each server.
During the time when there is no communication between certain of the servers, a certain number of objects may be propagated on each side of a communications break and there may be some inconsistencies. Over time as the systems starts recommunicating and the servers start applying the tie breaking rules, these inconsistencies disappear. This technique can be thought of as collapsing the objects versions to the lowest numbered server object version.
The object information is used to serialize the objects in a signed directory listing, i.e. a directory of all the objects and their version number. The signed directory listing is created at each server because each server is aware of all versions of the objects that it sent. If the conflict resolution mechanism indicates that the server has a definitive list, it publishes its version of the object. If the server can not do this because there is another version of that object that exists in a newer state than the server it has, then the server also has a directory listing from a different server that contains the object number identifier and its version. The server combines these directory listings to create a super set directory listing that contains all the information needed for conflict resolution with regard to the object in question. In this embodiment, the listing is the vector 41 shown in FIG. 3. It is also desirable to provide a technique to authenticate the super set directory listing to assure that it is valid.
An embodiment of the invention is concerned with a distributed computed relevance messaging system because there is a distribution of objects across many servers across a network. Relevance based computing is disclosed, for example, in Donoho, D. et al., Relevance clause for computed relevance messaging, U.S. Pat. No. 7,277,919 (issued Oct. 2, 2007). In such system: “a collection of computers and associated communications infrastructure to offer a new communications process . . . allows information providers to broadcast information to a population of information consumers. The information may be targeted to those consumers who have a precisely formulated need for the information. This targeting may be based on information which is inaccessible to other communications protocols. The targeting also includes a time element. Information can be brought to the attention of the consumer precisely when it has become applicable, which may occur immediately upon receipt of the message, but may also occur long after the message arrives. The communications process may operate without intruding on consumers who do not exhibit the precisely-specified need for the information, and it may operate without compromising the security or privacy of the consumers who participate.” (Abstract)
An embodiment of the invention herein provides a scheme for maintaining consistency among the different servers in a distributed system by stamping the objects based on the version number. A time stamp may also be appended to the object in this embodiment. A stamp by the server that originates the object may also be added to the object. In the event that there is an inconsistency, the tie breaking rule discussed above are applied, i.e. the lowest numbered server controls. Other embodiments can apply a different tie breaking scheme. The invention thus contemplates the generic use of tie breaking in the event of conflict between object versions. Instead of a lowest numbered server taking precedence, a particular region, e.g. the home office region, could take precedence, or other metadata could be added to the object vector that is used to determine precedence in the case of version conflicts. Thus, any predetermined tie breaking algorithm or set of rules may be used for tie breaking. Based on tie breaking, the invention allows a network to maintain a definitive version of each object. Over time, that definitive version is propagated throughout the network to restore consistency and to maintain consistency, especially where portions of the network operate for a period of time autonomously, for example due to a lost connection.
With regard to the directory listings of objects with version numbers stamped into them, the client can determine if is has a proper set by verifying a digital signature on the list. Because a digital signature is placed on every policy (object), a server can not change the policy in some way due to reconciliation because the server itself has no rights to issue a digital signature, and therefore no rights to override the signature on the policy. Only the originating operator can issue that signature. Because of the signed policy, it is necessary to have a mechanism where reconciliation can happen down to the end points (clients). A client must be able to determine what to do with an object while it waits for an operator to approve an action and stamp a new signature on the new, reconciled object.
There are different forms that any particular object can take. There is a database form that objects take. There is a serialized form, where the objects show up in a message that is communicated to other components. The version information is stored according to the object type. In a fixlet definition table (see FIG. 3), there is a column that defines a body with relevance and a many version. Each one of these elements is a version number of a particular server that represents a portion of the vector manipulated by the server, as discussed above. Each such index is implemented independently of the others by a respective server. Each server takes its own element and increments it, i.e. any particular server increments only its own version number. The left most number 0001 (in table 41 on FIG. 3) is the position for SERVER₀, the middle number is the position for SERVER₁, and the right number is the position for SERVER₂. Whenever SERVER₂modifies an object, it takes its version number in the right number position and increments it. It leaves all the other version numbers alone so that any software that is confronted with that new version number can do a comparison and decide which server did what to the object.
If the other versions of the object are propagated to SERVER₀, such that SERVER₀is aware that it is editing the highest numbered version of the object that had actually been edited on SERVER₁then when it goes to edit the object, it is actually editing the correct version of the object because it has incorporated all of the previous changes. In this case, the version number is incremented in the position for SERVER₀. All of the other changes resolve themselves underneath that. Thus, an automatic resolution takes place at the end of the process.
Relevance
Relevance based computing is disclosed, for example, in Donoho, D. et al, Relevance clause for computed relevance messaging, U.S. Pat. No. 7,277,919 (issued Oct. 2, 2007). In such system: “a collection of computers and associated communications infrastructure to offer a new communications process . . . allows information providers to broadcast information to a population of information consumers. The information may be targeted to those consumers who have a precisely formulated need for the information. This targeting may be based on information which is inaccessible to other communications protocols. The targeting also includes a time element. Information can be brought to the attention of the consumer precisely when it has become applicable, which may occur immediately upon receipt of the message, but may also occur long after the message arrives. The communications process may operate without intruding on consumers who do not exhibit the precisely-specified need for the information, and it may operate without compromising the security or privacy of the consumers who participate.” (Abstract) One network architecture that embodies such relevance based messaging system is the BigFix Enterprise Suite™ (“BES”; BigFix, Inc., Emeryville, Calif.), which brings endpoints in such system under management by installing a native agent on each endpoint.
An embodiment of the invention resides in a management system architecture that comprises a management console function and one or more agents, in communication with the management console function, either directly or indirectly, and which perform a relevance determination function. Relevance determination (see FIG. 4), for example, for targeted solution delivery 41, is carried out by an applications program, referred to as the advice reader 42 which, in the prior art (see U.S. Pat. No. 7,277,919) runs on the consumer computer and may automatically evaluate relevance based on a potentially complex combination of conditions, including:

- Hardware attributes. These are, for example, the type of computer on which the evaluation is performed, the type of hardware configuration 43, the capacity and uses of the hardware, the type of peripherals attached, and the attributes of peripherals.
- Configuration attributes. These are, for example, values of settings for variables defined in the system configuration 40, the types of software applications installed, the version numbers and other attributes of the software, and other details of the software installation 44.
- Database attributes. These are, for example, attributes of files 48 and databases on the computer where evaluation is performed, which may include existence, name, size, date of creation and modification, version, and contents.
- Environmental attributes. These are, for example, attributes that can be determined after querying attached peripherals to learn the state of the environment in which the computer is located. Attributes may include results of thermal, acoustic, optical, geographic positioning, and other measuring devices.
- Computed attributes. These are, for example, attributes that can be determined after appropriate computations based on knowledge of hardware, configuration, and database and environmental attributes, by applying specific mathematico-logical formulas, or specific computational algorithms.
- Remote attributes 49. These are, for example, hardware, configuration, database, environmental, and computed attributes that are available by communicating with other computers having an affinity for the consumer or his computer.
- Timeliness 45. These are, for example, attributes based on the current time, or a time that has elapsed since a key event, such as relevance evaluation or advice gathering.
- Personal attributes. These are, for example, attributes about the human user(s) of the computer which can either be inferred by analysis of the hardware, the system configuration, the database attributes, the environmental attributes, the remote attributes, or else can be obtained by soliciting the information directly from the user(s) or their agents.
- Randomization 46. These are, for example, attributes resulting from the application of random and pseudo-random number generators.
- Advice Attributes 47. These are, for example, attributes describing the configuration of the invention and the existence of certain advisories or types of advisories in the pool of advice.

In this way, whatever information is actually on the consumer computer or reachable from the consumer computer may in principle be used to determine relevance. The information accessible in this way can be quite general, ranging from personal data to professional work product to the state of specific hardware devices. As a result, an extremely broad range of assertions can be made the subject of relevance determination. In connection with an embodiment of the invention herein, any number of versioned objects are passed across the distributed network architecture, where such objects are useful for learning about managed devices, remediating such devices, and enforcing policies in connection with such device. This latter aspect of this embodiment is of critical concern in the event of a communications failure across a distributed network. In such situation, the various network segments continue to function autonomously and, in doing so, they create, spawn, and implement policies based upon relevance. Policy compliance may diverge during a communications failure. A critical aspect of this embodiment assures that relevance-based objects are propagated across a distributed network when the network recovers from such failure, such that a consistent policy is implemented across the entirety of the distributed network.

AN EMBODIMENT

The following describes an embodiment of the invention implemented in the BigFix BES product (BigFix, Inc., Alameda, Calif.). Those skilled in the art will appreciate that other embodiments of the invention are possible. Accordingly, the following discussion is provided as an example of the invention, and not as a limitation of the invention.
Object ID Allocation
ActionSite ID space is carved up to allow each server to create new IDs without risk of collision. One option is to use 24 bits for the ID+7 bits for the server ID+1 bit reserved to accommodate an action analysis ID. This allows for 128 servers and 16.7 million objects per server. Alternative embodiments provide a client having a 64-bit ID space, which remove the 128 and 16.7 million limitations. Namely, the client uses a 64-bit integer for fixlet IDs.
Multidimensional Version Numbers
Description
To merge independently-edited versions of sites together, one embodiment uses version numbering that represents changes multi-dimensionally, with one dimension for each server. Maintaining a version number sequence on a single server is practical, while allocating version numbers across servers is presently thought to be more difficult
For this discussion, it is assumed that servers are lettered consecutively, starting at A. For example, the first item is put into a site on server C. That version of the site is given the number (0,0,1). The “1” is in the third place because the change happened on server C. That change is copied to server A, and a further change is made there. The new version is (1,0,1). When making a change, a console takes the version of the site it knows about, and increments the place corresponding to its server. We know (1,0,1) completely supersedes (0,0,1) because it is at least as large in every place.
Suppose a change is made on server B, which has received (0,0,1) but not (1,0,1). The number of this change is (0,1,1). It supersedes (0,0,1), but conflicts with (1,0,1): neither (0,1,1) nor (1,0,1) is bigger in every place, although (1,0,1) just supersedes (0,1,1). When that happens, a rule is used to form a new, merged version (1,2,1).
To merge, it is necessary to know, for each file, the version in which the file was created (VC) and the version in which the file was last modified (VM). To keep things simple, there is a check that for each file in a list, VC<=VM<=version of the list.
The system also uses one list to win ties. One embodiment looks for the first place where the version numbers differ, and resolves any ties in favor of the list that is higher in that place. In the case of (1,0,1) vs. (0,1,1), the former wins ties.
When a file appears in both lists: Compare the modification versions. If one is larger, pick that one. Otherwise, go with the tie-winning list.
Other comparisons can be performed to distinguish various ways the conflict may have arisen, such as:

- add-vs-add;
- modify-vs-modify; and
- modify-vs-delete+add.

But we still just pick the tie-winning list.
When a file appears in only one list, let L be the version of the list from which the file is absent. If L>VM, the file was known about by the producer of L, and deleted. In this case the file is left out of the merged version.
Otherwise, if L>VC, the file is deleted on one branch, but modified on the other.
In this case, select the tie-winning list.
Otherwise, the producer of L had never seen the file. It was added on the other branch. In this case, keep the file in the merged list.
Implementation
The following describes one implementation of the invention. Those skilled in the art will appreciate that other, equally effective implementation are possible within the scope of the invention.
Multidimensional version numbers (“ManyVersions”) are represented in a database as varbinary(512). Their interpretation is as a series of 4-byte big-endian numbers representing the components of a ManyVersion. Trailing zeros are omitted. Thus, if N servers have been deployed, the typical ManyVersion value is 4*N bytes long. For example, (0,0,1) is represented as:
0x00000000 00000000 00000001
(spaces added for readability).
User-defined SQL functions are created as necessary to support comparisons and arithmetic. Each server maintains a “MaxManyVersion” value, stored as a column in the single-row DBINFO table. When an object is created in the database, the MaxManyVersion value is used to set the creation ManyVersion of the object, and the MaxManyVersion value is bumped by incrementing the current server's version component. When data from other databases are replicated, the modification ManyVersion of each object being inserted is compared to the current MaxManyVersion, and if any of its components are greater, then that component of the MaxManyVersion is increased to match it.
The VERSIONS table and Version columns are altered to use ManyVersions.
VERSIONS.CreationVersion holds the creation version and VERSIONS.
LatestVersion holds the modification version.
For external content, all version columns contain the empty ManyVersion (“0x”). When sites are propagated, the directory listing is stamped with the current MaxManyVersion (which then gets bumped), and the individual entries are stamped with the creation and modification ManyVersions of each object.
Sequence-Based Row Retrieval
One embodiment implements a replication mechanism for database rows based on sequence numbers. The basic mechanism for sequence-based row retrieval is common to many of the database tables described in the next section. Each table that participates in sequence-based replication must have the following columns:
Sequence
A rowversion column which is filled with a unique sequence number that monotonically increases whenever a row is inserted or updated.
OriginServerID
Used to indicate the server on which the row originated. The exact definition of origination depends on the table.
OriginSequence
Of type bigint, it is used to hold the sequence number from the original server for rows that have been replicated. Conceptually, for rows residing on their origin server, this column should be equal to the Sequence column. This may be hard to implement, so instead one may define an OriginSequence value of NULL to mean “this is the origin server, so use the Sequence column to get the OriginSequence value”. Each server keeps a sequence number for each other server, indicating the last row it was up-to-date with. To request an update from a server, all of the numbers are presented.
For example, if a request asking for rows beyond “Server 0: 100, Server 1: 50, Server 2: 70” is sent to server 0. In pseudo-SQL, the query looks like:


	select isnull(OriginServerID, 0 )as OriginServerID,
	isnull(OriginSequence, Sequence) as OriginSequence,
	other replicated columns
	from table
	where Sequence between 100 and current max sequence
	and not( OriginServerID is null and Sequence <= 100 )
	and not( OriginServerID = 1 and OriginSequence <= 50 )
	and not( OriginServerID = 2 and OriginSequence <= 70 )

The latter tests come into play when data may travel over multiple paths; they prevent replay and keep data from traveling in circles.
Note that this query depends on the number of servers and the server from which data are requested. It must be dynamically generated by the client server.
Conflict Notification and Resolution
It is necessary to decide what level of support is provided for notifying users of conflicts arising during replication. One embodiment adds a Boolean ManyVersionConflict flag column to the VERSIONS table which is set when conflicts are detected. The ManyVersionConflict column may either be too little information, or more than is used. Instead of setting the ManyVersionConflict column as described above, does zero or more of the following:
1. Log conflicts and do nothing else.
2. Make entries in a separate “CONFLICTS” table in the database that contains the ID of the object in question, and the two versions which conflict.
3. Log conflicts in the COMMENTS tables, where appropriate.
Domain Objects
Replication of the ActionSite domain objects, e.g. actions, Fixlets, etc., must be considered at both the database table level and the domain object level because, for certain types of object, special conflict resolution rules or changes to the existing object format are necessary.
Table-Level Synchronization

- PROPERTIES
- TEXTFIELDS
- BLOBFIELDS
- VERSIONS
- ACTIONS

PROPERTIES, TEXTFIELDS, and BLOBFIELDS are retrieved via a basic sequence based replication scheme. Only ActionSite rows are replicated. The preferred embodiment relies on GatherDB on each server to update external content.
VERSIONS are replicated as a byproduct of PROPERTIES replication. In addition to the relevant PROPERTIES columns, the retrieval query includes the CreationVersion column from VERSIONS. During insertion of the retrieved PROPERTIES rows, if the corresponding VERSIONS row does not exist, it is inserted using the retrieved CreationVersion and Version. If a VERSION row does exist, the LatestVersion column is updated whenever the Version column of the replicated row is greater (under the total order) than the existing LatestVersion. Otherwise, the object is replicated, but the VERSION table is not modified.
ACTIONS are replicated as a byproduct of PROPERTIES replication. Whenever an action PROPERTIES row is retrieved that does not have an existing VERSIONS entry, the appropriate row is inserted in the ACTIONS table. During the process of replicating these tables, the MaxManyVersion column in the DBINFO table is also updated such that when replication is complete, each MaxManyVersion component is greater than the maximum corresponding component from all objects.
Content-Specific Changes
The following types of content need changes to make them compatible with the database replication mechanism. It is assumed that action site initialization happens only on the main server; other servers just replicate the initialized data. Therefore, content that is created on initialization and never modified is automatically compatible with the database replication mechanism. This includes the ActionSite site and trash objects.
The following columns are added to the SITENAMEMAP table:
Masthead text not null

- The site masthead

ActionSMIME text not null

- A SMIME block containing a subscription action, if the site is currently subscribed, or an unsubscription action, if the site is currently unsubscribed.

The Console, BESAdmin, and GatherDB read and write to these columns instead of PROPERTIES/TEXTFIELDS. Conflicts will be managed via the SITENAMEMAP ManyVersion.
Universal Properties and Property Overrides
An embodiment uses the (SiteID, AnalysisID, PropertyID) triplet to identify all properties. The QUESTIONRESULTS and statistical tables are updated accordingly. The Universal Properties analysis still exists in the database, but only to claim an action site ID; it is never modified. Its MIME content is generated at propagation time from the custom properties, and is not signed. All custom properties are included.
As an upgrade step, for each reference to an external analysis property currently in the Universal Analysis, a copy of the external property is created as a custom property. Going forward, the ability to add external analysis properties in the universal analysis is removed (the “Make this filterable checkbox” is removed from the Custom Properties dialog, and the dialog no longer shows noncustom properties). Property overrides are no longer generated by GatherDB when importing an external site or by the console when creating a custom analysis. Backward compatibility is removed from analysis activations: new analysis activations no longer include the applicability relevance or property references. Existing analysis activations continue to work, and for backward compatibility, the existing property overrides are not deleted (though they are not propagated). The RPID translation done by FillDB is reversed: it translates results reported via the style IDs to ID triplets, rather than the other way around.
Certificate Stores
Add tables CERTIFICATES and SITE_SUPPORTING_CERTIFICATES which replace the CertStore.dat functionality. Replicate them. During propagation, each supporting certificate for that site is propagated as an individual file named <shalhash>.crt. For backwards compatibility, a CertStore.dat file is also propagated, containing the same set of certificates. Alternatively, it is possible continue to propagate only the existing CertStore.dat, and include supporting certificates in all signature blocks going forward.
Revocation Stores
Revocations.dat and Revocations.crl files are no longer used. The CERTIFICATES table contains a Revocations column containing a DER-encoded list of revocations (direct and indirect) issued by the row's certificate (or NULL if there are none). When retrieving certificates from the database, the CRLs are also retrieved and added to the X509RevocationStore. To prevent a database rollback attack, if a CRL retrieved from the database is bettered by a CRL in the store, the one in the database is replaced. To transmit revocations to clients in such a way that in the case that two separate CRLs are updated in two unreconciled versions of a site, the clients acknowledge both updates, all CRLs are included in the file list signature. It is also possible to include each CRL as an individual file in the propagated site and have clients read them.
Results

- ACTIONRESULTS
- FIXLETRESULTS
- QUESTIONRESULTS
- LONGQUESTIONRESULTS

This section assumes a familiarity with the results table schema and usage. ACTIONRESULTS, FIXLETRESULTS, and QUESTIONRESULTS rows are retrieved via Sequence-Based Row Retrieval. The OriginServerID column indicates which server received the report. In addition, a ReportNumber column is added to each of the three tables which contains the number of client report in which the result was sent. All these columns are maintained by FillDB, either during processing of client reports or during replication.
It is possible that some of the results returned by the row retrieval query have been superseded by later client reports due to the client changing servers. Superseded results must be discarded. Therefore, a pseudo-SQL stored procedure for processing a result X looks like:


	update Results
	set Value = X.Value,
	ReportNumber = X.ReportNumber,
	OriginServerID = X.OriginServerID,
	OriginSequence = X.OriginSequence
	where ID = X.ID
	and ReportNumber < X.ReportNumber
	if @@ROWCOUNT != 0
	return true
	if exists ( select * from Results where R.ID = X.ID )
	return false
	insert X into results using X's result value,
	ReportNumber, OriginServerID and OriginSequence
	return true

The returned value is used by FillDB in deciding whether to process results of the Client Administrators property. This is necessary for keeping the COMPUTER_ADMINISTRATORS table up to date. The COMPUTER_ADMINISTRATORS is essentially an index into the results data, and therefore does not need to be explicitly synchronized because it is derived from data that _is_synchronized. To keep it synchronized it is necessary to parse and process Client Administrators results during results synchronization.
ACTIONRESULTS, FIXLETRESULTS, and QUESTIONRESULTS are synchronized via this mechanism.
LONGQUESTIONRESULTS are updated during QUESTIONRESULTS synchronization.
Computers

- COMPUTERS
- COMPUTER_ADMINISTRATORS

A similar mechanism is used for synchronizing the COMPUTERS table. It already has a ReportNumber column. OriginServerID and OriginSequence columns are added. The synchronization rules must be aware of two sources for change in the COMPUTERS table:

- FillDB updates it (or inserts new rows) in response to incoming results.
- Consoles update it to mark computers as “IsDeleted”.

New results from a computer always result in the “IsDeleted” mark being removed if present.
COMPUTERS rows are retrieved via Sequence-Based Row Retrieval. The OriginServerID column indicates the server which received the report or on which the computer was deleted. The rules for processing the rows retrieved aee:

- A larger report number always wins
- In the case of a tie, deletions win
- Do not touch the row unless a change needs to be made

COMPUTER_ADMINISTRATORS are updated during synchronization of Client Administrators results.
Comments

- ACTION_COMMENTS
- COMPUTER_COMMENTS
- FIXLET_COMMENTS

Because editing of Comments is not allowed, no merging of these fields is required. Use the Sequence, OriginServer, OriginSequence scheme where the Console sets the OriginServer and sequences appropriately, and then insert all the rows t from other servers using queries similar to those described in the Results section. Note that the server ID for the FIXLET and ACTION ids are encoded in the existing ID column using the “7 of the top 8 bits” scheme. These rows can be marked as IsDeleted, and such a change can lead to a primary key collision when replicating. However, they cannot be undeleted once they have been marked. Thus, it is safe to assume that any row marked as IsDeleted is newer than the same row that is not marked IsDeleted. If a collision occurs, the IsDeleted version is the one that should be in the database.
The RowID column is an IDENTITY column. It is necessary to either set the identity seed separately on each server, or else rework the code to chose the RowID value some other way (perhaps switching to a ROWGUID type).
Custom Sites

- CUSTOM_SITES
- CUSTOM_SITE_READERS

Both of these tables may be treated the same way: Add IsDeleted, ManyVersion and VersionConflict, and when replicating select newer row using the ManyVersion, including tie-breakers. If tie-breakers were needed, set the VersionConflict flag to TRUE, otherwise set it to FALSE. If a conflict was detected, log the value of the row that was arbitrarily rejected. Sequence, OriginServer, and OriginSequence may be added to optimize replication traffic, if that seems like a win. Note that site creation may be an undelete if the site being created was previously created and deleted.
CUSTOM_SITE_WRITERS
This table is complicated because, if the same user is authorized more than once, the revocation step must revoke all authorizations, not just the newer one. To avoid throwing away certificates that may be needed in the future for revocation purposes, it is necessary to add the ServerID as a primary key column. For this table, add ServerID, IsDeleted, ManyVersion and VersionConflict. When replicating, use the ManyVersion to determine the newer row and mark the older row as IsDeleted. Set VersionConflict on both rows according to whether tie-breakers were required or not. When revoking a privilege, look for IsDeleted and VersionConflict both being true, and do an extra revocation for those rows. Sequence, OriginServer, and OriginSequence may be added to optimize replication traffic, if that seems like a win.
Note that if a site is deleted, all of the readers and writers are supposed to be revoked as well, but it is possible that a reader/writer is added on another server at the same time that the delete of the site is occurring. In this case, the result of synchronization is that a reader/writer row exists with no corresponding site row. In the event that the site is recreated, this authorization is still valid, which is somewhat unexpected.
Unmanaged Assets

- UNMANAGEDASSET_SOURCES
- UNMANAGEDASSET_ASSETS
- UNMANAGEDASSET_FIELDS_TYPES
- UNMANAGEDASSET_FIELDS
- UNMANAGEDASSET_KEY_FIELDS_TYPES
- UNMANAGEDASSET_KEY_FIELDS

We can use much the same scheme as for client reports. The key properties that make the scheme work are these: Each piece of scan data has a single source, which can provide consistent version numbers or timestamps. We can arrange for each piece of data to be reported to a single server. Newer reports from a source replace older reports.
One problem is that multiple scanners might be reporting on the same asset, and the reports would have to be collated. However, in the preferred embodiment the asset connectors have collation algorithms. We can treat data coming from other databases just like data coming from other scanners, unless a scanner switches databases. In that case, we just need to throw out the data from the older report.
All Nmap scan points report their scan data to a single server by assigning them to that server manually. That seems like a good solution, but it does limit how we can distribute scan points in a deployment. Moving console editable information to a separate table would make replicating the other tables simpler, but we still need a ManyVersion for the new table. Also, the UNMANAGEDASSET_ASSETS, UNMANAGEDASSET_SOURCES, UNMANAGEDASSET_FIELDS_TYPES, and ACTIONPRESETS tables all have identity columns for IDs, as with the COMMENTS tables.
Statistics

- STATISTICAL_SAMPLES
- STATISTICAL_BINS

One approach is to add a report number to the statistical samples, and merge by taking the higher report number. For bins, we can keep a separate set of bins for each server. The bins are filled in by considering consecutive sample pairs; in the HA world, each server can fill in its bins from sample pairs where it received the ending sample. Consoles only need to retrieve the totals, using the computerwise algorithm, of these bins. A current implementation relies on the single server accepting samples in order. This works because out-of-order reports are rejected. But in a multi-server design, there is not an ordering for reports received by different computers. Thus, when an agent reports to different servers faster than they synchronize, the servers record overlapping sample intervals, and the agent's data is double-counted.
When reports are received consecutively, there is no problem. So we can focus on the situations where a server receives non-consecutive reports. One solution involves noticing when sequence numbers are skipped, and, instead of immediately adding the sample interval, making a record of the two samples. The contained intervals can be added to the bins when the intervening samples are found, or the larger interval can be added when all servers agree that the intervening samples are missing. The net effect during ordinary operation is visible, e.g. there are some data waits for synchronization to be added to the bins, but a complete picture of the bins is not available until synchronization anyway. During a network separation, where many agents switch servers, there is a significant amount of data from right around the split that is delayed until the network comes together.
Fixlet Visibility

- FIXLET_VISIBILITY
- USER_FIXLET_VISIBILITY

These tables are treated like the PROPERTIES/TEXTFIELDS tables. They need a ManyVersion column, and the Sequence, OriginServer, and OriginSequence columns. When inserting rows from other databases, apply the ManyVersion comparison scheme and only update the existing row if the new row is newer, including tie-breakers.
Users
BESAdmin is required to do all user creation and editing against a single server, which should eliminate any merge ambiguities with the USERINFO table. If the single server is lost, BESAdmin can be moved to another server, but this requires manual intervention.
Add ManyVersion, Sequence, OriginServer, OriginSequence columns which are filled in by BESAdmin when writing to the table. Get rows from other servers using the usual scheme and insert them if they are newer based on the ManyVersion values, including tie-breakers.
An IsDeleted column is added, and deletion changed to be a “mark as deleted” operation instead of a removal. This allows the deletion change to be more easily replicated to other servers.
Finally, a column is added which contains a signed piece of text containing the username, the hash of the database password for the user, an indication as to whether the user is supposed to be deleted or not, and a ManyVersion. The signature block includes all necessary supporting certificates.
When FillDB inserts or updates a USERINFO table row during database replication, it validates the signature as follows:

- The signature must be valid and authorized.
- The user name in the signed text must match the name in the table row.
- The signing certificate must be either a site license certificate, or a publisher certificate with a serial number that matches that of the user named in the signed text.
- The ManyVersion must be greater than the current ManyVersion in the USERINFO table row for that user.

If validation is successful, FillDB does the following:

- If the signed text indicates that the user is deleted, and a login exists for that user, drop the login.
- If the signed text indicates that the user is active, and no login exists for that user, create the login using the password hash in the signed text.
- Otherwise, change the user's password using the password hash in the signed text.

Administration

- DBINFO

No need to replicate. Stores the ServerID and MaxManyVersion.

- ADMINFIELDS

There should not be any conflicts, because we limit BESAdmin to the primary server. But for consistency, add ManyVersion, Sequence, OriginServer, OriginSequence columns which are filled in by BESAdmin when writing to the table. Get rows from other servers using the usual scheme and insert them if they are newer based on the ManyVersion values, including tie-breakers.

- AGGREGATEDBY

If a web reports server is aggregating any server in the deployment, it is effectively aggregating the whole deployment. We should replicate this table as we do many others, by adding ManyVersion, IsDeleted, Sequence, OriginServer, and OriginSequence.
Historical Data

- HISTORICAL_COMPUTER_COUNTS
- HISTORICAL_FIXLET_COUNTS

No synchronization necessary.
Miscellaneous Tables

- SITENAMEMAP

Use SiteUniqueID from certificate, or if that field is not found then the serial number, as the site ID. To avoid conflict with existing site IDs, set the high bit. Add the ManyVersion, Sequence, OriginServer, and OriginSequence columns and replicate using the usual scheme.

- SITEVERSIONS

The last vestige of functionality still working with this table is the ability to remember the master action site version number so that, in the event of a loss of the propagation server's site Archive, the master action site version number is not reset to 0.

- ACTIONPRESETS

These need to be replicated in a fashion similar to other console-authored objects. Add IsDeleted, ManyVersion, Sequence, OriginServer, and OriginSequence columns. Add an IsDeleted column and change deletion to a “mark as deleted” operation. Add the server ID to the top 8 bits of the preset ID. When replicating, use ManyVersion to decide on the newer version. When inserting or editing, update the ManyVersion column to the current MaxManyVersion for the local server, as well as setting the OriginServer and sequence numbers. Use queries similar to the Results queries for replication.
Propagation and Gather
Propagation
For a given logical site S (could be an operator site, the master action site, etc.) there are N physical sites S(0) . . . S(N−1) where N is the number of servers. Each S(i) site is treated as a unique site, and represents the state of that site on the ith server the last time that a console connected to that server performed a propagation of that site. It has its own URL, and its own integral version number, and a set of differences based on that version numbering.
When consoles propagate they generate a site containing all of the latest versions of the site content on the server they are connected to. The site directory listing generated by the console includes the creation and modification ManyVersions for each object in the site. These are copied from the database. In addition, the listing itself is stamped with the MaxManyVersion of the entire database at the time the site's contents were copied out of the database. Note that the site contents must be locked against further modification until this operation is completed and the MaxManyVersion has been obtained. This site directory listing and site contents are propagated as site S(i) using the same propagation methods as in the non-HA version of BES. Note that in the single server case, this involves a minimal increase in the size of the directory listing, e.g. the inclusion of 2 one-component ManyVersions per listing entry, plus a single one-component ManyVersion for the listing itself.
Under the existing propagation and gathering scheme in BES, sites must be mirrored before they can be gathered by clients or relays. The mirror server makes available a merged version of site S. Whenever it mirrors a site S(i) it constructs a new merged version of site S and makes that available instead. Clients and relays gather site S as they did before. If one of the S(i) sites is newer than all of the others, based on their ManyVersions, then the mirror copies that site S(i) as the new version of site S. If there are two or more S(i) sites whose ManyVersions conflict with each other, then the mirror merges the sites by concatenating their signed directory listings into a multipart site directory listing. Note: for backward compatibility, the mirror also maintains a site S′ that is delivered to older clients. This site S′ never provides multipart listings, but instead always chooses the best site S(i) using tie-breakers, when necessary. Differences are provided to go from version to version of the merged site S. The versions are labeled using the 32 bit CRC hash of the ManyVersion of the site. If clients change from one root server to another, then it is not guaranteed that the difference file it needs is available because the new server may never have seen the particular version of the site that the client last gathered. In this instance, the client requests a diff file which does not exist, gets a 404, and fails over to using the fullsite file, or file by file, whichever it determines is cheaper.
It is not actually necessary to get a lock on the site contents while generating the contents of a site for propagation. Instead, when generating the site to propagate, the console could check that each item in the listing has ManyVersion numbers which are older than the ManyVersion of the listing itself. If this is not the case for an item, then the console would need to fetch an earlier version of the object from the database that _does_meet this requirement or, if no such version exists, then it should omit the object. This allows the site contents to be modified during a propagation.
Gather
When clients gather, they gather the logical site S, which may return a multipart directory listing. If it does return a multipart listing, the client must perform the merge operation to generate the actual listing of site contents, and then perform a gather using that result. After completing a gather, the client sets its current ManyVersion for the site to the ManyVersion of whichever part of the multipart directory listing wins the tie-breaker rules which, in essence, is to pick the lowest server id among those servers that have a conflict.
When the client gathers it submits the 32 bit CRC hash of the ManyVersion it has, and the server will only return the “already up to date” response if the ManyVersion of the site it has available hashes to the same value. After the client retrieves the directory listing it hashes the ManyVersion of that site and requests a diff by asking for the diff file whose name is diffsite_XX_YY where XX is the hash of the current ManyVersion, and YY is the hash of the ManyVersion it is trying to gather.
The client should always confirm that the ManyVersion of the directory listing it gathers is newer than the ManyVersion of the directory listing it currently has (“newer” in this case means “not strictly older than” . . . clients should accept versions that are “sideways” moves).
Client Register
New Computer IDs
Registration servers currently hand out new computer IDs using the “number of seconds since 1970” or, if that number has already been used, then a random number. In practice, the random numbers are rarely used. In one embodiment, each server has its own space reserved by using the high 8 bits of the computer ID as a “server id” field. The other 24 bits are assigned randomly with a check to make sure that the number chosen has not already been used. The “number of seconds since 1970” scheme had the advantage that if a registration list were wiped out for some reason, e.g. server crash, then the reg server could hand out new IDs with some confidence that those IDs were not in use by an existing client. In the case of HA, this scheme is not as vital, because if a server has its registration data wiped out, it is able to restore most of its data by synchronizing with another server. The data that are not restored by synchronization are eventually restored by client re-registration, and in the meantime collisions are unlikely due to the random number space being large enough.
Here is a rough mathematical justification for this:
If a deployment has 1 million computers, and the average lifecycle of a computer is 1 year, then about 3000 computers are brought online each day. If the server which lost its data can only restore data from the other servers up to the last day, then as many as 3000 registrations may have been lost (worst case . . . it could be much less if new registrations are distributed among the servers). When choosing a new random number from a space of 24 bits (approx 16 million) the chance that one chooses a collision with one of the 3000 unknown numbers is about 1 in 5000. If it takes a full day to restore the missing 3000 computers, then the expected number of collisions is something like 3/5 (worst case again, because the probability of collision decreases over time as more of the missing computers re-register). This seems acceptable. If you do the same math with some more reasonable numbers you get: 300000 computer deployment turning computers every two years, generates 500 IDs per day, distributed over five servers means 100 IDs were lost, which means each new ID has a chance of collision of 1 in 160,000 and chance of collision over 1 day is 1/1600.
Synchronizing Registration Data
Under HA, client register servers sync their registration lists with other servers. There are several fields in a registration record, and each must be treated differently. These fields are:

- Computer ID
- If the computer ID in the record from the remote server does not exist, add a record for it. If it matches an existing ID, then check if the Client Sequence Number of the remote record is newer. If it is older, then ignore the record. If the remote record is newer, then update the other fields of the record as described below.
- IP Address
- IsRelay
- Parent Client ID
- These values are used to reflect the path that the client used to reach the registration server, and therefore represent the path to take to reach the client when sending ping messages. The more recent value is more likely to represent a working path. Copy the remote record's values for these fields if the remote record has a newer Client Sequence Number.
- License Number
- Do not sync. If the record from the remote server is a new computer id, allocate a new license number and store that number. Otherwise, just keep the value as it was before. License numbers are calculated by each registration server independently, so they should be unaffected by the license number values being used on other servers.
- Registration Time
- If the record from the remote server is a new computer ID, use the value from the remote record. Otherwise, only use the value from the remote server if the Client Sequence number indicates that it is newer. Because server clocks may vary, we use the client sequence number as the authoritative way to decide which record is newer, so even if a remote record has a newer reg time, we ignore it if the Client Sequence number is older.
- Client Sequence Number
- If the record from the remote server is a new computer ID, use the value from the remote record. Otherwise, compare the two values, and take the max. This sequence number is used for cloning detection (see ClientRegister). It should be the high water mark of all client registrations. Note that the above technique of ignoring remote records whose Client Sequence number is lower is equivalent to taking the max of the sequence numbers.
- Resend Reports Flag
- Last Good Report Number
- Do not sync these values, use the local value. In addition to these existing fields, for the purposes of the synchronization between servers we want to add some additional fields, such as Original Server ID and Original Server Sequence Number. These fields can be used in a similar scheme to the database result tables replication scheme to allow one registration server to pull only the records it does not already have when synchronizing with another server.

Implementation
Option 1: Write a plug-in that takes a synchronization request containing the high water market sequence numbers for every server and returns an XML document containing all data known on this server that is newer. Have the existing ClientRegister plug-in periodically sync with other servers, in a similar fashion to how it currently does the flush of the registration list.
Option 2: Change ClientRegister to maintain its data in the BES database instead of the current “in memory/text file” combination. Add this table to the list of tables that are synchronized by FillDB.
Configuration
Server
New Configuration Information
Servers need the following additional configuration information for HA:

- Server ID
- Add a ServerID column to the single row DBINFO table. In addition, if client register is not using the DB, then a reg key must be set as well. For phase 1, this ID should be specified at installation time by the user.
- REPLICATION_SERVERS
- Add a table which has the following columns: ServerID, DSN, URL. The DSN is the name that can be used to create an ODBC connection. The URL is the base URL of this server which can be used to access the mirror, client register, and other cgis.
- REPLICATION_SCHEDULE
- Add a table to the DB with the following columns: SrcServerID, DstServerID, Interval, LastUpdate. On a given server, we query for rows of this table where DstServerID=DBINFO.ServerID and (now-LastUpdate)>Interval, replicate from each source server returned, and update the corresponding LastUpdate column.
- Mirror Server
- The mirror needs roughly the same info as the REPLICATION_SCHEDULE table to know which servers to mirror from. Or it could just mirror from every server.

New Server Installation
When setting up a new server, the following things need to happen:
1. The server needs to be added to the REPLICATION_SERVERS table. This involves choosing a unique ID for the server.
2. The new server needs to be added to the replication schedule of at least one existing server, and at least one existing server needs to be added to the replication schedule of the new server.
3. Authentication pathways need to be created between all servers that are going to replicate from each other.
4. The server installer needs to be run on the new server, including running BESAdmin in a mode where the database tables are all created and are initialized only as much as FillDB needs them to be to initiate its first replication.
5. FillDB needs to be provided with the necessary configuration information to establish the remote connection to the servers in its replication schedule, including password information if required.
To accomplish this, an embodiment implements the following:
Before deploying a new server, items 1 and 2 above are accomplished using BESAdmin, connecting to the master server. Note that item # 1 requires knowing the server name, and requires generating a new server ID for that server name. Before deploying a new server, item #3 is accomplished either manually or through BESAdmin. Once items 1-3 have been accomplished, the installer is run on the new server. It should install all the necessary files and settings as usual, and run BESAdmin with a special flag to prepare the database. It should then call BESAdmin in a new mode (/replicationServerlnstallUI?) to collect the following
information from the user:

- Server ID (already generated through BESAdmin)
- Authentication information (NT authentication vs. SQL Authentication, and password if SQL authentication Username can probably be generated predictably from server ID)
- Contact information for server to do initial replication from.

BESAdmin should set that information in the database, and should then provide status feedback collected from FillDB. This feedback should be rich enough that the user can either see why FillDB is failing, or else can watch FillDB work through its initial replication. A simple status string written by FillDB to the registry
might be sufficient for this. In case the user exits BESAdmin without successfully completing this step, there should be a way to get back to BESAdmin in this mode through the start menu. One approach adds a new shortcut to the start menu that calls BESAdmin with the appropriate command line argument, another approach changes BESAdmin's behavior when called with no command line arguments to give access to this UI.
Replication Schedule Management
Updating and maintaining the REPLICATION_SCHEDULE table is a routine task as administrators adjust the replication behavior of all of the HA server in their deployment. An embodiment provides a UI for managing this process through BESAdmin. Because this functionality is included in the main BESAdmin UI, note that it is only available when connecting to the master database. However, there are some circumstances in which it may be important to be able to adjust these parameters on a machine that is not connected to the master database. In the initial HA release, this situation is addressed by direct edits to the database.
The UI in BESAdmin needs to allow the user to adjust the replication schedule for each server pair, in each direction. To that end, we propose a UI in which the user selects a destination server (the recipient of the replication data) from a drop-down combo box, and based on the selection of destination server, a list control is populated with the set of all possible source servers.
For each source server, the user should be able to enable or disable replication, and if replication is enabled they should be able to specify the replication interval. As an optional extra, it would be useful to show them the ‘shortest path’ time as well because the shortest path might very well not be by direct connection.
Console
One embodiment requires consoles to be configured with a separate DSN for each server they want to connect to. It is also possible for the console to figure this out itself and allow the user to select from a list of servers for a given deployment. Instead of using the action site masthead, the client should get the server URL from the ServerURL column of the DBINFO table. The console should also get the Server ID from the DBINFO table.
WebReports
Web Reports could detect and report an error condition if it was instructed to aggregate two servers that were members of the same deployment.
Although the invention is described herein with reference to the preferred embodiment, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. Accordingly, the invention should only be limited by the Claims included below.

Claims

1. An apparatus for synchronizing network element state when a network connection is restored between a plurality of servers after a network failure, comprising:

a plurality of objects, each object existing in a plurality of different versions, wherein each said different object version results from modifications to an object made by different servers during said network failure when said servers are unable to communicate with each other but otherwise continue to function;

each said object comprising a vector including a separate version number for each server, wherein each server increments its version number in said vector when it modifies said object;

an automatic conflict resolution mechanism for providing, at each server, a most up to date view of all objects across all of said plurality of servers upon restoration of said network connection between said plurality of servers after said network failure, said conflict resolution mechanism reconciling the existence of said plurality of different versions of an object to determine which object version should take precedence over other object versions, wherein conflict resolution is performed when there are multiple versions of a same object at a server, said conflict resolution mechanism further comprising at least one tie breaking rule that is applied to decide which servers take precedence over other servers when determining which object version should take precedence over other object versions.

2. The apparatus of claim 1, sources of said objects comprising any of the following network elements:

clients, relays, servers, and consoles.

3. The apparatus of claim 1, at least one of said servers comprising any of:

a policy collection and a state collection.

4. The apparatus of claim 1, said automatic conflict resolution mechanism, further comprising:

a mechanism that stamps every object with information about the object's origin.

5. The apparatus of claim 4, said information comprising:

what server said object originated from.

6. The apparatus of claim 4, said information comprising:

what version the object is.

7. The apparatus of claim 1, wherein an object can be modified over time and a new version can be time stamped into said object.

8. The apparatus of claim 1, wherein said object's total version number comprises:

both of the server that modified the object and the version number of the object itself.

9. The apparatus of claim 1, said automatic conflict resolution mechanism comprising:

means for maintaining an object version number, wherein when any server replaces an object or updates an object, it changes the object's version number for that server.

10. The apparatus of claim 1 said automatic conflict resolution mechanism comprising:

a vector of version number per server, wherein a first element of the vector is associated with a first server number, and each subsequent element of the vector is associated with corresponding subsequent server numbers.

11. The apparatus of claim 1, each said server comprising:

a processor programmed to execute instructions by which, whenever said server updates an object, said server takes an existing version of the object that it has and the server finds its position in an array of vectors, said server then increments the object's version number in the vector, said server then applies a rule for comparison that allows the server to decide which version of the object takes precedence.

12. The apparatus of claim 1, wherein the object version having a lower number server takes precedence.

13. The apparatus of claim 1, further comprising:

a plurality of vector versions of an object, where each server has a position in said vector, wherein when an object is changed, each server takes a previous version of the object, increments the version numbers for that particular server's portion of the vector, and puts the vector into a new object, wherein the new object is then propagated into the network, and wherein if a conflict is encountered, version number resolution tie breaking rules are applied to decide which version of said object should be used by each server.

14. The apparatus of claim 1, further comprising:

means for serializing each said object in a signed directory listing comprising a directory of all objects and their version number, wherein a signed directory listing is created at each server, wherein if said conflict resolution mechanism indicates that a server has a definitive list, it publishes its version of the object, wherein if said server can not do so because there is another version of that object that exists in a newer state than that that the server has, then the server also has a directory listing from a different server that contains the object number identifier and its version, wherein the server combines these directory listings to create a super set directory listing that contains all of the information needed for conflict resolution with regard to the object in question.

15. The apparatus of claim 1, further comprising:

means for maintaining consistency among a plurality of different servers in a distributed system by stamping said objects based upon an object version number.

16. The apparatus of claim 15, wherein a time stamp is appended to said object.

17. The apparatus of claim 1, wherein a client determines if is has a proper set by verifying a digital signature on a signed directory listing.

18. The apparatus of claim 1, wherein said object comprises any of the following different forms:

a database form; and

a serialized form.

19. The apparatus of claim 1, further comprising:

an object definition table comprising a column that defines a body with relevance and a manyversion, wherein each one of these elements is a version number of a particular server that represents a portion of the vector manipulated by the server, wherein each index is implemented independently of the others by a respective server, and wherein any software that is confronted with a new version number can perform a comparison and decide which server did what to the object.

20. The apparatus of claim 1, further comprising:

at least one inspector automatically determining relevance in connection with an associated network element in response to presentation of an object to said associated network element, said relevance determination based on any of:

hardware attributes;

configuration attributes;

database attributes;

environmental attributes;

computed attributes;

remote attributes;

timeliness;

personal attributes;

randomization; and

advice attributes.

21. The apparatus of claim 1, further comprising:

at least one inspector automatically evaluating management and/or remediation information from said management console function based upon a determination of relevance in connection with an associated network element in response to presentation of an object to said associated network element and, as a result of a determination of said object's relevance to said network element, automatically performing any of:

mathematico-logical calculations;

executing computational algorithms;

returning results of system calls;

accessing contents of said associated network element;

querying said associated network element to evaluate any of:

said properties of said associated network element;

said associated network element configuration;

contents of storage devices associated with said associated network element;

peripherals associated with said associated network element; and

said associated network element environment.

22. The apparatus of claim 1, The apparatus of claim 1, further comprising:

at least one inspector automatically evaluating management and/or remediation information based upon a determination of relevance in connection with an associated network element in response to presentation of an object to said associated network element and, as a result of a determination of said object's relevance to said network element, providing at least one notification based upon said relevance determination result.

23. A computer implemented method for synchronizing network element state when a network connection is restored between a plurality of servers after a network failure, comprising the steps of:

providing a plurality of objects, each object existing in a plurality of different versions, wherein each said different object version results from modifications to an object made by different servers during said network failure when said servers are unable to communicate with each other but otherwise continue to function;

wherein each said object comprises a vector including a separate version number for each server, wherein each server increments its version number in said vector when it modifies said object;

providing, at each server, via an automatic conflict resolution mechanism, a most up to date view of all objects across all of said plurality of servers upon restoration of said network connection between said plurality of servers after said network failure;

said conflict resolution mechanism reconciling the existence of said plurality of different versions of an object to determine which object version should take precedence over other object versions;

wherein conflict resolution is performed when there are multiple versions of a same object at a server;

said conflict resolution mechanism applying at least one tie breaking rule to decide which servers take precedence over other servers when determining which object version should take precedence over other object versions.