CA1213985A

CA1213985A - Distributed fault isolation and recovery system and method

Info

Publication number: CA1213985A
Application number: CA000469577A
Authority: CA
Inventors: John W. Maher
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 1983-12-09
Filing date: 1984-12-07
Publication date: 1986-11-12
Also published as: US4570261A

Abstract

ABSTRACT

There is disclosed a system and a method for isolating faults and recovering a distributed sys-tem of the type including a plurality of modules to optimized operation. At least some of the mod-ules are active fault recovery modules and include fault detecting means for initializing a fault check routine and sensing faults within the dis-tributed system. Voting means are associated with each active module for placing a vote during each fault check routine in response to a detected fault. Collective vote determining means record the votes of the active modules after each fault check routine and recovery sequence initializing means initializes a fault isolation and recovery sequence in response to a given number of consecu-tive collective votes exceeding a predetermined value.

Description

~39~35 ,.

DISTRIBUTED FAULT ISOLATION
A~D RECOVERY SYSI'EM ~ND METHOD

FI~LD OF T~E INVENTION

The present invention relates to a fault iso-lation and recovery system and method wherein the fault isolation and recovery can be performed in any one of a plurality of modules or nodes dis-tributed throughout the system. The system andmethod of the present invention finds particular application in systems of the type including a plur21 ity of modules or nodes which must communi-cate with each other over a common bus~

3~ 5 BACKGROU

There are many syste~s which include a plu-rality of modules distributed along a common link, such as a bus or a plurali~y of busesO Often, time division multiplexin~ is utilized to provide efficient information transfer between the dis tributed modules and to achieve maximum system capacity. In systems of this kind, the bus or buses are divided into a plurality of time slots.
Each module is assigned a predetermined time slot into which i~ can insert information on~o the bus and means for receiving information from any one of the time slots. In this manner, any one module is capable o transferring information to any other module, and in turn, capable of receiving information from any other module.
In ad~ition to the foregoing, one of the time slots of one of the buses or a separate bus can be dedicated to permit each module to address or send data to any one of the other modules. Furtherr a central common control of the data bus or slot is typically provided to control the overall opera-tion of the system. The central common control can provide, for example, system clocks, data bus or slot arbitration, and guard tone generation~
Because of the impor~ance of the central common control to the system, it is typically provided in a redundant .manner so that if one central common control develops a fault, the system can be switched over to the other redundant central com-mon control~

~3~ 35 A problem which can arise in such a system isthe location of the fault detectin~ intelligent module of the system. If it resides in a central, stand alone location, system reliability is com-promised should the centra:L fault detecting moduleor node fail.
Protection against misuse of the buses is another problem. Module failure of any unshared circuitry associated with communicating on the buses could render the buses either totally or partially inoperative.
Prior art systems have addressed this problem by switching to redundant buses and bus devices.
While such approaches can be generally successful, they exhibit certain undesirable effects. First, it increases system cost~ This results because ~all bus drives and related circuitry must be du-plicated. Second, total system capacity may not be realized with just one time division multi-plexed (T~M) bus. As a result, a plurality o~ re-dundant buses may be required.
Prior art sy~tems are generally arran~ed so that if a module failure renders one of the buses inoperative, all the buses may be switched over to redundant buses or just the failed bus may be switched to its redundant bus. Neither of these arrangements is totally satisfactory, and in the latter one, additional input-out or decoding cir-cuitry must be provided for every TDM switch user to selectively and properly make the switch. This approach both adds cost to the system and adverse-ly affects system reliability.

One improVement to prior art systems of this type is fully disclosed and claimed in U.S. Pate~t No. 4,562,575 for Method and ~pparatus For The Selection of Redundant System Modules, which application is assigned to the assignee of the pre~ent invention. The system there disclosed includes a redundant central common control referred to as MUX Common. However, the switching between the main MUX Common and the redundant MUX Common is no-t initiated by a centrally located fault detecting module.
Instead, a plurality of modules associated with the buses are active fault detecting modules or nodes, each con-tinuously checking the system in parallel for faults.
When a fault is detected by one of these active modules, it places a vote indicating that a fault has been detected. If a predetermined number, for example, a majority, of the acitve modules vote, the sys-tem then switches from the then active MUX Common to the other MUX Common. Hence, the swltching to the redundant module is not commanded by a single fault detecting module, but instead, by a majority of a plurality of fault detecting modules distributed throughout the system. As a result, since a single fault detecting node is not relied upon, system reliability is greatly improved.
Even though the foregoing system exhibits many advantages over prior systems ~or detecting faults, the switchins to a redundant MUX Common may not always rectify the fault or problem with the system. The present invention, however, pro-9~

vides a further improvement thereto in that notonly is the fault detection distributed throughout the system~ but the fault isolation and recovery is also distributed throughout the system as well. As a result, a single node is not relied upon for fault isolation and recovery, bu~ in-stead, this function is distributed throughout the system so that if one fault isolation and recovery node fails, another one immediately takes its place to restore the system to optimized opera-tion.
It is therefore a general obiect of the pres-en~ invention to provide a new and improved dis-tributed fault isolation and recovery system and method for recovering a system of the type includ-ing a plurality of modules or nodes which experi-ence a fault in an optimi~ed configuration.
It is a further object of the present inven-tion to provide such a system and method wherein the fault isolation and recovery process is ini-tialized based upon a distributed detection of a fault.
It is a still further object of the present invention to provide such a system and me~hod wherein any one of a plurality of modules or nodes includes means for performing the testing required for recovering the system.
It is still another object of the present in-vention wherein the node or module performing the testing of the system must pass internal testing prior to proceeding with the fault isolation and system recovery.

3~
,.

SU.~MARY OF THE INVENTION

The invention provides a system for isolating faults and recovering a distributed system of ~he type includir.~ a plurality of modules to optimized operation. ~- least some of the modules are ac-tive fault recovery modules and include fault de-tecting r.eans for initializing a fault check rou-tine and sensing faults within the distributed system. Voting means are associated with each ac~ive mo~ule for placing a vote during each fault check routine in response to a detected fault.
Collective vc e determining means record the votes of th~ a~tive modules after each fault check rou-tine and recovery sequence initializing means ini-~ializes a fa~lt isolation and recovery sequence in response t3 a given number of consecutive col-lective votes exceeding a predetermined value.
The present invention also provides a method of isolating faults and recoverin~ a distri~uted processing system including a plurality of modules to optimized operation. The method includes the steps of peri~dically detecting for faults within the system at given ones of the modules, generat-ing a vote at those given ones of the modules de-tecting a fault within the system, collecting the votes as a collective vote, and initializing a re-covery sequen-e when a given consecutive number of the collective votes exceed a predetermined value.

~`

1~139~35 BRIE~ DESCRIPTION OF THE~ R~WINGS

The features of the present invention which are believed .o be novel are set forth with par-ticularity in the appended claims. The invention, together ~ith further objects and advantages thereof, may ~est be understood by making refer-ence to the following description taken in con-junction with the accompanying dra~ings, in the several Figures of whicn like reference numerals inden~ify ide~tical elements, and wherein:
Figure 1 is a functional block diagram of a distributed p-ocessor system utilizing the present invention;
Figure 2 is a block and electrical schematic diagram illus'rating in greater detail one of the modules of the system of Figure 1;
Figure 3 is a block and electrical schematic diagram illus'rating a portion of another module of the syste~ of Figure 1; and Figures ~ through 8D are flow charts il-lustrating the fault isolation and recovery rou~tine for each of the active fault recovery modules of the sy--tem~

DESCRIPTIO~ OF T~E PREFERRED EMBODIMENT
_ Referring now to the drawings, ~nd more par-ticularly to Figure 1 thereof, a distributed sys-tem in the form of a communication control center utilizing the present invention is illustrated.
The communication control center includes multiple modules of three distinct types which communicate with each other ~y way of a time division multi plexing (hereinafter referred to as TDM) network.
The control center system comprises a plurality of operator MUX interface (O.~I) modules 2, a plural-ity of Transmit ~eceive (TR) modules 4 and a plu-rality of D~lal Receive ~DR) modules 6, all of which communicate with ezch other via the BUS g or other accepta~le equivalent. In accordance ~ith this preferred embodiment, the OMI modules 2 are active fault recovery modules of the system~
The other modules of the system are passive mod-ules with respect to the fault isolation and re-covery process. Communication between the modules is controlled by a fourth type of module referred to as the MUX co~mon module 8 or 10, which are re-dundant common modules. The MUX common module 8 and 1~ arbitrate access to a dedicated data chan-nel or a dedicated data bus on the TDM B~S 9 or equivalent and also output synchroni~ation signals necessary for other forms of communication over the network. $he MUX co~mon modules are therefore of critical importance to the system since they represent common elements, the failure of which would cause failure of the entire network. As diselosed in the aforementioned U.S.

g Patent No D 4,562,575, an extremel~ rellable mechanism is provided to insure that the MUX common ~odules 8 and lQ
functions do not fail. ~s disclosed therein, the operation of the active MUX common modules 8 cr 10 is monitored and the operation from -the acti~e MUX common module 8 or 10 to the inactive MUX common module 8 or 10 is transferred should the active module fail. For a complete description of the communication control center of Figure 1, reference may be had to copending Canadian application Serial No. 456,908 filed June 19, 1984, for Time-Division Multiplex Communications Control System, which application is also assigned -to the assignee of the present invention.
Referring now to Figure 2, it illustrates in general detall the configuration of each OMI module 2 in accordance with the present inventton. Each OMI
module 2 includes a microprocessor 11, such as an MC6809 manufactured by Motorola, Inc., a plurality of bus drivers 12, 13, and 14 which can be MC14503 bus drivers manufactured by Motorola, Inc., a buffer 15, a parallel to serial converter 16, and a serial to parallel converter 17~
The bus 12 asillustrated in Figure 1 includes in part a status lead 20, a dedicated data bus 21, and a vote lead 22. The data bus 21 is utilized within the system for transferring data between the various modules of the system except for the MUX common modules 8 and lOo The data carried on the data bus 21 is transferred serially 5, --1 o--between the various modules. The vote lead 22 is utilized, as will be moLe fully explained herein-after, for determining when there is a fault with-in the system. Suffice it to say here that each OMI module has stored within its ~icroprocessor 11 a routine which continuously detects for faults within the system. When a fault is detected, each OMI module 2 will output a vote onto the vote lead

2~ The votes, as will be described hereinafter, are recorded in the MUX common msdules 8 and 1C, and if the collective vote exceeds a predetermined value, a fault within the system ~ill be indicat--ed. The first such collective vote causes trans-fer from the active MUX common to the other redun-dant ~UX com~on. After the transfer, each O-~I
module 2 reinitializes its fault aetecting routine to de~ermine if the problem within the system has been corrected by transferring operation to the redundant ~.UX co~mon. If the problem within the system has not been corrected by the switch to the redundant MUX common, and the resultant error is such that any OMI module cannot pro~erly access the data bus, then any OI~I module experiencing this type of error will place another vote onto the vote lead 22. If the collective vote exceeds a predetermined value, then another transfer of ~UX commons, in this case back to the original MUX
common, will occur. The occurrence of a second switch of MUX commons is cause for the system to proceed into the fault isolation an~ recovery rou-tine which will be described hereinafter in great-er detail. The status lead 20 is utilized to in-dicate which MUX common module 8 o. 10 is pres-~3~ 5 ently active. This indication depends upon thelevel on the status lead 2t).
The buffer 15~ the serial to parallel con-verter 17, and an int~rnal data bus 1~ permit data on the data bus 21 to be received by the OMI mod-ule 2 at its microprocessor 11. To that end it can be noted that the input to the buffer 15 is coupled to the data bus 21. The output of buf fer 15 is coupled to the serial to parallel converter 17. The serial to parallel converter 17 receives tne serialized data and converts it to a parallel data word. That data word is in~utted into the microprGcessor 11 over the data bus 18 in response to a command from the microprocessor 11 over a de-vice select lead 19.
- - To permit the OMI module 2 to output data on-to tne data bus 21 ~rom its microprocessor 11, the internal data bus 18 is coupled to the input of the parallel to serial converter 16. The output of the parallel to serial converter 16 is coupled to the data bus 21 through the bus drivers ?3 and 14. When data is to be transferred from the microprocessor 11 onto the data bus 21, a parallel digital data word will be provided from the micro-processor and transferred to the parallel toserial converter 16 over the internal data bus 18. Upon command from the microprocessor 11 over the device select lead 25, the parallel data word will be serially conveyed from the parallel to serial converter 16, through the bus drivers 13 and 14, and onto the data bus 21.
For placing a vote onto the vote lead 22, the bus driver 12 is coupled to the vote lead 22.

~3~
~12-When an OMI module 2 de-tects a fault within the system and places a vote, the output of the bus driver 12 will change state between a low and a high le~el for placing the vote onto the vote lead 22. During the fault isolation and reco~ery sequence, as will be more fully described hereinafter, it is necessary for each OMI module 2 to be removed Erom the data bus 21. This removal is referred to herein as a tristate condition, and the tristate condition is achieved on command by the microprocessor 11 over a disable lead 27 When the OMI module 2 is tristated, the bus drivers 12, 13~ and 14 will appear to the data bus 21 to be a high impedance. The term TRI-STATE, is a registe~d trademark of National Semi-conductor Corporation, used -to describe a particular family of logic devices. Conversely, as used herein, the term tristate imports the common technical meaning as understood by those of ordinary skill in the art (i.e. three states: logic low, logic high, and a high-impedance state). Because these bus drivers are disabled, the OMI module 2 will not be able to place a vote onto the vote lead 22 or place data onto the data bus 21. However, a tristated OMI will still be able -to receive data from the data bus 21 in the manner as previously described.
Referring now to Figure 3, it ilustrates in general detail the circuitry of the MUX commons 8 or 10 for recording the collective vote from the OMI modules 2 and determining if the collective votes exceed a predetermined value. As shown in Figure 3, the MUX common modules include a comparator 23 which has its positive input terminal connected to the vote lead 22 through the series resistors 24 and 26. The output of the comparator 23 is connected to the positive inpu-t terminal thereof by way of the feed~
back resistor 28. The negative input terminal of the com-parator 23 is connected to a 12 volt power supply throughthe resistor 30 and to ground through the resistor 32. A capacitor 34 has one end connected to the junction of the resistors 24 and 26 and its other end connected to ground.
The base of the NPN transistor 36 is con-nected to the output of the comparator 23 through the resistor 38, the base of the transistor 36 also being connected to ground through the resis-tor 40. The emitter of the transistor 36 is also connected to ground. The collector of the tran-sistor 36 is connected to a 12 volt po~er supplythrough the resistor 42 and has a ter~inal C to enable or disable the re~aining circuitry (not shown) of the MUX ~ommon module. The remaining circuitry is used to control co~munication between the various other modules as mentioned herein be-~ore, the details of which are not necessary for a full and complete understanding of the present in-vention. The output of the comparator 23 is also connected to the base of an ~PN transistor 44 through the resistor 46. The base of the tran-sistor 44 is also connected to ground through the resistor 48. The emitter of transistor 44 is also connected to ground. The collector of the tran-sistor 44 is conn~cted to the cathode of the diode 50. A P~P transistor S2 has its collector con-nected to the anode of the diode 50 and to the status lead 20 through the ca~acitor 54-resistor 56 combination. The emitter of the transistor 52 is connected to the 12 volt power supply through the resistor 58. The base of the tra.~sistor 52 is connec~ed to the 12 volt power source through the series diodes 60 and 62 and is also connected to ground thro~gh the resistor 64.

~3~:3l 35 The operat ion of the MUX commons in record ing the collecti~e votes from the OMI modules 2 and for switching the M~X commons will now be describ~
ed. In this exa~ple, the status lead 20 is at a low logic level when the ~UX CQmmon A module 8 is active and is at a hi~h log ic level when the MUX
common B module 10 is active. The status lead 20 is driven by the circuit comprised of the tran-sistors 44 and 52 each having their respective as-sociated circuit components. The transistor 52with i~s associated bias components 60, 62, and 64 forms a constant curr~nt source and has its emit-ter voltage held at approximately t1.~ volts by the diode 60, the diode ~2 and the resistor 64.
The emitter current is therefore .7 volts divided by the resistance of the resistor 58 and is used to raise the status lead 20 to a high logic level through the capacitor 54-resistor 56 combination.
~hen the output of the comparator 23 is at a high lo~ic level, the transistor 44 is turned on through the resistors 45 and 48 and the current supplied by the transistor 52 is drawn to ground through the transistor 44 and the diode 50. This also pulls the status lead 20 to ground. Such a condition indicates that ~UX common A module 8 is enabled and the ~UX common B module 10 is dis-abled. ~hen the output from the comparator 23 is at a low logic level, the transistor 44 is turned off which raises the status lead 20 to a high logic level. The resistor 56, the capacitor 54 and the diode 50 are included to protect the cir-cuitry ~gainst accidental shorts. The status lead 20 is monitored by each of the microprocessors 11 within their respective O~I modules 2.

:~3L3'~

Each of the microprocessors 11 includes an executive di~gnostic routine which detects for faults within the system. This routine and the fault isolation and recovery routine included wi~hin these microprocessors will be d~scribed in greater detail hereinafter.
When the microprocessor 11 detects an error which indicates a fault within the system, it in-terrogates the status lead 20 to determine whether the MUX co~mon A module 8 or the MUX common B mod-ule 10 is active. It then adjusts its vote output signal to vote that the inactive or stand-by ~UX
common module 8 or 10 be activated. In this em-bodiment~ a high logic level signal of 12 volts is outputted to vote for the MUX common A module R
~and a low logic level signal of 0 volts is out-putted to vote for the MUX common B module 10.
Assuming for example that the MrJX common A ~odule 8 is active, the status lead 20 would therefore be 20 indicating a low logic level. Then, if a failure occurs within the system, each of the microproces-sors 11 of the O~I modules 2 will detect the fault and interrogate the status lead 20 to determine that the MUX common A module is currently activeO
Each of the microprocessors 11 will then respond by adjusting its vote output to the low logic level indicating a vote to activate the ~IUX common B module 10.
The actual voltage on the vote lead 22 is de-termined by superposition of the vote output sig-nals from all of the microprocessors 11 in the system. In the case where all of the microproces-sors 11 determine that there is a fault within the .39~35 system and that therefore the MUX common A module 8 should be acti~e, each of them would output a high logic level signal of 12 vo~ts and the vote lead 22 would have a level of 12 volts. Converse-ly, in the event that all of the microprocessors 14 determine a fault within the system and that MUX common B module 10 should be active, each of them would output a low logic level signal of 0 volts and the vote lead 22 would have a level of 0 10 volts. However, it is conceivable that not all of the microprocessors 11 will generate the s`ame vote output and, ther~fore, since e~ch vote output sig-nal has the same output impedance of 4.7 Kohms, the following formula expresses the voltage on the vote lead 22 for any giYen combination of vote -output signals:
VVB = 1~ X A (1);
where VvB is the voltage on the YOTE BUS
line 22;
A is the number of active (i.e., from OMI modules which have not removed them-selves from the system) micro~rocessors 11 vQting for the MUX common A module 8 expressed in percent (~; and 12 is the high lo~ic level voltage ex-pressed in volts.
A majority vote for ~IUX cornmon A module 8 can, therefore, be recogni2ed as a voltage greater than 6 volts on the vote bus line 22 by the appli-cation of equation (1). On the other hand, a ma-jority vote ~or ~UX common B module 10 can be rec-ognized as a voltage less than 6 volts on the vote lead 22. Hence, when the collective vote Erom all of the OMI m~dules 2 exceeds a predetermined value indicating that a majority of the OMI modules 2 have voted i~ response to t:he detection of a fault within the syste~, the MUX commons will be switch-5 ed.
The votD lead 22 is monitored by the vote collecting circuitry on the MUX common A module ~
comprised of the comparator 23 and associated re-sistor 24 and c2pacitor 34. The voltage on the vote lead 22 is first filtered by the resistor ~4 and capacitor 34 combination to remove any noise components. The filtered signal is than routed to the positiv~ input of the comparator 23. . The neg-ative input of the comparator 23 is biased to half of the supply by the equal value resistors 30 and 32. If the voltage on the vote lead 22 exceeds half-supply indicating that a majority of the microprocessors 11 have voted for the MUX common A
module 8, then the output of the comparator 23 20 will swing high indicating that the ~lus terminal is more positive than the negative terminal. Sim-ilarly, the output will swing low if the voltage on the vote lead 22 is below half-supply indicat-ing that that a majority of the microprocessors 11 have voted for th~ MUX common B module. The re-sistor 28 is used to provide positive feedback in order to cre2te a s~all amount of hysterisis which prevents the comparator 23 from oscillating in the event that the vote lead 22 is substantially equal to half-supply. When the output of the comparator 23 is at a high logic level, the ~IUX common ~ mod-ule ~ is enabled ~y ~he inverter circuit formed by the transistor 36 having the base resistors 38 and 1~39~5 40 and the collector resistor 42, at terminal C.
When the output of the eoml?arator 23 is at a low logic level, the MUX commo!l A module 8 is similar-ly disabled at the terminal C. Thus, in the pre-ferred embodiment, a majority (i.e., more than 50percent~ of the microp~ocessors 11 must vote for the desired MUX common module, that is, the col-lective vote must be more than 50 percent before it will be ac~ivatedO However, it should be readily a~parent to those skilled in the art that the comparator circuit 23 could be replaced with other components such as a microprocessor so that any predetermined number of votes could be select-ed to transfer the operation from one MUX co~mon module to another. Also, as will become apparent ~hereinafter, and in accordance with the preferred embodiment of the present invention, when the col-lective votes of the O~I modules 2 represent two consecutive collective votes, and thus, after the MUX commons have been switched twice, the entire system will go into the fault isolation and re-covery sequence as will be described hereinafter.
The error servicing routine, described in Figures 4 through 8 illustrate the procedure per-formed on any reported error, based on a majorityvote. The error check routines continually moni-tor specific system functions and alert the error servicing routine when an error has occurred.
These error check routines are as indicated below.

1. Sound off Checks Each module in th system is required to send a "sound off" data message at fixed time inter-~3~3~5 19vals. These "sound off" messages are lo~ged 'oy each OMI module as they are received off the data bus 21. At the end OL each sound off interval each OMI module compares the current interval with a reference interval to verify that all modules are present. One type or error is reported if one module is missing and~a second type of error is reported if two or more modules are missing. The former error type is cause for only one vote since the error is not inhibiting other modules from successfully using the data bus, and the error can be isolated to one module without executing a fault isolation and recovery sequence. The per-sistence of the 1atter error type is cause for multiple votes possibly producing a fault isola-tion and recovery sequence.

2. Sent Packet Check When each OMI module transmits a message on the data bus, it also reads back this message through buffer 15, serial to parallel converter 17 and internal data bus 18. The latter portion of this message contains cyclic redundancy check ~CRC) information. Failure to read the CRC trans mitted will prod~ce a type error to the error ser-vicing routine which will cause multiple votingand a possible fault isolation and recovery se-quence if the error persists after the first vote and switch of MUX commons.

3. MUX Common Guard Tone Check .. .. _ .
30Guard tone, for keying base stations in a tone controlled system, is available to all mod-1~ 3~8~

-2~1-ules in the system. This t:one is generated on the MUX common in an analog fashion~ This provision has allo~ed for a method ~hich requires, on a periodic basis, 11 modules to digitize this tone and source it in-o its assigned slot on the TDM
bus. An O.~I mo~le will listen to this slot, con-vert it back to analog form, and detect its pres-ence. This erro~ check routine allows each OMI
module to verify that other modules zre remaining in their assigned slots. This method is incorpor-ated in the "recovery'~ sequence ~o isolate modules which have "jumpsd" to another module's slot on the TD~ bus.

4. Scheduled I~X Common Switch To or~vent z "silent" failure on the redun-dant MUX co~on ~scheduled" switches occur. This verifies not onli the integrity of the "idling"
MUX common, but 21so the vote collecting circuitry which is the comr.unications link between each OMI
module in the case of a data bus fault.
Referring no~ to Figures 4 through 8, they illustrate the er;or servicing routine which is performed within the system by the O~I modules 2 for servicin~ faults within the system and isolat-ing the faults ar~ recovering the system to anoptimized config~ration. Each of the O~I modules 2 is cap2~1e of performing the en.ire error ser-vicing routine illustrated in Figures 4 through 8. However, in 2ccordance with the present inven-tion, not all of the O~I modules may ~e calledupon to perform t:.e recovery sequence. Each of the O~II modules 2 has a specific time slot address 8~

and, upon the initiation of the recovery sequence, each OMI module sets a timer which is weighted according to its address. The OI~I module with the lowest address is the fîrst O.XI module to assume command of the system for isolating th_ fault and recovering the system. Each OMI module, when it becomes its turn to isolate the fault and recover the system, untristates, and first performs a series of internal checks. Since this OMI module will be the only data bus user at this time, any internal check errors cannot be assu~ed to be dependent on other modules, as they are tristated. These internal checks include the "sent packet check," where this untristated OMI
module verifies that it can access the data bus properly by sending a ~dummy" message to itself and the "MUX common guard tone" check which verifies that this OI~I module is accessiny the correct slot on the TDM bus. Successfully passing these internal checks will assure that errors detectPd during the subsequent interrogation process cannot be attributed to the "controlling~
OMI module.
If the OI~I module passes these internal checks, it can proceed with the interrogation se-quence~ The controlling OMI module begins the in-terrogation process, or recovery sequence by send-ing a data message to only one mo~ule which will cause this ~odule to untristate. All data mes-sages sent by the controlling O~I module requirean acknowledgment data message by the interrogated module. Another data message, an "echo" message"
and subsequent ac~nowledgment message will verify 1~ v~5 two way comm nication over the data bus with two modules untr stated. Having successfully passed this portion o~ the interrogaticn process the in~
terrogated mo~ule is next asked to source guard tone into its assigned slot on the TDM bus. The controlling 0.~I ~odule will listen to this module and detect its tone. ~ This will verify that the polled module is not sourcing into the wrong slot on the TD~ b~s. After completing these checks the controlling O~I ~odule sends it a tristate ,~es-sage. This co~pletes the interrogation of ~ given module in the system. All modules in the system are si~ilarly polled. A complete poll of all mod-ules will re~eal a module which failed to respond correctly. ~t the end of the polling process only those modules who responded correctly will b~ sent a final apower up" messaqe by the controlling OMI
module. This will end the recovery sequence and return the system to normal operation.
When the controlling O~I module encounters a module~ durir.g the recovery sequence, which is renderins the data bus inoperative, it would now become i~2ossible to send it a "tristate" mes-sage. To overcome this, each module while being interrogated must continually hear from the con-trolling O~I ~odule within fixed time intervals.
If this tim.e interval is exceeded, then the inter-ro~ated r~odule will tristate by default. The con-trolling Or~I ~odule will allow for this ti~e in-terval to ela~se before continuing wi~h the re-covery se~uerce.
In the e~ent that the first Oi'~I module in the system is alco the module which had the failure, ~L;3~35 causing .he ecovery sequence it will proceed with the interrog~tion sequence described above. At the end of the polling process it will evaluate the results znd note t~at no module responded cor-rectly. Since no module responded while being in-terrogated, the controlling O~I module will tri-state and, t~erefore, pass control to the next 3~I
module~ If such occurs, the past controlling OMI
module will recreate its weiqhted timer so as to be last in the list. In this way all other system O~I modules ~ill have an opportunity to complete the interrogation sequence. This same procedure of recreating a weighted timer is executed if the controlling Q.~I module fails its internal checks 1~ as described earlier.
Logistical questions surface when multiple failures are considered. These problems are han-dled by the software algorithm by prioritizing specific errors and selecting the more functional M~X common~ Prioritization of possible system failures is:
1. Data bus errors: All failures which render the data bus inoperative. These include system clock, data arbiter failures on the MUX
common, two o~ more missing modules, and misuse of the data bus by a given module. These errors will cause a double vote and subsequent recovery seauence if they persist after the first vote and switch.
~. Failure of one module to "sound off"
will cause a vote. If the MUX common switch occurs, and tne module is still missing~ a reset message will be sent to this module in an attempt to restore it to normal operation. Regardless of .3985 -2~-the outcome, a second vote toward a recovery se-quence will not occur.
3. Loss of ~uard tone generation on the MUX
common is reason for an O~I module to vote for the redundant MUX common. Regardless of the outcome, a second vote to~ard a recovery sequence will not occur.
Referring now more particularly to Figure 4 as illustrated, the error servicing routine begins with each OMI module determining whether a fault or error in the system is currently being pro cessed. If the answer is "no~'; then the OMI mod-ule determines whether it detected any errors.
These errors include the previously mentioned, data access, sound off, or ~IUX common guard tone generation errors. If the answer is "no", then the OMI module ~roceeds to determine whether the MUX commons have been switched. If the answer is "no", the OMI module returns to the executive con trol or main operating system which supervises all of the routines within the system. Therefore, when a "return" is indicated, this indicates that the OMI module returns to the control of the exe-cutive control or main operating system.
If the OMI ~odule determines that the MUX
commons have not been switched and returnsr this indicates that the system is operating properly and that there a~e no faults to be acted ~pon. If the O~I module did not detect any errors but the MUX commons have been switched, then the OMI mod~
ule will vote to agree with the majority of the other OMI modules. This indicates that a majority of the other system OMI modules have detected a 3~

system error as described earlier, and voted be-fore this OMI module detected any error~ In any event, this O~II module must stay in sequence, should a recovery sequence be fo~thcoming and vote as did the majority.
If the OMI module did detect an error, then it proceeds to vote f~or a MUX co~on switch ac-cor~ing to the sequence indic~ted in Figure 4 Referring now to Figure 4, after the OMI
module votes for the MUX common switch, it then determines whether the MUX commons have been switched by the condition of the status lead 20.
If the answer is l'yes", this indicates that the majority or the system OMI modules have already detected an error and voted. The O~II module will then reinitialize its error flags and its error detect routines and upon its next execution, it will perform its "post MUX" rou.ine to be describ-ed hereinafter. In brief, the "post .~UX" routine will evaluate the integrity of the redundant MUX
common .
If the MU~ commons were not switched, this indicates that this OMI modulQ is a minority OMI
module and should allow time for the other OMI
mo~ules to vote. ~ence, its next execution i5 to do a "wait MUX" routine to be described herein-after.
Referring again to Figure 4, if the OMI mod-ule determines that an error is currently being processed, then it will jump to one of four rou-tines indicated as a Wait, Post ~X, Vote Re-covery, or Recovery routine. Eacn of these is described her~inafter in greater detail.

-2~)-Referring n~w ~o Figure ~, this illustrates the ~ait MUX rou~ine. The OMI module first deter mines whether the MUX commons have been switched.
If t~e answer is ~no", then it de~ermines whether it has time remaining to wait. If the answer is "yes~, than the OMI module returns and allows time for the other system Oi~I modules to detect error.
If the answer is ~no", th~n the OL~I module pro-ceeds to determine if there has been a data bus access error. If the answer is "nol', it reports a MUX common switch failure and reinitializes its error flags and returns. This results because no further error processing action can occur without a consensus of the majority of the OMI modules.
If the OMI ~odule determined that there ~as no time to wait and that the data bus access error did occur, then the OMI module will tristate it-self and then return. This removes this OIYI mod-ule including its vote from the system due to a failure to properly access the data bus.
If initially the OMI module determined that the 14.UX commons had been switched, then it pro-ceeds as indicated in Figure 5, to initialize its error flags and then execute the post MUX routine upon its next execution. This indicates that the majority of the system OMI modules have detected an error and that it is now time to evaluate the redundant MUX co~on. In initializing its error flags, each Oi~I ~odule, in essence, is forgetting past errors and only remembering that some error, as described earlier, has caused a vote and subse-quent MUX common switch. What further action is needed will be de-ermined by new error input col-3~3~5 --~7 --lected while operating under the redundant MUX
common.
Referring now to Fi~ure 6, which illus~rates ~he post MUX routine. In eval-uating the redundant MUX common, the OMI modulefirst determines whether errors are detected. If no errors are detected, the 0~1I module reini-tializes 'o normal operation~ This indicates that the previously active i~UX common was faulty and the switch to the redundant M~X common resolved tne system fault. The OMI module then returns.
If errors are detected while operating under the redundant MUX common, then the OMI module determines whether that error was a data-bus access error. If the answer is "yes", then the -O.~I module votes for a second MUX common switch.
After so voting, the OMI module will upon its next execution go into the vote for recovery sequence and return. In accordance with this preferred embodiment, data bus access errors are the highest priority errors in the system and constitute a second vote for a recovery sequence.
If there was no data bus access error, then the OMI module proceeds to the sequence of opera-tions illustrated in Figure 6. First~ the OMImodule determines whether one node or module fail-ed to so~nd off. In accordance with the present inventio~, each system node or module is required to send a data pac'~et over the data bus at pre-determin~d intervals, such as, for example, everyfive sec~nds. Tnis allows the OMI modules to de-termine whether all modules are active on the bus. If one node failed to sound oE~, then the 3~35 OMI module sends a reset data message to the miss-ing node. This is done because the missing module is not inhibiting access to the data bus by other modules in the system. The OMI module dQtermines whether ~he reset was successful. If the reset was not successful, then the missing node is ex-cluded from the active node list and the O~II mod-ule then returns. If the reset was successful, the O~I module then also returns. As a result, a missing node or module can be restored without up-se~ting the rest of the system. It is of course~
to be understood r that each of the OMI modules should have detected the missing node and each sent a reset message to it.
If initially, the OMI module determined that no node failed to sound off, the O~II module then reports any re.~aining errors and then reini-tializes its error flags to the extent possible and then returns to normal operation.
Figure 7 illustrates the Vote for Recovery error sequence. First, the OI~I module determines whether the second MUX common has been switched.
In other words, the OMI module determines whether two consecutive collective votes from the OMI mod-ules have exceeded a predetermined value, here a majority of the 0~3I modules. If the second ~X
common switch did occur, the OMI module will upon its next execution go into the recovery sequence and then return. This indicates that the majority of the system O~I modules have encountered data bus access errors and voted for a recovery se-quence.

If the OMI module however did not detect a second MUX common switch, then this OMI module would tristate itself because of an internal error within itself involving data bus access. This OMI
module is a minority OMI module and therefore, it will tristate thereby completely remo~ing itself along with its vote from the system.
P~eferring now to Figures 8A through 8 ~r these Figures illustrate the Recovery sequence which taXes place after two consecutive collective votes of the OMI modules have exceeded a predetermined value. This condition causes the recovery se-quence to be initialized. First, as indicated, all OMI modules tristate from all of the system buses. Also, the passive ~odules of the system -will also tristate within a predetermined period of time after the recovery se~uence has ini-tiated. This occurs because passive modules must periodically receive "stay awake," or "remain un-~O tristated" messages from the system OMI ~odules.Each Oi~I module then initializes its recovery flags and creates a weighted timer. The weighted time is based on the module or node address. The OMI module having the lowest node address will time out first, thereby, becoming the controlling OMI module, which first begins the system inter-rogation process.
The OMI module then determines whether its weighted timer has expired. If it has, then the OMI module knows that it is the next one, which may be the first O~I module, to assume control of the recovery process. The O~I module assuming control first untristates and performs a series of ~2~3~8 -3a--selfchecks. The first selfcheck is a data bus access in the form of a "dummy" message to itself. As a second eheck it sources guard tone into its assigned slot on the TDM bus, which it then detects to verify that it has not jumped into another slot. The OMI module then determines whether the selfcheck routine was successful. If the selfcheck routine was not successful, then the OMI module reinitializes its weighted timer and 1D then goes to the end of the list of OMI modules to perform the recovery sequen~e. The OMI module then tristates itself from the bus.
As indicated in Figure 8A, after tristating itself from the bus, the O.~I module increments a recovery attempt counter. It then determines whether the count is greater than some maximum value. The maximum value being chosen such that any further atte~pts to successfully pass the self check under this MUX common will no doubt prove fruitless. If that count is greater than the maximum value, then the OMI module resets its counter and votes for another MUX common switch.
This indicat~s that each OL~I module has had a chance to test itself and all ests have been un-successful. As a result, the system will continueto switch between the MUX commons until the most intelligent or able MUX common is found~
Returning now to Figure 8A, if the OMI module determines that it passed its selfcheck routine, it proceeds as indicated in Figure 8B to send a data message to alert all of the system OMI mod-ules that it is the controlling O~II module. This data message will cause all remaining system O~II

- ~21~8 --3~--modules to reinitialize their weighted timers based on the module address of this now con-trolling Oi~I module. These OMI modules will con~
tinue to reinitialize these timers on every data message which originates from the controlling OMI
module during the recovery se~uence. This assures that the controlling OI~I module can complete the recovery sequence unim~eded by other OMI modules.
The controlling OMI module then obtains the first, or next, node address to interrogate until all modules have been interrogated. If m~re mod-ules must be interrogated it first sends an un-tristate data message to this module.
The controlling OMI module then determines whether it nas received an acknowled~ed message ,back from the module to which it sentthe un-tristate data message. If it did not receive an acknowledgment, this indicates a node failure~ and the O~I module then returns to get the next node addre~ss. If the controlling OMI module did re-ceive an acknowledged message from the module to which it sent the untristate data message, then it sends an echo data message. It then determines whether it has received an acknowledgment in re-sponse to the echo data message. If the answer is"no", this indicates a node failure and the OMI
module returns to get the next address ~essage.
The echo message verifies that two untristated modu~es can access the data bus correctly. If two untristated modules can successfully talk to each other, then all system modules which pass this portion of the interrogation should be capable of communicatin~ with each other when all are untri-~.3~

stated, as in normal operation. If it had re-ceived an ac~nowledgment to its echo d~ta message, then the controlling OMI module proceeds to send a time division multipiex tone continuity data mes-sage as indicated in Figure 8 C It then waits todetermine whether it has received an acknowledg-ment. If it has not, this indicates a node ail-ure and it once again returns to obtain the ad-dress of the next node to interrogate. If the answer is "yes", then it determines whether it de-~ected a time division multiplex tone. This de-termines whether the interrogated module is plac-ing data onto the bus within the correct time slot. Failure to source guard tone into the cor-rect TDM slot could mean that another slot user isex~eriencing difficulty in us-ing its TD~ slot under normal operating conditions. If the answer is "no", then this indicates a node failure and the controlling OMI module re.urns to obtain the next module address. If the answer was "yes", then the controlling O~I module sends a tristate data message to tristate the module being interro-gated. As indicated on the flo-~ chart, the con-trolling OI~I module proceeds to determine ~7hether it has received an acknowledgment to its tristate data message as indicated in Figure 8C. If the answer is "no", this again indicates a node fail-ure. If it did however receive an acknowledg~nent to its tristate data message, this indicates that the module or node has passed which is noted by the controlling O~I module, and then the con-trolling O.~I module returns to obtain the address o~ the next module to be interrogated.

~ 3~3 This sequence continues for each module in the system until all of the modules have ~eer in-terrogated. As in~icated ln Figure 8B, if all of the modules have been interrogated~ then the con-trolling OI~I module proceeds to determine whe~herat least one node passed the interrogation. rven though the controlling 05I module may have su--cessfully passed the selfchecks before pollin~ the other modules, it may still be possible that ~his OI~I module has failed in such a manner so as to inhibit other nodes from successfully accessi~g the data bus. For this reason, at least one in-terroga~ed module must have successfully passed tc conclude the rec~very sequence. As indicate~ in Figure 8c, if the answer is "yes", then the con-trolling O;~I module sends an untristate ~essa~e to--each passing ~odule for recovering the syster;.
After each untristate message, the controlling OM
module determines whetner more nodes are to be called up. If the answer is "yes", the con-trolling O.~I module continues to send the un-tristate data message to all of the passing nodes. If the answer is "no~, then the con-trolling OMI module proceeds to reinitialize and return the system to normal operation as indicate~
in Figure 8D and then returns to the executi-~le control.
Returning to Figu~e 8A, the execution p2 h that an OMI module takes while waiting for i s weighted timer to expire will be discussed. If the OMI module determines that its weighted 'imer had not expired, th~n it proceeds, as illust-ated in Figure 8D, to de~ermine if any interrogat on ~1.35~5 data message fro~ the controlling 3MI module has been sent. If .he answer .Ls "yes", as indicated in Figure 8E, the OMI module will adjust its weighted timer to this controlling OMI module so that its timer ~ill not time out before the controlling O~.I module has had an opportunity to interrogate all .nod~les within the system. This listening O~I module will, at this time, determine if this interrogation message is addressed to it.
If true, then this O.~I module will proceed with any necessary action and respond with the correct acknowledgment ~essage as previously described.
The OMI module then returns to determine whether its weignted timer has timed out as indicated in Figure 8A.
If the O~I ~odule does not detect interroga--tion messages from the controlling O;1I module, it will then look for the QreSence of "call up" mes-sages which signify the conclusion of the recovery sequence and return to normal operation. ~Call up" messages contain the specific module address of those modules which passed the interrogation pr~cess. This OMI module will loo~ for a mess~ge addressed to it. If the answer is ~no", then this OMI module has failed the interrogation. If the answer is "yes", then this O;~I module has passed the interrosation. If the O~I module did not ~ass the interrogation, then as indicated in ~igure 8Dt it sets its tristate flag and remains off of the data bus. If it did pass the interrogation, it initializes its error flags and returns to normal operation. This co~,pletes the fault isolation and recovery procedure.

Claims

THE EMBODIMENTS OF THE INVENTION IN WHICH AN EXCLUSIVE
PROPERTY OR PRIVILEGE IS CLAIMED ARE DEFINED AS FOLLOWS:

1. In a distributed processing system of the type including a plurality of modules, a subsystem for isolating faults within said system and recovering said system to optimize operation, said subsystem comprising:
at least some of some modules being active fault recovery modules including fault detecting means for initializing a fault check routine and sensing faults within said system including faults within a respective module;
voting means associated with each of said active module for placing a vote during each said fault check routine in response to a detected fault;
collective vote determining means for recording the votes of said active modules after each said fault check routine;
means for cooperatively intercoupling each of said voting means and said collective vote determining means; and recovery sequence initializing means associated with each active module for initializing a fault isolation and recovery sequence in response to a predetermined number of consecutive collective votes exceeding a predetermined value.

2. A system as defined in claim 1 wherein said modules are distributed along a common bus and wherein said active modules are arranged to tristate from said bus in response to the initializing of said recovery sequence.

3. A system as defined in claim 1 wherein the other, non-active fault recovery modules of said modules are arranged to tristate from said bus within a predetermined time following the initializing of said recovery sequence.

4. A system as defined in claim 2 wherein one of said tristated active modules is arranged to untristate and become fully active on said bus within a predetermined time after all said modules are tristated from said bus.

5. A system as defined in claim 4 wherein said untristated module is arranged to perform self-checks to detect faults within itself.

6. A system as defined in claim 4 wherein said untristated module is arranged to switch back to a tristate condition upon detecting a fault within itself.

7. A system as defined in claim 4 wherein another one of said tristated modules is arranged to untristate and become fully active on said bus in response to said one module switching back to said tristate condition.

8. A system as defined in claim 4 wherein said untristated module is arranged to test each said module on said bus or faults.

9. A system as define in claim 4 wherein said untristated module is arranged to test each said module one at a time in a given sequence.

10. A system as defined in claim 4 wherein said untristated module includes means for testing said modules over said bus.

11. A system as defined in claim 10 wherein said untristated module includes means for untri-stating each said module, testing each said mod-ule, and thereafter tristating each said module.

12. A system as defined in claim 4 wherein said untristated module further includes means for untristating all said modules which have success-fully passed said tests after all said modules have been tested to recover said system to opti-mized operation without the faulting ones of said modules.

13. A system as defined in claim 1 wherein each said active fault recovery module is arranged for testing all the other of said modules sequen-tially.

14. A system as defined in claim 13 wherein each said active fault recovery module is arranged for testing the other said modules one at a time in a given sequence.

15. A system as defined in claim 13 wherein each said active fault recovery module includes means for testing itself prior to testing the other said modules.

16 . A system as defined in claim 15 further including means for causing the first one of said active fault recovery modules which successfully completes said self test to test the other said modules.

17. A system as defined in claim 16 wherein said active fault recovery modules are arranged to perform said self tests in a given sequence.

18. A system as defined in claim 13 wherein all said modules are arranged for being deacti-vated upon the initializing of said recovery se-quence and wherein said first one of said active fault recovery modules include means for activat-ing the modules which successfully pass said test for recovering said system to optimized operation.

19. A system as defined in claim 1 further including means for detecting when a fault occurs in only a faulty one of said modules, means for resetting said faulty module after one of said collective votes, and means for removing said faulty module from said system.

20. A method of isolating faults and recovering a distributed processing system including a plurality of modules to optimized operation, said method comprising the steps of:
periodically detecting for faults within said system at given ones of said modules;
generating a vote at those given ones of said modules detecting a fault within said system, including a fault detected at a respsective module;
collecting said votes such that a collective vote is generated from the respective votes; and initializing a recovery sequence when a predetermined consecutive number of said collective votes exceed a predetermined value.

21. A method as defined in claim 20 including the further step of deactivating all of said modules upon the initializing of said recovery sequence.

22. A method as defined in claim 21 including the further step of activating one module of said given ones of said modules and causing said one module to test the other said modules.

23. A method as defined in claim 22 further including the step of testing said one module internally prior to testing the other said modules.

24. A method as defined in claim 22 wherein said one module tests the other said modules one at a time.

25. A method ad defined in claim 24 wherein said step of testing each said other module in-cludes: activating each said module individually, testing each said module, and thereafter deacti-vating each said module.

26. A method as defined in claim 23 includ-ing the further step of deactivating said one mod-ule if it fails to pass said internal test and activating another one of said given ones of mod-ules and causing said another one of said given ones of modules to test said other modules.

27. A method as defined in claim 22 includ-ing the further step of activating all said other modules passing said test after all said other modules have been tested to recover said system.

28. A method as defined in claim 27 wherein said one module activates all said other modules passing said test.

29. A method as defined in claim 20 includ-ing the further steps of detecting when a fault occurs in only a faulty one of said modules, re-setting said faulty module, and removing said faulty module from said system.