WO2000055953A1

WO2000055953A1 - System and method of event management and early fault detection

Info

Publication number: WO2000055953A1
Application number: PCT/US2000/006919
Authority: WO
Inventors: Kumar Gajjar; Nghiep Tran
Original assignee: Smartsan Systems, Inc.
Priority date: 1999-03-15
Filing date: 2000-03-15
Publication date: 2000-09-21
Also published as: AU3889200A

Abstract

Users or clients of a computer system can set and change, as needed, the fault reporting, fault logging, fault notification, and fault trigger point thresholds for any event in a network or system. A 'point and click' graphical user interface (GUI) allows users to conveniently perform these tasks, or they can be performed by calling API functions. Another advantage is the integration of an Event Manager into a central point of all appropriate fault management functions, including an Event Table, Registration, Event Thresholding, Logging and Notification, as well as Recovery Operations or Actions to be Taken.

Description

SYSTEM AND METHOD OF EVENT MANAGEMENT AND EARLY FAULT

DETECTION

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is related to and is extending in a novel way the tasks accomplished and the utility of the copending U.S. Provisional Patent Application Serial No. 60/124,494, entitled "System and Method of Zoning and Access Control, Event Management and Network Management in a Computer Network," filed on March 15, 1999.

BACKGROUND OF THE INVENTION

1. Field of the Invention The present invention relates generally to network fault management via a software event manager (EM) inserted in the network at a central point in the system and controlled by the user through a Graphical User Interface (GUI).

2. Description of the Background Art

The introduction and proliferation of fibre channel has allowed greatly increasing network connectivity between central servers and local storage so that many more devices can be connected to a network over wider geographical areas.

Fibre channel is an ANSI-standard, high-speed data communications technology providing gigabit-per-second transmission rates for server/storage and large-size, high-performance, geographically dispersed networking environments. Increases in computer network speed, size and connectivity require that early fault detection and fault management controls be embedded in the central server or elsewhere with connections to all devices and storage comprising the network. The main components or functions of the fault or EM are: 1. Event Table 2. Registration

3. Event Thresholding, Logging and Notification

4. Recovery Operations or Actions to be Taken.

The prior art for network fault or event management allows only predefined, built-in controls, that is, preset, fixed trigger thresholding.. Therefore, there remains a need for an improved fault management system which permits the users to conveniently set and change, as needed by each application, the trigger point threshold, via an easy to use method, such as a Graphical User Interface (GUI).

SUMMARY OF THE INVENTION

The present invention provides a software system and method for the users or clients of the system to set and change, as needed, the fault reporting, fault logging, fault notification, and fault trigger point thresholds for any event in the network or system. A "point and click" graphical user interface (GUI) can allow users to perform these tasks, or they can be performed by calling API functions.

Another advantage of the present invention is the integration into a central point, the EM, of all appropriate fault management functions as follows:

1. Event Table

2. Registration

3. Event Thresholding, Logging and Notification

4. Recovery Operations or Actions to be Taken, With the EM in the system, other subsystem modules do not need to have code to monitor and track any events or faults, thus freeing them of the burden of booking, reporting and the other features of the EM.

Other advantages and features of the present invention will be apparent from the drawings and detailed description as set forth below.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram of one embodiment of a controller device according to the invention embodying an Event Manager (EM) for managing events and faults in a computer network;

Figure 2 is a block diagram illustrating one embodiment of the process of client registration;

Figure 3 is a block diagram of nested hierarchical blocks illustrating one embodiment of the format and the ordered information content in the Client Event Table;

Figure 4 is a block diagram illustrating one embodiment of the Event Notification Registration List;

Figure 5 is a flow chart diagram of the Event Notification process in accordance with the present invention;

Figure 6 is a block diagram of one embodiment illustrating the Event

Threshold Registration List;

Figure 7 is a flow chart diagram of Event Thresholding in accordance with the present invention;

Figure 8 is a flow chart diagram of Ordered Event Thresholding in accordance with the present invention;

Figure 9 is a block diagram of one embodiment of the Event Reporting feature; and Figure 10 is a block diagram of an Event Reporting example in accordance with the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The present invention is a novel system and method of providing fault management and early fault detection, reporting and system response in a computer or logic device network that reaches all the way down to the device level, including logical devices.

Figure 1 is a schematic block diagram illustrating one embodiment of the controller or EM 100 wherein there are identified the key elements of the EM 100. These elements include the Processor Module 1 10 comprizing a Processor 120 connected to a random access memory (RAM) 130, a non-volatile memory 140, a read-only memory (ROM) 150, a Cache/Staging memory 170 and the input/output connections to all the relevant components of the network (FC I/O's 172, 174, etc. and (I O's) 182, 184, 186 etc.

During the initialization process, all the software modules or clients initialize themselves and generate an Event Table with all the possible events that can be detected and registers themselves with the EM. The EM then adds the new client to the Client List 132 region of the RAM 130. This is discussed in detail in Figure 2. During the registration, the client passes a Client Event Table pointer to the EM.

Next, during the operation of the system if an event or fault occurs, then the client will call the EM with that event or fault. The EM will then reference the Action Table and take the appropriate measures defined in the Action Table. This is discussed in detail in Figure 3. Now referring to Figure 2, a block diagram illustrates one of the ways that a client XYZ registers with the EM. First, the client assembles an event/fault table, as shown in block A, wherein there are listed in the required level of detail the possible or anticipated events that can occur to the client and its components. This table is discussed in great detail in Figure 3. Next the client XYZ registers with the EM with Client Identification (ID) and a pointer to its Event Table, through a step B to the EM. The EM, in turn, adds Client XYZ to its Client List, as shown in block C. The Event Manager or EM also establishes a link, D, to Client XYZ's Event Table. Figure 3 is a set of nested hierarchical blocks of lists illustrating the format and the ordered information content in the Client Event Table 300. In block 300 all the event elements, say 301 to 309 are listed in numerical ascending order for one client. For each element, say 303, there is given in the event element block 360, a list of attributes or tags (361-366) that identify the components 361, 363, the event description and its severity 364, 362, the recommended correction actions 365 as well as additional actions 366. These tags 361-366 also correspond each to an entire block of ordered lists of choices, attributes and actions down to the level of required detail to identify the component, the fault and its severity and to take component and system fault remedial actions as illustrated by the information inside the right-hand side blocks 361-366 of Figure 3. Additional options, choices, list members can be added to the lists in blocks 300, 360-366 suitable or required by a specific application by a designer skilled in the art.

Next, the Event Notification Registration feature allows a client to register itself with the EM in order to be notified after the occurrence of a specified event.

Figure 4 illustrates in detail the Event Notification Registration (ENR) function of the EM. When a client registers an Event Notifcation request, the EM creates an ENR Element, 450, 136, 142, and adds it to its ENR List, 436, and increments the ENRCount. The ENR function, block 436 of Figure 4, stores this information for the given client when the client sends to block 436 an ENR in the format of block 450, 451 etc. of Figure 4. The format of the client ENR, say block 450, includes the Client ID, the Event Code, a data on the previous event occurance, ENR Prev P, data on the next event occurance, ENRNext P and a Callback Function List the contents of which are shown in block 480. For every event received by the EM 120, it checks the ENR

Link list, Figure 4, for a match. If a match is found then the EM will call the Callback Function 480 that was registered by the client in the EM Control Block 436 so that the client will take the appropriate response. Figure 5 illustrates in detail via a flow chart the Event Notification process. The flow chart starts with the EM receiving an event. Then it checks every notification entry in the list of block 136 or block 436 if there is a next entry. If not then it exits. If yes it checks to find if the event matches the one in the stored list. If not then it returns to the start of the event notification flow chart to test the next event. If yes then it calls the Callback functions in the callback functions list 480. Then it returns to the beginning of the flow chart to check the next notification entry.

Figure 6 illustrates in detail the Event Threshold Registration (ETR) function of the controller 100, shown as block 138, 142 in Figure 1 and as block 638 in Figure 6. When a client registers an Event Threshold request, the EM creates an ETR element, 650, 138, 142, adds it to its ETR List 638, and increments the ETRCount. The format of the client ETR, say block 650, includes the client ID, the Event Code, data on the previews event occurance ETRPrevP, data on the next (current) event occurance ETRNextP, Occurance Count, Timestamp, Threshold Type, Threshold Duration, Threshold Event Count, Callback Function, Event Count, Event Code List. The Event Code list in block 650 is further delineated into a Threshold Event List, block 680 that tags the threshold events. Each threshold event in block 680 is further delineated into a Threshold Element List, block 690 containing information on the Element Type, Event Number, Client ID, Severity Level, Component Type and Component ID.

The ETR feature of EM 100 allows clients to register event(s) with EM, so that EM will notify the client if the threshold parameters are met for the specified event(s). This will allow a client to monitor the activity in the system (i.e. For failure analysis, one can request to be notified when 5 "Media error" events occur within 2 seconds, when this happens it can decide what to do with the device). There are two types of threshold, ordered and not-ordered. For every Event received by EM, EM checks the ETR Link List for a match. If a match is found, the occurance counter is incremented. If the interval between the timestamp and the current time is greater than the duration then the timestamp is reseted. Else if the counter is equal to the event count the EM call the call back. Else the timestamp is also reseted if the counter is 1 (the first occurance). The difference between ordered and not-order is that in the ordered threshold, the events must occur in sequential order (i.e. event 0 must occur first, then event 1 occurs second...) as shown in Figure 6, block 680, in not-ordered threshold, any event can occur in any order or combination. Example 1: User, the client, sets the trigger parameter as: "Notify the user via SNMP Trap if 3 Bad Block Errors occur within 10 seconds time interval from Storage Device 0".

In this example, EM will monitor all Bad Block errors generated by Device 0, log the time the errors occurred and monitors to check if 3 errors occurred within the 10-second time interval. If so then it will notify the user by sending an SNMP Trap to the Management station.

Example 2: Fibre Channel Driver, the client, sets the trigger parameter as:

"Call the fcdlnit(portl) function if 5 LIP Resets are detected by the FC Driver within 15 seconds time interval from Fibre Channel Port 1".

In this example, EM will monitor all LIP Resets detected on Fibre Channel Port 1 , log the time the errors occurred and check if 5 errors occurred within the 15-second time interval. If so then it will call the function fcdInit(port 1).

The function emETR () returns a unique ID which can be used to de-register the ETR.

Figure 7 is a flowchart of steps in a method for checking whether Threshold parameters are met for the specified event(s). This will allow a client to monitor the activity in the system. Starting from step 700 where all events are received in the EM a Threshold event list is created as shown in Figure 6 and placed in step 702 of Figure 7. The event thresholding program, in step 710 initiates or continues the evaluation of threshold entries. If there no more threshold entries in the list, the program exits the its evaluation process. If there is an additional threshold entry in the list then it proceeds in step 720 to compare its duration against a preset duration. If the given threshold duration is greater than a preset duration then in step 722 it resets the Timestamp and Resets the Counter and proceeds to the next step 730. If the duration is less or equal to the preset duration then it proceeds to step 730 where it is compared to the preset event for match. If it does not match then the program returns to the initial step 710 where it looks for a next entry to evaluate. If it does match in step 730 then it increments the Counter and proceeds to 740 and checks the Counter to find if it is equal to one (1). If yes it proceeds to step 742 where it resets the timestamp and proceeds to the initializing step 710 where it calls for a next entry to be tested. If the answer is No in step 740 then it proceeds to 750 where it checks to find if the counter value is greater or equal to the threshold event count. If the answer is No then it returns to step 710 to initiate testing a next entry. If the answer is Yes then it continues to step 760 where it calls the Callback function, resets the timestep, resets the counter and returns to the initializing step 710 for evaluating the next entry.

Figure 8 addresses the case when the threshold event list is ordered as shown in block 680 of Figure 6. The only difference between FIGS. 7 and 8 occur in the insertion of steps 831 and 833 between steps 830 and 832. They yes option of step 830 leads to a new step 831 where the matched index is compared to the counter. If they are not equal then the timestamp and the counter are reset and the program returns to the initializing step 810 to evaluate the next entry. If they are equal (ordered event) then the remaining steps are identical to the corresponding ones of Figure 7.

Figure 9 illustrates in detail how a client reports an event 900 to the EM. The client will call emReportEvent () 910 in EM with the following parameters inserted: client ID 920, event number 930, component ID 940 and software context 950. The software context block 950 contains File Name, Line Number and Version Number. The remaining blocks 960, 970, 980, 991, 992, 993, 994, 995 and 996 are identical in format to those in FIGS. 2 and 3. When the EM receives an Event Reporting request, it will index into the Client Table using the Client ID and find the Client Event Table. Then using the Event Number, EM will index into the Client Event Table and get the Event Element of FIGS . 4 and 6.

Figure 10 illustrates an event reporting example from the FC driver: the Al Loop Up Event. Block 1000 identifies the event from the event element. Block 1010 identifies the event element. Block 1020 identifies the relevant Correction Description Table. Block 1030 identifies the two actions that are enabled on this event as specified in the first two elements.

Claims

What is claimed is:

1. A system for early fault detection in a computer network, comprising a: an event manager; a client list; a client event table; and an event notification registration.

2. The system of claim 1 , further including an event threshold registration.

3. The system of claim 2, further including event logging and notification.

4. The system of claim 3, further including a list of actions to be taken.