WO1988001412A1 - Method and apparatus for circuit simulation using parallel processors including memory arrangement and matrix decomposition synchronization - Google Patents

Method and apparatus for circuit simulation using parallel processors including memory arrangement and matrix decomposition synchronization Download PDF

Info

Publication number
WO1988001412A1
WO1988001412A1 PCT/US1987/002079 US8702079W WO8801412A1 WO 1988001412 A1 WO1988001412 A1 WO 1988001412A1 US 8702079 W US8702079 W US 8702079W WO 8801412 A1 WO8801412 A1 WO 8801412A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
circuit
entry
cpus
memory
Prior art date
Application number
PCT/US1987/002079
Other languages
French (fr)
Inventor
Gabriel Bischoff
Steven Greenberg
Original Assignee
Digital Equipment Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Equipment Corporation filed Critical Digital Equipment Corporation
Publication of WO1988001412A1 publication Critical patent/WO1988001412A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • G06F17/12Simultaneous equations, e.g. systems of linear equations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/36Circuit design at the analogue level
    • G06F30/367Design verification, e.g. using simulation, simulation program with integrated circuit emphasis [SPICE], direct methods or relaxation methods

Definitions

  • the field of the present invention relates to computer assisted circuit simulation employing parallel processing.
  • more than one processor may write information into a memory location corresponding to a particular node, at the same time another processor is attempting to read previously stored information related to that node or write new information relating to that node, based on its computations with respect to a different circuit element attached to the node.
  • the solution in the past has been to require special locking mechanisms within the programs controlling the multiple processors in order to prevent- such occurrences. This in effect lengthens the total CPU processing time required, in that much time is spent waiting for the access t the particular memory location to be opened to another of the processing units operating in parallel, which desires to read or write information into the particular memory location.
  • the proposed solution in the past has been to block write access to the entire matrix when one CPU is writing to the matrix.
  • Parallelization employing such simulation programs as SPICE uses a list of circuit components and parameters, which may be constants or variables, for example, in the case of MOSFET transistors geometric parameters, currents, charges, capacitances, etc., which are then used to calculate the individual components or terms, the sum of which equals the value for each entry at a position in a matrix.
  • the matrix is used to solve a number of simultaneous equations for the variables, for example, voltages, at each node within the circuit.
  • the matrix typically consists of an X by X matrix with X approximately equal to the number of nodes in the circuit.
  • the parallelization techniques employed in the past suffer from the disability that the summation of the terms, in order to define the value at the matrix location for a given node, may be influenced by more than one circuit element, and the parallel processors may be computing the effect of two or more circuit elements on the same matrix node location at the same time. This results in the need for the use of some technique, for example, interlocks, to prevent simultaneous writing or reading of data by more than one processor into or from the memory location corresponding to the matrix location.
  • a feature present invention relates to providing memory locations corresponding both to the matrix location and to a specific one of a plurality of terms which determine the ultimate value for the entry at the matrix, and summing the terms from the plurality of memory locations corresponding to a matrix entry to define a single value for the matrix entry.
  • a further feature of the present invention relates to synchronizing the parallel processors during matrix decomposition by assigning a single processor to a given row within the matrix and setting a flag when the values in a row extending from a diagonal matrix element are ready for use.
  • Fig. 1 is a schematic representation of the prior art . SPICE program and circuit simulation apparatus implementing that program;
  • Fig. 2 is a schematic representation of a method and apparatus according to the present invention for generating the matrix for solution of simultaneous equations necessary for circuit simulation.
  • the parallelization of circuit simulation for example, using SPICE, uses a list of circuit elements contained within the circuit, and a matrix,, with the matrix locations corresponding generally to the circuit nodes.
  • SUBSTITUTESHEET the number of circuit elements connected to a node, and/or th type of circuit elements connected to a node, a plurality of different terms can contribute to the ultimate value for the matrix location. The sum of these values is used in solving the simultaneous equations, in order to solve for the necessary variables to conduct the circuit simulation.
  • SPICE program the use of parallel processors has been employed to analyze in parallel, different circuit elements and their contributions to the matrix values.
  • the parallel processing of different circuit elements, each of which may contribute to different ones of the terms for the same matrix location entry is not allowable in the past methods of parallel processing.
  • each of the parallel processors must wait its turn to add its contributing term to sum the value for each matrix location entry.
  • the list of circuit elements contain the parameters, which determine the term of terms contained i the matrix location for a given matrix entry, and a pointer, which points to the memory location for the given matrix entry.
  • the data is represented in a double precision value requiring eight bytes of data for each memory location.
  • the present invention employs a list of circuit elements, as shown in Fig. 2, essentially identical to the list employed by circuit simulators using SPICE.
  • SUBSTITUTE SHEl elements contains a plurality of pointers, each of which point to a memory location within an array of memory locations associated with a given matrix entry. There is, therefore, a memory location assigned to each of the plurality of circuit elements which affects the sum of the terms for a given matrix entry.
  • the input from one of the plurality of parallel processors operating on the calculation of the term for a given circuit element, which contributes to the total value for the given matrix entry will be stored in a designated memory location, and the input from another processor operating in parallel on the calculation of the contribution to the total value from a different circuit element will have a different designated memory location.
  • the different memory locations correspond to the particular matrix entry by being included within an array of memory locations corresponding to the given matrix entry, but each corresponds to a contribution from a different circuit element.
  • each matrix entry having a plurality of memory locations associated with it, each corresponding to a different circuit element, the influence of which determines a portion of the total value for each entry, which is employed in solving the simultaneous equations represented by the matrix.
  • Utilization of the resent invention greatly speeds up the loading of the matrix entries for circuit simulation in order to solve the simultaneous equations.
  • the loading of the matrix entries occupies approximately 80% of the total computing time necessary for circuit simulation.
  • the utilization of the present invention including the improvement in the ability to parallelize the matrix loading
  • the present invention can be implemented on VAX8800 and 8300 computers manufactured by Digital Equipment Corporation which contain dual processors having a shared memory.
  • the parallelization program employing SPICE is divided into seria phases. Each phase is divided into tasks which are executed serially or concurrently. A phase whose task in executed serially is called a single stream phase whereas a phase whos tasks are executed concurrently is called a multiple stream phase.
  • Single stream phases are executed by a master processor and the multiple stream phases are executed by slav processors.
  • the slave processors are idle when the master is active and the master processor is idle when any-of the slave processors is active.
  • Processes are dynamically assigned to processors by the VAX VMS operating system.
  • the VAX VMS operating system also provides event flags (semaphores) used for synchronization between processors.
  • a general FORTRAN library has been designed for this environment.
  • a program using this library has the following structure:
  • This task is performed by the logical function MASTER_PROCESS. It consists of discriminating between a MASTER and SLAVE process, setting up the section of memory shared between the MASTER and SLAVE process, initializing the event flags used for synchronization in creating the slave process. This logical is set to TRUE if the process for executing the code is the master process and set to FALSE if it is the slave process.
  • This task is performed by four sub ⁇ routines. FORK, JOIN, JOIN_EXIT and JOIN_FORK.
  • the sub ⁇ routine FORK is called by the master process in order to signal the slave process to proceed and then to wait for them to signal back.
  • the sub-routine JOIN is called by the slave process in order to signal the master process to proceed and then to wait for it to call back.
  • the sub-routine JOIN_EXIT is called by the slave process in order to signal the master process to proceed and then to exit.
  • the sub-routine JOIN_FORK is called by the slave process in order to synchronize two multiple stream phases with no intervening single stream phase.
  • Interlock Memory Access This task is performed by two sub-routines locked and unlocked.
  • the variable used in the argument of sub-routines holds the state of the lock for the associated section of shared memory whose access needs to be interlocked.
  • the sub-routine LOCK and UNLOCK use the atomic test and set instruction, BBSSI and BBCCI, of the VAX
  • the interlocks are necessary in certain sub-routines not related to the present invention, for example, synchronizatio routines and load balancing routines.
  • the tasks performed in parallel by the circuit simulatio apparatus and method according to the present invention are matrix load, matrix LU decomposition and time step computation.
  • the matrix load phase for each circuit element includes evaluation and loading of the element's contribution to the matrix entry and the right-hand side of the simultaneous equations, as well as convergence checking of the branched currents.
  • Elements evaluation and convergence checking have no write contention in shared • memory.
  • the loading of the matrix and the right-han side requires writing to memory locations which may be shared by several elements. Synchronization of write access to the matrix through a single lock on the whole matrix has been proposed by Jacob, et al., supra. In such a case only one processor can write into memory for the entire matrix at a given time. This leads to contention for shared resources and decreases efficiency.
  • the number of circuit elements attached to a given node is used to create an array of memory locations associated with the node and the list of circuit elements is augmented by the incorporation of a pointer which directs the storage in a specific one of the array of memory locations of the term indicating the contribution of the particular circuit element to the total entry for the matrix location.
  • the memory structure so generated may be visualized as a three dimensional memory structure, with two dimensions defining the matrix locations and the third dimension defining the number of circuit elements affecting each matrix location. There is no unused memory in this structure since it has a variable depth in the third dimension.
  • the contents of the arrays are summed, again using parallel processing, without any contention for access to shared memory, in order to generate a single matrix for the computation of the solutions to the simultaneous equations for the circuit simulation.
  • Dynamic task allocation is achieved through an array of tasks whose number
  • SUBSTITUTESHEET exceeds the number of processors.
  • a task consists of a list of circuit elements to be loaded. Tasks are defined so that each requires approximately the same amount of work. The amount of work needed to load a circuit element is roughly estimated by neglecting bypass and evaluating the CPU time needed to load the given element. Dynamic task allocation minimizes any imbalance which may occur during simulation through device model computation bypass. Task allocation during the summing phase of the matrix load operation is done statically, since the work needed to perform this phase can b divided into tasks requiring roughly the same amount of CPU time. The only interlocked access to shared memory during th matrix load phase is the one required on the array index, which defines the next task when dynamic task allocation is employed. This index is successively read and incremented by all slave processors.
  • the present invention employs LU matrix decomposition . mathematically similar to that performed by SPICE.
  • the LU decomposition requires pivoting, only a single master processor is used. This is done twice during transient analysis, for full pivoting, once for the first LU decomposition performed in the DC operating point computation and second for the first LU decomposition of the transient wave form computation.
  • pivotin is used only relatively infrequently when a diagonal element used as a pivot is less than a predetermined threshold.
  • the present invention relates to synchronizing the parallel processors so that they will perform decomposition operations only on valid data. This is accomplished by assigning each slave processor a set of rows
  • each processor updates the rows in its assigned set.
  • the usual linked list of matrix entries below the diagonal is, according to the present invention, broken down into separate lists based upon the rows assigned to each processor.
  • a flag is associated with each diagonal element to ensure that it is never used before its final value is available. Use of this flag results in an efficient synchronization scheme.
  • the flags are shared data accessed by multiple processors in the read mode, but only one in the write mode, so that no locks are needed.
  • An heuristic algorithm assigns rows to the slave processors.
  • the first step of this algorithm is to divide the circuit matrix into blocks of consecutive columns such that the slave processors can work within blocks without synchronization.
  • the blocks are found by scanning the set of matrix columns from left to right and assigning them to blocks so that within a block no dependency among diagonal elements exists when performing LU decomposition.
  • each block rows containing nonzero subdiagonal elements are assigned to slave processors by determining the number of updates necessary to complete a row and dividing the amount of work assigned-to the slave processors during the LU decomposition so that it is balanced among them. For row assignment, the blocks are processed from right to left.
  • Each slave processor is assigned a set of energ storage elements and computes the minimum time step required by its set of energy storage elements.
  • the master processor then computes the minimum time step among the time steps returned to it by the slave processors.
  • the energy storage elements are assigned to the slave processors so that the wor among them is balanced.

Abstract

A system for simulating LSI and VLSI circuits generates a matrix for solution of simultaneous equations necessary for the circuit (G1, G2) simulation. It synchronizes parallel processors during matrix decomposition by assigning a single processor to given row (i, J, K) within the matrix and sets a flag when the values in a row extending from a diagonal matrix element are ready for use.

Description

METHOD AND APPARATUS FOR CIRCUIT SIMULATION USING PARALLEL PROCESSORS INCLUDING MEMORY ARRANGEMENT AND MATRIX DECOMPOSITION SYNCHRONIZATION
FIELD OF THE INVENTION
The field of the present invention relates to computer assisted circuit simulation employing parallel processing.
BACKGROUND AND SUMMARY OF THE INVENTION
In the field of LSI and VLSI circuits, increased circuit complexity and increased density of the devices on a single chip have resulted in an ever increasing need for computer aided circuit simulation, which is capable of carrying out th necessary circuit simulation in a reasonable period of time. Computer programs such as SPICE and ASTAP have been employed for such circuit simulation. However, these are proven to be less and less adequate as the circuits become more dense and more complex on a single chip.
A variety of algorithms for the efficient and accurate, simulation of complex circuits have been proposed for example, Nagle, "SPICE2. A Computer Program to Simulate Semiconductor Circuits," University of California, Berkeley, Memo No. ERL- M520, May 1985; Weeks, "Algorithms for ASTAP-A Network Analysis Program," IEEE Transactions on Circuit Theory, Volum CT-20, pp. 628-634, November 1983; Saleh, Kleckner and Newton, Iterated Timing Analysis and SPICE1," Proceedings of the IEEE International Conference on computer Aided Design, Santa Clara, CA. , pp. 139-140, September 1983; White and Sagiovanni Vincentelli, "Relax 2.1-A Waveform Relaxation Based Circuit Simulation Program," Proceedings, International Custom Integrated Circuits Conference, Rochester, NY, pp. 232-236, June 1984. Salkallah and Director, "An Activity Directed Circuit Simulation Algorithm," Proceedings IEEE Conference on
SUBSTiTUTE SHEET Circuits and Computers, pp. 1032-1035, October 1980.
More recently the implementation of some of these algorithms has been suggested in a multi-processing environment. Circuit simulation algorithms based on iterative methods, such as Gauss-Jacobi, however, while easy to parallelize, usually have slow convergence rates unless some of the parallelization utility is sacrificed. Iterative methods, moreover, are not appropriate for tightly coupled circuits. The RELAX simulator lumps together tightly coupled nodes and solves the sub-networks using a direct method. Such partitioning algorithms also have to deal with conflicting constraints when convergence properties, processor number and load balancing among processors need to be taken into account.
It has also been suggested in the past, that parallel processing using a plurality of computers, or computers having a plurality of central processing units which operate in parallel with each other, and closely coupling th 'plurality of processors to a shared memory, is a solution to the increasingly complex problem of circuit simulation. One such solution is proposed in Jacob, Newton, and Pederson, "Direct- Method Circuit Simulation Using Multiprocessors," Proceedings of the International Symposium on Circuits and Systems, May 1986. Another such solution is proposed in White, "Parallelizing Circuit Simulation - A Combined Algorithmic and Specialized Hardware Approach, Proceedings of the ICCD, Rye, New York, 1985.
Each of these references, however, recognizes a significant drawback to the utilization of parallel processing for circuit simulation. Since the processors are sharing the same memory, and the processors are in parallel computing values for different circuit elements throughout the LSI or VLSI circuit, or some defined sub-portion of such circuit, th processors will quite often be solving for values at the same node within the circuit at the same time. Since the processing algorithms also typically involve iterative processes, which update values for. a given node based on computations relating to circuit elements connected to the node, and a memory location within the computer is used to store the values at any given time for a particular node, the problem arises that the access to the shared memory must be strictly controlled. Otherwise, more than one processor may write information into a memory location corresponding to a particular node, at the same time another processor is attempting to read previously stored information related to that node or write new information relating to that node, based on its computations with respect to a different circuit element attached to the node.
The solution in the past has been to require special locking mechanisms within the programs controlling the multiple processors in order to prevent- such occurrences. This in effect lengthens the total CPU processing time required, in that much time is spent waiting for the access t the particular memory location to be opened to another of the processing units operating in parallel, which desires to read or write information into the particular memory location. In fact, the proposed solution in the past has been to block write access to the entire matrix when one CPU is writing to the matrix.
The Jacob et al. paper, noted above, investigates the parallel implementation of a direct method circuit simulator using SPLICE, and using up to eight processors of a shared memory machine. An efficiently close to forty-five percent
SUBS"ϊ 1 TUTE SHEET was reported. Much of the failure to achieve a higher percentage efficiency is the result of synchronization of the multiple processors in their access to common shared memory locations.
Parallelization employing such simulation programs as SPICE uses a list of circuit components and parameters, which may be constants or variables, for example, in the case of MOSFET transistors geometric parameters, currents, charges, capacitances, etc., which are then used to calculate the individual components or terms, the sum of which equals the value for each entry at a position in a matrix. The matrix is used to solve a number of simultaneous equations for the variables, for example, voltages, at each node within the circuit. The matrix typically consists of an X by X matrix with X approximately equal to the number of nodes in the circuit. Since the circuit elements are typically connected to at least two nodes, and may often be connected to more than two nodes, the parallelization techniques employed in the past suffer from the disability that the summation of the terms, in order to define the value at the matrix location for a given node, may be influenced by more than one circuit element, and the parallel processors may be computing the effect of two or more circuit elements on the same matrix node location at the same time. This results in the need for the use of some technique, for example, interlocks, to prevent simultaneous writing or reading of data by more than one processor into or from the memory location corresponding to the matrix location.
Once the matrix is loaded, it is known in the art, using SPICE, to solve the matrix using sparse matrix LU decomposition. Paralleliz.ation of the matrix solution phase •is also important, and presents unique problems. For larger
SUBSTITUTESHEET circuits, the CPU time needed for the matrix solution phase will dominate over that needed for the matrix load phase. Efficient parallelization schemes are known for full matrices as is reported in Thomas, "Using the Butterfly to Solve Simultaneous Linear Equations", BBN Laboratories Memorandum, March 1985. However, sparse matrices are more difficult to decompose efficiently in parallel. The LU decomposition algorithm has a sequential dependency and the amount of concurrent work which can be done at each step, using SPICE, in a sparse matrix is small. Algorithms detecting the maximu parallelism at each step have been proposed for vectorized circuit simulation. Yama oto and Takahashi, "Vectorized LU Decomposition Algorithms for Large Scale Circuit Simulation", IEEE Transactions on computer Aided Design, Vol. Cad-4, No. 3 pp. 232-239', July 1985. Algorithms based upon a pivot dependency graph and task queues have been proposed. Jacob e al., supra. The overhead associated with task queues makes, the efficiency of these algorithms questionable.
Recognizing the need for an improved circuit simulation apparatus and method, it is a general object of the present invention to provide a circuit simulation apparatus and metho for simulating LSI and VLSI circuits which eliminates the costly synchronization requirement of the prior art and, in addition, more efficiently implements the LU decomposition in parallel using multiple processors.
A feature present invention relates to providing memory locations corresponding both to the matrix location and to a specific one of a plurality of terms which determine the ultimate value for the entry at the matrix, and summing the terms from the plurality of memory locations corresponding to a matrix entry to define a single value for the matrix entry.
SUBSTITUTESHEET A further feature of the present invention relates to synchronizing the parallel processors during matrix decomposition by assigning a single processor to a given row within the matrix and setting a flag when the values in a row extending from a diagonal matrix element are ready for use.
The features of the present invention discussed above are not intended to be all-inclusive but rather are representative of the features of the present invention in order to enable those skilled in the art to better appreciate the advance made over the art and to understand the invention. Other features will become more apparent upon reference to the detailed description of the preferred embodiment which is contained below and with reference to the figures of the drawing.
BRIEF DESCRIPTION OF THE DRAWINGS
Fig. 1 is a schematic representation of the prior art . SPICE program and circuit simulation apparatus implementing that program;
Fig. 2 is a schematic representation of a method and apparatus according to the present invention for generating the matrix for solution of simultaneous equations necessary for circuit simulation.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
In the past, as shown in Fig. 1, representing the prior art, the parallelization of circuit simulation, for example, using SPICE, uses a list of circuit elements contained within the circuit, and a matrix,, with the matrix locations corresponding generally to the circuit nodes. Depending on
SUBSTITUTESHEET the number of circuit elements connected to a node, and/or th type of circuit elements connected to a node, a plurality of different terms can contribute to the ultimate value for the matrix location. The sum of these values is used in solving the simultaneous equations, in order to solve for the necessary variables to conduct the circuit simulation. With the SPICE program the use of parallel processors has been employed to analyze in parallel, different circuit elements and their contributions to the matrix values. However, the parallel processing of different circuit elements, each of which may contribute to different ones of the terms for the same matrix location entry, is not allowable in the past methods of parallel processing. This is so since, in the past, data representing the most recent value for the summation of all of the terms for a given matrix location has been stored in a single memory location, and "summed with the values calculated for the next circuit elements influencing, the value of the sum of the terms for the given matrix location. Thus, in effect, each of the parallel processors must wait its turn to add its contributing term to sum the value for each matrix location entry.
As shown in Fig. 1, the list of circuit elements contain the parameters, which determine the term of terms contained i the matrix location for a given matrix entry, and a pointer, which points to the memory location for the given matrix entry. In actuality, using, e.g., the VAX11/780, the data is represented in a double precision value requiring eight bytes of data for each memory location.
The present invention employs a list of circuit elements, as shown in Fig. 2, essentially identical to the list employed by circuit simulators using SPICE. The list of circuit
SUBSTITUTE SHEl elements contains a plurality of pointers, each of which point to a memory location within an array of memory locations associated with a given matrix entry. There is, therefore, a memory location assigned to each of the plurality of circuit elements which affects the sum of the terms for a given matrix entry. The input from one of the plurality of parallel processors operating on the calculation of the term for a given circuit element, which contributes to the total value for the given matrix entry will be stored in a designated memory location, and the input from another processor operating in parallel on the calculation of the contribution to the total value from a different circuit element will have a different designated memory location. The different memory locations correspond to the particular matrix entry by being included within an array of memory locations corresponding to the given matrix entry, but each corresponds to a contribution from a different circuit element.
In effect, a three dimensional matrix of memory locations is established, with each matrix entry having a plurality of memory locations associated with it, each corresponding to a different circuit element, the influence of which determines a portion of the total value for each entry, which is employed in solving the simultaneous equations represented by the matrix.
Utilization of the resent invention greatly speeds up the loading of the matrix entries for circuit simulation in order to solve the simultaneous equations. At present, the loading of the matrix entries, occupies approximately 80% of the total computing time necessary for circuit simulation. With the utilization of the present invention, including the improvement in the ability to parallelize the matrix loading
SUBSTITUTESHEET portion of the operation, the efficiency of the circuit simulator's utilization of the plurality of parallel processors has been elevated to approximately 90%, as oppose to the 45% reported in the Jacob et al. paper noted above.
The present invention can be implemented on VAX8800 and 8300 computers manufactured by Digital Equipment Corporation which contain dual processors having a shared memory. In addition to defining the memory noted above, the parallelization program employing SPICE is divided into seria phases. Each phase is divided into tasks which are executed serially or concurrently. A phase whose task in executed serially is called a single stream phase whereas a phase whos tasks are executed concurrently is called a multiple stream phase. Single stream phases are executed by a master processor and the multiple stream phases are executed by slav processors. The slave processors are idle when the master is active and the master processor is idle when any-of the slave processors is active. Processes are dynamically assigned to processors by the VAX VMS operating system. The VAX VMS operating system also provides event flags (semaphores) used for synchronization between processors.
A general FORTRAN library has been designed for this environment. A program using this library has the following structure:
COMMON/LOCAL/LOCAL__VARIABLES COMMON/SHARED/SHARED_VARIABLES IF MASTER_PROCESS THEN
CALL MASTER_CODE ELSE
CALL SLAVE_CODE ENDIF END Master and Slave processes run the exact same executable file. In the above code MASTER_PROCESS is a logical function from the library. This library has presently seven entries and performs the following functions:
Initialization. This task is performed by the logical function MASTER_PROCESS. It consists of discriminating between a MASTER and SLAVE process, setting up the section of memory shared between the MASTER and SLAVE process, initializing the event flags used for synchronization in creating the slave process. This logical is set to TRUE if the process for executing the code is the master process and set to FALSE if it is the slave process.
Synchronization. This task is performed by four sub¬ routines. FORK, JOIN, JOIN_EXIT and JOIN_FORK. The sub¬ routine FORK is called by the master process in order to signal the slave process to proceed and then to wait for them to signal back. The sub-routine JOIN is called by the slave process in order to signal the master process to proceed and then to wait for it to call back. The sub-routine JOIN_EXIT is called by the slave process in order to signal the master process to proceed and then to exit. The sub-routine JOIN_FORK is called by the slave process in order to synchronize two multiple stream phases with no intervening single stream phase.
Interlock Memory Access. This task is performed by two sub-routines locked and unlocked. The variable used in the argument of sub-routines holds the state of the lock for the associated section of shared memory whose access needs to be interlocked. The sub-routine LOCK and UNLOCK use the atomic test and set instruction, BBSSI and BBCCI, of the VAX
SUBSTITUTE SHEET architecture.
The interlocks are necessary in certain sub-routines not related to the present invention, for example, synchronizatio routines and load balancing routines.
The tasks performed in parallel by the circuit simulatio apparatus and method according to the present invention are matrix load, matrix LU decomposition and time step computation.
Those skilled in the art will appreciate that by employing the aspects of the present invention just described a number of advantages are achieved over the existing art. I SPICE, as it is known in the art, the matrix load phase for each circuit element includes evaluation and loading of the element's contribution to the matrix entry and the right-hand side of the simultaneous equations, as well as convergence checking of the branched currents. Elements evaluation and convergence checking have no write contention in shared memory. However, the loading of the matrix and the right-han side requires writing to memory locations which may be shared by several elements. Synchronization of write access to the matrix through a single lock on the whole matrix has been proposed by Jacob, et al., supra. In such a case only one processor can write into memory for the entire matrix at a given time. This leads to contention for shared resources and decreases efficiency.
In the present invention locking the entire matrix, or even some portion thereof, is avoided by the utilization of a memory structure which accommodates the storage of the individual contributions to a given matrix location entry from each of the circuit elements affecting that matrix location entry. The same list of circuit elements is used, as had been
SUBSTITUTESHEET done in the past with SPICE. However, according to the present invention, the number of circuit elements attached to a given node is used to create an array of memory locations associated with the node and the list of circuit elements is augmented by the incorporation of a pointer which directs the storage in a specific one of the array of memory locations of the term indicating the contribution of the particular circuit element to the total entry for the matrix location.
The memory structure so generated may be visualized as a three dimensional memory structure, with two dimensions defining the matrix locations and the third dimension defining the number of circuit elements affecting each matrix location. There is no unused memory in this structure since it has a variable depth in the third dimension.
After the list of circuit elements is processed, and the memory locations within each array of memory locations thereby loaded with the terms representing contributions of the individual circuit elements to the total entry for the matrix location, the contents of the arrays are summed, again using parallel processing, without any contention for access to shared memory, in order to generate a single matrix for the computation of the solutions to the simultaneous equations for the circuit simulation.
Even distribution of tasks among the processors operating in parallel is achieved- so that no slave process stays idle while others are computing. A dynamic task allocation is used for the multiple stream phase of matrix load. Since the time needed to load the factor contribution of each given circuit element cannot be estimated exactly, dynamic task allocation is used in lieu of static task allocation. Dynamic task allocation is achieved through an array of tasks whose number
SUBSTITUTESHEET exceeds the number of processors. A task consists of a list of circuit elements to be loaded. Tasks are defined so that each requires approximately the same amount of work. The amount of work needed to load a circuit element is roughly estimated by neglecting bypass and evaluating the CPU time needed to load the given element. Dynamic task allocation minimizes any imbalance which may occur during simulation through device model computation bypass. Task allocation during the summing phase of the matrix load operation is done statically, since the work needed to perform this phase can b divided into tasks requiring roughly the same amount of CPU time. The only interlocked access to shared memory during th matrix load phase is the one required on the array index, which defines the next task when dynamic task allocation is employed. This index is successively read and incremented by all slave processors.
The present invention employs LU matrix decomposition . mathematically similar to that performed by SPICE. When the LU decomposition requires pivoting, only a single master processor is used. This is done twice during transient analysis, for full pivoting, once for the first LU decomposition performed in the DC operating point computation and second for the first LU decomposition of the transient wave form computation. For all other decompositions, pivotin is used only relatively infrequently when a diagonal element used as a pivot is less than a predetermined threshold.
During the time the parallel processing is being used fo the LU decomposition, the present invention relates to synchronizing the parallel processors so that they will perform decomposition operations only on valid data. This is accomplished by assigning each slave processor a set of rows
SUBSTITUTE SHEET of the circuit matrix. When the SPICE decomposition algorithm progresses to a_particular diagonal element, each processor updates the rows in its assigned set. The usual linked list of matrix entries below the diagonal is, according to the present invention, broken down into separate lists based upon the rows assigned to each processor. A flag is associated with each diagonal element to ensure that it is never used before its final value is available. Use of this flag results in an efficient synchronization scheme. The flags are shared data accessed by multiple processors in the read mode, but only one in the write mode, so that no locks are needed.
This technique remains efficient for up to about four processors operating in parallel. Beyond this, optimum reordering and assignment of the rows may be necessary to improve the efficiency of the algorithm. The present invention' makes use only of the existing SPICE version 2G5 reordering algorithm.
An heuristic algorithm assigns rows to the slave processors. The first step of this algorithm is to divide the circuit matrix into blocks of consecutive columns such that the slave processors can work within blocks without synchronization. The blocks are found by scanning the set of matrix columns from left to right and assigning them to blocks so that within a block no dependency among diagonal elements exists when performing LU decomposition. Then in each block rows containing nonzero subdiagonal elements are assigned to slave processors by determining the number of updates necessary to complete a row and dividing the amount of work assigned-to the slave processors during the LU decomposition so that it is balanced among them. For row assignment, the blocks are processed from right to left.
SUBSTITUTE Sϊ-iΞ≡T Parallelization of the time step computation does not present a major problem as the computation of the local truncation error for each energy storage element is independent. Each slave processor is assigned a set of energ storage elements and computes the minimum time step required by its set of energy storage elements. The master processor then computes the minimum time step among the time steps returned to it by the slave processors. The energy storage elements are assigned to the slave processors so that the wor among them is balanced.
The preferred embodiment of the present invention has been described in order to comply with the requirements of th Patent Statutes, and is not intended to limit the present invention to any particular embodiment. Rather, those skille in the art will appreciate that many modifications and change may be made- to the present invention without departing from the scope of the invention, and the inventors intend that the appended claims cover such changes and modifications as come within the scope and content of the appended claims.
What is claimed as new and desired to be secured by Letters Patent of the United States is:
SUBSTITUTE SHEET

Claims

1. In a computing apparatus employing a plurality of central processing units CPUs, operating in parallel, and shared memory shared by the CPUs, and wherein the plurality of CPUs are operated to create a matrix utilized to solve complex multivariable circuit equations for the purpose of circuit simulation evaluation, with the matrix entries assigned according to a matrix of circuit nodes, a method of avoiding the need for synchronization of the access to the shared memory by the plurality of CPUs, comprising the steps of: determining the number of circuit elements connected to each node corresponding to a matrix entry;
. assigning a plurality of distinct memory locations to each matrix entry according to the number of" circuit elements connected to the node corresponding to the matrix entry, with each memory location assigned to a specific circuit element; directing the CPUs to write into the respective memory location associated with a respective circuit element connected to the respective node when operating on the contribution of the respective circuit element to the matrix entry associated with the respective node.
2. The method of claim 1 further comprising the step of: Summing the contents of the plurality of memory locations associated with a respective matrix entry to obtain value for the matrix entry to be used in computing the solution of the matrix for the solution of a plurality of simultaneous equations.
3. A computing apparatus having a plurality of central
SUBSTITU" C'_ processing units (CPUs) and a memory shared by the CPUs, and wherein the CPUs are operated in parallel to solve complex multivariable circuit equations in a matrix format, with the matrix entry locations assigned according to a matrix of circuit nodes, comprising: means for determining the number of circuit elements connected to each node corresponding to a matrix entry; means for assigning a plurality of distinct memory locations to each matrix entry according to the number of circuit elements connected to the corresponding node, with each respective memory location assigned to a respective circuit element; means for directing each of the CPUs to write into the respective memory location associated with the respective circuit element when operating on the contribution of the respective circuit elements to the respective matrix entry corresponding to the respective node.
4. The apparatus of claim 3, further comprising: means for summing the contents of the respective memory locations associated with a respective one of the matrix entries.
5. In a computing apparatus employing a plurality of central processing units (CPUs) operating in parallel to decompose a matrix utilized to solve complex multivariable circuit equations for the purpose of circuit simulation evaluation, a method of synchronizing the matrix decomposition to prevent the CPUs from operating on invalid data during the matrix composition, comprising the steps of: assigning to each of the plurality of CPU's the tasks
SUBSTITUTE SHEET associated in updating all of the matrix entries within selected rows of the matrix as required for matrix update during matrix decomposition; generating an indicator associated with each diagonal entry in the matrix when the one of the plurality of processors has completed updating the matrix entries in the row in which the diagonal entry resides; preventing any of the others of the plurality of CPU's from updating any entries in a row assigned to such other CPU, which update depends on a given matrix diagonal entry, unless the presence of the indicator for the given matrix diagonal element is detected.
SUBSTITUTE ;. __
PCT/US1987/002079 1986-08-20 1987-08-20 Method and apparatus for circuit simulation using parallel processors including memory arrangement and matrix decomposition synchronization WO1988001412A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US89847686A 1986-08-20 1986-08-20
US898,476 1986-08-20

Publications (1)

Publication Number Publication Date
WO1988001412A1 true WO1988001412A1 (en) 1988-02-25

Family

ID=25409513

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1987/002079 WO1988001412A1 (en) 1986-08-20 1987-08-20 Method and apparatus for circuit simulation using parallel processors including memory arrangement and matrix decomposition synchronization

Country Status (2)

Country Link
CA (1) CA1276296C (en)
WO (1) WO1988001412A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4827922A (en) * 1987-03-05 1989-05-09 L'air Liquide Process and device for supplying respiratory oxygen
CN102081690A (en) * 2010-12-30 2011-06-01 南京理工大学 MDA (Matrix Decomposition Algorithm)-combined novel SVD (Singular Value Decomposition) method for complex circuit
CN110737870A (en) * 2019-09-26 2020-01-31 北京华大九天软件有限公司 method and device for merging schuler matrix on GPU

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3629843A (en) * 1970-05-11 1971-12-21 Bell Telephone Labor Inc Machine process for assigning interconnected components to locations in a planar matrix
UST915008I4 (en) * 1973-04-06 1973-10-09 Yes yes

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3629843A (en) * 1970-05-11 1971-12-21 Bell Telephone Labor Inc Machine process for assigning interconnected components to locations in a planar matrix
UST915008I4 (en) * 1973-04-06 1973-10-09 Yes yes

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
IEEE International Symposium on Circuits and Systems, Volume 1, issued May 1986 (Piscataway, New Jersey), "Direct-Method Circuit Simulation Using Multiprocessors", see pages 170-172. *
IEEE Transaction on Computer Aided Design, Volume Cad-4, No. 3, issued July 1985, "Vectorized LU Decomposition Algorithms for Large Scale Circuit Simulation", see pages 232-239. *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4827922A (en) * 1987-03-05 1989-05-09 L'air Liquide Process and device for supplying respiratory oxygen
CN102081690A (en) * 2010-12-30 2011-06-01 南京理工大学 MDA (Matrix Decomposition Algorithm)-combined novel SVD (Singular Value Decomposition) method for complex circuit
CN102081690B (en) * 2010-12-30 2014-07-02 南京理工大学 MDA (Matrix Decomposition Algorithm)-combined novel SVD (Singular Value Decomposition) method for complex circuit
CN110737870A (en) * 2019-09-26 2020-01-31 北京华大九天软件有限公司 method and device for merging schuler matrix on GPU

Also Published As

Publication number Publication date
CA1276296C (en) 1990-11-13

Similar Documents

Publication Publication Date Title
US5157778A (en) Method and apparatus for circuit simulation using parallel processors including memory arrangements and matrix decomposition synchronization
US5392429A (en) Method of operating a multiprocessor computer to solve a set of simultaneous equations
Capitanio et al. Partitioned register files for VLIWs: A preliminary analysis of tradeoffs
EP0428084B1 (en) Method and apparatus for compiling computer programs with interprocedural register allocation
JPH04330527A (en) Optimization method for compiler
Hammond et al. Efficient ICCG on a shared memory multiprocessor
Stamatakis et al. AxML: a fast program for sequential and parallel phylogenetic tree calculations based on the maximum likelihood method
Pinar et al. Parallel decomposition of multicommodity network flows using a linear-quadratic penalty algorithm
Ford The construction of maximal orders over a Dedekind domain
Rauber et al. Parallel execution of embedded and iterated Runge–Kutta methods
WO1988001412A1 (en) Method and apparatus for circuit simulation using parallel processors including memory arrangement and matrix decomposition synchronization
Lengauer A methodology for programming with concurrency: The formalism
Bündgen et al. Parallel ReDuX→ PaReDuX
Haring On state-dependent workload characterization by software resources
Lowe et al. On finding optimal clusterings of task graphs
Kappeller et al. Optimized extrapolation methods for parallel solution of IVPs on different computer architectures
Johnson et al. Waveform-relaxation-based circuit simulation on the Victor V256 parallel processor
Leiserson et al. Retiming synchronous circuitry
Mujić et al. Accelerating Sorting on GPUs: A Scalable CUDA Quicksort Revision
Møller-Nielsen et al. Experiments with a Multiprocessor
Moran et al. A lower bound on wait-free counting
Cérin et al. Methods for partitioning data to improve parallel execution time for sorting on heterogeneous clusters
Dorochevsky et al. Constraint handling, garbage collection and execution model issues in ElipSys
Corrêa et al. Modeling parallel branch-and-bound for asynchronous implementations
Pirozzi The fast numerical solution of mildly nonlinear elliptic boundary value problems on multiprocessors

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE FR GB IT LU NL SE