WO2002069097A2 - Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer - Google Patents

Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer Download PDF

Info

Publication number
WO2002069097A2
WO2002069097A2 PCT/US2002/005574 US0205574W WO02069097A2 WO 2002069097 A2 WO2002069097 A2 WO 2002069097A2 US 0205574 W US0205574 W US 0205574W WO 02069097 A2 WO02069097 A2 WO 02069097A2
Authority
WO
WIPO (PCT)
Prior art keywords
node
elements
nodes
array
multidimensional
Prior art date
Application number
PCT/US2002/005574
Other languages
French (fr)
Other versions
WO2002069097A3 (en
Inventor
Gyan V. Bhanot
Dong Chen
Alan G. Gara
Mark E. Giampapa
Philip Heidelberger
Burkhard D. Steinmacher-Burow
Pavlos M. Vranas
Original Assignee
International Business Machines Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to KR1020037011119A priority Critical patent/KR100592753B1/en
Priority to CNB02805377XA priority patent/CN1244878C/en
Priority to CA002437036A priority patent/CA2437036A1/en
Priority to EP02721139A priority patent/EP1497750A4/en
Priority to PCT/US2002/005574 priority patent/WO2002069097A2/en
Priority to JP2002568153A priority patent/JP4652666B2/en
Application filed by International Business Machines Corporation filed Critical International Business Machines Corporation
Priority to AU2002252086A priority patent/AU2002252086A1/en
Priority to US10/468,998 priority patent/US7315877B2/en
Priority to IL15751802A priority patent/IL157518A0/en
Publication of WO2002069097A2 publication Critical patent/WO2002069097A2/en
Publication of WO2002069097A3 publication Critical patent/WO2002069097A3/en
Priority to US11/931,898 priority patent/US8095585B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • HELECTRICITY
    • H05ELECTRIC TECHNIQUES NOT OTHERWISE PROVIDED FOR
    • H05KPRINTED CIRCUITS; CASINGS OR CONSTRUCTIONAL DETAILS OF ELECTRIC APPARATUS; MANUFACTURE OF ASSEMBLAGES OF ELECTRICAL COMPONENTS
    • H05K7/00Constructional details common to different types of electric apparatus
    • H05K7/20Modifications to facilitate cooling, ventilating, or heating
    • H05K7/20709Modifications to facilitate cooling, ventilating, or heating for server racks or cabinets; for data centers, e.g. 19-inch computer racks
    • H05K7/20836Thermal management, e.g. server temperature control
    • FMECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
    • F04POSITIVE - DISPLACEMENT MACHINES FOR LIQUIDS; PUMPS FOR LIQUIDS OR ELASTIC FLUIDS
    • F04DNON-POSITIVE-DISPLACEMENT PUMPS
    • F04D25/00Pumping installations or systems
    • F04D25/16Combinations of two or more pumps ; Producing two or more separate gas flows
    • F04D25/166Combinations of two or more pumps ; Producing two or more separate gas flows using fans
    • FMECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
    • F04POSITIVE - DISPLACEMENT MACHINES FOR LIQUIDS; PUMPS FOR LIQUIDS OR ELASTIC FLUIDS
    • F04DNON-POSITIVE-DISPLACEMENT PUMPS
    • F04D27/00Control, e.g. regulation, of pumps, pumping installations or pumping systems specially adapted for elastic fluids
    • F04D27/004Control, e.g. regulation, of pumps, pumping installations or pumping systems specially adapted for elastic fluids by varying driving speed
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17381Two dimensional, e.g. mesh, torus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/526Mutual exclusion algorithms
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G5/00Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
    • G09G5/003Details of a display terminal, the details relating to the control arrangement of the display terminal and to the interfaces thereto
    • G09G5/006Details of the interface to the display terminal
    • G09G5/008Clock recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L7/00Arrangements for synchronising receiver with transmitter
    • H04L7/02Speed or phase control by the received code signals, the signals containing no special synchronisation information
    • H04L7/033Speed or phase control by the received code signals, the signals containing no special synchronisation information using the transitions of the received signal to control the phase of the synchronising-signal-generating means, e.g. using a phase-locked loop
    • H04L7/0337Selecting between two or more discretely delayed clocks or selecting between two or more discretely delayed received code signals
    • H04L7/0338Selecting between two or more discretely delayed clocks or selecting between two or more discretely delayed received code signals the correction of the phase error being performed by a feed forward loop
    • FMECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
    • F24HEATING; RANGES; VENTILATING
    • F24FAIR-CONDITIONING; AIR-HUMIDIFICATION; VENTILATION; USE OF AIR CURRENTS FOR SCREENING
    • F24F11/00Control or safety arrangements
    • F24F11/70Control systems characterised by their outputs; Constructional details thereof
    • F24F11/72Control systems characterised by their outputs; Constructional details thereof for controlling the supply of treated air, e.g. its pressure
    • F24F11/74Control systems characterised by their outputs; Constructional details thereof for controlling the supply of treated air, e.g. its pressure for controlling air flow rate or air velocity
    • F24F11/77Control systems characterised by their outputs; Constructional details thereof for controlling the supply of treated air, e.g. its pressure for controlling air flow rate or air velocity by controlling the speed of ventilators
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02BCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO BUILDINGS, e.g. HOUSING, HOUSE APPLIANCES OR RELATED END-USER APPLICATIONS
    • Y02B30/00Energy efficient heating, ventilation or air conditioning [HVAC]
    • Y02B30/70Efficient control or regulation technologies, e.g. for control of refrigerant flow, motor or heating

Definitions

  • the present invention generally relates to a field of distributed-memory message- passing parallel multi-node computers and associated system software, as applied for example to computations in the fields of science, mathematics, engineering and the like. More particularly, the present invention is directed to a system and method for efficient implementation of a multidimensional Fast Fourier Transform (i.e., "FFT") on a distributed-memory parallel supercomputer.
  • FFT Fast Fourier Transform
  • Linear transforms such as the Fourier Transform (i.e., "FT")
  • FT Fourier Transform
  • the FT alters a given problem into one that may be more easily solved, and the FT is used in many different applications.
  • the FT essentially represents a change of the N variables from coordinate space to momentum space, where the new value of each variable depends on the values of all the old variables.
  • Such a system of N variable is usually stored on a computer as an array of N elements.
  • the FT is commonly computed using the Fast Fourier Transform (i.e., "FFT").
  • the FFT is described in many standard texts, such as the Numerical Recipes by Press, et al. ("Numerical Recipes in Fortran", pages 490-529, by W. H. Press, S. A. Teukolsky, W. A. Vetterling and Brian P Flannery, Cambridge University Press, 1986, 1992, ISBN: 0-521 -43064-X).
  • Most computer manufacturers provide library function calls to optimize the FFT for their specific processor.
  • the FFT is fully optimized on the IBM's RS/6000 processor in the Engineering and Scientific Subroutine Library. These library routines require the data (i.e., the foregoing elements) necessary to perform the FFT be resident in a memory local to a node.
  • N elements of a multidimensional array are distributed in a plurality of dimensions across nodes of a distributed-memory parallel multi- node computer.
  • Many applications that execute on distributed-memory parallel multi-node computers spend a large fraction of their execution time on calculating the multidimensional FFT. Since a motivation for the distributed-memory parallel multi-node computers is faster execution, fast calculation of the multidimensional FFT for the distributed array is of critical importance.
  • the N elements of the array are initially distributed across the nodes in some arbitrary fashion particular to an application. To calculate the multidimensional FFT, the array of elements is then redistributed such that a portion of the array on each node consists of a complete row of elements in the x-dimension.
  • a one-dimensional FFT on each row in the x- dimension on each node is then performed. Since the row is local to a node and since each one-dimensional FFT on each row is independent of the others, the one- dimensional FFT performed on each node requires no communication with any other node and may be performed using abovementioned library routines. After the one-dimensional FFT, the array elements are re-distributed such that a portion of the array on each node consists of a complete row in the y-dimension. Thereafter, a one-dimensional FFT on each row in the y-dimension on each node is performed.
  • the re-distribution and a one-dimensional FFT are repeated for each successive dimension of the array beyond the x-dimension and the y-dimension.
  • the resulting array may be re- distributed into some arbitrary fashion particular to the application.
  • the treatment of the x-dimension and the y-dimension in sequence is not fundamental to the multidimensional FFT. Instead, the dimensions of the array maybe treated in any order. For some applications or some computers, some orders may take advantage of some efficiency and thus have a faster execution than other orders.
  • the initial distribution of the array across the nodes which is in some arbitrary fashion particular to the application, may coincide with the distribution necessary for the one-dimensional FFTs in the y-dimension. In this case, it may be fastest for the multidimensional FFT to treat the y-dimension first, before treating the x-dimension and any other remaining dimensions.
  • each re- distribution of the array between the one-dimensional FFTs is an example of an "all-to-all" communication or re-distribution.
  • each node of the distributed-memory parallel multi-node computer sends unique data (i.e., elements of the array) to all other nodes utilizing a plurality of packets.
  • fast calculation of the multidimensional FFT on the distributed- memory parallel multi-node computer is of critical importance.
  • typically a large fraction of the execution time is spent to re-distribute the array across the nodes the distributed-memory parallel multi-node computer. More particularly, a large fraction of execution time is spent on the "all-to-all" re-distribution of elements of the array across the nodes the distributed-memory parallel multi-node computer.
  • a method for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network
  • the method comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one- dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT.
  • FFT Fast Fourier Transform
  • a system for for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network
  • the system comprising: means for distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; means for performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; means for re-distributing the one- dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and means for performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT.
  • FFT Fast Fourier Transform
  • a program storage device tangibly embodying a program of instructions executable by a machine to perform a method for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the method comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one- dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one- dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates
  • FFT Fast Fourier Transform
  • a method for efficiently re-distributing a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the method comprising re-distributing the elements at each node via "all-to-all" distribution in random order across other nodes of the computer system over the network, wherein the random order facilitates efficient utilization of the network.
  • a system for efficiently re-distributing a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the system comprising a means for re-distributing the elements at each node via "all-to-all" distribution in random order across other nodes of the computer system over the network, wherein the random order facilitates efficient utilization of the network.
  • a program storage device tangibly embodying a program of instructions executable by a machine to perform a method for efficiently re-distributing a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the method comprising re-distributing the elements at each node via "all-to-all" distribution in random order across other nodes of the computer system over the network, wherein the random order facilitates efficient utilization of the network.
  • Figure 1 illustrates an exemplary distributed-memory parallel supercomputer that includes 9 nodes interconnected via a multidimensional grid utilizing a 2- dimensional 3x3 Torus network according to the present invention
  • Figure 2 illustrates a more detailed representation of an exemplary node from the distributed-memory parallel supercomputer of Figure 1 according to the present invention
  • Figure 3 illustrates an exemplary two-dimensional 9-row by 9-column array, which may efficiently be implemented for the multidimensional FFT according to the present invention
  • Figure 4 illustrates an exemplary distribution the two-dimensional array of Figure 3 across the nodes of the supercomputer in Figure 1 according to the present invention
  • Figure 5 illustrates an exemplary first one-dimensional FFT of the two- dimensional array distributed across the nodes of the supercomputer of Figure 1 according to the present invention
  • Figure 6 illustrates an exemplary re-distribution of a resultant two-dimensional array after the first one-dimensional FFT of Figure 5 according to the present invention
  • Figure 7 illustrates an exemplary second one-dimensional FFT of the re-distributed array of Figure 6 according to the present invention
  • Figure 8 illustrates an exemplary method flowchart depicting the implementation of the two-dimensional FFT illustrated in Figures 4-7 according to the present invention
  • Figure 9 illustrates an exemplary method flowchart that depicts the filling of output queues on the exemplary node with packets destined for other nodes on the distributed-memory parallel supercomputer according to the present invention.
  • Figure 10 illustrates an exemplary method flowchart that depicts how the packets in the output queues on the exemplary node are drained into injection FIFOs for subsequent insertion on the Torus network 100 according to the present invention.
  • the present invention is directed to a system and method for efficiently implementing the multidimensional Fast Fourier Transform (i.e., "FFT") on the distributed-memory parallel supercomputer. More particularly, the present invention implements an efficient "all-to-all" re-distribution of elements distributed at nodes of the distributed-memory parallel supercomputer to achieve an efficient implementation of the multidimensional FFT.
  • FFT Fast Fourier Transform
  • the FFT is implemented on the distributed- memory parallel supercomputer, as a series of one-dimensional transforms, which require one or more "all-to-all" re-distributions of a multidimensional array across the nodes of the distributed-memory parallel supercomputer.
  • the distributed- memory parallel supercomputer utilizes a Torus-based network for the interconnection of and communication between nodes of the supercomputer.
  • each node implements a hardware router for efficiently routing packets that include elements of the array across the nodes of the supercomputer interconnected via the Torus-based network. Therefore, the present invention couples the implementation of the multidimensional FFT as a series of one-dimensional transforms of the multi-dimensional array with the foregoing hardware routing to obtain the efficient FFT implementation according to the present invention.
  • the distributed-memory parallel supercomputer comprises a plurality of nodes, each of which includes at least one processor that operates on a local memory.
  • the nodes are interconnected as a multidimensional grid and they communicate Via grid links.
  • the multidimensional node grid of the supercomputer will be described as an exemplary 2-dimensional grid. Notwithstanding the fact that only the 2-dimensional node grid is described in the following description, it is contemplated within the scope of the present invention that node grids of other dimensions may easily be provided based on the teachings of the present invention. It is noted that the distributed-memory parallel supercomputer may utilize a 3 -dimensional or greater Torus-based architecture.
  • the multidimensional array used by the multidimensional FFT will be described as an exemplary 2- dimensional array. Notwithstanding the fact that only the 2-dimensional array is described in the following description, it is contemplated within the scope of the present invention that arrays of additional dimensions may easily be provided based on the teachings of the present invention. It is further noted that there is no correspondence between the number of dimensions in the Torus-based architecture and the number of dimensions in the array. The array must be of sufficient size such that it can be distributed across the nodes or a subset of the nodes of the supercomputer for implementing the multidimensional FFT according to the present invention.
  • Figure 1 is an exemplary illustration of distributed-memory parallel supercomputer that includes 9 nodes interconnected via a multidimensional grid utilizing a 2- dimensional 3x3 Torus network 100, according to the present invention. It is noted that the number of nodes is in exemplary fashion limited to 9 nodes for brevity and . clarity, and that the number of nodes may significantly vary depending on a particular architectural requirements for the distributed-memory parallel supercomputer.
  • Figure 1 depicts 9 nodes labeled as Ql 1 - Q33, a pair of which is interconnected by a grid link.
  • the 9-node Torus network 100 is interconnected by 18 grid links, where each node is directly interconnected to four ' other nodes in the Torus network 100 via a respective grid link.
  • the exemplary 2-dimensional Torus network 100 includes no edge nodes.
  • node Ql 1 is interconnected to node Q31 via grid link 102; to node Q13 via grid link 104; to node Q21 via grid link 106; and finally to node Q12 via grid link 108.
  • Node Q22 is interconnected to Node Q12 via grid link 110; to node Q21 via grid link 112; to node Q32 via grid link 114 and finally to Node Q23 via grid link 116.
  • Other nodes are interconnected in a similar fashion. Further with reference to Figure 1, data (i.e., elements of the array) communicated between nodes is transported on the network in one or more packets.
  • a packet comprises a packet header and the data carried by the packet.
  • the packet header includes information required by the Torus network 100 to transport the packet from a source node to a destination node.
  • each node on the network is identified by a logical address and the packet header includes a destination address so that the packet is automatically routed to a node on the network as identified by a destination.
  • FIG 2 is a more detailed representation 200 of an exemplary node, e.g., node Ql 1 , from the distributed-memory parallel supercomputer of Figure 1 according to the present invention.
  • the node Ql 1 comprises at least one processor 202 that operates on local memory 204.
  • the node further comprises a router 206 that routes, i.e., sends and receives, packets on the grid links 102,104,106 and 108, which connect the node Ql 1 to its neighboring nodes Q31, Q13, Q21 and Q 12, respectively, as particularly illustrated in Figure 1.
  • the node comprises a reception buffer 208 for buffering packets received by the router 206, which are destined for the local processor 202.
  • the local processor 202 may easily periodically poll the reception buffer 208 in order to determine if there are packets in the reception buffer and then retrieve the packets that are buffered in the reception buffer 208. Depending on a particular application and the packets, the local processor 202 may write the contents of the packets into memory 204.
  • the node Ql 1 comprises four injection First-In- First-Out (i.e., "FIFO") buffers 810, which are particularly labeled X+, X-, Y+ and Y-.
  • the processor places outbound packets into one or more output queues 212 of the local memory 2104, which store packets destined for other nodes until they can be placed into the injection FIFOs 210. While injection FIFOs are not full, the processor places outbound packets into the injection FIFOs 210.
  • FIFO injection First-In- First-Out
  • the packet Upon a particular packet reaching the head of an injection FIFO 210, the packet is removed from the injection FIFO 210 by the router 206 and the router 206 inserts the packet onto a grid link 102,104,106 and 108 toward a destination node for the particular packet.
  • the four injection FIFOs 210 are treated equivalently by the router 206 and by the hardware of the local processor 202.
  • the router 206 comprises several simultaneous routing characteristics.
  • the routing first represents virtual cut- through routing. For example, if an incoming packet on one of the grid links is not destined for the local processor 202 of node Ql 1 , then the router 206 forwards the packet onto one of the outgoing grid links 102, 104, 106 and 108. The router 206 performs the forwarding without the involving the local processor 202.
  • the routing further represents shortest-path routing. For example, a packet sent by node Ql 1 to node Q13 (See Figures 1 and 8) that travels over the grid link 104 represents a shortest path route. Any other path would by longer.
  • a packet sent by node Ql 1 to node Q22 may travel over grid links 106 and 112 or alternatively over grid links 108 and 110.
  • This type of routing is represents an adaptive type of routing.
  • the packet may leave the node Ql 1 via the grid link 106 or 108.
  • Adaptive routing allows the router 206 to choose the less busy outgoing grid link for a packet or to choose the outgoing grid link based on some other criteria.
  • the adaptive routing is not just performed at the source node of a packet, e.g., node Ql 1, but is performed at each intermediate node that a packet cuts through on the way to the packet's destination node over the Torus-based network 100 of Figure 1.
  • the description below with reference to Figures 9 and 10 particularly describes how the present invention performs the foregoing routing of packets across the nodes of the supercomputer over the Torus network 100.
  • Figure 3 is an exemplary two-dimensional 9-row by 9-column array 300 that includes 81 elements, which may efficiently be implemented for the multidimensional FFT according to the present invention. It is noted that the exemplary two-dimensional array 300 is easily extended to other two-dimensional arrays including a different number of rows and columns (e.g., 10-row by 11- column two-dimensional array), which may be utilized for implementing the FFT on the distributed-memory parallel supercomputer according to the present invention.
  • the first row of the array comprises elements Al l, A12...A19, while the first column of the array comprises elements Al 1, A21...A 91.
  • FIG 4 is an exemplary distribution illustration 400 of how the two-dimensional array 300 of Figure 3 is distributed across the nodes Ql 1 - Q33 in Figure 1 according to the present invention. It is noted that the array may initially be distributed across the nodes in some arbitrary fashion that is particular to an application. According to present invention, the array re-distributed such that a portion of the array on each node Ql 1...Q33 comprises the distribution illustrated in Figure 4. This re-distribution is similar to that described below with reference to Figures 5 and 6. As particularly depicted in the distribution illustration 400, each node of Figure 1 includes a portion of the two-dimensional array 300 of Figure 3. For example, node Ql 1 comprises the first row of the array 300, i.e., elements Al 1, A12...A19.
  • node Q12 comprises the second row of the array 300, i.e., elements A21, A22...A23.
  • other nodes Q13 - Q33 of Figure 1 comprise respective rows 3 through 9 of array 300, as particularly depicted in distribution illustration 400 of Figure 4.
  • the assignment of a particular node to a particular row of array elements is not fundamental. Instead, it is noted that any assignment is feasible.
  • some assignments may take advantage of efficiencies offered by the applications and/or computers and thus produce faster execution than other assignments. For example, it may be that the fastest way to perform the multidimensional FFT may be to reverse the assignments of nodes Ql 1 and Q12 from those illustrated in Figure 4.
  • Figure 5 is an exemplary illustration 500 that depicts a first one-dimensional FFT on the two-dimensional array of Figure 4 that was distributed across the nodes Ql 1 - Q33 over the two-dimensional Torus network 100 of Figure 1.
  • the multidimensional FFT according to the present invention is accomplished by performing a series of one-dimensional FFTs.
  • the multi-dimensional FFT of the two-dimensional array 300 may be implemented as a series of one-dimensional FFTs. Therefore, a one- dimensional FFT is performed on each row of elements distributed at each node. For example, a one-dimensional FFT is performed for the elements distributed at node Ql 1, i.e., elements in the first row of array 300 that were distributed to node Ql 1.
  • One-dimensional FFTs are performed for elements (i.e., rows of elements) at each node Q12 - Q33.
  • the result is an array of elements transformed by the first one-dimensional FFT.
  • the result of the one-dimensional FFT on each row at each node is a row of the same length as particularly illustrated in Figure 5.
  • a one-dimensional FFT performed on the first row at node Ql 1 of Figure 4 which comprises elements Al 1, A12...A19
  • the one-dimensional FFT performed on each row at each node is independent of the one-dimensional FFT performed on any other row at another node.
  • Figure 6 is an exemplary "all-to-all" re-distribution illustration 600 that depicts how each resulting row of elements transformed via the first-dimension FFT of Figure 5 is re-distributed across the nodes Ql 1 - Q33 for performing the second- dimension FFT according to the present invention. More particularly, each resulting row of elements that is distributed at each node Ql 1...Q33 of Figure 5 is re-distributed over the Torus network 100 so that each successive node receives a successive column of elements as particularly depicted in Figure 6.
  • This efficient re-distribution is the "all-to-all" re-distribution, which enables an efficient implementation of the multidimensional FFT on the distributed-memory parallel supercomputer according to the present invention.
  • the first node Ql 1 receives the first column of elements, i.e., first elements from each of the nodes Ql 1...Q33.
  • node Q12 receives the second column of elements, i.e., second elements from each of the nodes Ql 1...Q33.
  • This redistribution is performed for each column in figure 5.
  • the assignment of a particular node to a particular row of array elements is not fundamental. Instead, it is noted that any assignment is feasible. For various applications and/or computers, some assignments may take advantage of efficiencies offered by the applications and/or computers and thus produce faster execution than other assignments.
  • the fastest way to perform the multidimensional FFT may be to reverse the assignments of nodes Ql 1 and Q12 from those illustrated in Figure 6.
  • the description below with reference to Figures 9 and 10 particularly describes how the present invention performs the "all-to-all" re-distribution of array elements across the nodes of the supercomputer over the Torus network 100.
  • the "all-to-all" re-distribution of the elements at each node Ql 1...Q33 is fast since it takes advantages of the communication characteristics of the Torus network 100.
  • each node from Ql 1...Q33 nodes sends a single array element to every other node.
  • each element of the array is a quantity of data larger than the quantity of data carried by a single packet.
  • a plurality of packets is needed to transmit each element of the array to a destination node over the Torus network 100. This closely resembles the typical real- world re-distribution, where due to much larger array sizes, each node sends many array elements to every other node, typically requiring many packets.
  • Figure 7 is an exemplary illustration 700 that depicts a second one-dimensional FFT on the two-dimensional array of Figure 6 that was redistributed across the nodes Ql 1 - Q33 over the two-dimensional Torus network 100 of Figure 1 according to the present invention.
  • the multidimensional FFT according to the present invention is accomplished by performing a series of one-dimensional FFTs, where Figure 7 depicts the second one-dimensional FFT in that series according to the present invention. Therefore, a one-dimensional FFT is performed on the column of elements that were distributed to each node as illustrated in Figure 5.
  • a one-dimensional FFT is performed for the elements distributed at node Ql 1, i.e., elements Bl 1, B21...B91 in Figure 6 that were distributed as a row to node Ql 1 form the first column of Figure 5.
  • one-dimensional FFTs are performed on rows of elements (i.e., distributed from successive columns of elements of Figure 5) at each node Q12 - Q33.
  • the result of the one-dimensional FFT on each row is a row of the same length as particularly illustrated in Figure 7.
  • a one- dimensional FFT performed on the first row at node Ql 1 of Figure 6, which comprises elements Bl 1, B21...A91 results in a first row at node Ql 1 of Figure 7, which comprises elements Cl 1, C21...C91.
  • the one-dimensional FFT performed on each row at each node is independent of the one-dimensional FFT performed on any other row at another node.
  • the particular distribution of data illustrated in Figure 6 enables each node to perform the one-dimensional FFT on the row of elements distributed at that node, without communication with any other node on the Torus network 100 of Figure 1. Therefore, since no communication is required between the nodes, these one-dimensional FFTs are performed fast.
  • Figure 8 is an exemplary method flowchart that illustrates the implementation of the two-dimensional FFT of an array on the distributed distributed-memory parallel supercomputer of Figure 1 that utilizes a 2-dimensional Torus network 100 for communication between nodes Ql 1...Q33 of the supercomputer.
  • Figure 8 is will be described on the basis of Figures 1-7 for efficiently performing the two-dimensional FFT.
  • step 802 the multidimensional FFT of a two-dimensional array illustrated in Figure 3 in the distributed distributed-memory parallel supercomputer of Figure 1 is started. It is noted that at step 702, the array illustrated in Figure 3 is distributed across the nodes in some arbitrary fashion that may be particular to an application.
  • elements (i.e., the data) of the array 300 are efficiently re-distributed across nodes Ql 1...Q33, as particularly illustrated in Figure 4.
  • each node performs a first one-dimensional FFT (out of a series of one-dimensional FFTs) on a row of elements of the array stored at that node, as illustrated in Figure 4, and the result particularly illustrated in Figure 5.
  • columns of one-dimensional FFT-transformed elements are re-distributed across the nodes Ql 1... Q33 of the supercomputer utilizing the Torus-based architecture of Figure 1 at step 808.
  • each node performs a second one-dimensional FFT on a successive column of a first one-dimensional FFT- transformed elements illustrated of Figure 6 that is distributed as a row of elements in Figure 6.
  • the result of the second one-dimensional FFT is illustrated in Figure 7.
  • the multi-dimensional FFT of the two-dimensional array illustrated in Figure 3 in the supercomputer of Figure 1 is ended. As particularly described above, between the two one-dimensional FFTs there is a fast re-distribution of elements across the nodes Ql 1...Q33.
  • the above-described multidimensional FFT on an array of elements distributed across nodes of a distributed-memory parallel supercomputer coupled with redistribution of the elements across the nodes are illustrative of the invention. More particularly, the present invention utilizes efficient hardware routing of the Torus-based architecture coupled with a series of one-dimensional FFTs to achieve an efficient implementation of the multidimensional FFT on the distributed- memory parallel supercomputer. As noted above, the teachings according to the present invention may be utilized for performing efficient multidimensional FFTs in other number of array dimensions, in other array sizes, and in other number of Torus network dimensions, e.g., 3-dimensional Torus. Additionally, the teachings according to the present invention may be utilized for performing "all-to-all" communication between nodes of the distributed-memory parallel supercomputer on a Torus network of arbitrary dimensions.
  • Figure 9 is an exemplary method flowchart 900 that depicts the filling of one or more output queues 212 on an exemplary node Ql 1 of Figure 2 with packets destined for other nodes, e.g., nodes Q22 and Q33, on the distributed-memory parallel supercomputer according to the present invention.
  • node Qxy (e.g., node Ql 1) needs to send a plurality of total packets (i.e., k packets) to every node Qab for all possible values of a and b (e.g., Q12, Q13; Q21, Q22, Q23; and Q31, Q32, Q33 as illustrated in Figure 1; it is noted that Ql 1 does not send packets to itself).
  • the grid links of the Torus network 100 must efficiently utilized. If packets are not scheduled in an efficient order, then the grid link utilization may be very inefficient.
  • the fast re-distribution takes advantage of the adaptive routing capability of the Torus-based network 100 such that packet scheduling is implemented efficiently, as particularly illustrated below.
  • the exemplary method starts.
  • an array i.e., random_map[ ] array
  • k packets e.g., 6 packets.
  • d the number of iterations necessary and transmit b packets per iteration for a total number of k packets.
  • b may be chosen as necessary for efficiency and may likewise be equal to 1. For example, to transmit a total of 6 packets, it can be chosen to transmit 2 packets per iteration on each of 3 iterations for the total of 6 packets. Therefore, at step 906, a loop is initiated for id from 1 to d iterations.
  • a queue counter is initialized to zero. It is assumed that there are L output queues 212 (L being greater than or equal to 1) for storing packets (or short descriptors of the packets such that the actual packet need not be copied), and all packets (or descriptors of the packets) for a given destination will be placed into the same output queue.
  • a particular output queue iL is selected in round-robin order at step 912 within nested loops of Figure 9.
  • a loop is initialized for iN value from node 0 to node Nx*Ny-2, as an index into the array (i.e., random_array[]) created at step 904.
  • a random node value is obtained from the random_array.
  • a first queue is selected in round-robin order.
  • a loop is initialized for ib from 1 to b packets per d iterations.
  • a plurality of b packets destined for a given random node iN are added to the same output queue iL as packetfnode, id, ib].
  • a particular node "i" e.g., processor 202 on node Ql 1 of Figure 2 will first place b number of packets that include data for an element of the array destined for a node Modulus (i+1, Nx*Ny-l) in a first output queue, then particular node "i” will place b packets that include data for an element of the array destined for a node Modulus (i+2,Nx*Ny-l) into a next output queue, and so on until reaching node Modulus (i+(Nx*Ny-l), Nx*Ny-l).
  • Figure 10 is an exemplary method flowchart 1000 that depicts how the packets in the one or more output queues 212 on the exemplary node Ql 1 of Figure 2 are drained into the injection FIFOs 210 for subsequent insertion on the Torus network 100 according to the present invention.
  • the exemplary method starts.
  • a loop is initiated for iL from 1 to L, to iterate over all L output queues from.
  • step 1010 for a packet at the head of the output queue iL, possible directions for routing the packet over the Torus network 100 are obtained.
  • node Ql l placed a packet destined to node Q22 into an output queue iL.
  • the packet may travel from node Ql 1 in the X+ direction (over grid link 108) followed by Y- direction (over grid link 110) to reach node Q22, or it may travel in the Y- direction (over grid link 106) followed by the X+ direction (over grid link 112) to reach node Q22.
  • each injection FIFO 210 has a logical direction (e.g., X+) associated with it, which represents that any packet placed in the injection FIFO 210 can move in the associated logical direction (e.g., X+ direction). If the injection FIFOs 210 for packet directions are full, then the method skips the current output queue and continues by iterating to the next output queue at step 1006. Otherwise, at step 1014, the packet is moved from the output queue to a least full FIFO 212 in one of the possible directions for that packet.
  • a logical direction e.g., X+
  • packets are removed from output the queues in a round- robin order for insertion into the injection FIFOs 210 illustrated in Figure 2.
  • the method continues at step 1008 for a next available packet in that output queue. Once all output queues are empty, the method ends at step 1016.
  • the order of the array elements and their destination nodes from node Ql l is as follows: ⁇ B12 to Q12; B13 to Q13; B14 to Q21; B15 to Q22; B16 to Q23; B17 to Q31; B18 to Q32 and B19 to Q33 ⁇ .
  • the array elements are placed into the FIFOs 210 of node Ql 1 as follows: ⁇ Bl 8 to Q32 via X+ or Y-; B15 to Q22 via X+ or Y+; B13 to Q13 via X-; B14 to Q21 via Y+; B16 to Q23 via Y+ or X-; B19 to Q33 via X- or Y-; B12 to Q12 via X+; and B17 to Q31 via Y- ⁇ .
  • the FIFOs 210 on node Ql 1 might be filled as illustrated in the table 1 below.
  • the use of an injection FIFO that is restricted to at least a particular grid link also is well-suited when number of injection FIFOs is not equal to the number of grid links. For example, if there are fewer injection FIFOs than grid links, then the use of a buffer may be restricted to at least one of several particular grid links. For another example, if there are more injection FIFOs than grid links, then there may be several injection FIFOs whose use is restricted to at least the same particular grid link.

Abstract

The present invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system(100) comprising a plurality of nodes(Q11-Q33) in communication over a network, comprising distributing the plurali ty of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via 'all-to-all' distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitated efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The 'all-to-all' re-distribution of the array elements is further efficiently implemented in applications other that the multidimensional FFT on the distributed-memory parallel supercomputer.

Description

EFFICIENT IMPLEMENTATION OF MULTIDIMENSIONAL
FAST FOURIER TRANSFORM ON A DISTRIBUTED-MEMORY
PARALLEL MULTI-NODE COMPUTER
CROSS REFERENCE
The present invention claims the benefit of commonly-owned, co-pending United States Provisional Patent Application Serial Number 60/271,124 filed February 24, 2001 entitled MASSIVELY PARALLEL SUPERCOMPUTER, the whole contents and disclosure of which is expressly incorporated by reference herein as if fully set forth herein. This patent application is additionally related to the following commonly-owned, co-pending United States Patent Applications filed on even date herewith, the entire contents and disclosure of each of which is expressly incorporated by reference herein as if fully set forth herein. U.S. patent application Serial No. (YOR920020027US1, YOR920020044US1 (15270)), for "Class Networking Routing"; U.S. patent application Serial No. (YOR920020028US1 (15271)), for "A Global Tree Network for Computing Structures"; U.S. patent application Serial No. (YOR920020029US1 (15272)), for 'Global Interrupt and Barrier Networks"; U.S. patent application Serial No. (YOR920020030US1 (15273)), for Optimized Scalable Network Switch"; U.S. patent application Serial No. (YOR920020031US1, YOR920020032US1 (15258)), for "Arithmetic Functions in Torus and Tree Networks'; U.S. patent application Serial No.
(YOR920020033US1, YOR920020034US1 (15259)), for 'Data Capture Technique for High Speed Signaling"; U.S. patent application Serial No. (YOR920020035US1 (15260)), for 'Managing Coherence Via Put/Get Windows'; U.S. patent application Serial No. (YOR920020036US1, YOR920020037US1 (15261)), for "Low Latency Memory Access And Synchronization"; U.S. patent application Serial No. (YOR920020038US1 (15276), for 'Twin-Tailed Fail-Over for Fileservers Maintaining Full Performance in the Presence of Failure"; U.S. patent application Serial No. (YOR920020039US1 (15277)), for "Fault Isolation Through No-Overhead Link Level Checksums'; U.S. patent application Serial No. (YOR920020040US1 (15278)), for "Ethernet Addressing Via Physical Location for Massively Parallel Systems"; U.S. patent application Serial No. (YOR920020041US1 (15274)), for "Fault Tolerance in a Supercomputer Through Dynamic Repartitioning"; U.S. patent application Serial No. (YOR920020042US1 (15279)), for "Checkpointing Filesystem"; U.S. patent application Serial No. (YOR920020043US 1 (15262)), for "Efficient Implementation of Multidimensional Fast Fourier Transform on a Distributed-Memory Parallel Multi-Node Computer"; U.S. patent application Serial No. (YOR9-20010211US2 (15275)), for "A Novel Massively Parallel Supercomputer"; and U.S. patent application Serial No. (YOR920020045US1 (15263)), for "Smart Fan Modules and System".
BACKGROUND OF THE INVENTION
Technical Field of the Invention
The present invention generally relates to a field of distributed-memory message- passing parallel multi-node computers and associated system software, as applied for example to computations in the fields of science, mathematics, engineering and the like. More particularly, the present invention is directed to a system and method for efficient implementation of a multidimensional Fast Fourier Transform (i.e., "FFT") on a distributed-memory parallel supercomputer.
Description of the Prior Art
Linear transforms, such as the Fourier Transform (i.e., "FT"), have widely been used for solving a range of problems in the fields of science, mathematics, engineering and the like. The FT alters a given problem into one that may be more easily solved, and the FT is used in many different applications. For example, for a system of N variables, the FT essentially represents a change of the N variables from coordinate space to momentum space, where the new value of each variable depends on the values of all the old variables. Such a system of N variable is usually stored on a computer as an array of N elements. The FT is commonly computed using the Fast Fourier Transform (i.e., "FFT"). The FFT is described in many standard texts, such as the Numerical Recipes by Press, et al. ("Numerical Recipes in Fortran", pages 490-529, by W. H. Press, S. A. Teukolsky, W. A. Vetterling and Brian P Flannery, Cambridge University Press, 1986, 1992, ISBN: 0-521 -43064-X). Most computer manufacturers provide library function calls to optimize the FFT for their specific processor. For example, the FFT is fully optimized on the IBM's RS/6000 processor in the Engineering and Scientific Subroutine Library. These library routines require the data (i.e., the foregoing elements) necessary to perform the FFT be resident in a memory local to a node.
In a multidimensional FFT, N elements of a multidimensional array are distributed in a plurality of dimensions across nodes of a distributed-memory parallel multi- node computer. Many applications that execute on distributed-memory parallel multi-node computers spend a large fraction of their execution time on calculating the multidimensional FFT. Since a motivation for the distributed-memory parallel multi-node computers is faster execution, fast calculation of the multidimensional FFT for the distributed array is of critical importance. The N elements of the array are initially distributed across the nodes in some arbitrary fashion particular to an application. To calculate the multidimensional FFT, the array of elements is then redistributed such that a portion of the array on each node consists of a complete row of elements in the x-dimension. A one-dimensional FFT on each row in the x- dimension on each node is then performed. Since the row is local to a node and since each one-dimensional FFT on each row is independent of the others, the one- dimensional FFT performed on each node requires no communication with any other node and may be performed using abovementioned library routines. After the one-dimensional FFT, the array elements are re-distributed such that a portion of the array on each node consists of a complete row in the y-dimension. Thereafter, a one-dimensional FFT on each row in the y-dimension on each node is performed. If there are more than two dimensions for the array, then the re-distribution and a one-dimensional FFT are repeated for each successive dimension of the array beyond the x-dimension and the y-dimension. The resulting array may be re- distributed into some arbitrary fashion particular to the application.
The treatment of the x-dimension and the y-dimension in sequence is not fundamental to the multidimensional FFT. Instead, the dimensions of the array maybe treated in any order. For some applications or some computers, some orders may take advantage of some efficiency and thus have a faster execution than other orders. For example, the initial distribution of the array across the nodes, which is in some arbitrary fashion particular to the application, may coincide with the distribution necessary for the one-dimensional FFTs in the y-dimension. In this case, it may be fastest for the multidimensional FFT to treat the y-dimension first, before treating the x-dimension and any other remaining dimensions.
In the implementation of the multidimensional FFT described above, each re- distribution of the array between the one-dimensional FFTs is an example of an "all-to-all" communication or re-distribution. In the all-to-all re-distribution, each node of the distributed-memory parallel multi-node computer sends unique data (i.e., elements of the array) to all other nodes utilizing a plurality of packets. As above-mentioned, fast calculation of the multidimensional FFT on the distributed- memory parallel multi-node computer, is of critical importance. In the implementation described above, typically a large fraction of the execution time is spent to re-distribute the array across the nodes the distributed-memory parallel multi-node computer. More particularly, a large fraction of execution time is spent on the "all-to-all" re-distribution of elements of the array across the nodes the distributed-memory parallel multi-node computer.
Therefore there is a need in the art for providing a system and method for efficiently implementing the multidimensional FFT on the on the distributed- memory parallel supercomputer. In particular, there is a need in the art for providing a system and method for efficiently implementing the "all-to-all" redistribution on the distributed-memory parallel supercomputer for efficiently implementing the multidimensional FFT.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to provide a system and method for efficiently implementing the multidimensional FFT on an array distributed on a distributed-memory parallel supercomputer. It is another object of the present invention to provide a system and method for efficiently implementing the multidimensional FFT on the array by efficiently implementing the "all-to-all" re-distribution on the distributed-memory parallel supercomputer.
It is yet another object of the present invention to provide a system and method for efficiently implementing the "all-to-all" re-distribution in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.
According to an embodiment of the present invention, there is provided a method for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the method comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one- dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT.
According to another embodiment of the present invention, there is provided a system for for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the system comprising: means for distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; means for performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; means for re-distributing the one- dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and means for performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT.
According to yet another embodiment of the present invention, there is provided a program storage device, tangibly embodying a program of instructions executable by a machine to perform a method for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the method comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one- dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one- dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT.
According to a further embodiment of the present invention, there is provided a method for efficiently re-distributing a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the method comprising re-distributing the elements at each node via "all-to-all" distribution in random order across other nodes of the computer system over the network, wherein the random order facilitates efficient utilization of the network. According to yet a further embodiment of the present invention, there is provided a system for efficiently re-distributing a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the system comprising a means for re-distributing the elements at each node via "all-to-all" distribution in random order across other nodes of the computer system over the network, wherein the random order facilitates efficient utilization of the network.
According to still a further embodiment of the present invention, there is provided a program storage device, tangibly embodying a program of instructions executable by a machine to perform a method for efficiently re-distributing a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the method comprising re-distributing the elements at each node via "all-to-all" distribution in random order across other nodes of the computer system over the network, wherein the random order facilitates efficient utilization of the network.
BRIEF DESCRIPTION OF THE DRAWINGS
The objects, features and advantages of the present invention will become apparent to one skilled in the art, in view of the following detailed description taken in combination with the attached drawings, in which:
Figure 1 illustrates an exemplary distributed-memory parallel supercomputer that includes 9 nodes interconnected via a multidimensional grid utilizing a 2- dimensional 3x3 Torus network according to the present invention;
Figure 2 illustrates a more detailed representation of an exemplary node from the distributed-memory parallel supercomputer of Figure 1 according to the present invention; Figure 3 illustrates an exemplary two-dimensional 9-row by 9-column array, which may efficiently be implemented for the multidimensional FFT according to the present invention;
Figure 4 illustrates an exemplary distribution the two-dimensional array of Figure 3 across the nodes of the supercomputer in Figure 1 according to the present invention;
Figure 5 illustrates an exemplary first one-dimensional FFT of the two- dimensional array distributed across the nodes of the supercomputer of Figure 1 according to the present invention;
Figure 6 illustrates an exemplary re-distribution of a resultant two-dimensional array after the first one-dimensional FFT of Figure 5 according to the present invention;
Figure 7 illustrates an exemplary second one-dimensional FFT of the re-distributed array of Figure 6 according to the present invention;
Figure 8 illustrates an exemplary method flowchart depicting the implementation of the two-dimensional FFT illustrated in Figures 4-7 according to the present invention;
Figure 9 illustrates an exemplary method flowchart that depicts the filling of output queues on the exemplary node with packets destined for other nodes on the distributed-memory parallel supercomputer according to the present invention; and
Figure 10 illustrates an exemplary method flowchart that depicts how the packets in the output queues on the exemplary node are drained into injection FIFOs for subsequent insertion on the Torus network 100 according to the present invention. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT OF THE INVENTION
The present invention is directed to a system and method for efficiently implementing the multidimensional Fast Fourier Transform (i.e., "FFT") on the distributed-memory parallel supercomputer. More particularly, the present invention implements an efficient "all-to-all" re-distribution of elements distributed at nodes of the distributed-memory parallel supercomputer to achieve an efficient implementation of the multidimensional FFT.
According to the present invention, the FFT is implemented on the distributed- memory parallel supercomputer, as a series of one-dimensional transforms, which require one or more "all-to-all" re-distributions of a multidimensional array across the nodes of the distributed-memory parallel supercomputer. The distributed- memory parallel supercomputer utilizes a Torus-based network for the interconnection of and communication between nodes of the supercomputer. As will be described below, each node implements a hardware router for efficiently routing packets that include elements of the array across the nodes of the supercomputer interconnected via the Torus-based network. Therefore, the present invention couples the implementation of the multidimensional FFT as a series of one-dimensional transforms of the multi-dimensional array with the foregoing hardware routing to obtain the efficient FFT implementation according to the present invention.
Further according to the present invention, the distributed-memory parallel supercomputer comprises a plurality of nodes, each of which includes at least one processor that operates on a local memory. The nodes are interconnected as a multidimensional grid and they communicate Via grid links. Without losing generality and in order to make the description of this invention easily understandable to one skilled in the art, the multidimensional node grid of the supercomputer will be described as an exemplary 2-dimensional grid. Notwithstanding the fact that only the 2-dimensional node grid is described in the following description, it is contemplated within the scope of the present invention that node grids of other dimensions may easily be provided based on the teachings of the present invention. It is noted that the distributed-memory parallel supercomputer may utilize a 3 -dimensional or greater Torus-based architecture. Additionally, without losing generality and in order to make the description of this invention easily understandable to one skilled in the art, the multidimensional array used by the multidimensional FFT will be described as an exemplary 2- dimensional array. Notwithstanding the fact that only the 2-dimensional array is described in the following description, it is contemplated within the scope of the present invention that arrays of additional dimensions may easily be provided based on the teachings of the present invention. It is further noted that there is no correspondence between the number of dimensions in the Torus-based architecture and the number of dimensions in the array. The array must be of sufficient size such that it can be distributed across the nodes or a subset of the nodes of the supercomputer for implementing the multidimensional FFT according to the present invention.
Figure 1 is an exemplary illustration of distributed-memory parallel supercomputer that includes 9 nodes interconnected via a multidimensional grid utilizing a 2- dimensional 3x3 Torus network 100, according to the present invention. It is noted that the number of nodes is in exemplary fashion limited to 9 nodes for brevity and . clarity, and that the number of nodes may significantly vary depending on a particular architectural requirements for the distributed-memory parallel supercomputer. Figure 1 depicts 9 nodes labeled as Ql 1 - Q33, a pair of which is interconnected by a grid link. In total, the 9-node Torus network 100 is interconnected by 18 grid links, where each node is directly interconnected to four ' other nodes in the Torus network 100 via a respective grid link. It is noted that unlike a mesh, the exemplary 2-dimensional Torus network 100 includes no edge nodes. For example, node Ql 1 is interconnected to node Q31 via grid link 102; to node Q13 via grid link 104; to node Q21 via grid link 106; and finally to node Q12 via grid link 108. As another example, Node Q22 is interconnected to Node Q12 via grid link 110; to node Q21 via grid link 112; to node Q32 via grid link 114 and finally to Node Q23 via grid link 116. Other nodes are interconnected in a similar fashion. Further with reference to Figure 1, data (i.e., elements of the array) communicated between nodes is transported on the network in one or more packets. For any given communication between a pair of nodes, a plurality of packets are required if the amount of data to be communicated exceeds the packet-size supported by the Torus network 100. A packet comprises a packet header and the data carried by the packet. The packet header includes information required by the Torus network 100 to transport the packet from a source node to a destination node. In the distributed-memory parallel supercomputer of the present patent application, each node on the network is identified by a logical address and the packet header includes a destination address so that the packet is automatically routed to a node on the network as identified by a destination.
Figure 2 is a more detailed representation 200 of an exemplary node, e.g., node Ql 1 , from the distributed-memory parallel supercomputer of Figure 1 according to the present invention. The node Ql 1 comprises at least one processor 202 that operates on local memory 204. The node further comprises a router 206 that routes, i.e., sends and receives, packets on the grid links 102,104,106 and 108, which connect the node Ql 1 to its neighboring nodes Q31, Q13, Q21 and Q 12, respectively, as particularly illustrated in Figure 1. Yet further, the node comprises a reception buffer 208 for buffering packets received by the router 206, which are destined for the local processor 202. The local processor 202 may easily periodically poll the reception buffer 208 in order to determine if there are packets in the reception buffer and then retrieve the packets that are buffered in the reception buffer 208. Depending on a particular application and the packets, the local processor 202 may write the contents of the packets into memory 204.
Further with reference to Figure 2, the node Ql 1 comprises four injection First-In- First-Out (i.e., "FIFO") buffers 810, which are particularly labeled X+, X-, Y+ and Y-. The processor places outbound packets into one or more output queues 212 of the local memory 2104, which store packets destined for other nodes until they can be placed into the injection FIFOs 210. While injection FIFOs are not full, the processor places outbound packets into the injection FIFOs 210. Upon a particular packet reaching the head of an injection FIFO 210, the packet is removed from the injection FIFO 210 by the router 206 and the router 206 inserts the packet onto a grid link 102,104,106 and 108 toward a destination node for the particular packet. The four injection FIFOs 210 are treated equivalently by the router 206 and by the hardware of the local processor 202.
Yet further with reference to Figure 2, the router 206 comprises several simultaneous routing characteristics. The routing first represents virtual cut- through routing. For example, if an incoming packet on one of the grid links is not destined for the local processor 202 of node Ql 1 , then the router 206 forwards the packet onto one of the outgoing grid links 102, 104, 106 and 108. The router 206 performs the forwarding without the involving the local processor 202. The routing further represents shortest-path routing. For example, a packet sent by node Ql 1 to node Q13 (See Figures 1 and 8) that travels over the grid link 104 represents a shortest path route. Any other path would by longer. As another example, a packet sent by node Ql 1 to node Q22 may travel over grid links 106 and 112 or alternatively over grid links 108 and 110. This type of routing is represents an adaptive type of routing. Thus, there may be a choice of grid links by which a packet may leave a node in transit for another node over the Torus- based network 100. In the previous example, the packet may leave the node Ql 1 via the grid link 106 or 108. Adaptive routing allows the router 206 to choose the less busy outgoing grid link for a packet or to choose the outgoing grid link based on some other criteria. It is noted that the adaptive routing is not just performed at the source node of a packet, e.g., node Ql 1, but is performed at each intermediate node that a packet cuts through on the way to the packet's destination node over the Torus-based network 100 of Figure 1. The description below with reference to Figures 9 and 10 particularly describes how the present invention performs the foregoing routing of packets across the nodes of the supercomputer over the Torus network 100.
Figure 3 is an exemplary two-dimensional 9-row by 9-column array 300 that includes 81 elements, which may efficiently be implemented for the multidimensional FFT according to the present invention. It is noted that the exemplary two-dimensional array 300 is easily extended to other two-dimensional arrays including a different number of rows and columns (e.g., 10-row by 11- column two-dimensional array), which may be utilized for implementing the FFT on the distributed-memory parallel supercomputer according to the present invention. In the array 200, the first row of the array comprises elements Al l, A12...A19, while the first column of the array comprises elements Al 1, A21...A 91.
Figure 4 is an exemplary distribution illustration 400 of how the two-dimensional array 300 of Figure 3 is distributed across the nodes Ql 1 - Q33 in Figure 1 according to the present invention. It is noted that the array may initially be distributed across the nodes in some arbitrary fashion that is particular to an application. According to present invention, the array re-distributed such that a portion of the array on each node Ql 1...Q33 comprises the distribution illustrated in Figure 4. This re-distribution is similar to that described below with reference to Figures 5 and 6. As particularly depicted in the distribution illustration 400, each node of Figure 1 includes a portion of the two-dimensional array 300 of Figure 3. For example, node Ql 1 comprises the first row of the array 300, i.e., elements Al 1, A12...A19. As another example, node Q12 comprises the second row of the array 300, i.e., elements A21, A22...A23. It is noted that other nodes Q13 - Q33 of Figure 1 comprise respective rows 3 through 9 of array 300, as particularly depicted in distribution illustration 400 of Figure 4. hi exemplary distribution of Figure 4, the assignment of a particular node to a particular row of array elements is not fundamental. Instead, it is noted that any assignment is feasible. For various applications and/or computers, some assignments may take advantage of efficiencies offered by the applications and/or computers and thus produce faster execution than other assignments. For example, it may be that the fastest way to perform the multidimensional FFT may be to reverse the assignments of nodes Ql 1 and Q12 from those illustrated in Figure 4.
Figure 5 is an exemplary illustration 500 that depicts a first one-dimensional FFT on the two-dimensional array of Figure 4 that was distributed across the nodes Ql 1 - Q33 over the two-dimensional Torus network 100 of Figure 1. As particularly noted above, the multidimensional FFT according to the present invention is accomplished by performing a series of one-dimensional FFTs. Thus according to the present invention, the multi-dimensional FFT of the two-dimensional array 300 may be implemented as a series of one-dimensional FFTs. Therefore, a one- dimensional FFT is performed on each row of elements distributed at each node. For example, a one-dimensional FFT is performed for the elements distributed at node Ql 1, i.e., elements in the first row of array 300 that were distributed to node Ql 1. One-dimensional FFTs are performed for elements (i.e., rows of elements) at each node Q12 - Q33. The result is an array of elements transformed by the first one-dimensional FFT. More particularly, the result of the one-dimensional FFT on each row at each node is a row of the same length as particularly illustrated in Figure 5. For example, a one-dimensional FFT performed on the first row at node Ql 1 of Figure 4, which comprises elements Al 1, A12...A19, results in a first row at node Ql 1 of Figure 5, which comprises elements Bl 1, B12...B19. Furthermore, the one-dimensional FFT performed on each row at each node is independent of the one-dimensional FFT performed on any other row at another node. The particular distribution of data illustrated in Figure 4 enables each node to perform the one-dimensional FFT on the row of elements distributed at that node, without communication with any other node on the Torus network 100 of Figure 1. Therefore, since no communication is required between the nodes, these one- dimensional FFTs are performed fast. It is noted that at each node, in addition to the resulting row in Figure 5, the original row in Figure 4 may continue to exist and be of interest for a particular application, but the original row is no longer needed for the second one-dimensional FFT in the series of FFTs required for the multidimensional FFT according to the present invention, as particularly illustrated in Figures 6 and 7.
Figure 6 is an exemplary "all-to-all" re-distribution illustration 600 that depicts how each resulting row of elements transformed via the first-dimension FFT of Figure 5 is re-distributed across the nodes Ql 1 - Q33 for performing the second- dimension FFT according to the present invention. More particularly, each resulting row of elements that is distributed at each node Ql 1...Q33 of Figure 5 is re-distributed over the Torus network 100 so that each successive node receives a successive column of elements as particularly depicted in Figure 6. This efficient re-distribution is the "all-to-all" re-distribution, which enables an efficient implementation of the multidimensional FFT on the distributed-memory parallel supercomputer according to the present invention. For example, the first node Ql 1 receives the first column of elements, i.e., first elements from each of the nodes Ql 1...Q33. As another example, node Q12 receives the second column of elements, i.e., second elements from each of the nodes Ql 1...Q33. This redistribution is performed for each column in figure 5. In exemplary redistribution of Figure 6, the assignment of a particular node to a particular row of array elements is not fundamental. Instead, it is noted that any assignment is feasible. For various applications and/or computers, some assignments may take advantage of efficiencies offered by the applications and/or computers and thus produce faster execution than other assignments. For example, the fastest way to perform the multidimensional FFT may be to reverse the assignments of nodes Ql 1 and Q12 from those illustrated in Figure 6. The description below with reference to Figures 9 and 10 particularly describes how the present invention performs the "all-to-all" re-distribution of array elements across the nodes of the supercomputer over the Torus network 100. The "all-to-all" re-distribution of the elements at each node Ql 1...Q33 is fast since it takes advantages of the communication characteristics of the Torus network 100. In the re-distribution illustrated in Figure 6, each node from Ql 1...Q33 nodes sends a single array element to every other node. The following description assumes that each element of the array is a quantity of data larger than the quantity of data carried by a single packet. Thus, a plurality of packets is needed to transmit each element of the array to a destination node over the Torus network 100. This closely resembles the typical real- world re-distribution, where due to much larger array sizes, each node sends many array elements to every other node, typically requiring many packets.
Figure 7 is an exemplary illustration 700 that depicts a second one-dimensional FFT on the two-dimensional array of Figure 6 that was redistributed across the nodes Ql 1 - Q33 over the two-dimensional Torus network 100 of Figure 1 according to the present invention. As particularly noted above, the multidimensional FFT according to the present invention is accomplished by performing a series of one-dimensional FFTs, where Figure 7 depicts the second one-dimensional FFT in that series according to the present invention. Therefore, a one-dimensional FFT is performed on the column of elements that were distributed to each node as illustrated in Figure 5. For example, a one-dimensional FFT is performed for the elements distributed at node Ql 1, i.e., elements Bl 1, B21...B91 in Figure 6 that were distributed as a row to node Ql 1 form the first column of Figure 5. Additionally, one-dimensional FFTs are performed on rows of elements (i.e., distributed from successive columns of elements of Figure 5) at each node Q12 - Q33. The result of the one-dimensional FFT on each row is a row of the same length as particularly illustrated in Figure 7. For example, a one- dimensional FFT performed on the first row at node Ql 1 of Figure 6, which comprises elements Bl 1, B21...A91, results in a first row at node Ql 1 of Figure 7, which comprises elements Cl 1, C21...C91. As mentioned above with regard to the first FFT, the one-dimensional FFT performed on each row at each node is independent of the one-dimensional FFT performed on any other row at another node. The particular distribution of data illustrated in Figure 6 enables each node to perform the one-dimensional FFT on the row of elements distributed at that node, without communication with any other node on the Torus network 100 of Figure 1. Therefore, since no communication is required between the nodes, these one-dimensional FFTs are performed fast.
Figure 8 is an exemplary method flowchart that illustrates the implementation of the two-dimensional FFT of an array on the distributed distributed-memory parallel supercomputer of Figure 1 that utilizes a 2-dimensional Torus network 100 for communication between nodes Ql 1...Q33 of the supercomputer. In the following description, Figure 8 is will be described on the basis of Figures 1-7 for efficiently performing the two-dimensional FFT. At step 802, the multidimensional FFT of a two-dimensional array illustrated in Figure 3 in the distributed distributed-memory parallel supercomputer of Figure 1 is started. It is noted that at step 702, the array illustrated in Figure 3 is distributed across the nodes in some arbitrary fashion that may be particular to an application. At step 804, elements (i.e., the data) of the array 300 are efficiently re-distributed across nodes Ql 1...Q33, as particularly illustrated in Figure 4. At step 806, each node performs a first one-dimensional FFT (out of a series of one-dimensional FFTs) on a row of elements of the array stored at that node, as illustrated in Figure 4, and the result particularly illustrated in Figure 5. As described with regard to Figures 5 and 6, columns of one-dimensional FFT-transformed elements are re-distributed across the nodes Ql 1... Q33 of the supercomputer utilizing the Torus-based architecture of Figure 1 at step 808. At step 810, each node performs a second one-dimensional FFT on a successive column of a first one-dimensional FFT- transformed elements illustrated of Figure 6 that is distributed as a row of elements in Figure 6. The result of the second one-dimensional FFT is illustrated in Figure 7. At step 812, the multi-dimensional FFT of the two-dimensional array illustrated in Figure 3 in the supercomputer of Figure 1 is ended. As particularly described above, between the two one-dimensional FFTs there is a fast re-distribution of elements across the nodes Ql 1...Q33.
The above-described multidimensional FFT on an array of elements distributed across nodes of a distributed-memory parallel supercomputer coupled with redistribution of the elements across the nodes are illustrative of the invention. More particularly, the present invention utilizes efficient hardware routing of the Torus-based architecture coupled with a series of one-dimensional FFTs to achieve an efficient implementation of the multidimensional FFT on the distributed- memory parallel supercomputer. As noted above, the teachings according to the present invention may be utilized for performing efficient multidimensional FFTs in other number of array dimensions, in other array sizes, and in other number of Torus network dimensions, e.g., 3-dimensional Torus. Additionally, the teachings according to the present invention may be utilized for performing "all-to-all" communication between nodes of the distributed-memory parallel supercomputer on a Torus network of arbitrary dimensions.
Figure 9 is an exemplary method flowchart 900 that depicts the filling of one or more output queues 212 on an exemplary node Ql 1 of Figure 2 with packets destined for other nodes, e.g., nodes Q22 and Q33, on the distributed-memory parallel supercomputer according to the present invention. The "all-to-all" redistribution illustrated in Figure 6 above is implemented as follows according to the present invention. Assume that Qxy denotes a generic node (e.g., node Ql 1) with an x-coordinate value x and a y-coordinate value y (e.g., x=l ; y=l). Thus, according to the "all-to-all" re-distribution, node Qxy (e.g., node Ql 1) needs to send a plurality of total packets (i.e., k packets) to every node Qab for all possible values of a and b (e.g., Q12, Q13; Q21, Q22, Q23; and Q31, Q32, Q33 as illustrated in Figure 1; it is noted that Ql 1 does not send packets to itself). To perform the re-distribution as fast as possible, the grid links of the Torus network 100 must efficiently utilized. If packets are not scheduled in an efficient order, then the grid link utilization may be very inefficient. For example, if every node first sends packets only in the positive X+ direction, then all the grid links in the negative X- direction will be idle, hence the re-distribution will not be performed as fast as possible and the multifϊeld FFT will not be implemented as efficiently as possible. According to the present invention, the fast re-distribution takes advantage of the adaptive routing capability of the Torus-based network 100 such that packet scheduling is implemented efficiently, as particularly illustrated below.
Thus with reference to Figure 9, there are Nx*Ny nodes interconnected by the Torus network 100 (i.e., 3 x 3 = 9 nodes in Figure 1) that need to exchange packets, which include elements of the two-dimensional array. At step 902, the exemplary method starts. At step 904, at each node Ql 1...Q33 there is created an array (i.e., random_map[ ] array) that assigns each node on the Torus network 100 a unique number between 0, ... , Nx * Ny - 2. Since a node does not send packets to itself, the total number of nodes that exchange packets are 0 to Nx*Ny-2. It is noted that the assignments at step 904 are generated randomly. At this point, assume that the total number of packets that a node requires to send an element of the array to another node is k packets (e.g., 6 packets). Thereafter, assume that total k packets = d iterations * b packets, where d is the number of iterations necessary and transmit b packets per iteration for a total number of k packets. It is noted that b may be chosen as necessary for efficiency and may likewise be equal to 1. For example, to transmit a total of 6 packets, it can be chosen to transmit 2 packets per iteration on each of 3 iterations for the total of 6 packets. Therefore, at step 906, a loop is initiated for id from 1 to d iterations. At step 908, a queue counter is initialized to zero. It is assumed that there are L output queues 212 (L being greater than or equal to 1) for storing packets (or short descriptors of the packets such that the actual packet need not be copied), and all packets (or descriptors of the packets) for a given destination will be placed into the same output queue. A particular output queue iL is selected in round-robin order at step 912 within nested loops of Figure 9. At step 910, a loop is initialized for iN value from node 0 to node Nx*Ny-2, as an index into the array (i.e., random_array[]) created at step 904. As the array created in step 904 is indexed for a particular iN value, a random node value is obtained from the random_array. At step 912, a first queue is selected in round-robin order. At step 914, a loop is initialized for ib from 1 to b packets per d iterations. Subsequently, as steps 914 and 916, a plurality of b packets (e.g., b = 2 packets from above example) destined for a given random node iN are added to the same output queue iL as packetfnode, id, ib]. At step 918, once all d iterations have been completed, the method ends. In sum with reference to the flowchart 900, during one d iteration a particular node "i" (e.g., processor 202 on node Ql 1 of Figure 2) will first place b number of packets that include data for an element of the array destined for a node Modulus (i+1, Nx*Ny-l) in a first output queue, then particular node "i" will place b packets that include data for an element of the array destined for a node Modulus (i+2,Nx*Ny-l) into a next output queue, and so on until reaching node Modulus (i+(Nx*Ny-l), Nx*Ny-l). When the packets b packets have been inserted for a given iteration into the output queues, this process is repeated until the d iterations have all been completed. The foregoing re-distribution achieves extremely high grid link utilization on the Torus network 100 of Figure 1, thereby efficiently implementing the multidimensional FFT according to the present invention.
Figure 10 is an exemplary method flowchart 1000 that depicts how the packets in the one or more output queues 212 on the exemplary node Ql 1 of Figure 2 are drained into the injection FIFOs 210 for subsequent insertion on the Torus network 100 according to the present invention. Before describing Figure 10 in detail, it is noted that the filling of Figure 9 and the draining of Figure 10 may be performed concurrently with one another. At step 1002, the exemplary method starts. At step 1004 it is determined whether all L output queues 212 are empty. At step 1006 a loop is initiated for iL from 1 to L, to iterate over all L output queues from. At step 1008 it is determined whether a particular output queue iL is empty. If the output queue iL is empty, the method continues to the next iL output queue at step 1006. Otherwise, at step 1010, for a packet at the head of the output queue iL, possible directions for routing the packet over the Torus network 100 are obtained. For example with reference to Figure 1, assume that node Ql l placed a packet destined to node Q22 into an output queue iL. The packet may travel from node Ql 1 in the X+ direction (over grid link 108) followed by Y- direction (over grid link 110) to reach node Q22, or it may travel in the Y- direction (over grid link 106) followed by the X+ direction (over grid link 112) to reach node Q22. Now back to Figure 10, at step 1012 it is further determined whether all FIFOs 210 of Figure 2 in the possible directions for the packet are full. As described above, each injection FIFO 210 has a logical direction (e.g., X+) associated with it, which represents that any packet placed in the injection FIFO 210 can move in the associated logical direction (e.g., X+ direction). If the injection FIFOs 210 for packet directions are full, then the method skips the current output queue and continues by iterating to the next output queue at step 1006. Otherwise, at step 1014, the packet is moved from the output queue to a least full FIFO 212 in one of the possible directions for that packet. It is noted that packets are removed from output the queues in a round- robin order for insertion into the injection FIFOs 210 illustrated in Figure 2. After the packet is moved, the method continues at step 1008 for a next available packet in that output queue. Once all output queues are empty, the method ends at step 1016.
In order to more fully demonstrate Figures 9 and 10, which describe the "all-to-all" routing, assume that the row of elements at node Ql 1 in Figure 5, i.e., elements Bl 1, B12...B19, are to be re-distributed across nodes Q12...Q33 as illustrated in Figure 6 over the Torus network 100. Assume that the random mapping of nodes has following values in random_map array = {Q32; Q22; Q13; Q21; Q23; Q33; Q12; and Q31 } . Therefore, the order of the array elements and their destination nodes from node Ql l is as follows: {B12 to Q12; B13 to Q13; B14 to Q21; B15 to Q22; B16 to Q23; B17 to Q31; B18 to Q32 and B19 to Q33}. The array elements are placed into the FIFOs 210 of node Ql 1 as follows: {Bl 8 to Q32 via X+ or Y-; B15 to Q22 via X+ or Y+; B13 to Q13 via X-; B14 to Q21 via Y+; B16 to Q23 via Y+ or X-; B19 to Q33 via X- or Y-; B12 to Q12 via X+; and B17 to Q31 via Y-}. Thus for example, the FIFOs 210 on node Ql 1 might be filled as illustrated in the table 1 below.
Table 1:
Figure imgf000022_0001
Notwithstanding the fact that the number of injection FIFOs was described above as equal to the number of grid links to a node (e.g., 4 FIFOs and 4 grid links), the use of an injection FIFO that is restricted to at least a particular grid link also is well-suited when number of injection FIFOs is not equal to the number of grid links. For example, if there are fewer injection FIFOs than grid links, then the use of a buffer may be restricted to at least one of several particular grid links. For another example, if there are more injection FIFOs than grid links, then there may be several injection FIFOs whose use is restricted to at least the same particular grid link.
Although the implementation of the array re-distribution was described above with reference to efficient implementation of the multidimensional FFT, the "all-to-all" re-distribution is also well suited for any type of array re-distributions over the Torus network 100 of Figure 1.
While the invention has been particularly shown and described with regard to preferred embodiments thereof, it will be understood by those skilled in the art that the foregoing and other changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

CLAIMS:Having thus described our invention, what we claim as new, and desire to secure by Letters Patent is:
1. A method for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the method comprising: (a) distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; (b) performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; (c) re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and (d) performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT.
2. The method for efficiently implementing a multidimensional FFT according to Claim 1, wherein the method further comprises the step of: re-distributing the elements of the array at each node in a third dimension via the "all-to-all" distribution in random order across other nodes of the computer system over the network; performing a one-dimensional FFT on elements of the array re- distributed at each node in the third dimension; and repeating the steps of re-distributing the elements of the array in random order across nodes and performing the one-dimensional FFT on the re- distributed elements at each node for subsequent dimensions.
3. The method for efficiently implementing a multidimensional FFT according to Claim 1, wherein the method comprises a step of generating a random order of other nodes for re-distributing the one-dimensional FFT- transformed elements at each node.
4. The method for efficiently implementing a multidimensional FFT according to Claim 3, wherein each of the plurality of elements is re- distributed between nodes of the computer system via a plurality of total packets.
5. The method for efficiently implementing a multidimensional FFT according to Claim 4, wherein the method further comprises the steps of: providing a plurality of output queues at each node; iterating thru the other nodes in generated random order a plurality of times; and outputting to an output queue for each other node at least one packet of the plurality of total packets during each iteration.
6. The method for efficiently implementing a multidimensional FFT according to Claim 5, wherein the method further comprises the steps of: providing a plurality of inj ection first-in-first-out (FIFO) buffers, each FIFO buffer for transmitting packets in at least a particular direction on the network; iterating through the plurality of output queues at a node to identify a packet at the head of each queue; obtaining possible routing directions associated with the packet at the head of each queue; and moving the packet from the head of each queue to a least full FIFO buffer in one of the possible routing directions associated with the packet.
7. A system for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in. a multi-node computer system comprising a plurality of nodes in communication over a network, the system comprising: (a) means for distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; (b) means for performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; (c) means for re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the. computer system over the network; and (d) means for performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT.
8. The system for efficiently implementing a multidimensional FFT according to Claim 7, wherein the method further comprises the step of: means for re-distributing the elements of the array at each node in a third dimension via the "all-to-all" distribution in random order across other nodes of the computer system over the network; means for performing a one-dimensional FFT on elements of the array re-distributed at each node in the third dimension; and means for repeating the steps of re-distributing the elements of the array in random order across nodes and performing the one-dimensional FFT on the re-distributed elements at each node for subsequent dimensions.
9. The system for efficiently implementing a multidimensional FFT according to Claim 7, wherein the systems comprises a means for generating a random order of other nodes for re-distributing the one-dimensional FFT- transformed elements at each node.
10. The system for efficiently implementing a multidimensional FFT according to Claim 9, wherein each of the plurality of elements is re- distributed between nodes of the computer system via a plurality of total packets.
11. The system for efficiently implementing a multidimensional FFT according to Claim 10, wherein the method further comprises the steps of: means for providing a plurality of output queues at each node; means for iterating thru the other nodes in generated random order a plurality of times; and means for outputting to an output queue for each other node at least one packet of the plurality of total packets during each iteration.
12. The system for efficiently implementing a multidimensional FFT according to Claim 11, wherein the method further comprises the steps of: means for providing a plurality of injection first-in-first-out (FIFO) buffers, each FIFO buffer for transmitting packets in at least a particular direction on the network; means for iterating through the plurality of output queues at a node to identify a packet at the head of each queue; means for obtaining possible routing directions associated with the packet at the head of each queue; and means for moving the packet from the head of each queue to a least full FIFO buffer in one of the possible routing directions associated with the packet.
13. A program storage device, tangibly embodying a program of instructions executable by a machine to perform a method for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the method comprising: (a) distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; (b) performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; (c) re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and (d) performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT.
14. The program storage device for efficiently implementing a multidimensional FFT according to Claim 13, wherein the method further comprises the step of: re-distributing the elements of the array at each node in a third dimension via the "all-to-all" distribution in random order across other nodes of the computer system over the network; performing a one-dimensional FFT on elements of the array re- distributed at each node in the third dimension; and repeating the steps of re-distributing the elements of the array in random order across nodes and performing the one-dimensional FFT on the re- distributed elements at each node for subsequent dimensions.
15. The program storage device for efficiently implementing a multidimensional FFT according to Claim 13, wherein the method comprises a step of generating a random order of other nodes for re-distributing the one- dimensional FFT-transformed elements at each node.
16. The program storage device for efficiently implementing a multidimensional FFT according to Claim 15, wherein each of the plurality of elements is re-distributed between nodes of the computer system via a plurality of total packets.
17. The program storage device for efficiently implementing a multidimensional FFT according to Claim 16, wherein the method further comprises the steps of: providing a plurality of output queues at each node; iterating thru the other nodes in generated random order a plurality of times; and outputting to an output queue for each other node at least one packet of the plurality of total packets during each iteration.
18. The program storage device for efficiently implementing a multidimensional FFT according to Claim 17, wherein the method further comprises the steps of: providing a plurality of injection first-in-first-out (FIFO) buffers, each FIFO buffer for transmitting packets in at least a particular direction on the network; iterating through the plurality of output queues at a node to identify a packet at the head of each queue; obtaining possible routing directions associated with the packet at the head of each queue; and moving the packet from the head of each queue to a least full FIFO buffer in one of the possible routing directions associated with the packet.
19. A method for efficiently re-distributing a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the method comprising re-distributing the elements at each node via "all- to-all" distribution in random order across other nodes of the computer system over the network, wherein the random order facilitates efficient utilization of the network.
20. The method for efficiently re-distributing a multidimensional array according to Claim 19, wherein the method comprises a step of generating a random order of other nodes for re-distributing the elements at each node.
21. The method for efficiently re-distributing a multidimensional array according to Claim 20, wherein each of the plurality of elements is re- distributed between nodes of the computer system via a plurality of total packets.
22. The method for efficiently re-distributing a multidimensional array according to Claim 21, wherein the method further comprises the steps of: providing a plurality of output queues at each node; iterating thru the other nodes in generated random order a plurality of times; and outputting to an output queue for each other node at least one packet of the plurality of total packets during each iteration.
23. The method for efficiently re-distributing a multidimensional array according to Claim 22, wherein the method further comprises the steps of: providing a plurality of injection first-in-first-out (FIFO) buffers, each FIFO buffer for transmitting packets in at least a particular direction on the network; iterating through the plurality of output queues at a node to identify a packet at the head of each queue; obtaining possible routing directions associated with the packet at the head of each queue; and moving the packet from the head of each queue to a least full FIFO buffer in one of the possible routing directions associated with the packet.
24. A system for efficiently re-distributing a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the system comprising a means for re-distributing the elements at each node via "all-to-all" distribution in random order across other nodes of the computer system over the network, wherein the random order facilitates efficient utilization of the network.
25. The system for efficiently re-distributing a multidimensional array according to Claim 24, wherein the method comprises a means for generating a random order of other nodes for re-distributing the elements at each node.
26. The system for efficiently re-distributing a multidimensional array according to Claim 25, wherein each of the plurality of elements is re- distributed between nodes of the computer system via a plurality of total packets.
27. The system for efficiently re-distributing a multidimensional array according to Claim 26, wherein the system further comprises: means for providing a plurality of output queues at each node; means for iterating thru the other nodes in generated random order a plurality of times; and means for outputting to an output queue for each other node at least one packet of the plurality of total packets during each iteration.
28. The system for efficiently re-distributing a multidimensional array according to Claim 27, wherein the system further comprises: means for providing a plurality of injection first-in-first-out (FIFO) buffers, each FIFO buffer for transmitting packets in at least a particular direction on the network; means for iterating through the plurality of output queues at a node to identify a packet at the head of each queue; means for obtaining possible routing directions associated with the packet at the head of each queue; and moving the packet from the head of each queue to a least full FIFO buffer in one of the possible routing directions associated with the packet.
29. A program storage device, tangibly embodying a program of instructions executable by a machine to perform a method for efficiently re- distributing a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, the method comprising re-distributing the elements at each node via "all-to-all" distribution in random order across other nodes of the computer system over the network, wherein the random order facilitates efficient utilization of the network.
30. The program storage device for efficiently re-distributing a multidimensional array according to Claim 29, wherein the method comprises a step of generating a random order of other nodes for re-distributing the elements at each node.
31. The program storage device for efficiently re-distributing a multidimensional array 29, wherein each of the plurality of elements is re- distributed between nodes of the computer system via a plurality of total packets.
32. The program storage device for efficiently re-distributing a multidimensional array according to Claim 31 , wherein the method further comprises the steps of: providing a plurality of output queues at each node; iterating thru the other nodes in generated random order a plurality of times; and outputting to an output queue for each other node at least one packet of the plurality of total packets during each iteration.
33. The program storage device for efficiently re-distributing a multidimensional array according to Claim 32, wherein the method further comprises the steps of: providing a plurality of injection first-in-first-out (FIFO) buffers, each FIFO buffer for transmitting packets in at least a particular direction on the network; iterating through the plurality of output queues at a node to identify a packet at the head of each queue; obtaining possible routing directions associated with the packet at the head of each queue; and moving the packet from the head of each queue to a least full FIFO buffer in one of the possible routing directions associated with the packet.
PCT/US2002/005574 2001-02-24 2002-02-25 Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer WO2002069097A2 (en)

Priority Applications (10)

Application Number Priority Date Filing Date Title
CNB02805377XA CN1244878C (en) 2001-02-24 2002-02-25 High efficient implementation of multidimensional fast Fourier transform on distributed-memory parallel multi-node computer
CA002437036A CA2437036A1 (en) 2001-02-24 2002-02-25 Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer
EP02721139A EP1497750A4 (en) 2001-02-24 2002-02-25 Efficient implementation of a multidimensional fast fourier transform on adistributed-memory parallel multi-node computer
PCT/US2002/005574 WO2002069097A2 (en) 2001-02-24 2002-02-25 Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer
JP2002568153A JP4652666B2 (en) 2001-02-24 2002-02-25 Efficient implementation of multidimensional fast Fourier transform on distributed memory parallel multi-node computer
KR1020037011119A KR100592753B1 (en) 2001-02-24 2002-02-25 Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer
AU2002252086A AU2002252086A1 (en) 2001-02-24 2002-02-25 Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer
US10/468,998 US7315877B2 (en) 2001-02-24 2002-02-25 Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer
IL15751802A IL157518A0 (en) 2001-02-24 2002-02-25 Efficient implementation of multidimensional fast fourier transform on a distributed-memory parallel multi-node computer
US11/931,898 US8095585B2 (en) 2001-02-24 2007-10-31 Efficient implementation of multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US27112401P 2001-02-24 2001-02-24
US60/271,124 2001-02-24
PCT/US2002/005574 WO2002069097A2 (en) 2001-02-24 2002-02-25 Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US10468998 A-371-Of-International 2002-02-25
US11/931,898 Continuation US8095585B2 (en) 2001-02-24 2007-10-31 Efficient implementation of multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

Publications (2)

Publication Number Publication Date
WO2002069097A2 true WO2002069097A2 (en) 2002-09-06
WO2002069097A3 WO2002069097A3 (en) 2002-10-24

Family

ID=68499830

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/005574 WO2002069097A2 (en) 2001-02-24 2002-02-25 Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

Country Status (9)

Country Link
US (2) US7315877B2 (en)
EP (1) EP1497750A4 (en)
JP (1) JP4652666B2 (en)
KR (1) KR100592753B1 (en)
CN (1) CN1244878C (en)
AU (1) AU2002252086A1 (en)
CA (1) CA2437036A1 (en)
IL (1) IL157518A0 (en)
WO (1) WO2002069097A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2419706A (en) * 2004-10-30 2006-05-03 Agilent Technologies Inc Permuting a vector using a permutation structure having random control bits

Families Citing this family (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008117044A (en) * 2006-11-01 2008-05-22 Oki Electric Ind Co Ltd Two-dimensional fast fourier transform operation method and two-dimensional fast fourier transform operation device
KR100592753B1 (en) * 2001-02-24 2006-06-26 인터내셔널 비지네스 머신즈 코포레이션 Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer
CN100422975C (en) * 2005-04-22 2008-10-01 中国科学院过程工程研究所 Parallel computing system facing to particle method
US20060241928A1 (en) * 2005-04-25 2006-10-26 International Business Machines Corporation Load balancing by spatial partitioning of interaction centers
GB2425860A (en) * 2005-05-05 2006-11-08 Advanced Risc Mach Ltd Multi-dimensional fast fourier transform
EP2013772B1 (en) * 2006-04-28 2018-07-11 Qualcomm Incorporated Multi-port mixed-radix fft
US8082289B2 (en) 2006-06-13 2011-12-20 Advanced Cluster Systems, Inc. Cluster computing support for application programs
US8325633B2 (en) * 2007-04-26 2012-12-04 International Business Machines Corporation Remote direct memory access
US7948999B2 (en) * 2007-05-04 2011-05-24 International Business Machines Corporation Signaling completion of a message transfer from an origin compute node to a target compute node
US7889657B2 (en) * 2007-05-04 2011-02-15 International Business Machines Corporation Signaling completion of a message transfer from an origin compute node to a target compute node
US7890670B2 (en) * 2007-05-09 2011-02-15 International Business Machines Corporation Direct memory access transfer completion notification
US7779173B2 (en) * 2007-05-29 2010-08-17 International Business Machines Corporation Direct memory access transfer completion notification
US8037213B2 (en) 2007-05-30 2011-10-11 International Business Machines Corporation Replenishing data descriptors in a DMA injection FIFO buffer
US7765337B2 (en) * 2007-06-05 2010-07-27 International Business Machines Corporation Direct memory access transfer completion notification
US8018951B2 (en) 2007-07-12 2011-09-13 International Business Machines Corporation Pacing a data transfer operation between compute nodes on a parallel computer
US8478834B2 (en) * 2007-07-12 2013-07-02 International Business Machines Corporation Low latency, high bandwidth data communications between compute nodes in a parallel computer
US8959172B2 (en) * 2007-07-27 2015-02-17 International Business Machines Corporation Self-pacing direct memory access data transfer operations for compute nodes in a parallel computer
US7890597B2 (en) * 2007-07-27 2011-02-15 International Business Machines Corporation Direct memory access transfer completion notification
US20090031001A1 (en) * 2007-07-27 2009-01-29 Archer Charles J Repeating Direct Memory Access Data Transfer Operations for Compute Nodes in a Parallel Computer
JP2009104300A (en) * 2007-10-22 2009-05-14 Denso Corp Data processing apparatus and program
US9009350B2 (en) * 2008-04-01 2015-04-14 International Business Machines Corporation Determining a path for network traffic between nodes in a parallel computer
US9225545B2 (en) * 2008-04-01 2015-12-29 International Business Machines Corporation Determining a path for network traffic between nodes in a parallel computer
US8694570B2 (en) * 2009-01-28 2014-04-08 Arun Mohanlal Patel Method and apparatus for evaluation of multi-dimensional discrete fourier transforms
US9251118B2 (en) * 2009-11-16 2016-02-02 International Business Machines Corporation Scheduling computation processes including all-to-all communications (A2A) for pipelined parallel processing among plurality of processor nodes constituting network of n-dimensional space
US8544026B2 (en) 2010-02-09 2013-09-24 International Business Machines Corporation Processing data communications messages with input/output control blocks
JP5238791B2 (en) 2010-11-10 2013-07-17 株式会社東芝 Storage apparatus and data processing method in which memory nodes having transfer function are connected to each other
US8949453B2 (en) 2010-11-30 2015-02-03 International Business Machines Corporation Data communications in a parallel active messaging interface of a parallel computer
US8949328B2 (en) 2011-07-13 2015-02-03 International Business Machines Corporation Performing collective operations in a distributed processing system
US8930962B2 (en) 2012-02-22 2015-01-06 International Business Machines Corporation Processing unexpected messages at a compute node of a parallel computer
CN105308579B (en) * 2013-07-01 2018-06-08 株式会社日立制作所 Series data parallel parsing infrastructure and its parallel decentralized approach
EP3204868A1 (en) * 2014-10-08 2017-08-16 Interactic Holdings LLC Fast fourier transform using a distributed computing system
US10084860B2 (en) * 2015-04-09 2018-09-25 Electronics And Telecommunications Research Institute Distributed file system using torus network and method for configuring and operating distributed file system using torus network
CN104820581B (en) * 2015-04-14 2017-10-10 广东工业大学 A kind of method for parallel processing of FFT and IFFT permutation numbers table
US10116557B2 (en) 2015-05-22 2018-10-30 Gray Research LLC Directional two-dimensional router and interconnection network for field programmable gate arrays, and other circuits and applications of the router and network
KR102452945B1 (en) * 2015-08-27 2022-10-11 삼성전자주식회사 Apparatus and Method for performing Fourier transform
KR102526750B1 (en) * 2015-12-17 2023-04-27 삼성전자주식회사 Apparatus and Method for performing Fourier transform
US10587534B2 (en) 2017-04-04 2020-03-10 Gray Research LLC Composing cores and FPGAS at massive scale with directional, two dimensional routers and interconnection networks
CN107451097B (en) * 2017-08-04 2020-02-11 中国科学院软件研究所 High-performance implementation method of multi-dimensional FFT on domestic Shenwei 26010 multi-core processor

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5751616A (en) * 1995-11-29 1998-05-12 Fujitsu Limited Memory-distributed parallel computer and method for fast fourier transformation
US6073154A (en) * 1998-06-26 2000-06-06 Xilinx, Inc. Computing multidimensional DFTs in FPGA
US6119140A (en) * 1997-01-08 2000-09-12 Nec Corporation Two-dimensional inverse discrete cosine transform circuit and microprocessor realizing the same and method of implementing 8×8 two-dimensional inverse discrete cosine transform
US6237012B1 (en) * 1997-11-07 2001-05-22 Matsushita Electric Industrial Co., Ltd. Orthogonal transform apparatus

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5644517A (en) * 1992-10-22 1997-07-01 International Business Machines Corporation Method for performing matrix transposition on a mesh multiprocessor architecture having multiple processor with concurrent execution of the multiple processors
US5583990A (en) * 1993-12-10 1996-12-10 Cray Research, Inc. System for allocating messages between virtual channels to avoid deadlock and to optimize the amount of message traffic on each type of virtual channel
JP4057729B2 (en) 1998-12-29 2008-03-05 株式会社日立製作所 Fourier transform method and program recording medium
KR100592753B1 (en) * 2001-02-24 2006-06-26 인터내셔널 비지네스 머신즈 코포레이션 Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer
US7788310B2 (en) * 2004-07-08 2010-08-31 International Business Machines Corporation Multi-dimensional transform for distributed memory network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5751616A (en) * 1995-11-29 1998-05-12 Fujitsu Limited Memory-distributed parallel computer and method for fast fourier transformation
US6119140A (en) * 1997-01-08 2000-09-12 Nec Corporation Two-dimensional inverse discrete cosine transform circuit and microprocessor realizing the same and method of implementing 8×8 two-dimensional inverse discrete cosine transform
US6237012B1 (en) * 1997-11-07 2001-05-22 Matsushita Electric Industrial Co., Ltd. Orthogonal transform apparatus
US6073154A (en) * 1998-06-26 2000-06-06 Xilinx, Inc. Computing multidimensional DFTs in FPGA

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1497750A2 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2419706A (en) * 2004-10-30 2006-05-03 Agilent Technologies Inc Permuting a vector using a permutation structure having random control bits

Also Published As

Publication number Publication date
EP1497750A4 (en) 2011-03-09
KR20040004542A (en) 2004-01-13
WO2002069097A3 (en) 2002-10-24
JP4652666B2 (en) 2011-03-16
EP1497750A2 (en) 2005-01-19
KR100592753B1 (en) 2006-06-26
US8095585B2 (en) 2012-01-10
JP2004536371A (en) 2004-12-02
US7315877B2 (en) 2008-01-01
CN1493042A (en) 2004-04-28
CA2437036A1 (en) 2002-09-06
US20080133633A1 (en) 2008-06-05
US20040078405A1 (en) 2004-04-22
CN1244878C (en) 2006-03-08
AU2002252086A1 (en) 2002-09-12
IL157518A0 (en) 2004-03-28

Similar Documents

Publication Publication Date Title
US7315877B2 (en) Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer
US7620736B2 (en) Network topology having nodes interconnected by extended diagonal links
Boppana et al. A comparison of adaptive wormhole routing algorithms
Nicol Rectilinear partitioning of irregular data parallel computations
Lim et al. Efficient algorithms for block-cyclic redistribution of arrays
US20160105494A1 (en) Fast Fourier Transform Using a Distributed Computing System
US8549058B2 (en) Multi-dimensional transform for distributed memory network
Culler et al. Fast parallel sorting under LogP: from theory to practice
Bokhari Multiphase complete exchange: A theoretical analysis
Bar-Noy et al. Multiple message broadcasting in the postal model
Steenkiste A high-speed network interface for distributed-memory systems: architecture and applications
Leung et al. On multidimensional packet routing for meshes with buses
Bay et al. Deterministic on-line routing on area-universal networks
Bhuyan et al. Impact of switch design on the application performance of cache-coherent multiprocessors
Ranade et al. Nearly tight bounds for wormhole routing
Lo et al. Parallel divide and conquer on meshes
Hamdi et al. RCC-full: An effective network for parallel computations
Raghunath Interconnection network design based on packaging considerations
KR20240020539A (en) An interconnection network organization for distributed computation
Buzzard High performance communications for hypercube multiprocessors
Raghunath et al. Customizing interconnection networks to suit packaging hierarchies
Chlebus et al. Routing on meshes in optimum time and with really small queues
Seo et al. Extended flexible processor allocation strategy for mesh-connected systems using shape manipulations
Lee et al. Analysis of finite buffered multistage combining networks
Creutz Message passing on the QCDSP supercomputer

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2437036

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 157518

Country of ref document: IL

Ref document number: 2002568153

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 10468998

Country of ref document: US

Ref document number: 02805377X

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 1020037011119

Country of ref document: KR

WWE Wipo information: entry into national phase

Ref document number: 2002721139

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWP Wipo information: published in national office

Ref document number: 1020037011119

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 2002721139

Country of ref document: EP

WWG Wipo information: grant in national office

Ref document number: 1020037011119

Country of ref document: KR