WO2010040144A1

WO2010040144A1 - Multi-processor architecture and method

Info

Publication number: WO2010040144A1
Application number: PCT/US2009/059594
Authority: WO
Inventors: Shahin Solki; Stephen Morein; Mark S. Grossman
Original assignee: Advanced Micro Devices, Inc.
Priority date: 2008-10-03
Filing date: 2009-10-05
Publication date: 2010-04-08
Also published as: EP2342626A1; CN105005542A; KR20110067149A; JP2012504835A; CN105005542B; US20170235700A1; US8373709B2; US20130147815A1; US20100088453A1; CN102227709A; KR101533761B1; US20180314670A1; CN102227709B; EP2342626B1; US10467178B2

Abstract

Embodiments of a multi-processor architecture and method are described herein. Embodiments provide alternatives to the use of an external bridge integrated circuit (IC) architecture. For example, an embodiment multiplexes a peripheral bus such that multiple processors can use one peripheral interface slot without requiring an external bridge IC Embodiments are usable with known bus protocols.

Description

MULTI-PROCESSOR ARCHITECTURE AND METHOD

RELATED APPLICATIONS

The present application claims the benefit of the U. S Application No 12/245,686, filed on October 3, 2008. This application also claims the benefit of U.S. Patent Application No. 12/340,510, filed December 19, 2008, which is a continuation- in-part of U.S. Application No 12/245,686 Both of these applications are incorporated by reference herein in their entirety.

TECHNICAL FIELD

The invention is in the field of data transfer in computer and other digital systems.

BACKGROUND

As computer and other digital systems become more complex and more capable, methods and hardware to enhance the transfer of data between system components or elements continually evolve Data to be transferred include signals representing data, commands, or any other signals. Speed and efficiency of data transfei is particularly critical in systems that run very data-intensive applications, such as graphics applications. In typical systems, graphics processing capability is provided as a part of the central processing unit (CPU) capability, or provided by a separate special purpose processor such as a graphics processing unit (GPU) that communicates with the CPU and assists in processing graphics data for applications such as video games, etc. One or more GPUs may be included in a system In conventional multi-GPU systems, a bridged host interface (for example a PCI express (PCIe®) bus) interface must share bandwidth between peer to peer traffic and host traffic. Traffic consists primarily of memory data transfers but may often include commands Figure 1 is a block diagram of a prior art system 100 that includes a root 102. A typical root 102 is a computer chipset, including a central processing unit (CPU), a host bridge 104, and two endpoints EPO 106a and EPl 106b. Endpoints are bus endpoints and can be various peripheral components, for example special purpose processors such as graphics processing units (GPUs). The root 102 is coupled to the bridge 104 by one or more buses to communicate with peripheral components, Some peripheral component endpoints (such as GPUs) require a relatively large amount of bandwidth on the bus because of the large amount of data involved in their functions, It would be desirable to provide an architecture that reduced the number of components and yet provided efficient data transfer between components. For example, the cost of bridge integrated circuits (ICs) is relatively high. In addition, the size of a typical bridge IC is comparable to the size of a graphics processing unit (GPU) which requires additional printed circuit board area and could add to layer counts. Bridge ICs also require additional surrounding components for power, straps, clock and possibly read only memory (ROM).

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram of a prior art processing system with peripheral components

Figure 2 is a block diagram of portions of a multi-processor system with a multiplexed peripheral component bus, according to an embodiment.

Figure 3 is a block diagram of portions of a processing system with peripheral components, according to an embodiment.

Figure 4 is a more detailed block diagram of a processing system with peripheral components, according to an embodiment.

Figure 5 is a block diagram of an embodiment in which one bus endpoint includes an internal bridge.

Figure 6 is a block diagram of an embodiment that includes more than two bus endpoints, each including an internal bridge.

Figure 7 is a block diagram illustrating views of memory space from the perspectives of various components in a system, according to an embodiment DETAILED DESCRIPTION

Embodiments of a multi-processor architecture and method are described herein. Embodiments provide alternatives to the use of an external bridge integrated circuit (IC) architecture, For example, an embodiment multiplexes a peripheral bus such that multiple processors can use one peripheral interface slot without requiring an external bridge IC, Other embodiments include a system with multiple bus endpoints coupled to a bus root via a host bus bridge that is internal to at least one bus endpoint. In addition, the bus endpoints are directly coupled to each other. Embodiments are usable with known bus protocols.

Figure 2 is a block diagram of portions of a multi -processor system 700 with a multiplexed peripheral component bus, according to an embodiment In this example system, there are two GPUs, a master GPU 702A and a slave GPU 702B, Each GPU 702 has 16 peripheral component interconnect express (PCIe®) transmit (TX) lanes and 16 PCIe® receive (RX) lanes. Each of GPUs 702 includes a respective data link layer 706 and a respective physical layer (PHY) 704. Eight of the TX/RX lanes of GPU 702 A are connected to half of TX/RX lanes of a Xl 6 PCIe® connector, or slot 708. Eight of the TX/RX lanes of GPU 702B are connected to the remaining TX/RX lanes of the X16 PCIe® connector or slot 708. The remaining TX/RX lanes of each of GPU 702 A and GPU 702B are connected to each other, providing a direct, highspeed connection between the GPUs 702.

The PCIe® xl6 slot 708 (which normally goes to one GPU) is split into two parts. Half of the slot is connected to GPU 702A and the other half is connected to GPU 702B. Each GPU 702 basically echoes back the other half of the data to the other GPU 702 That is, data received by either GPU is forwarded to the other , Each GPU 702 sees the all of the data received by the PCIe® bus, and internally each GPU 702 decides whether it is supposed to answer the request or comments Each GPU 702 then appropriately responds, or does nothing Some data or commands, such as "Reset" are applicable to all of the GPUs 702.

From the system level point of view, or from the view of the peripheral bus, there is only one PCIe® load (device) on the PCIe® bus, Either GPU 702A or GPU 702B is accessed based on address. For example, for Address Domain Access, master GPU 702A can be assigned to one half of the address domain and slave GPU 702B can assigned to the other half, The system can operate in a Master/Slave mode or in a Single/Multi GPU modes, and the modes can be identified by straps,

Various data paths are identified by reference numbers. A reference clock (REF CLK) path is indicated by 711. An 8-lane RX-2 path is indicated by 709 An 8- lane RX-I path is indicated by 713. An 8-lane TX-I path is indicated by 715. Control signals 710 are non-PCIe® signals such as straps. The (PHY) 704 in each GPU 702 echoes the data to the proper lane or channel. Lane connection can be done in the order, which helps to optimize silicon design and/or to support PCIe® slots with less than 16 lanes Two GPUs are shown as an example of a system, but the architecture is scalable to n-GPUs In addition, GPUs 702 are one example of a peripheral component that can be coupled as described. Any other peripheral components that normally communicate with a peripheral component bus in a system could be similarly coupled.

Figure 3 is a block diagram of portions of a processing system 200 with peripheral components, according to an embodiment, System 200 includes a bus root 202 that is similar to the bus root 102 of Figure 1. The bus root 202 in an embodiment is a chipset including a CPU and system memory. The root 202 is coupled via a bus 209 to an endpoint EPO 206a that includes an internal bridge 205a. The bus 209 in an embodiment is a PCI express (PCIe®) bus, but embodiments are not so limited. EPO 206a is coupled to another endpoint EPl 206b EPl 206b includes an internal bridge 205b EPO 205a and EPl 205B are through their respective bridges via a bus 207. EPl 206b is coupled through its bridge 205b to the root 202 via a bus 21 1. Each of endpoints EPO 206a and EPl 206b includes respective local memories 208a and 208b. From the perspective of the root 202, 209 and 211 make up transmit and receive lanes respectively of a standard bidirectional point to point data link

In an embodiment, EPO 206a and EPl 206b are identical. As further explained below, in various embodiments, bridge 205b is not necessary, but is included for the purpose of having one version of an endpoint, such as one version of a GPU, rather than manufacturing two different versions. Note that EPO may be used standalone by directly connecting it to root 202 via buses 209 and 207; similarly EPl may be used standalone by directly connecting it to root 202 via buses 207 and 21 1

The inclusion of a bridge 205 eliminates the need for an external bridge such as bridge 104 of Figure 1 when both EPO and EPl are present. In contrast to the "Y" or '"T" formation of Figure 1, system 200 moves data in a loop (in this case in a clockwise direction). The left endpoint EPO can send data directly to the right endpoint EPl . The return path from EPl to EPO is through the root 202. As such, the root has the ability to reflect a packet of data coming in from EPl back out to EPO. In other words, the architecture provides the appearance of a peer-to-peer transaction on the same pair of wires as is used for endpoint to root transactions

EPO 206a and EPl 206b are also configurable to operate in the traditional configuration That is, EPO 206a and EPl 206b are each configurable to communicate directly with the root 202 via buses 209 and 211, which are each bidirectional in such a configuration.

Figure 4 is a more detailed block diagram of a processing system with peripheral components, according to an embodiment System 300 is similar to system 200, but additional details are shown System 300 includes a bus root 302 coupled to a system memory 303. The bus root 302 is further coupled to an endpoint 305a via a bus 309 For purposes of illustrating a particular embodiment, endpoints 305a and 305b are GPUs, but embodiments are not so limited. GPUO 305a includes multiple clients Clients include logic, such as shader units and decoder units, for performing tasks The clients are coupled to an internal bridge through bus interface (VF) logic, which control all of the read operations and write operations performed by the GPU

GPUO 305a is coupled to a GPUl 305b via a bus 307 from the internal bridge of GPUO 305a to the internal bridge of GPUl 305b. In an embodiment, GPUl 305b is identical to GPUO 305a and includes multiple clients, an internal bridge and I/F logic. Each GPU typically connects to a dedicated local memory unit often implemented as GDDR DRAM. GPUl 305b is coupled to the bus root 302 via a bus 311. In one embodiment, as the arrows indicate, data and other messages such as read requests and completions flow in a clockwise loop from the bus root 302 to GPUO 305a to GPUl 305b. In other embodiments, one of the GPUs 305 does not include a bridge In yet other embodiments, data flows counterclockwise rather than clockwise.

In one embodiment, the protocol that determines data routing is communicated with in such as ways as to make the architecture appears the same as the architecture of Figure 1 In particular, the bridge in 305b must appear on link 307 to bridge 305a as an upstream port, whereas the corresponding attach point on the bridge in 305a must appear on link 309 to root 302 as a downstream port Furthermore, the embedded bridge must be able to see its outgoing link as a return path for all requests it receives on its incoming link, even though the physical routing of the two links is different. This is achieved by setting the state of a Chain Mode configuration strap for each GPU If the strap is set to zero, the bridge assumes both transmit and receive links are to an upstream port, either a root complex or a bridge device. If the strap is set to one, the bridge assumes a daisy-chain configuration

In another embodiment, the peer to peer bridging function of the root is a two- step process according to which GPUl 305b writes data to the system memory 303, or buffer. Then as a separate operation GPUO 305a reads the data back via the bus root 302.

The bus root 302 responds to requests normally, as if the internal bridge were an external bridge (as in Figure 1) In an embodiment, the bridge of GPUO 305a is configured to be active, while the bridge of GPUl 305b is configured to appear as a wire, and simply pass data through This allows the bus root 302 to see buses 309 and 311 as a normal peripheral interconnect bus When the bus root reads from the bridge of GPUO 305a, this bridge sends the data to pass through the bridge of GPUl 305b and return to the bus root 302 as if the data came directly from GPUO 305a

Figure 5 is a block diagram of a system 400 in which one of the multiple bus endpoints includes an internal bridge. System 400 includes a bus root 402, and an EPO 406a that includes a bridge 405a EPO 406a is coupled to the root 402 through the bridge 405a via a bus 409, and also to EPlb406b through the bridge 405a via a bus 407 Each of endpoints EPO 406a and EPl 406b includes respective local memories 408a and 408b. Figure 6 is a block diagram of a system 500 including more than two bus endpoints, each including an internal bridge, System 500 includes a bus root 502, and an EPO 506a that includes a bridge 505a and a local memory 508a. System 500 further includes an EPl 506b that includes a bridge 505b and a local memory 508b, and an EPl 506c that includes a bridge 505c and an internal memory 508c

EPO 506a is coupled to the root 502 through the bridge 505a via a bus 509, and also to EPIb 506b through the bridge 506b via a bus 507a. EPO 506b is coupled to EPIc 506c through the bridge 506c via a bus 507b, Other embodiments include additional endpoints that are added into the ring configuration. In other embodiments, the system includes more than two endpoints 506, but the rightmost endpoint does not include an internal bridge, In yet other embodiments the flow of data is counterclockwise as opposed clockwise, as shown in the figures.

Referring again to Figure 4, there are two logical ports on the internal bridge according to an embodiment, One port is "on" in the bridge of GPUO 305a, and one port is "off in the bridge of GPUl 305b. The bus root 302 may perform write operations by sending requests on bus 309. A standard addressing scheme indicates to the bridge to send the request to the bus I/F. If the request is for GPUl 305b, the bridge routes the request to bus 307 So in an embodiment, the respective internal bridges of GPUO 305a and GPUl 305b are programmed differently.

Figure 7 is a block diagram illustrating the division of bus address ranges and the view of memory space from the perspective of various components, With reference also to Figure 4, 602 is a view of memory from the perspective of the bus root, or Host processor 302. 604 is a view of memory from the perspective of the GPUO 305a internal bridge. 606 is a view of memory from the perspective of the GPUl 305b internal bridge The bus address range is divided into ranges for GPUO 305a, GPUl 305b, and system 302 memory spaces The GPUO 305a bridge is set up so that incoming requests to the GPUO 305a range are routed to its own local memory Incoming requests from the root or from GPUO 305a itself to GPUl 305b or system 302 ranges are routed to the output port of GPUO 305a. The GPU1305b bridge is set up slightly differently so that incoming requests to the GPUl 305b range are routed to its own local memory. Requests from GPUO 305a or from GPUl 305b itself to root or GPUO 305a ranges are routed to the output port of GPUl 305b.

The host sees the bus topology as being like the topology of Figure 1 GPUl 305b can make its own request to the host processor 302 through its own bridge and it will pass through to the host processor 302. When the host processor 302 is returning a request, it goes through the bridge of GPUO 305a, which has logic for determining where requests and data are to be routed

Write operations from GPUl 305b to GPUO 305a can be performed in two passes. GPUl 305b sends data to a memory location in the system memory 303. Then separately, GPUO 305a reads the data after it learns that the data is in the system memory 303

Completion messages for read data requests and other split-transaction operations must travel along the wires in the same direction as the requests Therefore in addition to the address-based request routing described above, device- based routing must be set up in a similar manner. For example, the internal bridge of GPUO 305a recognizes that the path for both requests and completion messages is via bus 307.

An embodiment includes power management to improve power usage in lightly loaded usage cases For example in a usage case with little graphics processing, the logic of GPUl 305b is powered off and the bridging function in GPUl 305b is reduced to a simple passthrough function from input port to output port Furthermore, the function of GPUO 305a is reduced to not process transfers routed from the input port to the output port In an embodiment, there is a separate power supply for the bridging function in GPUl 305b. Software detects the conditions under which to power down. Embodiments include a separate power regulator and/or separate internal power sources for bridges that are to be powered down separately from the rest of the logic on the device.

Even in embodiments that do not include the power management described above, system board area is conserved because an external bridge (as in Figure 1) is not required. The board area and power required for the external bridge and its pins are conserved. On the other hand, it is not required that each of the GPUs have its own internal bridge In another embodiment, GPUl 305b does not have an internal bridge, as described with reference to Figure 5.

The architecture of system 300 is practical in a system that includes multiple slots for add-in circuit boards. Alternatively, system 300 is a soldered system, such as on a mobile device

Buses 307, 309 and 311 can be PCIe® buses or any other similar peripheral interconnect bus

Any circuits described herein could be implemented through the control of manufacturing processes and maskworks which would be then used to manufacture the relevant circuitry. Such manufacturing process control and maskwork generation are known to those of ordinary skill in the art and include the storage of computer instructions on computer readable media including, for example, Verilog, VHDL or instructions in other hardware description languages

Embodiments of the invention include a system comprising: a peripheral component connector coupled to a peripheral component bus; and a plurality of peripheral components coupled directly to the peripheral component connector via a plurality of respective transmit/receive (TX/RX) lanes such that the plurality of peripheral components appear to the peripheral component bus as one peripheral device coupled to the peripheral component connector.

In an embodiment, the plurality of peripheral components are further coupled directly to each other via respective transmit/receive (TX/RX) lanes.

In an embodiment, at least one of the plurality of peripheral components is a graphics processing unit (GPU).

In an embodiment, the peripheral component connector is a peripheral component interconnect express (PCIe®) slot, and wherein the peripheral component bus is a PCIe® bus

In an embodiment, the each peripheral component is configured to receive all of the data transmitted via the peripheral component connector and to decide which data is applicable

In an embodiment, each peripheral component forwards all of the data received by it to the remaining peripheral components. In an embodiment, each of the peripheral components is accessed based on an address.

Embodiments of the invention include a multi-processor method, comprising coupling a plurality of piocessors to a peripheral bus via respective groups of transmit/receive (TX/RX) lanes of the bus; coupling the plurality of processors to each other TX/RX lanes of the plurality of processors that are not coupled to the bus, transmitting data to the plurality of processor directly from the bus to the plurality of piocessors, wherein each of the plurality of processors is addressable; and transmitting data directly between processors.

In an embodiment, the plurality of processor comprises a graphics processing unit (GPU)

In an embodiment, the plurality of processors comprise a plurality of GPUs, and wherein the peripheral bus comprises a peripheral component interconnect express (PCIe®) bus to which the GPUs are directly coupled.

In an embodiment, further comprising coupling the GPUs communicating directly with each other via respective TX/RX lanes

In an embodiment, further comprising a GPU receiving data from the bus and transmitting the data to the other GPUs

Embodiments of the invention include a computer readable medium storing computer readable instructions adapted to enable manufacture of a circuit comprising a peripheral component connector coupled to a peripheral component bus; and a plurality of peripheral components coupled directly to the peripheral component connector via a plurality of respective transmit/receive (TX/RX) lanes such that the plurality of peripheral components appear to the peripheral component bus as one peripheral device coupled to the peripheral component connector, wherein the plurality of peripheral components are further coupled directly to each other via respective transmit/receive (TX/RX) lanes.

In an embodiment, the instructions comprise hardware description language instructions.

Embodiments of the invention include a computer -readable medium having instructions stored theieon, that when executed in a multi-processor system cause a method to be peifoimed, the method comprising: coupling a plurality of processors to a peripheral bus via respective groups of transmit/receive (TX/RX) lanes of the bus; coupling the plurality of processors to each other TX/RX lanes of the plurality of processors that are not coupled to the bus; transmitting data to the plurality of processor directly from the bus to the plurality of processors, wherein each of the plurality of processors is addressable; and transmitting data directly between processors

In an embodiment, the plurality of processor comprises a graphics processing unit (GPU).

In an embodiment, the plurality of processors comprise a plurality of GPUs, and wherein the peripheral bus comprises a peripheral component interconnect express (PCIe®) bus to which the GPUs are directly coupled

In an embodiment, the method further comprising coupling the GPUs communicating directly with each other via respective TX/RX lanes.

In an embodiment, the method further comprising a GPU receiving data from the bus and transmitting the data to the other GPUs

Embodiments of the invention include a system comprising- a bus root comprising a central processing unit configurable to communicate with peripheral components via a bus; and a first peripheral component coupled directly to the bus root and further coupled directly to a second peripheral component, the first peripheral component comprising an internal bridge configurable to receive data and to transmit data, wherein receiving and transmitting comprises direct communication between the first peripheral component and the second peripheral component

In an embodiment, the first peripheral component and the second peripheral component are each further configurable to communicate directly with the bus root to transmit and receive data.

In an embodiment, receiving and transmitting further comprise transmitting requests and data from the second peripheral component to the first peripheral component via the bus root

In an embodiment, receiving and transmitting further comprise transmitting requests or data from the first peripheral component to the second peripheral component via the internal biidge of the fust peripheral component to the second peripheral component

In an embodiment, the bus root is configurable to perform write operations, wherein a write operation to the second peripheral component comprises the bus root transmitting a write request to the internal bridge of the first peripheral component t, and the internal bridge of the first peripheral component t transmitting the write request directly to the second peripheral component

In an embodiment, the write request is received by an internal bus interface of the second peripheral component.

In an embodiment, the bus root is configurable to perform write operations, wherein a write operation to the first peripheral component comprises the bus root transmitting a write request to the bridge, and the bridge transmitting the write request to an internal bus interface of the first peripheral component.

In an embodiment, the bus root is configurable to perform read operations, wherein a read operation to the first peripheral component comprises the bus root transmitting a read request to the bridge, and the bridge transmitting the read request to an internal bus interface of the first peripheral component.

In an embodiment, the bus root is configurable to perform read operations, wherein a read operation to the second peripheral component comprises the bus root transmitting a read request to the bridge, and the bridge transmitting the read request directly to the second peripheral component.

In an embodiment, the bridge comprises logic configurable to determine routing for received write requests, received read requests, and received data

In an embodiment, the second peripheral component comprises an internal bridge configurable to receive data and to transmit data, and further configurable to be powered done when the internal bridge is not used to receive data and to transmit data.

In an embodiment, the second peripheral component further comprises a dedicated power source for the use of the internal bridge

In an embodiment, the first peripheral component and the second peripheral component each comprise a graphics processing unit (GPU) In an embodiment, each of the first peripheral component and the second peripheral component further comprise a respective plurality of clients coupled to respective bus interfaces, wherein the clients comprises video processing logic comprising shader units and encoder/decoder units

Embodiments of the invention include a method of communicating in a multiprocessor system, the method comprising- a bus root transmitting requests directly to a first peripheral component, wherein the requests comprise read requests and write requests, the first peripheral component receiving the requests via a first bus in an internal bridge of the first peripheral component; and the internal bridge determining appropriate routing for the request, wherein appropriate routing comprises, routing requests that are directed to a second peripheral component directly to a bus interface of the second peripheral component from the bridge via a second bus; and routing requests that are directed to a first peripheral component to a bus interface of the first peripheral component

In an embodiment, further comprising the second component responding to a read request by transmitting data directly to the bus root via a third bus.

In an embodiment, further comprising the second peripheral component receiving requests in an internal bridge of the second peripheral component.

In an embodiment, further comprising' the first peripheral component transmitting a read request to the second peripheral component via the second bus, the second peripheral component transmitting data in response to the read request to the bus root via the third bus, and the buss root transmitting the data to the bridge via the first bus

Embodiments of the invention include a computer-readable medium having stored thereon instruction that when executed in a multi-processor system, cause a method of communicating to be performed, the method comprising: a bus root transmitting requests directly to a first peripheral component, wherein the requests comprise read requests and write requests, the first peripheral component receiving the requests via a first bus in an internal bridge of the first peripheral component; and the internal bridge determining appropriate routing for the request, wherein appropriate routing comprises, routing requests that are directed to a second peripheial component diiectly to a bus interface of the second peripheial component from the bridge via a second bus; and routing requests that are directed to a first peripheral component to a bus interface of the first peripheral component

In an embodiment, the method further comprises the second component responding to a read request by transmitting data directly to the bus root via a third bus.

In an embodiment, the method further comprises the second peripheral component receiving requests in an internal bridge of the second peripheral component.

In an embodiment, the method further comprises: the first peripheral component transmitting a read request to the second peripheral component via the second bus; the second peripheral component transmitting data in response to the read request to the bus root via the third bus; and the bus root transmitting the data to the bridge via the first bus

In an embodiment, the instructions comprise hardware description language instructions that are usable to create an application specific integrated circuit (ASIC) to perform the method

Aspects of the embodiments described above may be implemented as functionality programmed into any of a variety of circuitry, including but not limited to programmable logic devices (PLDs), such as field programmable gate arrays (FPGAs), programmable array logic (PAL) devices, electrically programmable logic and memory devices, and standard cell-based devices, as well as application specific integrated circuits (ASICs) and fully custom integrated circuits, Some other possibilities for implementing aspects of the embodiments include microcontrollers with memory (such as electronically erasable programmable read only memory (EEPROM), Flash memory, etc ), embedded microprocessors, firmware, software, etc Furthermore, aspects of the embodiments may be embodied in microprocessors having software-based circuit emulation, discrete logic (sequential and combinatorial), custom devices, fuzzy (neural) logic, quantum devices, and hybrids of any of the above device types. Of course the underlying device technologies may be provided in a variety of component types, e g,, metal-oxide semiconductor field-effect transistor (MOSFET) technologies such as complementary metal-oxide semiconductor (CMOS), bipolar technologies such as emitter-coupled logic (ECL), polymer technologies (e g,, silicon-conjugated polymer and metal-conjugated polymer-metal structures), mixed analog and digital, etc.

The term "processor" as used in the specification and claims includes a processor core or a portion of a processor Further, although one or more GPUs and one or more CPUs are usually referred to separately herein, in embodiments both a GPU and a CPU are included in a single integrated circuit package or on a single monolithic die Therefore a single device performs the claimed method in such embodiments,

Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise," "comprising," and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in a sense of "including, but not limited to." Words using the singular or plural number also include the plural or singular number, respectively. Additionally, the words "herein," "hereunder," "above," "below," and words of similar import, when used in this application, refer to this application as a whole and not to any particular portions of this application When the word "or" is used in reference to a list of two or more items, that word covers all of the following interpretations of the word, any of the items in the list, all of the items in the list, and any combination of the items in the list

The above description of illustrated embodiments of the method and system is not intended to be exhaustive or to limit the invention to the precise forms disclosed While specific embodiments of, and examples for, the method and system are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize. The teachings of the disclosure provided herein can be applied to other systems, not only for systems including graphics processing or video processing, as described above, The various operations described may be performed in a very wide variety of architectures and distributed differently than described. In addition, though many configurations are described herein, none are intended to be limiting or exclusive,

In other embodiments, some or all of the hardware and software capability described herein may exist in a printer, a camera, television, a digital versatile disc (DVD) player, a DVR or PVR, a handheld device, a mobile telephone or some other device. The elements and acts of the various embodiments described above can be combined to provide further embodiments. These and other changes can be made to the method and system in light of the above detailed description.

In general, in the following claims, the terms used should not be construed to limit the method and system to the specific embodiments disclosed in the specification and the claims, but should be construed to include any processing systems and methods that operate under the claims. Accordingly, the method and system is not limited by the disclosure, but instead the scope of the method and system is to be determined entirely by the claims.

While ceitain aspects of the method and system are presented below in certain claim forms, the inventors contemplate the various aspects of the method and system in any number of claim forms. For example, while only one aspect of the method and system may be recited as embodied in computer -readable medium, other aspects may likewise be embodied in computer-readable medium. Such computer readable media may store instructions that are to be executed by a computing device (e g , personal computer, personal digital assistant, PVR, mobile device or the like) or may be instructions (such as, for example, Verilog or a hardware description language) that when executed are designed to create a device (GPU, ASIC, or the like) or software application that when operated performs aspects described above The claimed invention may be embodied in computer code (e.g., HDL, Verilog, etc.) that is created, stored, synthesized, and used to generate GDSII data (or its equivalent) An ASIC may then be manufactured based on this data

Accordingly, the inventors reserve the right to add additional claims after filing the application to pursue such additional claim forms for other aspects of the method and system.

Claims

What is claimed is.

1 A system comprising: a peripheral component connector coupled to a peripheral component bus; and a plurality of peripheral components coupled directly to the peripheral component connector via a plurality of respective transmit/receive (TX/RX) lanes such that the plurality of peripheral components appear to the peripheral component bus as one peripheral device coupled to the peripheral component connector

2 The system of claim 1, wherein the plurality of peripheral components are further coupled directly to each other via respective transmit/receive (TX/RX) lanes

3. The system of claim 1, wherein at least one of the plurality of peripheral components is a graphics processing unit (GPU)

4 The system of claim 1, wherein the peripheral component connector is a peripheral component interconnect express (PCIe®) slot, and wherein the peripheral component bus is a PCIe® bus

5 The system of claim 2, wherein the each peripheral component is configured to receive all of the data transmitted via the peripheral component connector and to decide which data is applicable.

6. The system of claim 2, wherein each peripheral component forwards all of the data received by it to the remaining peripheral components

7. The system of claim 1, wherein each of the peripheral components is accessed based on an address

8 A multi-processor method, comprising coupling a plurality of processors to a peripheral bus via respective groups of transmit/receive (TX/RX) lanes of the bus, coupling the plurality of processors to each other TX/RX lanes of the plurality of processors that are not coupled to the bus; transmitting data to the plurality of processor directly from the bus to the plurality of processors, wherein each of the plurality of processors is addressable; and transmitting data directly between processors

9. The method of claim 8 wherein the plurality of processor comprises a graphics processing unit (GPU).

10. The method of claim 8, wherein the plurality of processors comprise a plurality of GPUs, and wherein the peripheral bus comprises a peripheral component interconnect express (PCIe©) bus to which the GPUs are directly coupled

11. The method of claim 10, further comprising coupling the GPUs communicating directly with each other via respective TX/RX lanes.

12 The method of claim 11, further comprising a GPU receiving data from the bus and transmitting the data to the other GPUs

13. A computer readable medium storing computer readable instructions adapted to enable manufacture of a circuit comprising: a peripheral component connector coupled to a peripheral component bus; and a plurality of peripheral components coupled directly to the peripheral component connector via a plurality of respective transmit/receive (TX/RX) lanes such that the plurality of peripheral components appear to the peripheral component bus as one peripheral device coupled to the peripheral component connector, wherein the plurality of peiipheial components are further coupled directly to each other via respective transmit/receive (TX/RX) lanes

14 The computer readable medium of claim 13, wherein the instructions comprise hardware description language instructions.

15 A computer -readable medium having instructions stored thereon, that when executed in a multi-processor system cause a method to be performed, the method comprising coupling a plurality of processors to a peripheral bus via respective groups of transmit/receive (TX/RX) lanes of the bus, coupling the plurality of processors to each other TX/RX lanes of the plurality of processors that are not coupled to the bus; transmitting data to the plurality of processor directly from the bus to the plurality of processors, wherein each of the plurality of processors is addressable; and transmitting data directly between processors.

16. The computer-readable medium of claim 15 wherein the plurality of processor comprises a graphics processing unit (GPU).

17 The computer-readable medium of claim 15, wherein the plurality of processors comprise a plurality of GPUs, and wherein the peripheral bus comprises a peripheral component interconnect express (PCIe®) bus to which the GPUs are directly coupled

18. The computer-readable medium of claim 17, the method further comprising coupling the GPUs communicating directly with each other via respective TX/RX lanes.

19. The computer-readable medium of claim 18, the method fuithci comprising a GPU leceiving data from the bus and transmitting the data to the other GPUs.