WO2012037494A1

WO2012037494A1 - Performance and power optimized computer system architectures and methods leveraging power optimized tree fabric interconnect

Info

Publication number: WO2012037494A1
Application number: PCT/US2011/051996
Authority: WO
Inventors: Mark Bradley Davis; David James Borland
Original assignee: Calxeda, Inc.
Priority date: 2010-09-16
Filing date: 2011-09-16
Publication date: 2012-03-22
Also published as: DE112011103123B4; GB2497493B; GB2497493A; GB201306075D0; DE112011103123T5; CN103444133A

Abstract

A performance and power optimized computer system architecture and method leveraging power optimized tree fabric interconnect are disclosed. One embodiment builds low power server clusters leveraging the fabric with tiled building blocks while another embodiment implements storage solutions or cooling solutions. Yet another embodiment uses the fabric to switch non-Ethernet packets, switch multiple protocols for network processors and other devices.

Description

Performance and Power Optimized Computer System Architectures and Methods Leveraging Power Optimized Tree Fabric Interconnect

Mark Davis

David Borland

Priority Claims/Related Applications

This application claims priority under 35 USC 120 to U.S. Patent Application Serial No. 12/794,996 filed on June 7, 2010 and entitled "System and Method for High-Performance, Low-Power Data Center Interconnect Fabric", the entirety of which is incorporated herein by reference. In addition, this patent application claims the benefit under 35 USC 119(e) and 120 of U.S. Provisional Patent Application Serial No. 61/383,585 filed on September 16, 2010 and entitled "Performance and Power Optimized Computer System Architectures and Methods Leveraging Power Optimized Tree Fabric Interconnect", the entirety of which is incorporated herein by reference.

Background

Figures 1 and 2 show a classic data center network aggregation as is currently well known. Figure 1 shows a diagrammatical view of a typical network data center architecture 100 wherein top level switches 101a-n are at the tops of racks 102a-n filled with blade servers 107a-n interspersed with local routers 103a-f. 105a-b and additional rack units 108a-n contain additional servers 104 e-k and routers 106a-g. Figure 2 shows an exemplary physical view 110 of a system with peripheral servers 11 la-bn arranged around edge router systems 1 12a-h, which are placed around centrally located core switching systems 1 13. Typically such an aggregation 110 has 1-Gb Ethernet from the rack servers to their top of rack switches, and often 10 Gb Ethernet ports to the edge and core routers.

Brief Description of the Drawings

Figures 1 and 2 illustrate a typical data center network aggregation;

Figure 3 illustrates a network aggregation using a server in accordance with one embodiment;

Figure 4 illustrates a data center in a rack according to one embodiment;

Figure 5 shows a high level topology of a network system with a switching fabric;

Figure 6 illustrates a server board that composes multiple server nodes interconnected with the described point-to-point interconnect;

Figures 6a-6c illustrates another example of the fabric topology;

Figure 7 illustrates an example of a passive backplane connected to one or more node boards and two aggregation boards;

Figure 8 shows an example of extending the fabric across shelves and linking shelves across a server rack;

Figure 9a shows an exemplary server 700 with a disk form factor;

Figures 9b and 9c show exemplary arrays of disk-server combination according to one embodiment using a storage server 1-node SATA board;

Figure 9d illustrates a standard 3.5 inch drive;

Figure 9e illustrates an implementation of multiple server nodes in a standard 3.5 inch disk drive form factor;

Figure 10 illustrates an implementation of deeply integrated servers with storage;

Figure 1 1 illustrates an implementation of a dense packing of storage and servers leveraging an existing 3.5 inch JBOD storage box;

Figure 12 illustrates an implementation of a server node instanced in the same form factor of a 2.5 inch drive;

Figure 13 illustrates an implementation of rack chimney cooling;

Figure 13a shows an exemplary illustration of the heat convection used in the chimney rack cooling shown in Figure 13;

Figure 14 illustrates server nodes that are placed diagonally with respect to each other to minimize self-heating across server nodes;

Figure 15 shows an exemplary 16-node system according to one embodiment with heat waves rising from printed circuit boards;

Figure 16 shows a higher-density variant of the 16-node system with nodes similarly arranged to minimize self-heating across the nodes;

Figure 17 illustrates the internal architecture of a server node fabric switch;

Figure 18 illustrates a server node that includes a PCIe controller connected to the internal CPU bus fabric;

Figure 18a illustrates a system with multiple protocol bridges using the fabric switch;

Figure 19 illustrates integration of the server fabric with a network processor;

Figure 20 illustrates the fabric switch and a FPGA that provides services such as IP Virtual Server (IPVS);

Figure 21 illustrates a way to build OpenFlow flow processing into the Calxeda fabric;

Figure 22 illustrates one example of an integration of the power optimized fabric switch to an existing processor via PCIe; and

Figure 23 illustrates one example of an integration of the power optimized fabric switch to an existing processor via Ethernet.

Detailed Description of One or More Embodiments

A performance and power optimized computer system architecture and method leveraging power optimized tree fabric interconnect are disclosed. One embodiment builds low power server clusters leveraging the fabric with tiled building blocks while another embodiment implements storage solutions or cooling solutions. Yet another embodiment uses the fabric to switch other things.

Co-pending patent application 12/794,996 describes the architecture of a power optimized server communication fabric that support routings using a tree-like or graph topology that supports multiple links per node, where each link is designated as an Up, Down, or Lateral link, within the topology. The system uses a segmented MAC architecture which may have a method of re-purposing MAC IP addresses for inside MACs and outside MACs, and leveraging what would normally be the physical signaling for the MAC to feed into the switch. The Calxeda XAUI system interconnect reduces power, wires and the size of the rack. There is no need for high powered, expensive Ethernet switches and high-power Ethernet Phys on the individual servers. It dramatically reduces cables (cable complexity, costs, significant source of failures). It also enables a heterogeneous server mixture inside the rack, supporting any equipment that uses Ethernet or SATA or PCIe. In this architecture, power savings is primarily from two architectural aspects: 1) the minimization of Ethernet Phys across the fabric, replacing them with point to point XAUI interconnects between nodes, and 2) the ability to dynamically adjust the XAUI width and speed of the links based upon load.

Figure 3 shows a network aggregation 200. This network supports 10-Gb/sec

Ethernet communication 201 (thick lines) between aggregation router 202 and three racks 203 a-c. In rack 203 a the Calxeda interconnect fabric provides multiple high-speed 10 Gb paths, represented by thick lines, between servers 206a-d on shelves within a rack. The embedded switch in servers 206a-d can replace a top-of-rack switch, thus saving a dramatic amount of power and cost, while still providing a 10 Gb Ethernet port to the aggregation router. The Calxeda switching fabric can integrate traditional Ethernet (1 Gb or 10 Gb) into the Calxeda XAUI fabric, and the Calxeda servers can act as a top of rack switch for third- party Ethernet connected servers.

Middle rack 203b shows another scenario where Calxeda servers 206e,f can integrate into existing data center racks that contain a top-of-rack switch 208a. In this case, the IT group can continue to have their other servers connected via 1 Gb Ethernet up to the existing top-of-rack switch. Calxeda internal servers can be connected via Calxeda 10 Gb XAUI fabric, and they can integrate up to the existing top-of-rack switch with either a 1 Gb or 10 Gb Ethernet interconnect. Rack 203c, on the right, is the current way that data center racks are traditionally deployed. The thin red lines represent 1 Gb Ethernet. So, the current deployments of data center racks is traditionally 1 Gb Ethernet up to the top-of-rack switch 308b, and then 10 Gb (thick red line 201) out from the top of rack switch to the aggregation router. Note that all servers are present in an unknown quantity, while they are pictured here in finite quantities for purposes of clarity and simplicity. Also, using the enhanced Calxeda servers, no additional routers are needed, as they operate their own XAUI switching fabric, discussed below.

Figure 4 shows an overview of an exemplary "data center in a rack" 400, according to one embodiment. It has 10-Gb Ethernet PHY 401a-n and 1-Gb private Ethernet PHY 402. Large computers (power servers) 403 a-n support search; data mining; indexing; Hadoop, a Java software framework; MapReduce, a software framework introduced by Google to support distributed computing on large data sets on clusters of computers; cloud applications; etc. Computers (servers) 404a-n with local flash and/or solid-state disk (SSD) support search, MySQL, CDN, software-as-a-service (SaaS), cloud applications, etc. A single, large, slow-speed fan 405 augments the convection cooling of the vertically mounted servers above it. Data center 400 has an array 406 of hard disks, e.g., in a Just a Bunch of Disks (JBOD) configuration, and, optionally, Calxeda servers in a disk form factor (the green boxes in arrays 406 and 407), optionally acting as disk controllers. Hard disk servers or Calxeda disk servers may be used for web servers, user applications, and cloud applications, etc. Also shown are an array 407 of storage servers and historic servers 408a, b (any size, any vendor) with standard Ethernet interfaces for legacy applications. Figure 5 shows a high-level topology 500 of the network system described in copending patent application 12/794,996 that illustrates XAUI connected SoC nodes connected by the switching fabric. The 10 Gb Ethernet ports EthO 501a and Ethl 501b come from the top of the tree. Ovals 502a-n are Calxeda nodes that comprise both computational processors as well as the embedded switch. The nodes have five XAUI links connected to the internal switch. The switching layers use all five XAUI links for switching. Level 0 leaf nodes 502d, e (i.e., NOn nodes, or Nxy, where x=level and y=item number) only use one XAUI link to attach to the interconnect, leaving four high-speed ports that can be used as XAUI, 10 Gb Ethernet, PCIe, SATA, etc., for attachment to I/O. The vast majority of trees and fat trees have active nodes only as leaf nodes, and the other nodes are pure switching nodes. This approach makes routing much more straightforward. Topology 500 has the flexibility to permit every node to be a combination computational and switch node, or just a switch node. Most tree-type implementations have I/O on the leaf nodes, but topology 500 let the I/O be on any node. In general, placing the Ethernet at the top of the tree minimizes the average number of hops to the Ethernet.

Building Power Optimized Server Fabric Boards Using Tiled Building Blocks

Figure 6 illustrates a server board that composes multiple server nodes interconnected with the described point-to-point interconnect. The server board has:

• Each of the ovals in this diagram is a standalone server node that includes processor, memory, I/O, and the fabric switch.

• The fabric switch has the ability to dynamically modify the width (number of lanes) and speed of each lane for each link independently. • The 14 node board example shows two Ethernet escapes from the fabric. These Ethernet escapes would usually be routed to a standard Ethernet switch or router. These Ethernet escapes can be either standard 1 Gb or 10 Gb Ethernet.

• The 14 node example topology is a butterfly fat tree which provides redundant paths to allow adaptive routing to both route around faults and route around localized hot spots.

• The 3 node aggregator board allows the composition of large server fabrics with only two board tiles.

^■ For redundancy, add 2nd aggregator

^■ In/Out:

• PCIe connector for smooth-stone fabric

• Optionally Ethernet support (Off, 1, 2, 5, 10 or 20 Gbs)

^■ Ethernet decision based on bandwidth required for application

• The nodes on the aggregator board can be either just switching nodes, or full computational nodes including switching.

• The board in/out may be PCIe connector that supports two x4 XAUI (2 smooth-stone fabric link) and/or optional Ethernet support (Off, 1, 2, 10 or 20 Gbs).

• Example fabric topologies like the 14-node example minimize the number of links that span off the board to minimize connectors (size and number) and associated costs, while still retaining Ethernet escapes and multi-path redundancy.

• Two aggregator boards can be used to achieve path redundancy when extending the fabric.

• Power savings can be achieved with static link configuration o Lower layer nodes in the figure (noted as Leaf Nodes) can be run at 1 Gb/sec.

o 1st layer switching nodes in the figure (noted as Layer 1 Switches) would then have an incoming bandwidth from the Leaf Nodes of 3 Gb/sec. This allows a static link configuration between the Layer 1 and Layer 2 switches of either 2.5 or 5 Gb/sec.

o The links extending off the Layer 2 Switches layer can then run at 10 Gb/sec.

o In this topology, since the bulk of the nodes are Leaf Nodes, the bulk of the links are running at the slowest rate (1 Gb/sec in this example), thus minimizing networking power consumption.

o Allows Ethernet escapes to be pulled at any node in the fabric allowing fabric designers to trade off needed bandwidth of the Ethernet escapes, the number of ports utilized by top of rack switches, and the costs and power associated with the Ethernet ports.

• Power savings can be further optimized via dynamic link configuration driven by link utilization. In this example, each link and associated port of the fabric switch contains bandwidth counters, with configurable threshold events that allow for the reconfiguration of the link width and speed, both up and down, based upon the dynamic link utilization.

• Since in many common server use cases, the Ethernet traffic is primarily node to external Ethernet and not node to node, the proposed tree fabric structure, and specifically the butterfly fat tree example, minimizes the number of hops across the fabric to Ethernet, thus minimizing latency. This allows the creation of large low latency fabrics to Ethernet while utilizing switches that have a relatively small (in this example 5) number of switching ports.

• The integration of server 209a in Figure 2 illustrates another novel system use of the defined server fabric. In this case, to take advantage of the performance and power management of the server fabric, and to minimize port utilization on the top of rack switch, this figure shows a heterogeneous integration of existing servers onto the defined server fabric such that Ethernet traffic from existing servers can be gateway' ed into the fabric, allowing communication with nodes within the fabric, as well as having 209a Ethernet traffic carried through the fabric to the uplink Ethernet port 201.

Figures 6a-6c illustrates another example of the fabric topology that is a forty eight node fabric topology that consists of 12 cards, where each card contains 4 nodes, connecting into a system board. This topology provides some redundant links are provided, but without heavy redundancy. The topology has four Ethernet gateway escapes and each of these could be either 1 Gb or 10 Gb, but not all of these Ethernet gateways need to be used or connected. In the example shown, eight fabric links are brought off the quad-node card and, in one example, a PCIe xl6 connector is used to bring 4 fabric links off the card.

Summary/Overview of Building Power Optimized Server Fabric Boards Using Tiled Building Blocks

1. A server tree fabric that allows an arbitrary number of Ethernet escapes across a server interconnect fabric, to minimize the number of Ethernet Phys utilized to save power and costs associated with the Ethernet Phys, associated cables, and ports consumed on top of rack Ethernet switches / routers. 2. The switching nodes can be either pure switching nodes saving power by turning off computational subsystems, or can be used as complete computational subsystems including fabric switching. Referring to Figure 17, in one

implementation, multiple power domains are used separate the computational subsystem, block 905, from the management processor, block 906, and the fabric switch, the remainder of the blocks. This allows the SOC to be configured with the computational subsystem, block 905, powered off, retaining management processing in block 906, and hardware packet switching and routing done by the fabric switch.

3. The Butterfly Fat Tree topology server fabric provides for minimal number of links within a board (saving power and costs), minimal number of links spanning boards (saving power and costs) while allowing for redundant link paths both within and across boards.

4. The proposed base board and aggregator board allows scalable fault- resilient server fabrics to be composed with only two board building blocks.

5. Tree oriented server fabrics, and variants like the example butterfly fat tree, allow for static link width and speed specification that can be determined by the aggregate bandwidth of children nodes of that node, allowing for easy link configuration while minimizing interconnect power.

6. Power savings can be further optimized via dynamic link configuration driven by link utilization. In this example, each link and associated port of the fabric switch contains bandwidth counters, with configurable threshold events that allow for the reconfiguration of the link width and speed, both up and down, based upon the dynamic link utilization.

7. Since in many common server use cases, the Ethernet traffic is primarily node to external Ethernet and not node to node, the proposed tree fabric structure, and specifically the butterfly fat tree example, minimizes the number of hops across the fabric to Ethernet, thus minimizing latency. This allows the creation of large low latency fabrics to Ethernet while utilizing switches that have a relatively small (in this example 5) number of switching ports.

8. Allows heterogenous server integration to the fabric carrying Ethernet traffic from existing servers into and through the defined server communication fabric.

Building Power Optimized Server Shelves and Racks Using Tiled Building

Blocks

Now these board "tiles" can be composed to construct shelves and racks of fabric connected server nodes. Figure 7 shows an example of how a passive backplane can connect 8 14-node boards and two aggregation boards to compose a shelf consisting of 236 server nodes. Each board may be, for example, 8.7" tall + mechanical < 10.75" for 6U, interleave heat sinks for density and 16 boards fit in a 19 inch wide rack. The backplane may be simple/cheap with a PCIe connectors and routing wherein the routing may be XAUI signals (blue&green) + Power which is very simple without wires. Ethernet connections shown at 8 board aggregation point.

Figure 8 shows an example of extending the fabric across shelves, linking shelves across a server rack. The Ethernet escapes can be pulled at any node in the fabric, in this example, they are pulled from the passive interconnect backplane connecting the multi-node blades.

Summary/Overview of Building Power Optimized Server Shelves and Racks Using Tiled Building Blocks 1. Utilization of a PCIe connector to bring out the Ethernet escapes and XAUI links off a board to connect boards together with a point to point server fabric, utilizing not the PCIe signaling, but using the physical connector for the power and XAUI signals of the board, while maintaining redundant communication paths for fail-over and hotspot reduction.

2. XAUI point-to-point server interconnect fabric formed with a fully passive backplane.

3. Ethernet escapes across a fabric spanning the rack at every level of tree, not just at the top of the tree.

4. Ethernet escapes across the fabric can be dynamically enabled and disabled to match bandwidth with optimized power usage.

5. Node to node traffic, including system management traffic stays on the fabric spanning a rack without ever traversing through a top of rack Ethernet switch.

Storage

Figure 9a shows an exemplary server 700 with a disk form factor, typically such as a standard 2.3-inch or 3.5-inch hard disk drive (HDD) with SCSI or SATA drive,, according to one embodiment. Server board 701 fits in the same infrastructure as disk drive 702 in a current disk rack. Server 701 is a full server, with DDR, server-on-a-chip SoC, optional flash, local power management, SATA connections to disks (1-16 ... limited by connector size). Its output could be Ethernet or Calxeda's fabric (XAUI), with two XAUI outputs for fail-over. Optionally, it could use PCIe instead of SATA (SSDs or other things that need PCIe), with 1 through 4 nodes to balance compute vs. storage needs. Such a server could do RAID implementation as well as LAMP stack server applications. Use of Calxeda

ServerNode™ on each disk would offer a full LAMP stack server with 4 GB of DDR3, and multiple SATA interfaces. Optionally, a second node for 8 GB of DDR could be added if needed.

Figures 9b and 9c show exemplary arrays 710 and 720, respectively, of disk-server combinations 700a-n, according to one embodiment, using a storage server 1-node SATA board as discussed above. Connection by some high speed network or interconnect, either standard or proprietary, eliminates the need for a large Ethernet switch, saving power, cost, heat and area. Each board 701 is smaller than the height and depth of the disk. The array may be arranged with alternating disks and board, as shown in Figure 7b, or one board can server multiple disks, for example, in a disk, disk, board, disk, disk arrangement, as shown in Figure 7c. Thus computing power may be matched to disk ratio in flexible fashion.

Connectivity of boards 701a-n may be on a per node basis, with SATA used to hook to disk and multiple SATAs to hook to multiple disks. It may also be on a node to node basis, with two XAUIs in the fabric configuration, as discussed earlier, as well as in application

61/256,723 in each node, for redundancy. Nodes are connected through the XAUI fabric. Such connections could be of a tree or fat-tree topology, i.e., node to node to node to node, with deterministic, oblivious, or adaptive routing moving data in the correct direction.

Alternatively, an all-proprietary interconnect could be used, going to other processing units. Some ports could go to an Ethernet output or any other I/O conduit. Each node could go directly to Ethernet, inside the "box," or XAUI to an XAUI aggregator (switch) then to PHY, or XAUI to PHY. Or any combination of the above could be used. In yet other cases, SATA connections could be replaced with PCIe connections, using SSDs with PCIe connections. Some SSDs are going into disk form factors with PCIe or SATA. Or PCIe and SATA could be mixed. Ethernet out of the box could be used instead of XAUI for system interconnection. In some cases, for example, standard SATA connectors may be used, but in other cases higher-density connectors with proprietary wiring through a proprietary backplane could be made.

In yet another case, a server function could be within a disk drive, offering a full server plus a disk in a single disk drive form factor. For example, the ServerNode™ could be put on the board inside a disk. This approach could be implemented with XAUI or Ethernet connectivity. In such a case, a server-on-a-chip approach known to the inventor could be used as a disk controller plus server. Figure 9d illustrates this concept. A standard 3.5 inch drive is shown in figure 9d, item 9d0. It has an integrated circuit card 9dl that controls the disk drive. A significant amount of space is unused within the drive, noted by 9d2 in which the Calxeda low-power, small server node PCB can be formed to fit within this unused space within the disk drive.

Figure 9e illustrates an implementation of putting multiple server nodes in a standard 3.5 inch disk drive form factor. In this case, connectors from the server PCB to the backplane exports the XAUI based server fabric interconnect to provide network and inter-server communication fabric, as well as 4 SATA ports for connection to adjacent SATA drives.

Figure 10 illustrates an implementation for deeply integrating servers with storage.

Server node (101) shows a complete low-power server that integrates computational cores, DRAM, integrated I/O, and the fabric switch. In this example, server node 101 is shown in the same form factor as a standard 2 1/2 inch disk drive (102). (103) illustrates combining these server nodes and disk drives in a paired one-to-one fashion, where each server node has it's own local storage. (104) shows the server node controlling 4 disk drives. System (105) illustrates combining these storage servers via the unifying server fabric, and then in this example pulling four 10-Gb/sec Ethernet escapes from the fabric to connect to an Ethernet switch or router.

Figure 1 1 illustrates a concrete realization of this dense packing of storage and servers by illustrating a usage leveraging an existing 3.5 inch JBOD (Just a Bunch of Disks) storage box. In this case the JBOD mechanicals including disk housing is unchanged, but storage nodes are shown paired one-to-one with disk drives within the unmodified JBOD box. This illustrates a concept where the server nodes are pluggable modules that plug into an underlying motherboard that contains the fabric links. In this illustration, this standard JBOD box houses 23 3.5 inch disks (shown as rectangles in the logical view), and this figure shows 31 server nodes (shown as ovals/circles in the logical view) contained within the JBOD box controlling the 23 disks, and exposing two 10 Gb/sec Ethernet links (shown as dark wide lines in the logical view). This tightly integrated server / storage concept takes an off-the- shelf storage only JBOD box, and then adds 31 server nodes in the same form factor communicating over the power optimized fabric. This maps very well to applications that prefer to have local storage.

Figure 12 shows a related concept that leverages the fact that the server nodes can be instanced in the same form factor of a 2.5 inch drive. In this case, they are integrated into a 2.5 inch JBOD that has 46 disks. This concept shows 64 server nodes integrated in the same form factor of the JBOD storage. In this example, two 10 Gb Ethernet links are pulled from the fabric, as well as a 1 Gb/sec management Ethernet link.

Summary/Overview of Storage

1. Utilization of a PCIe connector to bring out the Ethernet escapes and XAUI links off a board to connect boards together with a point to point server fabric, utilizing not the PCIe signaling, but using the physical connector for the power and XAUI signals of the board, while maintaining redundant communication paths for fault resilience and load balancing.

2. Utilization of the defined server fabric to transform existing JBOD storage systems by pairing small form-factor low-power fabric-enabled server nodes with the disks providing very high-density compute servers, tightly paired with local storage, integrated via the power and performance optimized server fabric to create new high-performance computational server and storage server solutions without impact the physical and mechanical design of the JBOD storage system.

3. For use in a high density computing system, a method of encapsulating complete servers in the form factors of hard disk drives, for the purposes of replacing some of the drives with additional servers.

4. As in claim 3, wherein the servers are connected via and additional switching fabric to a network

5. As in claim 3, wherein the backplane in the enclosure holding the drives is replaced with a backplane suitable for creating at least one internal switching pathway .

6. For use in a high-density storage system, a method of integrating a low-power server PCB into the empty space within a standard 3.5 inch disk drive, providing integrated compute capabilities within the disk drive.

Cooling of Rack Integrated Low-Power Servers

One aspect of driving to low-power computer server solutions is the management of heat, cooling, and air movement through the rack and across the boards. Minimization of fans is one aspect of lowering total cost of ownership (TCO) of low-power servers. Fans add cost, complexity, reduce reliability due to the moving parts, consume a significant amount of power, and produce a significant amount of noise. Reduction and removal of fans can provide significant benefits in reliability, TCO, and power consumption.

Figure 13 illustrates a novel implementation of rack chimney cooling that supports chimney cooling through the entire rack or in just a segment of the rack. An important aspect is the single fan in a chimney rack concept, which uses natural convection upward with help from one fan. A large fan, cooling the entire rack, can be slow speed. It may be positioned at the bottom, or within the rack below the vertically mounted convection cooled subset of the rack. As cool air comes in the bottom, the fan pushes it through the chimney and out the top. Because all boards are vertical, there is no horizontal blockage. Although in this example, the fan is shown at the bottom of the rack, it can be anywhere in the system. That is, the system could have horizontal blocking with "classic" cooling - under the vent and fan - leaving the top as a vertical chimney. This vertical, bottom-cooled approach can work on a small system. The fan can be variable speed and temperature dependent.

Figure 13a shows an exemplary illustration of the novel principles of heat convection 500 used in the chimney rack concept. Placing the components at an angled alignment causes heat streams 501a-n to rise from heat-emanating Double Data Rate (DDR) memory chips 503a-n on a printed circuit board 502, so those heat emanating chips don't form heat backup or mutual heat ups. In this example, the DDR chips are placed diagonally with one another, not stacked vertically, because they tend to heat one another. Also, the DDR chips are placed above, not below, the large computing chips 504a, such as ASICs, SOCs, or processors, because they would tend to heat the SOCs. And the coolest chips, the flash chips 506, is placed below the SOCs. Likewise, nodes are not stacked vertically, as discussed below. Figure 14 extends this concept to show how server nodes are placed diagonally with respect to each other to minimize self-heating across server nodes.

Figure 15 shows an exemplary 16-node system, according to one embodiment, with heat waves rising from printed circuit boards. For a typical 16-node system, individual are arranged so that the heat rising from each unit does not heat the unit above. The overall enclosure would typically be longer, less tall, and less dense. Also, rather than mount PCBs diagonally as shown, PCBs could be squarely aligned and be rectangular, but components could be placed in a diagonal alignment to minimize mutual heating. PCBs in different rows could either have complementary layouts or could be staggered accordingly to reduce mutual heating. Similarly, Figure 16 shows a higher-density variant of the 16-node system with nodes similarly arranged to minimize self-heating across the nodes.

An additional cooling concept for racks of low power servers is to use a pneumatic air pressure differential to create an upward air flow, without requiring fans. The technique for doing this is to create an sealed rack with an extended vertical vent pipe for the air. This vent pipe must be tall enough (approximately 20-30 feet+) to create sufficient air pressure differential to create the upward air flow. This provides a totally passive air movement and cooling system for the rack of low power servers.

Summary/Overview of Cooling of Rank Mounted Low Power Servers

1. For use in a high density computing system, a method of placing heat- emanating components on a vertically placed mounting board,

wherein none of the heat-emanating component is placed directly above or below another heat-emanating component,

2. As claim 1, wherein the components are arranged in a substantially diagonal arrangement across the mounting board

3. As in claim 1, wherein the components are arranged in several substantially cross diagonal arrangements across the mounting board

4. As in claims 1,2 and 3, wherein the mounting board is a Printed Circuit

Board

Server Fabric Switching of Non-Ethernet Packets

As described in co-pending Patent application 12/794,996, Figure 17 illustrates the internal architecture of a server node fabric switch. Figure 17 shows a block diagram of an exemplary switch 900 according to one aspect of the system and method disclosed herein. It has four areas of interest 91 Oa-d. Area 910a corresponds to Ethernet packets between the CPUs and the inside MACs. Area 910b corresponds to Ethernet frames at the Ethernet physical interface at the inside MACs, that contains the preamble, start of frame, and inter- frame gap fields. Area 910c corresponds to Ethernet frames at the Ethernet physical interface at the outside MAC, that contains the preamble, start of frame, and inter- frame gap fields. Area 910d corresponds to Ethernet packets between the processor of routing header 901 and outside MAC 904. This segmented MAC architecture is asymmetric. The inside MACs have the Ethernet physical signaling interface into the routing header processor, and the outside MAC has an Ethernet packet interface into the routing header processor. Thus the MAC IP is re-purposed for inside MACs and outside MACs, and what would normally be the physical signaling for the MAC to feed into the switch is leveraged. MAC configuration is such that the operating system device drivers of A9 cores 905 manage and control inside EthO MAC 902 and inside ETH1 MAC 903. The device driver of management processor 906 manages and controls Inside Eth2 MAC 907. Outside Eth MAC 904 is not controlled by a device driver. MAC 904 is configured in Promiscuous mode to pass all frames without any filtering for network monitoring. Initialization of this MAC is coordinated between the hardware instantiation of the MAC and any other necessary management processor initialization. Outside Eth MAC 904 registers are visible to both A9 905 and management processor 906 address maps. Interrupts for Outside Eth MAC 904 are routable to either the A9 or management processor. It is key to node that the Routing Header processor 910d adds a fabric routing header to the packet when packets are received from a MAC headed to the switch, and removes the fabric routing header when the packet is received from the switch heading to a MAC. The fabric switch itself only routes on node IDs, and other information, contained in the fabric routing header, and does no packet inspection of the original packet. Distributed PCIe Fabric

Figure 18 illustrates a server node that includes a PCIe controller connected to the internal CPU bus fabric. This allows for the creation of a novel PCIe switching fabric that leverages the high performance, power optimized server fabric to create a scalable, high- performance, power optimized PCIe fabric.

The technique follows:

• PCIe controller 902 connects to Mux 902a allowing the PCIe controller to connect directly to the external PCIe Phy, or to the PCIe Routing Header Processor 910c. When Mux 902a is configured to direct PCIe traffic to the local PCIe Phy, this is equivalent to the standard local PCIe connection. When Mux 902a is configured to direct PCIe traffic to the PCIe Routing Header Processor 910c, this enables the novel PCIe distributed fabric switch mechanism.

• PCIe Routing Header Processor 910c utilizes the embedded routing information within the packet (address, ID, or implicit) to create the fabric routing header that maps that PCIe packet route to the destination fabric node PCIe controller.

• This provides similar advantages to the distributed PCIe fabric that the server fabric provides to networking.

• PCIe transactions sourced from the processor cores (905) can be routed to local PCIe Phy (via either the Mux bypass or via the switch), can be routed to any other node on the fabric, directly to the inside PCIe controller (902) or to the outside PCIe controller / Phy (904).

• Likewise, incoming PCIe transactions enter the outside PCIe controller (904), get tagged with the fabric routing header by the PCIe Routing Header

Processor (910), and then the fabric transports the PCIe packet to its final target. Distributed Bus Protocol Fabric

Figure 18a illustrates an additional extension that shows that multiple protocol bridges can take advantage of the fact that the fabric switch routes on the routing header, not directly on the underlying packet payload (e.g. a layer 2 Ethernet frame). In this illustration, 3 protocol bridges are shown: Ethernet, PCIe, and a Bus Protocol bridge.

The role of the bus protocol bridge is to take the processor or internal SOC fabric protocol, packetize it, add a Calxeda fabric routing header, and then route it through the Calxeda fabric.

As a tangible example, consider a bus protocol such as AMBA AXI, HyperTransport, or QPI (Quick Path Interconnect) within an SOC.

Consider the following data flow:

• A processor on the internal SOC bus fabric issues a memory load (or store) request.

• The physical address target for the memory operation has been mapped to a remote node on the fabric.

• The bus transaction traverses through the bus protocol bridge:

o The bus transaction is packetized

o The physical address for the memory transaction is mapped to a remote node, that node ID is used when building the routing header.

o A routing frame is built by the bus protocol bridge consisting of a routing header with the remote node ID, and the payload being the packetized bus transaction.

• The bus transaction routing frame passes through the Fabric Switch, traverses through the fabric, and is received by the target node's frame switch.

• The target node bus protocol bridge unpacks the packetized bus transaction, issues the bus transaction into the target SOC fabric, completes the memory load, and returns the result through the same steps, with the result flowing back to the originating node. Network Processor Integration with Server Fabric

Figure 19 shows an illustration of integrating the server fabric with Network

Processors (91 1). There are several use cases for the integration of the server fabric with Network Processors, including:

• The Network processors can serve as network packet processing accelerators to both the local processors (905), as well as any other processor on the fabric.

• Can be a Network Processor centric design, where the incoming packets from the external Ethernet are targeted to the Network Processors, and the Network Processors and the Control Plane processing can be offloaded to the larger processor cores (905).

• The server fabric can serve as a communication fabric between the network processors.

To enable these novel use cases, the network processors are assigned a MAC address. In the switch architecture shown in Figure 19, there are not Routing Header Processors attached to Port 1-4. So agents connected directly to Ports 1-4 need to inject packets that have the Fabric Switch Header prepended to payload packet. The Network Processor adds fabric switch integration to their design by:

• Outgoing packets from the Network Processor are tagged with the fabric switch header, which encodes the Destination Node ID from the Destination MAC.

• Incoming packets to the Network Processor from the Fabric Switch have the fabric switch header removed before Ethernet packet processing.

Foreign Device Integration with Server Fabric

Figure 19 shows an illustration of integrating the server fabric with arbitrary Foreign Devices (912). By Foreign Device, we mean any processor, DSP, GPU, I/O, or

communication or processing device that needs an inter-device communication fabric. A typical use case would be a large processing system composed of DSP or GPU processors that need an interconnect fabric between the DSP or GPU processors. The Fabric Switch routes packets based upon the fabric routing header, and does no packet inspection of the packet payload. The packet payload has no assumptions of being formatted as an Ethernet frame, and is treated completely as an opaque payload.

This allows Foreign Devices (e.g. DSP or GPU processors) to attach to the fabric switch and leverage the scalable, high performance, power optimized communication fabric by:

• Adding a routing frame header contained the destination node ID of the packet to an arbitrary packet payload sending to the frame switch.

• Stripping the routing frame header when receiving a packet from the frame switch.

Load Balancing

When considering a fabric topology such as illustrated in Figure 5, each of the nodes in the fabric export at least one MAC address and IP address to provide external Ethernet connectivity through the gateway nodes shown in 501a and 501b.

Exposing these fine grained MAC and IP addresses is advantageous for large scale web operations that use hardware load balancers because it provides a flat list of MAC / IP addresses for the load balancers to operate against, with the internal structure of the fabric being invisible to the load balancers.

But, smaller data centers can be potentially over-whelmed with a potentially large number of new MAC / IP addresses that a high-density low-power server can provide. It is advantageous to be able to provide the option for load balancing to insulate external data center infrastructure from having to deal individually with a large number of IP addresses for tiers such as web serving.

Consider Figure 20 where we have taken one port on the fabric switch and have added a FPGA that provides services such as IP Virtual Server (IPVS). This IP virtualization can be done a range of network levels including Layer 4 (Transport) and Layer 7 (Application). In many cases, it is advantageous for load balancing to be done at layer 7 for data center tiers such as web serving such that a http session state can be maintained locally by a specific web server node. The IPVS FPGA is only attached to the gateway nodes (nodes 501a and 501b in figure 5). In this example, the fabric illustrated in Figure 5, when augmented with the IPVS FPGAs on the gateway nodes, can export a single IP address per gateway node. The IPVS FPGA then load balances the incoming requests (e.g. HTTP requests) to the nodes within the fabric. With layer 4 load balancing, the IPVS FPGA can be done stateless, and use algorithms including round robin across nodes, or instancing a max number of requests per node before using the next node. With layer 7 load balancing, the IPVS FPGA will need to maintain state such that application sessions can be targeted to specific nodes.

The resulting flow becomes:

• Incoming request (e.g. HTTP request) enters the gateway node (Port 0) in Figure 20.

• The fabric switch routing tables have been configured to direct the incoming traffic from Port 0 to the IPVS FPGA port on the fabric switch.

• The IPVS FPGA rewrites the routing header to target a specific node within the fabric, and forwards the resulting packet to the target node.

• The target node processes the request, and sends the results normally out the gateway node.

OpenFlow / Software Defined Networking Enabled Fabric

OpenFlow is a communications protocol that gives access to the forwarding plane of a switch or router over the network. OpenFlow allows the path of network packets through the network of switches to be determined by software running on a separate server. This separation of the control from the forwarding allows for more sophisticated traffic management than feasible today using ACLs and routing protocols. OpenFlow is considered an implementation of the general approach of Software Defined Networking.

Figure 21 shows a way to build OpenFlow (or more generally software defined networking (SDF)) flow processing into the Calxeda fabric. Each of the gateway nodes would instance an OpenFlow enabled FPGA on a port of the gateway node's fabric switch. The OpenFlow FPGA needs an out-of-band path to the control plane processor, this can be done by a separate networking port on the OpenFlow FPGA, or can be done by simply claiming another port off the fabric switch to talk to the control plane processor.

The resulting flow becomes: • Incoming request enters the gateway node (Port 0) in Figure 20.

• The fabric switch routing tables have been configured to direct the incoming traffic from Port 0 to the OpenFlow / SDF FPGA port on the fabric switch.

• The OpenFlow / SDF FPGA implements standard OpenFlow processing, including optionally contacting the control plane processor if necessary.

The OpenFlow / SDF FPGA rewrites the routing header to target a specific node within the fabric (by MAC address), and forwards the resulting packet to the target node.

• The target node processes the request, and sends the results back to the OpenFlow FPGA where it implements any outgoing flow processing.

Integration of Power Optimized Fabric to Standard Processors via PCIe

The power optimized server fabric depicted in Figure 5 and described previously provide compelling advantages to existing standard processors and can be integrated as an integrated chip solution with existing processors. Standard desktop and server processors often support PCIe interfaces either directly, or via an integrated chipset. Figure 22 illustrates one example of an integration of the power optimized fabric switch to an existing processor via PCIe. Item 22a depicts a standard processor that supports one or more PCIe interfaces, either directly, or via an integrated chipset. Item 22b depicts the disclosed fabric switch with integrated Ethernet MAC controllers to which a PCIe interface has been integrated. Item 22b may typically be integrated together utilizing a FPGA or ASIC implementation of the PCIe integrated fabric switch.

In this disclosure, the nodes depicted in Figure 5 can be a heterogenous combination of power-optimized server SOCs with the integrated fabric switch, as well as this disclosed integration of PCIe connected standard processor to a PCIe interfaced module containing the Ethernet MACs and the fabric switch.

Integration of Power Optimized Fabric to Standard Processors via

Ethernet

The power optimized server fabric depicted in Figure 5 and described previously provide compelling advantages to existing standard processors and can be integrated as an integrated chip solution with existing processors. Standard desktop and server processors often support Ethernet interfaces via an integrated chip, or potentially provided within an SOC. Figure 23 illustrates one example of an integration of the power optimized fabric switch to an existing processor via Ethernet. Item 23a depicts a standard processor that supports an Ethernet interface, either by means of an SOC, or via an integrated chip. Item 23b depicts the disclosed fabric switch without the integrated inside Ethernet MAC controllers. Item 23b may typically be integrated together utilizing a FPGA or ASIC implementation of the integrated fabric switch.

In this disclosure, the nodes depicted in Figure 5 can be a heterogenous combination of power-optimized server SOCs with the integrated fabric switch, as well as this disclosed integration of Ethernet connected standard processor to the integrated fabric switch implemented in an FPGA or ASIC.

While the foregoing has been with reference to a particular embodiment of the invention, it will be appreciated by those skilled in the art that changes in this embodiment may be made without departing from the principles and spirit of the disclosure, the scope of which is defined by the appended claims.

Claims

Claims:

1. A computing device comprising:

a plurality of server nodes, wherein each server node includes a processor, a memory, an input/output circuit and a fabric switch that are interconnected to each other;

a fabric switch that interconnects the plurality of server nodes together by a plurality of fabric links; and

one or more Ethernet escapes from the fabric switch that form a power optimized server fabric.

2. The computing device of claim 1, wherein the plurality of server nodes are part of a server board.

3. The computing device of claim 2 further comprising an aggregate of a set of board, wherein each board has one or more server nodes, wherein each server node includes a processor, a memory, an input/output circuit and the set of boards are interconnected with the fabric switch to produce a larger server.

4. The computing device of claim 1, wherein the server nodes further comprise one or more fabric switches that switch of routing headers concatenated to Ethernet layer 2 packets.

5. The computing device of claim 1, wherein the fabric switch has a plurality of server node links wherein a speed of each server node link is set to optimize power.

6. The computing device of claim 1, wherein the fabric switch has a plurality of server node links wherein a speed of each server node link is dynamically adjustable to optimize power.

7. The computing device of claim 6, wherein each server node link speed is dynamically adjustable based on one of instantaneous utilization of the server node link and an average utilization of the server node link.

8. The computing device of claim 3, wherein one or more fabric links and one or more Ethernet escape use a PCIe connector to connect to the set of boards.

9. The computing device of claim 1 further comprising a passive backplane that provides a point-to-point server interconnect fabric.

10. The computing device of claim 1, wherein the plurality of server nodes form a tree having one or more levels and wherein the Ethernet escapes are at any level of the tree.

11. The computing device of claim 1, wherein each Ethernet escape is one of enabled and disabled to match bandwidth with optimized power usage.

12. The computing device of claim 1, wherein data between the server nodes traverses the fabric switch but not the Ethernet escapes.

13. The computing device of claim 4, wherein each server node has the computational components turned off to reduce power.

14. The computing device of claim 2 further comprising a plurality of server boards that form one of a rack and a backplane.

15. The computing device of claim 14 further comprising a plurality of shelves that compose a rack.

16. A computing device, comprising:

a storage device having a form factor;

a server node, wherein the server node includes a processor, a memory, an input/output circuit, a switch fabric and one or more SATA interfaces for the storage device, the server node having the same form factor as the storage device.

17. The computing device of claim 16 further comprising an array of storage devices and an array of server nodes interconnected to each other.

18. The computing device of claim 16, wherein the server node is within the storage device.

19. The computing device of claim 16, wherein the server node is connected to the storage device and the storage device is local storage for the server node.

20. The computing device of claim 16 further comprising a plurality of storage devices wherein each storage device is connected to one of the SATA interfaces so that the server node controls the plurality of storage devices.

21. The computing device of claim 16 further comprising a plurality of server nodes and a plurality of storage devices, wherein the plurality of server nodes are a switch fabric and control the plurality of storage devices.

22. The computing device of claim 16 further comprising one or more Ethernet escapes and a link and wherein each Ethernet escape and link has a PCIe connector.

23. A method for producing a high density computing system, the method comprising:

providing a server node having a processor, a memory, an input/output circuit, a switch fabric and one or more SATA interfaces; and

encapsulating the server node into a form factor of a hard disk drive.

24. The method of claim 23, wherein the switch fabric connects the server node to a network.

25. The method of claim 23 further comprising replacing the a backplane of the hard disk drive with a backplane suitable for creating at least one internal switching pathway.

26. A method for producing a high density computing system, the method comprising:

providing a standard form factor disk drive; and

integrating a server node having a processor, a memory, an input/output circuit, a switch fabric and one or more SATA interfaces into the standard form factor disk drive, wherein integrated compute capabilities are provided within the standard form factor disk drive.

27. A computing device, comprising:

a circuit board;

one or more dynamic memory chips mounted on the circuit board;

one or more computing chips mounted to the circuit board;

one or more flash memory chips mounted to the circuit board;

wherein the circuit board is vertically mounted so that the one or more flash memory chips below the one or more computing chips and the one or more dynamic memory chips are above the one or more computing chips;

a chimney cooler of the vertically mounted circuit board.

28. The computing device of claim 27 further comprising a plurality of vertically oriented circuit boards that are cooled by the chimney cooling.

29. The computing device of claim 27, wherein the chimney cooler is a fan at a bottom of the circuit board that cools the circuit board.

30. The computing device of claim 27, wherein the chimney cooler is a pneumatic air source and a vent pipe.

31. The computing device of claim 27, wherein the one or more dynamic memory chips in the vertically mounted circuit board are not directly above the one or more computing chips.

32. The computing device of claim 27, wherein the circuit board is a printed circuit board.

33. The computing device of claim 27, wherein the one or more dynamic memory chips, one or more computing chips and one or more flash memory chips are mounted diagonally on the circuit board.

34. A computing device comprising:

one or more processors;

a bus fabric connected to the one or more processors;

a fabric switch connected to the bus fabric that output data from the computing device to one or more ports; and

one or more routing header processors, wherein each routing header processor is used to route a particular transport stream so that the fabric switch handles different transport streams.

35. The computing device of claim 34, wherein the different transport streams include a server transport stream, a storage transport stream and a networking transport stream.

36. The computing device of claim 34 further comprising one or more Ethernet MAC controllers connected to the bus fabric and the fabric switch is connected to the one or more Ethernet MAC controllers that output data from the computing device to one or more ports and the one or more routing header processors is a PCIe header processor for routing PCIe data across the fabric switch.

37. The computing device of claim 36 further comprising a PCIe controller connected to the bus fabric, a PCIe routing header connected to the PCIe controller that are capable of connecting to a PCIe PHY.

38. The computing device of claim 34 further comprising a network processor connected to the switch fabric.

39. The computing device of claim 36 further comprising a foreign device connected to at least one port.

40. A computing device comprising:

one or more processors;

a bus fabric connected to the one or more processors;

a fabric switch connected to the bus fabric that output data from the computing device to one or more ports;

a bus protocol bridge connected between the bus fabric and the switch fabric; and one or more routing header processors, wherein each routing header processor is used to route a particular transport stream so that the fabric switch handles different transport streams.

41. A method for switching different transport streams; comprising:

providing one or more processors and a bus fabric connected to the one or more processors;

providing a fabric switch connected to the bus fabric that output data from the computing device to one or more ports; and

switching, using one or more routing header processors, a particular transport stream so that the fabric switch handles different transport streams.

42. The method of claim 41, wherein the different transport streams include a server transport stream, a storage transport stream and a networking transport stream.

43. The method of claim 41 further comprising routing PCIe data across the fabric switch.

44. A method of load balancing using a switch fabric, comprising:

providing a server node having one or more processors, a bus fabric connected to the one or more processors; a fabric switch connected to the bus fabric that output data from the computing device to one or more ports and an IP virtual server connected to the fabric switch; receiving an incoming request;

routing the incoming request to the IP virtual server connected to the fabric switch; generating, using the IP virtual server connected to the fabric switch, a routing header to a specific node of the fabric;

forwarding the incoming request to the specific node; and

processing, using the specific node, the incoming request to provide load balancing.

45. A method of processing using a switch fabric, comprising:

providing a server node having one or more processors, a bus fabric connected to the one or more processors; a fabric switch connected to the bus fabric that output data from the computing device to one or more ports and an OpenFlow device connected to the fabric switch;

receiving an incoming request;

routing the incoming request to the OpenFlow device connected to the fabric switch; generating, using the OpenFlow device, a routing header to a specific node of the fabric;

forwarding the incoming request to the specific node;

processing, using the specific node, the incoming request to provide load balancing; and

sending the processed incoming request back to the OpenFlow device.

46. A computing device, comprising:

one or more processors;

a bus fabric connected to the one or more processors; a fabric switch connected to the bus fabric that output data from the computing devicer more ports;

a PCIe interface connected to the bus fabric; and

an external processor connected to the computing device using the PCIe interface.

47. A computing device, comprising:

a fabric switch that output data from the computing device to one or more ports; an Ethernet port connected to the fabric switch; and

an external processor connected to the computing device using an Ethernet interface.