CA1182573A

CA1182573A - Method for partitioning mainframe instruction sets to implement microprocessor based emulation thereof

Info

Publication number: CA1182573A
Application number: CA000424284A
Authority: CA
Inventors: Palmer W. Agnew; Joseph P. Buonomo; Steven R. Houghtalen; Anne S. Kellerman; Raymond E. Losinger; James W. Valashinas
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1982-04-26
Filing date: 1983-03-23
Publication date: 1985-02-12
Also published as: JPS58191044A; ES8404073A1; EP0092610A3; ES521806A0; EP0092610A2; BR8301430A; US4514803A

Abstract

METHODS FOR PARTITIONING MAINFRAME INSTRUCTION SETS
TO IMPLEMENT MICROPROCESSOR BASED EMULATION THEREOF

Abstract of the Disclosure Methods of applying LSI and microprocessors to the design of microprocessor-based LSI
implementation of mainframe processors are described. The mainframe instruction set is partitioned into two or more subsets, each of which can be implemented by a microprocessor having special on-chip microcode or by a standard off-the-shelf microprocessor running programs written for that purpose. Alternatively, one or more of the subsets can be implemented by a single microprocessor. In addition, a subset of the partitioned instruction set can be implemented by emulating software, by off chip vertical or horizontal microcode, or by primitives. But, however partitioning is implemented, the end result thereof is to keep the critical flow paths, associated with the most frequently used instruction subset, as short as possible by constraining them to a single chip.

Description

r-D~ /A r~
TO_MPL ~IENT rrlICROPROC2SSOR ~SLD C. IJI AIIOII TI~LI;.~ 2 Background of tne Invention 5 1 Field of the Invention 7 This invention is concerned with methods for 9 partitioning the large instruction sets of mainframe 10 computing systems in order that such partitioned 11 sets can he run by a plurality of microprocessors. 12 ~Jore particularly, this invention relates to 13 methodology for partitioning mainframe instruction 14 sets to obtain the most effective cost/performance 15 emulation of the mainframe instruction set through 1 microprocessor implementation thereof. 17

2. ~escription of the Prior Art 19 One noteworthy characteristic of this era of 21 integrated circuits is that higher performance 22 computers use lower levels of in~egration. This is 23 the result OL individual optimizations across the 24 performance spectrum. Since the price of a 25 state-of-the-art silicon chip i5, on balance, 26 independent of the level of integration, ~he price ~7 per ~ate is lower ~or microcomputers than for super 28 computers. One result of this situation has been 29 the complete reversal of Grosch's Law which formerly 30 stated that payment of twice as much for a computer 31 would provide four times as much processing powerO 32 Tnis meant that one would achieve the kest 33 cost/performance ~rorn the iargest computer that 34 could be justified when its resources were shared 35 among many unrelated users. As amended ky the most 36 recent technological advances and designs~ the 37 reversal of ~rosch's Law now implies that the bes~ 38 cost/performance will be obtained from the smallest 39 ~2573 computer that will perform an application in an acceptable time. 2 Large scale integration or LSI has played a 3 major role in the cost/performance im~rovements of 4 all computing systems, particularly in reducing 5 storage costs. However, LSI has been much more 6 effective in reducing the costs of low performance 7 processors having simple architectures than of high 8 performance processors having complex architectures. 9 This property of LSI favors implementing high lq performance computers using large numbers of low 11 per~ormance processors and storage chips. However, 12 this implementation is difficult to apply to 13 existing complex architectures intended for 14 uni~processors that process a single stream of 15 instructions. This limitation is Dest understood by 16 considering tlle basic nature and effect of LSI on 17 digital designs. 1~
Recent improvements in the cost/performance of 19 digital computer systems have been driven by the 20 availability of increasingly denser LS-I chips. 21 Denser LSI memory chips, with reduced costs per bit 22 stored, have direct and obvious applicability to 23 digital systems over the entire application range 24 from hand held calculators to super computers. 25 Denser LSI logic chips, however, apply most 26 naturally to digital systems near the low end of the 27 performance and complexity spectrum. 28 LSIj as previously noted, applies naturally to 29 very small digital systems. The logic portion OL a 30 hand calculator, microwave oven, or wrist watch, 31 including the necessary memory and I/O device 32 interfaces, can be implemented on a single LSI 33 microcomputer chip. A small personal computer can 34 be readily realized ~y using a slngle microprocessor 35 chip, to implement the entire instruction set of the 36 computer, together with other LSI chips which 37 implement the interfaces between the microprocessor 38 and the memory, keyboard, display tube, disks, 39 printers, and communication lines. This is an 40 EM9a2014 - - 2 -example oE partitioning a digital system's function for implementation by several LSI chips. This 2 Eunctional partitioning method is simple, well 3 kno~m, and straightforward ~ecause the instruction 4 processing function can be accomplished entirely by 5 a single chip. 6 ~5ethods of applying LSI technology to the 7 implementation of still more powerful digital 8 systems, in which the state of the LSI art does not 9 permit implementing tne entire instruction 10 processing function on a single LSI chip, are far 11 less obvious. A first ap~roach would be simply to 12 wait until technology advances far enough to contain 13 a desired architecture, of a given complexity, on a 14 single chip. Unfortunately, this approach has its 15 pitfalls. For example, the architec-ture of each 16 generation's state-of-the-art microprocessor was 17 determined by the then current capability of the 18 technology, which explains why today's leading 19 microprocessors lack floating-point instructions. 20 The most significant disadvantage of this method is 21 that it precludes implementing a pre-deEined 22 architecture that does not happen to fit wi~hin one 23 chip in the current technology. This has le~l to the 24 major software problems inherent in having each 25 ~eneration of micxoprocesSors implement an 2~
essentially new architecture. 27 ~nother method of employing LSI in the larger, 28 more complex processing systems is to partition the 29 instruction execution function so that the data Elow 30 is on one chip and the microcode that controls the 31 data flo~ is on one or more other chips. This 32 method is the obvious application of LSI technology, 33 separately, to the data flow and to the con-trol 34 store. Unfortunatel~, this method relinquishes the 35 main advantage of LSI implementation, namely, that 36 of having the control store and the data flow tha-t 37 it controls, both on the same chip. In most 38 processors, the critical path runs Erom control 39 store, to data flow, to arithmetic result, to 40 address of the next control store word. Its length, in nanoseconds, determines the microcycle time and 2 l1ence the instruction processing rate of the 3 processor. For a given power dissipation, a 4 critical path that remains .~holly on one LSI chip 5 results in a sl1orter cycle time than that of a 6 critical path that must traverse several inches of 7 conductor and a number o~ chip~to-card pin 8 connections. 9 This off-chip microcode partitioning method lO
also requires what ~SI technology is least adept at ll providing, namely, large numbers of pins. ,he data 12 flow chip needs at least a dozen pins to tell the 13 control store what microword to give it next. Even 14 worse, the data flow chip needs from 16 to lO0 yins 15 to receive that control word. A processor using 16 this method is often limited to roughly 16-bit 17 control words, and hence a vertical microprograml~
that can control only one operation at a time, l9 whereas a far higher performance processor could ~e 20 designed if a lO0-bit control word were available. 21 If available, such lO0-bit control words would 22 permit a l1orizontal microprogram that can control 23 several operations in each micro~cycle and thus 24 perform a given function in fewer cycles. It should 25 be noted that the off-chip rnicrocode partitioning 26 method has been particularly successful when applied 27 ~o bit-slice processors, in which the data flow is 28 not reduced to a single chip, but rather is a 29 collection of chips, eacn of which irnplements a30 particular group o~ bits throughout the data flow~ 31 8it-slice processors usually employ bipolar 32 technologies whose densities are limited by the 33 number of gates available, or the ability to cool 34 them, rather than by the number of pins on the 35 chips. The off-chip microcod~ p~rtitioning method36 applies to FE~ implementations only in more unusual 37 cases where many pins are available and the chip 3 density happens to exactly match tlle number of gates 39 needed to implemen~ the data flow of a desired 40 E~19820l4 ~ 4 -processor. The Toshiba T88000 16-bit microprocessor happens to meet these conditions. Such an 2 implementation can be best viewed as a bit-slice 3 design in which the implementable slice width has 4 widened to encompass the entire desired dataflow. 5 Each major microprocessor manufacturer has 6 faced the need to implement an architecture more 7 complex than can ~e put onto a single LSI chip. 8 Some needed to implemen~ pre-existing architectures 9 in order to achieve software compatibility with 10 installed machines. Others sought to enhance the 11 functions o~ existing successful one-chip 12 microprocessors by adding further instructions. 13 For exa~ple, Cigital Equipment ~orporation 1~
needed a low-end implementation of their PDP-1115 minicomputer architecture. They chose the off-chip 16 microcode partitioning method. The result was the 17 LSI 11 four-chip set manufactured first by Western 18 Digital Corporation and then by Digital Equipment 19 Corporation itself. 20 Intel Corporation needed to add hardware21 computational power, particularly floating-point 22 instructions~ to its 8086 microprocessor sys-tems. 23 For this purpose, they developed a "co-processor", 24 the 8087. A processing system containing both an 25 8086 chip and an 8087 chip vperates as follo~s. The 26 chi~5 fetch each instruction simultaneously. If the 27 instruction is one that -the 8086 can execute, it 28 executes the instruction and both chips fetch the 29 next instruction. If the instruction is one that 30 the 3087 executes, the 8087 starts to execute ito 31 In the usual case where a main store address is 32 required, the 8086 computes the address and puts it 33 on the bus shared with the 8087. The 80~7 uses that 34 address to complete execution o~ the instruction and 35 then signals the ~086 that it is ready for both of 36 them to fetch the next instruction. T;us, each chip 37 looks at each instruction and executes its assigned 38 subset, but` only the 8086 computes addresses. 39 E~982014 - 5 -Zilog Corporation similarly needed to add floating-point instructions to its Z8000 2 microprocessor and developed an Extended Processins 3 Unit or EPU. ~ system containing a Z8000 and one or 4 more EPUs works as follows. The æ8000 fetches an 5 instruction. I the Z8000 can execute the 6 instruction, it does so. Otherwise, the Z8000 7 issues a request for service by an EPU and supplies 8 an identifier (ID) that it determines by examining 9 the instruction. One EPU recognizes that ID as its 10 own and begins executing. The EPU can use special 11 wires to the Z8000 to instruct the Z8000 to move 12 necessary data back and forth between the EP~ and 13 the main store. The Z8000 proceeds to fetch and 14 execute more instructions while the EPU is working, 15 and only stops to wait for the ~P~ if it requests 16 service by the same EP~ while tha~ EPU is still 17 busy. Thus, it is the responsibility of the Z8000 lR
to start the EPU and respond to commands from the 19 EPU. A great deal of execution overlap is possible 20 in such a system. 21 National Semiconductor Corporation had a 22 similar requirement to add floating-point 23 instructions to its MS-16000 microprocessor systems. 24 It called the r~s-l6ooo a "master" and called the 25 computational processor a "slave". In a system 26 containing a master and a slave, the master fetches 27 instructions and executes them i. it can. ~hen the 28 master fetches an instruction it cannot execute, it ` 29 selects a slave to begin execution. The master 30 sends the instruction and any needed data to the 31 slave, waits for the slave to signal complekion, 32 receives the result, and proceeds to fetch the next 33 instruction. Thus, the master never overlaps its 3 execution with the slave's execution and is 35 responsible for knowing what the slave is doing and 36 what it needs. 37 Data General Corporation needed an LSI 3~
implementation of its Eclipse minico~puter 39 architecture. The resulting l~licroEclipse family 40 Ei.g82014 - 6 -

3~J5~3 employs a one-chip processor that contains the data flow as well as the horizontal (35-bit) and vertical 2 (18-bit) microcode for executing the most 3 performance-critical instructions in the 4 architecture. This ~rocessor can call for vertical 5 microwords from an off-chip control store, as 6 necessary, to execute the rest of the instructions 7 in the architecture by making use of the on-chip 8 horizontal microwords. This is a variant with some 9 of the advantages of both the off-chip control-store 10 method and the instruction-set partitioning method. 11 Designs that partitioned off I/O functions for 12 implementation on dedicated microprocessors were 13 common and none of the advanced microprocessor 14 partitioning methods previously discussed had yet 15 appeared when the present invention was conceived. 16 Partitioning of func-tions within a central 17 processing unit for implementation on separate 18 processors had been employed in super computers. 19 .heir goal was separa~e execution units for 20 fixed-point, floating-point, and perhaps decimal 21 instructions! tha~ could overlap execution to 22 achieve maximum throughput. 23 Objects and Summary of the Invention 25 Accordingly, it is a principal object of the 27 present invention to provide methods for 28 imple~enting large system instruction sets in a 29 manner that minimizes the cri~ical path. 30 It is also a principal object of the present 31 invention to provide a metho~l of implementing large 32 andlor complex instruction sets in a manner that 33 retains the critical path on one LSI chip. 34 It is another object of the present inver.tion 35 to provide methodology for implementing large and/or 36 complex instruction sets in an efficient manner that 37 takes maxi~um advantage of LSI technology without 3~
the need of providing a very large number of 39 c~ifferent custom chips~ 40 ~982014 7 ~

~hese and other objects of the present invention in using LSI to implement an architecture 2 that is too large or complex to implement on one 3 chip are realized bv partitioning the instruction 4 set of the architecture itself into subsets that are 5 each microprocessor implemented. ~his method of 6 utilizing select architectural subsets preserves the 7 main advantage of a one~chip implementation, namely, 8 keeping each critical path on a single chip. For 9 each subset of the instructions, for which execution 10 time is important to system performance, the 11 corresponding microprocessor chip contains the data 12 ~low path and all elements, including registers, 13 necessary for the execution of that subset as well 14 as the microcode that controls execution. The lS
application of this method requires partitioning 16 that makes each identified important subset ~it on 17 one microprocessor in ~he current state of 18 technology, a way to quickly pass control back and 19 ~orth between all of the microprocessors, a suitable 20 way to pass data back and forth between all of the 21 microprocessors, and a technology in which it is 22 economically feasible to have several coples of a 23 complex data flow and control store mechanism. 24 Brie~ Description of the Drawinss 26 The invention will be described further, by ~lay 28 of preferred examples thereof, with reference to the 29 accompanying drawings wherein: 30 Figure 1 schematically illustrates a 31 partitioned mainframe instruction set in 32 accordance with the ~resent invention, said 33 partitioned set having two overlapping subsets; 34 Figure 2 schematically reveals the 35 critical path of a computing system, in 36 particular, the critical path as it is 37 comprehended by the present invention; 38 Flgure 3 schematically depicts another 39 partitioned mainframe instruction set in 40 accordance with the present invention, said partitioned instruction set havin~ four 2 subsets, three of which are implemented by 3 on-chip microcode and the other of which is 4 implemented by higher level instructions stcred 5 in memory chips; 6 Figure 4 schematically shows another 7 partitioned mainframe instruction set in 8 accordance with the present invention, said 9 partitioned instruction set having t~o subsets, 10 only one of which is implemented by on-chip 11 microcode and the other of which is implemented 12 by higher level instructions stored in memory 13 chips, 14 Figure S schematically depicts a further 15 partitioned mainframe instructior. set, 16 partitioned in accordance with the present 17 invention, said partitioned instruction set 18 having only one on-chip implemented subset; 19 Figure 6 schematically illustrates another 20 partitioned mainframe instruction set, with 21 said partitioning being implemented in 22 accordance with the present invention by 23 placing predetennined vertical microcode 24 elsewhere than on the implementing 25 microprocessor chip; 26 Figure 7 schematically shows yet another 27 partitioned ~.ainframe instruction set, with 28 said partition.ing bei~g implemented in 29 accordance with the present invention by 30 placing predetermined horizontal ~icrocode 31 elsewhere than on the implementing 32 microprocessor chip; and 33 Figure 8, shown on the same sheet of drawings 34 with Figure 3, schematically depicts still another 35 partitioned mainframe instruction set, with said 36 partitioning being implemented in accordance wi.th 37 the present invention by placing one subset and a 3 çollection of primitives on the implementing 39 microprocessor chip. 40 EN982014 - g -~escription of the Preferred Embodimen-t ~lainframe architecture can be microprocessor 3 implemented in many ways with any one or more 4 specific goals or criteria in mind. The goal of the present invention is to optlmize cost/performance, 6 not performance, at the low end of the mainframe 7 spectrum. To achieve that end, it was decided to 8 use a microprocessor that was general purpose in 9 design, tha~ was significantly microcoded thereby 10 allowing architectural tuning and that had an 11 appropriate number of 32-bit general purpose lZ
registers. Motorola's 16 bit processor, the 68000, 13 was an excellent choice that fit this description 14 rather well. This MPU implementation approach was 15 selected due to projec~ions -that H~OS and compara~le 16 FET technologies would require a wait of several17 years before they would permit implementation of18 mainframe architecture on a single chi~. 19 As used herein, the terms "mainframe 20 architecture" or "mainframe instruction set" 21 identify or refer to the architecture or instruction 22 set of general purpose digital computers of the ~ype 23 that have a rich and varied instruction set, 24 typically several hundred in nw~ber, a relatively 25 wide word size, typically four bytes, and a complete 26 methodology for handling exception conditions. The 27 IB~ 4331, manufactured by International Business28 ~iachines Corporation, is considered to be a such a 29 mainframe computer at the low end of the spectrum. 30 Further, as used herein, "System/370" i~ a term that 31 identifies a range of computers, also manufactured 32 by International Business ~:achines Corporation, the 33 details of ~hich are well known and publicly 34 documented, that also fall within the scope of the 35 foregoing definition of a ~ainframe. In addition, 36 as used herein, the term "critical path-~ defines a 37 path that runs fro~ the control store,~to data flow, 38 to arithmetic result, to address of tlle next control 39 i store word. Its length, in nanoseconds, ~etermines 40 *Trade Mark E~982014 - 10 -the microcycle time and hence the instruction processing rate of -the processor. ~or a given povJer 2 dissipation, a critical path that remains wholly on 3 one LSI chip results in a shorter cycle time than 4 that of a critical path that must traverse several 5 inches of conductor and a number of chip-to-card pin 6 connections. 7 The following descriptions of several 8 approaches to solving the problems OL single chip 9 mainlrame implementation are limited to the 10 instruction processing ~ortion of a computer. Each 11 approach provides a local bus witllin the instruction 12 processing portion on which one or more 13 microprocessor c]lips can communicate with each other 14 and with a local store. Each approach assumes that 15 the local bus can be connected -to a global bus to 16 allow the instruction processing portion to 17 communicate ~1ith I/O devices and main memory. At 18 other times, the local bus is disconnected from the 19 global bus so that separate communications can occur 20 over the two buses. 21 A. Two Overlapping Subsets 23 The first approach to partitioning a mainframe 25 architecture employs two specially microcoded 26 microprocessors A1 and ~1 that implement overlapping 27 subsets of the architecture, as schematically 28 depicted in Figure 1. Each of the microprocessors 29 is provided with on-chip microcode that replaces the 30 standard microprograms that are usually found in a 31 68000. This overlapping emulation is achieved in 32 the following manner. The ~ainframe architecture is 33 partitioned into three sets named P1, ~1 and Rl, 34 ~ith most of the high-Crequency use instructions 35 being in set P1. 36 As employed in this description, the terms 37 "most requently used instructions" or 38 "hi~h-frequency use instructions" or any other term 39 having similar connotation refers to those 40 E~1982014 instructions in the entire set that are used the most when a typical group of user programs is run on 2 a mainframe and the resulting instruction mix is 3 surveyed. It has been ~ound that at least 70~, and 4 usually 75~, of such frequently used instructions 5 can be grouped in the key Ol prime su~set, subset P1 0 in this approach, and will account for approximately 7 95% or more of use of the computing system. 8 The special microcode referred to above is 9 written for the combination of sets P1 and Q1 to 1~
reside in ~rocessor A1 and microcode is written for 11 the combination of sets P1 and ~1 to reside in 12 processor B1, as shown in Figure 1. At any one 13 kime, only one o~ the processors is "activen, and 14 the other processor is "passive". Only the active 15 processor fetches and executes instructions and 16 controls the bus. There is no contention between 17 the processors. 18 This approach functions in the following 19 manner. Assume that the last several instructions Z0 have all been either in set P1 or in set Q1. ~hus, 21 processor A1 is active and processor ~1 is passive. 22 ~ote that the internal values of processor A1 23 (I-counter, general purpose registers, condition 24 code, etc.) are up-to-date, and the internal values 25 of processor B1 are not. If ~he nex~ instruction is 26 in set R1, processor A1 fetches this instructiorl and 27 performs the follQwing operations: 28 1.) it places all of its internal values, ~hat 29 processor B1 might need in order to 30 execut~ any instruction3 in sets P1 or R1, 31 into a mailhox in a local store; 32 2~) it taps processor B1 on the shoulder, 33 tellin~ it to become the active processor, 34 that is, to read ne~ internal values ~rom 35 the mailbox and to then execute 3O
instructions as long as instructions 37 remain in set R1 or set P1; and 38 3.) it becomes the passive processor until, 39 sometime later, it ~eels a shoulder tap 40 E~982014 - 12 -5~

from processor B1 telling it to read internal values and execute an instruction 2 in set Q1 and then continue executing all 3 instructions up to the next ins,ruction in 4 set R1. 5 The sets P1, Q1, and R1 are selected based on 6 the following criteria. First, all of the 7 high-usage instructions are placed in se~ P1, which 8 is com1non to both processors, there~y greatly 9 reducin~ the frequency of swapping the active and 10 passive processors. This is desirable because, 11 between swaps, instructions are executed as fast as 12 if they were all implemented in the microcode of a 13 single processor. Second, the frequency of 14 processor swaps is reduced still further if sets Q1 15 and R1 are selected in such a way that instructions 16 in these two sets seldom interleave with each other. 17 One particularly suitable instruction set partition 18 scheme is to have set P1 contain only fixed-point, 19 branch, and load instructions, have set Q1 contain 20 only floating-point instructions, ancl have set R1 21 contain only decimal and privi~eged instructions. 22 This selection satisfies both criteria. First, the 23 fixed-point, branch, and load/store instructions 24 represent about 75% of the execution time in a 25 typical mainLrame instruction mix. Second, although 26 there is frequent interleaving of floating-point, 27 branch, and load ins-~ructions Witll either 28 f~xed-point instructions or decimal instructions, 29 there i5 much less frequent interleaving of 30 floating-point instructlons with decimal 31 instructions. Therefore, there is relatively little 32 performance lost to swapping active and passive 33 processors if this selection of P1, ~1, and ~1 is 34 made. In fact, a need for both floatina-point and 35 decimal instructions in the same application is 36 sufficiently rare that special-purpose syster,s 37 containing only one of microprocessor A1 or 38 microprocessor B1 could be attractive. 39 E~J982014 - 13 -If a selectlon is made in ~hich instructions in sets Ql and Rl frequently interleave, but have 2 rather independent internal value modification 3 characteristics, then an additional manipulation 4 could be used to shorten the processor swap overhead 5 time. This would be to have the passive processor 6 actually executing instructions in set Pl along with 7 the active processor, listening to the bus, and 8 u~dating its internal values, hut not controllins 9 the bus or affecting any external values. In 10 addition, the passive processor would decode those 11 instructions not implemented in its o~m microcode 12 just enough to see whether each such instruction 13 would affect its internal values other than the 14 I-counter and Condition Code (CC). If so, the 15 passive processor would set a bit indicating that it 16 must read internal values from the mailbox when it 17 again becomes the active processor. If it ~ecomes 18 the active processor when this bit is still reset, 19 then the passive ~rocessor would read in only the 20 I-counter and CC values when it thereafter accessed 21 the mailbox. This stxategy often reduces the time 22 required to swap the active and passive processors, 23 although it does not reduce the frequency of 24 swapping. 25 It shoulcl be noted tlla-t -the foregoing 26 partitioning method keeps the critical path of 27 either microprocessor chip to an absolute minimum, 28 since there is no deviation from the path shown in 29 Figure 2. As used herein, the "critical path" ln 30 all of descri~ed approaches defines a path, as shown 31 in Figure 2 by way o~ example, tllat runs from the 32 control store, to data flow (the dashed box), to 33 arithmetic result, to address of the n~xt control 34 store word. The length of the critical path, in 35 nanoseconds, determines the microc~-cle time and, 36 hellce, the instruction processing rate of the 37 processor. 38 B. Four Su~)sets, Three i~icrocoded 40 E~982014 - 14 -The second approach to partitioning, employs four microprocessors as shown in Figure 3. Three of 2 these, a primary processor ~2 and two secondary 3 processors, B2 and C2, are Motorola Corporation 4 68000s with special on-chip microprograms that 5 replace the 68000's standard microprograms. The 6 first of these specially microcoded processors A2 is 7 utilized to implement I-cycles (instruction fetch 8 and decode and effective-address calculation) for 9 all instructions, and E-cycles (Lnstruction 10 execution) or the fixed-point, load, and branch 11 instructions. The registex space of this processor 12 is used for the general purpose registers (~PRs). 13 It should be noted that its on-chip microcode 14 implements all functions that make heavy use of the 15 GPRs, so the critical path remains on and is 16 contained within one chip. The second of the 17 special rnicroprocessors B2 is employed to implement 18 E~Cycles for floating-point instructions. Half of 19 the register space in this microprocessor is used 20 for the floating-point registers (FPRs) and the 21 other half is used for work space. Again, the 22 microcode is on the same chip as the registers and, 23 of course, the data flow that it controls. An 24 alternative design employs a different 25 microprocessor chip that can execute floating-point 26 instructions faster kecause its data flow is wide 27 enough to process most common floating-point 28 variables in parallel. The third of the specially 29 coded microprocessors C2 is used to handle the 30 E-Cycles for decimal instructions. ~ll of the 31 register space in this microprocessor is available 32 for work space, since decimal instructions have 33 storage-to-storage format. 34 The fourth microprocessor D2 is off-the-shelf, 35 that i5, it contains the standard Motorola microcode 35 that implements the instruction set of the 68000. 37 The part of the System/370 architecture not 38 implemented by microcode, namely, privileged 39 instructions, exception or error conditions, address 40 57~

translation misses, and interrupt handling, are simulated by sequences of 68000 instructions that 2 are stored in a separate local store, rather than on 3 a microprocessor chip. This is appropriate because 4 these instructions and functions are used 5 infrequently so maximwn speed is not required, are 6 error-prone so early models should have them in 7 easily changed PRO~ls, and are voluminous so they can 8 be written more economically in the relatively 9 high-level 6800C machine language rather than in the 10 very low-level 68000 horizontal microcode language. 11 A system containing these four microprocessors 12 operates as follows. The first or primary 13 micropxocessor A2 fetches an instruction. If it can 14 execute the instruction, 7 t does so. If not, the 15 primary hands off control to one of the other or 16 secondary microprocessorS / a2 or C2. This involves, 17 first, passing necessary data such as the o~eration 18 code and effec~ive address in predefined local store 19 locations and, second, setting a new value into the 20 four-state circuit (Quatch) whose state determines 21 which microprocessor has control of the local bus 22 that connects all four microprocessors ancl their 23 local store, in parallel, to the rest of the system. 24 The selected secondary runs, with full control of 25 the local bus and full access to the main store and 26 I/O system, until it has comFleted execution of the 27 instruction it was given. Then, it sets the 28 original value back into the Quatch~ handing control 29 ~ac~ to the primary. At this point, tne primary 30 looks at a return code in local store and proceeds 31 to fetch the next instruction, or passes control to 32 the off-the-shelf secondary microprocessor for 33instruction error handling. Note that this 34 mechanism for passing control allows a secondary 35 microprocessor responsible for floating-point or 3G
decimal instructions to call on t'ne off-the shelf 37 secondary to complete an instruction that detected 3~
an error. Thus, the error handling function, which 39 is voluminous and not critical to perormance, need 40 EN98201~ - 16 -, .. . . . . . .

t73 not occupy valua~le control store space on the floating-~oint secondary chip.
The desirability of this approach's 3 PartitiOnin~ OL the System/370 architecture can be 4 seen by noting that -the primary processor runs more 5 than 75~ of the time wl1en executing typical job 6 mixes, and has to hand only one instruction in 7 twenty over to a secondary processor. 8 C. T~o Subsets, One ~1icrocoded l0 The third approach to partitioning is similar 12 to the second, but only employs a single specially 13 microcoded microprocessor A3 and a standard coded 14 microprocessor B3. This ap~roach combines the 15 excellent cost/performance of on-chip microcode for 16 the most critical functions ~ith the flexi~ility, 17 extendibility, and low development cost o~ off-chip 18 microprocessor code for less critical functions. It l9 uses the structure shown in Figure 4 and works as 20 follows. Processor A3, called the primary 21 processor, contains the general purpose registers 22 (GPRs) and the microcode for all functions that ~ake 23 heavy use of GPRs. It performs I-cycles for all 24 instructions. It also performs E-cycles for the 25 most frequently used instructions, that is, for 26 almost all instructions except floating-point, 27 decimal, and privileged instructions. In a typical 28 instxuction mix, the instructions that the primarv 29 processor executes constitute about 35~ of the 30 instructions by frequency of occurrence and about 31 50% of the instructions by execution time. 3ecause 32 the primary Frocessor also performs I-cycles for all 33 instructions, it actually runs more than 50% of the 34 time. 35 The primary processor A3 is also responsible 36 for detectin~ instructions for which it does not 37 contain the execution microcode. It hands over 38 control to the secondary processor B3 to co~Flete 39 such instructions. ~ost of tl1e decimal, 40 E1198201-~ 17 -5~
floating-point, and privileged instruc-tions do a relatively large amount of data processing or are 2 used very infrequently in ty~ical instruction mixes. 3 Therefore, the time to pass control from the primary 4 processor to the secondary processor, and back, is 5 relatively small ~he secondary processor carries 6 out the necessary processing under control of code 7 contained in the local store. The same local store 8 contains other registers, such as the floating pOillt 9 registers, and the mailboxes in which the processors 10 leave instruction codes, operand addresses, 11 condition codes, and other necessary data as they 12 pass control back and forth. Control of the two 13 processors is simple because only one of them is 14 ever running at any one time. There is no overlap 15 and no bus contention. ~ither processor can pass 16 control to the other by inverting the state of the 17 two-state latch that determines which of them is 18 gran-ted use of the bus. 19 It is important to note that a state-of-the-art 20 microprocessor, the Motorola 68000, has been used to 21 successfully implement a reaso~ably high-level 22 machine language. This is the languaye in which 23 ~ost of -the mainframe architec~ure is coded when 24 using this approach to partitioning. Development of 25 this code is rapid and inexpensive, in comparison to 26 wri-ting in a low-level microcode lanyuage. 27 Moreover, the code resides in local store where it 28 is easy to change, in comparison to microcode 29 residing on a microprocessor chi~. .he 30 corresponding disadvantage is that code implementing 31 instructions tends to run longer ~han ~icrocode 32 implementing the same instructions. Therefcre, -33 there is a perfor~ance imb~lance between the 34 high usage instructions, which are implemented in 35 microco~e, and the low-usage instructions, which are 3O
implemented in code. 37 n. Subset With Emulation 39 ~M982014 - 18 -The fourth approach relies heavily on software to implement parts of the architecture that cannot 2 be placed on a single microprocessor chip, as is 3 illustrated in Figure 5. In using this approach, 4 one first defines a suitable subset P4 OL the 5 mainframe architecture, implements this subset as 6 the "machine" architecture of the microprocessor 7 chip, and then writes a first layer of software to 8 raise the level of the subset to the level of full 9 mainframe archi~ecture. The suhset P4 must include 10 sufficient instructions and functions to enable the 11 first layer of software to simulate the rest of the 12 mainframe architecture, including preservation of 13 system integrity. 14 In some appllcations, no such first software 15 layer is necessary. It might be possible to run 16 some System/360 software, that which does not use 17 new functions introduced in System/370, directly on 18 the machine interface of the micropxocessor chip. 19 The selected subset might suffice Eor many OE21 type 20 applications, such as intelligent terminals, 21 intelligent printers, and test-equipn)ent control. 22 ~pplications in turnkey "applications machines" 23 could be written for the subset with customers or 2~
users never knowin~ that the subset was there. In 25 o~her applications, rnissing instructions can be 26 rep].aced by subroutine calls at compile time. In 27 -the remaining applications, the operating system, as 28 shown in Figure 4, can have a first layer that 2~
handles "invalid operation" program interruptions by 30 simulating the missing instructions instead of 31 passing thesP interruptions up to the next-higher 32 layer. 33 This solution to the problem o. insufficient 34 con-trol store space has the advantages of minimal 35 hardware development cost, risk, and time, as well 36 as excellent product cost/performance fcr 37 applications that employ only the selected subset. 38 l~owever, it has the disadvantages of a large mix 39 i~balance in any sort of software simulation of 40 missing inctructions~ and an increased maximum interrupt latency time. 2 E. Off-Chip Vertical ~icroccde 4 The three remaining app oaches employ two 6 levels of microcode. The fifth approach, shown in 7 Figuxe 6, has the advantages of using t~o levels of 8 microcode with different widths. Current 9 microprocessors achieve excellent cost/performance l0 by allowing a single chip to ccntain hoth the ll control store and the data flow that it controls. 12 Their cost/performance is further improved if the 13 control store is wide, or "horizontal", rather than 14 narrow or "verticali'. A wide control store 15 eliminates most decoding, so it reduces both 16 complexity and propagation delay. In addition, a 17 wide control store can control several simultaneaus 18 operations, so it improves performance. However, a l9 wide control store usually needs to contain more 20 bits than a narro~J one in order to implement a given 21 function. As used herein, the terms "narrow" or 22 "vertical" storage or microcode are employed to ~ 23 signify that a use of a word len~th on the order of 24 16 bits, while the terms "wide" or "horizon~al" 25 signify a word length on the order of l00 bits. In 26 between these two, although not used herein, is a 27 midrange word lenyth of approximately 32 bits. 28 One common solution to the problem of a large 29 wide control store, has been described with 30 reference to the Motorola 68000 microprocessor. 31 This solution is based on noting that the 32 information in a wide control store is highly 33 redundant; i.e., many control words have bits that 34 are identical. The solution is to have both a ~ide 35 horizontal store and a narrow vertical store. The 36 horizontal store contains the few, non-redundant 37 control bi~ patterns required by the data flow. The 38 vertical store contains the many bit patterns that 39 are necessary for sequencing through many machine 40 E~1982014 - 20 -~ 1~3~5'i~3 instructions. Such an approach is said to reduee the total eontrol store size by about a faetor of 2 two in the ~otorola 6~000 mieroproeessor. 3 Even with this approaeh, current 4 mieroprocessors have insuffieient on-ehip eontrol 5 store to implement all of the ~ieroeode that is 6 neeessary to implement an architecture as complex as 7 that found in a mainframe. Yet, there is a rnajor 8 eost/performance advantage in having all of the 9 horizontal microeode on the sarne chip as the data 10 flow, to avoid the many pins or kus cycles required 11 to bring a wide control word onto the ehip, and 12 there is a eost/performanee advantage in having the 13 most frequently used vertieal mierowords on the same 14 C}lip as the data flow to avoid any aceesses to the 15 off-ehip bus in most mieroeyeles. This leaves only 16 the infrequently used vertieal rnierowords to be 17 stored off the microprocessor chip, in a 18 microproeessor-based i~ple~entation o a large 19 system or mainframe arehiteeture. 20 5ueh an implementation leaves two detailed 21 design problems to be solved. Lhese problems are 22 aeeommodated in the following manner. First, braneh 23 from on-ehip to off-ehip vertieal n`icroeode hy 24 setting a lateh attac:hed to a microprocessor output 25 pin, by restricting on-ehip vertieal micro read-only 26 memory tROM), for example to 512 words, and 27 branehing to a word whose address exceeds 511, or by 2i3 branehing to the highest valid on ehip vertical ?9 mieroword address after setting the off-chip 30 vertical mieroword braneh address onto the data bus. 31 Seeond, allow eonditional branehes to depend on 32 status ~its by bringing up to 16 raw status bits off 33 the ehip, by ~ay of the data bus or dedicated pins, 34 just before the data bus or other dedicated pins are 35 used to bring the ne~t vertieal microword on chip or 36 by using the branch eontrol fields o~ the horizontal 37 mierowords to seleet just the desired status 38 information and bring off o the chip just the low 39 5'73 two bits of the address of the next off-chip microword. 2 Note that most horizontal microwords will 3 probably be used by both on-chip and off-chip vertical microwords. However, some specially 5 written llorizontal microwords will have to be put onto the chip just for the use of the off-chip 7 vertical microcode. That is, the microprocessor, as 8 seen by the off-chip vertical control store, should 9 interpret a thoroughly general and flexible vertical 10 microcode language. This provides the ability to 11 implement a complex mainfral~e architecture~ The 12 on-chip vertical microcode provides very high 13 performance for the mos~-frequently-used portions of 14 that architecture. 15 other advantages of this method of partitioning 16 microcode are that it allows microcoding for high 17 speed, since coding for smallest size is not 18 necessary, it allows off-chip vertical microcode, 19 written for a first product, to be put in the 20 on-chip vertical microstore in subsequent produc~s 21 whose microprocessors have larger Read Only ~emory 22 (ROM), and it encourages a microprogramming 23 methodology of first selecting a set of useful 24 horizontal microwords, and then stringing them 25 together with vertical microwords, WiliCil increases 26 rnicroprogram,mer productivity. 27 F. Cff-Chip Horizontal ~licrocode 29 The sixth approach, shown in Figure 7, e~ploys 31 two sets of microwords that have the same ~Jidth. 32 One set is on the microprocessor chip and executes 33 very rapidly. The other set is in an external store 3 and can be very largeO In a typical lnstruction 35 mix, fixed-point, branch, and load instructions 36 account for 95% of the instxuctions by frequency of 37 occurrence, and for 60~ to 75~ of the instructions 38 by execution time. Thus, these instrllctions are 39 suitable candidates for this partitioning scheme to 40 ~M982014 - 2 have on-chip. The remaining microwords, Xept in an off-chi~ control store, are brought onto the chip 2 one by one for execution. This could be done in 3 several cycles using existing address and/or data 4 pins for microworcl bits; or it could be done using 5 dedicated pins. Tlle off-chip control store must be 6 wide enough for both the microword bits required by 7 the data flo~ and the microword-selection bits required by the sequencer. The off-chip microword 9 sequencer must have access to on-chip status lQ
information, in order to perform conditional 11 microprogram branches and in order to pass control 12 back and forth between on-chip and off-chip 13 functions and instructions. 14 This method of Fartitioning the microcode 15 necessary for implementing a complex mainfra~e 1 architecture has the advantage of permitting an 17 architecture of unlimited complexity to be 1;3 implemented by use of a sufficiently large off-chip lg control store. Further, difficult parts of the 20 architecture can be placed off-chip, where they can 21 be corrected without altering the rnicroprocessor 22 C1lip itself. In addition, off-chip microcode - 23 written for a product so implemented may be placed 24 on chip, with minimal modifications, if a subsequent 25 product uses a microprocessor chip with large- 20 on-chip control store. With care, ~atches to the 27 on-chip microcode can be implemented in the off-chip 28 microcode if errors are found. Since off-chip 29 instructions are executed in the same engine as 30 on~chip instructions, they have full access to 31 registers, condition code, and other facilities of 32 the machine yielding other advantages. A final 33 advantage accrues from the fact that all accesses to 34 main storage and channels are made b~ the same 35 microprocessor. 3~
~he arrangement for partitioning microcode 37 between on-chi~ and off-chip control stores allows 38 the most frequently used instructions to benefit 39 from the cost/performance of microprocessors due to 40 EN98201~ - 23 -the short critical path produced by on-chip microcode, and runs the remaining instructions and 2 functions ~lit]l the cost/performance characteristics 3 of bit slices ~ith the lonyer critical path produced by OL f-chip microcode. 5 ~. Subset ~ith Primitives 7 The last approach, shown in Figure 8, could 9 produce a very economical processor at the expense 10 of a difficult and prolonged developrnent process. 11 The most difficult aspect of this approach is 12 defining suitable "primitive" operations. In 13 principle, a microprocessor that contains on-chip ` l~
microcode for a mainframe system's fixed-point, 15 branch, and load/store instruc-tions can be 16 programmed to emula~e the remainder of that system's 17 architecture, as described under "Subset with 18 Emulation" a~ove. In practice, that design produces 19 relatively poor performance for the instructions and 20 functions that are emulated hy off-chip code, rather 21 than microcoded on the microprocessor chip. 22 ~licrocoding so~e "~rimitives", instead of some 23 instructiolls that could occupy the same on-chip 24 control store space, can produce signiricantly 25 higher performance on a complete instruction mix. A 2 primitive is not i~self a system in~truction, but 27 rather it executes a simple function tllat is useful 28 in the emulation of more complicated instructions or 29 functions. An emulation program can achieve higher 30 performance if it has primitives available as well 31 as the basic instructions. Examples of primitives 32 are "load registers with contents of instruction 33 fields", "set condition code according to arithmetlc 3 result" and "compute effective address." 35 This method of implementing a large system 36 architecture on a microprocessor is itself 37 implemented by subdividing the microprocessor's 38 operation code space into the following three sets: 3g E~982014 - 24 -. _ _ _ _ _ _ _ . _ _ _ _ _ _ _ ~ _ _ _ _ _ _ _ . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ,573 A.) codes of iligh-usage instructions, each of which is implemented by a sequence of 2 - on chip microcode; 3 B. ) codes assigned to primitives which are 4 useful for emulating instructions, each of 5 whicn is implemented by a sequence of 6 on-chip microcode; and 7 C.) codes of the remaining low-usage 8 instructions, each of which is implemented 9 by a sequence of high-usage instructions 10 (A) and primitives (B). 11 In operation, an instruction stream is being 12 fetched from store. As long as these instructions' 13 codes are found to be in set ~, execution is 14 controlled by on-chip microcode. Any codes in set B 15 are illegal, in this mode. ~hen an instruction's 16 code is found to be in set C, direct execution of 17 on-chip microcode is terminated aEter completion of 18 that instruction's I-cycles, which can include 19 effective address generation. .he instruction code 20 selects a starting address in a private program 21 store, and the microprocessor fetches its next 22 "instruction" from this address. That "instruction" 23 code will be in set A or B, so it initiates a 24 sequence of on-chip microcode. This sequence ends 25 ky fetching ano-tner "instruction" which initiates 26 another sequence of on-chip microcode, and so on, 27 until the instruction whose code was in set C has 28 been completely emulated. Then the next instruction 29 is fetched from store, not Erom the priva-te program 30 store. That instruction, too, is either executed 31 directly by a sequence of on-chip microcode, or 32 simulated ky "instructions" in the private program 33 store, which are in turn executed by se~uences of 34 on-chip microcode. 35 It should be noted that the emulation mode used 36 to program a low-usage instruction, whose code ls in 37 set C, has the following special characteristlcs. 38 In this mode, "instructions" are fetched ~rom the 39 private program s~ore, not from main store. ,he 40 E~982014 - - 25 -instruction counter is not incremented and codes ~n both sets ~ and B are legal while emulatiny an 2 instruction in set C. In ad~ition, interrupts must 3 be held pendin~ until all of the "instructions" that emulated one instruction in set C are completed. 5 Any instructions in set A, that are used along with 6 primitives in set B to simulate an instruction in 7 set C, must be prevented from changing the condition 8 code or taking their ordinary exceptions. 9 Some advantages of this method o~ partitioning 10 the architecture ~etween on-chip microcode and 11 off-chip e~lulation code are as follo~s. An 12 instruction in set C can be simulated with 13 relatively few bus cycles. An "instruction" brought 14 in from the private instruction store by one or t~o 15 bus cycles, initiates a sequence of many micro~ords 16 which do not require bus cycles. Constan~ data 17 needed by dificult instructions or by interrupts 18 such as Translate and Test's implied register, or 19 interrupts' many implied storage addresses, can be 20 brought in easily as immediate fields of 21 "instructions" fetched from the private program 22 store. SUC}1 constants may be difficult to introduce 23 by way of on-chip microcode. ~n architecture of 24 unlimited complexity can be emulated by a 25 sufEiciently large private program store, if the 26 codes in sets A and B supply functions of sufficient 27 generality. The private program store can be 28 relatively small, hecause it stores relatively 29 powerful "instructions" each of which is interpreted 30 by ~any microwords. This is especially true i~ 31 powerful branch and subroutine call "Instructions" 32 are used to save space. 33 The transfer of control ~rom on-chip microcode 34 to an of-chip emulation progra~ need not ~e limi-ted 35 to the time when an I-cycle completes. On-cllip 3O
microcode should be allowed to call for simulation 37 of the rest of an instruction whenever it detects an 38 unusual condition, 50 it does not require high 39 performance, that is difficul~ to hand1e and would ~0 E~J9a201~ - 26 -S7~3 otherwise consume many valuable on-chip mlcrowords.
For example, the on-chip microcode for llove 2 Characters should be able to call an off-chip 3 program if it detects operand overlap. 4 H. Conclusion 6 ._ _ The foregoing description has ~een specifically 8 directed to a methodology by means of which state-of 9 the-art microprocessors can be u~ilized to emulate 10 mainframe architectures. A comparison summary oE 11 the various approaches is presented in Lable Io 12 This table should prove useful in comparing each 13 approach with respect to different measures o' 14 goodness. Although the present invention has ~een 15 described in the context of preferred embodiments 16 thereof, it will be readily apparent to those 17 skilled in the appertaining art, that modifications 18 and variations can be made therein without departin~ 19 from its spirit and sco~e. ~ccordingly, it is not 20 intended that the present invention be limited to 21 the specifics of the ~oregoing description o~ -~he 22 preferred embodiments. Tnstead, the present 23 invention should be considered as being limited 24 ~olely by the appended claims, ~Ihich alone are 25 intended to define its scope. 26 ~M982014 - 27 -t73 ~L~

.APPROACH RANK MAIN MAIN
APPROACH NAME P B D R ADVANTAGE DISADVANTAGE
_ TWO LOW BUILD CAN NOT
A OVERLAPPING7 7 2 1 co s T, GOOD IMPLEME~`IT RICH
SUBSETS BALANCE ARCHITECTURE
.... _ . _ _ _ .
FOUR SUBSETS, HIGH HIGH BUILD

MICROCODED
_ TWO SUBSETS, GOOD/COST UNBALANCED

MICROCODED
. . _ SUBSET LOW LOW AND

EMULATION PERFORMANCE
. _ _ OFF-CHIP NEED COMPLETE

MICROCODE - HORIZONTAL
MICROWORDS
~ _ _ _ . OFF-CHIP . CAN IMPLE- LOW

MICROCODE ARCHITECTURE

SUBSET GOOD COST/ NEED COMPLETE
G WITH 5 6 1 5 ¦ PERFORMANCE SET OF
_ PRIMITIVES . . SYSTEM/370 PRIMITIVES

*
RANK KEY (7 IS BtST):
PERFORMANCE
BUILD COST
DEVELOPMENT COST
RICHNESS OF IMPLEMENTABLE ARCHITECTURE

Claims

The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:

1. A method for microprocessor implemented emulation of a mainframe computer, using large scale integrated microprocessor chips, said method comprising the steps of a) partitioning the instruction set of said mainframe computer into a plurality of subsets, at least one of which completely fits on and is entirely executable by a single microprocessor chip;
b) providing chip based microprocessors on which said instruction set subsets can be implemented, each of said microprocessors being capable of supporting on-chip microcode;
c) providing each microprocessor with all of the necessary microcode to allow implementation and control execution of its resident subset instructions entirely on-chip;
d) providing at least one path between all of said microprocessors via which control can be passed back and forth between said microprocessors; and e) providing at least one path between all of said microprocessors via which data can be passed back and forth between said microprocessors.

2. The method according to claim 1 wherein said completely fitting instruction subset includes those instructions that make substantial use of the general purpose registers and wherein the microprocessor chip on which said subset is

Claim 2 continued resident is provided with sufficient general purpose registers to handle the instructions of said subset and with microcode for all of the subset's functions that make use of said general purpose registers.

3. The method according to claim 1 wherein said mainframe instruction set is partitioned into three subsets, P1, Q1 and R1, and wherein subsets P1 and Q1 are grouped together for complete implementation on a single microprocessor chip A1 and subsets P1 and R1 are grouped together for complete implementation on another single microprocessor chip B1.

4. The method according to claim 3 wherein subset P1 is partitioned to constitute about 95% of the mainframe instruction set by frequency of occurrence and about 70% of the instructions by execution time, with subsets Q1 and R1 partitioned to include only the remaining mainframe instructions.

5. The method according to claim 4 wherein subset P1 is partitioned to include only fixed-point, branch and load instructions, subset Q1 is partitioned to include only floating-point instructions and subset R1 is partitioned to include only decimal and privileged instructions.

6. The method according to claim 5 wherein said partitioning step is supported by providing on-chip microcode for microprocessors A1 and B1 to implement the respective instruction subset groups of each.

7. The method according to claim 6 wherein said microprocessors are operated in non-contention with only one being active and the other passive at any one time.

8. The method according to claim 1 wherein said mainframe instruction set is partitioned into four subsets, P2, Q2, R2 and S2,and which method further includes the steps of providing an off-chip control store and microprocessor chips A2, B2, C2 and D2, said subsets P2, Q2 and R2 being each respectively completely implemented on microprocessor chips A2, B2 and C2, with all remaining mainframe instructions not found in said partitioned subsets P2, Q2, and R2, namely those contained in subset S2, being provided by simulation instruction sequences stored in said off-chip control store, with microprocessor chip D2 being used for initialization, PSW
maintenance, start I/O and housekeeping functions.

9. The method according to claim 8 wherein subset P2 is partitioned to implement I-cycles for all instructions and E-cycles for the fixed-point, load and branch instructions to gather therein all functions that make frequent use of general purpose registers, subset Q2 is partitioned to implement E-cycles for floating-point instructions and subset R2 is partitioned to implement E-cycles for decimal instructions.

10. The method according to claim 9 wherein subset P2 is partitioned to constitute about 95% of the mainframe instruction set by frequency of occurrence and about 70% of the instructions by execution time, with subsets Q2, R2 and S2 partitioned to include only the remaining mainframe instructions.

11. The method according to claim 10 which further includes the step of providing microcode for

Claim 11 continued each of said microprocessor chips A2, B2 and C2 to enable full implementation thereon of its assigned instruction subset.

12. The method according to claim 1 wherein said mainframe instruction set is partitioned into two subsets, P3 and Q3, and which method further includes the steps of providing an off-chip control store and microprocessor chips A3 and B3, said subset P3 being completely implemented on microprocessor chip B3, with all remaining mainframe instructions not found in said partitioned subset P3, namely those of subset Q3, being provided by simulation instruction sequences stored in said of chip control store, with microprocessor chip B3 being used for executing said simulation instructions for said subset Q3 and also for address translation misses, exception conditions and interrupt handling.

13. The method according to claim 12 wherein subset P3 is partitioned to implement I-cycles for all instructions and E-cycles for the fixed-point, load and branch instructions to gather therein all functions that make frequent use of general purpose registers.

14. The method according to claim 13 which further includes the step of providing microcode for said microprocessor chip A3 to enable full implementation thereon of its assigned instruction subset P3.

15. The method according to claim 14 wherein subset P3 is partitioned to constitute about 95% of the mainframe instruction set by frequency of occurrence and about 50% of the instructions by execution time, with subset Q3 partitioned to include only the remaining maniframe instructions.

16. The method according to claim 1 wherein said mainframe instruction set is partitioned into at least subsets P4 and Q4, which method further includes the steps of providing a microprocessor chip A4, implementing subset P4 completely on microprocessor chip A4 as the machine architecture thereof, and then providing at least one layer of software that will raise the level of subset P4 to that of the full mainframe architecture, said software layer including sufficient instructions and functions to simulate said instruction subset Q4, namely the mainframe architecture not defined in and by subset P4, including preservation of system integrity.

17. The method according to claim 1 wherein said mainframe instruction set is partitioned into two subsets, P5 and Q5, and which method further includes the steps of providing an off-chip control store that is suitable for storing vertical microcode therein, providing microprocessor chip A5, implementing said subset P5 completely on microprocessor chip A5, with all remaining mainframe instructions not found in said partitioned subset P5, namely those of subset Q5, being provided by vertical microcode stored in said off-chip control store, and using microprocessor chip A5, for managing said instruction subset Q5, and address translation misses, exception conditions and interrupt handling.

18. The method according to claim 17 wherein subset P5 is partitioned to implement all instructions that do not require infrequently used vertical microcode.

19. The method according to claim 18 which further includes the step of providing microcode for said microprocessor chip A5 to enable full implementation thereon of its assigned instruction subset P5.

20. The method according to claim 19 which includes the additional step of including microcode on said microprocessor chip A5 to assist in the implementation of said off-chip vertical microcode.

21. The method according to claim 20 which includes the additional steps of providing a latch coupled to a microprocessor A5 output pin and branching from on-chip to off-chip vertical microcode by setting said latch whenever a predetermined condition occurs.

22. The method according to claim 1 wherein said mainframe instruction set is partitioned into two subsets, P6 and Q6, and which method further includes the steps of providing an off-chip control store that is suitable for storing horizontal microcode therein, providing microprocessor chip A6, implementing said subset P6 completely on microprocessor chip A6, with all remaining mainframe instructions not found in said partitioned subset P6, namely those of subset Q6, being provided by horizontal microcode stored in said off-chip control store, and using microprocessor chip A6 for this and for managing privileged instructions, address translation misses, exception conditions and interrupt handling.

23. The method according to claim 22 wherein subset P6 is partitioned to implement all instructions that do not require infrequently used horizontal microcode.

24. The method according to claim 23 which further includes the step of providing microcode for said microprocessor chip A6 to enable full implementation thereon of its assigned instruction subset P6.

25. The method according to claim 24 which includes the additional step of including microcode on said microprocessor chip A6 to assist in the implementation of said off-chip horizontal microcode.

26. The method according to claim 25 which includes the additional steps of providing a latch coupled to a microprocessor A6 output pin and branching from on-chip to off-chip horizontal microcode by setting said latch whenever a predetermined condition occurs.

27. The method according to claim 1 wherein said mainframe instruction set is partitioned into two subsets, P7 and Q7, and which method further includes the steps of providing an off-chip control store that is suitable for storing coded therein, providing a microprocessor chip A7, providing on-chip microcode for implementing said subset 27 entirely on said microprocessor A7, assigning and providing operation codes to identify and implement, as primitives, additional instructions in said instruction subset P7 using said on-chip microcode, providing code for said instruction subset Q7 that is stored in said off-chip control store and is implemented by a mix of on-chip microcode and primitives.