Introduction to Microprocessors and Microcontrollers - читать онлайн бесплатно полную версию книги . Страница 14

13. The PowerPC

Intel was producing a series of CISC microprocessors and, together with Microsoft, was in a position to dominate the market. Being increasingly squeezed out was the traditional king of computers, IBM, which, at one time, produced more computers that all other manufacturers combined. Big Blue as IBM was called, on account of their logo and the blue suits worn by their army of salesmen, laid down the standard design for the computer that now eclipses all others designs in the world.

As long ago as the mid-1970s IBM had developed a RISC microprocessor but it didn’t really make it in the market place. RISC did not ‘come of age’ until Acorn produced the ARM 2 and 3 microprocessors for their Archimedes microcomputer, but this too, failed to muscle its way into the market as it made little attempt to make it compatible with Intel code. Acorn, at that time, was introducing the Archimedes as a replacement for the much-loved BBC microcomputer. By 1990, it was apparent that the terrible twins, Microsoft and Intel, would take over the world if no one fought back.

As it happens, this was the very year in which a fledgling company called ‘AMD’ was hatched to grow over the years to become a persistent irritant to Intel. As yet, Microsoft still rules the world but there is a system called Linux that may, one day, become troublesome.

Meanwhile, an alliance was formed between IBM, Motorola and Apple Computers. To this alliance IBM brought their POWER microprocessor (Performance Optimized With Enhanced RISC). This was the successor to the earlier 801 RISC microprocessor and was chosen because it was a RISC microprocessor and already had software developed. Motorola would build the chip and Apple would bring its computer operating system, which was light years ahead of the Microsoft equivalent at that time. The new family of microprocessors was to be called the PowerPC series.

The designers took great care to make it attractive to software companies by being careful to address the problem of future development. They distinguished between the overall architectural features that will stay the same throughout the series, rather than how these features will actually be implemented. This allows the programmers to know which parts they can rely on to be consistent and which bits are likely to change. For example, they designed the system for 64-bit operation even though only 32 bits were to be used in the early devices.

The PowerPC 601 (or MPC601)

The PowerPC 601 was introduced in 1994 and followed the agreed PowerPC architecture as shown in Figure 13.1. It used 2.8 million transistors which is slightly less than the Pentium but many of the Pentium transistors were tied up with maintaining compatibility with their earlier microprocessors.

Figure 13.1 The PowerPC 601 architecture

Many of the blocks shown are familiar after our look at the Pentium. The 601 is a 32-bit microprocessor using a 64-bit data bus and a 32-bit address bus.

Bus interface unit

This serves the usual purpose of connecting the data bus and address buses to the microprocessor. It also acts as a control device to determine whether the data is to be read into the microprocessor or written into the external memory.

Cache

This is a single 32 kbyte cache which is shared by data and instructions. Later versions have increased the total cache available to provide two separate 32 kbyte caches, one for data and the other for instructions.

Within the cache, the information is arranged in a series of groups or lines of 64 bytes. To provide a high-speed link between the cache, the bus interface unit and the instruction unit and queue, a 256-bit internal bus is provided.

On many occasions, the result of a particular instruction is not of great interest in itself but just provides the data to be used for a future instruction. So when an instruction is completed, the result is stored in the cache rather than being put back into the main memory. Writing the result back into the cache is called a ‘write-back’ organization as opposed to ‘write-through’ action when the information is sent to the external memory. This, of course, saves a lot of time since the cache is about seven times faster than accessing the main memory and a million times faster than using the hard drive.

Instruction queue and instruction unit

The fast internal bus maintains a queue of up to eight instructions. Using the normal RISC ideas, all the instructions are the same length at 32 bits. Eight such instructions can fit across the 256-bit width of the internal bus.

The function of the instruction unit is to send instructions to the three pipelines: integer unit, floating-point unit and the branch prediction unit. With the right mixture of instructions, we can handle three instructions at the same time. To keep the pipelines busy, it also has the facility of running some of the instruction out of order. This is limited to instructions that are not interdependent.

The branch prediction unit

In the Pentium, the branch prediction included analysis of the history of each branch or jump instruction to help predict whether it is likely to be taken. The PowerPC uses a single stage pipeline which decodes and executes in a single clock cycle employing a very much simpler strategy that curiously seems to work just as well.

It makes no choices. If the branch is sending the program back to an earlier instruction, it always assumes that the branch will be taken. This is usually the correct choice since such loops in programs are very common. On the other hand, if the branch instruction offers the chance to jump forward, it assumes the branch will not be taken. If the predictions are correct, instructions are pre-fetched and loaded into the instruction queue and the correct data is available in the pipelines and no delay is experienced. If incorrect, the pipeline has to be flushed and reloaded losing several clock cycles.

In the case of unconditional jumps, the program just tells the microprocessor to move to another section of the program and no choice is involved. If the jump is to a distant address, the relevant instructions may not be in the cache and the cache would have to be flushed (re-loaded) (see Figure 13.2).

Figure 13.2 Branch prediction

Integer unit and registers

As expected with a RISC processor, there are plenty of registers. In this section of the 601, we have 32 registers, each 32-bits wide. These registers are dual-ported. This means that two circuits can access the registers at the same time without interfering with each other. This is like someone reading the back of your newspaper as you are reading the front – except that registers don’t find it irritating. ‘Port’, by the way, is just a fancy electronic word meaning ‘connection’. Transistors, generally, have three wires going to them and so are described as three-port devices.

The integer unit handles all instructions like integer arithmetic bit manipulation and transferring data to and from the external memory and is organized into a three-stage pipeline. In Figure 13.3, the second clock pulse executes the first instruction. The next clock pulse executes the second instruction and the last clock pulse executes the third. We have achieved the target of one clock pulse per clock pulse. And in the fourth clock pulse, we can see the next instruction just arriving to be decoded immediately after the first write-back.

Figure 13.3 Integer unit pipeline

Floating-point unit

This has a further 32 registers but in this case, they are 64-bits wide and to fill a register with a single clock pulse, there is an internal 64-bit bus connecting it with the cache. The pipeline is five stage: prefetch, buffer, decode, execution and write-back.

Memory buffer

This acts as a buffer for the external memory. The buffers include two reads and three writes, each up to 32 bytes. It is also used in writing-back to the cache.

Big and little endians

The main memory is divided into locations each having its own address. Each location can hold a single byte of information. If we wanted to store a 32-bit number, then we would have to utilize four consecutive locations.

Imagine that we wished to store the 32-bit number 00000000 01010101 00010001 11111111₂ and we had addresses 24646603H, 24646602H, 24646601H and 24646600H available. Little-endian format would store the most significant byte in the highest memory address so, in our example, the data 00000000 would go into address 24646603H. This is used by Intel microprocessors. Big-endian, which Motorola uses, works the other way around. The most significant byte is put in the lowest memory address so, in our example, the data 00000000 would go into address 24646600H. These are shown in Figure 13.4. All the PowerPC microprocessors are switchable to enable little or big-endian to be used.

Figure 13.4 Big and little endians

PowerPC 970

A large number of PowerPCs have continued to power the Apple-Mac and IBM desktops and, in addition, support both the UNIX and Linux operating systems.

The latest offering is the 970 with its 52 million transistors started life as a 1.8 GHz device and has now progressed to 2.0 GHz. This may appear slow but it has compensating attributes such as its 900 MHz bus as opposed to the 533 MHz bus of the Pentium 4.

It is a 64-bit micro so it handles data in 64-bit chunks but remains compatible with earlier 32-bit designs. It has two level 1 caches, one for instructions at 64 kB and a data cache of 32 kB, which are somewhat larger that the Intel product but both companies use a level 2 cache of 512 kB.

As memory size is continuing to increase with each design, the size of memory that can be directly accessed increases with the move to 64-bit processing. The Pentium 4 can access 40 GB of memory, which seems excessively large at the moment but there was a time when 4 MB was something to wonder at. The PowerPC 970 can handle memory of Star Trek proportions measured in terabytes (thousands of Gigs).

Table 13.1 Cache sizes

L1 Instruction	L1 Data	L2 cache
PowerPC 970	64 kB	32 kB	512 kB
Pentium 4	It’s a secret	8 kB	512 kB

For maximum microprocessor speed we need a high clock speed combined with the maximum use being made of every part of the microprocessor. The early 8-bit microprocessors would accept the first instruction and it would pass through the microprocessor being decoded, then acted upon, then having the results stored before it considered the next instruction. This meant that each bit of the micro was doing nothing for much of the time.

Modern micros load many instructions at the same time and split up the tasks so that as many as possible can be carried out at the same time to have the minimum time wastage.

As with the Pentium 4, the PPC970 makes use of level 1 caches that, as is now common, are split into an Instruction cache and a Data cache. There is also a level 2 cache and an external level 3 cache.

Loading the instructions

The instructions pour down from the Instruction cache at a maximum rate of eight per cycle, though five is a more likely overall figure. But this is still fast.

The PP970 uses a very long pipeline and can be handling up to 200 instructions simultaneously. The price of such a long pipeline is that we must be careful to ensure that it is filled with the most useful instructions and hence we need to back it up with very effective branch prediction techniques.

Branch prediction

To obtain the maximum possible speed, the PP970 has devoted a great deal of resources into its branch prediction. As the instructions are loaded, the branch prediction circuitry scans the incoming instruction looking for branch instructions. Every time we meet a branch instruction that offers a choice of outcome the branch will have to be accepted or rejected.

The 970 has two branch prediction methods. The first is very similar to that used in the Pentium 4 and, to over simplify the situation, it follows the same sort of reasoning as we often adopt in everyday life. If it usually happens, it is most likely to happen again. The 970 keeps a record of the previous 16384 branches in its BHT (Branch History Table) to see how often each choice was made and then this information is further sorted by a prediction program before it comes to a final decision.

The second method involves a similar sized table called a Global Predictor. This method also comes up with a final go/no go for the branch but it decides by generating an 11-bit vector that stores the actual execution path taken by the previous eleven fetch groups leading up to the branch.

So there are two independent mechanisms that make a decision as to whether the branch should be taken. If they disagree, we need a referee. This job is performed by a ‘Selector Table’ that stores the success rate for each of the two previous methods for each particular branch. It then makes the final decision – and it is said (by IBM) to be very successful, which it probably is.

Handling the instructions

Having combined the incoming instruction stream from the Instruction cache with the information from the Branch predict, the instruction are queued and passed to the Decode, Crack and Group Formation Unit. At this stage, in order to keep the instruction handling speed at a maximum, this unit takes the instruction codes from the Instruction cache, decodes them and cracks them into their component parts called Internal Operations (IOPs). These very small but simple tasks are passed out to specialized units like the five blocks shown along the bottom of Figure 13.5.

Figure 13.5 The PowerPC 970

The IOPs are executed in whatever order that will result in the fastest throughput and to reduce the complexity of keeping track of the execution of each and every one, they are organized in groups of five and then the groups are tracked.

Of the final row shown, there are the arithmetically based block that handle the vectors, floating point and integer calculations, the load-store that handles the transfer of data to the memory via the second level cache and finally the feedback path for the branch prediction information.

The PC market place

The PowerPC may not be in our PC but it may well be in our car. The Ford Motor Company has elected to use the PowerPC as first choice for their engine management computer into the next century.

Quiz time 13

In each case, choose the best option.

1 The maximum number of instructions that the PowerPC 970 can be dealing simultaneously is:

(a) 200.

(b) 3.

(d) 128.

2 Write-back:

(a) reverses the order of the bits of data.

(b) is used to double-check the accuracy of data before use.

(d) stores results in the cache rather than in the external memory.

3 The PowerPC 970 has an internal bus running at a frequency of:

(a) 64 bits/s although it can run at 32 bits/s.

(b) 512 kB/s.

(d) 533 MHz.

4 A register that can be accessed by two circuits at the same time is referred to as:

(a) a second-level cache.

(b) dual-ported.

(d) a three-ported device.

5 Big endian format:

(a) stores the low byte in the highest address.

(b) stores the high byte in the highest address.

(d) is used in a cache but never in the main memory.