52905.fb2
The Pentium is a 32-bit microprocessor just like the previous Intel 80386 and 80486 but has been considerably enhanced to improve its speed of operation. Even the 132 pins of the 80386 have increased to 296 on the Pentium.
Other full RISC chips were being well-received at the time the CISC Pentium was launched in 1993 and Intel took these new designs into account but it was boxed into a corner by its own success. It had to maintain absolute compatibility with the previous 8086, 80286, 80386 and the 80486 together with their numerical co-processors. The compromise was to use all the RISC while maintaining the CISC codes. It has over 400 instruction codes. Some are performed by hardware and some by microcode. Its two million plus transistors have been incorporated into a superscalar structure. This means that it has duplicated arithmetic and logic units that can allow it to carry out two instructions at the same time under favourable conditions.
It was launched at 66 MHz and in its first year became famous as the microprocessor that couldn’t count. There was a flurry of letters in the computer magazines and a host of ‘How many Pentiums does it take to change a light bulb?’ type jokes. At first, Intel denied there was a problem even though they must have known about it. ‘And, no, you can’t have your money back.’ More letters. ‘Alright, there is a very, very small matter of a few division sums.’ The error actually produced inaccuracies in the sixth or ninth decimal place in some particular division sums. This was insufficient error to affect more that a small minority of users but it started to undermine confidence in the Pentium. The real problem was that two errors occurred during its design at the same time. Either one, on its own, would have been spotted but the two mistakes served to hide each other. Anyway, it’s been fixed. It only affected the early versions and is no longer significant.
Over time the speed has increased to 200 MHz with the inevitable rumours of the Pentium II running at 400 MHz that will support a 100 MHz system clock.
See Figure 12.1.
Figure 12.1 The Pentium processor
Data and code caches
Connections to the outside world are via a 64-bit external data bus and a 32-bit address bus. The incoming data that consists of numerical data and instruction codes are loaded very quickly into two internal caches – an 8 kbyte data cache and an 8 kbyte code cache. These caches shift data very rapidly on the internal pathways that are 128 and 256 bits wide.
Whenever possible, the Pentium uses burst mode to read and write data. The burst mode system loads a cache for example, with more data than the width of the data bus. If a cache line is 128 bits wide and it is fed from a 64-bit data bus, then we could completely fill the line by transferring 64 bits and then another 64 bits. Burst mode loads all 128 bits very rapidly without further intervention from the microprocessor. Putting more new data into the cache will increase the chances of the cache holding the required information. This is called a cache ‘hit’.
Prefetch buffer
The prefetch buffer is a small internal memory that holds a list of instructions that are waiting to be executed. This ensures that the instruction decoder is never waiting for a new instruction from the external (slow) memory and it makes more efficient use of the external data bus since the new instructions can be loaded whenever the opportunity arises. When it gets a moment, the Pentium shifts an instruction from the external program into the cache and transfers one instruction from the cache into the prefetch buffer and also sends a signal to the microcode circuit to prepare the code for the next instruction. So, with all the housekeeping done, the instruction decoder can be fed with instructions and data at its maximum rate. The prefetch buffer is actually two independent 32-bit buffers, each providing input to one of the ALUs.
The instruction decoder
The instruction decoder performs much the same function as in other microprocessors. It has two outputs that are fed to the two ALUs called ‘u’ and ‘v’.
Arithmetic and logic units
These units are under the control of the aptly named control unit. The blocks, shown in the diagram as ALU ‘u’ and ALU ‘v’ are actually five step pipelines that can operate in parallel to execute two instructions in a single clock cycle. All commands other than floating point arithmetic can be executed in the ‘u’ pipeline and a more limited range can be carried out in the ‘v’ pipeline. The five-stage pipeline can speed the throughput to one instruction per clock cycle. In the correct conditions, both pipelines can be used simultaneously to handle two instructions in a single clock cycle. Sometimes this is not possible. Perhaps both instructions need access to the same piece of hardware, perhaps the result of an instruction is needed before the next instruction can be started. As a rather simplistic example, if we wished to add two numbers then divide the result by 10, we cannot start dividing anything until the first answer is available. One minor drawback is that instructions cannot overtake each other even if the second one could have been finished very rapidly and they are not dependent on each other.
Floating-point unit (FPU)
For floating-point arithmetic, the FPU has an 8-bit pipeline that is further enhanced by using a hardware multiplier and divider. This is a significant advance over the 80486, which was not pipelined in the FPU. Between them, the pipeline and the hardware, the FPU runs about ten times faster than the 80486 with equivalent clock speeds. You may remember from earlier discussions that one of the benefits of the RISC designs was the use of hardware for the execution of arithmetic operations.
The ‘u’ pipeline has some overlap with the floating-point pipeline so there are restrictions on the occasions when two instructions can be executed at the same time.
There are eight FPU registers 80-bits wide, arranged as a stack. Bits 0 to 63 hold a 64-bit mantissa. Bits 64 to 78 hold a 15-bit exponent and the last bit holds a sign bit.
Notice how the layout of the floating-point number differs from the example that we saw in Chapter 4.
Branch prediction
When the program reaches a ‘branch’ or ‘jump’ instruction, the microprocessor is sent to another part of the program. These instructions are usually ‘conditional’ as in ‘jump to address xxxx if the value in the accumulator is not zero’. When this jump happens, the next few instructions that are loaded into the pipelines are all incorrect and the pipeline has to be emptied and restocked with the new information. This is called ‘flushing’ the pipeline and causes an irritating delay of four or five clock cycles.
The branch prediction logic holds about 256 entries in a cache to aid the Pentium in guessing the next instruction. If we can guess what is coming next before it happens, then the data and instructions can be loaded ready to go.
But how do we guess? There are two likely outcomes: either the branch will be taken and we jump to another part of the program, or we don’t take the branch and we continue with the next instruction. The branch prediction logic argues that what the microprocessor did last time, it will probably do again. This is true more often than not. The reasoning behind this is that when a loop occurs, the program is sent back to repeat a section several, or many, times. It can only NOT take the branch once, so on average it will take a branch more often than it doesn’t.
In the cache are stored the instructions immediately before the branch or jump together with the target address assuming the branch is taken. It also stores statistical information of how often the branch was taken in the past. This information is used to predict the likely outcome of the current situation and is correct for about 85% of the time. When the branch has occurred, the history information is updated to make the next guess even better.
General purpose registers
The Pentium has seven general-purpose registers, all 32-bits wide. One of them is used as an accumulator and to maintain compatibility with the 80386 and the 80486, it can be addressed as a single 32-bit register, two 16-bit or four 8-bit registers. There are three other general-purpose registers that can be similarly split and three that only offer the choice of 32-and 16-bit use.
Interrupts
The handling of interrupts has not changed beyond all recognition since we were looking at the Z80.
There are two hardware interrupts available. The NMI or nonmaskable interrupt is activated by the pin voltage going to a logic 1 or high-level. Immediately on the completion of the current instruction, the Pentium puts the content of the flag register and the current address onto the stack. It then goes to the flag register and resets the interrupt flag to prevent any further interrupts. It then services the interrupt. The NMI normally occurs as a result of hardware failures to quickly limit the damage caused.
The IRQ or interrupt request is also activated by the appropriate pin going to a logic 1 or high-level but in this case remember that it is only a request and can be blocked by resetting the interrupt flag in the flag register. If more than one interrupt is received they are checked for priority and the highest one wins. IRQs are generally initiated by peripheral equipment such as a printer.
Exceptions
These interrupts are issued by the microprocessor itself and occur when the microprocessor has found itself in a difficulty that it cannot resolve.
When an exception occurs, an on-screen message often appears announcing that an exception has occurred and the Pentium attempts the instruction again. Asking the Pentium for an impossible answer causes some exceptions. This could be ‘division by zero’. Dividing any number by zero is not possible and the Pentium cannot respond.
Another one, which often strikes terror into the heart of the user, is ‘General Protection Error’. The software has sent the Pentium off to an address that doesn’t exist and obviously, therefore, no instructions are available.
MMX (MultiMedia eXtensions) is an addition to the standard Pentium designed to increase the speed of multimedia, communications and other applications where large numbers of repetitive calculations are required.
It started by analysing a wide range of typical applications: graphics, video, games, speech recognition etc. Intel was looking for time-consuming common characteristics. Many were found in which a fairly simple instruction like changing the colour of a pixel is applied to a large number of pixels. This gave rise to the idea called SIMD (Single Instruction Multiple Data). Using SIMD, we can perform the same operation on multiple bits of data, and this is executed in parallel. MMX allows eight pixels to be moved around and process them together. SIMD is the heart of MMX.
MMX technology maintains full compatibility with previous instructions and has added a further 57 instructions. No danger of the RISC approach here!
MMX instructions take over control of the eight floating-point registers and it has a further eight registers for holding addresses, loop control, data manipulation instructions etc. The floating-point registers are highly flexible in that the 64-bit mantissa section can be used for eight separate bytes, four 16-bit words, two 32-bit ‘doublewords’ or a single 64-bit ‘quadword’.
Saturation arithmetic
In normal fixed-point arithmetic adding two numbers can cause an overflow to occur and the msb can be lost. To take a simple example, adding the number 1 to the byte 1111 would give the result 10000. This would offer the result of zero and an overflow would have occurred as seen in Chapter 4. To check for the overflow, the microprocessor would have to take time out to check the status register to see if the overflow flag has been set. This is time consuming. When applied to graphics, perhaps shading, the sudden return to zero may cause a sudden and unwanted change in colour.
Saturation arithmetic ensures that any increase that would cause a wrap-around effect of returning the value to zero is prevented (see Figure 12.2). If we counted up from 0000, the Pentium would allow the count to proceed normally until it reached the maximum value of 1111 and it would then be held at that value. The colour in our example would reach black but would be prevented from accidentally returning to white.
Figure 12.2 Saturation arithmetic prevents wraparound
As we have mentioned before, one of the limits on operational speed is the size of the internal components and, until recently the smallest detail was limited to 0.18 µm. As the competition between the AMD continued, it was time for the next step as AMD started using 0.13 µm technology and, as expected, the Pentium 4 also upgraded to the same technology for the faster versions of 1.8 GHz and above. The operating voltage has also been reduced from 1.75 down to 1.5 volts allowing closer spacing and a further increase in speed (and 25% reduction in cost). The new design has allowed the Pentium 4 to increase its transistor headcount from 42 million to 55 million increasing the number of connecting pins to 478. Intel has moved a long way from the 16 pins of their 4-bit offering in 1972.
Thermal safety
The power dissipation increases as any integrated circuit works faster and the Pentium 4 is no exception. Now, bearing in mind that the actual processor circuit is just 10 mm×10 mm (0.4 square inches) and consumes 55 watts. We must be very careful to ensure that it doesn’t overheat. This is achieved by using a large heat sink and a cooling fan. The new Pentium has a thermal safety circuit. If the microprocessor starts to overheat, the cooling fan will increase its revs and the operating speed of the microprocessor will decrease. If things get serious and it reaches a dangerous level of 69°C (155°F) the thermal circuit will call it a day and shut down the computer to prevent the microprocessor from being destroyed.
The system bus
Also called the FSB or Front Side Bus, is 64 bits wide and ‘Quad Pumped’ which is a fancy way of saying that each clock pulse, presently running at 133 MHz, will shift four lots of data along the bus. Now, rounding off the figures a bit, 133 MHz×4=533 MHz so the bus looks like a single 533 MHz bus. Incoming and outgoing information is stored in the 256 kB level 2 Advanced Transfer Cache which is fed 256 bit wide pathways. Intel calls it ‘Advanced Transfer Cache’ which is not quad pumped though being wider, still matches the speed of the system bus.
Instruction Decoder, Level 1 Execution Trace Cache and Branch Predictor
The data that is selected by the predictor is loaded into a buffer and then passed onto the Instruction Decoder.
At this stage, the incoming instructions are analysed and converted into an internal code sequence which can be accessed from the Micro code as we saw when we looked at the Z80180 microprocessor. Once the instructions have been decoded, up to about 12 000
instructions called ‘Micro-Operation/Operand’ or µOP are stored in order of use, all ready to go. The correct order is much assisted by the Branch prediction – known by Intel as the Branch Target Buffer (BTB). This stores previous experience to guess what is likely to happen next.
Hyper pipeline
As we saw in Chapter 11, the pipeline is the organization of the microprocessor and not a separate device within the design, so we don’t get a ‘pipeline’ block shown in Figure 12.3. The predictor designs are now very much improved, having had the experience gained with earlier versions of the Pentium. The better the prediction, the longer and faster we can risk make the pipeline. So pleased were Intel with their predictions that they called the new longer pipeline a ‘hyper pipeline’. For maximum speed we would like a long pipeline so that many simple steps can be carried out at greater speed but the overall outcome depends on the predictor circuits making the right guess. A wrong guess means that the pipeline is loaded with incorrect data and has to be refilled, or ‘flushed’, which takes valuable time. The Pentium 4 now has a pipeline of 20 stages which allow 126 instructions to be in use at a single time which can include up to 48 load and 24 store instructions.
Figure 12.3 The Pentium 4 processor
Micro-OP and Memory usage
The µOps that pour out of the Execution Trace Cache are arranged in order and they will be a mixture of information to be stored in memory locations and arithmetic operations. The arithmetic operations are divided in floating-point operations and integer operations. The floating-point register deals with moving and storing while the ALU (Arithmetic and Logic Unit) deals with the more complex operations such as multiplication of 128-bit numbers and MMX (multimedia instructions) as we met a little earlier. The SIMD (Single Instruction Multiple Data) that was applied to the earlier Pentiums have been extended by an extra 144 instructions. This facility is now called SSE2 (Streaming SIMD Extensions 2 instructions). The general idea is that if we have to perform an action on many bits of data, it is simpler and faster to collect them all together and perform the function on all of them at the same time.
Rapid Execution Engine
For the integer instructions there are two ALUs clocked at twice the core processor speed which is a four-fold improvement over the basic function and provides a data transfer rate of 48 GB/s.
A level 1 data cache handles the data outputs from the ALUs and the AGUs (Address Generation Units).
The new Pentium design with speeds over 1.8 GHz and 0.13 µm technology is given the codename ‘Northwood’ that replaces the previous ‘Williamette’. The Williamette had reached the end of its development whereas the Northwood is just starting and since it is already running at 2.8 GHz, the magic 3 GHz chip is imminent, then we can probably look forward to the even more magic 4 GHz before the Northwood design is obsolete.
The Celeron
The bold type in computer adverts always shouts about ‘price and speed’ and many people fall into the trap of assuming that a 2.8 GHz microprocessor is obviously faster than a 2.5 GHz microprocessor. This is a false assumption but still well established so for this section of the market there is a demand for a very cheap microprocessor with a high clock speed.
The solution is to use the Pentium design and cheapen it by taking out some of the non-essential areas. There have been twelve such versions to track the Pentium releases during its development. In the 2 GHz Celeron, the price reduction has been achieved at the expense of reducing the L2 cache from 512 kB down to 128 kB and the FSB down from 533 MHz to 400 MHz.
Quiz time 12
In each case, choose the best option.
1 SIMD is:
(a) used in standard Pentiums but not in the MMX versions.
(b) a way of preventing wraparound.
(c) single in-line multimedia data.
(d) single instruction multiple data.
2 Branch prediction logic:
(a) is another name for the prefetch register.
(b) is only used in MMX versions.
(c) saves memory in 85% of occasions.
(d) attempts to guess the future steps to be taken by a program.
3 An exception:
(a) will be ignored if the I flag is set to a high level.
(b) is an unusual branching of the program.
(c) is an interrupt signal generated by the microprocessor.
(d) occurs whenever the Pentium is surprised by an arithmetic result.
4 The initials SIMD stand for:
(a) SIM card type D.
(b) Single Instruction Multiple Data.
(c) Superscalar Instruction Mode for Data.
(d) Streaming Instructions Modular Data.
5 In its construction, the Pentium 4 uses:
(a) 0.13 µm technology.
(b) 1.8 µm technology.
(c) 1.3 µm technology.
(d) 0.18 µm technology.