Fundamental Architecture : Introduction , Defining a Computer Architecture and Single Processor Systems .

Fundamental Architecture

Introduction

The design space for computer architectures is fairly diverse and complicated. Each new architecture strives to fulfill a different set of goals and to carve out a niche in the computer world. A system can be composed of one or several processors. Many concepts apply to both multiple processor and sin- gle processor systems. Many researchers have concluded that further advances in computational speed and throughput will come from parallelism rather than relying heavily on technological innovation, as has occurred in the past. But implementation is still an important consideration for any computer system.

 Defining a Computer Architecture

Some important attributes of a computer system are as follows:

✁ Structure of the control path(s)

✁ Structure of the data path(s)

✁ The memory organization

✁ The technology used in the implementation

✁ The number of clocks, that is, single clocked or dual clocked

✁ The clock speed(s)

✁ The degree of pipelining of the units

✁ The basic topology of the system

✁ The degree of parallelism within the control, data paths, and interconnection networks

In some cases, the algorithms used to execute an instruction are important as in the case of data flow architectures or systolic arrays. Some of these attributes are dependent on the other attributes and can change depending on the implementation of a system. For example, the relative clock speed may be de- pendent on the technology and the degree of pipelining. As more diverse and complicated systems are developed or proposed, the classification of a computer architecture becomes more difficult. The most often cited classification scheme, Flynn’s taxonomy (Flynn, 1966) defined four distinct classes of computer architectures: single instruction, single data (SISD), single instruction, multiple data (SIMD), multiple in- struction, single data (MISD), and multiple instruction, multiple data, (MIMD). A particular classification is distinguished by whether single or multiple paths exist in either the control or data paths. The benefits of this scheme are that it is simple and easy to apply. Its prevalent use to define computer architectures attests to its usefulness. Other classification schemes have been proposed, most of which define the parallelism incorporated in newer computer architectures in greater detail.

Single Processor Systems

A single processor system, depicted in Fig. 19.1 is usually composed of a controller, a data path, a memory, and an input/output unit. The functions of these units can be combined, and there are many names for each unit. The combined controller and data path are sometimes called the central processing unit (CPU). The data path is also called an arithmetic and logical unit (ALU).

Hardware parallelism can be incorporated within the processors through pipelining and the use of multiple functional units, as shown in Fig. 19.2. These methods allow instruction execution to be overlapped in time. Several instructions from the same instruction stream or thread could be executing in each pipeline stage for a pipelined system or in separate functional units for a system with multiple functional units.

image

Context is the information needed to execute a process within a processor and can include the contents of a program counter or instruction counter, stack pointers, instruction registers, data registers, or general purpose registers. The instruction, data, and general purpose registers are also sometimes called the register set. Context must be stored in the processor to minimize execution time. Some systems allow multiple instructions from a single stream to be active at the same time. Allowing multiple instructions per stream to be active simultaneously requires considerable hardware overhead to schedule and monitor the separate instructions from each stream.

image

Pipelined systems came into vogue during the 1970s. In a pipelined system, the operations that an instruction performs are broken up into stages, as shown in Fig. 19.3(a). Several instructions execute simultaneously in the pipeline, each in a different stage. If the total delay through the pipeline is D and there are n stages in the pipeline, then the minimum clock period would be D/n and, optimally, a new instruction would be completed every clock. A deeper pipeline would have a higher value of n and thus a faster clock cycle.

Today most commercial computers use pipelining to increase performance. Significant research has gone into minimizing the clock cycle in a pipeline, determining the problems associated with pipelining an instruction stream, and trying to overcome these problems through techniques such as prefetching instructions and data, compiler techniques, and caching of data and/or instructions. The delay through each stage of a pipeline is also determined by the complexity of the logic in each stage of the pipeline. In many cases, the actual pipeline delay is much larger than the optimal value, D/n, of the logic in each stage of the pipeline. Queues can be added between pipeline stages to absorb any differences in execution time

image

through the combinational logic or propagation delay between chips (Fig. 19.4). Asynchronous techniques including handshaking are sometimes used between the pipeline stages to transfer data between logic or chips running on different clocks.

It is generally accepted that, for computer hardware design, simpler is usually better. Systems that minimize the number of logic functions are easier to design, test, and debug, as well as less power consuming and faster (working at higher clock rates). There are two important architectures that utilize this concept most effectively, reduced instruction set computers (RISC) and SIMD machines. RISC architectures are used to tradeoff increased code length and fetching overhead for faster clock cycles and less instruction set complexity. SIMD architectures, described in the section on multiple processors, use single instruction streams to manipulate very large data sets on thousands of simple processors working in parallel.

RISC processors made an entrance during the early 1980s and continue to dominate small processor designs as of this writing. The performance p of a computer can be measured by the relationship

image

The first component (computations/instruction) measures the complexity of the instructions being executed and varies according to the structure of the processor and the types of instructions currently being executed. The inverse of the component (instructions/cycle) is commonly quoted for single processor designs as cycles per instruction (CPI). The inverse of the final component (cycles/second) can also be expressed as the clock period of the processor. In a RISC processor, only hardware for the most common operations is provided, reducing the number of computations per instruction and eliminating complex instructions. At compile time, several basic instructions are used to execute the same operations that had been performed by the complex instructions. Thus, a RISC processor will execute more instructions than a complex instruction set computer (CISC) processor. By simplifying the hardware, the clock cycle is reduced. If the delay associated with executing more instructions (RISC design) is less than the delay associated with an increased clock cycle for all instructions executed (CISC design), the total system performance is improved. Improved compiler design techniques and large on-chip caching has continued to contribute to higher performance RISC designs (Hennessy and Jouppi, 1991).

One reason that RISC architectures work better than traditional CISC machines is due to the use of large on-chip caches and register sets. Since locality of reference effects (described in the section on memory hierarchy) dominate most instruction and data reference behavior, the use of an on-chip cache and large register sets can reduce the number of instructions and data fetched per instruction execution. Most RISC machines use pipelining to overlap instruction execution, further reducing the clock period. Compiler techniques are used to exploit the natural parallelism inherent in sequentially executed programs.

A register window is a subset of registers used by a particular instruction. These registers are specified as inputs, outputs, or temporary registers to be used by that instruction. One set of instructions outputs become the next inputs in a register window for another instruction. This technique allows more efficient use of the registers and a greater degree of pipelining in some architectures.

Scaling these RISC concepts to large parallel processing systems poses many challenges. As larger, more complex problems are mapped into a RISC-based parallel processing system, communication and allocation of resources significantly affects the ability of the system to utilize its resources efficiently. Unless special routing chips are incorporated into the system, the processors may spend an inordinate amount of time handling requests for other processors or waiting for data and/or instructions. Using a large number of simple RISC processors means that cache accesses must be monitored (snoopy caches) or restricted (directory-based caches) to maintain data consistency across the system.

The types of problems that are difficult to execute using RISC architectures are those that do an inordinate amount of branching and those that use very large data sets. For these problems, the hit rate of large instruction or data caches may be very low. The overhead needed to fetch large numbers of new instructions or data from memory is significantly higher than the clock cycle, virtually starving the processor. Compiler techniques can only be used to a limited extent when manipulating large data sets.

Further increases in speed for a single stream, pipelined processor will probably come about from either increasing the pipeline depth, superpipelining or increasing the width of the data path or control path. The latter can be achieved by either issuing more than one instruction per cycle, superscalar, or by using a very long instruction word (VLIW) architecture in which many operations are performed in parallel by a single instruction. Some researchers have developed the idea of an orthogonal relationship between superscalar and superpipelined designs (Hennessy and Jouppi, 1991). In a superpipelined design, the pipeline depth is increased from the basic pipeline; whereas in a superscalar design, the horizontal width of the pipeline is increased (see Fig. 19.5).

To achieve an overall gain in performance, significant increases in speed due to superpipelining must be accompanied by highly utilized resources. Idle resources contribute little to performance while increasing overall system costs and power consumption. As pipeline depth increases, a single instruction stream cannot keep all of the pipeline stages in a processor fully utilized. Control and data dependencies within

image

the instruction stream limit the number of instructions that can be active for a given instruction stream. No operation (NoOps) or null instructions are inserted into the pipeline, creating bubbles. Since a NoOp does not perform any useful work, processor cycles are wasted. Some strategies improve pipeline utilization using techniques such as prefetching a number of instructions or data, branch prediction, software pipelining, trace scheduling, alias analysis, and register renaming to keep the memory access overhead to a minimum. An undesirable consequence of this higher level of parallelism is that some prefetched instructions or data might not be used, causing the memory bandwidth to be inefficiently used to fetch useless information. In a single processor system this may be acceptable, but in a multiple processor system, such behavior can decrease the overall performance as the number of memory accesses increases.

Superscalar processors use multiple instruction issue logic to keep the processor busy. Essentially, two or three instructions are issued from a single stream on every clock cycle. This has the effect of widening the control path and part of the datapath in a processor. VLIW processors perform many operations in parallel using different types of functional units. Each instruction is composed of several operation fields and is very complex. The efficient use of these techniques depends on using compilers to partition the instructions in an instruction stream or building extra hardware to perform the partitioning at run time. Again both these techniques are limited by the amount of parallelism inherent in an individual instruction stream.

A solution to fully utilizing a pipeline is to use instructions from independent instruction streams or threads. (The execution of a piece of code specified by parallel constructs is called a thread.) Some machines allow multiple threads per program. A thread can be viewed as a unit of work that is either defined by the programmer or by a parallel compiler. During execution, a thread may spawn or create other threads as required by the parallel execution of a piece of code. Multithreading can mitigate the effects of long memory latencies in uniprocessor systems; the processor executes another thread while the memory system services cache misses for one thread. Multithreading can also be extended to multiprocessor systems, allowing the concurrent use of CPUs, network, and memory.

To get the most performance from multithreaded hardware, a compatible software environment is required. Developments in new computer languages and operating systems have provided these environments (Anderson, Lazowska, and Levy, 1989). Multithreaded architectures take advantage of these advances to obtain high-performance systems.

Labels: