Very Good Topic On Process Variation
Very Good Topic On Process Variation
Kevin J. Nowka
January 1996
Partially supported by an ARPA Fellowship in High Performance Computing administered by the Institute for Advanced Computer Studies, University of Maryland. Additional support for this work from NSF Contract No. MIP8822961 using facilities provided by NASA under contract NAG2-842.
Abstract
Wave pipelining, or maximum rate pipelining, is a circuit design technique that allows digital synchronous systems to be clocked at rates higher than can be achieved with conventional pipelining techniques. It relies on the predictable nite signal propagation delay through combinational logic for virtual data storage. Wave pipelining of combinational circuits has been shown to achieve clock rates 2 to 7-times those possible for the same circuits with conventional pipelining. Conventional pipelined systems allow data to propagate from a register through the combinational network to another register prior to initiating the subsequent data transfer. Thus, the maximum operating frequency is determined by the maximum propagation delay through the longest pipeline stage. Wave pipelined systems apply the subsequent data to the network as soon as it can be guaranteed that it will not interfere with the current data wave. The maximum operating frequency of a wave pipeline is therefore determined by the di erence between the maximum propagation delay and the minimum propagation delay through the combinational logic. By minimizing variations in delay, the performance of wave pipelining is maximized. Data wave interference in CMOS VLSI circuits is the result of the variation in the propagation delay due to path length di erences, di erences in the state of the network inputs and intermediate nodes, and di erence in fabrication and environmental conditions. To maximize the performance of wave pipelined circuits, the path length variations through the combinational logic must be minimized. A method of modifying the transistor geometries of individual static CMOS gates so as to tune their delays has been developed. This method is used by CAD tools that minimize the path length variation. These tools are used to equalize delays within a wave pipelined logic block and to synchronize separate wave pipelined units which share a common reference clock. This method has been demonstrated to limit the variation in delay of CMOS circuits to less than 20. Delay models have demonstrated that temperature variation, supply power variations, and noise limit the number of concurrent waves in CMOS wave pipelined systems to three or less. Run-to-run process variation can have a signi cant impact on CMOS VLSI signal propagation delay. The ratio of maximum to minimum delay along the same path for seven di erent runs of a 0.8-micron feature size fabrication process was found to be 1.35. Unless this variation is controlled, the speedup of wave pipelining is limited to two to three to ensure that devices from any of these runs will operate. When aggregated with variations due to environmental factors, the maximum speed-up of a wave pipeline is less than two. To counteract the e ects of process variation, an adaptive supply voltage technique has been ii
developed. An on-chip detector circuit determines when delays are faster than the nominal delays and the power supply is lowered accordingly. In this manner, ICs fabricated with fast processes are run at a lower supply voltage to ensure correct operation at the design target frequency. To demonstrate that wave pipeline technology can be applied to VLSI system design, a CMOS wave pipelined vector unit has been developed. Extensive use of wave pipelining was employed to achieve high clock rates in the functional units. The VLSI processor consists of a wave pipelined vector register le, a wave pipelined adder, a wave pipelined multiplier, load and store units, an instruction bu er, a scoreboard, and control logic. The VLSI vector unit contains approximately 47000 transistors and occupies an area of 43 sq mm. It has been fabricated in a 0.8 micron CMOS technology. Tests indicate wave pipelined operation at a maximum rate of 303MHz. An equivalent vector unit design using traditional latch-based pipelining was designed and simulated. The latch-based design occupied 2 more die area, operated with a 35 longer clock period, and had multiply latency 8 longer and add latency 11 longer than the wave pipelined vector unit. This work was supported in part by an ARPA Fellowship in High Performance Computing administered by the Institute for Advanced Computer Studies, University of Maryland. Additional support for this work from NSF Contract No. MIP88-22961 using facilities provided by NASA under contract NAG2-842.
iii
Contents
1 Introduction
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Pipelining and Wave Pipelining . . . . . . . . . . . . . . . . . . . . . 1.1.2 Prior Wave Pipeline Research . . . . . . . . . . . . . . . . . . . . . . 1.2 CMOS VLSI System Wave Pipelining . . . . . . . . . . . . . . . . . . . . . 1.3 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1 1 3 4 5 6 7
8
8 11 12 13 13 14 14 30 33 35
36
36 36 38
3.1.3 Manipulating CMOS Delay . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Linear Program Representation . . . . . . . . . . . . . . . . . . . . . 3.1.5 Design Process and Simulated Results . . . . . . . . . . . . . . . . . 3.1.6 CMOS Fine Balancing Limitations . . . . . . . . . . . . . . . . . . . 3.2 Wave Pipeline Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Monoharmonic Wave Pipelines Lacking Feedback . . . . . . . . . . . 3.2.2 Polyharmonic Wave Pipelines Lacking Feedback . . . . . . . . . . . 3.2.3 Monoharmonic Wave Pipelines with Feedback . . . . . . . . . . . . . 3.2.4 Polyharmonic Wave Pipelines with Feedback . . . . . . . . . . . . . 3.3 Process and Environmental Delay Compensation . . . . . . . . . . . . . . . 3.3.1 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Tunable Constructive Clock Skew . . . . . . . . . . . . . . . . . . . 3.3.3 Biased Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Driver Current Starving . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Driver Voltage Controlled Load . . . . . . . . . . . . . . . . . . . . . 3.3.6 Thermal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.7 Supply Voltage Control . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39 42 44 47 50 52 54 56 58 59 60 62 62 63 64 65 67 69
71
71 71 73 76 79 79
4.1.6 External Control Logic . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.7 Constant Delay Power Control Logic . . . . . . . . . . . . . . . . . . 4.1.8 Clock Generation and Distribution . . . . . . . . . . . . . . . . . . . 4.2 Vector Unit Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Vector Unit Fabrication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Functional Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Wave Pipeline Speed Tests . . . . . . . . . . . . . . . . . . . . . . . 4.6 Comparison to Traditional Design . . . . . . . . . . . . . . . . . . . . . . . 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79 80 80 80 82 82 82 84 85 87 89
91
91 91
5.1.2 Fully Latchless Feedback Circuits . . . . . . . . . . . . . . . . . . . . 102 5.1.3 Self-Timed Wave Pipelines . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 Circuit Enhancements for Wave Pipelining . . . . . . . . . . . . . . . . . . . 104 5.2.1 Low Variation Circuit Designs . . . . . . . . . . . . . . . . . . . . . 104 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
108
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.3 Future Wave Pipelining Research . . . . . . . . . . . . . . . . . . . . . . . . 112 6.3.1 Models and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.3.2 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 v
vi
List of Figures
1 2 3 4 5 6 7 8 9 Circuit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Synchronizer Edge De nitions . . . . . . . . . . . . . . . . . . . . . . . . . Wave Pipeline Timing De nitions . . . . . . . . . . . . . . . . . . . . . . . Inverter Chain Propagation Delay vs. Fabrication Run . . . . . . . . . . . Relative Carrier Mobilities vs. Temperature . . . . . . . . . . . . . . . . . . Inverter Chain Propagation Delay vs. Temperature . . . . . . . . . . . . . . Relative Propagation Delay vs. Temperature . . . . . . . . . . . . . . . . . Vector Unit Thermal Pro le . . . . . . . . . . . . . . . . . . . . . . . . . . . Relative Charge, Discharge Delay vs. Supply Voltage . . . . . . . . . . . . . 2 9 10 15 16 17 18 19 21 22 23 24 25 27 29 30 31 33 34 37 40 41 42 45
10 Inverter Propagation Delay vs. Supply Voltage . . . . . . . . . . . . . . . . 11 Relative Propagation Delay vs. Supply Voltage 5V . . . . . . . . . . . . . 12 Relative Propagation Delay vs. Supply Voltage 3.3V . . . . . . . . . . . . 13 Externally Supplied Clocked System . . . . . . . . . . . . . . . . . . . . . . 14 Maximum Waves vs. ............................. 15 Environmental Degradation Factor . . . . . . . . . . . . . . . . . . . . . . . 16 Internally Generated Variable Frequency Clocked System . . . . . . . . . . 17 Internally Generated Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Inverter Chain Delay and Ring-Oscillator Period vs. Temperature . . . . . 19 Inverter Chain Delay and VCO Period vs. Temperature . . . . . . . . . . . 20 Example Circuit and Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Delay Tuning Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Delay Tuning Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Inverter Propagation Delay vs. Modi cation Factor . . . . . . . . . . . . . . 24 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
25 Pulse Circuit Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Unbalanced Counter Delay Histogram . . . . . . . . . . . . . . . . . . . . . 27 Fine Balanced Counter Delay Histogram . . . . . . . . . . . . . . . . . . . 28 Carry Generation Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Example Polyharmonic Wave Pipelines . . . . . . . . . . . . . . . . . . . . 30 Monoharmonic Wave Pipeline without Feedback Optimization . . . . . . . . 31 Polyharmonic Wave Pipeline without Feedback Optimization . . . . . . . . 32 Monoharmonic Wave Pipeline with Feedback Optimization . . . . . . . . . 33 Polyharmonic Wave Pipeline with Feedback Optimization . . . . . . . . . . 34 Biased Logic Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Compensation Using Current Starved Driver . . . . . . . . . . . . . . . . . 36 Delay Tuning Range of a Current Starved Driver . . . . . . . . . . . . . . . 37 Driver with Voltage Controlled Load . . . . . . . . . . . . . . . . . . . . . 38 Delay Tuning Range of Voltage Controlled Load Driver . . . . . . . . . . . 39 Thermal Controlled Delay Compensation . . . . . . . . . . . . . . . . . . . 40 Power Supply Voltage Delay Compensation . . . . . . . . . . . . . . . . . . 41 Power Supply Voltage Delay Compensation . . . . . . . . . . . . . . . . . . 42 Vector Unit Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Parallel Adder Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Parallel Multiplier Organization . . . . . . . . . . . . . . . . . . . . . . . . 45 4,2 Counter Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Vector Register Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Vector Register Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Vector Instruction Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . 49 Vector Unit Die Photo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Dice Ring Oscillator Variation . . . . . . . . . . . . . . . . . . . . . . . . . viii
46 47 48 49 51 54 56 58 60 63 64 65 66 66 67 67 69 72 73 74 75 77 78 81 83 84
51 Vector Register Read Operation . . . . . . . . . . . . . . . . . . . . . . . . 52 Constant Delay Power Bump Indications . . . . . . . . . . . . . . . . . . . 53 Die Process Variation Compensation . . . . . . . . . . . . . . . . . . . . . . 54 High Speed Wave Pipeline Testing . . . . . . . . . . . . . . . . . . . . . . . 55 Propagation of Waves in Wave Pipeline . . . . . . . . . . . . . . . . . . . . 56 Stall Handling in Wave Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 57 Wave Pipeline with Input Register Chain . . . . . . . . . . . . . . . . . . . 58 Stall in Wave Pipeline with Results Queue . . . . . . . . . . . . . . . . . . . 59 Wave Pipeline with Results Queue . . . . . . . . . . . . . . . . . . . . . . . 60 Freeze Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Wave Pipeline with Freeze Points . . . . . . . . . . . . . . . . . . . . . . . .
85 86 87 88 92 93 94 94 95 96 96
62 Stalling Wave Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 63 Relative Clock Rate 10 freeze delay . . . . . . . . . . . . . . . . . . . . . 100 64 Relative Clock Rate 20 freeze delay . . . . . . . . . . . . . . . . . . . . . 100 65 Relative Clock Rate 40 freeze delay . . . . . . . . . . . . . . . . . . . . . 101 66 Wave Pipeline with Latchless Feedback . . . . . . . . . . . . . . . . . . . . . 102 67 Decoder Delay Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 68 Phase Comparator Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 69 Power Converter Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 70 Adaptive Power Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 123 71 Adaptive Power Step Response . . . . . . . . . . . . . . . . . . . . . . . . . 124
ix
List of Tables
1 2 3 4 5 6 7 8 Simulated Process Corner Propagation Delays . . . . . . . . . . . . . . . . . Simulated Process Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . Inverter Chain Simulated Maximum Number of Waves . . . . . . . . . . . . Vector Unit Logic Balancing Results . . . . . . . . . . . . . . . . . . . . . . Sorting Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Unit Balancing Results . . . . . . . . . . . . . . . . . . . . . . . . . Vector Unit Results Comparison . . . . . . . . . . . . . . . . . . . . . . . . 14 20 28 48 61 82 89
1 Introduction
1.1 Background
In an e ort to improve the throughput of digital systems, designers have long turned to pipelining. In a pipelined system, a logic network is partitioned into pipeline stages, each of which operates upon data computed in the previous cycle by the previous pipeline stage. When a logic network is pipelined, synchronizing elements, either latches or registers, are inserted to partition the network into stages. Pipelining of a circuit into N stages can result in speedup in throughput up to a factor of N . The inserted synchronizing elements increase the area and power consumption of the logic and add additional latency and cycle time overhead. Wave pipelining is an alternative synchronous circuit clocking technique that allows overlapped execution of multiple operations without using synchronizing elements within the logic. Rather, knowledge of the signal propagation delay characteristics of the logic network are used at design time to manage the signal delays so as to ensure that operations do not interfere with their predecessor nor successor computations. Figure 1 is a block diagram of an nonpipelined circuit, a pipelined version of the same circuit, and a wave pipelined equivalent.
NONPIPELINED
Register
Combin. Logic
Register
1
waves
Valid Invalid
ing element, latch or register, through the combinational network to another synchronizing element prior to initiating the subsequent data transfer, wave pipelined designs apply subsequent data to the network as soon as it can be guaranteed that it will not interfere with the current data wave. In this manner, multiple waves of data are simultaneously propagating through distinct regions of the logic network. Because waves of data are applied to the logic as fast as can be guaranteed not to interfere, the throughput of wave pipelined synchronous systems can be greater than can be achieved with conventional pipelining techniques. Wave pipelining can approach the physical switching limit of the devices 10 . Wave pipelining can improve the throughput of a logic circuit while avoiding some of the overheads of traditional pipelining. Wave pipelines avoid the cycle time overhead of traditional pipelines because there are no internal synchronizers. Instead, cycle time is determined by the variation in the propagation delay of the signals through the logic and the input and output register delays. Latency through the wave pipeline avoids the traditional pipeline overhead because the signals do not propagate through internal synchronizers. Partitioning overhead is avoided since the pipeline is not partitioned into stages separated by synchronizers. The area and power overheads of a traditional pipeline are avoided in the wave pipe since there are no internal synchronizers. Manipulation of the circuitry to maximize performance of wave pipelines can, however, introduce additional area and power overhead.
by Joy, et. al. 26 . Chang, et. al. 6 developed a method of removing latches from a traditional pipeline when a lower clock period results and wave pipelined timing constraints can be met. Kim, et. al. 28 have developed optimization tools which restructure the wave pipeline logic to improve path length balance. Wave pipelining has been successfully applied in several VLSI designs. Wong 54 developed a wave pipelined bipolar population counter with a latency of 10ns and a cycle time of 4ns, thus supporting 2.5 concurrent waves. Chappell, et. al. 8 applied wave pipelining techniques in the design of an SRAM that consisted of self-resetting logic blocks which were operated in wave pipelined fashion. This SRAM had a latency of 3.9ns and a cycle time of 2ns, thus supporting 1.9 concurrent waves. Fan, et. al. 14 developed a CMOS adder using wave pipelining. Simulated operation of this adder achieved 250MHz and supported greater than ve waves. Lien, et. al. 35 applied wave pipelining to CMOS domino logic circuits and designed a 100MHz, 2-wave CMOS wave domino multiplier. Klass 29 developed a CMOS multiplier which operated at 350MHz and supported four waves. Additional wave pipelined VLSI designs include CMOS multipliers 41, 19, 55 , CMOS static RAMs 40, 52 , and several simple CMOS circuits. In general VLSI implementations of wave pipelining have demonstrated up to 2 waves of data for memory devices and from 2 to 6 waves of data for arithmetic circuits
complexity of design and operation result. First, all pipelines in the system should operate over a common range of clock frequency. Second, signals owing between pipelines, including those in feedback paths, must meet the timing constraints for proper pipeline operation. This research e ort has developed design methods for systems with multiple, interconnected wave pipelines. Other CMOS wave pipelining has relied on manual optimization of wave pipelined performance or has relied on the addition of xed circuit elements to assist in the performance optimization of the circuits. Automated CMOS optimization techniques for use in CMOS wave pipelined system design have been developed as part of this research. Unlike previous wave pipelined research, where the operating frequencies could be determined and set individually for each die, in this research it was deemed necessary to design and operate all dice at a given target frequency. Techniques that ensure the correct operation of all dice at the target frequency were developed in this research e ort. Wave pipeline system design algorithmic, architectural, and circuit design issues such as wave pipeline stalling, low data-dependent CMOS circuit design, and wave pipeline traditional pipeline interfacing were also examined in this research. To validate the performance limits and the design techniques and tools a demonstration system was designed. A wave pipelined CMOS vector unit VLSI integrated circuit was designed, fabricated, and tested. This vector unit design operates at 300MHz. It contains a wave pipelined vector register le, a wave pipelined adder, and a wave pipelined multiplier. It demonstrates the use of multiple, synchronized wave pipelines with feedback. The performance of this system is optimized through the use of the automated balancing techniques and through process and environmental compensation techniques.
low power circuits 23, 36 . Self-timed design techniques in which the completion detection logic is signaled with a timing reference which is guaranteed to be longer than the critical path logic 51 is somewhat akin to the wave pipelining with critical skew as presented in Chapter 5. Recent e orts in selftimed circuit design which make use of the dynamic signal propagation delay characteristics of critical logic paths to generate dynamic clocks" 47, 11 are also so related. Clock distribution techniques with constructive clock skew which are applied to traditional pipeline designs 15, 18 are speci c cases of wave pipelining in which the intentional clock skew results in a number of waves fractionally greater than one.
1.4 Contributions
This dissertation develops constraints for wave pipeline operation which extend previously presented constraints 53, 33, 21 for environmental and process dependencies. Using these constraints, and models of static CMOS gate delays, performance limits for CMOS wave pipelines are established. A quanti cation of the performance implications of the delay variation for both wave pipelines and traditional pipelines is derived. To optimize the performance of CMOS wave pipelined circuits, a method of equalizing CMOS circuit path delays is presented. The transistor sizing mechanism was developed, implemented, measured for balancing accuracy, and applied to the design of a vector unit as part of this research e ort. Optimization methods for systems of wave pipelines which are more general than examined by other researchers 12, 21, 27, 6 are developed. Methods of determining constructive clock skew and intentional delay insertion for optimization of these wave pipelined are proposed. The strict constraints placed upon CMOS wave pipelines by fabrication and environmental variations quanti ed in this dissertation motivated an examination of delay compensation techniques. These techniques which have been employed for other compensation purposes, are evaluated for suitability to wave pipelined CMOS system design. Due to the range of compensation necessary and area and power bene ts, a variable power supply technique is determined to be attractive. This technique was demonstrated in the vector unit IC developed in this research. In this research, one impediment to the use of wave pipelines in processors, the inability of a wave pipeline to stall is examined. Stalling wave pipelines which employ additional transistors within the pipeline to provide stall capabilities are proposed, their clock constraints are presented, and their performance is contrasted to conventional pipelines. While previous e orts have demonstrated operation of wave pipelined memories and arithmetic circuits, this research demonstrates that systems of CMOS wave pipelines, using the 6
tools and techniques developed, can be designed, optimized, fabricated, and operated at clock rates above those achievable using conventional techniques.
1.5 Scope
The following chapters detail the performance limits of CMOS wave pipelining, wave pipelining design and optimization techniques, VLSI vector unit design and testing results, and architectural and circuit optimizations for wave pipelining. Chapter 2 is an analysis of the performance limits of wave pipelining in CMOS systems. It presents the timing constraints for valid wave pipelined operation, presents an analytical model of the delay characteristics of CMOS circuits, details the causes and performance e ects of delay variation in CMOS circuits, and contrasts the variation e ects upon performance to those exhibited by traditional pipeline designs. Chapter 3 presents design techniques for high performance wave pipelines. It details the path delay balancing method employed in this research and describes the procedure used to synchronize and optimize the performance of multiple wave pipeline systems. It relates techniques for process and environmental compensation to ensure correct operation of wave pipelined systems over all design operation ranges. Chapter 4 describes the organization, design procedure, test procedure, and test results of a CMOS wave pipelined vector unit integrated circuit. Chapter 5 is an exposition of architectural and circuit design enhancements for CMOS wave pipeline design. It describes a methods of supporting pipeline stalls in wave pipelines, latchless feedback, and low data dependent circuit techniques. Chapter 6 summarizes the results of this research and suggests further areas of wave pipelining research.
1
In Figure 3 this constraint is shown for the wave 1 data. N is the number of clock periods between the application of the input data and the subsequent latching of the results at the output. It is also the number of concurrent waves in the wave pipeline. cs is the constructive skew between the clock at the input synchronizer and output synchronizer. Pmax is the worst case maximum propagation delay through the combinational network. It is measured from the time at which the slowest input reaches the midpoint of its switching 8
Input Synchronizer
Output Synchronizer
Combinational Logic
Input Data
cs
Figure 2: Synchronizer Edge De nitions voltage to the time at which the slowest output of the logic reaches the midpoint of its switching voltage. C is the unintentional clock skew between input and output clocks. Ts is the maximum setup time of the output synchronizer. RFmax is the maximum rise fall time of the inputs to the output synchronizer. Tsynch is the maximum time from the data initiating edge of the clock to valid output of the input synchronizer. This inequality ensures that the result of the slowest computation has su cient time to propagate to the output, all outputs rise or fall to its terminal value, and all outputs meet the minimum setup time of the synchronizer prior to being latched. In addition, the subsequent wave must not reach the synchronizer prior to the synchronizing clock edge. Thus the race through constraint for wave pipelines using edge-triggered registers as synchronizing elements is:
min N , 1 Tclk + cs Pmin , C , Th , RF2 + Tsynch
2
In Figure 3 this constraint is shown for the wave 2 data. This inequality ensures that the result of the fastest computation is not able to propagate through the logic fast enough 9
Input Data
Output Data
Figure 3: Wave Pipeline Timing De nitions to change the voltage of any output in the cycle before the results will be latched. This gure is for the edge-triggered registers as synchronizing elements. For transparent latches as synchronizing elements:
min N , 1 Tclk + Ttrans + cs Pmin , C , RF2 + Tsynch
3
where Ttrans is the time over which the latch is open and transparent.
In addition to meeting the race-through and long-path constraints, wave pipelined circuits require that waves of data do not interfere with each other at the output synchronizing element. This constraint result in the following inequality:
4
In addition to the output constraint, wave pipelined circuits can not allow wave interference at any point in the network. This can be represented by the following:
where Tms is the minimum amount of time a node voltage must be stable to ensure the subsequent level of logic operates correctly. 10
5
Details of the timing constraints for pipelined and wave pipelined circuits are found in 53 . To establish the performance limits of wave pipelining, constraints 1 and 2 can be combined to nd the maximum number of waves which can be supported by a wave pipeline:
Nmax =
6
Nmax, the maximum number of waves in the wave pipeline, also represents the maximum
speed up of a wave pipeline over the same circuit being operated as a traditional pipeline stage. By collecting the clock overhead factors for the long path and the race-through into a single terms Hmax , and Hmin , respectively, constraint 6 can be reduced to:
Nmax = P
7
8
9
V Tpd = K V2Cl,tV 2 +
dd t
10
where Cl is the total load capacitance, Vt is the transistor threshold, K is the transistor transconductance, and Vdd is the power supply voltage. Short-channel results are presented in Appendix B. With this model of the gate delay, the maximum delay through a combinational logic network is:
Pmax =
X
long path
Tpd
11
and the minimum path delay through the combinational logic network is:
Pmin =
X
short path
Tpd
12
As shown in equation 6, the speedup of wave pipelines is constrained by relative di erences in propagation delay rather than maximum propagation delay:
Nmax P Pmax=Pmin 1 =P ,
max min
13
Thus, the ratio of maximum to minimum propagation delays is necessary to ascertain the performance potential of wave pipelining.
Table 1: Simulated Process Corner Propagation Delays Figure 4 is a diagram of simulated propagation delay of a chain of 50 inverters for using SPICE model parameters derived from measurements of seven MOSIS 0.8 micron fabrication runs. For these runs, the maximum propagation delay is longer than the minimum by a factor of 1.35. When compared to the arithmetic average, the variation is +11 to -18. Fan, et. al. 14 performed fabrication process sensitivity analysis on a wave pipelined adder design. By varying the SPICE model parameters, they found simulated delay to be most sensitive to variations in channel oxide thickness, the transistor geometry parameters, and device transconductance.
4.0 | 0
Figure 4: Inverter Chain Propagation Delay vs. Fabrication Run of channel current with temperature is strongly related to the change in channel carrier mobility. Therefore, the variations in propagation delay are modeled as a function of variations in mobility. Secondary e ects such as threshold reduction and junction capacitance variation are ignored for this analysis. Empirical studies 46, 20 have shown that the temperature dependence of channel carriers can be represented by: 14 where fv and fh represent degradation factors in the vertical and horizontal directions, respectively. The temperature dependence of the low- eld mobility, 0 , is;
where M is an empirical constant between 1.5 and 2. HSPICE uses M = 1:5 for level 3 IDS MOS device modeling 38 . 1 and 2 are absolute temperatures. Figure 5 shows the ratio of channel carrier low- eld mobility at 25 C to that for temperatures from 25 C to 125 C as derived from the above mobility formula with M=1.5. 15
| | | | | | | |
| 1
| 2
= 0 fv fh
0 2 = 0 1 2= 1,M
15
mobility(25C)/mobility(T)
1.5
| | | | |
1.4
1.3
1.2
1.1
1.0 | 25
| 50
| 75
Figure 5: Relative Carrier Mobilities vs. Temperature The variation of mobility results in a corresponding variation in channel current, and in turn, propagation delay. Ignoring the secondary temperature e ects, and concentrating on the mobility variation, the propagation delay through a network of long-channel devices at a given temperature to that at the nominal temperature should be the inverse of the mobility ratio as suggested by the propagation delay equations in Section B. Figure 5 data suggests that propagation delays of CMOS logic structures can be as much as 50 to 60 slower at 125C than at 25C due to the di erences in mobility. Figure 6 shows HSPICE simulations of propagation delay of two chains of 50 CMOS inverters over a temperature range of 25 C to 125 C. The short-channel chain consists of inverters with 1:5=0:8 NMOS transistors and 3:5=0:8 PMOS transistors. The long-channel chain consists of inverters with 9=3 NMOS transistors and 21=3 PMOS transistors. Figure 7 shows the ratio of propagation delay of the inverter chains for temperatures from 25 C to 125 C to the propagation delay at 25 C. Superimposed on Figure 7 is the ratio of mobilities as given previously. The mobility approximation to relative propagation delay becomes less accurate as temperature is increased due to the assumption of constant thresholds. Based upon the models of CMOS device behavior and SPICE simulations, the propagation 16
70
| | | |
60
50
40
30
short-channel long-channel
| | |
20
10
0| 25
| 50
| 75
Figure 6: Inverter Chain Propagation Delay vs. Temperature delay of a CMOS network at temperature
2
For short-channel devices, velocity saturation limits the channel current. Because temperature a ects the saturation voltage, the expression for relative propagation delay is more complicated:
Thus, propagation along a given path for a CMOS network will be as much as 50 slower at 125 C than at room temperature. 17
16 17
dmax
18 19
dmax
1.6
1.5
| | | | | |
1.4
1.3
1.2
1.1
1.0 | 25
| 50
| 75
Figure 7: Relative Propagation Delay vs. Temperature Temperature variation is both spatial and temporal. As a transistor conducts current, heat is conducted through the surrounding die area resulting in changes in local temperature. As power consumption increases, the spatial average of die temperature also increases, and thus a temporal temperature variation exists. If the consumption of power is not spatially uniform, temperature gradients exist. Figure 8 illustrates the spatial and temporal variation in temperature for the vector unit IC developed as part of this research. The two thermal pro les are shown for the locations on the die which experience the thermal extremes. These pro les were derived using twodimensional heat transfer models in which the die was represented as a mesh of thermal cells. Each cell is represented by a thermal resistance with each of its neighbors, a thermal resistance to the ambient, the speci c heat of the cell, and the local power consumption. At time t = 0 the vector unit operations which consume the greatest amount of power commence. The total power consumption at this instance increases from 0.6 W to 1.9 W. The top trace gives the temperature of the maximum temperature location. The temperature at this location rises approximately 43 C with a time constant of approximately 350 ms. The location of the minimum thermal extreme rises approximately 38 C. Thus, as shown in this gure, the maximum spatial temperature di erence is 5 C and the maximum temporal temperature variation is 43 C. 18
Voltage Variation Supply voltage variation a ects propagation delay by altering the
channel current and signal voltage swing. Using the delay expressions from Section B for the propagation delay of a capacitor discharging through an n-channel device and charging through a p-channel device, a rst-order expression for the ratio of propagation delay at a given supply voltage to the propagation delay at the nominal supply voltage is derived. For the simulated process, the model parameters are given in table 2. Figure 9 shows the propagation delay of a capacitor being driven high and low through a PMOS and NMOS device, respectively, for a range of supply voltages, relative to the nominal 5V supply. Since, in this model, the propagation delay through a logic network is the sum of the individual delays, the ratio of the propagation delays for the network should lie within the charging and discharging ratios. 19
Parameter
Vtn Vtp Vdd n0 p0 Cox vsatn vsatp
0.71V -0.90V 5V 572 cm2 =V s 178 cm2 =V s 192 nF=cm2 1980 cm=s 3690 cm=s
Value
Table 2: Simulated Process Parameters Figure 10 shows the simulated propagation delay of a minimum-sized balanced inverter driving an identical inverter high and low versus supply voltage. This gure shows that static CMOS inverter gate delay is, to rst order, inversely proportional to the supply voltage. Figure 11 compares the computed relative propagation delay ratios for rising and falling outputs versus supply voltage. Also included in this gure is the simulated ratios for the short-channel chain of 50 inverters. Figure 12 is a plot of simulated propagation delay versus supply voltage for a nominal supply of 3.3V. As a rst-order approximation the variation in propagation delay due to supply voltage drift is linearly related to the supply voltage. Thus:
20 21
The propagation delay of a network shows a variation of 5 to 10 with respect to nominal over an operating supply voltage range of -4.5 to 5.5V. The ratio of maximum delay to minimum delay due to supply changes is thus 1.2. In addition to dc variations, dynamic power uctuations have an e ect on the propagation delay of CMOS circuits. Supply dI dt noise due to on-chip circuitry is small; however, driver dI dt noise can have a signi cant impact on the delay of the driver 2 . With separate power distribution networks, the delay variation is isolated to the driver. Thus, the relative delay variation of a CMOS circuit path due to driver dI dt noise can be estimated by multiplying the relative delay factor of the driver by the fraction of the nominal delay of the path due to the nominal driver delay. 20
1.10
| | | |
1.05
1.00
0.95
0.90 | 4.5
| 4.6
| 4.7
| 4.8
| 4.9
| 5.0
| 5.1
Coupled Noise In addition to the power supply noise described in the previous section,
noise coupled from adjacent signals to the output of a gate can have signi cant impact on the delay of that gate. We model capacitively coupled noise as a change in the e ective load capacitance as seen by the gate. When there is no change in the voltage on the coupled line, the e ective capacitance as seen by the gate is the nominal capacitive load due to the output wire capacitance and the gate capacitance of all transistors connected to the output. When the voltage on a coupled line is moving in the same direction as the gate output, the e ective capacitive load is decreased by value of the coupled capacitance. When the voltage on the coupled line is moving in the opposite direction as the gate output, the e ective capacitance is increased by the value of the coupled capacitance. The resulting delay for the gate with a capacitively coupled output is:
P P = 0 Cl ,
P indicates the delay under coupled noise conditions. P = 0 indicates the delay with all coupled wires static. di indicates the direction of signal switch -1 if opposite to output and 1 if same as output. Ccoupled i is the mutual capacitance of the output and the signal i.
21
Xd C
i
coupled i
22
120
| | | | | |
100
80
60
40
20
0| 4.50
|
| 4.75
| 5.00
Figure 10: Inverter Propagation Delay vs. Supply Voltage Thus the maximum to minimum delay ratio of the gate output due to coupled capacitance is: P P slow Cl + P Ccoupled i 23 P fast Cl , Ccoupled i This variation is most important for gates driving long wires. To get the e ect of coupled noise on the total propagation delay the gate delay variation must be scaled by the ratio of the nominal contribution to the path delay due to the wire driver to the nominal total delay.
Process and Environmental Performance Limits This section uses the clock con-
straints, the delay models, and the variations in process and operating environment previously discussed to establish the limits of wave pipeline performance for xed frequency and variable frequency CMOS systems.
synchronous system, a clock with xed period Tclk is supplied to the device. The clock frequency is not a function of chip supply voltage, temperature, or fabrication process. Systems with external clock generation, clocks phase-locked to external xed frequency clocks, 22
1.10
Simulated Inverter Chain Delay Calculated load C charging time Calculated load C discharging time
| | | |
1.05
1.00
0.95
0.90 | 4.5
| 4.6
| 4.7
| 4.8
| 4.9
| 5.0
| 5.1
Figure 11: Relative Propagation Delay vs. Supply Voltage 5V and systems with temperature and supply voltage compensating on-chip xed frequency oscillators are included in this category. Figure 13 is a block diagram of a synchronous system with an externally supplied clock. For a xed frequency clocked traditional pipelined system to operate properly, the worst case maximum propagation delay determines the clock rate: Pmax + RFmax=2 + Ts + C Tclk + cs 24 Pmax; RFmax; Ts; C and cs are voltage, temperature, and process dependent. Tclk is voltage, temperature, and process independent. For a xed frequency clocked wave pipelined circuit to operate properly, the following two inequalities must hold for edge-triggered registers: Pmax + RFmax=2 + Ts + C + Tsynch , cs T 25 For ow latches, the following inequalities must hold: Pmax + RFmax =2 + Ts + C + Tsynch , cs T
clk N Pmin , RFmin =2 , Th , C + Tsynch , cs T clk N,1
26 27
clk
23
1.10
| | | |
1.05
1.00
0.95
0.90 | 2.9
| 3.0
| 3.1
| 3.2
| 3.3
Pmin , RFmin =2 , Th , C + Tsynch , Ttrans , cs T 28 clk N,1 Pmax , Pmin , RF , Tsynch , Ts and Th are voltage, temperature, and process dependent. Tclk and Ttrans are voltage, temperature, and process independent.
Deviation of process parameter on a die are relatively time invariant and relatively uniform across the entire die 20 . Thus, once a device is fabricated, its Tox, 0 , Vtn , Vtp, etc. could be determined and the process characterized. This type of variation is termed static delay variation. Since the particular process parameters are not known a priori, wave pipelines must be designed to function over a range of expected processes. In addition to the static variation, there is dynamic variation. Dynamic variation is due to changes in the operating environment over time, and include the temperature, voltage drift and noise, and coupled noise examined in this chapter. For a wave pipeline to function correctly, a clock period and an integer number of waves must be speci ed which satisfy the above inequalities for all acceptable values of process, supply voltage, and temperature. For the rst inequality, the worst condition is minimum supply voltage, maximum temperature, and slowest process. For the second inequality, the worst condition is maximum supply voltage, minimum temperature, and fastest process. The longest path in the network is some factor greater than the shortest path in the network 24
Figure 13: Externally Supplied Clocked System for a given temperature, voltage, and process. This factor, represented by , is due to path length di erences and data dependent delay variations in the network.
29
Because the relative variation in propagation delay due to temperature and voltage variation is to rst order independent of absolute propagation delay, is a good approximation of the relative path length di erence in the network for any temperature and voltage. The worst-case wave pipeline timing constraints become:
slow slow slow Pmin + RFmax =2 + Tsslow + C slow + Tsynch , csslow Tclk N
30
fast fast fast fast Pmin , RFmin =2 , Th , C fast + Tsynch , csfast Tclk 31 N,1 where slow signi es operating conditions Vmin ; max; slow and fast signi es operating conditions Vmax ; min; fast.
The propagation delay at worst case operating temperature, supply voltage, and process will be some factor larger than the best case propagation delay. If is de ned as: = Pmin Vmin ;; max;; slow Pmin Vmax min fast From Section 2.3.4 data: 32
: Vmax Vmin
2 15 1
proc
33
25
where proc is the variation in delay due to process. If it is assumed that setup, hold, rise and fall, synchronizer delay, and skew times scale as propagation delay with temperature and voltage, the worst case timing inequalities become:
fast fast fast Pmin + RFmax =2 + Tsfast + C fast , csfast + Tsynch Tclk N fast fast fast fast Pmin , RFmin =2 , Th , C fast , csfast + Tsynch Tclk N,1
34 35
Combining the constraints to solve for N, the number of waves in the wave pipelined circuit:
N
where,
fast slow Pmin + Hmax , csfast fast fast slow , 1Pmin + Hmax + Hmin , , 1csfast
fast slow fast Hmax = RFmax =2 + Tsfast + C fast + Tsynch fast fast fast fast Hmin = RFmin =2 + Th + C fast , Tsynch
fast If Pmin
RF; Ts; Th; C; Tsynch and the clocks are not intentionally skewed, cs = 0, then: N
,1
39
N ,1
40
Figure 14 gives the maximum number of waves through a wave pipelined network versus the process and environmental delay variation factor, , for several practical values of the path length variation factor, . Table 3 gives the simulated results for the maximum number of waves achievable for the chain of 50 inverters for a range of temperatures and voltages. It is assumed that the process parameters are nominal. There are two important implications from constraint 39. First, based upon data from Section 2.3.4 and Section 2.3.4 values of for temperature ranges of 25-125 C and voltage ranges of 4.5-5.5 V for CMOS circuits will be 1.4 to 1.7. Therefore, the number of waves in a static CMOS wave pipelined logic network, independent of its absolute propagation 26
Max Waves
12 11 10 9 8 7 6 5 4 3 2
1| 1.0
delay, is three or less. Process variation further reduces the number of waves which can be supported. Aggregating the above environmental variation with a process variation of proc = 1:35 the number of waves is limited to 1.6. Second, because operating environment changes result in signi cant changes in propagation delay, extremely accurate path-length balancing may not be necessary to achieve the maximum number of waves. For instance, if temperature and supply changes results in a relative propagation delay variation of 60, i.e. = 1:6, the path lengths through the network can di er by as much as 25 for two concurrent waves.
process variation are contrasted for traditional pipelines and wave pipelines with xed frequency clocking. For a traditional pipeline, the minimum clock period over all acceptable temperatures and voltages is determined by the maximum propagation delay through the network. Thus the min minimum clock period, Tclk , for a traditional pipeline which must operate over all possible expected supply voltages, temperatures, and process parameters is some factor, larger 27
| | | | | | | | | | |
| 1.1
| 1.2
| 1.3
| | | | | | | 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Environ & Process Variation (Beta)
Max Waves
6 4 2 2
Table 3: Inverter Chain Simulated Maximum Number of Waves than the clock period which could be achieved if it were expected to operate at the nominal supply, temperature, and process:
min Tclk 8V; 8 ; 8 = Tclk V0; 0; 0
41
where,
slow = Pmax Vmin ; ;max; Pmax V0 0; 0
42
and, 1 In these equations, the nominal voltage, temperature, and process are V0, 0 , and respectively and all possible ranges of operation are represented by 8V ,8 , and 8 .
min Tclk 8V; 8 ; 8 = Tclk V0; 0; 0
43
0
44
This factor represents the maximum throughput lost by environmental and process variation. For a wave pipeline,
min Tclk V0; 0; 0 = Pmax V0; 0; 0 , Pmin V0; 0; 0 min Tclk 8V; 8 ; 8 =
assuming,
Pmin Vmax;
Pmax ; Pmin
Thus,
min Tclk 8V; 8 ; 8P = min Tclk V0; 0; 0
,1 ,
48
Figure 15 plots the degradation factors for both traditional and wave pipelines versus . It is assumed that for this gure any propagation delay through the network at the nominal environment is approximately equal to the propagation delay at maximum voltage and minimum temperature i.e. . Figure 15 is evidence of the need for minimization of environmental uctuations for wave pipelined design.
CLK Degradation Factor
12 11 10 9 8 7 6 5 4 3 2
1| 1.0
Figure 15: Environmental Degradation Factor A strategy for maximizing the performance of externally-clocked wave pipelined circuits is tightly controlling the drift of the external power supply and minimizing Vdd and GND noise with numerous supply pins, lter capacitors on the die, and current-limiting I O drivers. Temperature variation can be minimized by lowering the maximum junction temperature with low thermal resistivity packaging. Analysis of heat generation and ow could be used in the design process to provide tighter bounds on the expected temporal and spatial propagation delay variation. Lee 34 has suggested integrating thermal analysis in a design 29
| |
| | | | | | | | |
| 1.1
| 1.2
| 1.3
| 1.4
| 1.5
| 1.6
| 1.7
| 1.8
environment for improved reliability and performance. Temporal variation can be decreased by raising the minimum operating temperature with warm-up" cycles. Without tight controls on temperature and voltage, wave pipelined xed-clock circuits are limited to 2-3 waves per stage. For designs in which full commercial operation is required and tight environmental and process control are not practical, it is unreasonable to expect greater than two waves per wave pipelined logic block. A useful strategy in this case is to partition the logic into the smallest number of pipeline stages, k, such that constraint 2.3.4 with N = 2 is satis ed for each section. In this manner, each pipeline stage will be the minimum delay which holds two simultaneous waves. Therefore, the maximum speed-up over a nonpipelined circuit becomes 2 k and the increase in latency will be minimized. Klass 30 analyzes pipelines in which each pipeline stage is in-turn wave pipelined.
A ring oscillator design and a voltage-controlled ring oscillator design are shown in Figure 17.
Single Ring
Clock
Differential Ring
Clock
NMOS Bias
Figure 17: Internally Generated Clocks For a variable frequency clocked traditional pipelined system to operate properly, the worst case maximum propagation delay determines the clock rate:
49
Pmax ; RFmax; Ts, and C are voltage, temperature, and process dependent. Tclk is also
For a variable frequency clocked wave pipelined circuit to operate properly, the following two inequalities must hold for edge-triggered registers: 31
50 51
Pmax + RFmax=2 + Ts + C + Tsynch , cs T 52 clk N Pmin , RFmin =2 , Th , C + Tsynch , Ttrans , cs T 53 clk N,1 Pmax , Pmin , RF , Tsynch , Ts and Th are voltage, temperature, and process dependent. Tclk and Ttrans are also voltage, temperature, and process dependent.
The period of oscillation of a ring oscillator is determined by the propagation delay through the ring. Thus if the temperature, voltage, and process were constant across the device, Tclk will vary as the combinational network propagation delay. According to Glasser 20 process parameters can be approximated as constant across a die. Surface temperature pro les of a die tend to be a superposition of a baseline temperature due to average die power dissipation and ambient temperature and hot-spots due to localized device power dissipation 22 . Thus, there is a spatially independent component and a spatially dependent component of temperature variation. Power supply low frequency voltage variation is also time dependent due to supply drift and spatially dependent due to IR drops across the power distribution network. Figure 18 compares the variation in propagation delay of a chain of inverters with the variation in clock period for a clock generated by an on-chip ring oscillator. This gure shows that inverter chain propagation delay and the ring oscillator period track if the temperature is spatially uniform. Figure 19 compares the variation in propagation delay of the inverter chain with variation in period of an on-chip voltage-controlled ring-oscillator for spatially uniform temperature. Spatial temperature variation depends upon power consumption, device placement, switching behavior, and package design. In the absence of heat ow analysis, worst case spatial temperature variation should be assumed. With internally generated clocks, the clock frequency is a function of temperature and voltage, and is therefore not time invariant. This may present problems in interfacing a device to other devices in a system. An additional problem for on-chip ring-oscillators is frequency jitter due to noise. Because the clocks used in wave pipelined circuits are constrained to a range of valid frequencies which becomes increasingly narrow as the number of waves through the logic increases 21 , 32
1.5
1.4
| | | |
1.3
1.2
1.1
1.0 | 25
| 50
| 75
Figure 18: Inverter Chain Delay and Ring-Oscillator Period vs. Temperature a high degree of clock frequency stability is necessary. This jitter must be included in the C factor in the constraint computations. Low-jitter voltage and current-controlled oscillators minimize jitter through precise capacitance, current, and noise control. Jitter of less than 160 ppm is achievable for on-chip precision CMOS oscillator circuits 17 . They are, however, subject to frequency variation due to supply voltage and temperature changes. Further analysis of the impact of low jitter on-chip oscillators on wave pipelined designs is warranted.
54
33
1.5
1.4
Rel Prop Del Rel Tclk Bias=0V Rel Tclk Bias=1V Rel Tclk Bias=2V
| | | | | |
1.3
1.2
1.1
1.0 | 25
| 50
| 75
Figure 19: Inverter Chain Delay and VCO Period vs. Temperature where, and, or,
slow = Pmax Vmin ; ;max; Pmax V0 0; 0
1
min Tclk 8V; 8 ; 8 = min Tclk V0; 0; 0
This factor represents the maximum throughput lost by environmental and process variation. For a wave pipeline,
min Tclk V0; 0; 0 = Pmax V0; 0; 0 , Pmin V0; 0; 0 min Tclk 8V; 8 ; 8P = Pmin Vmax; min ; fast , Pmin Vmax; min ;
fast
assuming,
Pmax; Pmin
Thus,
61
For variable frequency clocking with a uniform surface distribution of supply voltage and temperature, the impact of environmental and process variation a ect traditional and wave pipelines equally. For these circuits, speed-ups such as those reported in 31, 14, 41, 35 of 2-7 are achievable. For non-uniform surface temperature and supply voltage, wave pipelined circuits with variable-frequency on-chip clocks are subject to the performance constraints of Section 2.3.4 where is due to the worst-case spatial variation of environmental conditions.
35
A B C A 0 B E
Del A->D D F G N
Figure 20: Example Circuit and Graph the contributions of the individual gates along the path. The contribution from each gate is the tuple entry which results in the appropriate transition at the sink node. For positively unate gates, the input nodes inherit the output node transition direction. Negatively unate gates inherit the opposite transition direction. DAGs with nonunate arcs cannot be represented with unique rising sink delay, falling sink delay pairs because the output of a nonunate gate can transition in a given direction as a result of a rising or falling of an input. Thus, for a given output transition direction, multiple path delays are possible. Path delays of circuits with nonunate gates are represented by the minimum and maximum bounds on the delays, min rising delay, max rising delay, min falling delay, max falling delay. The bounds at the output of a nonunate gate are computed recursively. For example the max rising delay to the output of nonunate gate i along a path is de ned:
max rising delayi = maxmax rising delayi,1 + rising gate delayi; max falling delayi,1 + rising gate delayi
62
The other bounds are similarly determined recursively, with the appropriate delay terms being substituted. 37
where Cl is the total load capacitance, Vdd is the supply voltage, Idsmax is the maximum MOS transistor source-drain current, and k is a factor to account for the di erence between the maximum and average current over the switching period. The maximum current is:
63
where K is the transconductance per unity channel width to length ratio, Vt is the device threshold, W is the e ective transistor width, and L is the e ective transistor length. Thus, the delay can be approximated by:
64
65
The load capacitance is the sum of the wiring capacitance, Cint , and the gate capacitance of each transistor driven by the output:
Cl = Cint +
XC
i
gatei
66
The rst-order gate capacitance of each transistor is related to the per unit area oxide capacitance and the transistor geometry length and width:
Cgate i = Cox Wi Li
67
Tpd Cl Req
68
Circuit path delays are the computed as the sum of the propagation delays of the gates along the path as detailed in the previous section.
W0 L0 = W0 Mw L0 Ml
This occurs when
69
Mw = 1=Ml 70 where Mw is the width modi cation factor and Ml is the length modi cation factor. Applying this result to the gate delay, equation 65, the transistor geometry manipulation results 39
71
The delay of a gate then becomes a monotonically decreasing, convex function of the modi cation factor. The following delay tuning example compares the constant capacitance tuning method to the width-only and length-only tuning strategies. The delay tuning modes are applied to the simple circuit shown in Figure 21.
Critical Path
Tuned Gate
Short Path
Figure 21: Delay Tuning Circuit For this circuit, the delay through the critical path and the delay through the short path are shown in Figure 22 for the three tuning modes: length-only scaling, width-only scaling, and constant capacitance width length scaling. The paths are balanced when the critical path line and the short path line intersect. With length-only tuning, represented by the dashed lines, the lengths of the transistors in the tuned gate are increased from their nominal value of unity length modi cation factor Ml = 1 to twice their nominal length Ml = 2. The transistor widths are held constant. At approximately Ml = 1:6, the critical path delay and short path delay are balanced. However, the delay through the critical path has been increased by about 5. With width-only tuning, represented by the dotted lines, the widths of the transistors in the tuned gate are decreased from their nominal width Mw = 1 to 2 3 of their nominal width Mw = 0:67. With this range of width modi cation the critical and short path lines do not intersect and thus the circuit can not be balanced. In this example, this method decreases the delay of the critical path. In general, this is not desirable. The goal of delay tuning is to modify all paths to have delay equal to the critical path delay. By decreasing the delay of some paths, this method may result in additional variation between paths. With constant capacitance tuning, represented by the solid lines, the widths of the transistors in the tuned gate are decreased from their nominal width Mw = 1 to 2 3 of their nominal width Mw = 0:67. Over this range, the lengths are increased so as to maintain a 40
constant capacitance. At about Mw = 0:7 the critical path delay and short path delay are balanced. When balanced, the delay through the critical path has decreased by about 1. Length-only tuning, while providing su cient delay tuning range, alters the critical path delay. Width-only tuning does not provide su cient delay tuning range and alters the critical path delay. Constant capacitance scaling provides su cient delay tuning range with minimum impact to critical path delay.
Prop Delay (ps)
The balancing environment uses HSPICE simulations of each gate type, rather than the simple analytic models of gate delay, to accurately determine the constant capacitance transistor length to transistor width relationship. HSPICE simulations are used to ascertain gate propagation delay versus capacitive load versus transistor width modi cation factor. These simulations are used to build macromodels used by the balancing linear program solver. Figure 23 shows the relationship between the propagation delay of an inverter driving a fanout of four inverters as a function of channel width modi cation. The widths of all transistors in a gate are modi ed by this factor simultaneously in order to 41
| | | | | | | |
| 0.6 1.67
| 0.7 1.43
| 0.8 1.25
| 0.9 1.11
| 1.0 1.0
Mw Ml
1200
| | | | | |
1000
800
600
400
200
0| 0.5
| 0.6
| 0.7
| 0.8
Figure 23: Inverter Propagation Delay vs. Modi cation Factor modify the propagation delay of the gate. Although separate modi cation factors for rising gate output and falling gate output could be employed to separately modify the rising and falling delay, a single modi cation factor for the entire gate is employed so as to maintain equal rise and fall delays of the gates. The automated CMOS wave pipeline balancing process consists of development of a circuit netlist and parameterizable gate library. It is assumed that the critical paths of the unbalanced netlist have been minimized. Therefore, delay balancing increases the delay along the short paths through the logic network. The unbalanced netlist is rst rough balanced by the Wong rough balancing algorithm 53 which inserts bu ers into paths which have local delay imbalances greater than a bu er delay. The rough balanced netlist is converted to a linear program representation which is solved using the Simplex method 39 . The solution to the linear program is a vector of width modi cation factors from which the individual transistor geometries can be determined.
a combinational circuit, nd a vector of modi cation factors, M , such that all paths from the source to sink have delays which do not exceed a speci ed maximum propagation delay, Dmax, and the delay of the shortest path is maximal. The cycle time of a wave pipelined circuit is minimized when the di erence between the longest and shortest path, D, is zero. Gates whose delay functions are not equal for rising and falling output may result in a nonzero D since the rising sink delay for a given path may di er from the falling sink delay along that same path. The relative sizes of the NMOS transistors and PMOS transistors in the gates are chosen to minimize the rising and falling delay di erences. Stacked transistors and parallel conduction paths in CMOS gates limit the degree to which these delay functions can be equalized. The Symmetric Fine Tuning Problem 53 is a restricted case of the Fine Tuning Problem in which the rising delay function of each gate equals its falling delay function. The solution to this problem is also a solution to the Fine Tuning Problem with a bounded D. A related problem, whose solution is also a solution to the Symmetric Fine Tuning Problem is solved by the delay balancing tools. The Width Minimization Problem is: Given a weighted acyclic DAG representation of a combinational circuit, nd a vector of modi cation factors, M , such that all paths from the source to sink have delays which do not exceed Dmax and the total transistor width is minimized. The Width Minimization Problem is:
Minimize W =
X M i W i ; Such that :
i
72
D i; 1 + flin i; j; 1 M x ; Cl j D j; 1 D i; 0 + flin i; j; 0 M x ; Cl j D j; 0
If gate corresponding to arc i,j is negative unate:
D i; 1 + flin i; j; 0 M x ; Cl j D j; 0 D i; 0 + flin i; j; 1 M x ; Cl j D j; 1
43
where flin is a piecewise linear approximation to the monotonically decreasing, convex delay function f , D i; 0 is the delay from the source node to node i falling, D i; 1 is the delay from the source node to node i rising, Cl j is the capacitive load on node j, M i is the width modi cation factor of transistor i, W i is the nominal width of transistor i, Wmin i and Wmax i are xed limits on the width of a transistor, node 0 is the source node, and node N is the sink node . The optimal solution to the Width Minimization Problem, M , in which all of the delay inequality constraints in the problem description are active, i.e. the constraints are at their limit and are thus equalities, is also a solution to the Symmetric Fine Tuning Problem in which all paths have a delay of Dmax and D = 0. The solution to the Width Minimization Problem with active constraints represents an accurately balanced circuit. In a design system using parameterizable library cells whose cell height is xed and whose cell width is a linear function of the transistor widths the solution is area e cient.
The wave pipelined balancing tool is incorporated into a design environment based on the Mentor Graphics GDT design tools. The design process for cell generation, design optimization, layout, and routing is shown in Figure 24.
Flatten
Add Loads
Flat Netlist
No Solution Increase Dmax Fine Balance
pad output or module replace all gates with min width factor & positive slack for all inputs to output
Balanced Netlist
Layout
L cells
L file
Extract Cap
L file
Route
Figure 24: Design Process To test the capabilities of the balancing tool, several demonstration circuits have been balanced and simulated with HSPICE: a pulse generator, an unbalanced carry generation circuit, a 4,2 counter circuit, a 16-bit parallel adder, and an 8-bit x 8-bit multiplier. HSPICE simulations of the circuits were for typical fabrication parameters for a 0.8-micron, 3-level metal CMOS process operating at 25 C. The counter, adder, and multiplier circuits are typical of arithmetic designs. The carry tree is an unbalanced NAND and inverter logic structure. The pulse generator is a simple circuit which relies on di erences in propagation delay in order to produce a pulse. 45
For these example circuits a static CMOS parameterizable gate library consisting of four sizes of inverters and three two-input NANDs was used. Balancing results for these circuits were consistent with a study by Klass 32 which showed that despite a large data-dependent delay of individual static CMOS gates, circuits designed with static CMOS inverters and 2-input NAND gates exhibit small data-dependent delay variation. These gates were also easily macromodeled. Figure 25 is a diagram of the pulse generator circuit prior to rough tuning, following rough tuning, and after ne tuning. When this circuit is perfectly balanced by the tool, no pulse is output. Rough-tuning allows the inputs to the NAND gate to cross the Vdd 2 volt threshold within 122ps of each other, thereby inducing only a small perturbation of the NAND gate output. Fine-tuning results in 17ps di erence in arrival time with no perturbation in the NAND output.
Original Circuit
Rough Tuned
Fine Tuned
Figure 25: Pulse Circuit Balancing A 4,2 counter circuit was rough balanced, ne balanced, and simulated with HSPICE. Figure 26 is a histogram of the delay of the carry output for all possible pairs of input vectors which cause a change in the carry output of the unbalanced circuit. The unbalanced circuit had a maximum delay of 1.64 ns and a delay variation of 970 ps. Figure 27 is a histogram of the delay of the ne balanced circuit. The maximum delay through the balanced circuit is 1.73 ns and the maximum delay variation is 370 ps. Because these counters are cascaded, the fast carry-out output was connected to the carry-in input for the balancing procedure. In the balancing procedure, the critical path delay of this circuit was increased by approximately 90ps. This increase is due to the simple delay model used by the balancing tool. The limitations of this method of ne balancing are discussed in Section 3.1.6. 46
Number of Vectors
64 56 48 40 32 24 16 8
| | | | | | | | | | | | | 0| 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Propagation Delay (ns)
|
Figure 26: Unbalanced Counter Delay Histogram The carry generation circuit is shown in Figure 28. This circuit was balanced and simulated with HSPICE. For heuristically chosen critical vectors, the original circuit exhibited a maximum delay of 1.74 ns and a delay variation of 955 ps. The ne-tuned circuit had a maximum delay of 1.74 ns and a delay variation of 250 ps. The 8x8 multiplier and 16-bit adder used in the wave pipelined vector unit were balanced with this tool. Balancing included input and output bus capacitive loading. The balancing details are given in table 4.
| | | | | | | |
Number of Vectors
64 56 48 40 32 24 16 8
| | | | | | | | | | | | | 0| 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Propagation Delay (ns)
|
Max Delay Delay Variation Unbalanced No. Trans. Balanced No. Trans. Unbalanced Area Balanced Area Balancing Time
| | | | | | | |
Multiplier
10.84ns 1.37ns 5526 6466 1.95sqmm 2.17sqmm 10.94hr
Adder
48
a0
b0
ci
a1
b1
a2
b2
a3
b3
co
49
of other inputs prevent data from propagating along the speci ed path. Thus, false paths tend to result in a pessimistic estimate of critical path delay. Despite these limitations, this method provides balancing variation of 10 to 20 of maximum delay. This accuracy is consistent with the limits of manual balancing methods by Klass 31 . Higher accuracy in balancing provides diminishing returns, as the wave pipeline cycle time becomes dominated by other factors 43 .
i can be divided into j sets of paths each of which can support up to Nij waves of data.
Figure 29 illustrates three important examples of polyharmonic wave pipelines: parallel wave pipelines, forward receiving wave pipelines, and forward generating pipelines. Parallel wave pipelines result when common input and output synchronizers are used to source data to and store data from separate combinational logic paths. For instance, in the vector unit described in chapter 4 the multiplier and adder share common input and output synchronizers. The propagation delays through the multiplier is approximately twice that of the adder, and thus the number of waves supported by the multiplier paths are approximately twice that of the adder paths. The forward generating and receiving wave pipelines occur in interfacing of wave pipelines to other wave pipelines or traditional synchronous circuits. These polyharmonic wave pipeline results when inputs and or outputs have more than one reference clock. In the forward generating case, intermediate results are output from of the combinational logic such that the number of waves in the paths to the intermediate results di ers from the number of waves in the paths to the primary outputs. In the forward receiving case data is input into the wave pipeline such that the paths from the forward inputs to the primary outputs support fewer waves of data than the paths from primary inputs to primary outputs.
Figure 29: Example Polyharmonic Wave Pipelines A wave pipeline with feedback is a wave pipeline in which there exists a path from the output of the wave pipeline through any number of synchronizers to the input of the wave pipeline. Ekroot 12 developed a method of ensuring correct wave pipeline operation at a minimum clock period for monoharmonic wave pipelines and multistage systems of monoharmonic 51
wave pipelines with feedback. With this method, solutions to linear program representations of the circuit optimizations specify the amount of intentional delay inserted into the combinational logic and the amount of constructive clock skew for each register which achieves the minimum clock period. This method restricts wave pipelines to using edge triggered registers and to common input registers for all inputs and outputs. Extending the work of Fishburn 15 on clock skew optimization, Joy and Ciesielski 27 have formulated the optimization of constructive clock skew so as to minimize the clock period of register-based wave pipelined systems as a linear program. This method does not constrain the wave pipeline to a common input clock. Gray, et. al. 21 using similar formulations for skew optimization include feedback for register-based multistage wave pipelined systems. These clock optimizations to not employ intentional delay insertion. Synchronization methods for wave pipelines of increasing complexity will be examined in the following sections. Edge triggered synchronizers are considered in this analysis.
74
75 synch 2 i spans the range of all wave pipelines in the system. csi is the amount of constructive clock skew between the input and output synchronizers. The magnitude of the clock skew is less than the clock period, and the sign of csi is positive if the output clock lags the i input clock and negative otherwise. Pmax is the amount of intentional path delay added to each path between input and output synchronizers under worst case operating conditions. i Pmin is the same intentional path delay under best case operating conditions. The latter two quantities are always positive. 52
Without constructive clock skew and intentional path delay, the minimum clock period which satis es the constraints may not be limited by the maximum propagation delay variation. The valid clock intervals are not continuous for wave pipelines in which there is delay variation in the combinational logic and or clock overhead. Optimization of the wave pipeline circuits, via constructive clock skew and combinational logic delay modi cation, can be used to achieve the minimum clock period. The minimum clock period, that which is determined by the delay variation is:
opt i i i i Tclk = maxPmax + Hmax , Pmin + Hmin
76
This clock period can be achieved through constructive clock skew and intentional delay insertion when the skew and added delay have no variation. Insertion of additional path delay into each path can be used to ensure that the output register timing constraints are met. The bounds on the additional intentional path delay, i i Pmax and Pmin , for each wave pipeline i can be determined:
opt i i i Pmin Ni , 1Tclk , Pmin , Hmin opt i i i Pmax NiTclk , Pmax + Hmax
For wave pipelines without feedback, constructive clock skew can also be used to ensure that the output register timing constraints are met. The constructive clock skew, csi , can be determined:
opt i i csi Pmin , Hmin , Ni , 1Tclk opt i i csi Pmax + Hmax , NiTclk
Figure 30a is an example of a system of monoharmonic wave pipelines with no feedback. Without optimization, the minimum clock period ignoring clock overhead is 3.25ns which is greater than the optimal clock period of 3ns. Figure 30b shows the same system which 53
has been optimized through intentional delay insertion. This solution achieves a cycle time of 3ns at the cost of 3ns of additional latency. Figure 30c shows the same system which has been optimized through constructive clock skew. This solution achieves a cycle time of 3ns. The cost of this solution is a reduction of 2ns the allowable delay of any circuit which uses the output of the nal synchronizer.
cs=1
delay=2
(a) unoptimized
Figure 30: Monoharmonic Wave Pipeline without Feedback Optimization For monoharmonic wave pipelines without feedback, the constructive clock skew solutions found by the linear program methods of Grey, et. al. 21 and Joy and Ciesielski 27 will satisfy the clock skew limits 81 and 80. The intentional path delay found by the linear program method of Ekroot 12 will satisfy the limits 77 and 78. While the constructive clock skew method does not increase wave pipeline latency, it does increase the complexity of clock distribution and wave pipeline interface. The intentional path delay method increases wave pipeline latency but does not add complexity to clock generation and distribution or interface.
To optimize the performance of a polyharmonic wave pipeline which does not have feedback, the following procedure is used. The maximum propagation delay from the input max max synchronizer to output synchronizer, denoted by Pmax , is determined. If Pmax is in the set opt of paths which can support N i waves, in the constraint analysis, a delay term, Xij Tclk , is determined for all paths which support fewer than N i waves. This term represents the maximum integer number of clock periods by which the maximum propagation delay through the wave pipeline exceeds the path under consideration. The di erence in the number of harmonics between paths i and j , Xij , is an integer bounded by:
i j i j j Pmax , Pmax + Hmax , Hmax , Pmax , 1 X ij opt Tclk
83
and,
i j i j j Xij Pmax , Pmax + Hmax , Hmax , Pmax opt Tclk
84
Once the di erence in the number of waves supported by the path under consideration and j the maximum delay path is determined, the intentional delay added to each path, Pmax , is found. Thus, for each path the bounds on the intentional delay introduced which ensures synchronization is:
j j j j j j Pmax + Pmax + Hmax , cs T opt Pmin + Pmin , Hmin , cs clk N i , Xij N i , 1 , Xij
85
Thus, for each path the bounds on the intentional delay introduced which ensures synchronization is:
j opt j j Pmin N i , Xij , 1Tclk , Pmin , Hmin opt j j j Pmax N i , Xij Tclk , Pmax + Hmax
86 87
Intentional clock skew can be used to reduce the amount of intentional path delay in cases where all paths in a polyharmonic wave pipeline share a common input and output synchronizer and have nonzero intentional path delay. Figure 31 is a polyharmonic wave pipeline: two combinational logic blocks share input and output registers. The delays through these units are signi cantly di erent and it is advantageous to have a di erent number of concurrent waves in each block. Since they share clocks, the clock period constraints 73 must be satis ed for both pipelines. In Figure 31a 55
assuming no clock skew nor register overhead, the minimum clock period is 5ns. By lengthening the delay through the long pipe so as to be a multiple of the minimum clock period of the longer pipe, as shown in 31b, the optimal clock period of 3ns can be achieved. Figure 31c demonstrates the use of both intentional delay insertion and constructive clock skew to optimize performance.
cs=1
delay=2
delay=1
(a) unoptimized
Figure 31: Polyharmonic Wave Pipeline without Feedback Optimization The balancing method presented in Section 3.1 allows for the insertion of intentional delay for synchronization. Delay bu ers are inserted and device geometries are adjusted such that all path delays are increased by the necessary delay. Since the adjusted pipelines may have a delay variation which violates the minimum clock period constraint, the delay variation of each pipe must be reveri ed. A polyharmonic wave pipeline exists in the vector unit: The vector register output is supplied to either the adder or multiplier, operated upon, and returned to the vector register le. The register read and write for both types of operations are performed with reference to a common clock. The delay through the multiplier is approximately double that through the adder. This optimization technique was employed to tune the delay of the demonstration vector unit's adder output bu ers to ensure proper wave pipeline synchronization.
88
For wave pipelines with feedback, constructive clock skew with intentional delay insertion can be used to ensure that the output register timing constraints are met. The constructive clock skew, csi , can be determined:
opt i i csi Pmin , Hmin , Ni , 1Tclk opt i i csi Pmax + Hmax , NiTclk
89 90
91
To ensure the synchronization of the feedback signals, intentional path length delay may be required. The amount of intentional delay is found by the following procedure. Construct a directed graph G with jV j vertices and jE j edges. Where each vertex vi corresponds to synchronizer i, and each edge eij corresponds to the existence of a combinational logic connection without a intervening synchronizer from synchronizer i to synchronizer j . A weight wij = csi is attached to each arc. The amount of delay necessary at each feedback point of wave pipeline i is found by determining the amount of additional weight added to the feedback arc of the graph to make the sum of the weights around the closed path which includes that feedback path equal to zero and computing the residue of that quantity with respect to the clock period.
i Pmax = ,1 loop
Xw
92
Figure 32 demonstrates optimization of an example monoharmonic wave pipeline with feedback. In Figure 32a, the unoptimized circuit the minimum clock period is 7ns, signi cantly greater than the optimal value of 3ns. Figure 32b shows the skew graph constructed for this circuit. Figure 32c shows the optimized circuit which through intentional delay insertion and constructive clock skew is able to achieve the optimal clock period of 3ns. This method achieves the delay variation limited clock period at the expense of additional latency in the feedback paths. 57
cs=1
(a) unoptimized
(c) optimized
93
Next, the intentional path delay necessary to balance the maximum e ective propagation delay of each path is found. If maximum propagation delay from the input synchronizer to max output synchronizer, denoted by Pmax , is in the set of paths which can support N i waves, opt j a delay of Xij Tclk + Pmax is introduced in the constraint inequalities for each path with j , j 6= i. As in the polyharmonic wave pipeline without feedback, the number of waves N 58
opt term Xij Tclk is the maximum integral number of clock periods by which the longest path through the wave pipeline exceeds the path under examination.
The di erence in the number of harmonics between paths i and j , Xij , is an integer bounded by:
i j i j j Pmax , Pmax + Hmax , Hmax , Pmax , 1 X ij opt Tclk
94
and,
i j i j j Xij Pmax , Pmax + Hmax , Hmax , Pmax opt Tclk
95
The intentional delay added to each path, P j , is such that the maximum e ective delay of each path is equal.
opt i i j j Pmax + Hmax = Pmax + Hmax + Xij Tclk + P j
96
At this point, all paths through the polyharmonic wave pipeline have the same e ective maximum delay, and the procedure for optimizing monoharmonic wave pipelined systems can be used to determine the constructive clock skew and intentional path delay necessary to achieve the optimal clock period. Figure 33 demonstrates optimization of an example polyharmonic wave pipeline with feedback. In Figure 33a, the unoptimized circuit the minimum clock period is 7ns, signi cantly greater than the optimal value of 3ns. Figure 33b shows the delay insertion step to equalize the polyharmonic wave pipeline. Figure 33c shows the optimized circuit which through intentional delay insertion and constructive clock skew is able to achieve the optimal clock period of 3ns. As in the previous case, this method may increase latency to improve the system clock rate. In synchronizing the wave pipelines any variation in the intentional delays added to the wave pipelines and any variation in the constructive clock skew methods will impact the cycle time of the system. Thus, the achievable clock period is increased by these variations.
cs=1
delay=2
Xij=2
delay=2
. . .
delay=1
(a) unoptimized
Figure 33: Polyharmonic Wave Pipeline with Feedback Optimization as high as 1.13x if the output bu er delay is 25 of the path delay, the bu er interconnect capacitance equals the bu er load gate capacitance, and the mutual capacitance between wires is 50 of the interconnection capacitance. Unless these variations are controlled or their e ects compensated the maximum number of waves, or speedup, for a perfectly balanced wave pipeline is 1.7. This section describes methods by which the delay variations can be controlled or their e ects compensated.
3.3.1 Sorting
One method of compensating for the variation in delay due to process variation is sorting. Unlike traditional synchronous circuit sorting where a bin sorted IC will run at any clock frequency below the bin upper limit, subject to the limits of any dynamic logic in the design, wave pipeline sorting is range sorting. Due to the two-sided limit on the valid clock period of a wave pipeline, sorting for wave pipelines involves determination of the valid range of clock periods given the particular device fabrication process of each VLSI device. By range sorting, the e ects of process variation between dice can be minimized. Cross die spatial variations in process parameters, however, must be accounted for in the determination of the clock period. Due to the relatively narrow range of valid clock periods for aggressive wave pipelines, range sorting is not a desirable method of accounting for delay variation due to manufacturing 60
process. Table 5 demonstrates the di culties in sorting for wave pipelines. For three wave pipelines operated up to their maximum clock period the valid range of clock period for the fastest and slowest expected processes are given. The pipelines are assumed to have the maximum delay through the pipeline is much greater than the clock overheads Hmax and Hmin . Inclusion of the clock overheads narrows the valid ranges even more. The three wave pipelines have a no path length variation, b maximum path minimum path = 1.1, and c maximum path minimum path = 1.2. All pipelines are assumed to have environmental variation due to spatial variation in process parameters, temperature variation, voltage variation, and noise such that the process and environment variation factor, , is 1.3. The minimum propagation delay of the wave pipeline is 20ns. In addition, on the slowest die the propagation delays of each path are 1.3-times as slow as on the fastest die, i.e. proc = 1:3. This table shows the valid operating ranges of the fastest and slowest expected dice o the line.
2 3 4 1.1 2 3 1.2 2
valid clk interval as fastest die Waves of avg clk period oper range
42.4 14.3 2.5 33.2 4.8 24.7
50 to 76.9 MHz 100 to 115.4 MHz 150 to 153.8 MHz 50 to 69.9 MHz 100 to 104.9 MHz 50 to 64.1 MHz
38.5 to 59.2 MHz 76.9 to 88.8 MHz 115.4 to 118.3 MHz 38.5 to 53.8 MHz 76.9 to 80.7 MHz 38.5 to 49.3 MHz
Table 5: Sorting Example For each wave pipeline, as the number of waves supported is increased, the valid operating range of the pipeline is diminished. In turn, the number of divisions into which the dice must be sorted increases. For instance, if the rst wave pipeline supports two waves, the operating frequency range of 50-59.2MHz is valid for all dice and no sort is necessary. If this pipeline is operated with 3 waves, the operating ranges for the fastest and slowest dice do not overlap. For this example, sorting into at least three frequency ranges would be necessary to capture the operating ranges of all dice. If this pipeline is operated with 4 waves, sorting into at least seven frequency ranges would be necessary to capture the operating ranges of all dice. The minimum number of ranges into which wave pipelined dice must be sorted is approximately:
Bins
2
procPmin
N Pmin
97
61
The term proc is the process degradation factor for the slowest fabricated die. The term CLKinterv is the width of operating clock period of each bin. This width cannot exceed the minimum valid clock frequency range of the wave pipeline. For example, if the above pipeline is operated with 4 waves, the minimum valid clock frequency range is 2.9 MHz; thus, each bin can have at most a 2.9MHz width. Assuming that each bin must have a width of 2MHz, the center frequency of the ten bins are: 117.3, 121.1, 124.9, 128.7, 132.5, 136.3, 140.1, 143.9, 147.7, and 151.5MHz. The narrowing of the valid clock interval as wave pipeline performance is increased and the resulting increase in the number of sorting ranges makes frequency sorting impractical for high performance wave pipeline ICs.
csi =
i proc , 1Pmax
98
where proc is the ratio of the delay of the critical path on a particular die to the delay of the same path fabricated with the fastest expected process. This method, while appropriate for laboratory experiments is not practical for systems with several wave pipelines, as the clock skew mechanism for each wave pipeline must be externally accessible and controllable. In addition, wave pipelines with feedback present additional problems due to the interrelation of clocks.
An alternative to the biased pseudo-NMOS logic is biased CMOS logic in which series transistors are added to the pull-up and pull-down transistor networks for each gate. The gates of the series transistors in the pull-up network and pull-down network are driven by separate bias voltages. The bias voltages are set so as to compensate for the static delay variations due to process. This method does not su er from the ratioing, noise margin, and power consumption problems of the biased pseudo-NMOS method, but the additional transistors signi cantly increase the area of the logic gates and degrade the nominal speed of the gates. Biased pseudo-NMOS and biased CMOS NAND gates are shown in Figure 34.
PMOS Bias In A Out In A In B In B NMOS Bias In A In B
Bias
Out
Out In
Figure 35: Compensation Using Current Starved Driver delay variation, the current starved bu ers must have a wide delay tuning range. Figure 36 illustrates a tuning range of 200ps to 1600ps. If a current starved bu er has a tuning range of Pbuf to r Pbuf in the fastest expected process, it can compensate for process variations for all circuits with a maximum propagation delay with the fastest expected process of:
99
The bu er shown in Figure 36 with proc = 1:4 can compensate for process variation for wave pipelines with maximum propagation delay up to 3.3ns. For longer wave pipelines, current starved bu ers can be cascaded. Use of current starved bu ers for compensation require the routing of a bias voltage signal and may increase critical path delay, particularly if multiple levels of bu ers are required. The steep slope of the delay curve in Figure 36 indicates that this method is sensitive to noise on the bias voltage line.
0| 1.00
|
Figure 36: Delay Tuning Range of a Current Starved Driver Figure 37 illustrates the use of a driver with a voltage controlled load. Figure 38 shows a tuning range of approximately 300 to 2000ps of a single stage of a driver with a voltage controlled load. Applying equation 99, a single level of bu ers with a voltage controlled load can compensate for process for wave pipelines with a critical path of up to 4 ns. As in the case of the current starved bu er, multiple bu ers with a voltage controlled load can be cascaded. The voltage controlled load method requires the routing of a bias voltage, requires additional die area for the load transistors, is susceptible to bias noise, and may increase critical path delay.
| | | | | | | |
| 1.50
| 2.00
| 2.50
| 3.00
| 3.50
In
Out
Bias
2000 1800 1600 1400 1200 1000 800 600 400 200
0| 1.00
|
| | | | | | | | | |
| 1.50
| 2.00
| 2.50
| 3.00
| 3.50
66
Combinational Logic
Thermal Control
Figure 39: Thermal Controlled Delay Compensation increase in propagation delays by a factor of 1.3. Thus, while this method can maintain a constant die temperature and thereby eliminate the variation in delay due to the temperature increases which would result from device switching, there is insu cient range of tuning to compensate for the variation in process parameters.
Figure 40: Power Supply Voltage Delay Compensation To compensate for the propagation delay variation, the supply voltage is adjusted to a level which maintains the target delays. By applying equation 20, the power supply is set to:
V V0
100
Since can be approximately two, the voltage supply may need to be set as low as half of 67
the nominal supply level. This method can be used to compensate for dynamic changes in delay, by adaptively adjusting the voltage supply of the wave pipelined logic so as to maintain circuit delays at their design targets. This adaptive method is capable of compensating for delay a ecting changes for which the variation occurs with time constants greater than the closed-loop bandwidth of the adaptive circuit. The adaptive circuit consists of a delay error detector and a supply voltage converter. Delay error detection is performed by phase comparing a signal with a voltage-controlled delayed version of the same signal. A xed frequency source is applied to the input of an inverter chain whose delay has been set to half the source period when the chain is fabricated with the slowest anticipated fabrication. The chain input and output are phase compared. In the open loop control, if the phase di erence exceeds a threshold, an external indication is toggled to indicate the device is running too fast and the power supply is lowered. In a closed loop adaptive supply circuit, the phase error is used to charge or discharge a charge pump capacitor. In each adaptation cycle, a xed amount of charge is added to or removed from the charge pump depending on whether the delay chain is longer or shorter than the design target. The supply voltage converter consists of two parts, the delay chain supply, and the chip supply. A unity gain ampli er drives the charge pump voltage to the Vdd supply rail of the inverter chain. The supply rail thus is modi ed until the delay matches the design target. For small wave pipelined circuits, all circuitry can be driven by an on chip converter. For larger circuits the output of the error detector circuit is used to drive an o -chip dcto-dc converter. Appendix C shows simulations for a closed-loop adaptive supply for the demonstration chip. Figure 41 details the e ectiveness of the power control method for compensation of process variation. For each process run from Section 2.3.3, the simulated delays are shown with the supply at Vdd=5V and at the voltage determined by the constant delay circuit. Without compensation the maximum to minimum delay variation is 1.35x. With the compensation that ratio is 1.04x. Because of the area and power e ciency of this method and its range of delay variation compensation, this method was employed in the wave pipelined vector unit system developed in this research. It is further described in Section 4.1.7. Several di culties exist in the use of an adaptive supply. Logic which is designed to switch at set voltages or which rely on voltage references and logic in which transistor threshold drops are allowed may not operate properly with a lowered power supply level. In addition, adaptive modi cation of power supply levels may increase the probability of CMOS latch-up and may result in static power consumption at interfaces with circuitry driven with nominal voltage power supplies. 68
7.0
4.8V
| | | | |
4.8V
4.5V
4.0V
4.8V
3.7V
5.0V
6.5
6.0
5.5
5.0
4.5
4.0 | 0
| 1
| 2
| 3
| 4
3.4 Summary
This chapter detailed design techniques for the optimization of the performance of wave pipelines. A method of minimizing the path imbalance in the design of CMOS wave pipelined circuits based upon transistor sizing was developed. This method is an extension of the Wong method for balancing CML logic networks 53 . The optimization uses a topological model of the circuit and macromodels of gate delay based upon HSPICE simulations to generate a linear program representation of the transistor sizing problem. For several representative circuits this method of delay balancing has optimized the circuits such that the path delay and data dependent delay variation are limited to the 0 to 20 range. Klass 29 has shown that for similar circuits, the best manual balancing methods exhibit path delay and data dependent delay variation of up to 15 to 20. Further performance potential may be lost in wave pipelining due to the di culty of operating all wave pipelines in a system at a common clock frequency. Circuit optimizations which use intentional delay insertion and or constructive clock skew can be used to minimize the performance impact of synchronization of wave pipelines within a system. The variations in delay due to process and operating environment impose severe limits on the performance of CMOS wave pipelined circuits. Several means of minimizing the variations or the performance e ects of these variations were presented in this chapter. Frequency 69
sorting was shown to be signi cantly more constrained for wave pipelines than for conventional pipelined circuits. For high performance wave pipelines sorting was demonstrated to be impractical. Delay compensation methods were evaluated. The tunable clock skew compensation method is impractical for a VLSI system in which multiple wave pipelines which are part of a synchronous system with feedback because of the need to individually skew each clock and because of the interrelation of clocks in a system with feedback. The biased logic methods su er from area penalties and or power consumption and noise margin problems. Methods based upon tunable delay bu ers or thermal compensation su er from limited range of delay adjustment. The adaptive power method provides su cient range of delay adjustment, does not increase logic area, and can lower power consumption while compensating for fabrication process and temperature dependent delay variations. The techniques for the optimization of the performance of wave pipelines allow systems of CMOS wave pipelines to be implemented in VLSI ICs. The following chapter describes the demonstration processor developed in this research.
70
Decode, Control, and Scoreboard 16 word Load Unit 16 word Store Unit 16+16b Adder Data Out Bus (16b) 8bx8b Multiplier
72
bf
af
b1 a1 b0
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
PG
G Delay
PG Delay P[j] Sum Sf PG PG PG PG PG PG PG
The outputs of the two sets of counters are reduced in the second level of the reduction logic by a set of sixteen counters.
a7 a6 a5 a4 a3 a2 a1 a0 b0 b1 b2 b3 Ai Bj
PPij
(4,2) Counters
Carryout Delay
Figure 44: Parallel Multiplier Organization Figure 45 is a schematic of the 4,2 counter used in this design. This circuit counts the number of the inputs A, B, C, D, and Cin which are asserted. This count is output in a redundant binary form in the sum output, which gives the count in the same precedence column as the input bits, and two carry outputs, which represent the portion of the count in the next higher precedence column. The shaded inverters in Figure 45 are delay elements inserted by the rough-tuning pass of the delay balancing tool. 74
Co
Ci
Carry
Sum
75
Delay balancing of the multiplier for wave pipelining resulted in a 10 increase in the area of the multiplier. The delay bu ers within the counter circuits and the nal carry out delay shown in Figure 44 accounted for an 8 increase in area. The ne balancing transistor sizing accounted for 2 of the additional area. The maximum propagation delay through the multiplier is 10.8ns and the path length variation is 1.37 ns.
Read Enable
Qual WL0
Enable
WL1
Addr Counter
Rbit
Rbitb
(b)
Samp Drv
Rbit Rbitb
RLtch
WLtch
Drv (c)
SAE
Write Data
Read Data
77
78
4.1.5 Scoreboard
The vector unit scoreboard consists of timers for each functional unit, timers for each vector register for vector reads, and timers for each vector register for vector writes, and a scoreboard update state machine.
External operation signals were synchronized and executed at the chip core frequency. 79
Multiply
D RF-0 ExA-0 ExB-0 ExC-0 WP Multiply 10.8ns ExB-0 W-0 RF write ltch ExD-0 W-0 RF write ltch
Add
D
RF read 3.7ns
WP Add 5.5ns
Store
D RF-0 ExA-0 Buffer write
RF read 3.7ns
Load
D ExA-0 W-0
81
4.3 Balancing
This section summarizes the balancing results of the wave pipelined logic used the vector unit. Table 6 details the wave pipelined overhead in number of transistors, and functional unit area for the adder and multiplier circuits. The execution time of the balancing procedure on a Sun SPARCstation 10 are also given.
Maximum Delay 10.84ns 5.52ns Delay Variation 1.37ns 0.58ns Unbalanced Transistors 5526 1322 Balanced Transistors 6466 1730 Unbalanced Area 1.95 sq mm 0.40 sq mm Balanced Area 2.17 sq mm 0.45 sq mm Balancing Exec. Time 10.94 hr 0.62 hr Sun SS10 Table 6: Vector Unit Balancing Results
Multiplier
Adder
83
100 | 0
Eleven of thirteen passed write and read verify tests of 100 random vectors at 50MHz. Vector register reads and writes at speeds up to 200MHz were performed on a single IC. A trace of a 200MHz read verify operation is shown in Figure 51. Since this test was performed from the pins, higher frequency testing exceeded the switching speed of the output pad driver design for the load imposed by the test equipment. The functional multiplier test consisted of the application of multiplier input vectors to test inputs and product checking at product testing output pins. Ten thousand pseudorandom vectors were applied and results checked at a 10MHz rate. Eleven of the 13 ICs passed this 84
| | | | | | | |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 Die
86
power method the frequencies were measured at 0 to +5 faster than the target frequency.
Ring Freq
Vdd=5 124 121 118 115 112 109 106 103
100 | 0
Once the lock voltage was determined, wave pipelined speed testing was performed by loading data and instructions into the instruction bu er and load unit fos at low speed, performing the vector instructions at high-speed, and nally emptying store unit fos at low speed and verifying the results. This procedure is shown in detail in Figure 54. Using this procedure the eleven functional ICs were tested. Three of the eleven correctly performed a 16000 vector add test and a 16000 vector multiply test at 303MHz. Eight of the eleven correctly performed the 16000 vector add test and a 16000 vector multiply test at 222MHz. At this speed, the multiply latency was decreased to four as fewer waves were held within the multiply wave pipeline.
|
Vdd=Constant Delay Voltage
| | | |
Design Target Frequency
| |
Constant Delay Voltage
|
4.7 | 1 4.7 | 2 4.7 | 3 4.8 | 4 4.7 | 5 4.8 | 6 4.5 | 7 4.7 | 8 4.7 | 9 4.6 | 10 4.5 | 11 4.7 | 12 4.9 | 13 Die
FIFO Load Operand A FIFO Load IBuffer Load Operand A Store Operand A
IBuffer
Up to 8 VInstr
Comp
IBuffer
Store Result
High Speed DAS Test Slow Speed DAS Operation
Comp
88
compared. The traditional pipeline used two phase clocking and dynamic ow latches to achieve high performance. The adder and multiplier were redesigned using the latches. The multiplier and adder were implemented using the same cell library as the wave pipelined implementation, but delay padding elements and delay balancing transistor sizing were not used. The vector register le was operated as a single stage in the pipeline and thus was not partitioned into multiple pipeline stages. Table 7 details simulated results for the wave pipelined vector unit and for a complete layout using latch-based pipelined units and a single cycle register access. Die Area 43.19 sq mm 44.09 sq mm Min Clock Period 2.8ns 3.8ns Mult Unit Latency 10.84ns 11.7ns Add Unit Latency 5.5ns 6.1ns Table 7: Vector Unit Results Comparison The traditional pipeline, because of its higher cycle time, had a three cycle multiply execution rather than the four cycle execution in the wave pipelined design. The clock to the output latch of both the adder and multiplier were skewed in the traditional pipeline design. Without the use of this clock skew, the latencies would be 7.6ns and 15.2ns, respectively. Because of the ability of the wave pipelined design to have more than a single register read operation occurring concurrently, the wave pipeline had a 35 faster cycle time. Because of the lack of intermediate latch delay and partitioning overhead, the wave pipeline had operation latencies up to 10 less than for the traditional pipeline design. When compared to traditional pipelines designed with less aggressive clocking technologies, such as static latch or register-based pipelines, the wave pipeline performance would be even better.
4.7 Summary
This chapter has described the wave pipelined vector unit developed for the demonstration of the techniques and tools presented in this dissertation. This vector unit integrates and synchronizes multiple heterogeneous wave pipelines: Wave pipelining was employed in the design of the vector register le, the add functional unit, and the multiply functional unit. This unit was fabricated in a 0.8 micron process and was tested at operating frequencies 89
up to 303MHz. At 300MHz, the vector register le supported 1.1 concurrent waves of data, the adder supported 1.9 waves of data, and the multiplier supported 3.7 waves of data with no intervening latches. An equivalent conventional pipelined design using aggressive two phase clocking and dynamic latches was developed and contrasted to the wave pipeline design. The vector unit design which used conventional pipeline clocking was approximately 2 larger. The simulated latencies through the multiplier and adder functional units using conventional pipelining were 8 and 11 longer, respectively. The maximum simulated clock rate achieved by using wave pipelining was 35 faster than that which could be achieved using conventional two phase clocking and dynamic latches.
90
Wave Pipeline Circuit Model To improve throughput, a logic network can be partitioned into pipeline stages, each of which operates upon data computed in the previous cycle by the previous pipeline stage. When a logic network is pipelined, synchronizing elements, either latches or registers, are inserted to partition the network into stages. These synchronizing elements increase the network area, power, and latency.
Wave pipelining is a design style which allows overlapped execution of multiple operations without using synchronizing elements. Rather, a knowledge of the signal propagation delay through the network is used to ensure that operations do not interfere with their predecessor nor successor data values. The e ects of stalling in a wave pipeline are now examined. Figure 55 shows the propagation of waves down a wave pipeline with no stall. In Figure 56, a stall condition occurs at time 2, while wave 3 is in stage 2. Until the stall is released, no new inputs are applied to the pipeline. However, the waves already in the pipe continue to propagate. Once the stall is released, the pipeline must be restored to the condition prior to the stall. In a traditional pipeline, the data at each stage is not allowed to propagate to the next stage during the stall. Thus, when the stall is released the data in the pipeline which succeed the stall inducing stage have not progressed. To accomplish this in a wave pipeline the pipeline can be re lled so as to restore the placement of the waves in the wave pipeline. This restoration occurs in times k to k+3 in the gure. Wave pipeline restoration requires that enough input values are queued at the input of the pipeline to ensure that the pipeline can be restored. For an N stage wave 91
pipeline, up to N-1 previous inputs must be reentered into the pipeline after a stall before additional computation can occur.
Stage 1 Stage 2 Stage 3 Stage 4
Input Register Stage time wave 1 wave 2 wave 3 wave 4 wave 5 wave 6 wave 1 wave 2 wave 3 wave 4 wave 5 wave 1 wave 2 wave 3 wave 4 wave 1 wave 2 wave 3
Output Register
Figure 55: Propagation of Waves in Wave Pipeline Figure 57 is a block diagram of a wave pipeline with an input register chain at the head of the pipeline. This approach requires additional area for the input register chain, sequencing logic, and input multiplexing. The cycle time of the wave pipeline may be increased by the addition of a multiplexor delay and the capacitive load of the input register chain gates. The re lling of the wave pipeline requires a number of penalty cycles which can be up to the number of stages in the wave pipeline. For long wave pipelines, the re ll cycles can become prohibitively expensive. An alternative to holding the inputs to the wave pipeline is holding the output of the pipeline. This works only if the stall condition does not alter the values produced by the wave pipeline. Figure 58 shows this condition. The results of waves 1, 2, and 3 are queued up at the output register. After the stall has been released, these results are multiplexed to the wave pipeline output. Additional logic is required to count the stall cycles and control the selection of the multiplexors on the release of the stall. Figure 59 is a block diagram of a wave pipeline with a results queue at the tail of the pipeline. This approach requires additional area for the results queue, sequencing logic, and output multiplexing. The cycle time of the wave pipeline may be increased by the multiplexor delay and the capacitive load of the results queue gates. No penalty cycles to re ll the wave pipeline are required. Hybrid approaches can used when the wave pipeline results are in uenced by the stall. 92
Stage 1
Stage 2
Stage 3
Stage 4
Input Register
Output Register
time
0 wave 1 1 wave 2 2 wave 3 3 wave 4 4 wave 4 5 wave 4 wave 1 wave 2 wave 3 wave 4 wave 4 wave 1 wave 2 wave 3 wave 4 wave 4 wave 1 wave 2 wave 3 wave 4
6 wave 4 wave 4 . . . stall released k wave 1 k+1 wave 2 k+2 wave 3 k+3 wave 4 wave 4 wave 1 wave 2 wave 3
93
Mux
Input Register
Wave Pipeline
Output Register
Stage 1
Stage 2
Stage 3
Stage 4
Input Register
Output Register
time
0 wave 1 1 wave 2 2 wave 3 3 wave 4 4 wave 4 5 wave 4 wave 1 wave 2 wave 3 wave 4 wave 4 wave 1 wave 2 wave 3 wave 4 wave 4 wave 1 wave 2 wave 3 wave 4 Queued Results wave 1 wave 2, wave 1 wave 3, wave 2, wave 1
6 wave 4 wave 4 . . . stall released k wave 4 k+1 wave 5 k+2 wave 6 k+3 wave 7 wave 4 wave 4 wave 5 wave 6
wave 3, wave 2, wave 1 wave 4 wave 4 wave 4 wave 5 wave 4 wave 4 wave 4 wave 4 wave 3, wave 2 wave 3
Mux
Input Register
Wave Pipeline
Output Register
Results Queue
Figure 59: Wave Pipeline with Results Queue Results of waves which precede the rst stalling wave can be queued up at the output of the wave pipeline, while the stalling wave and its successors must be restarted at the head of the wave pipeline.
Dynamic Stalling of Wave Pipelines Note that the wave pipeline re lling and results
queuing presented in the previous section would not be necessary if the wave pipeline could be frozen" in time until the stall condition was resolved. Traditional pipelines freeze" the pipeline by stalling. When a pipeline is stalled, data is not allowed to proceed to the next pipeline stage until the stall is released. Latches or registers act as barriers to the propagation of data through the pipeline. In this way, a stall condition can be resolved while the remainder of the system waits. This eliminates the need to queue results and re ll pipelines.
In wave pipelining, it has not been possible to stall the pipeline since there are no latches or registers within the wave pipeline to act as barriers to signal propagation. In this section a method which allows some degree of stalling within a wave pipelined is presented. To allow stalling of a wave pipeline, barriers to signal propagation are introduced at strategic positions within the wave pipeline. Transistors are used during the stall period to disconnect selected gate output nodes from the supply rails, thereby prohibiting changes in node states during the stall period. These transistors make the outputs of the selected gates dynamically latched. Figure 60 is a diagram NAND gate which can be frozen during a stall. When Stall is inactive, the transistors driven by the Stall and !Stall signals act as closed switches. When Stall is active, these transistors act as open switches. Thus the output, Out, is unable to transition. In a wave pipeline which supports N waves, such transistors are placed at N freeze points" within the pipeline. These freeze points are positioned such that the maximum propagation delay between the freeze points is less than the wave pipeline clock period. At the freeze 95
In1
In2
Figure 60: Freeze Points point, the tristate transistors are disabled by stall signals which are active throughout the stall period. Figure 61 is a block diagram of a wave pipeline with freeze points to allow stalling.
Freeze Points
Zone 1
Zone 2
Zone 3
Zone 4
Input Register
Freeze Points
Output Register
Figure 61: Wave Pipeline with Freeze Points If activation of the stall signal is coincident with the enabling edge of the pipeline clock, the rst freeze point is positioned less than one clock period from the input register or latch. This ensures that when a node is frozen, it is at its terminal voltage, not at an intermediate voltage. The period for which the stall signals are held, Tstall is: Tstall = Nstall Tclk + Tfreeze point mod Tclk 101 where Nstall is the number of required stall cycles, Tclk is the wave pipeline clock period, and Tfreeze point is the propagation delay from the input of the wave pipeline to the freeze 96
point. Figure 62 shows HSPICE simulated waveforms for a wave pipeline with freeze points. The wave pipeline consists of a chain of 50 CMOS inverters. This wave pipeline is has a maximum simulated propagation delay of 6.7ns. The circuit is clocked such that ve simultaneous waves propagate through the inverter chain. To allow stalling, the 5th, 15th, 25th, 35th, and 45th inverters are freeze points. The top trace show a one-period pulse propagating down the pipeline at the 10th, 20th, 30th, 40th, and 50th inverters with no stalling. In the bottom trace the pipeline is stalled for two cycles. The waveforms at the same inverters are shown. Note that the stall has the e ect of delaying by two cycles the edges which occur after the stall is initiated. The relative spacing of the edges is the same for the poststall edges as the prestall edges. It must be noted that this technique uses dynamic holding of node voltages. The period for which the stall can be held is therefore limited. The freeze points, while logically latching gates, are not pipeline latches; they remain transparent except during a stall. The following section compares the a wave pipeline with freeze points to a traditional pipeline of the same pipeline depth.
implementations of a logic block for wave pipelining, wave pipelining with stalling, and traditional pipelines with dynamic and static latching are compared. A single phase clocking strategy is assumed for the traditional pipelines. When dynamic latches are placed at the freeze points, the delay of each gate at a freeze point is increased due to the additional transistor between the gate output and the supply rail. The additional delay of a gate at the freeze point is Tdlatch. Thus, when N dynamic latches are placed within the combinational logic they increase the maximum propagation delay through the wave pipeline by N Tdlatch . Thus the lower bound on the clock period becomes: N Tclk + cs Pmax + Hmax + N Tdlatch 102 or Tclk + cs=N Pmax =N + Hmax=N + Tdlatch 103 where Tdlatch is the increase in propagation delay of a freezing gate due to the dynamic latching transistors at the freeze points and Hmax is the long path clock overhead:
104 105
Note that the di erence between this constraint and constraint 1 is simply the additional delay through the freeze point dynamic latches. The upper bound on the wave pipeline 97
98
clock period must also be met: N , 1 Tclk + cs Pmin , Hmin + N Tdlatch where Hmin is the race through clock overhead: 106 107
If the same block of logic is pipelined with traditional techniques, Pmax is divided into N stages separated by synchronizing elements. The clock rate limit for the traditional pipeline is:
108
With the traditional pipeline, it is assumed that the combinational logic can be equally partitioned into N stages. For cases where this approximation is not valid, the term Pmax =N should be replaced with the delay of the longest stage. Note that the stalling wave pipeline amortizes the clocking overhead over the number of waves in the wave pipeline. While the clock rate of the wave pipeline is decreased due to the freeze latches, it still outperforms the traditional pipeline of equal degree of pipelining. Figures 63, 64, and 65 show the factor by which the maximum clock rate of a stalling wave pipeline exceeds the maximum clock rate of a traditional pipeline over a range of number of pipeline stages. In these gures, the freeze point propagation delay, Tdlatch, are 10, 20, and 40 of the clock overhead, Hmax , respectively. For instance, if the clock overhead, Hmax , is 1ns, results are presented for the additional delay due to each freeze point latching gate being 100ps, 200ps, and 400ps. In each gure, plots are given for four di erent maximum propagation delays. The plots vary from short to long pipelines. For instance, if the clock overhead, Hmax , is 1ns, results are presented for pipelines whose maximum propagation delay are 2ns, 4ns, 8ns, and 16ns. In all instances, the stalling wave pipelines are able to achieve greater performance than traditional pipelines. Clock rates up to seventy percent higher can be achieved with stalling wave pipelines. As the freeze point delays are increased, the bene ts of the stalling wave pipeline diminish. HSPICE simulations of a pipeline illustrate the performance e ects of the freeze latches. The pipeline consists of fty identical CMOS inverters. Each inverter was sized to have equal delay for rising and falling output. Since they are balanced the e ects of delay imbalance are ignored. Thus the impact of the clocking and stalling mechanisms are determined. 99
1.0 | 1
| | | | | | | |
| 2
| 3
| 4
| 5
| 6
| 7
| 8
Tdlatch=0.1H
1.0 | 1
| | | | | | |
| 2
| 3
| 4
| 5
| 6
| 7
| 8
Tdlatch=0.2H
100
1.4
| | | |
1.3
1.2
1.1
1.0 | 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
Tdlatch=0.4H
Wave pipeline WP with stall TP with dynamic latch TP with static latch
Design
Number of Stages 8 8 8 8
Table 8: Performance of Pipelines For the simulated pipelines, the clock to the output latch was skewed to ensure that a integral number of waves were held. In the traditional pipeline, this allowed the number of inverters in the nal stage to be 8 as opposed to 6 in the previous stages. For this example Pmax 13Hmax and Tdlatch 0:2Hmax. Additional performance may be realized for the stalling wave pipeline due to the decreased load on the clock driver. The stalling wave pipeline has an increased area over the wave pipeline due to the freeze latches and additional signal routing. This area penalty is approximately the same as a traditional pipeline with dynamic latches as synchronizing elements. It is smaller than the penalty for a traditional pipeline with static latches or registers for synchronizing elements. It has been shown that wave pipelines can be designed to ease the di culty in stall handling. 101
Methods have been detailed which allow wave pipelines to be restarted with input register chains or to have results queued and sequenced out following a stall. A method has been demonstrated by which a wave pipeline can be designed to dynamically stall. Despite the additional logic required to support stalling, a wave pipeline can outperform a traditional pipeline of equal depth.
Pmin/Pmax
f f Pmin/Pmax
Figure 66: Wave Pipeline with Latchless Feedback Waves interfere at the output of a wave pipeline if the time of the earliest application of the i + 1st inputs can propagate through the combinational logic so as to arrive prior to the latest possible arrival time of the data generated by the ith application of inputs. Thus the general wave interference constraint for a wave pipeline is:
109
Where Tilate is the latest time at which input data wave i is applied to the combinational logic and Tiearly is the earliest time at which data input wave i + 1 is be applied. Notice +1 that clock distribution and synchronizer overhead are being ignored for simplicity. For a register-based wave pipeline with latchless feedback, the time at which the primary inputs for data wave i are applied to the combinational network is i Tclk . The feedback inputs for data wave i arrive at the inputs to the logic no earlier than: f j Tclk + k Pmin + k Pmin and no later than: f j Tclk + k Pmax + k Pmax : where j Tclk is the time in the past j i at which a change in the primary inputs resulted in the current change in the feedback. The number of cycles through the pipeline and feedback from the change in the primary inputs to the current feedback input change is k. Thus, at time j Tclk an input vector is applied to the logic network and after k iterations around the closed pipeline feedback path a change in the feedback inputs occurs. If the closed path has a maximum delay which is an integral number of clock periods, f Pmax + Pmax = N Tclk , the interference constraint becomes: f f jTclk + kNTclk + Pmax j +1Tclk + kNTclk , k Pmax + Pmax , Pmin , Pmin + Pmin 110 or, f f Tclk k + 1 Pmax , Pmin + k Pmax , Pmin 111 The clock constraint 111 implies that for latchless feedback, the minimum period of wave opt separation, Tclk , is a linear function of the number of iterations, k, through the feedback loop which resulted in the current feedback inputs. Thus, in the general case, the minimum wave pipelined clock period for a wave pipeline with latchless feedback is unbounded, and thus, wave pipelines with latchless feedback will not function. There are, however, special cases for which latchless feedback will operate correctly. The rst case is if there is no variation in delay through all closed feedback paths. The second case is if the delay variation of the k iterations through the feedback loop is independent of k. An example would be a feedback loop whose delay alternates between P0 and P1. The nal case where a latchless feedback wave pipeline can operate is when the feedback inputs are periodically synchronized to the primary inputs or the primary input clock. This limits k in constraint 111 to some kmax. This synchronization may be through explicit qualifying, or logical ANDing, of the feedback signals with a clock signal or a primary input, or implicitly as a result of the function being performed by the logic.
through use of methods of data encoding which can be used to distinguish when valid data has arrived at the output or through the use of a separate timing reference path through the logic which tracks the data propagation through the logic in a dynamic manner. Self-timed circuits, being asynchronous, use handshake protocols based upon completion and ready indications to ensure that operations do not interfere. Several techniques which are used in self timing can be employed in wave pipelined systems to provide additional performance. As shown in Chapter 3, the necessity of wave pipeline output synchronization can lead to cycle times which are greater than those imposed by the propagation variation constraints. In Chapter 3 intentional delay padding and intentional clock skew were shown to achieve the variation imposed optimal clock rate. Two alternative output synchronization methods based on self-timed techniques are possible: output encoding, and critical path delayed clock skew. If data encoding techniques in which valid data and invalid data could be distinguished were employed in the wave pipeline, the output synchronizer could be clocked based upon the change in the output from invalid data to valid data. Unlike self-timed systems, this change from invalid to valid would not be used to initiate a subsequent data transfer at the wave pipeline input. This is because for a wave pipeline to have more than a single wave in the pipeline, new data must be input to the pipeline prior to the arrival at the output. The input synchronizer would be clocked by the system clock, whose period is constrained by the propagation delay variation through the pipeline. In the critical path delayed clock skew method, the propagation delay of the critical path is replicated and placed in a path from the clock of the input synchronizer to the output synchronizer. In this manner, the interval between the input clock pulse which drives input data into the wave pipeline and the output clock which latches the results from that data tracks variations in the dynamic operation of the wave pipeline. Thus, the clock period of the wave pipeline can be guaranteed to be limited by just the propagation variation. These techniques, while easing the synchronization of the wave pipeline output, result in signi cant logic overhead for CMOS wave pipelines.
The variation due to the parallel paths can be minimized by minimizing the capacitance driven by a gate with parallel supply paths. Figure 67 shows how a CMOS decoder driving a large capacitance can designed to minimize delay variation. In this gure, a wordline with a capacitance 256-times the input capacitance of a minimum-sized inverter is driven by a wordline driver which is 32-times minimum-size. In the rst case, the wordline driver is driven by a NOR2NAND3 decoder. The NOR2 stage is minimum sized and the NAND3 stage is 4-times minimum size. During disable, the NOR2 can have equivalent channel impedance of R or R 2 depending upon if one or both paths are activated. The NAND3 can have an equivalent impedance of R, R 2, or R 3. Using an RC delay model the maximum delay would be: R 4C + R=4 32C + R=32 256C = 20RC 112 The minimum delay would be:
113
Thus, the maximum delay is 1.58-times greater than the minimum delay. By minimizing the capacitance of the nets which have parallel paths to the supply, this variation can be reduced. In Figure 67 the NOR2 and NAND3 gates are minimum size. The decoder output is driven through a chain of scaled inverters. The maximum delay is: The minimum delay would be: R=2 C + R=3 C + R 4C + R=4 32C + R=32 256C = 20:83RC Hence, the maximum delay is 1.056-times the minimum delay.
114 115
As illustrated in this example, to minimize delay variation due to data dependencies of CMOS gates with parallel conduction paths, all nodes driven by gates with parallel paths to the supply rails should have minimum capacitive load. Large capacitive loads should be driven by inverters with balanced pullup and pulldown times. The intertransistor parasitic capacitance e ect becomes important when the magnitude of small capacitive loads driven by gates with parallel paths approach the intertransistor parasitic capacitances. rails. This tends to limit the minimum delay variation which can be achieved by the above technique.
5.3 Summary
This chapter has exposed architectural and circuit enhancements for wave pipeline design. Methods of providing external support for stalls through operand or output stall fos are examined. In addition, a technique for providing stalling capabilities within wave pipelines 105
NOR
NAND
INV
Wordline
M=256 M=1 R,R/2 M=4 4C R/4, R/8, R/12 M=32 32C R/32 256C
M=256 M=1 R,R/2 M=1 C R, R/2, R/3 M=1 C R M=4 4C R/4 M=32 32C R/32 256C
106
has been developed. Stalling wave pipelines make use of dynamically latching gates within the combinational logic to impede the ow of data through the logic network during a stall condition. By not incurring the clock overhead incurred by conventional pipelines, the performance of a stalling wave pipeline in which the delay variation is su ciently small exceeds the performance of a conventional pipeline of equal depth. Wave pipelines with fully latchless feedback have been shown to be impractical for most circuits. Design techniques to ease the wave pipeline constraints on the output register clocks have been explored; use of data encoding which distinguish when valid data has arrived at the output or through the use of a separate timing reference path through the logic which tracks the data propagation through the logic in a dynamic manner are suggested. Optimizations to CMOS wave pipeline circuit designs to minimize the delay variation due to parallel conduction paths have been presented in this chapter. By minimizing the capacitance driven by gates with parallel conduction paths and driving the large capacitances with balanced drivers, this delay variation can be minimized.
107
variation, independent of other factors, limits the speed-up of wave pipelining to at most three to four. Changes in operating temperature may degrade delays by a factor of up to 1.4, thereby limiting the speed-up to 3.5. Supply drops and drift, supply noise, and coupled noise can result in further propagation delay variation. For representative CMOS circuits the aggregation of the variation causes result in worst-case delays along critical circuit paths which are 2.7-times slower than best-case delay along the fast paths. Variations of this magnitude limit the clock frequency of wave pipelines to 1.6-times the rate which can be achieved without pipelining. The seminal clocking constraints for wave pipelining which were extended in this research to include variations due to process and environment, a model for CMOS propagation delay, the causes of delay variation in CMOS circuits, the e ects of these variations on the performance of CMOS wave pipelines, and a comparison of the impact of these factors on conventional pipelines and wave pipelines were presented in Chapter 2. Chapter 3 presented design techniques for the optimization of the performance of wave pipelines. A method of minimizing the path imbalance in the design of CMOS wave pipelined circuits based upon transistor sizing was developed. This method is an extension of the Wong method for balancing CML logic networks 53 . The optimization uses a topological model of the circuit and macromodels of gate delay based upon HSPICE simulations to generate a linear program representation of the transistor sizing problem. The solution to the linear program is used to set the lengths and widths of the CMOS transistors so as to minimize the path length variations in the logic network. This optimization method has been integrated into a wave pipelining circuit design environment. For several representative circuits this method of delay balancing has optimized the circuits such that the path delay and data dependent delay variation are limited to the 0 to 20 range. These results are consistent with the best manual balancing methods which show variation of up to 15 to 20 29 . Balance in this range is su cient for high performance wave pipelines. Further performance potential may be lost in wave pipelining due to the di culty of operating all wave pipelines in a system at a common clock frequency. Circuit optimizations which use intentional delay insertion and or constructive clock skew can be used to minimize the performance impact of synchronization of wave pipelines within a system. The variations in delay due to process and operating environment impose severe limits on the performance of CMOS wave pipelined circuits. Several means of minimizing the variations or the performance e ects of these variations are evaluated in Chapter 3. Frequency sorting was shown to be signi cantly more constrained for wave pipelines than for conventional pipelined circuits. For high performance wave pipelines sorting was demonstrated to be impractical. Methods to compensate for process and environmental delay variation so as to avoid the di cult frequency sorting procedure as well as to gain additional performance were examined. Tunable constructive clock skew, use of biased logic, use of tunable delay bu ers, delay compensation through thermal control, and an adaptive power supply voltage 109
method of minimizing delay variation were presented in Chapter 3. The tunable clock skew method is impractical for a VLSI system in which multiple wave pipelines which are part of a synchronous system with feedback because of the need to individually skew each clock and because of the interrelation of clocks in a system with feedback. The biased logic methods su er from severe area penalties and or power consumption and noise margin problems. Methods based upon tunable delay bu ers or thermal compensation su er from limited range of delay adjustment. The adaptive power method provides su cient range of delay adjustment, does not increase logic area, and can lower power consumption while compensating for fabrication process and temperature dependent delay variations. This method was used in the design of the wave pipelined CMOS VLSI vector unit presented in Chapter 4. To demonstrate the techniques and tools developed as part of this research a CMOS wave pipelined VLSI vector unit. This integrated circuit demonstrates that wave pipelined CMOS VLSI systems can be designed to perform within the performance limits described in Chapter 2. This system integrates and synchronizes multiple heterogeneous wave pipelines: Wave pipelining was employed in the design of the vector register le, the add functional unit, and the multiply functional unit. It was designed with the assistance of automated CAD balancing tools which employ the transistor sizing method presented in Chapter 3. It contains adaptive power supply support for the maintenance of wave pipelined operation over a range of operating conditions and fabrication tolerances. This unit was fabricated in a 0.8 micron process and was tested at operating frequencies up to 303MHz. At 300MHz, the vector register le supported 1.1 concurrent waves of data, the adder supported 1.9 waves of data, and the multiplier supported 3.7 waves of data with no intervening latches. To allow a comparison of performance and costs between wave pipelining and conventional pipelining, an equivalent vector unit was designed using an aggressive traditional clocking technique. The vector unit design which used conventional pipeline clocking was approximately 2 larger due to the latches which had to be inserted into the functional units. The simulated latencies through the multiplier and adder functional units using conventional pipelining were 8 and 11 longer, respectively. The maximum simulated clock rate achieved by using wave pipelining was 35 faster than that which could be achieved using conventional two phase clocking and dynamic latches. Details of the organization, operation, balancing, and testing of the vector unit are given in Chapter 4. Architectural and circuit enhancements for wave pipelining were detailed in Chapter 5. A signi cant barrier to the use of the wave pipelining design technique has been the di culty in stalling a wave pipeline. Once a wave has been launched, it precedes unimpeded through the combinational network. Methods of providing external support for stalls through operand or output stall fos are examined. In addition, a technique for providing stalling capabilities within wave pipelines has been developed. Stalling wave pipelines make use of dynamically latching gates within the combinational logic to impede the ow of data through the logic network during a stall condition. During normal, nonstall operation, the dynamically latching gates are not switched and thus only marginally increase the propagation delay through 110
the logic. By not incurring the clock overhead incurred by conventional pipelines, the performance of a stalling wave pipeline in which the delay variation is su ciently small exceeds the performance of a conventional pipeline of equal depth. Because the need for synchronizers within the combinational logic of a wave pipeline is obviated, the latencies, clock frequencies, and die areas of wave pipelined implementations can be superior to conventional pipelines. Wave pipelines with fully latchless feedback have closed loop feedbacks of pipeline outputs to inputs. The lack of synchronizers in the feedback loop make this organization impractical for most circuits. The constraints on the timing of the output clock of wave pipelines can be eased through the use of self-timing techniques. In self-timed circuit designs, data storage and synchronization is performed through the dynamic operation of the circuitry. The synchronizers time reference is provided either through the use of methods of data encoding which distinguish when valid data has arrived at the output or through the use of a separate timing reference path through the logic which tracks the data propagation through the logic in a dynamic manner. These same techniques can be used in wave pipeline designs to generate the output clock for the wave pipeline. The self-timed output clock eliminates the need for the intentional delay insertion or clock skewing methods presented in Chapter 3. Optimizations to CMOS wave pipeline circuit designs to minimize the delay variation due to data dependent delays are presented in Chapter 5. The delay variation e ects due to parallel conduction paths of outputs driving large capacitances are examined. By minimizing the capacitance driven by gates with parallel conduction paths and driving the large capacitances with balanced drivers, this delay variation can be minimized.
6.2 Conclusions
As a result of this research e ort the following conclusions may be drawn: Wave pipelining of CMOS VLSI systems can result in 1 to 2-times increase in the rate at which the combinational logic can be clocked when the system is required to operate over reasonable environmental conditions and with typical process variation without compensation for the changes in propagation delay. With adaptive compensation techniques presented, the clock rates of CMOS wave pipelined systems can be 2 to 6-times those which could be achieved without pipelining. Delay balancing based upon transistor sizing can limit the variation in path delay to less than 20 for practical circuits. This degree of accuracy is consistent with manual balancing methods. Additional performance can be gained in CMOS wave pipelined circuits through synchronization of pipeline clocks and minimizing the capacitive loads of gates with parallel conduction paths. 111
Stalls can be supported in wave pipelined systems through fos external to the wave pipeline or through the use of dynamic stalling gates. If the variation in delay of a wave pipeline is su ciently low, a stalling wave pipeline has higher performance than a conventional pipeline. Relatively large CMOS wave pipelined systems can be designed, tested, and operated. The vector unit developed as part of this research e ort contained wave pipelined multiply and add functional units and a wave pipelined vector register le, all of which were synchronized to each other and remaining conventional synchronous system. Most circuit optimizations were automated. This design performed at up to 303MHz. The vector register le supports 1.1 concurrent waves of data, the adder supports 1.9 waves of data, and the multiplier supports 3.7 waves of data with no intervening latches. This design by using wave pipelining achieved a clock rate 35 faster than could be achieved using conventional two phase clocking and dynamic latches.
6.3.2 Adaptation
This research has shown that CMOS wave pipelines will not approach their performance potential in system applications without the ability to compensate for variations in delay due to process and operating environment. Additional research on constant propagation delay, closed loop compensation techniques and their e ect on circuit performance and reliability is warranted. Additional work on variable propagation delay, self-timed output 112
wave pipelines could further serve to increase the acceptance of wave pipelines in the design of VLSI systems.
113
A Symbols
Cl Ccoupled i Cgate Cint Cox cs csisys
The total load capacitance. The mutual capacitance of the output and the signal i. The capacitance of the gate of a transistor. The capacitance of the wires connected to a gate output. The per area oxide capacitance. The constructive skew between the clock at the input The time by which the clock to the output latch of wave pipeline i lags the system reference clock. synchronizer and output synchronizer. D i; 0 The propagation delay from the source node to node i with falling output. D i; 1 The propagation delay from the source node to node i with rising output. di The direction of coupled signal switch -1 if opposite to output and 1 if same as output. fast The operating conditions Vmax; min ; fast. f from; to; direct Mw; Cl The propagation delay function of the gate connecting from to to under the given conditions. fh The horizontal mobility degradation factor. fl in The linear approximation to the delay function f . fv The vertical mobility degradation factor. H The clock overhead including rise fall time, clock skew, setup or hold time, and synchronizer output time. i Hmax The worst case maximum synchronizer overhead. i Hmin The worst case minimum synchronizer overhead. Ids max The maximum MOS transistor source-drain current. k A factor to account for the di erence between the maximum and average current over the period during which a transistor switches. k The number of iterations a signal transitions makes around the feedback loop in wave pipeline with latchless feedback. K Transconductance per unit ratio of channel width to channel length. Kn The NMOS transconductance. Kp The PMOS transconductance. L The channel length. L The e ective transistor length. L0 The nominal e ective transistor length prior to balancing. M The exponential temperature dependent delay constant between 1.5 and 2. 114
Ml Mw N Nmax Nstall f Pmax Pmax Pmin RFmax Req slow Tiearly +1 Tilate
opt Tclk
Tclk Tdlatch Tfreeze point Tms Tox Tpd Tphl Tplh Tstall Tsynch
The length modi cation factor of a transistor being balanced. The width modi cation factor of a transistor being balanced. The number of concurrent waves in the wave pipeline. The maximum number of waves in the wave pipeline, also represents the maximum speed up of a wave pipeline over the same circuit being operated as a traditional pipeline stage. The number of required stall cycles. The maximum delay of the feedback path in a wave pipeline with latchless feedback. The worst case maximum propagation delay through the combinational network. The best case minimum propagation delay through the combinational network. The maximum rise fall time of the inputs to the output synchronizer. The equivalent resistance of a conducting transistor. The operating conditions Vmin ; max; slow . The earliest time at which data input wave i + 1 is applied to the combinational logic. The latest time at which input data wave i is applied to the combinational logic. The lower limit on the clock period of a wave pipeline due to the variation in propagation delay. The clock period. The increase in propagation delay of a freezing gate due to the dynamic latching transistors at the freeze points. The propagation delay from the input of the wave pipeline to the freeze point. The minimum amount of time a node voltage must be stable to ensure the subsequent level of logic operates correctly. The device oxide thickness. The propagation delay of combinational logic. The propagation delay of a gate with output transitioning from the high to the low level. The propagation delay of a gate with output transitioning from the low to the high level. The period for which the stall signals are valid. The maximum time from the data initiating edge of the clock to valid output of the input synchronizer. 115
Ts Ttrans V Vdd Vdmax Vsat vsat Vtn Vtp Vt W W0 Wmax Wmin Xij
C P
i Pmax j Pmax
0 n p
The maximum setup time of the output synchronizer. The time over which the latch is open and transparent. The supply voltage. The supply voltage. The maximum drain to source voltage during velocity saturation. The drain to source saturation voltage. The saturation velocity. NMOS threshold. PMOS threshold. The device threshold. The e ective transistor width. The nominal e ective transistor width prior to balancing. The maximum allowable width of a transistor. The minimum allowable width of a transistor. The largest integer di erence in the number of waves in the longest and shortest paths in a polyharmonic wave pipeline. The ratio of the largest propagation delay through a logic network to the smallest propagation delay through the network. The propagation delay degradation factor due to process and environmental variations. The unintentional clock skew between input and output clocks. The di erence in propagation delay between the longest and shortest path through a combinational network. The worst case delay introduced in to path i during the intentional delay insertion process. The worst case delay introduced in to path j to equalize the delays in a polyharmonic wave pipeline. The worst-case process and environmental degradation factor relative to typical process and environmental conditions. The low- eld channel mobility. The electron channel mobility. The hole channel mobility. The fabrication process. The operating temperature.
116
where Cl is the total load capacitance, represents operating temperature, and distinguishes the fabrication process.
To estimate Tpd , gates are represented as a single transistor, sized so as to match the current carrying capacity of the complex gate, charging or discharging a xed load capacitance. Using long-channel MOS current equations, the propagation delay equations for a high-tolow transitioning output assuming step input are: 50
For short-channel MOS devices, where velocity saturation limits channel current, the propagation delay for low-going outputs assuming step input is: if Vdmax Vdd =2,
Tphl = t1 + t2 V t1 = 2ClKddV, Vdmax 2 n dmax Cl t2 = Vdmax 1:5Vdd,2Vtn + 2=V ln 1:5Vdd,2Vtn 1 Kn Vdd,Vtn ln VddVdd,Vtn ,0:5Vdmax sat 2Vdd ,2Vtn ,Vdmax
117
else, where,
l Tphl = KCVV2dd n dmax n Vdmax = Vsat 1 + 2Vdd , Vtn 0:5 , 1 Vsat Kn = n Cox W=L
Vsat = L vsat
126 and W is channel width, L is channel length, n is electron channel mobility, Cox is per area oxide capacitance, vsat is the saturation velocity, and Vtn is NMOS threshold. Corresponding equations for the propagation from low-to-high transitioning output result from the application of the same delay model, mutatis mutandis. Using long-channel MOS current equations, the propagation delay equations for high-going outputs assuming step input are: Tplh = t1 + t2 127 ,2 t1 = K ,VCl VtpV 2 128 p dd , tp
,Cl dd ,Vdd , VtpKp ln , V,V, Vtp dd For short-channel MOS devices: If Vdmax Vdd =2, Tplh = t + t ,V t = 2Cl,KddV, Vdmax p dmax Cl t = Vdmax , : Vdd, Vtp + 2=V ln , : Vdd, Vtp Kp ,Vdd,Vtp ln Vdd ,Vdd,Vtp , : Vdmax sat , Vdd, Vtp ,Vdmax
t2 =
3 4 1 2 1 2 2 1 15 2 05 15 2 2 2
else,
where,
l Tplh = KCVV2dd
p dmax
136 and W is channel width, L is channel length, p is hole channel mobility, Cox is per area oxide capacitance, vsat is the saturation velocity, and Vtp is PMOS threshold. 118
Vsat = L vsat
If the gate is balanced, the propagation delays for rising and falling outputs are equal. For cases where propagation delays are represented by a single number and the gates are not balanced, the arithmetic average of the rising and falling delays is used. For the analysis in Chapter 2, balanced gates are assumed. For the CMOS delay tuning via transistor sizing presented in Chapter 3, the rising and falling delays are treated separately.
119
This appendix analyzes the performance and stability of the constant delay, adaptive power technique introduced in Chapter 3.
A closed-loop adaptive power design consists of a delay error detector circuit phase comparator and a power supply regulator. The supply regulator can be implemented as a dissipative circuit which lowers the power supply voltage, or a pulse-width-modulation circuit which controls the performance of a switch-mode regulator.
As the constant delay adaptive power technique was used for a relatively high power design, this analysis concentrates on the switch-mode regulation method.
A xed frequency source is applied to an inverter chain which acts as a voltage controlled delay. The inverter chain is designed such that for the slow corner process and the worstcase operating temperature and noise, the delay through the chain with a nominal power supply is equal to 1 2 the period of the xed frequency source.
The input to the delay chain and the output are phase compared using a master slave latch circuit shown in Figure 68 2 . The outputs of the comparator are exclusive. When the input to the delay chain lags the output, the delay through the chain is too short and the supply voltage to the chain must lowered. Under this situation, the chip supply is operating faster than the design target and the chip supply should be lowered. The lagging input results in the activation of the ChargeRemove signal which results in a xed amount of charge to be removed from the capacitor on the Vctrl node. The removed charge lowers slightly the 120
supply to the delay chain, thereby increasing the delay through the delay chain.
Pulse Charge Add
Figure 68: Phase Comparator Circuit If the delay chain input leads the output, the supply is too low. In this case, the ChargeAdd signal is activated, a small charge is added to the charge pump capacitor, and the reference voltage is increased. The value of the control capacitor and the amount of charge added to or removed from the capacitor are chosen so as to ensure the stability and response time of the control circuit. The control voltage xed by the delay loop is an approximation to the voltage at which the delay of the chain equals the design target for the realized process and the current operating conditions. If the bandwidth of the delay loop is su ciently high when compared to the time constant of the environmental changes, this approximation is accurate. To approach constant delay of all circuitry, the chip supply voltage should track the delay detector control voltage. The control voltage is used as the reference voltage for the buck converter in the switching power supply. The reference voltage change results in a change in the duty cycle of the switching waveform to the switching transistor. The modeling of the buck converter shown in Figure 69 is based upon work by Chetty 9 . Figure 70 shows the response of the delay detector and converter to the initial voltage locking due to a process which is faster than the design process. The top graph is the ChargeAdd indication and the middle graph is the ChargeRemove signal. The bottom graph shows two traces: The top trace in the graph, node 6, is the chip power supply node. The bottom trace in the graph, node PULL, is the reference voltage. This graph shows that within 50 microseconds, the chip power supply voltage reaches the value which achieves design target propagation delays. Figure 71 shows the response of the adaptive power circuit to a rise in die temperature of the vector unit chip. At time t=0 the power consumption rises from a standby power consumption of 0.5W to a power consumption of 1.9W. The rst graph in Figure 71 gives the thermal response at the location of the delay chain on the die. The temperature rise has a time constant of approximately 350ms. The temperature rise is approximately 43 C. 121
Buck Converter
Chip Vdd
Rchip
PWM
Error Amp
Vref
PWM
Figure 69: Power Converter Circuit The second graph presents the response of the control voltage, node PULL, and the chip power supply, node 6. The power supply level is adjusted to track the temperature rise. The maximum di erence between the control voltage and the chip power supply during the temperature increase is 420 microvolts for this simulation. The resulting deviation of the propagation delay of the inverter delay chain from its design target is negligible. Simulation results have shown that the adaptive power supply method can compensate for process dependent delay variation and temperature dependent delay variation in CMOS wave pipelined circuits. This method cannot, however, compensate for high frequency delay a ecting changes to the operating environment such as Ldi dt noise and capacitively coupled noise, or spatial variations in the process and operating environment. Any disparity in the tracking of propagation delay between the inverter delay chain and the wave pipeline combinational logic when supply voltage is changed may also introduce some variation in circuit delay. Despite these factors, the adaptive power method is attractive for high performance CMOS wave pipelined designs.
122
123
124
References
1 S. Anderson, J. Earle, R. Goldschmidt, and D. Powers. The IBM system 360 model 91 oating point execution unit. IBM Journal of Research and Development, pages 34 53, January 1967. 2 H. B. Bakoglu. Circuits, Interconnections, and Packaging for VLSI. Addison-Wesley, 1990. 3 M. Berkelaar and J. Jess. Gate sizing in MOS digital circuits with linear programming. In Proceedings of the 1990 European Design Automation Conference, pages 217 21, Glasgow, March 1990. 4 C. Branson, D. Murray, and S. Sullivan. Integrated pin electronics for a VLSI test system. In International Test Conference 1988 Proceedings, pages 23 27, Washington, D.C., September 1988. 5 R. Brent and H. Kung. A regular layout for parallel adders. IEEE Transactions on Computers, C-31:260 264, 1982. 6 C. Chang, E. Davidson, and K. Sakallah. Using constraint geometry to determine maximum rate pipeline clocking. In Proceedings of 1992 IEEE ACM International Conference on Computer-Aided Design, pages 142 148, Santa Clara, California, November 1992. 7 J. Chapman. High-performance CMOS based VLSI testers: timing control and compensation. In Proceedings International Test Conference 1992, pages 59 67, Baltimore, September 1992. 8 T. Chappell, B. Chappell, S. Schuster, J. Allan, S. Klepner, R. Joshi, and R. Franch. A 2-ns cycle, 3.8-ns access 512-kb CMOS ECL SRAM with a fully pipelined architecture. IEEE Journal of Solid-State Circuits, November 1991. 9 P. R. K. Chetty. Switch-mode power supply design. Tab Professional and Reference Books, 1986. 10 L. Cotten. Maximum rate pipelined systems. In Proceeding AFIPS Spring Joint Computer Conference, pages 581 586, 1969. 11 M. Dean. STRiP : a self-timed RISC processor. PhD thesis, Stanford University, Department of Electrical Engineering, 1992. 12 B. Ekroot. Optimization of Pipelined Processors by Insertion of Combinational Logic Delay. PhD thesis, Stanford University, Department of Electrical Engineering, September 1987. 13 W. Elmore. The transient response of damped linear networks with particular regard to wideband ampli ers. Journal of Applied Physics, January 1948. 125
14 D. Fan, C. Gray, W. Farlow, T. Hughes, W. Liu, and R. Cavin. A CMOS parallel adder using wave pipelining. MIT Advanced Research in VLSI and Parallel Systems, pages 147 164, March 1992. 15 J. Fishburn. Clock skew optimization. IEEE Transactions on Computers, 39:945 51, 1990. 16 J. Fishburn and A. Dunlop. Tilos, a polynomial programming approach to transistor sizing. In Proceedings of the 1985 IEEE International Conference on Computer-Aided Design, pages 326 8, 1985. 17 M. P. Flynn and S. Lidholm. A 1.2-m CMOS current-controlled oscillator. IEEE Journal of Solid-State Circuits, July 1992. 18 E. Friedman and J. Mulligan Jr. Clock frequency and latency in synchronous systems. International Journal of Electronics, 70:930 4, May 1991. 19 D. Ghosh and S. K. Nandy. a 400 MHz wave-pipelined 8x8-bit multiplier in CMOS technology. In Proceedings of the International Conference on Computer Design, pages 189 201, 1993. A slightly more detailed presentation is given in: D. Ghosh and S. K. Nandy. Design and realization of high-performance wave-pipelined 8x8 b multiplier in CMOS technology. IEEE Transactions on Very Large Scale Integration VLSI Systems, pages 36-48, March 1995. 20 L. Glasser and D. Dobberpuhl. The Design and Analysis of VLSI Circuits. AddisonWesley, 1985. 21 C. T. Gray, W. Liu, and R. K. Cavin III. Timing constraints for wave pipelined systems. Technical Report NCSU-VLSI-92-06, North Carolina State University, Department of Electrical Engineering, December 1992. 22 K. Hijikata, T. Nagasaki, R. Kurazume, and W. Nakayama. Study on heat transfer from small heating elements in an integrated circuit chip. In Proceedings of the 3rd ASME JSME Thermal Engineering Joint Conference, volume 4, 1991. 23 M. Horowitz. Personal conversation. Dept. of Electrical Engineering, Stanford University, March 1993. 24 D. Jeong, G. Borriello, D. Hodges, and R. Katz. Design of PLL-based clock generation circuits. IEEE Journal of Solid-State Circuits, pages 255 61, April 1987. 25 M. Johnson and E. Hudson. A variable delay line PLL for CPU-coprocessor synchronization. IEEE Journal of Solid-State Circuits, pages 1218 23, October 1988. 26 D. Joy and M. Ciesielski. Placement for clock period minimization with multiple wave propagation. In Proceedings of the 28th Design Automation Conference, pages 640 643, San Francisco, 1991. 126
27 D. Joy and M. Ciesielski. Clock period minimization with wave pipelining. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pages 461 472, April 1993. 28 T. Kim, W. Burleson, and M. Ciesielski. Logic restructuring for wave-pipelined circuits. In International Workshop on Logic Synthesis, 1993. 29 E. F. Klass. Wave Pipelining: Theoretical and Practical Issues in CMOS. PhD thesis, Delft University of Technology, Department of Electrical Engineering, September 1994. 30 F. Klass and M. Flynn. Comparative studies of pipelined circuits. Technical Report CSL-TR-93-579, Stanford University, Computer Systems Laboratory, Department of Electrical Engineering, July 1993. 31 F. Klass, M. Flynn, and A. J. van de Goor. Fast multiplication in VLSI using wave pipelining techniques. Journal of VLSI Signal Processing, 7:233 248, 1994. 32 F. Klass and J. Mulder. CMOS implementation of wave pipelining. Technical Report 1-68340-44199002, Delft University of Technology, Department of Electrical Engineering, December 1990. 33 W. Lam, R. Brayton, and A. Sangiovanni-Vincentelli. Valid clocking in wavepipelined circuits. In Proceedings of IEEE Conference on Integrated Circuits Computer Aided Design, 1992. 34 C. Lee and A. Palisoc. Real-time thermal design of integrated circuit devices. IEEE Transactions on Components, Hybrids and Manufacturing Technology, December 1988. 35 W. Lien and W. Burleson. Wave-domino logic: Timing analysis and applications. In Proceedings of TAU92, 1992. A short version of this work appears in: W. Lien and W. Burleson. Wave-Domino Logic: Timing Analysis and Applications. Proceedings of IEEE International Symposium on Circuits and Systems, pages 2949-52, 1992. 36 P. Macken, M. Degrauwe, M. Van Paemel, and H. Oguey. A voltage reduction technique for digital systems. In Proceedings of the 1990 IEEE International Solid-State Circuits Conference, pages 238 9, San Francisco, CA, February 1990. 37 D. Marple. Performance optimization of digital VLSI circuits. PhD thesis, Stanford University, Department of Electrical Engineering, 1987. 38 Meta-Software. HSPICE User's Manual: Volume 2 Elements and Models. MetaSoftware, Inc., 1992. 39 B. Murtagh and M. Saunders. Minos 5.1 user's guide. Technical Report SOL 83-20R, Stanford University, Systems Optimization Laboratory, Dept. of Operations Research, January 1987. 127
40 K. Nakamura, S. Kuhara, T. Kimura, M. Takada, H. Suzuki, H. Yoshida, and T. Yamazaki. A 220 MHz pipelined 16 Mb BiCMOS SRAM with PLL proportional selftiming generator. In Proceedings of the 1994 IEEE International Solid-State Circuits Conference, pages 258 9, San Francisco, California, February 1994. 41 V. Nguyen, W. Liu, C. Gray, and R. Cavin. A CMOS signed multiplier using wave pipelining. In Proceedings of IEEE 1993 Custom Integrated Circuits Conference, 1993. 42 L. S. Nielsen, C. Niessen, J. Sparso, and K. van Berkel. Low-power operation using self-timed circuits and adaptive scaling of the supply voltage. IEEE Transactions on Very Large Scale Integration VLSI Systems, 2:391 397, 1994. 43 K. Nowka and M. Flynn. Environmental limits on the performance of CMOS wavepipelined circuits. Technical Report CSL-TR-94-600, Stanford University, Computer Systems Laboratory, Dept. of Electrical Engineering, January 1994. 44 K. Nowka and M. Flynn. Wave pipelining of high performance CMOS static RAM. Technical Report CSL-TR-94-615, Stanford University, Computer Systems Laboratory, Dept. of Electrical Engineering, January 1994. 45 K. Nowka and M. Flynn. System design using wave-pipelining: A CMOS VLSI vector unit. In Proceedings of the 1995 IEEE International Conference on Circuits and Systems, pages 2301 2304, 1995. 46 A. Sabnis and J. Clemens. Characterization of electron mobility in the inverted 100 silicon surface. IEDM Technical Digest, 1979. 47 M. Santoro. Design and clocking of VLSI multipliers. PhD thesis, Stanford University, Department of Electrical Engineering, 1990. 48 S. Sapatnekar, V. Rao, and P. Vaidya. A convex optimization approach to transistor sizing for CMOS circuits. In Proceedings of the 1991 IEEE International Conference on Computer-Aided Design, pages 482 5, Santa Clara, CA, November 1991. 49 The MOSIS Service. Mosis parametric test results. 1993. 50 M. Shoji. CMOS Digital Circuit Technology. Prentice Hall, 1988. 51 I. Sutherland. Micropipelines. Communications of the ACM, pages 720 38, June 1989. 52 S. Tachibana, H. Higuchi, K. Takasugi, K. Sasaki, T. Yamanaka, and Y. Nakagome. A 2.6-ns wave-pipelined CMOS SRAM with dual-sensing-latch. In Proceedings of the 1994 Symposium on VLSI Circuits, pages 117 8, Honolulu, HI, June 1994. 53 D. Wong. Techniques for Designing High Performance Digital Circuits Using Wave Pipelining. PhD thesis, Stanford University, Department of Electrical Engineering, 1991. 128
54 D. Wong, G. De Micheli, and M. Flynn. A bipolar population counter using wave pipelining to achieve 2.5x normal clock frequency. In Proceedings of IEEE International Solid-State Circuits Conference, San Francisco, February 1992. 55 X. Zhang and R. Sridhar. CMOS wave pipelining using transmission-gate logic. In Proceedings of Seventh Annual IEEE International ASIC Conference and Exhibit, Rochester, NY, September 1994.
129