INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTEGRATED CIRCUITS AND SYSTEMS The International Journal of Design, Analysis and Tools for Integrated Circuits and Systems (IJDATICS) was created by a network of researchers and engineers both from academia and industry. IJDATICS is an international journal intended for professionals and researchers in all fields of design, analysis and tools for integrated circuits and systems. The objective of the IJDATICS is to serve a better understanding between the community of researchers and practitioners both from academia and industry. Editor-In-Chief Ka Lok Man Xi'an Jiaotong-Liverpool University, China, and Myongji University, South Korea Co-Editor-In-Chief Chi-Un Lei Abhilash Goyal University of Hong Kong, Hong Kong Oracle (SunMicrosystems), USA Editorial Board Vladimir Hahanov Salah Merniz Felipe Klein Kharkov National University of Radio Electronics, Ukraine Mentouri University, Algeria State University of Campinas, Brazil Paolo Prinetto Oscar Valero Enggee Lim Politecnico di Torino, Italy University of Balearic Islands, Spain Xi'an Jiaotong-Liverpool University, China Massimo Poncino Yang Yi Kevin Lee Politecnico di Torino, Italy Sun Yat-Sen University, China Murdoch University, Australia Alberto Macii Damien Woods Prabhat Mahanti Politecnico di Torino, Italy University of Seville, Spain University of New Brunswick, Saint John, Canada Joongho Choi Franck Vedrine Kaiyu Wan University of Seoul, South Korea CEA LIST, France Xi'an Jiaotong-Liverpool University, China Wei Li Bruno Monsuez Tammam Tillo Fudan University, China ENSTA, France Xi'an Jiaotong-Liverpool University, China Michel Schellekens Kang Yen Yanyan Wu University College Cork, Ireland Florida International University, USA Xi'an Jiaotong-Liverpool University, China Emanuel Popovici Takenobu Matsuura Wen Chang Huang University College Cork, Ireland Tokai University, Japan Kun Shan University, Taiwan Jong-Kug Seon R. Timothy Edwards Masahiro Sasaki LS Industrial Systems R&D Center, South Korea MultiGiG, Inc., USA The University of Tokyo, Japan Umberto Rossi Olga Tveretina Vineet Sahula STMicroelectronics, Italy Karlsruhe University, Germany Malaviya National Institute of Technology, India Franco Fummi Maria Helena Fino D. Boolchandani University of Verona, Italy Universidade Nova De Lisboa, Portugal Malaviya National Institute of Technology, India Graziano Pravadelli Adrian Patrick ORiordan Zhao Wang University of Verona, Italy University College Cork, Ireland Xi'an Jiaotong-Liverpool University, China Vladimir PavLov Grzegorz Labiak Shishir K. Shandilya Intl. Software and Productivity Engineering Institute, USA University of Zielona Gora, Poland NRI Institute of Information Science & Technology, India Ajay Patel Jian Chang J.P.M. Voeten Intelligent Support Ltd, United Kingdom Texas Instruments Inc, USA Eindhoven University of Technology, The Netherlands Thierry Vallee Yeh-Ching Chung Wichian Sittiprapaporn Georgia Southern University, USA National Tsing-Hua University, Taiwan Mahasarakham University, Thailand Menouer Boubekeur Anna Derezinska Aseem Gupta University College Cork, Ireland Warsaw University of Technology, Poland Freescale Semiconductor Inc., USA Monica Donno Kyoung-Rok Cho Kevin Marquet Minteos, Italy Chungbuk National University, South Korea Verimag Laboratory, France Jun-Dong Cho Yong Zhang Matthieu Moy Sung Kyun Kwan University, South Korea Shenzhen University, China Verimag Laboratory, France AHM Zahirul Alam R. Liutkevicius Ramy Iskander International Islamic University Malaysia, Malaysia Vytautas Magnus University, Lithuania LIP6 Laboratory, France Gregory Provan Yuanyuan Zeng Suryaprasad Jayadevappa University College Cork, Ireland University College Cork, Ireland PES School of Engineering, India Miroslav N. Velev D.P. Vasudevan S. Hariharan Aries Design Automation, USA University College Cork, Ireland B. S. Abdur Rahman University, India M. Nasir Uddin Arkadiusz Bukowiec Chung-Ho Chen Lakehead University, Canada University of Zielona Gora, Poland National Cheng-Kung University, Taiwan Dragan Bosnacki Maziar Goudarzi Kyung Ki Kim Eindhoven University of Technology, The Netherlands University College Cork, Ireland Daegu University, South Korea Dave Hickey Jin Song Dong Shiho Kim University College Cork, Ireland National University of Singapore, Singapore Chungbuk National University, South Korea Maria OKeeffe Dhamin Al-Khalili Hi Seok Kim University College Cork, Ireland Royal Military College of Canada, Canada Cheongju University, South Korea Tomas Krilavicius Zainalabedin Navabi Nan Zhang Vytautas Magnus University, Lithuania University of Tehran, Iran Xi'an Jiaotong-Liverpool University, China Milan Pastrnak Lyudmila Zinchenko Brian Logan Siemens IT Solutions and Services, Slovakia Bauman Moscow State Technical University, Russia University of Nottingham, UK John Herbert Muhammad Almas Anjum Ben Kwang-Mong Sim University College Cork, Ireland National University of Sciences and Technology, Pakistan Gwangju Institute of Science & Technology, South Korea Zhe-Ming Lu Deepak Laxmi Narasimha Asoke Nath Sun Yat-Sen University, China University of Malaya, Malaysia St. Xavier's College, India Jeng-Shyang Pan Danny Hughes Tharwon Arunuphaptrairong National Kaohsiung University of Applied Sciences, Taiwan Xi'an Jiaotong-Liverpool University, China Chulalongkorn University, Thailand Chin-Chen Chang Jun Wang Shin-Ya Takahasi Feng Chia University, Taiwan Fujitsu Laboratories of America, Inc., USA Fukuoka University, Japan Mong-Fong Horng A.P. Sathish Kumar Cheng C. Liu Shu-Te University, Taiwan PSG Institute of Advanced Studies, India University of Wisconsin at Stout, USA Liang Chen N. Jaisankar Farhan Siddiqui University of Northern British Columbia, Canada VIT University. India Walden University, Minneapolis, USA Chee-Peng Lim Atif Mansoor Yui Fai Lam University of Science Malaysia, Malaysia National University of Sciences and Technology, Pakistan Hong Kong University of Science & Technology, Hong Kong Ngo Quoc Tao Steven Hollands Jinfeng Huang Vietnamese Academy of Science and Technology, Vietnam Synopsys, Ireland Philips & LiteOn Digital Solutions, The Netherlands Siamak Mohammadi University of Tehran, Iran Managing Editor Journal Secretary Michele Mercaldi Jieming Ma EnvEve, Switzerland Xi'an Jiaotong-Liverpool University, China Assistant Editor-In-Chief Woonkian Chong Lai Khin Wee Xi'an Jiaotong-Liverpool University, China Technische Universitat Ilmenau, Germany, and Universiti Teknologi Malaysia, Malaysia Publisher Cooperation Name : Distributed Thought, Inc., UK Address : 7 Red Cat Lane, Crank, St Helens, Merseyside, WA11 8RU, UK Email : contact@distributedthought.com ISSN: 2071-2987 (online version) INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTEGRATED CIRCUITS AND SYSTEMS http://ijdatics.distributedthought.com/ i Preface Welcome to the first issue of the International Journal of Design, Analysis and Tools for Integrated Circuits and Systems (IJDATICS). This issue comprises of enhanced and extended version of research papers from the International DATICS Workshops in 2009. DATICS Workshops were created by a network of researchers and engineers both from academia and industry in the areas of i) Design, Analysis and Tools for Integrated Circuits and Systems and ii) Communication, Computer Science, Software Engineering and Information Technology. The main target of DATICS Workshops is to bring together software/hardware engineering researchers, computer scientists, practitioners and people from industry to exchange theories, ideas, techniques and experiences. This IJDATICS issue presents eight high quality research articles from eight different countries. This mix provides a comprehensive snapshot of state of the art research in the field and provides a springboard for driving future work and discussion. There are three key themes evident in these papers: Analog and Digital Circuits: Three papers address issues of circuit modeling and analysis. Boolchandani presents a Vector Machine based feasibility macromodel for analog circuit synthesis. Mahmoud looks at the impact of power supply noise on the performance of CMOS clock and data recovery circuits. Al-Hertani talks about the pattern dependent static power estimation of logic blocks in a library-free design environment. VLSI Digital Systems: Three papers introduce new analysis and design methodologies for VLSI digital system architectures. Lotfi-Kamran proposes a design methodology for pipelined processors to minimize unnecessary transitions in a NOP instruction. Yin introduces a hierarchical agent based Network-on-Chip (NoC) architecture with a real- time autonomous re-configuration. Benhamamouch presents an analysis approach to compute an upper estimation of the worst case execution time (WCET) of current complex hardware architectures. Power Electronic Circuits: Two papers talk about the application of electronics for the conversion of electric power. Chen illustrates a differential Class E power amplifier design with load mismatch protection and power control features. Huang describes a charge pump circuit topology which uses a voltage doubler as the clock scheme. We are beholden to all of the authors for their contributions to DATICS Workshops in 2009. We would also like to thank the IJDATICS editorial team. Editors: Massimo Poncino, Politecnico di Torino, Italy Ka Lok Man, Xi’an Jiaotong-Liverpool University, China and Myongji University, South Korea Chi-Un Lei, University of Hong Kong, Hong Kong ii Table of Contents Vol. 1, No. 1, June 2011 Preface ………………………………………………………………………………....... i Table of Contents ……………………………………………………………………….. ii 1. Exploring Efficient Kernel Functions for Support Vector Machine Based Feasibility Models for Analog Circuits ………………………………. D. Boolchandani, V. Sahula 1 2. Dynamic Power Reduction of Stalls in Pipelined Architecture Processors …………….. …….… P. Lotfi-Kamran, A.-A. Salehpour, A.-M. Rahmani, A. Afzali-Kusha, Z. Navabi 9 3. Studies on Sensitivity of Clock and Data Recovery Circuits to Power Supply Noise …. ……………………….… K. I. Mahmoud, J. D. Devi, R. Rajasekar, P. V. Ramakrishna 16 4. A Novel VSWR-Protected and Controllable CMOS Class E Power Amplifier for Bluetooth Applications ……………………………………. W. Chen, W. Lin, S. Huang 22 5. A Charge Pump Circuit by using Voltage-Doubler as Clock Scheme …………………. ………………………………………………….. W. C. Huang, J. C. Cheng, P. C. Liou 27 6. Hierarchical Agent Based NoC with DVFS Techniques ……………………………….. ………………... A. W. Yin, L. Guang, P. Liljeberg, P. Rantala, J. Isoaho, H. Tenhunen 32 7. Static Power Estimation of CMOS Logic Blocks in a Library Free Design Environment …………………………………………………. H. Al-Hertani, D. Al-Khalili, C. Rozon 41 8. Computing Worst Case Execution Time by Symbolically Executing a Time-accurate Hardware Model ……………………………………... B. Benhamamouch, B. Monsuez 53 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 1 Exploring Efficient Kernel Functions for Support Vector Machine Based Feasibility Models for Analog Circuits D. Boolchandani, and Vineet Sahula Abstract—Support Vector Machines (SVMs) have been used as such that the devices are not excessively large. Geometric classifier to identify the feasible design space of analog circuits. constraints are transformed into the form of eqn. (1). A feasibility design space is defined as a multidimensional space in which every point representing a design satisfies all the design Cg = {lbi ≤ X ≤ ubi , i = 1, ..., ng } (1) constraints. The minimum set of constraints is the one that ensures the correct functionality of the given circuit topology. Functional constraints Cf ensure the desired functionality of Performance Macromodels that facilitates accelerated analog the given circuit topology. They are often biasing constraints circuit synthesis are constructed and thereby valid only in the posed on the nodal voltages v and branch currents i in functionally correct design space. A kernel function is an integral part of the SVM and contributes in obtaining an optimized analytic form. A circuit level simulator is required to obtain and accurate classifier. A kernel function serves as a separating these values in order to check functional constraints. These function, a hypersurface which optimally separates input data constraints can be represented via simple transformation, as into two classes involving minimal support vectors. The support in eqn. (2). vectors are data points in input space lying on kernel function hypersurface. There is no formal way to decide, which kernel Cf = {x : fi (v, i) ≤ 0, i = 1, ..., nf } (2) function is suited to a class of classifier problem. While most commonly used kernels are Radial Basis Function (RBF), polyno- Performance constraints, Cp are posed on the performance mial, spline, multilayer perceptron; we have explored many other parameters p chosen according to the applications, viz. open un-conventional kernel functions and kernels composed through loop gain, unity gain frequency, phase margin for an op-amp. modifications on the some of the standard kernels functions. The classifiers using these new kernel functions have been tested on Cp = {x : fi (p) ≤ 0, i = 1, ..., np } (3) different analog circuits in order to identify the feasible design space. HSPICE has been used for generation of learning data. Device size ranges and functional constraints take part in defin- Least Square SVM toolbox interfaced with MATLAB was used ing the feasibility design space, while performance constraints for classification. We found that use of modified kernels improves do not. The feasibility design space S ⊆ Rn is defined as in classification accuracy and shortens classifier training time as eqn. (4). Note that x is a vector of all the design variables. well. S = x : x ∈ Rn , C ; C = Cg ∪ Cf ∪ Cp Index Terms—Analog synthesis, macromodels, Support Vector (4) Machine, kernel, feasibility classification. We define a feasibility function y(x), which only takes two values {+1,-1} depending on whether x ∈ S, I. I NTRODUCTION +1 if x ∈ S y(x) = (5) −1 if x ∈/S IVEN a circuit topology, we can pose three types of G [1]. constraints. Further discussion is based on details as in Feasibility design space identification is necessary in build- ing performance macromodels since it screens out infeasible Geometric constraints, Cg are posed directly on the resistor, designs. It is also essential during analog circuit design and capacitor, bias voltage and currents and devices sizes e.g. synthesis, in general since it ensures the functional correctness width and lengths. The matching constraints on the devices of the circuits. Feasibility function is approximated, since are satisfied by assigning one design variable to the matched checking whether a design is feasible or not requires com- devices. After matching is taken in to account, the controllable putationally expensive simulation. Hence it is called as fea- device sizes are abstracted into a vector of independent design sibility macromodeling. Feasibility macromodeling is treated variables x = x1 , ...., xn ∈ Rn . The constraints on the device as classification problem and existing classification techniques sizes are usually given in the form of lower and upper bounds. are applied to solve it. Instances from simulations are used The lower bounds can be determined by the feature size of a to train a selected model with objective of minimizing the technology. The upper bounds can be selected by the designer classification error on the training set. The technique of Support vector machines (SVMs) that has been successfully D. Boolchandani and V. Sahula are with the Dept. of ECE at Malaviya Na- applied in order to solve many practical problems in various tional Institute of Technology, Jaipur-302017, India. E-mail: dbool@ieee.org, fields is used for generation of feasibility classifier models. sahula@ieee.org The work was supported by a research grant from Ministry of Comm. & The SVMs are a class of machine learning algorithms. In the IT, Govt. of India through sponsored project SMDP-VLSI phase-2. next section we discuss support vector classifiers and briefly INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 2 TABLE I review the work done in literature. L IST OF KERNELS WITH THEIR EXPRESSION . II. P REVIOUS W ORK Name of the kernel Expression of the kernel Linear kernel K(x, xj ) = xT kx A. Support Vector Classification kx−xk k2 − σ2 RBF kernel K(x, xj ) = e SVMs [2] were proposed originally in the context of − kx−xk k2 d machine learning, for classification problems on typically Hybrid kernel K(x, xj ) = e σ2 × τ + xT kx Multiplied kernel k(x, xk ) = a × k(x, xk ) where a > 0 large sets of data which have an unknown dependence on Power kernel k(x, xk ) = − k x − xk kβ 0 < β ≤ 1 possibly many variables. We consider each of N data points Log kernel k(x, xk ) = −log(1+ k x − xk kβ ) 0 < β ≤ 1 xk ∈ Rn , k = 1, ..., N to be associated with a label yk ∈ {+1, −1} which classifies the data into one of two sets. In the simplest SVM formulation, the problem of finding a general representation of the classifier y(x) becomes that of the construction of a hyper-plane ω T xk + b which provides 2 maximal separation kωk 2 between points xk belonging to the After elimination of the variables w and e one gets the two classes. This give rise to an optimization problem of the following solution form 1 T P : minω,b ω ω s.t. yk [ω T xk + b] ≥ 1, (6) 2 yT where the 12 ω T ω term represents a cost function to be min- 0 b 0 = (12) imized in order to maximize separation. The constraints are y Ω + I/γ α 1ν formulated such that the nearest points xk with labels [either where y = [y1 ; ...; yN ] , 1v = [1; ...; 1] and α = [α1 ; ...; αN ] . +1 or -1] are (with appropriate input space scaling) at least 1 kωk2 distant from the separating hyper-plane. However for the The kernel trick is applied here as follows Least-Squares SVM classification modification is done such that upon the target value an error variable ek is allowed so Ωkl = yk yl φ(xk )T φ(xl ) that misclassifcations can be tolerated in case of overlapping = yk yl K(xk , xl ) k, l = 1, ..., N (13) distributions and following optimization problem is formulated in the primal weight space for given a training set {xk , yk }N k=1 The resulting LS-SVM model for classifier then becomes N 1 T 1X "N # P : min Jp (w, e) = w w+γ e2k (7) X w,b,e 2 2 y(x) = sign αk yk K(xk , x) + b . (14) k=1 k=1 together with the N constraints as given in equation 8. This formulation involves the trade off between a cost function where αk , b are the solution to the linear system given by equa- term and a sum of squared errors governed by the trade-off tion 12 and N represents the number of non-zero Lagrange parameter γ. multipliers αk , called support vectors. yk wT φ(xk ) + b = 1 − ek , k = 1, ..., N (8) A key feature of the Support Vector Machines is the To solve this ‘primal minimization problem, we construct ability to replace the input data by a non-linear function the dual maximization of eqn. (7) using the Lagrangian form φ(x) operating on the input data. This may be viewed as mapping the input data to higher dimensional space, to enable D : max L(w, b, e; α), (9) classification of data that is not linearly separable in the α original input space. An equivalent interpretation is that the where kernel function is a suitably-defined dot product < xk , x > N X replacing xTk x in the Hilbert space defined by the mapping φ. L = Jp (w, e) − αk {yk [wT φ(xk ) + b] − 1 + ek },(10) In this way, we avoid ever having to represent the mapping φ k=1 explicitly. In either case, the use of a kernel function allows the and αk are Lagrange multipliers. The conditions for optimality SVM representation to be independent of the dimensionality are given by of the input space. There are different kernel functions that provide the SVM, the ability to model complicated separation PN hyperplanes, as shown in Table I. However, because there ∂L ∂w = 0 → w= αk yk φ(xk ) k=1 is no theoretical tool to predict which kernel will give the ∂L PN ∂b = 0 → k=1 αk yk = 0 best results for given data set, experimenting with different ∂L kernels is only way to identify the best function. These ∂ek = 0 → αk = γek, k = 1, ..., N (11) ∂L T kernel functions must satisfy certain criteria known as Mercer ∂αk = 0 → yk w φ(xk ) + b − 1 + ek = 0, conditions for preserving the convexity of the problem. These where, k = 1, ..., N Mercer conditions are discussed in next Section. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 3 B. Mercer kernel paper. In their work, the functional constraints of an op-amp If the kernel K is a symmetric positive definite function, are posed by inheriting all the functional constraints of sub- which satisfies the Mercer’s conditions circuits. Example of current mirror is taken and its functional ∞ constraint are enumerated giving insight into the types of X K(xk , x) = ai φi (xk )φi (x), ai > 0 and, (15) constraints necessary to ensure well behaved circuits. Each i constraint defines a sub-space in electrical space. The inter- section of all such sub-spaces forms the feasibility region for Z Z K(xk , x)g(xk )g(x)dxk dx > 0 (16) well behaved circuits. Authors present a method for linearizing the functional constraints as well as a formula for mapping then the kernel K would represents an inner product in feature these linear approximate constraints back to the design space. space Since the approximate linear constraints are only valid around K(xk , x) = φ(xk ) · φ(x) (17) one quiescent point in both the design and electrical space, the linearized constraints can fail to detect pathological (ill and is known as Mercer Kernel. behaved) designs. As an example a folded cascode op-amp From this condition the simple rules for composition of ker- was analyzed resulting in 18 constraints on device sizes, 59 nels can be concluded, which also satisfy Mercer’s condition functional constraints and 9 free parameters in the design [3]. space. Classification accuracy of the linearized constraint set Corollary 1 (Linear combinations of kernels): Let by simulating random selection of points in the design space k1 (xk , x), k2 (xk , x) be Mercer kernels and c1 , c2 ≥ 0, then was tested. The linearized constraints statistically misclassified k(xk , x) = c1 k1 (xk , x) + c2 k2 (xk , x) (18) points inside the true feasibility region 15% of the time while misclassified points outside the true feasibility region about is also called a Mercer kernel. Moreover, the product of two 10% of time. Modeling accuracy was found to be one order Mercer kernels is a Mercer kernel, which is proved based on of magnitude better for both the linear and quadratic regression the equivalent definition of Mercer kernel. Similarly, it has when constrained to feasible design space. However, overall been proposed in [4] that we can modify the kernel functions accuracy achieved is only 70% while other drawback is that the by multiplying it by a positive factor, adding bias, or taking selection of design on which sensitivity analysis is performed, exponential of the kernel. The new kernel so obtained is also can change the approximated feasibility design space. a Mercer Kernel. Mercer condition needs to be satisfied for In [9], method for the automatic sizing of integrated analog keeping the problem convex and hence obtaining a unique CMOS circuit is presented that prevents bad or pathologically solution. Some of the useful modifications on kernels [5] are sized circuits, that violate basic design rules. This is done illustrated in equations (19), (20) and (21). by introducing circuit knowledge into sizing process. Basic k(xk , x) = a × k(xk , x) where a > 0 (19) sizing rules are setup on component level for transistor pairs k(xk , x) = k(xk , x) + b where b > 0 (20) and sub circuits and formulated as constraints. These structural constraints express general function and matching conditions. k(x , x) = a × e(τ +xk x) where a > 0 T k (21) A systematic consideration of these structural constraints dur- Also, two of the other kernels that are used in the present ing the sizing significantly reduce the number of free design work are power kernel and log Kernel [6] given in Table I. parameters, speed up the sizing, and prevents pathologically List of kernels that have been explored are given in Table sized circuits. The sizing is done with an iterative trust region I. All these kernels satisfy the Mercer’s condition, which is algorithm. In each step, the circuit performance and constraints necessary for the problem to be convex, and hence provides are linearized and a parameter correction with a good ratio unique and optimum solution. between error reduction and parameter deviation is calculated based on the characteristic boundary curve. The sizing result was applied to folded-cascode operational amplifier yielding C. Related Work 165 inequality constraints, 35 equality constraints and 9 free An approach to model the feasible design space and evaluate variables. The over all sizing time was reduced by a factor of the performance of sub-blocks at all levels has been proposed three and resulting circuit is less sensitive to process variation. in [7]. In this work, authors have used fractional factorial Authors in [10] present sizing rule method for constraining experiment design techniques to measure the significance of CMOS analog circuits such that they are well behaved and input variables. Variable screening and grouping techniques contain a minimum of free variables. The library of analog are employed to select and organize the input variables based sub-blocks developed by the authors is comprised of four upon their influence on the output response. An adaptive levels. Level 0 recognizes single transistors operating in linear volume slicing technique is used during regression analysis region or in saturation. Level 1 contains seven sub-circuits to dynamically distribute regressors such that the number composed of transistor pairs, namely a simple current mirror, of experimental runs is minimized. However, it is a rule level shifter, voltage reference, current mirror load, differential based sizing framework, resulting in less accurate solutions. pair, voltage reference and flip-flop. When building the Level 1 In [8], authors calculate feasible design space by linear ap- library, authors have enumerated all 206 possible combinations proximation. The concept of hierarchical decomposition and of transistor pairs, discarding those not used in analog CMOS application of functional constraints are used throughout the design. On Level 2, four different pairs of transistor pairs are INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 4 identified as major building blocks. These are:level shifter III. P ROPOSED W ORK bank, current mirror bank, cascode current mirror and 4- transistor current mirror. Lastly Level 3 contains only one sub- The scope of the present work is identification of feasible circuit, the differential stage. Once defined, all sub circuits in design space for analog circuits using SVM scheme and the hierarchy were analyzed to determine a suitable set of low- evaluation of the scheme on four analog circuits two-stage level functional constraints ensuring robust design practices op-amp, cascode op-amp, voltage controlled oscillator and as well as non-pathological behavior. An algorithm is given mixer. Models used for transistors are Berkeley BSIM3 models for sub block identification for any analog topology, there in 180 nm technology. Widths of the transistors, Coupling by making it possible to automatically generate all necessary Capacitor and Bias currents for above circuits are taken as functional constraints for a generic topology. As an example design variables. A known instance of all the design variable the sizing rules methodology is applied to three analog topol- is considered a tuple. Values of these design variables for ogy, thereby making it possible to automatically generate all both circuits were randomly generated within upper and lower necessary functional constraints for a generic topology. Three bound to get a set of 10000 tuples of design variables. These application areas are mentioned where application of sizing 10000 tuples of design variable serve as input data. HSPICE is rules might be useful, these are circuit sizing, design centering run for this set of 10000 tuples of design variables. Functional and response surface modeling. It has been shown that designs constraints and performance constraints are verified using are more robust with respect to operating tolerances when HSPICE simulation. For the given set of tuples which satisfy sizing rules are obeyed. Further it has been stated that sizing both functional and performance constraints output is taken as rules contribute to response surface modeling in several ways. ‘1’ otherwise as ‘-1’. This results in 10000 input and output First, they provide an accurate and technically relevant feasible data pair. Of these 6000 are used to train SVM classifier and region of an analytical model. Second the function domain 4000 are used for validation to check accuracy of classifier. is reduced in size due to sizing rules constraints. Finally the Least Square Support Vector Machine Toolbox [15] interfaced performance behavior is near to linear in the region where with MATLAB is used for classification. The toolbox outputs sizing rules are satisfied. This results in an increased accuracy the value of optimized α and bias. These values are used to of the analytical models. form a classifier as shown in eqn. (14). As it is evident in Section II-B that the kernel has an important role to play Authors in [11] have presented a novel approach for mod- in classification. Suitability of various kernels is explored. eling the performance space of an analog circuits based Modifications is carried out on RBF kernel and other suitable on SVMs. An analog circuit maps a set of input design forms of kernels to obtain Multiplied kernel eqn. (19) and Bias parameters to a set of performance figures. The function is kernel eqn. (20). The model is trained using Linear, RBF, Log, evaluated through simulations and its range defines the feasible Power, Multiplied and Hybrid kernels. Kernels are compared performance space of the circuit. The resulting model provides for accuracy and model training time while they are used a clear separation of abstraction levels, directly modeling per- for classification. For two stage op-amp, cascode op-amp and formance relations in place of regression on implementation mixer we have kept tuning parameters σ and γ as 1 and 10 parameters. In [12] Pareto-optimal hyperplane, which delimits respectively for all kernels. However, for VCO classifier, we the design space for the circuit at hand is derived by the have compared kernels for different combinations of tuning use of multiple-objective genetic optimization and multivariate parameters σ and γ. regression techniques. It helps designer in exploring the trade- off between different competing objectives in analog and RF integrated circuit design. Results obtained can be used both A. Accuracy measurement in the system-level design phase for topology selection and in the circuit-level design phase for optimal design. Modeled classifier can be made highly accurate by properly choosing the parameters of SVMs. The generalization ability Proposal in [13] is for active learning scheme for feasibility of the classifier is examined by an independent validation design space identification. The proposed methodology uses a data set. The learned function usually deviates from the true committee of classifiers to exclude a large portion of entire underlying function. Let S denote the entire design space design space and samples only the feasibility region and its after application of geometry constraints, as illustrated in the neighboring. It improves the accuracy of the classifier with Figure 1. much fewer samples, resulting in computation time reduction, In Figure 1, F is the feasibility design space and F’ is compared to a passive learning scheme using uniform random the approximated feasibility space. Thus S is divided by F samples. Authors in [14] have presented an approach for and F’ into four subspace: TP of true positives, TN of true generation of yield aware Pareto surface for hierarchical circuit negatives, FP of false positives and FN of false negatives. design space exploration. A non-dominated sorting based Accuracy is calculated using formula shown below. Here TP global optimization algorithm is used to generate the nominal is true positive, predicted positive by the classifier which are Pareto front for VCO circuit. Solutions on this Pareto front actual positive, and similarly TN is true negative. with efficient Monte Carlo analysis are then used to compute the yield aware Pareto fronts. These Pareto surfaces of VCO (| T P | + | T N |) accuracy = (22) are then used to synthesize PLL with a targeted yield. |S| INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 5 S minimize the DC offset voltage at the output node, width of transistor M8 is taken as 2 ∗ W3 ∗ W7 /W5 . This is because the current through M 4 = 0.5 ∗ Ibias ∗ W5 /W6 . As M3 and F F’ M4 transistors are of same size, have equal drain currents, and have the same gate to source voltages, so the drain voltage of M4 is equal to the drain/gate voltage of M3. Thus the gate FN TP FP voltage of M8 is equal to the drain voltage of M4, which is equal to the drain/gate voltage of M3. This causes M8 to mirror the current through transistors M3 and M4 by the ratio W8 /W3 . Putting this all together we have the current through TN M 8 = 0.5∗(Ibias ∗W5 /W6 )∗W8 /W3 and the current through M 7 = Ibias ∗ W7 /W6 . Equating the currents through M8 and M7 yields the necessary width of M 8 = 2 ∗ W3 ∗ W7 /W5 . Fig. 1. Design space and its subspace [13]. Lastly the compensation capacitor is left as a free variable since it controls the inherent stability of the op-amp. The load Vdd capacitor is taken as fixed variable to simplifying the modeling problem. The above arguments result in the 5-dimensional parametric configuration for the two-stage op-amp. The design M6 M5 M7 variables and fixed design parameters are shown in Table II. The functional constraints shown in Table III, ensure all the transistors are on and in saturation region with some margin. Vin− Vin+ Vout M2 M1 We set Von,min and Vsat,min to 0.1V. C1 CL Ibias M8 TABLE II D ESIGN VARIABLES OF THE T WO STAGE OP - AMP. M3 M4 W1 = W2 [1µm,100µm] W3 = W4 [1µm,50µm] Design W5 [1µm,100µm] Vss variables W7 [1µm,100µm] 2×W3 ×W7 W8 W5 Fig. 2. Two-stage op-amp [16]. C1 [5pF,20pF] L1 , · · · , L8 [0.5µm] Fixed design W6 10µm parameters Ibias 50µA IV. E XPERIMENTAL SETUP CL 5pF We show two op-amps, voltage controlled oscillator and mixer as our illustrative examples. We will show the accuracy improvement of the feasibility classifier constructed by the proposed kernels compared to those constructed by standard TABLE III kernels. The classifiers constructed using different kernels F UNCTIONAL CONSTRAINTS OF THE TWO STAGE OP - AMP. were trained and tested using data generated form HSPICE, on a workstation consisting of Pentium Core2 duo (3 GHz) nMOS transistor pMOS transistor with 1 GB RAM running on Redhat WS4 operating system. Vgs − Vth ≥ Von,min Vgs − Vth ≤ −Von,min Vds ≥Vgs − Vth + Vsat,min Vds ≤Vgs − Vth − Vsat,min A. Two Stage op-amp The two-stage op-amp is shown in Figure 2. As all tran- sistors are required to operate in saturation mode, we fix the length of all transistor to a nominal minimum length. B. Cascode op-amp This immediately eliminates nearly half of the free design parameters. Further the size of transistor M1 should equal M2, and the size of M3 should equal M4 to equalize the The circuit of cascode op-amp is shown in Figure 3. We fix currents through the differential pair. Both W1 = W2 and the lengths of all transistors to 0.5µm. Imposing sizing rules W3 = W4 are left as free parameters [17]. Transistor M6 similar to that of two-stage op-amp [10], we get five design can be fixed to some minimum nominal size since its job variables for cascode op-amp. The design variables and fixed is to simply mirror the reference current Ibias , which can design parameters are shown in Table IV. Here W indicate the also be fixed. The width of transistors M5 and M7 control width of transistor and L indicate the length of transistor. Ibias the current through the differential pair and output stage is the bias current as shown in Figure 3. Functional constraints respectively and are also left as free parameters. In order to in Table III apply with Von,min and Vsat,min set to 0.1V . INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 6 Vdd TABLE V D ESIGN VARIABLES OF THE VOLTAGE CONTROLLED OSCILLATOR . M14 M12 M10 M4 M3 M6 Ibias W1 = W2 [100µm,500µm] Design W3 = W4 [50µm,300µm] M7 W5 = W6 [20µm,250µm] M2 M1 Vout Vin− Vin+ variables W9 = W11 [100µm,500µm] W10 [1200µm,2000µm] M8 CL L1 , L2 [0.9µm] M13 M5 L3 , L4 [0.7µm] M9 L5 , L6 [12µm] M16 M15 M11 Fixed design L7 , L8 , L9 , L11 [0.2µm] L10 , L12 , L13 [6µm] Vss parameters W7 [2000µm] W8 [200µm] Fig. 3. Cascode op-amp [17]. W12 = W13 [1500µm] L (Inductor) [500pH] Vdd Vdd D. Mixer M8 M7 A low voltage mixer [13] is shown in Figure 5. Length of all transistors are fixed at 1.0 µm. The design variable and fixed design parameters are listed in Table VI. It is required that IC1= 2mA M1 M2 all nMOS transistor are on and biased in saturation where as pMOS transistor should also be on but biased in linear region Vdd Vdd as they behave as resistors. The functional constraints for both L=500pH pMOS and nMOS are shown in Table VII. M9 M 11 M5 M6 TABLE VI Vdd OUT 1 OUT 2 D ESIGN VARIABLES OF THE M IXER . R1=300 R2=300 IC1= 2mA Vc M 12 W1 = W2 = W3 = W4 [50µm,200µm] M 10 W5 = W6 [100µm,400µm] Design W7 = W8 = W9 [30µm,120µm] M 13 M3 M4 W L1 = W L2 [6µm,24µm] variables Ibias [1mA,2mA] VRF [2V,3V] VLO [1.5V,2.5V] Fixed design L1 , · · · , L8 , LM L1 , LM L2 [1.0µm] parameters Fig. 4. Voltage controlled oscillator [18]. TABLE VII TABLE IV F UNCTIONAL CONSTRAINTS OF THE MIXER . D ESIGN VARIABLES OF THE C ASCODE OP - AMP. nMOS transistor pMOS transistor W1 = W2 [1µm,100µm] (Saturation region) (Linear region) W3 = W4 [1µm,100µm] Vgs − Vth ≥ 0 Vgs − Vth ≤ 0 W5 [1µm,100µm] Vds ≥Vgs − Vth + Vsat,min Vds ≥Vgs − Vth − Vsat,min Design W6 , W7 , W8 , W9 [W3 ] variables W15 [ 2×WW5 ×W6 ] 3 W12 , W13 , W14 [0.25 × W3 ] Ibias [2µA,20µA] CL [1pF,10pF] V. R ESULTS Fixed design L1 , · · · , L16 [0.5µm] We have observed significant improvement in accuracy of parameters W10 , W11 , W16 [10µm] the classifiers of four circuits constructed with the use of proposed kernels. The corresponding results are shown in Tables VIII and IX which show the comparison of accuracy and model training time while using different kernels for two- C. Voltage controlled oscillator stage op-amp, cascode op-amp, mixer and voltage controlled oscillator respectively. We observe significant speed up in A voltage controlled oscillator [18] is shown in Figure 4. model training time with a similar or better accuracy with use Widths of transistor M1, M2, M3, M4, M5, M6, M9, M10 and of particular kernels. These results suggest an improvement in M11 are taken as design variables. The design variables along performance of the classifier using proposed kernels, i.e. Log, with fixed design parameters are shown in Table V. Functional Power and Multiplied kernels. The Log and Power kernel leads constraints shown in Table III, ensure all the transistors are on to higher accuracy where as Multiplied kernel provides speed and in saturation region except for transistor M5 and M6 which up with moderate accuracy. Classifiers with Linear kernels are will be in linear region. fastest but are very low on accuracy and Hybrid perform worst INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 7 TABLE VIII C OMPARING KERNELS FOR ACCURACY AND M ODEL TRAINING TIME (M ODEL T-T) OF DIFFERENT CIRCUITS . Accuracy (in %) Speed-up Kernels Two stage Cascode Mixer Two stage op-amp Cascode op-amp Mixer circuit op-amp op-amp circuit (Model T-T) Speed-up (Model T-T) Speed-up (Model T-T) Speed-up Linear 89.1 82.7 83.5 4.5 89 4.32 96 2.57 10.6 RBF 92.1 90.8 91.00 403.37 1.0 415.23 1.0 27.07 1.0 Log 96.7 95.2 91.15 121.20 3.3 124.42 3.3 17.09 1.5 Power 96.8 95.4 91.80 157.27 2.6 156.94 2.6 19.57 1.4 Multiplied 94.3 95.8 90.60 110.31 3.7 112.33 3.7 12.73 2.1 Hybrid 90.0 85.3 63.5 456.23 0.89 477.65 0.87 30.37 0.9 TABLE IX ACCURACY AND M ODEL TRAINING TIME FOR VCO WITH DIFFERENT σ AND γ. Kernels σ = 0.2 γ = 10 σ = 1.0 γ = 10 σ = 10 γ = 10 σ = 27 γ = 100 Accuracy Model training Accuracy Model training Accuracy Model training Accuracy Model training (in %) time (in sec) (in %) time (in sec) (in %) time (in sec) (in %) time (in sec) Linear 79.80 3.96 80.80 3.97 80.80 3.97 78.80 3.73 RBF 96.05 46.83 99.05 121.80 98.05 123.80 99.00 145.80 Log 99.15 29.60 99.15 29.89 99.15 30.89 99.15 31.89 Power 99.10 34.23 99.10 32.80 99.10 34.80 99.10 35.48 Multiplied 96.25 35.58 97.25 35.58 96.25 49.58 97.25 49.58 Hybrid 40.45 62.25 59.55 389.25 59.55 423.25 57.55 463.25 VDD when circuit performance parameters are to be evaluated a ML1 ML2 large number of times in a stochastic optimization engine. Further work using proposed kernels for regression prob- lems as well is being pursued. Also, a method is to be adopted RF+ M5 M6 RF− to tune the parameters of the kernels for different set of the + IF − application circuits. There still remains a work to be done for further improving the accuracy of macromodels. I bias LO+ ACKNOWLEDGMENTS LO+ M 1 M 2 We are grateful to Prof. R. Sharan, LNM-IIT, Jaipur (Ex- LO− M 3 M4 professor Indian Institute of Technology Kanpur, India) and Prof. D. Nagchoudhuri, DA-IICT Gandhinagar (Ex-professor M8 Indian Institute of Technology Delhi, India) for very helpful M 10 M7 suggestions during the work. We thankfully acknowledge laboratory support provided for the research work by Ministry of Communication & Information Technology, Govt. of India Fig. 5. Mixer circuit [13]. through phase-2 of Special Manpower Development Project for VLSI Design and related software. on both parameters in case of voltage controlled oscillator and R EFERENCES mixer classifiers. For VCO, different combinations of tuning [1] M. Ding and R. Vemuri, “A combined feasibility and performance parameters were explored, which are shown in Table IX. From macromodel for analog circuits,” in Proc. IEEE Design Automation the Table IX, we observe that the optimum values of accuracy Conference, 2005, pp. 63–68. [2] V. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, and model training time are obtained for σ = 1.0 and γ = 10. 1995. [3] G. Smits and E. Jordaan, “Improved SVM regression using mixtures of VI. C ONCLUSIONS & FUTURE WORK kernels,” in Proceedings of International Joint Conference on Neural Networks, vol. 3, 2002, pp. 2785–2790. We have presented a feasibility macromodel, which can [4] J. A. Suykens, T. Gestel, J. Brabenter, B. Moor, and J. Vandewalle, Least be used during synthesis of analog circuits. The generated Square Support vector Machines. World Scientific Publishing Co. Pte. Ltd, 2002. model, incorporating the proposed kernels has been found [5] D. Boolchandani, C. Gupta, and V. Sahula, “Analog circuit feasibility to be much more efficient while generating the classifiers, modeling using support vector machine with efficient kernel functions,” compared to those constructed using standard kernels. We in Proc. IAENG-ICEE: Design,Analysis and Tools for Integrated Circuits and System, 18-20 March 2009, pp. 1609–1614. treated the feasible design space identification problem as [6] S. Boughorbel, J.-P. Tarel, and N. Boujemaa, “Conditionally positive a two-class classification problem so that comparison can definite kernels for SVM based image recognition,” in Proc. IEEE be done for larger size of data set. Thus, we are able to International Conference on Multimedia and Expo, 2005, pp. 113–116. [7] R. Harjani and J. Shao, “Feasibility and performance region modeling build accurate and fast feasibility macromodels, which can of analog and digital circuits,” Analog Integrated Circuits and Signal tremendously save computation time during circuit sizing Processing, vol. 10, pp. 23–43, 1996. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 8 [8] S. Zizala, J. Eckmuller, and H. Grab, “Fast calculation of analog circuits’ D. Boolchandani received his Bachelor of Engineer- feasibility regions by low level functional measures,” in Proc. IEEE ing degree in Electronics with honors from Malaviya International Conference on Electronics, Circuits and Systems, vol. 2, National Institute of Technology, Jaipur in 1988. He 1998, pp. 85–88. obtained Master of Technology degree in Design [9] R. Schwencker, J. Eckmueller, H. Graeb, and K. Antreich, “Automating & Technology from the Indian Institute of Science the sizing of analog CMOS circuits by consideration of structural in 1998. He is currently Associate Professor in the constraints,” in Proceedings of Design, Automation and Test in Europe Department of Electronics and Communications En- Conference and Exhibition, 1999, pp. 323–327. gineering at National Institute of Technology Jaipur, [10] H. Graeb, S. Zizala, J. Eckmueller, and K. Antreich, “The sizing India. He has recently submitted his doctorate thesis rules method for analog integrated circuit design,” in Proc. IEEE/ACM entitled "On Macromodeling of Analog Circuits us- International Conference on Computer Aided Design, 2001, pp. 343– ing Support Vector Machines (SVMs)". His research 349. interests are in the areas of analog & digital CMOS circuits and Analog CAD. [11] F. D. Bernardinis, M. I. Jordan, and A. L. Sangiovanni-Vincentelli, “Sup- He is a member of IEEE, and also a member of IETE India. port Vector Machines for analog circuit performance representation,” in Proc. DAC, 2003, pp. 964–969. [12] B. De Smedt and G. Gielen, “Watson: design space boundary exploration and model generation for analog and RFIC design,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,, vol. 22, V. Sahula received his Bachelor of Engineering no. 2, pp. 213–224, Feb. 2003. degree in Electronics with honors from Malaviya [13] M. Ding and R. Vemuri, “An active learning scheme using support vector National Institute of Technology, Jaipur in 1987. machines for analog circuit feasibility classification,” in Proceedings of He obtained Master of Technology degree in IEEE International Conference on VLSI Design, 2005, pp. 528–534. Integrated Electronics & Circuits in 1989 and [14] S.K.Tiwary, P.K.Tiwary, and R.A.Rutenbar, “Generation of yield-aware Ph.D. in Electrical engineering in 2001, both from pareto surfaces for hierarchical circuit design space exploration,” in Proc. Indian Institute of Technology, Delhi. He joined as IEEE Design Automation Conference, 2006, pp. 31–36. faculty member at Department of Electronics and [15] Least Squares Support Vector Machine Matlab/C Toolbox. Communications Engineering, National Institute of http://www.esat.kuleuven.be/sista/lssvmlab. Technology Jaipur, India where he is currently an [16] G. Wolfe and R. Vemuri, “Extraction and use of neural network models Associate Professor. His research interests are in the in automated synthesis of operational amplifiers,” IEEE Transactions areas related to high level design, modeling & synthesis of analog & digital on Computer-Aided Design of Integrated Circuits and Systems, vol. 22, systems, and CAD for VLSI. He has two journal papers and more than 30 no. 2, pp. 198–212, Feb. 2003. refereed conferences papers to his credit. He has served on the Technical [17] G. A. Wolfe, Performance Macro-Modeling Techniques for Fast Analog program committee of the VLSI Design and Test Symposium held in India Circuit Synthesis. Ph.D. thesis, University of Cincinnati, US, 2004. (1998-2009). He has also served on organizing committee as fellowship-chair [18] V. R. Yelamanchili, Compilation and performance estimation of analog of 22nd IEEE International Conference on VLSI Design, 2009 India. He is and mixed-signal circuits. Master’s thesis, University of Cincinnati, a senior member of IEEE, Life Fellow of IETE, Life member of IMAPS and US, 2003. member of VLSI Society of India and ACM SIGDA. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 9 Dynamic Power Reduction of Stalls in Pipelined Architecture Processors Pejman Lotfi-Kamran, Ali-Asghar Salehpour, Amir-Mohammad Rahmani, Ali Afzali-Kusha, and Zainalabedin Navabi Abstract—This paper proposes a technique for dynamic power use a forwarding unit [13]. The forwarding unit detects the reduction of pipelined processors. It is based on eliminating dependencies and forwards the required data from the running unnecessary transitions that are generated during the execution instruction to the dependent instructions. In some cases, it is of NOP instructions. The approach includes the elimination of unnecessary changes in pipe register contents and the limita- impossible to forward the result because it may not be ready. tion of boundary movement of transitions caused by inevitable In these situations, using a NOP instruction is inevitable [13], changes in pipe register contents due to insertion of a NOP [14]. The last type of hazard is control hazard that occurs into a pipelined processor. To assess its efficiency, the proposed when a branch prediction is mistaken or in general, when the technique is applied to MIPS, DLX, and PAYEH processors system has no mechanism for the branch prediction. There considering a number of benchmarks. The experimental results show that the techniques can lead to up to 10% reduction in the are two mechanisms for handling the control hazard. The first dynamic power consumption at a cost of negligible (almost zero) mechanism runs instructions after a branch and flushes the pipe speed and (about 0.2%) area overheads. after the misprediction. Generally, flush mechanisms are not Index Terms—Dataflow architectures, low-power design, cost effective. A better solution to handle the control hazard is pipelined processors, stall. to fill the pipe after the jump instruction with specific numbers of NOPs. This mechanism is called delayed jump mechanism and used widely in DSP processors [13], [14]. I. I NTRODUCTION The NOP instruction does not contribute to any useful work. P OWER dissipation limits have emerged as a major con- straint in the design of microprocessors where the speed has been traditionally the primary goal [1]. At the low end Therefore, the power consumed for its execution is wasted. Our study indicates that the percentage of dynamic power consumed by NOP instructions in a pipelined processor is of the performance spectrum, namely in the category of considerable. There are many works that have targeted the handheld and portable devices or systems, power has always power optimization of pipelined processors (see, e.g., [17], been the more critical design constraint compared to speed [26]). Among them, several solutions have been presented to constraint [2]-[9]. reduce the number of NOP instructions [13]. Even with em- In battery-powered applications, where the speed is less of ploying these techniques, still a large number of stalls would a concern, relatively simple RISC (Reduced Instruction Set remain. Therefore, the power consumption of the processors Computers) like pipelines are often used [10], [11]. Pipelined may be reduced further by reducing the execution of the NOP processors frequently insert NOP (No Operation Performed) instruction itself. instruction to the pipe to eliminate hazards and generate some delays for the proper execution of the instructions [13]. There The aim of this paper is to reduce the dynamic power con- are three types of hazards which are structural, data, and sumption of a pipelined processor by eliminating the useless control [13]. The structural hazard may occur when there transitions that are generated in the pipeline when a NOP are not enough hardware resources for the execution of a instruction passes through pipe stages1 . This is performed by combination of instructions. While in processors with simple modifying the architecture of RISC processors. The rest of the architectures, this hazard is usually eliminated in the design paper is organized as follows. Section 2 outlines the design phase, it occurs in architectures that use more than one of the baseline pipelined processor used in this work while functional unit for instruction level parallelism [13], [14]. Section 3 motivates the need for a technique for reducing A data hazard occurs when an instruction needs the result he dynamic power consumption of a pipelined processor of its prior instruction that is still in the pipeline and its when a stall happens. In Section 4, our proposed technique result is not ready. This occurs when there is not enough for reducing the dynamic power consumed during a NOP latency between these two instructions which are considered execution is presented. The microarchitectural changes to the data dependent. A technique for preventing data hazard is to baseline pipelined processor for implementing the proposed technique is presented in Section 5. The results are discussed All authors are with the School of Electrical and Computer in Section 6 while the summary and conclusions are gven in Engineering, University of Tehran, Iran. E-mail: plotfi@computer.org, the last section. a.salehpour@ece.ut.ac.ir, am.rahmani@ece.ut.ac.ir, afzali@ut.ac.ir, navabi@ece.neu.edu. A.-M. Rahmani is also with Computer Systems Lab., Department of Information Technology, University of Turku, Turku, Finland. E-mail: 1 A Preliminary version of this work appeared in the Proc. of VLSI amir.rahmani@utu.fi. Symposium 2008 [15]. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 10 II. BASELINE P IPELINED P ROCESSOR into the pipeline. The controlling signals can be divided into two parts which are critical and non-critical. The examples Figure 1 shows the diagram of the microarchitecture of a of the critical control signals which should be deactivated for conventional pipelined processor based on a 5-stage 32-bit the correct operation of NOP include writing to the memory Von Neumann MIPS I architecture [13]. While we restrict the or register files. The non-critical control signals are those discussion to a MIPS like processor architecture, the proposed signals that do not effect the correct execution of the NOP approach may be applied to other types of architecture. The instruction, and hence, behave as ”do not care” signals. For five stages include: Instruction extraction (FETCH), Instruc- the NOP insertion, only the critical control signals ought to tion decoding (DECODE), Instruction Execution (EXECUTE), be deactivated. Memory access (MEMORY), and Update registers (WRITE BACK). Only two instructions can access the memory. The III. M OTIVATION F OR O UR A PPROACH processor contains 32 registers. In the case of hazard, data hazards are resolved with a bypass unit while branch hazards After the DECODE stage, the generated control signals are are resolved by predicting the address results. Interruptions used to control the flow of data. In this stage, if the control unit and exceptions are taken care of by a system coprocessor. determines that the current instruction depends on the former Furthermore, this processor has 4-way instruction and 4-way instructions and the forwarding cannot resolve the dependency, data caches. If a hit happens, the data is immediately sent to the control unit inserts a NOP instruction by deactivating the the next stage while if a miss occurs, the processor should wait critical control signals of the current instruction including for the data to become ready. In the first stage, i.e., FETCH, the control signals for writing to memory and register file. The next instruction is read from the memory and is loaded into the NOP instructions are inserted into the pipeline to eliminate FE/DE register at the end of this stage. In the second stage, hazards. These inserted NOP instructions contribute to the i.e., DECODE, the instruction is decoded where the values overall dynamic power of a pipelined processor by generating of the registers which are needed for running this instruction a number of unnecessary transitions. This is explained by an are read from the register file. In addition, if an immediate example in our baseline pipelined processor. value is used in an instruction, the immediate value is properly sign extended or zero filled. It is in this stage that the controlling signals for running the instruction are generated. These controlling signals include the signals for writing to memory and register file as well as the multiplexer selects and Fig. 2. A simple MIPS program. the ALU operation type. In the third stage, i.e., EXECUTE, the desired operation is performed on the extracted data in A simple program is shown in Figure 2. The first instruction the previous stage. For the branch instruction, the result is is a LOAD instruction that reads a data from memory and computed and based on the computed results, the next value the second instruction is an ADD instruction that uses the of PC (program counter) is determined. In the forth stage, i.e., loaded data. Because of the dependency between these two MEMORY, depending on the instruction, the desired value is instructions, after the LOAD instruction, a NOP instruction written into a memory location or the content of a memory should be inserted into the pipeline. During the execution of location is read. In the last stage, i.e., WRITE BACK, based on the simple program of Figure 2, when the LOAD instruction the instruction, the computed value is written into the register is in the DECODE stage, the control signals and the required file. data corresponding to this instruction are generated/extracted. In some situations, due to the dependency between the On the rising edge of the clock, the generated/extracted two successive instructions, the data needed for the second control/data are latched into the DE/EX pipeline register. In instruction should be produced by the first instruction. In these the next clock cycle, the ADD instruction is in the DECODE cases, when the second instruction is in the decode stage, the stage and the control unit determines that a NOP instruction loaded data from the register file are not valid. However, it is should be inserted into the pipeline. Therefore, the critical possible that when the second instruction actually needs the control signals of the ADD instruction are deactivated and data in the later stages of the pipeline, the first instruction has these deactivated critical control signals along with the other produced the data. Therefore, a forwarding unit is added to the control signals and the required data of the ADD instruction pipeline. If a data field is not valid, the forwarding unit tries (current instruction in the DECODE stage) are latched on to forward the valid data from the subsequent stages. In some the rising edge of the clock. Generally, the data parts of the situations, the first instruction cannot produce the needed data current (i.e., ADD) and previous instruction (i.e., LOAD) are of the second instruction even when this instruction needs the different. It means that data part of NOP is different from the data. In these cases, the second instruction should run with at former instruction (i.e., LOAD). Therefore, passing the NOP least a clock cycle delay. Therefore, in the DECODE stage, instruction in the pipe generates a number of transitions. In these cases are determined and a stall is inserted between the third clock cycle, the ADD instruction should be passed to the two instructions. When a stall is inserted into the pipe, the pipeline. In this time, the control signals corresponding to the FETCH stage stops running (PC is not loaded with a ADD are generated and latched along with its required data. new value) and the content of the controlling signals in the Since the data and non-critical control signals of NOP and DECODE stage are deactivated for a NOP to be inserted ADD instructions are not the same, the number of transitions INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 11 Fig. 1. Dataflow diagram of MIPS pipelined architecture [13]. induced during the passage of the ADD instruction in the instruction are not valid in the DECODE stage, for the correct pipeline stages is not negligible. This imposes some dynamic execution of this instruction, valid data will be prepared by power consumption. The objective of this paper is to minimize the forwarding unit. To minimize the number of transitions these transitions. generated during the execution of NOP in this case, the same data should be prepared for the NOP instruction. If the valid IV. T HE P ROPOSED T ECHNIQUES data of the instruction preceding NOP are still in some pipe As discussed, the data part of an inserted NOP instruction is registers when the NOP instruction needs them, the forwarding not the same as that of its preceding or subsequent instruction unit prepares the data for the NOP as well. In these cases, generating a number of transitions. In addition, passing the a few transitions are generated during the execution of the pending instruction after NOP produces more transitions. NOP instruction. On the other hand, if the valid data are not These transitions lead to wasting the power consumption available in any pipe registers when the NOP instruction needs which should be minimized. For the NOP instruction to them (because the processing of the instruction that generates generate as few transitions as possible, its data part should be those data has been finished and has gone out of the pipe), the same as that of its preceding or subsequent instruction. different data are loaded into some operators generating a Because of the unavailability of some of the data of the number of useless transitions which may propagate to the last instruction passing the pipe after NOP, the data part of the stage of the pipeline. Here, we propose a technique to prevent instruction preceding NOP should be used as the data part the propagation of these transitions to all the pipeline stages. of the NOP instruction. This way, as a NOP instruction For this purpose, the outputs of the NOP instruction should passes through a pipe, relative to the previous cycle, the same be the same as its preceding instruction in all the pipe stages operations are performed on the same data in all stages of the to minimize useless transitions. Implementing this technique, pipeline minimizing the number of transitions. In addition, the value which is loaded in each pipe register for NOP is the non-critical part of the control signals may be the same as the same as that of the previous instruction except for the those of the preceding instruction. The proposed idea may critical control signals. Therefore, only the critical control be implemented in the DECODE stage by modifying the signals of pipe registers should be loaded during the execution architecture of the baseline processor as will be explained in of NOP instructions. Using this approach, if the data of a Section 5. NOP instruction are not valid (i.e., the NOP data differ from The technique decreases the number of unnecessary tran- those of the instruction preceding it) and the valid data are not sitions generated when a NOP is inserted into the pipe. available in the pipe registers (forwarding unit cannot provide When the data part of the instruction before NOP is valid the valid data for the NOP), the change of data which leads in the DECODE stage, the proposed technique guarantees no to some transitions is inevitable. However, these transitions useless transitions is generated as a NOP instruction passes are propagated until they reach a pipe register where their through the pipe. However, if parts of the data of the previous propagation is stopped because writing into pipe registers has INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 12 Fig. 3. Modified MIPS architecture based on the proposed tecniques. been stopped. Therefore, transitions cannot propagate through the pipe registers. In the DECODE stage, if the controller the entire pipeline limits the propagation boundary. detects that the current inststruction is dependent on the former instructions and hence a NOP should be inserted into the pipe, V. M ODIFICATION O F BASELINE A RCHITECTURE the load enable control signals of all upcoming stage registers (i.e., ID/EX, EX/MEM, MEM/WB) are activated. The load The techniques proposed in the previous section included enable of ID/EX pipe register is directly fed into it while the the case where the data of the preceding instruction of NOP load enables of EX/MEM, MEM/WB pipe registers (i.e., LN1 is known and the case where the data of the preceding and LN2 respectively) are propagated through pipe registers instruction is not known. To implement the technique based to reach and fed into the desired pipe register. On the other on the first case, it is sufficient to add a load enable control hand, if the controller does not it necessary to insert a NOP signal to the data and non-critical control parts of DE/EX into the pipe, all load enable control signals are deactivated in pipe register. This way, only the critical control signals (such the DECODE stage. as write to memory and register file signals) that should be loaded in each clock cycle are not controlled by the added load enable signal. When a NOP is decided to be inserted VI. R ESULTS A ND D ISCUSSION into the pipe, the controller should deactivate the load enable In this section, we discuss the power reduction, area over- signal. For the second case, a load enable control signal is head, and timing penalty of our proposed power reduction added to each pipe register after the DECODE stage. This technique. The techniques have been implemented in three control signal is only applied to data and non-critical control general processors: MIPS [13], DLX [13], and PAYEH [14]. parts of the pipe registers. By deactivating the load enable MIPS is a 5 stage pipelined processor whose architecture is of a pipe register when NOP results are written into it, only RISC with fixed-width of 32-bit instructions. The details of the critical control signals of that pipe register are changed this processor can be found in reference [13]. DLX is a text and its other parts remain unchanged. The same as other book example of a RISC processor with a 5 stage pipeline control signals, these load enable signals are generated by the using forwarding to avoid data hazards. The DLX processor controller in the DECODE stage and are propagated through uses a load-store architecture. All DLX instructions are 32-bit the pipe registers like other control signals to the desired long. It has 32 32-bit registers [13]. PAYEH is a pipelined destination (i.e., specific pipe register). Figure 3 illustrates the version of SAYEH [27] with a similar instruction set and mechanism of propagating the load enable control signals in has five pipe stages. SAYEH is a multi-cycle RISC processor INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 13 TABLE I C HARACTERISTICS O F O RIGINAL A ND M ODIFIED MIPS, DLX, AND PAYEH P IPELINED P ROCESSORS Area Characteristic Frequency Characteristic Processor Original Modified Overhead Original Modified Overhead Area (µm2 ) Area (µm2 ) (%) Frequency Frequency (%) (MHz) (MHz) MIPS 199737.92 200215.89 0.24 50.76 50.76 ≈0 DLX 106396.99 106577.86 0.17 80 80 ≈0 PAYEH 919530.45 921185.60 0.18 129.33 129.33 ≈0 TABLE IV P OWER C ONSUMPTIONS OF O RIGINAL AND M ODIFIED PAYEH TABLE II P ROCESSORS F OR D IFFERENT B ENCHMARKS A ND I NPUTS P OWER C ONSUMPTIONS OF O RIGINAL AND M ODIFIED MIPS P ROCESSORS F OR D IFFERENT B ENCHMARKS AND I NPUTS Benchmark input ORG(mW) MOD(mW) IMP(%) n=10 1065.236 972.4543 0.0871 Benchmark input ORG(mW) MOD(mW) IMP(%) n=20 1070.558 975.278 0.089 n=10 502.47 453.43 0.0976 Factorial n=30 1073.907 978.2221 0.0891 n=20 504.98 455.95 0.0971 n=40 1074.204 988.1603 0.0801 Factorial n=30 506.56 457.47 0.0969 n=50 1077.108 981.5689 0.0887 n=40 506.7 458.77 0.0946 n=10 963.8127 885.4547 0.0813 n=50 508.07 460.72 0.0932 n=20 976.6053 897.4026 0.0811 n=10 465.61 427.85 0.0811 Fibonacci n=30 998.4231 908.9644 0.0896 n=20 471.79 431.97 0.0844 n=40 1015.894 936.5526 0.0781 Fibonacci n=30 482.33 443.79 0.0799 n=50 1041.417 953.7297 0.0842 n=40 490.77 446.85 0.0895 n=10 1027.732 949.5212 0.0761 n=50 503.1 459.18 0.0873 n=20 1030.812 956.3874 0.0722 n=10 503.79 471.75 0.0636 Power n=30 1031.812 951.6398 0.0777 n=20 505.3 473.87 0.0622 n=40 1034.504 958.158 0.0738 Power n=30 505.79 473.98 0.0629 n=50 1032.77 956.0356 0.0743 n=40 507.11 472.73 0.0678 n=10 1082.403 1004.47 0.072 n=50 506.26 475.78 0.0602 n=20 1092.483 1007.816 0.0775 n=10 515.43 483.58 0.0618 Vector Addition n=30 1090.404 1011.677 0.0722 n=20 520.23 488.65 0.0607 n=40 1095.717 1018.688 0.0703 Vector Addition n=30 519.24 483.46 0.0689 n=50 1099.623 1017.261 0.0749 n=40 521.77 490.72 0.0595 n=50 523.63 491.11 0.0621 with 16-bit data and 16-bit address buses. PAYEH architecture uses a forwarding unit. This forwarding unit can resolve all dependencies by forwarding the required data from the next pipe stages to the previous ones. TABLE III The original and modified processors were synthesized by P OWER C ONSUMPTIONS O F O RIGINAL AND M ODIFIED DLX P ROCESSORS FOR D IFFERENT B ENCHMARKS AND I NPUTS Synopsys D.C. using 130nm TSMC library. Table I shows the reported area and frequency of these processors. As expected, Benchmark input ORG(mW) MOD(mW) IMP(%) the proposed method does not have any adverse effect on n=10 600.45 564.00 0.0607 n=20 603.45 567.24 0.06 the frequency of the processors. This is due to the fact that Factorial n=30 605.34 569.75 0.0588 the method does not affect the critical paths. Table I also n=40 605.51 564.82 0.0672 reveals that the hardware overhead of the proposed technique n=50 607.14 568.71 0.0633 n=10 561.06 523.75 0.0665 is negligible (< 0.3%). n=20 568.507 529.9622 0.0678 Four benchmark programs were used to measure the effec- Fibonacci n=30 581.2077 546.8583 0.0591 tiveness of the proposed dynamic power reduction technique. n=40 591.3779 552.9383 0.065 n=50 606.2355 569.8007 0.0601 The Factorial benchmark reads a number and calculates its n=10 604.548 571.2979 0.055 factorial while Fibonacci benchmark reads a number and n=20 606.36 574.1623 0.0531 computes Fibonacci series up to the requested element. Power Power n=30 606.948 574.2335 0.0539 benchmark reads two numbers, a and b, and calculates a to n=40 608.532 572.6895 0.0589 n=50 607.512 575.0709 0.0534 power b (i.e., ab) and Vector Addition benchmark reads two n=10 628.8246 598.8297 0.0477 vectors and calculates their addition element by element. These n=20 634.6806 603.2639 0.0495 benchmark programs were applied to the original and modified Vector Addition n=30 633.4728 604.9032 0.0451 n=40 636.5594 604.0312 0.0511 synthesized processors. Every benchmark program was run n=50 638.8286 606.1206 0.0512 five times, each time with a different input size. The input size determines the run-time complexity of the program. Input n = 10 (50) corresponds to the least (most) complex runtime. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 14 As Table II indicates, for the MIPS processor, a maximum [15] P. Lotfi-Kamran et al., “Stall power reduction for pipelined architecture dynamic power reduction of 9.76% was achieved. The average processors,” in Proc. of VLSI Design, January 2008, pp. 541–546. [16] A. Hartstein and T. R. Puzak, “The optimum pipeline depth considering power reduction of the proposed approach was about 7.66% both power and performance,” ACM Transactions on Architecture and for this processor. The table shows that as the complexity of Code Optimization, vol. 1, no. 4, pp. 369–388, December 2004. a program increases, the dynamic power that is consumed by [17] S.-J. Ruan et al., “Bipartitioning and encoding in low-power pipelined circuits,” ACM Transactions on Design Automation of Electronic Sys- the processors also increases. Almost the same results were tems, vol. 10, no. 1, p. 24-32, January 2005. achieved for DLX and PAYEH processors. For the DLX pro- [18] M. Monchiero et al., “Power-aware branch prediction techniques: a cessor, as Table III indicates, a maximum and average power compiler-hints based approach for VLIW processors,” in Proc. of ACM Great Lakes symposium on VLSI, April 2004, pp. 440–443. reduction of 6.22% and 5.69%, respectively, were achieved. [19] D. Parikh et al., “Power issues related to branch prediction,” in Proc. of For the PAYEH processor, 8.86%, and 8.04%, respectively, International Symposium on High-Performance Computer Architecture, were the percentages of the maximum and average power February 2002, pp. 233–244. [20] R. I. Bahar and S. Manne, “Power and energy reduction via pipeline savings. balancing,” in Proc. of the 28th Annual International Symposium on Computer Architecture, June-July 2001, pp. 218–229. [21] A. Correale, “Overview of the power minimization techniques employed VII. C ONCLUSION in the IBM PowerPC 4xx embedded controllers,” in Proc. of the In this work, we proposed a method for minimizing unnec- ACM/IEEE International Symposium on Low Power Design, April 1995, pp. 75–80. essary transitions that are generated when a NOP instruction is [22] V. Tiwari et al., “Guarded evaluation: pushing power management to inserted into the pipe of a pipelined processor. The proposed logic synthesis/design,” IEEE Transactions on Computer Aided Design approach consisted of two techniques. The first one focused on of Integrated Circuits and Systems, vol. 17, no. 10, pp. 1051–1060, October 1998. eliminating unnecessary changes in the pipe register contents [23] H. Kapadia et al., “Reducing switching activity on datapath buses with while the second one restricted the propagation boundary of control-signal gating,” IEEE Journal of Solid-State Circuits, vol. 34, transitions caused by inevitable changes in the pipe register no. 3, pp. 405–414, March 1999. [24] M. Mnch et al., “Automating RT-level operand isolation to minimize contents due to insertion of a NOP instruction. To determine power consumption in datapaths,” in Proc. of the Design, Automation the efficacy of the proposed technique, we applied some and Test in Europe Conference and Exhibition), March 2000, pp. 624– benchmarks to MIPS, DLX and PAYEH pipelined processors. 633. [25] G. Kucuk et al., “Low-complexity reorder buffer architecture,” in Proc. While the hardware overhead and timing penalty of the pro- of International Conference on Supercomputing, June 2002, pp. 57–66. posed approach was negligible, the dynamic power reductions [26] S. Manne et al., “Pipeline gating: speculation control for energy re- of up to 10% were achieved. duction,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 17, no. 11, pp. 1061–1079, November 1998. [27] Z. Navabi, Digital design and implementation with field programmable R EFERENCES devices. Kluwer Academic Publisher, 2004. [1] “International Technology Roadmap for Semiconductor,” 2007. [2] V. Venkatachalam and M. Franz, “Power reduction techniques for microprocessor systems,” ACM Computing Surveys, vol. 37, no. 3, pp. 195–237, September 2005. [3] D. M. Brooks et al., “Power-aware microarchitecture: design and modeling challenges for next-generation microprocessors,” IEEE Micro, Ali-Asghar Salehpour received B.S. degree from vol. 20, no. 6, pp. 26–44, November/December 2000. Shahed University, Iran, in 2006, and M.S. degree [4] R. Gonzalez and M. Horowitz, “Energy dissipation in general purpose from the University of Tehran, Tehran, Iran, in 2009, microprocessors,” IEEE Journal of Solid-State Circuits, vol. 31, no. 9, both in Computer Engineering. His research interests pp. 1277–1284, September 1996. include low-power design, network, wireless sensor [5] M. Kandemir et al., “Register relabeling: A post-compilation technique networks, and network security. for energy reduction,” in Proc. of Workshop on Compilers and Operating Systems for Low Power, October 2000. [6] M. T.-C.Lee et al., “Power analysis and minimization techniques for embedded DSP software,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 5, no. 1, pp. 123–135, March 1997. [7] T. Li and C. Ding, “Instruction balance and its relation to program energy consumption,” in Proc. of Intl. Workshop on Languages and Compilers for Parallel Computing, August 2001, pp. 71–85. [8] W. Zhang et al., “Exploiting VLIW schedule slacks for dynamic and leakage energy reduction,” in Proc. of the 34th Annual Intl. Symp. on Microarchitecture (Micro), December 2001, pp. 102–113. Amir-Mohammad Rahmani received his Master [9] M. Sarrafzadeh et al., “Low power light-weight embedded systems,” degree in computer architecture from Department in Proc. of the international symposium on Low power electronics and of Electrical and Computer Engineering, University design, October 2006, pp. 207–212. of Tehran in 2009. He is currently pursuing his re- [10] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC micropro- search in Computer Systems Laboratory, University cessor,” IEEE Journal of Solid-State Circuits, vol. 31, no. 11, pp. 1703– of Turku, Finland and has a Ph.D. position in Turku 1714, November 1996. Center for Computer Science (TUCS). His research [11] IBM/Motorola, PowerPC 405CR User Manual. interests include Low-Power Design, Networks-on- [12] M. A. Amiri et al., “Design and implementation of a 50MHZ DXT chip, Multi-Processor System-on-chip, and 3D ICs. CoProcessor,” in Proc. of EuroMicro Conference on Digital System His PhD thesis is focused on Power Analysis and Design Architectures, Methods, and Tools, August 2007, pp. 43–50. Optimization in 3D-Networks-on-Chip. Amir is a [13] D. A. Patterson and J. L. Hennessy, Computer Architecture: A Quanti- member of IEEE, IEEE Circuits and Systems Society, and EUROMICRO tative Approach. Morgan-Kaufmann, 4th ed. and has published dozens of refereed papers in prestigious books, journals [14] S. Shamshiri et al., “Instruction-level test methodology for CPU core and conferences. self-testing,” ACM Transactions on Design Automation of Electronic Systems, vol. 10, no. 4, pp. 673-689, October 2005. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 15 Ali Afzali-Kusha received his B.Sc., M.Sc., and Zainalabedin Navabi Dr. Zainalabedin Navabi is a Ph.D. degrees all in Electrical Engineering from professor of electrical and computer engineering at Sharif University of Technology, University of Pitts- the University of Tehran, and an adjunct professor burgh, and University of Michigan in 1988, 1991, at Worcester Polytechnic Institute, Worcester, MA, and 1994, respectively. From 1994 to 1995, he was a USA. He is the author of eight books on VHDL, Ver- Post-Doctoral Fellow at The University of Michigan. ilog and related tools and environments. Dr. Navabis Since 1995, he has joined The University of Tehran, began his work in the EDA area in 1976, when he where he is currently a Professor of the School started the development of a register-transfer level of Electrical and Computer Engineering and the simulator for one of the very first HDLs. In 1981 Director of Low-Power High-Performance Nanosys- he completed the development of an RTL synthesis tems Laboratory. Also, on a research leave from the tool. Since 1981, Dr. Navabi has been involved in University of Tehran, he has been a Research Fellow at University of Toronto the design, definition and implementation of HDLs. He has written numerous and University of Waterloo in 1998 and 1999, respectively. He has published papers on HDLs, design automation, and digital system test. He started one more than 200 technical papers. Dr. Afzali-Kusha, who is a senior member of the first HDL courses in the US in 1990, and has since conducted short of IEEE, currently serves as an associate editor for the ACM Transactions courses and tutorials in the United States and abroad. In addition to being a on Design Automation of Electronic Systems. His current research interests professor, he is also a consultant to CAE companies. Dr. Navabi received his include network-on-chip, low-power high-performance design methodologies M.S. and Ph.D. from the University of Arizona in 1978 and 1891, and his from the physical design level to the system level for nanoelectronics era. B.S. from the University of Texas at Austin in 1975. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 16 Studies on Sensitivity of Clock and Data Recovery Circuits to Power Supply Noise Khalil I. Mahmoud, J. Dhurga Devi, R. Rajasekar, and P. V. Ramakrishna Abstract—This paper deals with the study of the impact of only recently in [7], [8]. A method to minimize the supply power supply noise on the performance of CMOS Clock and sensitivity of a CMOS ring oscillator through joint biasing of Data Recovery (CDR) Circuits. The sensitivity of the various the supply and the control voltage is proposed in [9] for PLLs. blocks of the dual loop CDR circuit to power supply noise is first studied and then it is demonstrated that insertion of suitable The work presented in [10] proposes adding a current source in Low Dropout Regulators (LDOs) can enhance the performance parallel with PMOS transistor of the inverter to form a pseudo- of the CDR system with respect to power supply noise. Based differential ring oscillator. The current source provides a large on extensive simulations, it was observed that while the system impedance between the output node and the supply, effectively can tolerate only about 20mV/10MHz noise on the power supply, isolating these two nodes and making the oscillator frequency incorporation of LDOs enables it to tolerate 200mV/10MHz noise without degradation in performance. less sensitive to supply voltage changes. An architecture to de- couple the tradeoffs between supply noise rejection and power Index Terms—CDR, PLL, VCO, jitter, power supply noise. consumption by putting the regulator outside the main loop of the PLL has recently been proposed in [11]. The decoupling is I. I NTRODUCTION achieved by putting the regulator in the low bandwidth coarse loop, which allows maximizing the bandwidth to suppress LOCK and Data Recovery (CDR) circuits have been C continuously evolving over the past few decades, with the requirements for ever increasing data rates, lower supply the oscillator phase noise without affecting the power supply- noise rejection or the power dissipation of the regulator. In the context of CDRs, the emergence of power supply noise as voltages and scaling of technology. These CDR circuits cater a design constraint is emphasized in [2], and a procedure for to multiple standards like SATA, PCI Express, XAU, OC-192 carrying out fast simulations to evaluate the impact of power and so on. Different CDR circuits are expected to be optimized supply noise on CDR systems has been presented recently for operation at frequencies that could be anywhere between in [12]. The present work adds to this literature, and reports, 1GHz and 100GHz for application over distances varying from for a specific CDR architecture, the studies carried out on the about a few centimeters to about a few 100 meters. impact and mitigation of power supply noise. The literature on CDR circuits is vast and the architectures In the context of CDRs, the emergence of power supply have been continuously evolving over the past two decades. noise as a design constraint is emphasized in [2], and a The crucial role played by the CDR circuits can be gauged procedure for carrying out fast simulations to evaluate the from the number of recent tutorial and review articles [1]-[4] impact of power supply noise on CDR systems has been published on this topic. The literature cited in [1]-[4] largely presented recently in [12]. The present work adds to this deal with techniques to increase the speed of operation (data literature, and reports, for a specific CDR architecture, the rates), techniques for the mitigation of ISI, tradeoffs between studies carried out on the impact and mitigation of power jitter, loop bandwidth, tunability, area, power consumption and supply noise. so on. There are only a few important architectures widely The present paper is organized as follows. In Section II, used for implementing CDR systems, and amongst these, the presents the details of the design of the dual loop delay dual loop CDR architecture [5] is considered as an important interpolating CDR system. In Section III, the design of the one [1], [6]. Being analog in nature, with independent loop LDO circuits is presented. In Section IV, the simulations filters for frequency lock loop and the phase locked loops, it results have been presented and discussed. Section V presents enables one to achieve a wide operating frequency range while the conclusions. maintaining low jitter. For CDR systems, very few quantitative studies have been reported on the impact of power supply noise on the perfor- II. T HE D UAL L OOP CDR S YSTEM D ESIGN mance. The CDR systems are closely related to Phase Locked The dual loop delay interpolating CDR architecture chosen Loop (PLL) and both share a number of similar characteristics, for the present study and adapted from [5], is shown in Fig.1 including the impact of power supply noise. Even for PLLs, in the form of a block diagram. This system is considered which cater to a much wider range of applications, studies as a reference for the present work and for which the jitter on the mitigation of power supply noise have been reported performance will be investigated. The system shown in Fig.1 consists of a coarse Frequency Lock Loop (FLL) and a fine All authors are with the Department of Electronic and Communications Engineering, Anna University, Chennai, Tamil Nadu, 600025, India. E-mail: Phase Lock Loop (PLL). The FLL consists of a Frequency khalilaljabory@gmail.com. Detector (FD), Charge Pump (CP), Loop Filter (LPF) and a INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 17 Voltage Controlled Oscillator (VCO). The FD is realized using a digital quadricorrelator which generates a number of constant duration UP and DOWN pulses proportional to the frequency difference between the data and an internally generated clock signal from the VCO. These UP and DOWN pulses control a charge pump output current that charges or discharges a loop filter. A second order loop filter is used for the FLL. The voltage developed at the output of the loop filter is corrected to the required common mode voltage by a Common Mode Feed Back (CMFB) circuit and then fed as coarse control voltage to the VCO. The VCO is common to FLL and PLL loops. The VCO used in this work is the same as that of [13] and consists of a ring oscillator with four delay interpolator stages. It also has two differential control voltages, one coarse voltage (Vcoarse) from the FLL and one fine voltage (Vfine) Fig. 2. Circuit diagram of a Delay Interpolator Stage. from the PLL. These control voltages are used to steer the tail currents of the delay interpolating stage in the fast path and in the slow path. The fast and slow path have different current magnitudes, but always their sum remains the same. Icp · KV CO · Rp ωc = (1) 2·π ωc 1 ωz = = (2) 5 2 · π · Rp · Cp Cp + Cs ωp = 5 · ωc = (3) Cp · Cs · Rp 1 −1 Cs = [ωp · Rp − ] (4) Cp The symbols ωc , ωp , ωz , ζ, and KV CO are the crossover frequency, pole frequency, zero frequency, damping ratio, and VCO gain respectively, while Rp , Cp , andCs , represent the shunt resistor, capacitor and parallel smoothing capacitor Fig. 1. Block Diagram of Dual Loop CDR system. respectively of the FLL loop filter. Using the above design equations and system specifications, the parameters of the A single delay interpolator stage of the ring oscillator is second order loop filter for FLL have been determined and shown in Fig.3 with a slow path comprising of a constant delay are listed in Table 1. buffer stage followed by another inverter delay stage, while The PLL consists of an analog Phase Detector (PD), V-to-I the fast path has only one inverting delay stage. The currents converter, Low Pass Filter (LPF) and the VCO [5]. The PD If astc o and Islowc o are derived from current folding circuit in the PLL is a differential analog sample and hold circuit controlled by the differential control voltage of the FLL, while which holds the clock amplitude on a capacitor for each the Islowf in and If astf in currents are derived from current data transition. The V-to-I converter block is necessary for folding circuit controlled by the differential voltages of the converting the differential PD output linearly to a current that PLL. The ring oscillator VCO is designed with four such delay charges or discharges the loop filter capacitor (which is a interpolating stages. The VCO generates the in-phase and the simple lead-lag filter) to provide a voltage. The control voltage quadrature phase clock, which is required for the FD in the is then corrected by the CMFB circuit to the required common FLL, while the in-phase and out of phase clock are required for mode level. the PD in the PLL. When the frequency locks, the variation The PLL loop natural frequency ωn and damping ratio ζ on the coarse control voltage becomes insignificant and the are chosen to be 3Mrad/s and 4 respectively and the VCO VCO is controlled only by the PLL [5]. sensitivity (fine) is 198MHz/V for the fine control voltage. The design procedure and relevant equations for determin- The design procedure for determining the PLL loop filter ing the loop parameters can be obtained by adopting the parameters is adopted from [14] and the design equations are procedure given in [5] and [14]. For a nominal data rate of as follows: 1.075Gbps chosen for the CDR, the FLL cross over frequency ωc is chosen as 207Mrad/s, the VCO sensitivity is selected as 2ζ 1.58GHz/V. The charge pump current of the FLL is found to τ2 = (5) ωn be 125uA. The devices chosen for the simulations are from π K = Kpd · KV CO · (6) 0.35um CMOS technology libraries from Austriamicrosys- r 4 tems. The design equations used for determining the loop filter τ1 K = 2 · ζ · ωn · +1 (7) parameters are summarized below. τ2 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 18 τ1 = (Rf 1 + Rf 2 ) · C (8) rather than amplifier pole, and this in turn leads to an increase τ2 = R f 2 · C (9) in the Power Supply Rejection (PSR) at high frequencies. The present paper uses the Miller compensated regulator with an The symbols ωn , ζ, K, Kpd , and KV CO are the natural additional NMOS cascode device as suggested in [16] along frequency, damping ratio, open loop gain, phase detector gain, with the split-regulator architecture to feed power to the entire and VCO gain respectively. The symbols τ1 , τ2 , Rf 1 , Rf 2 , and CDR system. The decrease in pass transistor size is used to C represent the time constants, series resistor, shunt resistor reduce the load capacitance required rather than to change and capacitor respectively of the PLL loop filter. With these the location of poles. The required PSR is now enhanced by design equations and system specifications, the Lead-Lag loop NMOS cascode device. filter parameters for PLL are determined and tabulated in The architecture of LDO used is shown in Fig.4 and it is Table 1. the same as that discussed in [16]. A single stage CMOS OTA TABLE I is used as an error amplifier. The PMOS transistor MP is used L OW PASS F ILTER PARAMETERS FOR FLL AND PLL as a pass device where as the NMOS device M1 is used as a cascode device. The charge pump is used to boost the gate FLL LPF Parameter Value PLL LPF Parameter Value voltage Vch of NMOS cascode device beyond VDD, while an R1 1.05k R1 106.7k Cp 28.8pF R2 835k RC filter (not shown) is used to reduce high frequency noise Cs 7.2pF C 150pF from the charge pump. The regulated output voltage is Vsup = 3.0V from an input VDD = 3.6V supply. The dropout of the cascode NMOS is 250mV and that of PMOS pass device is For the above CDR design, detailed simulations were car- 350mV. These results are verified through simulations. Further, ried out to verify its operation first and then various simula- a 10MHz a 200mVp−p sinusoidal signal at the input power tions were carried out with respect to its power supply noise supply of the LDO gave a ripple of 5mVp−p signal at the immunity. The results obtained are presented in subsequently regulated output terminal. in section IV. In order to improve the power supply noise immunity of the above CDR, suitable LDO circuits have been designed and these are discussed in the next section. III. T HE P ROPOSED LDO D ESIGN The CDR system with two LDOs connected between the power supply and the dual loop delay interpolating CDR system blocks is shown in Fig.3. Two LDOs are used to satisfy the different requirements of the CDR system since the FD and PD operates at low frequency and consumes high transient currents, while the rest of CDR system operates at high frequency and low currents. One LDO is therefore connected to FD and PD, while the other required is connected to the rest of CDR system. Fig. 4. The architecture of LDO. IV. R ESULTS AND D ISCUSSION First, simulations were carried out with a clean power supply so that this can be treated as a reference against which the noisy case can be compared. Next, noise in the form of sinusoidal signal was added to the power supply of the CDR system to test the capability of the CDR system to tolerate power supply noise. Jitter was then measured from the recovered clock for various applied noise amplitudes. The maximum supply noise amplitude that can be tolerated by the CDR system was determined by observing the failure of the Fig. 3. The Dual Loop Delay Interpolating CDR System with LDOs. system to lock as the supply noise amplitude was increased. The individual blocks of the system were separately tested In the split-regulator architecture for LDOs suggested in with this maximum supply noise amplitude that can be tol- [15], due to a distribution of load currents across the different erated. The most sensitive block of the system was then LDOs, the sizes of the pass devices can be reduced. This identified by measuring the jitter from the recovered clock. makes the output pole of the regulator as the dominant pole Finally, the CDR system with the two LDOs was subjected to INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 19 similar noise and maximum power supply noise that can be noise is more than 20mV amplitude at 10MHz, the differential tolerated by the system was observed. control voltage does not settle down but continues to oscillate. Using a clean 3.0V power supply, the control voltages generated by the FLL and the PLL for the nominal case are shown in Fig.5. It can be seen that the CDR establishes lock after about 600ns for the FLL and after 700ns for the PLL. These and the subsequent transient responses have been obtained with 214 Pseudo Random Bit Stream PRBS data as input. In order to determine its ability to tolerate power supply noise, this CDR system was then simulated with a noisy power supply common to the whole CDR. The noise applied on the power supply was a sine wave with 10MHz and variable amplitudes from 10mV to 30mV were applied. The resulting output jitter on the recovered clock and the ripple on control voltages were measured. Fig. 7. The fine control voltage of the CDR system with 20mV/10MHz noise on power supply. Fig. 5. The coarse and fine control voltage of the CDR system with clean power supply. Fig. 8. The fine control voltage of the CDR system with30mV/10MHz noise on power supply. Next, in order to determine the particular block within the CDR which is most sensitive to power supply noise, the following procedure is adopted. The noise on the power supply is applied, one at a time, on each individual block of the CDR and the ripple on control voltage and the jitter on the recovered clock are measured. As already pointed out, 20mV/10MHz is the maximum noise that can be tolerated by the whole system without LDO, and this noise level is applied to each block separately while the rest of the blocks were fed with clean supply. In the present architecture, the FD is basically a digital Fig. 6. The fine control voltage of the CDR system with 10mV/10MHz architecture and hence less sensitive to power supply noise. noise on power supply. Fig.9 shows plots the differential fine control voltage with noise injected selectively at the FD power supply node, while For 10mV and 20mV noise amplitudes, the dual loop delay the other nodes were fed with clean supply. Similarly, the interpolating CDR system captured the frequency and phase PD is not susceptible to supply noise because it can get correctly. From Fig.6 and Fig.7, it can be seen that the ripple rejected by the loop filter if it is outside the pass band. on the differential control voltage is within reasonable limits. Fig.10 shows plots of the differential fine control voltage with The measured jitter on the recovered clock is given in Table noise injected selectively at PD power supply node, while the II, but will be discussed subsequently along with the LDO other nodes were fed with clean supply. Next, the VCO is a case. When the noise amplitude was increased to 30mV the quite susceptible to noise on the VCO supply line. The noise resulting control voltage is plotted and shown in Fig.8. Though on the VCO supply disturbs the clock frequency and is not not evident in Fig.8, for this case, lock was not established. necessarily rejected by the loop filter. The noisy power supply Actually it was found that clock recovery fails when supply is applied only to the VCO. Fig.11 show the differential fine INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 20 Fig. 9. The fine control voltage of the CDR system with 20mV/10MHz Fig. 11. The fine control voltage of the CDR system with 20mV/10MHz noise on power supply on Frequency Detector only. noise on power supply of VCO only. Fig. 10. The fine control voltage of the CDR system with 20mV/10MHz Fig. 12. The fine control voltage of the CDR system with 20mV/10MHz noise on power supply of Phase Detector only. noise on power supply of PD, FD, VCO. alone but without any LDOs. control voltage which has the largest ripple on the control voltage and proves that the VCO is the most sensitive part of the CDR system to power supply noise. Fig.9, Fig.10, and Fig.11 are then superposed on the same plot and shown in Fig.12 to show a comparison between the differential control voltages (fine) for the three cases considered and it is clear that VCO gives rise to the largest ripple on the control voltage. The corresponding jitter on recovered clock is also measured and is found to be 37.39ps, 17.55ps, and 5.82ps when noise is injected into the VCO, PD, and FD supply nodes respectively. As described in the previous section, inclusion of LDOs in the supply lines of the CDR is expected to improve the performance of the system with respect to power supply noise. Even though the FD and PD are not vulnerable to high Fig. 13. The fine control voltage of the CDR system with 200mV/10MHz frequency noise on the supply line, they are sensitive to supply noise on power supply of CDR. noise that falls within the pass band of the loop filter. Hence one LDO is inserted into the supply line feeding the VCO and The jitter on recovered clock and ripple on the differential another LDO is inserted into the supply lines feeding the rest fine control voltage of the PLL is shown in Table II. It can of the system. be seen that the jitter for the case of noise of 200mV/10MHz The inputs to both the LDOs are provided with noisy power with LDO is less than the jitter for a noise of 20mV/10MHz supply of varying amplitudes and the resulting jitter on the without LDO. The ripples on differential fine control voltages recovered clock from the CDR were determined. With the for noise amplitudes of 50mV, 100mV, and 200mV with the LDOs in place, Fig.13 shows that the CDR system to can LDO case are nearly one third of corresponding values for the operate without significant degradation even when the power case without LDO. supply noise of 200mV at 10MHz is injected into the power In conclusion, the dual loop delay interpolating CDR with supply line. Fig.13 should be contrasted with Fig.11 which is LDOs can tolerate noise of the order of 200mV/10MHz while for the case when only 20mV/10MHz noise applied to VCO the dual loop delay interpolating CDR without LDOs can INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 21 hardly tolerate more than 20mV/10MHz before losing lock. [12] Marcus van Ierssel, Hisakatsu Yamaguchi, Ali Sheikholeslami, Hiro- taka Tamura, and William W. Walker, “Event-Driven Modeling of CDR TABLE II Jitter Induced by Power-Supply Noise, Finite Decision-Circuit Band- T HE CDR P ERFORMANCE WITH AND WITHOUT LDO S (N.L.: N OT width, and Channel ISI,” IEEE Transaction on Circuits And SystemsI, L OCKED ) vol. 55, no. 5, pp. 1306-1315, June 2008. [13] Jafar Savoj, and Behzad Razavi, “A 10-Gb/s CMOS Clock and Data Recovery Circuit with a Half-Rate Linear Phase Detector,” IEEE Journal Without LDO With LDO of Solid-State Circuits, vol. 36, no. 5, pp. 761-768, May 2001. [14] Keiji Kishine, Kiyoshi Ishii, and Haruhiko Ichino, “Loop-Parameter Noise 10mV 20mV 30mV 50mV 100mV 200mV Optimization of a PLL for a Low-Jitter 2.5-Gb/s One-Chip Optical Amplitude Receiver IC with 1: 8 DEMUX,” IEEE Journal of Solid-State Circuits, vol. 37, no. 1, pp. 38-50, Jan 2002. Jitter 16.4ps 43.4ps N.L. 14.7ps 23ps 39ps [15] V. Gupta, and G.A. Rincon Mora, “A Low Dropout, CMOS Regulator with High PSR over Wideband Frequencies,” IEEE International Sympo- Ripple 60mV 125mV 180mV 18mV 40mV 80mV sium on Circuits and Systems, vol. 5, pp. 4245-4248, May 2005. [16] V. Gupta, and G.A. Rincon Mora, “A 5mV 0.6mm CMOS Miller- Compensated LDO Regulator with -27db Worst Case Power Supply Rejection Using 60pF On-Chip Capacitance,” in Proc. IEEE International Solid-State Circuits Conference, Feb 2007, pp. 520-521. V. C ONCLUSIONS In conclusion, the present study indicates that (a) the dual loop CDR architecture is vulnerable to supply noise in the range of 10-20mV and (b) the incorporation of separate and Khalil I. Mahmoud is a PhD student at ECE Dept., appropriately designed LDOs can make the entire system to CEG Campus, Anna University, Guindy, Chennai- 25, India. operate without significant degradation even for supply noise of the order of 200mV up to 10MHz frequencies. ACKNOWLEDGMENT The author would like to thank Iraq-India governments for their financial support of the Scholarship through Indian Counsel for Cultural Relations (ICCR). R EFERENCES J. Dhurga Devi is a PhD student, and Lecturer at [1] Ming-ta Hsieh and Gerald E. Sobelman, “Architectures for Multi-Gigabit ECE Dept., CEG Campus, Anna University, Guindy, Wire-Linked Clock and Data Recovery,”IEEE Circuits and Systems Mag- Chennai-25, India. azine, vol. 8, no. 4, pp. 45-57, Fourth Quarter 2008. [2] Bryan Casper and F. O’Mahony, “Clocking Analysis, Implementation and Measurement Techniques for High-Speed Data Links–A Tutorial,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 56, no. 1, pp. 17-39, Jan. 2009. [3] Behzad Razavi, “Phase-Locking in Wire line Systems: Present and Future,” in Proc. IEEE Custom Intergrated Circuits Conference, Aug. 2008, pp. 615-622. [4] Zuoguo (Joe) Wu, and Evelina Yeung, “Multi-Gigabit I/O Link Circuit Design Challenges and Techniques,” in Electromagnetic Compatibility Symposium, July 2007, pp. 1-5. [5] Seema Butala Anand, and Behzad Razavi, “A CMOS Clock Recovery R. Rajasekar is a M.Sc. student at ECE Dept., Circuit for 2.5-Gb/s NRZ Data,”IEEE Journal of Solid-State Circuits, CEG Campus, Anna University, Guindy, Chennai- vol. 36, no. 3, pp. 432-439, Mar 2001. 25, India. [6] Miao Li, Tad Kwasniewski, and Shoujun Wang, “A 0.18 um CMOS Clock and Data Recovery Circuit with Reference-less Dual Loops,” IEEE International Symposium on Circuits and Systems, May 2008, pp. 2358- 2361. [7] Abhijith Arakali et al., “Supply-Noise Mitigation Techniques in Phase- Locked Loops,” in Proc. European Solid-State Circuits Conference, Sep 2008, pp. 374-377. [8] Tzung-Je Lee, and Chua-Chin Wang, “A Phase-Locked Loop with 30%Jitter Reduction Using Separate Regulators,” Journal of VLSI Design, vol. 2008, pp. 1-8, July 2008. [9] Ping-Hsuan Hsieh, Jay Maxey, and Chih-Kong Ken Yang, “Minimizing the Supply Sensitivity of a CMOS Ring Oscillator through Jointly Biasing the Supply and Control Voltages,” IEEE Journal of Solid-State Circuits, P. V. Ramakrishna is a Professor at ECE Dept., vol. 44, no. 9, pp. 2488-2495, Sep 2009. CEG Campus, Anna University, Guindy, Chennai- [10] Xiong Liu, and Alan N. Willson, Jr., “A 3 mW/GHz Near 1-V VCO 25, India. with Low Supply Sensitivity in 0.18-µm CMOS for SoC Applications,” IEEE International Midwest Symposium on Circuits and Systems, Aug 2009, pp. 90-93. [11] Abhijith Arakali, Srikanth Gondi, and Pavan Kumar Hanumolu, “Low- Power Supply-Regulation Techniques for Ring Oscillators in Phase- Locked Loops Using a Split-Tuned Architecture,” IEEE Journal of Solid- State Circuits, vol. 44, no. 8, pp. 2169-2181, Aug. 2009. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 22 A Novel VSWR-Protected and Controllable CMOS Class E Power Amplifier for Bluetooth Applications Wei Chen, Wei Lin, and Shizhen Huang Abstract—This paper describes the design of a differential offer higher breakdown voltage, lower substrate loss and class-E PA for Bluetooth applications in 0.18µm CMOS technol- higher quality of monolithic inductors and capacitors, but they ogy with load mismatch protection and power control features. are expensive. CMOS technology, on the other hand, could The breakdown induced by load mismatch can be avoided by attenuating the RF power to the final stage during over provide single-chip solution which greatly reduces the cost. voltage conditions. Power control is realized by means of “open But CMOS technology suffers from poor quality factors of loop” techniques to regulate the power supply voltage, and a monolithic passive components, low breakdown voltage of the novel controllable bias network with temperature compensated transistors and large process variation. is proposed, which allows a moderate power control slope (dB/V) to be achieved. Post-layout Simulation results show that the level More still, the main obstacle to the actual exploitation of of output power can be controlled in 2dBm steps; especially the CMOS PAs is the ruggedness requirement, i.e., the ability to output power in every step is quite insensitive to temperature survive under high load voltage standing wave ratio (VSWR) variations. conditions with a full-power RF drive [2]. Typically, device Index Terms—Power amplifier, class E, VCWR. testing procedures for commercial PAs can demand a VSWR as high as 10:1 under a 5V power supply. Such a strong mismatch condition results in very high voltage peaks at the I. I NTRODUCTION collector of the final stage (much higher than the nominal B LUETOOTH devices operate in the 2400-2483.5MHz Industrial, Scientific and Medical (ISM) band. There are basically three classes based on the transmission distance. supply voltage) and may eventually lead to permanent failure of the power transistor due to avalanche breakdown. As reported in [3], to comply with ruggedness requirements, They are Class 1 (The transmitted output power is 20dBm), collector voltage peaks in excess of 16V have to be tolerated. Class 2 (The transmitted output power is 4dBm) and Class 3 CMOS transistors usually exhibit lower breakdown voltages. (The transmitted output power is 0dBm) respectively. Usually, Thick gate-oxide transistors of TSMC 0.18µm RF CMOS the Bluetooth power amplifier is working in low power model, process have a 6.8V breakdown voltage. so the output power of Class 1 power amplifier must control- In this paper, a two-stage 0.18µm CMOS monolithic PA lable down to 4dBm or less in a monotonic sequence to save for Class 1 Bluetooth is proposed. The PA includes a power the power [1]. control circuit which can improve the PA gain control slope A standard method of controlling the output power of a (dB/V) and a protection circuit to overcome the detrimental power amplifier is to use a voltage regulator to regulate effects of load impedance mismatch. The level of output the battery or power supply voltage. Typical approaches to power can be controlled in 2dB steps using an open loop controlling the output power of a power amplifier use an “open control technique and a novel linearity control bias network loop” or a “closed loop” control technique. “Closed loop” using temperature compensated, and achieved confirm that techniques use an RF sensor, such as a directional coupler, to ruggedness specifications can be fulfilled. detect the power amplifier output power. The detected output power is used in a feedback loop to regulate the output power. “Open loop” techniques control the output power by regulating either the power supply voltage or power supply current used II. C IRCUITS DESIGN by the power amplifier. “Open loop” techniques are popular since open loop techniques do not have the loss and complexity The power amplifier is designed in a 0.18µm CMOS tech- associated with RF sensor elements. But in conventional power nology with analog and RF options. This CMOS technology control schemes by mean of regulating only power supply has two kinds of transistors. Thick gate-oxide transistors, voltage, the PA gain control slope (dB/V) is precipitate and which are similar to 0.35µm transistors, have a higher break- the PA will suffer from transmit burst shaping and potential down voltage. Thin gate-oxide transistors that are similar to stability problems. 0.18µm transistors have a higher Gm. So the thin gate-oxide Nowadays, Gallium Arsenide (GaAs), BiCMOS and silicon transistors are chosen in driver stage to generate a larger signal bipolar technologies still dominate in the power amplifier to turn the transistor on and off. And the thick gate-oxide design. Compared with CMOS technology, these technologies transistors used for out stage. Fig.1 shows a simplified power amplifier topology which contains three main modules, Class All authors are with the Fujian Key Laboratory of Microelectronics & Integrated Circuits, Fuzhou University, Fujian Province, 350002, China PRC. E power amplifier and Driver stag; Power control module and E-mail: wchen@fzu.edu.cn. VSWR protect module. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 23 5.447 1.42 Cs = C3 × (1 + ) (4) Q Q − 2.08 where Rl is the optimize load, Q is the quality factor of LC resonator. Fig.3 shows a simplified circuit of drive stage in a Fig. 1. Simplified power amplifier topology with power control and VSWR protection. A. Class E Power Amplifier and Driver Stage For Bluetooth Class E power amplifier is a switching-mode amplifier, which is nonlinear amplifier that achieves efficiencies ap- Fig. 3. Drive stage in a cascade topology. proaching 100To achieve these conditions, all the components should be properly designed. As shown in Fig.2, the loading cascade topology, which is to generate a larger signal to turn inductor L6 is either a RF choke (RFC) or a finite inductance. the transistor on and off. The variable-gain amplifier (VGA), which is operated at maximum gain under nominal conditions When VC is in low voltage and M7 turns off. When M7 turns on, the gain of VGA will decrease, and attenuate the RF power to the final stage and the drain voltage of final transistors will decrease. B. Open-loop Power Control for Class E Power Amplifier The open-loop control system in which the output has no effect upon the input signal.The methodology to realize power controllable is to change the output stage power supply voltage in “open loop” control technique which is a likely LDO and to change the final stage bias by a novel controllable network temperature compensated. The benefit of using this topology is that the noise of its output voltage is lower and the response to input voltage transient and output load transient is faster. Fig. 2. Differential class E PA with finite ground inductance. The output voltage Vcon in Fig.1 can obtain from equation (5). Cs is a charging capacitor; L7 and C3 are designed to be a se- R1 V con = (1 + ) × Vramp (5) ries LC resonator with an excess inductance at the frequency of R2 interest. The resonator resonates at the fundamental frequency, When the Bias of Class E is in a fixed station, the drive signal and suppresses the other harmonics. The purpose of LC res- is an excellent switch signal, and not to consider other non onator is designed for optimization conditions. The optimum ideal factors the output power P out can be obtained from values for each component are calculated as follows [4]. equation (6). 0.577(Vdc − Vknee )2 V con2 R1 Rl = (1) P out = 0.577× = 0.577×[(1+ )×Vramp ]2 /Rload Pout Rload R2 (6) 1 Equation (6) shows that the output power P out and the C3 = (2) 5.447ωRl control signal Vramp are in a square relationship. Actually QRl to change the driven signal can improve PA gain control slope L7 = (3) (dB/V). Fig.4 shows a simplified novel linearity controllable ω INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 24 Fig. 5. Closed-loop drain peak voltage control. Fig. 4. Novel linearity controllable bias network with temperature compen- sated. the output collector node and decreases its value to a specific threshold by varying the circuit gain. bias network with temperature compensated for final stage. Drain voltage of the output transistor is scaled down by The output voltage can be obtained from equation (7) a high-input-impedance sensing network, the scaled down VREF = a + a1 × Vramp (7) voltage is applied to an envelope detector delivering an output signal proportional to the collector peak voltage. An error It shows VREF is controlled by Vramp , Where amplifier then compares the rectified waveform voltage with a VBE1 × RREF VT ln n × RREF reference voltage. Finally, the output error is used to control a = I1 × RREF = ( )+( ) the gain of the drive stage. R3 + R4 R5 If R3 (R4 ) >> R2 and capacitors C1 and C2 are large I2 × RREF RREF W9 W7 enough to be considered short circuits at the operating fre- a1 = = ×( / ) quency, then the output voltage of rectifier (at node c) can be Vramp R L9 L7 expressed as In normal temperature, R1 R3 Vc = (Vsen(peak) −V con)× +V con× −V gs(M ) ∂VBE ∂VT R1 + R2 R3 + R4 ≈ −1.5mV /◦ K, ≈ +0.087mV /◦ K, (8) ∂T ∂T where Vgs(M) is the voltage between the gain and the source of ∂a RREF ∂VBE1 ln n × RREF ∂VT transistor M,Vsen(peak) is the peak of the drain voltage of final =( × )+( × ) ∂T R3 + R4 ∂T R5 ∂T stage transistor FET1, and Vcon is the DC voltage of it. When The coefficient “a” and “a1 ” are constant, but “a” is propor- the drain voltage of final stage transistor over the reconverted tional to temperature, and “a1 ” insensitive to temperature. voltage, then Vc will exceed VREF , and the VSWR protection will in work station. C. Closed-Loop Drain Peak Voltage Control III. L AYOUT DESIGN There is no isolator used between the PA and the an- tenna as a result the power amplifier can cause strong load The power amplifier was first layered out using Cadence mismatch due to faults or disconnection antenna. Therefore Analog Artist, and then imported into ADS’s Momentum RF power transistors should be able to tolerate over voltage, as 2.5d electromagnetic simulator. Pins are added to layout to the peaks of drain voltage waveforms show much higher define the current flow direction, the polygons are meshed into under mismatch conditions than under nominal conditions. rectangles and triangles, and the dielectric properties of the The worst case conditions occur when the power amplifier is substrate are defined. Fig.6 shows layout of differential class operated under both oversupply and load mismatch conditions. E with on chip input and inter-stage matching. The circuit is So the ruggedness specification is usually considered in terms then simulated using the planar field solver. The layouts of of a maximum tolerable output VSWR under a specified RF devices, especially for power amplifiers, require special oversupply condition. Typical data sheets of commercial power attention. The output transistor carries 250mA of dc current, amplifiers guarantee that no permanent damage happened with plus the RF current, and out stage transistors (M8 and M9 ) 10:1 load VSWR under supply voltage of 5V. has a total width of 2.4mm. The drain contact area of each The risk of breakdown can be prevented by simply atten- transistor is enlarged, and parallel layers of metal 1 to metal 5 uating the RF signal which drives the final stage during over are used as drain and source connections such that the device voltage conditions [5], [6]. This can be achieved by adopting is able to handle large currents. The output devices are placed a feedback control system, which detects the peak voltage at as close as possible to the output pads. Many bond-wires can INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 25 with the control voltage Vramp . When used 0dBm input signal and 1.8V supply voltage at 2.45GHz the PA reached to the maximum output power of 25.1dBm and 54.2% power-added efficiency (PAE). Fig.9 shows the drain voltage of FET1 would Fig. 6. Layout of differential class E with on chip input and inter-stage matching. handle the large output currents.14 ground pads were used in order to minimize the ground bond-wire inductance. IV. S IMULATION RESULTS Fig. 9. Drain voltage of FET1 under over power supply and load mismatch. The PA in Fig. 1 was simulated and optimized by using ADS (Advanced Design System) in 0.18µm technology, and reach much higher more than 6.8V without VSWR protection the Bond wire inductance is replaced by a physical lumped when the supply voltage is 5V and the load resistance is 5Ω, element model. Fig.7 shows the output power (Spectrum) vary then the transistor FET1 will breakdown. But the transistor will be in safe station with VSWR protection. V. C ONCLUSION A two stage of class-E power amplifier for class 1 Bluetooth applications has been designed which includes a protection circuit preventing output stage failure due to severe load mismatch. Safe operation was achieved through the use of a feedback loop that acts on the circuit gain to limit the overdrive of the output transistor whenever an over voltage condition is detected. The output power can be controlled Fig. 7. Output power vary with temperature. easily by a variable supply implemented by “open loop” technique; also a novel bias network controlled by “Vramp ” with Temperature, it indicates that the output power changes with temperature compensated for final stage is proposed less than 0.3dBm when the temperature vary from -25◦ C to which allows a moderate power control slope (dB/V) to be 85◦ C. Fig.8 shows the PAE and Output power (Spectrum) vary achieved. Post-layout simulation a 25.1dBm output power and 54.2% PAE were achieved at a nominal 1.8V supply voltage. The amplifier is able to sustain a load VSWR as high as 10:1 up to a 5V supply voltage without exceeding the breakdown limits. And the level of output power can be controlled in 2dBm steps; especially the output power in every step is quite insensitive to temperature variations. And it is satisfied for Bluetooth applications. ACKNOWLEDGMENT The authors would like to thank the teachers in Fujian key Laboratory of Microelectronics & Integrated Circuits, they are very kind and patient, and would like to thank Fujian Integrated Circuit Design Center for the use of their facilities. The project was supported by the Natural Science Foundation Fig. 8. The output power and PAE vary with control voltage. of China (Grant No. 10871221). INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 26 R EFERENCES Wei Lin was born in Fuzhou, Fujian in 1968. He received M.S. degree from Fuzhou University in [1] V. Vathulya, T. Sowlati, and D. M. W. Leenaerts, “Class-1 Bluetooth 1998. He is currently employed by Physics and power amplifier with 24-dBm output power and 48% PAE at 2.4 GHz in Information Engineering, Fuzhou University. His 0.25µm CMOS”, in Proc. of 27th European Solid-State Circuits Conf., current interests include analog integrated circuits pp. 84-87, 2001. design and sensor technology research. [2] A. Scuderi, F. Carrara, and G. Palmisano, “VSWR-protected silicon bipolar power amplifier with smooth power control slope”, in Proc. IEEE Int. Solid-State Circuits Conf., pp. 194-195, Feb. 2004. [3] K. Yamamoto et al., “A 3.2-V operation single-chip dual-band Al- GaAs/GaAs HBT MMIC power amplifier with active feedback circuit technique”, IEEE J. Solid-State Circuits, vol. 35, no. 8, pp. 1109-1120, Aug. 2000. [4] N. Sokal and A. Sokal, “Class E - A New Class of High-Efficiency, Tuned Single-Ended Switching Power Amplifier”, IEEE J. Solid-State Circuits, vol. l0, no. 3, pp. 168-176, June 1975. [5] A. Scuderi, F. Carrara, A. Castorina, and G. Palmisano, “A high perfor- mance RF power amplifier with protection against load mismatches”, in Proc. IEEE MTT-S Dig., pp. 699-702, Jun. 2003. [6] A. van Bezooijen, F. van Straten, R. Mahmoudi, and A. H. M. van Roer- mund, “Over-temperature protection by adaptive output power control”, in Proc. IEEE 36th Eur. Microwave Conf., pp. 1645-1647, Sep. 2006. Wei Chen was born in Putian, Fujian in 1977. He Shizhen Huang was born in Fujian in 1968. He received M.S. degree in College of Physics and received M.S. degree from Fuzhou University in Information Engineering from Fuzhou University 2002. He is currently employed by Physics and in 2007. He is currently employed by Physics and Information Engineering, Fuzhou University. His Information Engineering, Fuzhou University. His current interests include analog integrated circuits current interests include analog integrated circuits design and sensor technology research. design and sensor technology research. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 27 A Charge Pump Circuit by using Voltage-Doubler as Clock Scheme Wen Chang Huang, Jin Chang Cheng, and Po Chih Liou Abstract—A new charge pump circuit with a clock that shows an increased clock voltage as its stage is increased is proposed in the paper. The charge pump circuit utilizes the cross-connected NMOS, voltage doubler, as a pumping stage. Each stage of the voltage-doubler provides a pair of complementary clock voltages. The clock voltage also increases as the stage of voltage doubler is increased. It shows that a voltage up to 37.85V was obtained after eight-stage’s pumping of the circuit, through the simulation of HSpice under 0.35 um process with 2V of supply voltage and clock voltage. Index Terms—Charge pump, high voltage clock generator, voltage doubler. Fig. 1. Four-stage Dickson charge pump. I. I NTRODUCTION HARGE pumps are the circuits that used to generate dc C voltages those are higher than the normal power supply voltage or lower than the ground voltage of the chip. Charge pumps have been used in the nonvolatile memories, such as turned more effectively by the high voltage generated in the EEPROM or flash memories, to write or to erase the floating- next stage. More complicated circuit scheme for charge pump gate devices [1, 2]. They can also be used in the low-supply- have been applied to increase the voltage gain, such as all voltage circuits and switched-capacitor systems that require PMOS charge pump for low voltage operation [9], CMOS high voltage to drive the analog switches [3]. Analog circuitry charge pump [10], charge transfer switches in combination also requires efficient charge pump to augment the internal with pumping the output stage clock of enhanced voltage voltage supplies in order to achieve the increased dynamic amplitude . In our previous studied, we proposed a voltage- range and simplify the design [4]. doubler charge pump circuit (VDCP) by cascading multi- The charge pump circuit reported by Dickson had been stages of voltage-doubler [11], and the charge pump circuit widely used for generating high voltages [5, 6] and in some by using multi-staged voltage-doubler as the clock scheme circuit application. The structure of the circuit makes use of (MVDCP) [12] as shown in Fig. 2. Both these two charge capacitors, which are interconnected by diodes and coupled pump circuits showed high efficiency of pumping voltage. in parallel with two non-overlapping clocks. Diodes in the The basic structure of the MVDCP is based on a chained Dickson circuit can be replaced by NMOS, which will result of MOS-diode which combined with pumping capacitor. And a more practical implementation [7]. Fig. 1 shows a four-stage each stage of the capacitor is pumped by a serial of clock Dickson charge pump circuit. However its performance is voltage. The pumping clock of a multi-staged voltage-doubler limited due to the threshold voltage drop of the NMOS devices is designed to replace the traditional clock. The supply voltage and the reverse charge-sharing phenomenon. Moreover, for is constant at VDD . The input clock voltage of the first stage high output generated voltages, the increase in the threshold is also constant at the same as the supply voltage, VDD . The voltage due to the body effect can significantly reduce the MVDCP shows high pumping efficiency while it also shows pumping efficiency. the deficiency of large transistor counts. In order to overcome the problems mentioned above in the Dickson charge pump, a charge pump, called NCP-2 [8], is reported which utilizes the charge transfer switches (MSi At the aim of reducing the transistor counts, we present transistors). Each of the MSi transistors is controlled by the another version of MVDCP which called MVDCP-2. The pass transistors MNi (nMOS) and MPi (pMOS). In that way clock, multi-staged voltage-doubler, shows the characteristic of the charge transfer switches can be turned off completely when out of phase in its two outputs. So, each stage of the clock can required, preventing the reverse charge flow. Also they can be provide clock voltage for two stages of the transfer transistor. The number of transistors of MVDCP-2 is greatly reduced as W. C. Huang and P. C. Liou are with the Department of Electronic it was compared with the MVDCP-1. The operation principle Engineering, Kun Shan University, Tainan, Taiwan R.O.C. J. C. Cheng is with the Department of Accounting and Information System, and the simulation results will be discussed in the following Chang Jung Christian University, Tainan, Taiwan R.O.C. sections. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 28 II. P ROPOSED N EW C HARGE P UMP C IRCUIT : with MVDCP-1. M ULTI - STAGED VOLTAGE -D OUBLER C HARGE P UMP TABLE I C IRCUIT-2 (MVDCP-2) C OMPONENTS C OUNTS O F A F OUR S TAGES C HARGE P UMP C IRCUIT The proposed charge pump circuit is shown in Fig. 3. A Circuit Number of components Comments serial of MOS transfer diode with pumping capacitors are NMOS/PMOS/Int. Cap./Load Cap. used to transfer charge. The multi-staged voltage-doubler is MVDVP-1 22 / 9 / 12 / 1 One clock output designed to be the pumping clock. Fig. 3 shows a four- MVDCP-2 12 / 5 / 8 / 1 Two clock outputs staged connection of the multi-staged voltage-doubled charge pump circuit (MVDCP-2). Compared with the MVDCP-1, each stage of the voltage-doubler provides two clock-voltages A. Multi-staged voltage-doubler clock generator for two stages of the MOS-diode. While the structure of The unit cell of the multi-staged voltage-doubler clock MVDCP-1, each of voltage doubler only providing one clock- generator is a clock voltage doubler[8] as shown in Fig. 4. voltage to one stage of MOS-diode. The operation of the In the circuit, the amplitude of input clock voltage, VCLK , high-voltage clock generator and the concept of multi-staged is oscillated between 0 to VDD . The power supply voltage is voltage doubler are discussed in section II-A. The simulation constant at the voltage value of VDD . As the clock is at high results and discussion of the proposed MVDCP-2 is stated in voltage value (VDD ), the nMOSFET(M8) will be turn on and section II-B. the output voltage of the inverter will be discharged to 0V. As Table 1 shows a comparison of the number transistor, the clock goes to low voltage value (0V), this will turn on the pumping capacitor and load capacitor of the two circuits. Both pMOSFET (M7), because its gate voltage is 0V and its source the two circuits are four stages of pumping. It shows the voltage is 2VDD . The source voltage (2VDD ) of the transistor number of transistor and capacitor of MVDCP-2 is greatly M7 will charge the output capacitor of the inverter. Eventually, reduced. The fewer number of transistors and capacitors of the output node of the inverter becomes 2VDD . So, the value MVDCP-2 will result in lesser area of chip size as it compared of Vout4 oscillated between 0V to 2VDD during the action of clock. The left hand side of the circuit shows similar operation principle with opposite polarity. Fig. 2. The charge pump circuit by using multi-staged voltage-doubler as the clock scheme (MVDCP-1)[12]. Fig. 4. The circuit diagram of the clock voltage doubler. There are two purposes of the produced output clock voltage of each stage in the MVDCP-2 circuit. Firstly, it is to be the pumping clock of the nMOS diode in the chained of charge pump structure. Secondly, it provides clock scheme for the following stage of clock voltage-doubler to produce higher amplitude of clock voltage. The concept is realized by the circuit of MVDCP-2, as shown in Fig. 3. Now we discussed the multi-staged voltage doubler clock scheme first. In the clock scheme, voltage doubler is connected stage by stage. As the input clock goes from 0 to VDD , the left output of the first stage will go from 0 to 2VDD , the right output of the first stage will go from 2VDD to 0. The left output of the second stage will go from 3VDD to 0 and the right output of Fig. 3. The proposed charge pump circuit by using multi-staged voltage- the second stage will go from 0 to 3VDD at steady state. By doubler as clock scheme (MVDCP-2). suitable connection, the above four output clock voltages are INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 29 Fig. 5. The pumping voltage of various stages of MVDCP-2, where Fig. 6. The pumping voltages of a four-staged MVDCP-2 at various VDD =2V, VCLK =2V and f=25MHz in the circuit. simulation frequencies (@ VDD =2V, VCLK =2V). used to be the pumping clock of the chained of the nMOS diode. The receiving of the pumping voltage of capacitor C1 , C2 , C3 and C4 will be 0V, 2VDD , 0V and 3VDD , respectively as the input clock is 0V. The receiving of the pumping voltage of capacitor C1 , C2 , C3 and C4 will be 2VDD , 0V, 3VDD and 0V, respectively as the input clock is VDD . The circuit scheme of MVDCP-2 shows smaller chip area as it compared with the circuit scheme of MVDCP. B. Simulation results of the multi-staged voltage-doubler charge pump circuit (MVDCP-2) A four-staged multi-staged voltage-doubled charge pump circuit (MVDCP-2) is shown in Fig. 3. The simulation result of the output voltage of each stage of the multi-staged voltage- Fig. 7. A comparison of the pumping voltage of MVDCP-2, MVDCP-1, doubler charge pump circuit (MVDCP-2) is shown in Fig. VDCP, NCP-2 and Dickson charge pump at different number of pumping 5. Both the supply voltage VDD and the input clock voltage stages. VCLK are 2V, and the operation frequency is 25MHz. This figure shows the output voltage vs. operation time of different output stages. The output voltage can reach to 4.21V after one MVDCP-1 and MVDCP-2 is shown in Fig. 7. The results stage of pumping. As the number of the stages is increased, are the simulation of HSpice under CMOS 0.35 um process the output voltage is increased steadily. For an eight-staged with VCLK =VDD =2V, the pump capacitor is 20pF, the load MVDCP-2, its output voltage can be pumped up to 37.85V. capacitor is 10pF and at the operating frequency of 25MHz. Fig. 6 shows the pumping voltage of a four-staged MVDCP- At each stage, the MVDCP-2 shows a lower output voltage 2 at various operating frequencies. The operating frequency than that of MVDVP-1 while much higher than that of another is varied from 25, 50, 100, 200 to 500MHz, respectively. A circuits. In the MVDCP-1, the pumping voltage is equal to higher output voltage is obtained at low frequency operation. 4.31V after one stage pumping, is equal to 31.09V after five The operation frequency of 25, 50 and 100MHz of the circuit stages pumping and is equal to 40.37V after seven stages can pump similar output voltage. The output voltage degraded pumping. The MVDCP-2 shows lower pumping voltages at serious as the operation frequency increased to 200MHz. The the same simulation condition. The pumping voltage is equal pumping capacitor can be charged to a more saturated situation to 4.21V after one stage pumping, is equal to 21.31V after under low frequency operation. On the other hand, as the five stages pumping and is equal to 34.28V after seven stages operating frequency is increased the charging of the pumping pumping. While, these results are much higher than that of the capacitor is reduced, so it resulted in a lower pumping voltage. VDCP, the Dickson charge pump and the NCP-2. In this figure A comparison of the pumping voltage of various charge pump we also find that the Dickson charge pump and NCP-2 will circuits is discussed. The MVPCP-2 is compared with some saturate after many stages of pumping while the MVDCP-2 other charge pump circuits, the MVDCP-1, the VDCP, the did not show any tendency for saturation. The pumping voltage Dickson charge pump circuit [5] and NCP-2 [8] charge pump can be increased more as the stages of the MVDCP-2 are circuit. The output voltage vs. number of stages of various increased. Fig. 8 compares the simulated output voltages of the charge pump circuits, Dickson charge pump, NCP-2, VDCP, Dickson, NCP-2, VDCP, and the new proposed charge pump INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 30 Fig. 9. The output voltage vs. output current of the MVDCP-2, MVDCP-1, VDCP, NCP-2 and Dickson charge pump. Fig. 8. A comparison of the pumping voltage of MVDCP-1, MVDCP-2, VDCP, NCP-2 and Dickson charge pump at different supply voltage. ACKNOWLEDGMENT circuits under different power supply voltages (VDD ) without The authors would like to thank the National Science output current loading. The output voltages are obtained after Council of the Republic of China, for financially supporting three stages of pumping of these circuits, respectively. The the research under Contract No. NSC-97-2221-E-168-004. The output voltages of these charge pump circuits are degraded simulation software was support by CIC. when the power supply voltage is decreased. However, the new proposed charge pump circuit still has higher output voltages R EFERENCES under the lower power supply voltage because the proposed [1] T. Tanzawa, T. Tanaka, K. Takeuchi, and H. Nakamura, “Circuit tech- charge pump circuit has better pumping efficiency. Thus, the nologies for a single-1.8 V flash memory”, IEEE J. Solid State Circuits, vol.31, no.1, 2002, pp.84-89. proposed charge pump circuit is more suitable in low-voltage [2] T. Kawahara, T. Kobayashi, Y. Jyouno, S. Saeki, N. Miyamoto, T. Adachi, processes than the prior designs. Fig. 9 shows the simulated M. Kato, A. Sato, J. Yugami, H. Kume, and K. Kimura, “Bit line clamped output voltages of the MVDCP-2, MVDCP-1, VDCP, NCP- sensing multiplex and accurate high voltage generator for quarter-micron flash memories”, IEEE J. Solid-State Circuits, vol.31, pp.1590-1600, Nov. 2 and Dickson charge pump under different output currents. 1996. When the output current is increased, the output voltages of [3] T. B. Cho and P. R. Gray, “A 10 b, 20 M sample/s, 35 mW pipeline A/D these charge pump circuit are decreased. The output voltage converter”, IEEE J. Solid-State Circuits, vol.30, 1995, pp.166-172. [4] R. St. Pierre, “Low-power BiCMOS op amp with integrated current mode is 7.75V, 5.95V, 4.7V, 3.7V and 2.4V, respectively, at the charge pump”, IEEE J. Solid State Circuits, vol.35, no.7, 2000, pp.1046- output current of 30 µA. The output voltages of the proposed 1050. charge pump circuit with different output currents are much [5] J. F. Dickson, “On-chip high voltage generation in MNOS integrated circuits using an improved voltage multiplier technique”, IEEE J. Solid higher than those of other charge pump circuits. Especially, State Circuits, vol.SC-11, No.3, 1976, pp.374-378. with the higher output current of 50 µA, the proposed charge [6] T. Tanzawa and T. Tanaka, “A dynamic analysis of the Dickson charge pump circuit still has the better pumping performance than pump circuit”, IEEE J. Solid State Circuits, vol.32, No.8, 1997, pp.1231- 1240. others. In addition, the high pumping clock voltage in the [7] J. S. Witters, G. Groeseneken, H. E. Maes, “Analysis and modeling of on- new proposed charge pump circuit could fully turned on chip high-voltage generator circuits for use in EEPROM circuits”, IEEE to transfer the charges, but all MOSFET switches in the J. Solid State Circuits, vol.24, No.5, 1989, pp.1372-1380. [8] J. T. Wu, K. L. Chang, “MOS charge pumps for low-voltage operation”, Dickson charge pump circuit and NCP-2 are diode-connected IEEE J. Solid State Circuit, Vol.33, No.4, 1998, pp.592-597. transistors, which have the threshold voltage drop problem. [9] N. Yan and H. Min, “High efficiency all-PMOS charge pump for low- Thus, the proposed charge pump circuit has better pumping voltage operations”, Elect. Lett., vol.42, no.5, 2006, pp.277-279. [10] Y. Moisiadis, I. Bouras and A. Arapoyanni, “A CMOS charge pump for performance. low voltage operation”, Proc. IEEE International Symposium on circuits and systems, Geneva, 2000, pp.v577-v580. [11] W. C. Huang, J. C. Cheng and P. C. Liou, “A charge pump circuit III. C ONCLUSION – cascading high-voltage clock generator”, Proc. IEEE International The new version (MVDCP-2) of charge pump circuit by us- Symposium on Electronic Design, Test & Application, pp.332-337, 2008, Hong Kong SAR, China. ing multi-staged voltage-doubler as clock scheme is proposed. [12] W. C. Huang, J. C. Cheng and P. C. Liou, “A charge pump circuit Although it shows lower pumping efficiency than the previous using multi-staged voltage doubler clock scheme”, Proc. International version (MVDCP-1) while the area of chip size is greatly Conference on Microelectronics, pp.330-333, 2007, Cairo, Egypt. reduced. The MVDCP-2 still shows much higher pumping voltage than another charge circuits. No saturation of the pumping voltage is found as the pumping stage increased. It also shows suitable application for low power supply system. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 31 Wen-Chang Huang was born on February 16, 1967. Po-Chih Liu received the MS. degree from the He received the B.S. degree in electronics engineer- Department of Electronic Engineering, Kun Shan ing from Tamkang University 1989 and the M. S.and University in 2007. He is currently an IC layout Ph.D. degrees in electronic engineering from Chiao engineer, Synerchip Co. Ltd. Tung University, Hsinchu, Taiwan in 1991 and 1996, respectively. From 1998 to 2003, he was an Assistant Professor at the Department of Electronic engineering, Ta Hwa Institute of Technology, Hsin Chu, Taiwan. From 2003 to 2005, he was an Assistant Professor at the Department of Electronic engineering, Kun Shan University, Tainan, Taiwan. He is currently an Associate Professor at the Department of Electronic engineering, Kun Shan University, Tainan, Taiwan. His current research is in the areas of thin films material, semiconductor device and VLSI design. Jin-Chang Cheng received the Ph.D. degree from the Department of Electrical Engineering, State University of New York at Stony Brook in 1986. He is now with the Department of Accounting and Information System, Chang Jung Christian University, Tainan, Taiwan, R.O.C.. His research interests include digital system design, digital signal processing and image processing. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 32 Hierarchical Agent Based NoC with DVFS Techniques Alexander Wei Yin, Liang Guang, Pasi Liljeberg, Pekka Rantala, Jouni Isoaho, and Hannu Tenhunen Abstract—Network-on-Chip (NoC) is a promising architec- been proposed in 2008 [1]. It has been predicted that thousand- ture in the many-core on-chip systems. A hierarchical agent core processors will be feasible in the near future [2]. based NoC architecture is proposed which enables the NoC Although having the aforementioned advantages, NoCs are to autonomously adjust itself, and provide maximum power efficiency, fault/variation tolerance and system flexibility. Agents facing a number of problems of which the most urgent are software or hardware components which monitor and control is the power consumption. This is because in today’s ICs, the system at different granularity. Via the joint efforts and higher power consumption requires more expensive cooling interactions of agents at all levels of the architecture, system and packaging techniques and results in lower reliability. The optimization can be achieved at the runtime. Agent hierarchy, power concern will become more critical as the transistor mapping of agents on regular NoC platforms and the function partition among the agents are elaborately discussed. integration speed clearly oversteps the power delivery and heat Runtime power management with various Dynamic Voltage dissipation capacity. In the NoC based systems, the intercon- and Frequency Scaling (DVFS) techniques is analyzed on the nection network is identified as a major power consumer out of hierarchical agent based NoC platform. Conventional power all the power contributors. For the Alpha 21364 processor with monitoring techniques can be flexibly incorporated into the func- distributed memory, the integrated switches and link circuits tions of specific levels of agents. Network condition is observed by the agents at the runtime, and the power supplies are adjusted at consume 23W out of the total 125W chip power [3]. On the different granularity accordingly. This paper describes the system 16-tile RAW processor, the interconnection network accounts architectures of two adaptive schemes, with efficient and feasible for as high as 35% of the total chip power [4]. In NoCs, algorithms presented. Quantitative experiments demonstrate that with the trend of integrating more parallel components on a they achieve superior power efficiency comparing with the basic single chip, it can be predicted that the power consumed by architecture without any runtime configuration, while different tradeoffs are applied dependent on the monitoring granularity. the interconnection network will claim an even bigger share of power in the future. Index Terms—Network-on-Chip, power management, Dynamic Considering the dynamic feature of the network traffic, Voltage and Frequency Scaling. runtime power optimization techniques are most favorable in minimizing the power consumption. For energy constrained I. I NTRODUCTION systems, a wide range of efforts have been made to minimize the energy consumption under different Quality-of-Service CCORDING to the Moore’s Law, the fast developing A Integrated Circuit (IC) technology will provide the in- dustry with billions of transistors on a single chip in a few (QoS) requirements, with average packet latency as one of the essential metrics for best effort service. DVFS is one of the most widely used methods. The key idea behind DVFS years. The rapid increase of the number of IP cores integrated technique is to dynamically scale the supply voltage and on a chip will most probably lead to the exponential rise frequency of the processing cores to provide just-enough in the complexity of their interaction. The traditional and system performance to process the system workload while still dominating IC design approaches which are based on meeting the time and throughput constrains, and thereby, the shared buses and block-to-block wires are no longer reducing the power consumption [5]. In this paper, we explore feasible mainly due to their scalability limits. To solve this two DVFS techniques with different controlling granularity on problem, NoC has been proposed as a promising approach NoCs, namely, island based DVFS and per-core based DVFS. to integrate a large number of IP cores on a single chip by To achieve maximum power efficiency, fault/variation tol- leveraging the well developed computer network concepts. erance and system flexibility, dynamic on-line monitoring In this scenario, since there is no bus arbitration needed, services are desirable in NoCs. The demand for efficient more transactions can occur simultaneously, thus the delay system controlling while maintaining system scalability and of the packets is reduced and the throughput of the system is low overhead brings in a tradeoff between distributed and increased. Moreover, as the links in NoC are based on point- centralized monitoring. On one hand, local circuits need to be to-point mechanism, the communication among IP cores can provided with distributed monitoring modules since distributed be pipelined to further improve the system performance. With monitoring reduces the local operation delay and interconnect the technology development, the number of cores in NoCs delay for urgent monitoring services and it also prevents increases rapidly. A 167-core ASAP processor prototype has the potential communication bottleneck. On the other hand, however, despite the system size, centralized monitoring is still All authors are with the Department of Information Technology, Univer- sity of Turku, Turku, Finland, FI-20014, Turun Yliopisto. E-mail: alexan- an indispensable complement to localized monitoring schemes. der.yin@utu.fi Homepage: http://users.utu.fi/yinwei/ Theoretically, a centralized monitor, with the knowledge of INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 33 all on-chip resources, is able to coordinate and balance the the NoC at different voltage levels in order to minimize the functioning of all components with the aim of optimizing overall chip power consumption. At the same time, since the the overall system performance. In practice, for example, a NoC is divided in the granularity of islands instead of cores or single processing unit for dynamic testing operations and a links, the number of voltage regulators is greatly reduced. Our global level scheduler was adopted in [6]. For either distributed first DVFS implementation method is based on the concept of or centralized monitoring scheme, the energy efficiency of voltage island. monitoring services should be maximized. Realizing the constrains of voltage regulators, the combina- In this paper, a hierarchical agent based NoC with dynamic tion of dual static supply voltages is used to achieve dynamic on-line services is proposed. One of the key contribution of power management [10]. Two transistors are implemented as NoCs is the separation of computation and communication header devices to connect the circuit block with two static phases. In our design methodology, a third phase which we voltage wires. Inspired by this work, we propose a per-core call autonomous phase is added onto the NoC platform. It is DVFS implementation method in this paper. Via the usage of implemented as the hierarchical monitoring agent architecture. header transistors, each core in the NoC can be connected to This phase enables the system to autonomously adjust itself multiple voltage supply networks and work under a number in order to achieve low power consumption and fault/variation of voltage levels which is determined at the design time. By tolerance. Agents are functional units that monitor and control turning off all the transistors, cores can also be power gated to different architectural levels of the NoC platform depending on further reduce the power consumption, especially the leakage their hierarchical levels. This architecture aims at the enhance- power. Based on this novel technique, very fine grained per- ment of the system performance in both power consumption core DVFS can be achieved on NoCs. and fault/variation tolerance aspects. It also provides a wide The agent based management in NoC systems was initially design and synthesis space for the realization of agents at each proposed in our previous work [11]. Since then, we have level. carried out intensive studies researching on various aspects of The remainder of this paper is organized as follows. Section the design and implementation of the innovative architecture. II describes the related works. Hierarchical agent-based NoC This paper mainly addresses the system-level architectural architecture is motivated and described elaborately in Section design of NoC platforms equipped with hierarchical agent III. Dynamic power monitorings with island-based and per- monitoring functionality, with a focus on dynamic power core DVFS, leveraging the hierarchical agent architecture, management. are presented in Section IV, and quantitatively evaluated in Section V. Section VI concludes the paper. III. H IERARCHICAL M ONITORING AGENT A RCHITECTURE A. Hierarchy of the Monitoring Agents II. R ELATED W ORK There are four levels of agents in the proposed NoC A few previous works have addressed system level monitor- architecture, namely, application agent, platform agent, cluster ing services on NoC platforms [7], [8]. In these architectures, agent and cell agent (Fig. 1). As the top level agent, the the monitoring services are integrated into the NoC platform application agent is unique in a NoC platform. An application and reuse the NoC communication services for configuration agent is a software module capturing the application function- as well as for monitoring traffics. The monitoring probes are ality and runtime performance requirements and constraints. placed distributively across the platform and the monitoring The platform agent is also unique in a NoC. Based on the information can be analyzed either centrally or distributively. specification from the application agent and resource avail- However, solely centralized or distributed monitoring ser- ability, it (re)configures the network and Processing Elements vices do not lead to the highest monitoring efficiency in large (PEs). The entire NoC is divided into a number of clusters, scale NoC platform. Therefore, in this paper, we propose each of which is monitored and controlled by a cluster a hierarchical monitoring agent architecture which is able agent. A cluster is a group of processors with accompanying to explore the benefits of both centralized and distributed components (caches, scratchpad memories, switches, links, fashions. etc.). It is logically divided into cells which are the basic DVFS is one of the most effective methods to reduce the units in our architecture, consisting of a PE, a switch and power consumption in IC systems. In [3], the DVFS technique the corresponding links. The cells are equipped with their own was applied onto the NoC links. The utilization ratio of links local monitors, the cell agents, which trace and adjust the local and buffers was used as an effective indicator of network circuit conditions. load, based on which the voltage and frequency scaling was Fig. 2 illustrates the mapping of hierarchical agents on a performed. However, due to the high overhead of voltage tile-based regular mesh NoC structure. The size and number regulators in terms of delay, power and area, it is not feasible of clusters and the location of the cluster agents are all to implement the fine grained DVFS into large scale NoCs. application and platform dependent choices. In this example, To reduce the overhead brought by voltage regulators, the we divide the platform into four clusters, each of which is concept of voltage island is proposed. A voltage island is a a 2*2 mesh network. The cluster agents are placed on the group of on-chip cores powered by the same voltage source, lower right switches within the cluster. As mentioned earlier, independently from the chip-level voltage supply [9]. The the basic monitored units are cells which are comprised of usage of voltage islands permits operating different parts of a PE, Network Interface (NI), switch and the corresponding INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 34 Application Agent B. On-line Monitoring Services API System performamce Modifying Initial requirements 1) Fault/Variation Tolerance: The major causes affecting (throughput, requirements power) the reliability of the NoC systems are the shrinking of the fea- Platform Agent ture size and decreasing of the supply voltage, which expose them to different faults of permanent, transient or intermittent Cluster Cluster setting Reconfiguration Initial nature. Among the failure mechanisms, we can enumerate Configuration performance (Vdd, Vbs, Fclk ..) Commands factors such as crosstalk, electromigration, electromagnetic in- terference, alpha particle hits, and cosmic radiation [12]. These phenomena and system variations can change the timing and Cluster Agents functionality of the NoC fabrics and thus degrade their QoS Error detection Circuit conditions Circuit setting Reconfiguration Initial or, eventually, lead to failures of the whole system. Providing (current, (Vdd, Vth, Fclk ..) and recovery buffer load ..) Commands Configuration resilience from such faults and other process, voltage and temperature (PVT) variations is mandatory for the NoCs [13]. …... Cell Agents The proposed monitoring agent based architecture can pro- vide the fault/variation tolerance for NoC based systems, by Fig. 1. Hierarchical Agent approach. the joint efforts of all levels of agents. Before the execution, the platform agent utilizes a number of resources and configures the network based on the initial application requirements with power and performance aware- Cell Cell Cell Cell PE agent agent PE agent agent ness. A number of resources are reserved as spares in case NI NI of component failures. The initial configuration is enforced Sw. Sw. from the platform agent to the cluster and then cell agents. Cell Cell After the application starts running, the cell agents trace their Cluster agent Cluster agent agent agent local circuit conditions such as failures and PVT variations. They first attempt to fix the errors if feasible (for example Cluster Cluster by retransmission in case of transient crosstalk-induced error Application agent [14]). If the errors cannot be solved by the cell agent, they Platform agent have to be reported to the cluster agents. The cluster agents PE Cell agent Cell agent PE Cell agent Cell agent allocate the spares to take the functional places of the faulty NI NI cells and re-run their instructions. In case that the errors cannot Sw. Sw. be solved within a cluster, they will be further reported to the platform agent to re-map the application or reconfigure the Cell agent Cluster agent Cell agent Cell agent Cluster agent system if necessary. The application agent and platform agent are also respon- Cluster Cluster sible for the reconfiguration of the system to balance the workload in order to keep the circuit under relative low temperature. This is of crucial importance due to the fact that Fig. 2. Hierarchical Agent implementation on NoCs. the circuit is more error-prone in high temperature regions. Moreover, leakage currents also increase exponentially with temperature. 2) Power Consumption Optimization: Although being un- links. It is intuitive to allocate a cell agent for each cell by der intensive research for decades, the power consumption is sharing the physical area of the PE. Other cell level monitoring still the most critical constrain to be explored. The power con- units may be allocated at other particular places within a sumption can be categorized into dynamic power consumption cell, such as power gating sleep-transistors on the links. The and leakage power consumption. cluster agent is allocated at a fixed position during the design In the proposed system, DVFS is used to minimize the dy- time. Since a cluster agent has more sophisticated functionality namic power consumption during runtime. Traditional DVFS and controlling algorithms than a cell agent, it requires more provides the power optimization at the chip-wide level which resources such as area, power, communication bandwidth and is not able to utilize the local traffic variation in exploiting the etc. Therefore, a cluster agent replaces one of the PEs in the supply scaling potential. Therefore, in the large scale digital cluster. To minimize the communication latency and balance circuit systems, DVFS is often used together with multiple the workload of the system, the platform agent and application voltage and frequency domains on different granularity [15]– agent which monitor and control over the whole system are [17]. The overhead brought by the voltage regulators is a located at the geographic center of the platform. In order severe constrain in using DVFS. To alleviate this problem, to offer scalability for extremely large scale NoC systems, voltage island based DVFS has been proposed in [17] and clusters can be further divided into hierarchical sub-clusters widely used in today’s IC technology. In this method, the and similar monitoring functional partition can be applied. number of voltage regulators is reduced due to the smaller This issue is beyond the discussion of the paper. number of voltage domains. The island based DVFS, however, INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 35 cannot exploit the local traffic pattern and support runtime Platform Agent reconfiguration of spare cells into other clusters. Another design option is the per-core based DVFS which is not feasible FFT/IFFT Input C in the conventional technology due to the extremely high overhead. We propose a novel method which uses multiple voltage supply networks instead of voltage regulators and thus FFT/IFFT Output make the per-core based DVFS feasible to be implemented. (a) Implementation Alternative 1 The granularity of monitoring services is a design choice Platform Agent which is dependent on the size of the actual platform, the workload and constraints of the application. In this paper, we FFT/IFFT Input C examine both the island based and per-core based DVFS in Section IV. X X Besides dynamic power consumption, hierarchical monitor- FFT/IFFT Output ing architecture aims to minimize the leakage power, which (b) Reconfiguration with Fault Appearance grows exponentially with decrease of transistor size. One process generation increases leakage by a factor of 6 to 10x. Platform Agent According to the current leakage trend, a microprocessor in FFT Input C 100 nm technology may dissipate up to 50% leakage power. To reduce the leakage, dynamic power management methods, X X FFT/IFFT Output such as turning off the non communication intensive links, IFFT Input should be applied in the NoC platform by the usage of (c) Implementation Alternative 2 hierarchical monitoring agents. 3) Motivational Case Study: Fig. 3 provides a motivational Spare Cell Functioning Cell case study of agent-monitored reconfiguration on many-core systems (originally presented in our previous work [18]). X Faulty Cell C Cluster Agent 64-point FFT/IFFT computation is simulated on 2-D hier- Monitoring Data Path Communication archical agent-based NoC, and each processing element as a DSP unit runs at the same frequency. 30% of all the processing elements are set as spares initially. The platform agent has two architectural alternatives, one exploring more parallelism (thus Fig. 3. Study case on fault/variation tolerance with hierarchical agents. finishing faster) while using more processors [19] than the other [20]. The study case is simulated in Matlab/SimuLink. Every DSP works at 600MHz with 16-bit wide data, and on the functions of agents at various levels. Island-based DVFS one complex multiplication takes 6 cycles. Fig. 3(a) shows that is a coarse-grained monitoring approach (Section IV-A1), when the system is configured with the architecture described mainly handled by cluster agents. Per-core DVFS is a fined- in [19], the computation takes 6 processors and 8 ms. If some grained approach (Section IV-A2), handled by cell agents. components fail (Fig. 3(b)), the platform agent will replace 1) Island-based DVFS: Voltage and frequency island is them with spare processors and reconfigure the network. If a coarse-grained architectural method to improve power or the application agent specifies tougher timing constraints, energy efficiency by exploiting the spatial variation of work- the platform agent may utilize more available resources to loads [22]. Combined with DVFS, island-based architecture achieve another performance/cost tradeoff. In Fig. 3(c), with addresses both the spatial and temporal variations of commu- architecture alternative as in [20], the computation time is nication traffics in on-chip networks. reduced to 3 ms with the cost of another 10 processors used In hierarchical agent-based NoC, island-based DVFS can be (spare ones in the data flow are only bypassed, not used in integrated into the functional responsibility of cluster agents computation). (Fig. 4). The whole network is partitioned into a number of islands (clusters) and each of which works under its own IV. I SLAND - BASED AND P ER - CORE DVFS voltage and frequency. The cluster agent collects the network Power consumption is a major design constraint in NoC- information within the cluster and accordingly configures based platform, given the large amount of computing and in- proper voltage and frequency levels. Such configuration can terconnection resources on-chip. Among all sources of power be based on the tradeoff of performance and energy consump- consumption, inter-switch communication is a major power tion, and one effective tradeoff method will be explained in contributor [21], and it will be more substantial with further Section IV-B. Conventionally, the actuation of voltage scaling parallelization into thousand-core NoC era. is implemented by an on-chip DC regulator, and the frequency scaling is realized by a PLL. Between the islands, FIFOs are implemented to interface different voltage/frequency domains. A. System Architectures 2) Per-core DVFS: Per-core DVFS is another alternative of Hierarchical agent-based NoC platforms can flexibly sup- DVFS scheme. It is finer-granular than island-based DVFS, port power monitoring services of different granularity, relying and is more capable of exploiting local traffic variation. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 36 TABLE I C ONTROLLING THE VOLTAGE S UPPLIES Selection Vector CL CN CH Voltage Supply “00” 0 1 1 VL “01” 1 0 1 VN “10” 1 1 0 VH “11” 1 1 1 Power Gating 9 Voltage Island 1 Voltage Island 2 8 Average Flit Latency (Normalized) 7 Fig. 4. An architecture view of Island Based DVFS on NoCs. 6 5 *++0 *++0 4 *++- *++- 3 *++, &'/ .0 *++, &'/ .0 2 &') .- &') .- 1 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 !! &'( ., !! &'( ., Average Network Buffer Load "# $% "# $% Fig. 6. Adjustment of Transmission Latency by observing Traffic Load (8*8 Switch Switch mesh network, uniform traffic). *++0 *++0 *++- *++- *++, &'/ .0 *++, &'/ .0 B. Runtime Power Tradeoff &') .- &') .- !! &'( ., !! &'( ., Power monitoring schemes in our design utilize dynamic "# $% "# $% performance/energy tradeoff, based on the observation and adjustment of network load, quantified by the percentage of Switch Switch buffers occupied in interconnections. Initially proposed in [26], *++0 *++0 *++- *++- the buffer load of each node is collected by controllers, which *++, &'/ .0 *++, &'/ .0 configure proper level of voltage and frequency based on &') .- &') .- the network condition. Fig. 6 illustrates the relation between !! &'( ., !! &'( ., latency and buffer load in an 8*8 network with store-and- "# $% "# $% forward switching and deterministic routing. With lower fre- quency and voltage, the latency gradually increases until the Switch Switch network saturates. After the network saturation, the latency becomes unboudedly high due to the accumulation of queuing time in the buffers. At the same time, the power and energy Fig. 5. An architectural view of Per-Core DVFS on NoCs with Multiple consumption will be reduced because of lower frequency and Voltage Supply Networks. voltage supply. Based on this observation, the system can be configured to adaptively tradeoff performance and energy by setting a desirable traffic load. In hierarchical agent-based system, specific level of agents To support such fine-grained power monitoring, we utilize will take the responsibility to perform the energy/performance an advanced power delivery technique with multiple voltage tradeoff. The exact algorithm with the desirable traffic load is supply networks and power selecting transistors [1], [23], determined by the designer and programmed into the function which will reduce the area and power overhead of DC con- of corresponding agents. Such setting may be reconfigured verters [24], [25]. Per-core DVFS can be integrated on the as required by the application or platform agent. Detailed hierarchical agent-based NoC platform (Fig. 5), and the power algorithms are presented in Section V-A4, showing how the configuration is handled by the cell agents. tradeoff is performed in per-core and island-based DVFS. As shown in Fig. 5, each cell is connected to a number of voltage supply networks via power selecting transistors. The V. Q UANTITATIVE E VALUATION power selecting transistors, labeled as NL , NN and NH , are controlled by their own control signals C L, C N and C H. In this section, we evaluate the effectiveness of applying The setting of these signals determines the supply voltage fed island-based and per-core DVFS on power management of to each component. For example, with 3 voltage levels (low on-chip communications. Average energy consumption and voltage V ddL , normal voltage V ddN and high voltage V ddH ), latency of transmission using various traffic patterns are an- Table I lists the options of voltage scaling. Each switch is alyzed, in order to show the benefits and tradeoff of such controlled independently by its own selecting vector and thus techniques applied at different granularity and handled by very fine grained optimization can be achieved. different levels of agents.
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-