Handbook of 3D Integration Vol. 4: Design, Test, and Thermal Management. 9783527338559, 3527338551

538 126 27MB

English Pages [479] Year 2019

Polecaj historie

Handbook of 3D Integration, Volume 3: 3D Process Technology [3 ed.] 3527334661, 9783527334667

Edited by key figures in 3D integration and written by top authors from high-tech companies and renowned research instit

730 61 33MB Read more

Design and Test Strategies for 2D/3D Integration for NoC-based Multicore Architectures [1st ed. 2020] 978-3-030-31309-8, 978-3-030-31310-4

This book covers various aspects of optimization in design and testing of Network-on-Chip (NoC) based multicore systems.

502 134 2MB Read more

The Handbook of Design Management 9781474294126, 9781847884886

The management of design has emerged as central to the operational and strategic options of any successful organization.

506 81 15MB Read more

Handbook of Thermal Plasmas 9783319121833

586 176 27MB Read more

The World Book Learning Library Vol. 4: Test Skills [4]

234 97 65MB Read more

Thermal System Design and Simulation 0128094494, 9780128094495

Thermal System Design and Simulationcovers the fundamental analyses of thermal energy systems that enable users to effec

1,845 284 11MB Read more

Handbook of Engineering and Specialty Thermoplastics, vol. 4 - Nylons

Content: Chapter 1 Engineering and Specialty Thermoplastics: Nylons: State of Art, New Challenges and Opportunities (pag

1,104 186 7MB Read more

Handbook of Thermal Plasmas 3030849341, 9783030849344

This authoritative reference presents a comprehensive review of the evolution of plasma science and technology fundament

385 98 78MB Read more

Handbook of Thermal Plasmas 9783030849368, 9783030849344, 3030849368

130 18 324MB Read more

Thermal Design of Buildings: Understanding Heating, Cooling and Decarbonization 9781785008993

880 223 31MB Read more

Handbook of 3D Integration Vol. 4: Design, Test, and Thermal Management.
9783527338559, 3527338551

Author / Uploaded
Peter Ramm (editor)
Muhannad S. Bakir (editor)
Paul D. Franzon (editor)
Philip Garrou (editor)
Mitsumasa Koyanagi (editor)
Eric J. Marinissen (editor)

Citation preview

Handbook of 3D Integration

Related Titles Garrou, P., Bower, C., Ramm, P. (eds.) Handbook of 3D Integration Volumes 1 and 2: Technology and Applications of 3D Integrated Circuits 2012 Print ISBN: 978-3-527-33265-6 Garrou, P., Koyanagi, M., Ramm, P. (eds.) Handbook of 3D Integration Volume 3: 3D Process Technology 2014 Print ISBN: 978-3-527-33466-7

Handbook of 3D Integration Design, Test, and Thermal Management

Edited by Paul D. Franzon, Erik Jan Marinissen, and Muhannad S. Bakir

Volume 4

Volume Editors: Paul D. Franzon

North Carolina State University Electrical and Computer Engineering 2410 Campus Shore Drive Raleigh, NC 27606 USA Erik Jan Marinissen

IMEC Kapeldreef 75 3001 Leuven Belgium Muhannad S. Bakir

Georgia Institute of Technology Electrical and Computer Engineering 791 Atlantic Drive NW Atlanta, GA 30318 USA Series Editors: Philip Garrou

Microelectronic Consultants of North Carolina 3021 Cornwallis Road 27709 Res. Triangle Park, NC USA Mitsumasa Koyanagi

Tohoku University New Industry Creation Hatchery Center 6-6-10 Aza-Aoba, Aramaki 980-8579 Sendai Japan Peter Ramm

Fraunhofer EMFT Device and 3D Integration Hansastr. 27d 80686 München Germany

All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate. Library of Congress Card No.:

applied for British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library. Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at . © 2019 Wiley-VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law. Print ISBN: 978-3-527-33855-9 ePDF ISBN: 978-3-527-69704-5 ePub ISBN: 978-3-527-69706-9 oBook ISBN: 978-3-527-69705-2 Cover Design Adam-Design, Weinheim,

Germany Typesetting SPi Global, Chennai, India Printing and Binding

Printed on acid-free paper

v

Contents Introduction to Design, Test and Thermal Management of 3D Integrated Circuits xv

Part I

Design 1

1

3D Design Styles 3 Paul D. Franzon

1.1 1.2 1.3 1.4 1.5 1.6 1.6.1 1.6.2 1.6.3 1.6.4 1.6.5 1.7 1.8

Introduction 3 3D-IC Technology Set 3 Why 3D 6 Miniaturization 7 Memory Bandwidth 8 3D Logic 10 Power-Efficient Computing and Logic 11 Modular Partitioning: FFT Processor 12 Circuit Partitioning 13 3D Heterogeneous Processor 14 Thermal Issues 15 Heterogeneous Integration 16 Conclusions 18 References 18

2

Ultrafine Pitch 3D Stacked Integrated Circuits: Technology, Design Enablement, and Application 21 Dragomir Milojevic, Prashant Agrawal, Praveen Raghavan, Geert Van der Plas, Francky Catthoor, Liesbet Van der Perre, Dimitrios Velenis, Ravi Varadarajan, and Eric Beyne

2.1 2.2 2.2.1 2.2.2 2.2.3 2.3 2.3.1

Introduction 21 Overview of 3D Integration Technologies 23 Integration Granularity 23 Stacking Orientation 23 TSV Formation 24 Design Enablement of Ultrafine Pitch 3D Integrated Circuits 25 Design Flow Overview 25

vi

Contents

2.3.2 2.3.3 2.3.3.1 2.3.3.2 2.3.3.3 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.6.1 2.4.6.2 2.4.6.3 2.4.6.4 2.5

3D Integration Backbone Tool 27 3D Design Flow Add-Ons 28 Interconnect Delay and Power Models 29 Repeater Area Model 30 Cost Model 31 Implementation of Mobile Wireless Application 32 Application Driver 32 Architecture Template 32 MPSoC Instance 34 Multiple Memory Organization and Bus Structure 34 Experimental Setup 35 Experimental Results 36 Private vs. Hybrid Memory Architecture 36 Interconnect Technology Comparison 36 Impact of System Architecture 37 System Parameter vs. Design Choices 38 Conclusions 38 References 39

3

Power Delivery Network and Integrity in 3D-IC Chips 41 Makoto Nagata

3.1 3.2 3.3 3.4

Introduction 41 PDN Structure and Integrity 41 PDN Simulation and Characterization 43 PDN in 3D Integration 49 References 52

4

Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D Integrated System 53 Alexander Steinhardt, Dimitrios Papaioannou, Andy Heinig, and Peter Schneider

4.1 4.1.1 4.1.2 4.2 4.3 4.3.1 4.3.2 4.3.3 4.4 4.5 4.6

Introduction 53 Example: Stent 57 Example Interposer 59 Data Handling for the System View 61 Electrical Challenges 62 Modeling 64 Simulation 64 Optimization 65 Mechanical Challenges 67 Thermal Challenges 68 Thermomechanical Challenges 72 Acknowledgments 77 References 77

Contents

®

5

Physical Design Flow for 3D/CoWoS Stacked ICs 81 Yu-Shiang Lin, Sandeep K. Goel, Jonathan Yuan, Tom Chen, and Frank Lee

5.1 5.2 5.3 5.4 5.4.1 5.4.2 5.4.3 5.4.4 5.4.5 5.5 5.5.1 5.5.2 5.5.3 5.5.4 5.5.5 5.6 5.7

Introduction 81 CoWoS vs. 3D Design Paradigm 82 Physical Design Challenges 83 Physical Design Flow 85 RC Extraction and TSV Modeling 85 Interposer Connectivity Checking Technique (LVS) 87 Interposer Interface Alignment Checking 88 Cross-Die Timing Check 90 IR Drop Analysis for the Interposer 92 Physical Design Guideline 94 Interposer Wide Bus Routing Guideline 94 Interposer SI/PI Analysis for HBM Interface 97 Combo-Bump Design Style 106 Chip-Package Co-Design for Stacked ICs 109 Interposer Multi-Die ESD Protection Scheme 110 TSMC Reference Flows 113 Conclusion 114 References 114

6

Design and CAD Solutions for Cooling and Power Delivery for Monolithic 3D-ICs 115 Sandeep Samal and Sung K. Lim

6.1 6.2 6.2.1 6.2.2 6.3 6.3.1 6.3.2 6.3.3 6.3.4 6.4 6.4.1 6.4.2 6.4.3 6.5 6.5.1 6.5.2 6.5.3 6.6

Introduction 115 New Thermal Issues in Monolithic 3D-ICs 117 Material and Structural Differences 118 Temperature Map Comparisons 119 Fast Thermal Analysis with Adaptive Regression 121 Initial Experiments 121 Modeling Technique 123 Sample Generation 123 Simulation Results 124 New Power Delivery Issues in Monolithic 3D-ICs 126 Design and Analysis Setup 126 Impact of PDN 128 PDN Analysis Results 131 Power Delivery Network Optimization 134 Design Styles 134 Full PDN Analysis Results 135 PDN Design Guidelines for Monolithic 3D-ICs 137 Conclusions 139 References 139

®

vii

viii

Contents

7

Electronic Design Automation for 3D 141 Paul D. Franzon

7.1 7.2 7.3 7.4 7.5 7.6

Introduction 141 EDA Flows for 3D-IC 141 Commercial EDA Support 143 Modular Partitioning Approaches 143 Circuit Partitioning 145 Conclusions 146 References 147

8

3D Stacked DRAM Memories 149 Christian Weis, Matthias Jung, and Norbert Wehn

8.1 8.1.1 8.1.2 8.1.3 8.1.4 8.2 8.2.1 8.2.1.1 8.2.2 8.2.2.1 8.2.3 8.2.3.1 8.2.3.2 8.2.4 8.2.4.1 8.2.4.2 8.3 8.3.1 8.3.2 8.3.2.1 8.3.2.2 8.4

3D-DRAM Design Space and DRAM Technology Background 150 DRAM Evolution 150 Common DRAM Architecture 152 Architecture Study of a 22 nm 4GB 3D-DRAM Cube 155 DRAM Memory Controller 158 Design Space Exploration of 3D-DRAMs 160 3D-DRAM Behavioral Models 160 Power Model Verification 162 3D-DRAM Core Architecture and Technology 164 Wiring and TSV Considerations 165 3D-DRAM Architecture Exploration Results 167 3D-DRAM Bank Exploration Results 167 Complete 3D-DRAM Stack Results 169 Flexible Burst Length and Bandwidth Interface for 3D-DRAMs 172 Experimental Results 173 Subsystem Power and Energy Estimation 175 Architectural 3D Stacked DRAM Controller Optimizations 176 Temperature-Aware Refresh Control for 3D Stacked DRAMs 177 Advanced 3D Stacked DRAM Power-Down Policies 178 Staggered Power Down for Standard DDR3 DRAMs 179 Bankwise Staggered Power Down for WideIO DRAMs 182 Conclusion 183 References 184 Part II

Test 187

9

Cost Modeling for 2.5D and 3D Stacked ICs 189 Mottaqiallah Taouil, Said Hamdioui, and Erik Jan Marinissen

9.1 9.2 9.2.1 9.2.2 9.3 9.3.1 9.3.2

Introduction 189 Testing 3D Stacked ICs 189 Importance of Testing 189 Test Moments and Test Flows 190 Cost Modeling 191 Cost Classification 191 Design Cost 191

Contents

9.3.3 9.3.4 9.3.5 9.3.6 9.4 9.4.1 9.4.2 9.5 9.5.1 9.5.2 9.5.2.1 9.5.2.2 9.5.2.3 9.5.3 9.5.4 9.5.5 9.6

Manufacturing Cost 191 Test Cost 192 Packaging Cost 192 Logistics Cost 193 3D-COSTAR 193 Tool Inputs and Outputs 194 Tool Flow 194 Case Studies 196 Reference Cases 196 Experiments 198 Fault Coverage of Interposer Pre-Bond Testing Mid-Bond Testing and Logistics 199 Dedicated Probe Pads vs. Micro-Bump Probing Fault Coverage of Interposer Pre-Bond Testing Mid-Bond Testing and Logistics 204 Dedicated Probe Pads vs. Micro-Bump Probing Conclusion 207 References 207

10

Interconnect Testing for 2.5D- and 3D-SICs 209 Shi-Yu Huang

10.1 10.2 10.2.1 10.2.2 10.2.3 10.2.4 10.2.4.1 10.2.4.2 10.2.4.3 10.2.4.4 10.2.4.5 10.3 10.3.1 10.3.2 10.3.3 10.3.4 10.4

Introduction 209 Pre-Bond TSV Testing 211 General Test Methods for Pre-Bond TSVs 211 Leakage Test by Voltage Conversion and Comparison 213 Charge-and-Sample-Based Pre-Bond TSV Test 214 Input Sensitivity Analysis (ISA)-Based Oscillation Test 216 Electrical Effect of a Resistive Open Fault 216 Electrical Effect of a Leakage Fault 216 Test Structure 217 Fault Detection Scheme 218 Impact of Process Variation 219 Post-Bond Interconnect Testing 220 Direct Measurement 220 Voltage-Divider-Based Test 221 Pulse-Vanishing Test (PV Test) 222 Characterization-Based Test Method via VOT Scheme 225 Concluding Remarks 227 References 228

11

Pre-Bond Testing Through Direct Probing of Large-Array Fine-Pitch Micro-Bumps 231 Erik Jan Marinissen, Bart De Wachter, Jörg Kiesewetter, and Ken Smith

11.1 11.2 11.3 11.4

Introduction 231 Pre-Bond Testing 232 Micro-Bumps 234 Probe Technology 236

198 200 202 204

ix

x

Contents

11.4.1 11.4.2 11.5 11.6 11.6.1 11.6.2 11.6.3 11.6.4 11.6.5 11.7

Probe Cards 236 Probe Station 238 Test Vehicle: Vesuvius-2.5D 239 Experiment Results 242 Initial Hurdles 242 Probe Marks 243 PTPA Accuracy 245 Contact Resistance 247 Probe Impact on Stack Interconnect Yield 248 Conclusion 249 Acknowledgments 249 References 250

12

3D Design-for-Test Architecture 253 Erik Jan Marinissen, Mario Konijnenburg, Jouke Verbree, Chun-Chuan Chi, Sergej Deutsch, Christos Papameletis, Tobias Burgherr, Konstantin Shibin, Brion Keller, Vivek Chickermane, and Sandeep K. Goel

12.1 12.2 12.3 12.3.1 12.3.2 12.3.3 12.4 12.4.1 12.4.2 12.4.3 12.4.4 12.4.5 12.5

Introduction 253 Basic 3D-DfT Architecture 254 Vesuvius-3D 3D-DfT Demonstrator 257 Vesuvius-3D Technology 257 3D-DfT Demonstrator Design 258 3D-DfT Demonstrator Results 263 Extensions to the Basic 3D-DfT Architecture 265 Multi-tower Stacks 265 Test Data Compression 267 Hierarchical SoCs Containing Embedded Cores 269 At-Speed Interconnect Testing 270 Memory-on-Logic Stacks 272 Conclusion 276 Acknowledgments 276 References 277

13

Optimization of Test-Access Architectures and Test Scheduling for 3D ICs 281 Sergej Deutsch, Brandon Noia, Krishnendu Chakrabarty, and Erik Jan Marinissen

13.1

Uncertain Parameters in Optimization of 3D Test Architecture and Test Scheduling 282 Robust Optimization of 3D Test Architecture 285 Mathematical Model for Robust Co-optimization of Test Architecture and Test Scheduling 286 Heuristic Method for Robust Optimization Based on Simulated Annealing 290 Simulation Results 294 Conclusion 299 References 299

13.2 13.2.1 13.2.2 13.3 13.4

Contents

14

IEEE Std P1838: 3D Test Access Standard Under Development 301 Adam Cron, Erik Jan Marinissen, Sandeep K. Goel, Teresa McLaurin, and Sandeep Bhatia

14.1 14.2 14.2.1 14.2.2 14.2.3 14.3 14.4 14.4.1 14.4.2 14.4.3 14.5 14.5.1 14.5.2 14.5.3 14.6 14.6.1 14.6.2 14.6.3 14.6.4 14.7

Introduction 301 Overview 303 History 303 PAR Summary 303 Status 304 Scope and Terminology 304 Serial Control 306 Initial 1500 vs. 1149.1 Discussion 306 1149.1 Serial Data Path 309 TLR Hold-Off Bit 310 Die Wrapper Register 311 Die Wrapper Requirements 313 Shared vs. Boundary Wrapping and Stack Considerations 314 Wrapper Cell Options 316 Flexible Parallel Port 317 FPP Structure 318 FPP Control Signals 319 FPP Configurations 320 Usage Scenarios 320 Conclusion 322 References 322

15

Test and Debug Strategy for TSMC CoWoS Stacking Process-Based Heterogeneous 3D-IC: A Silicon Study 325 Sandeep K. Goel, Saman Adham, Min-Jer Wang, Frank Lee, Vivek Chickermane, Brion Keller, Thomas Valind, and Erik Jan Marinissen

15.1 15.2 15.3 15.3.1 15.3.2 15.3.3 15.4 15.4.1 15.4.2 15.4.3 15.5 15.6 15.6.1 15.6.2 15.7

Introduction 325 Overview of CoWoS Stacking Process 327 CoWoS Chip Architecture 327 SoC Die 329 JEDEC WideIO DRAM Die 329 DRAM Die 329 Test and Diagnosis Architecture 329 Known-Good-Die (KGD) Test 330 Known-Good-Stack (KGS) Test 330 Interconnect Test ATPG 333 Testing of Passive Silicon Interposer 336 Experimental and Silicon Results 340 Interconnect Failure Diagnosis 341 Cause and Effect Analysis 343 Conclusion 345 References 345

®

®

®

xi

xii

Contents

Part III

Thermal Management 347

16

Thermal Isolation and Cooling Technologies for Heterogeneous 3D- and 2.5D-ICs 349 Yang Zhang, Hanju Oh, Yue Zhang, Li Zheng, Gary S. May, and Muhannad S. Bakir

16.1 16.1.1

Thermal Challenges for Heterogeneous 3D-ICs 349 A 3D-IC Architecture for Thermal Decoupling Using Air-Gap Isolation and Thermal Bridge 350 Thermal Evaluation of the Proposed Architecture 351 Impact of TSVs on the Proposed Architecture 354 Experimental Demonstration 355 Thermal Challenges and Solutions for 2.5D-ICs 359 Thermal Comparison of Interposer and Bridge-Chip 2.5D Integration 359 Impact of Die Thickness Mismatch and Die Spacing on 2.5D Integration 361 Thermal Solutions Using Integrated Die-Level MFHS 362 Electrical and Fluidic Micro-Bumps 363 Fabrication Flow 363 Assembly, Testing, and Characterization 364 High Aspect Ratio TSVs Embedded in A Micropin Fin Heat Sink 367 Fabrication and Testing 369 Conclusion 371 References 371

16.1.2 16.1.3 16.1.4 16.2 16.2.1 16.2.2 16.2.3 16.3 16.3.1 16.3.2 16.4 16.4.1 16.5

17

17.1 17.2 17.2.1 17.2.2 17.3 17.3.1 17.3.2 17.3.3 17.3.4 17.3.5 17.4 17.4.1

Passive and Active Thermal Technologies: Modeling and Evaluation 375 Craig E. Green, Vivek Sahu, Yuanchen Hu, Yogendra K. Joshi, and Andrei G. Fedorov

Introduction 375 Integrated Background Heat Sink Approaches 376 Single-Phase 3D-IC Cooling 376 Two-Phase (Flow Boiling) Cooling 383 Solid-State Cooling 384 Thermoelectric Cooler (TEC) Design Principles 386 Thin Film Coolers 387 Conventional Thermoelectric vs. Thin Film Coolers 389 Modeling of Thin Film Coolers 389 Transient Behavior of Thin Film Cooler 391 Passive Cooling: Phase Change Material Regeneration Concerns 393 Three-Dimensional Model for a CTC Integrated into a Three-Layer Stack with SSC Regeneration 395 17.4.2 Experimental Setup 397 17.4.2.1 Thermoelectric Cooler Regeneration Setup 399 17.4.2.2 Fan Cooling Regeneration 400 17.4.2.3 Liquid Cooling Regeneration 401

Contents

17.4.3 17.4.4

Device Operation and Characterization Process Results 404 References 409

18

Thermal Modeling and Model Validation for 3D Stacked ICs 413 Herman Oprins, Federica Maggioni, Vladimir Cherman, Geert Van der Plas, and Eric Beyne

401

18.1 18.2

Introduction 413 Modeling Methods for the Thermal Analysis of 3D Integrated Structures 413 18.2.1 Package-Level Numerical Simulations (FEM/CFD) 414 18.2.2 Compact Thermal Models (CTMs)/Fast Thermal Models (FTMs) 416 18.2.2.1 Resistance–Capacitance Networks (White Box) 416 18.2.2.2 Analytical and Semi-analytical Solutions (White Box/Gray Box) 417 18.2.2.3 Green’s Function-Based Models (White Box/Gray Box) 418 18.2.3 Full-Chip Layout-Based Thermal Simulations 419 18.3 3D Stacked Thermal Test Vehicles 421 18.3.1 Uniform Power Dissipation Measurements 421 18.3.2 Hotspot-Based Thermal Test Chip 422 18.3.3 Full CMOS Test Chip with Programmable Power Map 424 18.4 Experimental Validation of Thermal Models for 3D-ICs 425 18.5 Inter-die Thermal Resistance 428 18.5.1 Experimental Characterization 428 18.5.2 Modeling Study 429 References 430 19

On the Thermal Management of 3D-ICs: From Backside to Volumetric Heat Removal 433 Thomas Brunschwiler, Gerd Schlottig, Chin L. Ong, Brian Burg, and Arvind Sridhar

19.1

Introduction: Density Scaling Drives Compute Performance and Efficiency 433 Thermal Management Landscape for 3D-ICs 434 Advances on Thermal Interfaces: Percolating Thermal Underfills 437 Single-Phase Interlayer Cooling: Design Rules Toward Extreme 3D 438 Interlayer Cooling Aware Chip Stack Design 438 Extreme 3D Enabled by Interlayer Cooling 441 Applicability of Two-Phase Cooling for 3D-ICs 443 Single vs. Two-Phase Cooling Implementations 444 3D-ICs Interlayer Cooling 445 Thermal Resistance, Pressure Drops, Local Junction, and Fluid Temperatures 446 Novel Radial Hierarchical Fluid Network for Two-Phase Interlayer Heat Transfer 446

19.2 19.3 19.4 19.4.1 19.4.2 19.5 19.5.1 19.5.2 19.5.3 19.5.4

xiii

xiv

Contents

19.6 19.6.1 19.6.2 19.6.3 19.7 19.7.1 19.7.2 19.7.3 19.7.4 19.8 19.8.1 19.8.2

Compact Thermal Modeling Framework 448 Compact Thermal Modeling for Air-Cooled ICs 448 3D-ICE: A Compact Thermal Model for Single-Phase Liquid-Cooled ICs 448 STEAM: A Compact Thermal Model for Two-Phase Liquid-Cooled ICs 449 Consequence of Fluid Presence for the Package Topology 449 Sealing Design 450 Additional Loads 450 Stress Testing 452 System Link and Field Replacement 453 Thermal Laminates Enabling Dual-Side Cooling and Electrical Interconnects 453 Thermal-Power Insert Enabling Dual-Side Cooling 453 Thermal Power Plane Enabling Dual-Side Electrical Interconnects 455 References 456 Index 461

xv

Introduction to Design, Test and Thermal Management of 3D Integrated Circuits When we started to work on the first volume of Wiley’s Handbook of 3D Integration in 2007, we intended these volumes to be an encompassing treatise on 3D integration – a new technical field of semiconductor technology and electronic systems packaging. This ambitious goal of Wiley-VCH and the editors was achieved with the help of Christopher Bower who served as co-editor of the first two volumes. We chose to initially focus on 3D-IC process technology, such as fabrication of through-silicon vias (TSVs) and wafer thinning and temporary and permanent bonding technologies. Volume 3, released in 2014, continued coverage of new developments in process technology as future production was still on the horizon. We felt that volume 4 should be strongly dedicated to design, test, and thermal management of 3D-IC. Since the mid-2010s, 3D integration and silicon interposer technologies have become well-accepted approaches for fabrication of high-performance memory-enhanced products, explicitly stacked DRAMs, which are currently in high-volume production at both Samsung and Hynix. Samsung started the production of “3D Stacked DDR4 DRAM” with via-middle technology in August 2015. So finally, after more than three decades of R&D, 3D-IC integration has arrived in the electronic industry! Another application that has gone into volume production is CMOS image sensors (CIS). In 2017 Sony announced the industry’s first three-layer stacked CIS: a 90 nm generation back-illuminated CIS top chip, 30 nm generation DRAM middle chip, and 40 nm generation image signal processor (ISP) bottom chip for smartphone cameras. All of the other major CIS manufacturers are following suite. On the other hand, there have also been drawbacks. Most significantly, 3D memory-on-logic applications, widely predicted by many sources, have been postponed several times and will be still not introduced in 2019. While the major issue, of course, is the high cost of 3D-IC manufacturing, another reason for postponing the introduction has been the success of TSV interposer technology. These, also called “2.5D” concepts, enable a high interconnection density between side-by-side devices, through TSV and redistribution layers, without introducing “true 3D integration” (i.e., TSV interconnects through stacked active devices). Considering TSV interposer technologies as well, the market has

xvi

Introduction to Design, Test and Thermal Management of 3D Integrated Circuits

exceeded US$ 4 billion in 2016, and the forecast for TSV-based products for the different applications appears to be very promising (Figure 1). One of the most promising silicon interposer approaches has been TSMC’s CoWoS (Chip-on-Wafer-on-Substrate). The CoWoS technology has been in production since 2013 with one of the first applications being Xilinx’s FPGAs. The CoWoS concept opened a new track in the roadmap toward 3D-IC production, such as the Xilinx Virtex-7 product H580T, labeled the “first heterogeneous 3D FPGA”. In addition to the above developments, industrial consortia have been targeting 3D integration as a key technology for heterogeneous IC/MEMS products, demanding smart system integration rather than extreme high interconnect densities (as early as already established in 2013, the European e-BRAINS platform). Heterogeneous integration technologies are being developed for functional diversification systems, for example, integration of CMOS with other devices, such as analog/RF, solid-state lighting, HV power, passives, sensors/actuators, chemical and biological sensors, and biomedical devices. This heterogeneous integration started with system-in-packaging technology and is expected to evolve into 3D heterogeneous integration. Many R&D activities worldwide are focusing on heterogeneous integration for novel functionalities. Corresponding 3D integration technologies are in evaluation at several companies, research institutions, and industrial-driven research consortia. Recently, three new relevant international roadmap initiatives have started, highlighting heterogeneous 3D integration as a key element: the International Roadmap for Devices and Systems (IRDS), as follow-up of ITRS, directed by Paolo Gargini (IEEE SSCS, a.o.); the Heterogeneous Integration Roadmap (HIR), initiatively directed by William Chen (Semi, IEEE EPS a.o.); and furthermore 5 000 000 4 500 000 4 000 000

Memory cube 3D SoC Accelerometer FPS Other MEMS and sensors

Si interposer Imaging device ALS RF filters

3 500 000 3 000 000 2 500 000 2 000 000 1 500 000 1000 000 500 000 0 2016

2017

2018

2019

2020

2021

2022

Figure 1 Revenue forecast TSV-based products by application. Source: 3D TSV and 2.5D Business Update – Market and Technology Trends Report, Yole Développement, 2017.

Introduction to Design, Test and Thermal Management of 3D Integrated Circuits

NanoElectronics Roadmap for Europe: Identification and Dissemination (NEREID) (funded by the European Commission). Including as well Sensor and MEMS/IC applications, a main subject of HIS and NEREID is clearly on heterogeneous systems. While targeting more on computing applications, the IRDS predicts an area of so-called 3D power scaling with transition to vertical device structures and heterogeneous integration to become the key technology driver in the years 2025–2040. To summarize, dedicated 3D integration technologies are today in a ramp-up phase toward high volume production. Nevertheless, there is still a huge amount of related problems, e.g. thermal issues, design and test issues, materials optimization, robustness of the processes, thermomechanical reliability of the systems, and last but not the least high production costs, which can only be solved by significant development efforts. For this volume, we invited Paul D. Franzon (NCSU) for “Design,” Erik Jan Marinissen (IMEC) for “Test,” and Muhannad S. Bakir (Georgia Institute of Technology) for “Thermal Management” to serve as co-editors. They succeeded in assembling excellent contributions from both academic and industrial practitioners in these three key areas of interest. The book is organized into three corresponding parts: Part I: Design Paul D. Franzon (editor) Contributions from Fraunhofer, Georgia Institute of Technology, IMEC, Kaiserslautern University, Kobe University, NCSU, and TSMC Part II: Test Erik Jan Marinissen (editor) Contributions from ARM, Cadence Design Systems, Duke University, FormFactor, Google, IMEC, NTHU, Synopsys, TSMC, and TU Delft Part III: Thermal Management Muhannad S. Bakir (editor) Contributions from Georgia Institute of Technology, IBM, IMEC We would like to acknowledge the three co-editors for putting together their parts of the book, the reviewers, and all authors for their chapters. We are deeply grateful for their time and efforts they each put into their contributions. On behalf of all editors and authors, we like to acknowledge the Wiley-VCH team who greatly supported us, especially Waltraud Wuest and Nina Stadthaus. July 2018

Philip Garrou Microelectronic Consultants of NC Research Triangle Park, NC, USA Mitsumasa Koyanagi Tohoku University Sendai, Japan Peter Ramm Fraunhofer EMFT Munich, Germany

xvii

1

Part I Design

3

1 3D Design Styles Paul D. Franzon North Carolina State University, 2410 Campus Shore Dr., Raleigh, NC 27606, USA

1.1 Introduction 3D-IC and interposer technologies have demonstrated their capability to reduce system size and weight, improve performance, reduce power consumption, and even improve cost as compared with baseline 2D integration approaches. Though not a replacement for Moore’s law, 3D technologies can provide significant improvements in performance per unit of power and performance per unit of cost. The main purpose of this chapter is to provide an overview of product and design scenarios that uniquely leverage 3D-IC technologies in 3D specific ways. The structure of this chapter is as follows. First, we do a quick review of the 3D technology set. Then we review the main design drivers for using 3D technologies: (i) miniaturization, (ii) provisioning power effective memory bandwidth, (iii) improving performance/power of logic, and (iv) heterogeneous integration for cost reduction to enable unique system capabilities.

1.2 3D-IC Technology Set There are several technology components that can be mixed and matched in the 3D technology set. The purpose of this section is not to review these in detail, but to introduce them. Other books in this series focus on the technology. The main 3D-IC technologies of interest are illustrated together in Figures 1.1–1.4. Interposers (Figure 1.1) are so called because they are placed or posed in between the chip and the main laminate package. Using interposers is often referred to as 2.5D integration. A common way to make interposers is to use silicon processing technologies to create a microscale circuit board. Through-silicon vias (TSVs) are fabricated in a silicon wafer, and multiple metal layers are then fabricated on top. These metal layers can be fabricated with thin film processing, typically giving 3–6 metal layers up to a few micrometers

Handbook of 3D Integration: Design, Test, and Thermal Management, First Edition. Edited by Paul D. Franzon, Erik Jan Marinissen, and Muhannad S. Bakir. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

4

1 3D Design Styles

Micro-bump Wiring layers Through-silicon vias (TSV)s Bumps

Interposer Typical dimensions:

Interposer thickness : 100 μm Micro-bump pitch : 25 μm + Wiring layers: 4–6 layers; submicron to multi-μm thickness Typically 1–10 μm width and space Bump pitch: 150 μm +

Figure 1.1 Interposer or 2.5D integration. Bumps

Micro-bump Wiring layers Base CMOS wafer Redistribution layer (RDL) Typical dimensions:

Interposer thickness : 100 μm Micro-bump pitch : 25 μm + Wiring layers: 1–3 layers; submicron to multi-μm thickness Typically 1–10 μm width and space

Figure 1.2 Redistribution layer.

thick, or can be fabricated with integrated circuit back-end-of-line (BEOL1 ) techniques, giving 4–6 thinner but planarized metal layers. The latter approach usually reuses a legacy BEOL process, e.g. from the 65 nm technology node. Micron-scale line width and space can be readily achieved. The interposer is usually thinned to 100 μm. Thus 100 μm long TSVs are used to connect the metal layers to the package underneath. The pitch of the TSVs is also typically around 100 μm. Chips are flipped bumped to the top of the interposer, and the interposer connects them to each other and the outside world. The bump pitch between the chip and the interposer can be relatively tight, down to 25 μm, but the interposer package chip must be at conventional scales, typically in the 150+ μm range. The chips on top of the interposer can be single die or multi-chip stacks themselves. 1 The front end of line refers to transistor fab, which is usually done before metal interconnect fab, which is thus called BEOL.

1.2 3D-IC Technology Set

Micro-bumps or direct bond Transistor/wiring layers Through-silicon vias (TSV)s Micro-bumps to package or interposer

3D chip stack Typical dimensions:

Thinned chip : 25–50 μm Unthinned chip : 300 μm Micro-bump pitch : 25 μm + Direct bond pitch : 3 μm + Through-silicon via (TSV) diameter : 5–10 μm TSV pitch : 25 μm +

Figure 1.3 3D-IC chip stacking technology set.

Oxide Epi

Metal Transistors

Buried oxide Bulk silicon

(1) Oxide–oxide bond

(2) Silicon etch

(3) Via formation

(4) Repeat

Figure 1.4 3D integration in silicon on insulator technology.

Another interposer technology under active investigation is to use glass as a substrate rather than silicon. Then potentially large panel processing techniques, such as those used to make television screens, can be used, and price reduction achieved. A related technology is to create interconnect on top of an already finished CMOS wafer and use that to connect to chips and inputs/outputs. This is illustrated in Figure 1.2. Additional thin film wiring layers are processed on top of a completed CMOS wafer to connect the chips in that wafer to chips that are placed on top, together with the chip stack IO. It is referred to as a redistribution layer (RDL) as the CMOS wafer IOs are redistributed. Not as many wiring layers are possible as with interposers. One application of RDL technology is to

5

6

1 3D Design Styles

make a chip stack of a larger die, e.g. a memory stack, to a smaller die, e.g. a processor. An exemplar 3D chip stack, or 3D-IC, is shown in Figure 1.3. This illustrates a three-chip stack, two of which incorporate TSVs. The top two chips illustrated in this stack are mated face to face (F2F). That is, the transistor and wiring layers are directly mated. This mating can be done with solder bumps or with a thermocompression or direct bonding technology. The latter technologies have been demonstrated down to 3 μm pitch and have potential for 1 μm pitch. An example of a copper direct bond interconnect technology can be found in [1]. This permits a very high interconnect density between the two chips. These F2F connections can be leveraged in multiple ways to enable higher-performance and lower power logic stacks. TSVs can be used to connect the face of one chip, through the back of another to the transistor/wiring layer, or to connect chip stack IO through a chip backside. Thus they can connect a chip face to back (F2B, shown in Figure 1.3 between the bottom two chips) or even back to back (B2B, not shown). TSVs are made using techniques that create very vertical vias through the bulk silicon substrate. They have a lower density than an F2F connection but are important for creating chip stacks. For example, the TSVs shown in Figure 1.3 connect the primary IO and power grounds at the bottom up through the chip stack. The layers with TSVs have to be thinned. The chip stack often includes one unthinned layer for mechanical stability (though this is not a requirement). A fourth option that is only possible in a silicon on insulator (SOI) technology is shown in Figure 1.4. In this approach, fabricated wafers are joined F2F using an oxide–oxide bond. Since the transistors are built on top of an oxide layer, a silicon-selective back etch can be used to remove the silicon part of the SOI substrate while not affecting the transistors and interconnect layers. Simple through-oxide vias can then be used to create vertical connections between what were previously separate chips. An example of this process can be found in [2]. If the first two chips in the stack are fabricated without interconnect, then one gets two directly connectable transistor layers in what would be considered a monolithic 3D technology.

1.3 Why 3D Table 1.1 presents a summary of potential drivers for 3D integration. The desire for thinner smartphone cameras has resulted in the first mainstream high volume use of 3D technologies. However, such miniaturization can also be used for other image sensors and for smart dust sensors. Provisioning large amounts of power effective memory bandwidth appears to be the next volume application of 3D technologies. In contrast, logic stacking or logic-on-memory stacking has had strong but unrealized potential for improving system performance/power. Finally, 3D offers unique opportunities for heterogeneous integration of different technologies. Each of these potential design drivers will be explored in detail in the next four sections.

1.4 Miniaturization

Table 1.1 Issues that are potential drivers for 3D integration. Driving issue

Case for 3D

Caveats

Miniaturization

Stacked memories Smart dust sensors Image sensors

For many smart dust cases, stacking and wire bonding is sufficient

Memory bandwidth

3D memory can dramatically improve memory bandwidth and power consumption

Stacking memory on logic has thermal issues

Interconnect delay, bandwidth, and power

Length of critical paths can be substantially reduced through 3D integration, or benefit can be made of massive vertical bandwidth

Not all cases have a substantial advantage

In certain cases, a 3D architecture might have substantially lower power or performance/power over a 2D architecture

Thermal issues can be solved with careful floor planning and/or liquid cooling

Mixed technology (heterogeneous) integration

Tightly integrated mixed technology (e.g. III–V on silicon or analog on or next to digital) can bring many system advantages in performance and cost

1.4 Miniaturization Obviously, 3D stacking technologies using thinned silicon have direct potential to reduce system volume. An early application of TSVs was for providing the IO connections cell phone camera frontside imaging sensor (http://image-sensorsworld.blogspot.com/2008/09/toshiba-tsv-reverse-engineered.html; http://www .semicontaiwan.org/en/sites/semicontaiwan.org/files/docs/4._mkt__jerome__ yole.pdf). The goal was not to leverage 3D chip stacks – these were single die – but to reduce the overall sensor height, at least when compared with conventional packaging approaches. More recently Sony has leveraged a copper–copper direct bonding technology to create an image sensor as a two-chip stack [3], (http://www.sony.net/ SonyInfo/News/Press/201201/12-009E/index.html, http://www.3dic.org/3D_ stacked_image_sensor). One chip is a backside-illuminated pixel array that does not include interconnect layers or even complete CMOS transistors. The second chip is a complete CMOS chip on which is built the analog-to-digital converters (ADCs) and interconnect for all other functionality required of an image sensor. This approach leverages the high density capability of a direct bonding technology since pixel-scale vertical interconnect is required. Since only one of the two chips goes through a full CMOS fab, there is potential for cost reduction in comparison with a 2D sensor of the same total area having to go through a full CMOS fab. In contrast, here, the sensor-only chip should be substantially smaller per square millimeter. In addition, the volume is reduced substantially through a smaller footprint. This image sensor is probably the first high volume application incorporating a full 3D-IC chip stack.

7

8

1 3D Design Styles

Layouts

Die photo

Figure 1.5 Food processing sensor.

In the research domain a number of 3D image sensors have been demonstrated – too many to summarize here. Separating the image and processing layers leads to the potential for improved performance in terms of sensitivity (larger pixels), frame rate (e.g. faster or more ADCs in the CMOS layer), and integrated advanced processing (e.g. edge detection for robotics). Another interesting use of 3D technologies has been to build non-visible light sensors, sometimes using a non-silicon technology for the sensing layer. Examples include IR imagers, X-ray imagers [4], and other images for high energy physics investigations (http://meroli.web.cern.ch/meroli/DesignMonolitic DetectorIC.html). 3D-IC has been explored for non-image sensors. 3D chip stacking can be used to make such sensors with low integrated volume. Though fabricated using wire bonding, Chen et al. demonstrated an integrated power harvesting data collecting sensor with the photovoltaic power harvesting chip mounted on top of the logic and RF chips [5]. This maximizes the photovoltaic power harvesting area while minimizing the volume. TSVs and bonding technologies would permit further volume reduction. Lentiro [6] describes a two-chip stack aimed at simulating a particle of meat for the purposes of calibrating a new food processing system. One chip is an RFID power harvester and communication chip, and the second is the temperature data logger (Figure 1.5). It is a two-chip stack with F2F connections and TSV-enabled IO. It is integrated with a small battery for data collection purposes only as the RFID cannot be employed in the actual processing pipes. The two-chip stack permits smaller imitation food particles than otherwise would be the case.

1.5 Memory Bandwidth Memory is positioned as the next large volume application of 3D-IC technologies. To date DRAM has relied on one-signal-per-pin signaling using low cost, low pin count, and single chip plastic packaging. As a result, DRAM has continued to lag logic in terms of bandwidth potential and power efficiency. Furthermore, the IO speed of one-signal-per-pin signaling schemes is unlikely to scale a lot beyond

1.5 Memory Bandwidth

what can be achieved today in double data rate DDR4 (up to 3.2Gbps per pin) and graphics GDDR6 (8Gbps). Beyond these data rates, two-pins-per-signal differential signaling is needed. Furthermore, the IO power consumption, measured as mW/Gbps, is relatively high, even for the LPDDR standards (intended for mobile applications). Thus there are multiple 3D-IC enabled memory solutions available in the market, all of which offer improved data bandwidth and power efficiency over conventional memories. These include the hybrid memory cube (HMC), high bandwidth memory (HBM), WideIO, and Tezzaron disintegrated RAM (DiRAM). These are summarized in Figure 1.6 and Table 1.2. Note that in the table, B (byte) and b (bits) are both used. Also note that 1 mW/Gbps is equivalent to 1 pJ/bit. The HMC is a joint Intel–Micron standard that centers on a 3D stacked part including a logic layer and multiple DRAM layers organized as independent vertical slices. This 3D chip stack is then provided as a packaged part, so the customer does not have to deal with any 3D-IC or 2.5D packaging issues. At the time of writing this chapter, Micron offers 2GB and 4GB parts with a maximum memory DRAM layers Vertical slice Logic layer - Crossbar - Controller - SerDES Package High bandwidth memory

Hybrid memory cube

DRAM half bank TSV array Logic layer - Controller - 8 x 128 bit channels Micro-bump IO (48 x 55 μm pitch)

DRAM subarray layers Logic layer - Global sense amps, deocoders - Defect management

DRAM half bank TSV array Micro-bump IO

Wide IO

DiRAM

Figure 1.6 3D DRAMs. Table 1.2 Comparison of 2D and 3D memories.

Technology

Capacity

BW (GBps)

Power (W)

Efficiency (mW/GBps)

IO efficiency (mW/Gbps)

DQ count

DDR4-2667

4GB

21.34

6.6

309

6.5–39

32

LPDDR4

4GB

Up to 42

5.46

130

2.3

32

HMC

4Gb

128GBps

11.08

86.5

10.8

HBM

16Gb

256GBps

48

1024

WideIO

8–32Gb

51.2GBpsa)

42b)

256

DiRAM4

64Gb

8Tbps

a) WideIO2. WideIO1 was half of this. b) WideIO1. WideIO2 should be lower.

8 Serdes lanes

4096

9

10

1 3D Design Styles

bandwidth of up to 160GBps (the 128GBps part is used in Table 1.2). The data IO is organized as an eight high-speed serial channels or lanes. HMC is mainly aimed at computing applications. The HBM is a JEDEC (i.e., industry-wide) standard that is intended for integration via an RDL, 3D-IC stacking, or interposer to logic. It has a lot of data IO (DQ) pins, configured as 8×128-bit wide interfaces with each pin running at up to 2Gbps. Connecting this large number of pins (placed on a 48 μm × 55 μm grid) is the reason why it has to be integrated via an RDL, interposer, or direct 3D stacking. It is fabricated as a stack of multibank memory die, connected to a logic die through a TSV array, the TSV arrays running through the chip centers. Each chip is F2B mounted to the chip beneath it. The eight channels are operated independently. Details for a first-generation HBM (operating at 3.8 pJ/bit power level at 128GBps) can be found in [7]. The use of HBM in graphics module products has been announced by Nvidia and AMD. WideIO is also a JEDEC-supported standard, aimed largely at low power mobile processors. While intended to be mounted on top of the logic die in a true 3D stack, side-by-side integration on an interposer is also possible. WideIO is a DRAM-only stack – there is no logic layer. Instead the DRAM stack is exposed through a TSV-based interface, and the memory controller is designed separately on the CPU/logic die that is customer designed. An example is the ST/CEA WideIO1 test vehicle [8]. It also supports multiple independent memory channels, operating at up 800Mbs/pin. For example, the Samsung WideIO2 product supports four channels, each 64-bit wide operating at 800Mbs/pin. The standard is currently in its second generation (WideIO2), and a third is being planned. WideIO has yet to enjoy commercial success. To date the thermal challenges of mounting a DRAM on an already hot mobile processor logic die have been insurmountable, especially as it is desired to operate the DRAM at a lower temperature than logic (85 ∘ C for DRAM vs. 105 ∘ C for logic) to control leakage and refresh time. One potential solution is side-by-side integration on an interposer. WideIO has potential for employment in mobile processor-based server solutions as the thermal issues are easier to manage. The Tezzaron DiRAM4 is a proprietary memory. It has 4096 data IO organized across 64 ports. It is intended only for 3D and interposer integration. It has a unique organization in that the logic layer is not only used for controller and IO functions but also houses the global sense amplifiers and addresses decoders that in other 3D memories are on the DRAM layers. This permits faster operation for these circuits. The DiRAM4 has potential for a very high bandwidth (up to 8Tbps) and fast random cycles (15 ns) [9]. DiRAM4 is being integrated into a number of specialized applications that benefit from its high bandwidth.

1.6 3D Logic It has always been assumed that the next major employment of 3D-IC, after memories, would be 3D logic, i.e., logic stacks. The argument is simple. On-chip wiring dominates the area, performance, and power consumption of many logic chips. 3D logic stacking would shorten many of those wires, leading to power reduction,

1.6 3D Logic

performance enhancement, and area reduction. While these improvements can be achieved, the increased cost and heat flux issues have been challenging. This section describes experiments that have demonstrated these advantages as well as pointing to some solutions to the heat flux question. The main metric of interest in evaluating logic-on-logic stacks is performance per unit of power. The first two experiments to be described are ones in which a 2D logic chip is partitioned into two 3D stacked chips: first at the module level and second at the circuit level. Before describing those experiments, some discussion on power efficiency in computation is warranted. 1.6.1

Power-Efficient Computing and Logic

Table 1.3 lists the energy per operation for a range of operations, where appropriate, scaled to 0.6 V operation at the 7 nm node (for logic). (Note that 1 pJ/op = 1 mW/Gbps.) This table was constructed by taking simulation or published power results and scaling them using the conservative scaling factors published by Intel authors in [10, 11]. The single instruction multiple data (SIMD) core was the one designed at NCSU in 65 nm CMOS and optimized for low power operation. Some more detail on this core can be found in [12]. These conservative factors capture the slowdown in performance and power scaling expected after the 22 nm node. For DRAM, these numbers are for the DRAM core only (not its IO or other overhead) at the 16 nm node, which is the presumed last DRAM node. These Table 1.3 Energy per operation for a range of operations generally scaled to 0.6 V at the 7 nm node. Computation

Energy/32-bit word

32-bit multiply–add (SP)

6.02 pJ/op

FPU

1.4 pJ/op

SIMD vector processor (16 lane)

4.6 pJ/FLOP

Data storage 16 × 64-bit RF

0.5 pJ/word

128KB SRAM

0.9 pJ/word

L1 Dcache (16KB)

62 pJ/16 B

L2 Dcache (2MB)

24 pJ/16 B

16 nm DRAM core

140 pJ/word

Communications On-chip

0.23 pJ/word/mm

PCB

54 pJ/word

Interposer

17 pJ/word

TSV

1.1 pJ/word

Source: Adapted from Borkar 2010 [10] and Esmaeilzadeh et al. 2011 [11].

11

1 3D Design Styles

figures were taken from [13] and are for DRAM structures likely for commodity products, with high DRAM cell fill factors. The fill factor is the percentage of total area given over to DRAM cells. Energy/access for a DRAM can be improved by using smaller banks, with lower fill factor. Early studies on this aspect indicate that a potential improvement of about 4× is possible through this approach. For interconnect, some of these figures are taken from the modeling and simulation study presented in [10] and again extrapolated to the 7 nm node. The interposer power was based on an extrapolation of the results presented in [14], with an assumption that 2/3 of the power is for driving the transmission line and so does not scale. What is interesting to observe is that for 2D technologies, calculation (computation) is energetically much cheaper than data storage or communications, which creates serious constraints for power-efficient computing. Power efficiency is best achieved by minimizing data motion and by minimizing memory references, especially to DRAM or via the cache hierarchy. In contrast, data motion using 3D technologies takes much less energy than when using 2D technologies. With 3D stacking, vertical data communications using TSVs or a direct bond interface consumes less power than computation. Thus now it makes sense to move data if an overall advantage can be gained. An example of this is given below as a heterogeneous computer. 1.6.2

Modular Partitioning: FFT Processor

This system consists of three stacked tiers with eight processing elements, one controller, thirty-two SRAMs, and eight ROMs [15]. The system performs 32 memory accesses per cycle (16 reads and 16 writes), completing a 1024-point fast fourier transform (FFT) in 653 cycles utilizing five pipeline stages. The floor plan is designed so that all communications are vertical – there is no horizontal communications between PEs. The chip was implemented in the Lincoln Labs SOI 3D process described earlier. The die photo (Figure 1.7) clearly shows the TSV arrays, one of which is specifically pointed out and the locations of which were dictated to be at the SRAM bank interfaces. Figure 1.7 also shows the stacked chip floor plans. This clearly shows the modular nature of the partitioning in that each processing element (PE) module is preserved as an integrated 2D design. The logic to the interior of the modules is not broken into 3D. Each PE communicates vertically with the memories stacked with it. By breaking a large memory into TSV array

12

Mem34O Processing element 2

Processing element 3

ROM2 ROM3 ROM0 ROM1

Mem18O Mem18E

Mem17E

Mem33E

Mem33O

Mem65O

Mem40O

Mem36O

Mem65E

Mem40E

Mem36E

Mem66O

Mem129O Controller Mem129E Mem24O Mem130O Mem24E

Processing element 0

Processing element 1

Mem17O

Mem34E

Mem66E

Processing element 6

Processing element 7

ROM6 ROM7 ROM4 ROM5

Mem20O Mem20E

Mem130E Mem132E Mem132O Mem136O Mem72O

Mem68O

Mem136E Mem72E

Mem68E

Processing element 4

Figure 1.7 3D FFT engine die photo and floor plans of the three chips in the stack.

Processing element 5

1.6 3D Logic

Mem34O

Processing element 2

Mem18O

Mem34E

ROM2

Mem18E

Mem33E Mem40O Mem40E

Mem129O Mem129E Mem130O Mem136O Mem130E

Processing element 3 ROM3

ROM6

Mem36O Mem36E

Processing element 7 ROM7

Controller ROM4 Processing element 4 Processing element 0 ROM0

Mem136E

Mem24O Mem24E Mem132E Mem72O

Mem17E Mem65O

Mem33O Processing element 6

Mem17O

Mem65E

Mem Mem Mem Mem Mem Mem Mem Mem

Mem66O Even

Mem66E

ROM5 Processing element 5 Processing element 1 ROM1

Mem72E

Mem20O

Odd

Even

PE 0

Odd

PE 1

Even

Odd

PE 2

Even

Odd

PE 3

Mem20E Mem132O Mem68O Mem68E

Figure 1.8 2D floor plan and architecture of FFT engine.

32 smaller memories, memory power was reduced by 58%. (A similar trade-off exists for DRAMs.) This two-chip stack was redesigned as a 2D chip. The floor plan of this chip, together with a module connectivity diagram, is shown in Figure 1.8. A comparison of this with the 3D chip is summarized in Table 1.4. The total area of the 3D chip is 25% less than that of the 2D equivalent. This difference arises due to the need for added area in the 2D chip to route all the additional wiring that was needed. The total length of routed wire went up 57% in the 2D chip. Admittedly, there are around 1800 connections between the PEs and the memories – an amount very affordable in the 3D version but expensive in the 2D version. Due to the reduced wiring load, the 3D version could operate 24.6% faster and with 4.4% less power. Even the logic power is lower in the 3D version due to the reduced capacitive load at the logic outputs. The 3D version of this architecture shows significant advantages due to improvement in routability between the modules. 1.6.3

Circuit Partitioning

A modified CAD flow was applied to three different designs – a radar PE, an AES encryption engine, and a MIMO multipath radio processing engine. The CAD flow was designed to partition a 2D chip into two stacked chips. The partitioning is done at the circuit level – with connected logic gates possibly Table 1.4 Comparison of 2D and 3D FFT engines. Metric

2D

3D

Change (%)

Total area (mm2 )

31.36

23.4

−25

Total wire length (m)

19.1

8.23

−57

Maximum speed (MHz)

63.7

79.4

+24.6

Power at 63.7 MHz (mW)

340

325

−4.4

FFT logic energy (μJ)

3.55

3.36

−5.2

13

14

1 3D Design Styles

Table 1.5 Improvements in 3D design over 2D using logic cell partitioning. Total wire length (% change)

Fmax (% change)

Total power (% change)

Power (MHz)

Radar PE

−21.0%

+22.6%

−12.9%

−38%

AES

−8%

+15.3%

−2.6%

−18%

MIMO

+216%

+17.1%

−5.1%

−23%

being on different chips and connected vertically. This partitioning approach leverages the high density and bandwidth of the copper–copper direct bonding interface when two dies are stacked F2F with each other. The minimum bond pitch was 6.3 μm, and the chips were made in a standard 130 nm bulk CMOS process. TSVs were used for backside IO. All flip-flops are kept in one tier so that 3D clock distribution was not required. The radar PE was implemented in the Tezzaron bulk CMOS 3D process [16] (Figure 1.8). The results are summarized in Table 1.5. On average, performance per unit of power was increased by 22% due to the decreases in wire length achieved through this partitioning approach. The radar processor had an improvement in performance per unit of power of 21%. The other designs achieved 18% and 35%. 1.6.4

3D Heterogeneous Processor

This design is very 3D specific. It takes advantage of the vertical dimension and their lower power characteristics in a unique way. A stack of two different CPUs is integrated vertically using a vertical thread transfer bus that permits fast compute load migration from the high-performance CPU to and from the low power CPU when an energy advantage is found [17]. In this design, the high-performance CPU can issue two instructions per cycle, while the low power CPU is a single-issue CPU. The transfer is managed using a low-latency, self-testing multi-synchronous bus [18]. The bus can transfer the state of the CPU in one clock cycle by using a wide interface and exploiting a high density copper–copper direct bond process. The caches are switched at the same time, removing the need for a cold cache restart. Simulation with Specmark workloads shows a 25% improvement in the power/performance ratio compared with executing the sample workload solely in the high-performance processor. In contrast, if the workload was executed solely in the single-issue (low power) CPU, there was a 28% total energy savings, compared with keeping the workload in the high-performance CPU, but at the expense of a 39% reduction in performance. If the workload was allowed to switch every 10 000 cycles, there was a 27% total energy savings but at the expense of only a 7% reduction in performance. That is, a 25% improvement in power per unit of performance is achieved. This processor stack was taped out in a 3D 130 nm process in fall 2015. A copper–copper direct bond interface, with an 8 μm pitch, is used to build the required vertical connectivity. Key to this design is how the various bus elements are built into the logic tiers so that it can be further stacked with itself or other elements, such as accelerators.

1.6 3D Logic

Logic only

Logic, clocks, flip-flops

Die photo

Figure 1.9 Layouts and die photo of the 3D radar processing element.

Another feature of this processor is that it will use the fast multi-port Tezzaron 3D DiRAM4 memory as a combined L2/L3 cache. This DRAM can perform fast RAS–RAS cycles while providing more than 1Gb of total capacity. Compared with an SRAM-based cache hierarchy, it provides a 90% performance improvement while reducing power consumed in these caches by almost 4×. An illustration of the overall floor plan is shown in Figure 1.9, showing the two-processor stack integrated with each other and their caches. Between the two processors is a 2254-bit wide (1120 data in each direction and 14 control signals) thread transfer bus. This is a very short bus that runs through the copper direct bond pads between the two chips. Each processor is also connected to both caches through a switch. The buses in the switch are again very short. One bus runs horizontally between each cache and the CPU in the same chip; another runs vertically to the different CPUs. If this architecture was not built as a 3D chip, then at least two of the buses shown here would be long and power hungry and introduce additional delay. Figure 1.10 shows a layout of the two chips in the stack. 1.6.5

Thermal Issues

Based on a simple calculation, the thermal ramifications of 3D-IC are not very positive. In the examples above, the maximum power reduction due to 3D was 13%. Since the footprint area is halved, this means the heat flux is increased 1.7×, which would lead to a significant temperature rise! However, this simple calculation ignores the fact that temperature rise is very dependent on details of the floor plan. For example, by staggering the high power

Thread transfer bus

CPU 2

Cache 2

CPU 1

Cache 1

Cache connection switch

Figure 1.10 3D heterogeneous processor floor plan as a two-chip stack.

15

16

1 3D Design Styles

density blocks so that they do not overlap, Saeidi et al. [19] showed that a two-chip stack can achieve a junction temperature of only 8 ∘ C more than the 2D equivalent and the same junction temperature if one can achieve a 5% reduction in dynamic power in the 3D version! They then investigated this concept in the framework of a mobile processor design and found through a combination of clever floor planning and/or partitioning, together with the 5–16% power reduction that 3D gives, and the worst hotspot could be less in the 3D design than in the 2D. They achieved this for the CPU through careful modular floor planning and preventing high power density modules from overlapping. They achieved this in the GPU by leveraging the power reduction potential of circuit partitioning across a F2F connection. For servers, another potential solution is to use liquid cooling. Thus with some sophistication, thermal issues do not have to be a barrier to realizing the advantages of 3D design.

1.7 Heterogeneous Integration Another unique aspect to 3D-IC is that these technologies enable different technologies to be intimately mated with high connectivity. An example that has already been given is that of a CMOS image processor fabricated as a two-chip stack. The two chips are different: one chip just consists of imaging pixels, while the second is a complete CMOS chip. This leads to lower cost than the alternative of two full CMOS chips. DRAM on top of logic also serves as an example of heterogeneous integration. Three examples of heterogeneous integration will be given in the rest of this section: (i) splitting logic for cost reduction, (ii) mixing different CMOS nodes within one module, and (iii) mixing III–V and silicon technologies within one 3D-IC chip stack. The first example is only heterogeneous in the sense that it is mixing interposer and CMOS technologies. To a first approximation, the cost of a large CMOS chip goes up with the square of the area. This is because the probability of a defect occurring on the chip and thus killing the chip goes up with the chip area while the cost of making the chip in the first place also goes up with the area. Thus it is worth considering partitioning a large chip into a set of smaller ones, if the cost of integration and the additional test are less than the savings accrued to increase CMOS yield. Xilinx investigated this concept for large FPGAs and is now selling FPGA modules containing two to four CMOS FPGA chips tightly integrated on an interposer. Details are not available, but they claim an overall cost savings [20]. The second example is that of mixing technology nodes. In general, Moore’s law tells us that a digital logic gate costs less to make a more advanced technology due to the reduced area for that gate in that node. However, in contrast, many analog and analog-like functions like ADCs and high-speed serial–deserializer (SerDes) IOs do not benefit in such a fashion. The reason is that the analog behavior of a transistor has higher variation for smaller transistors than for larger ones. Thus, for many analog functions that rely on well-matched behaviors of different transistors in the circuit, no benefit is accrued from building smaller transistors. More simply put, analog circuit blocks do not shrink in dimensions with the use of

1.7 Heterogeneous Integration

more advanced technologies. Thus the cost of these functions in a more advanced process node can actually be higher than in the old node, since the old node costs less to make per unit of area. While Wu [20] also explores this concept generically, Erdmann et al. [21] have explored this concretely for a mixed ADC/FPGA design. Their design consisted of two 28 nm FPGA logic dies, integrated with two 65 nm ADC array dies on an interposer. Thus two sets of cost benefits are accrued: first the yield-related savings from splitting the logic die in two and second the fabrication cost savings of keeping the ADCs in an older technology. The third example that will be given is that of mixing III–V and silicon technologies. This is best exemplified by the DARPA diverse accessible heterogeneous integration (DAHI) program in which GaN and InP chips are integrated on top of CMOS chips through micro-bumps and other technologies [22, 23]. More specifically, CMOS can be used for most of the transistors in a circuit, while GaN high electron mobility transistors (HEMTs) can be used for their high power capability and InP HBTs can be used for their very high speed. An example of the latter is an ADC. In an ADC, only a few transistors generally determine the sampling rate. Thus with the DAHI technology, these few transistors can be built in a high-speed but expensive and low-yielding InP chiplet, while the rest of the ADC is built in cheaper and more robust CMOS. Northrop Grumman is the main fab in the DAHI program [23]. They use gold micro-bumps and through-silicon carbide vias to integrate GaN HEMTs on top of CMOS in a face-to-back process and gold micro-bumps to integrate InP HBT chiplets to CMOS in an F2F process (Figure 1.11). A face-to-back process is used for the GaN parts in order to allow for some heat spreading in the GaN part, as the main conduction path is through the CMOS chip (Figure 1.12).

Figure 1.11 Layouts of the two chips in the heterogeneous processor stack.

Figure 1.12 DAHI Northrop Grumman heterogeneous integration process.

Thinned GaN Through via

Thinned InP Transistor

Unthinned CMOS

Micro-bumps

17

18

1 3D Design Styles

1.8 Conclusions 3D-IC and 2.5D (interposer) technologies have demonstrated their utility in enabling cost scaling and scaling in performance and power consumption beyond that provided by Moore’s law alone. They permit miniaturization of chip assemblies and have had widespread employment in CMOS image sensors for mobile products. Their next impact will be in the form of DRAM stacks, enabling high bandwidth and low power memory integration. By exploiting the high density F2F connection, density of copper direct bonding and logic-on-logic stacks can be designed that improve performance/power by 25% or more. Careful floor planning and placement can be used to solve the thermal challenges that arise. Finally heterogeneous integration, that is, mixing different technologies on an interposer or 3D stack, leads to optimized performance at optimized cost.

References 1 Enquist, P., Fountain, G., Petteway, C. et al. (2009). Low cost of ownership

2

3

4 5

6 7

8

9 10 11

scalable copper direct bond interconnect 3D IC technology for three dimensional integrated circuit applications. In: IEEE International Conference on 3D System Integration, 2009. 3DIC 2009, 1–6. San Francisco, CA. Burns, J.A., Aull, B.F., Chen, C.K. et al. (2006). A wafer-scale 3-D circuit integration technology. IEEE Transactions on Electron Devices 52 (10): 2507–2516. Enquist, P. (2014). 3D integration applications for low temperature direct bond technology. In: 2014 4th IEEE International Workshop on Low Temperature Bonding for 3D Integration (LTB-3D), 8. Tokyo. Deptuch, G.W., Carini, G., Enquist, P. et al. (2016). Fully 3-D integrated pixel detectors for X-rays. IEEE Transactions on Electron Devices 63 (1): 205–214. Chen, G., Fojtik, M., Kim, D. et al. (2010). Millimeter-scale nearly perpetual sensor system with stacked battery and solar cells. In: 2010 IEEE International Solid-State Circuits Conference – (ISSCC), 288–289. San Francisco, CA. Lentiro, A. (2013). Low-density, ultralow-power and smart radio frequency telemetry sensor. PhD dissertation. NCSU. Lee, D.U., Kim, K.W., Kim, K.W. et al. (2015). A 1.2 V 8 Gb 8-channel 128 GB/s high-bandwidth memory (HBM) stacked DRAM with effective I/O test circuits. IEEE Journal of Solid-State Circuits 50 (1): 191–203. Dutoit, D., Bernard, C., Cheramy, S. et al. (2013). A 0.9 pJ/bit, 12.8 GByte/s WideIO memory interface in a 3D-IC NoC-based MPSoC. In: 2013 Symposium on VLSI Circuits, C22–C23. Kyoto. RTI 3D ASIP (2013). Evolving 2.5D and 3D integration. www.tezzaron.com (accessed 24 August 2018). Borkar, S. (2010). The exascale challenge. In: 2010 International Symposium on VLSI Design Automation and Test (VLSI-DAT), 2–3. Esmaeilzadeh, H., Blem, E., Amant, R.S. et al. (2011). Dark silicon and the end of multicore scaling. In: 2011 38th Annual International Symposium on Computer Architecture (ISCA), 365–376.

References

12 Franzon, P.D., Rotenberg, E., Tuck, J. et al. (2014). 3D-enabled customizable

embedded computer (3DECC). In: Proceedings of 3DIC 2014. 13 Vogelsang, T. (2010). Understanding the energy consumption of dynamic ran-

14

15

16

17

18 19

20 21

22

23

24

dom access memories. In: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 363–374. Washington, DC. Karim, A., Franzon, P.D., and Kumar, A. (2013). Power comparison of 2D, 3D, and 2.5D interconnect solutions and power optimization of interposer interconnect. Proceedings of IEEE ECTC 2013. Davis, W., Oh, E., Sule, A. et al. (2009). Application exploration for 3-D integrated circuis: TCAM, FIFO and FFT case studies. IEEE Transactions on VLSI 17 (4): 496–506. Thorolfsson, T., Lipa, S., and Franzon, P.D. (2012). A 10.35 mW/GFLOP stacked SAR DSP unit using fine-grain partitioned 3D integration. Proceedings of CICC 2012. Rotenberg, E., Dwiel, B.H., Forbes, E. et al. (2013). Rationale for a 3D heterogeneous multi-core processor. In: 2013 IEEE 31st International Conference on Computer Design (ICCD), 154, 168. Zhang, Z., Noia, B., Chakraparthy, K., and Franzon, P.D. (2013). Face to face bus design with built-in self-test in 3DICs. Proceedings of IEEE 3DIC. Saeidi, M., Samadi, K., Mittal, A., and Mittal, R. (2014). Thermal implications of mobile 3D-ICs. In: 2014 International 3D Systems Integration Conference (3DIC), 1–7. Kinsdale. Wu, X. (2015). 3D-IC technologies and 3D FPGA. In: 3D Systems Integration Conference (3DIC), 2015 International, KN1.1–KN1.4. Sendai. Erdmann, C., Lowney, D., Lynam, A. et al. (2015). A heterogeneous 3D-IC consisting of two 28 nm FPGA Die and 32 reconfigurable high-performance data converters. IEEE Journal of Solid-State Circuits 50 (1): 258–269. Raman, S., Dohrman, C.L., and Chang, T.H. (2012). The DARPA diverse accessible heterogeneous integration (DAHI) program: convergence of compound semiconductor devices and silicon-enabled architectures. In: 2012 IEEE International Symposium on Radio-Frequency Integration Technology (RFIT), 1–6. Singapore. Green, D.S., Dohrman, C.L., Demmin, J. and Chang, T. (2015). Path to 3D heterogeneous integration. International 3D Systems Integration Conference (3DIC), Sendai, (2015), pp. FS7.1–FS7.3. Gutierrez-Aitken, A., Scott, D., Sato, K. et al. (2014). Diverse accessible heterogeneous integration (DAHI) at Northrop Grumman aerospace systems (NGAS). In: 2014 IEEE Compound Semiconductor Integrated Circuit Symposium (CSICS), 1–4. La Jolla, CA.

19

21

2 Ultrafine Pitch 3D Stacked Integrated Circuits: Technology, Design Enablement, and Application Dragomir Milojevic, Prashant Agrawal, Praveen Raghavan, Geert Van der Plas, Francky Catthoor, Liesbet Van der Perre, Dimitrios Velenis, Ravi Varadarajan, and Eric Beyne IMEC, Kapeldreef 75, 3001 Leuven, Belgium

2.1 Introduction Today through-silicon via (TSV) is a mature process technology option for manufacturing of 3D stacked integrated circuits (3D-SIC). When designing a 3D-SIC system, the total area required for the TSVs, as compared with the total die area, and the number of inter-die connections are among the most important parameters to be checked, and they typically depend on the physical dimensions of the TSV and the technology node used. Currently, TSV-based 3D-SIC implementations are limited to designs with relatively low inter-die connection counts and/or designs with large die area to compensate for the TSV area overhead. However, for designs with smaller die area and implemented in advanced technology nodes (14 nm and below), such as mobile MPSoC, TSV-based 3D-SIC either will result in significantly larger area overhead or will heavily restrain the granularity at which partitioning can be carried out, unless TSVs can be aggressively scaled down. Although the dimensions of TSV are being continuously scaled down [1], there are many reasons to believe that the further scaling will become more and more difficult. One of the reasons is coupled to the limited alignment accuracy during the bonding process. Also, extremely scaled TSV will result in larger aspect ratios (length/diameter) that will influence heavily the manufacturing yield and electrical characteristics variability. Aspect ratio can be reduced by scaling down wafer thickness, but wafer thinning leads to challenges related to wafer handling a priori to bonding and thermomechanical issues during manufacturing and operation [2, 3]. Thus, TSV can hit a scaling wall in the near-future technologies. Copper-to-copper (Cu–Cu) bonding is an attractive alternative to TSV process technology offering advantages such as ultrafine pitch and higher interconnection density at significantly lower area overhead due only to IO TSVs that have to be inserted to allow the connection to the external world (and is typically small compared with potential number of inter-die nets). Cu–Cu bonding Handbook of 3D Integration: Design, Test, and Thermal Management, First Edition. Edited by Paul D. Franzon, Erik Jan Marinissen, and Muhannad S. Bakir. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

22

2 Ultrafine Pitch 3D Stacked Integrated Circuits

also has lower resistance/capacitance, lower cost, higher mechanical reliability, and lower electromigration issues as compared with TSV [4]. Besides, Cu–Cu bonding also has manufacturing cost advantages over TSV-based approach [5]. Cu–Cu bonding does not require extra processing steps such as RDL formation. The mechanical reliability is also improved as there is no intermetallic compound (IMC) layer formation [4]. While most of the work in the field of Cu–Cu bonding focuses on the device level and manufacturing-related aspects only [4, 6], no work has been done to study its implications on system-level design and vice versa. The limited available works on the system-level exploration typically assume TSV-based face-to-back (F2B) stacking [7, 8]. Thus, analysis of the implications of application constraints and system-level design choices on the scope and limitations of the Cu–Cu bonding-based 3D interconnection technology as well as analysis of its impact on system-level design is missing in the current state of art. This is especially true for fine-grained 3D partitioning-based 3D-SIC implementations of mobile MPSoC designs. In this chapter, we will show the results of the system-level/process technology co-design analysis. We will present a design flow for carrying out exploration of 2D and 3D-SIC implementations. We use this flow to analyze the implications of 2D and 3D integration technologies on the system-level design and vice versa. We will consider multiple variations of a complex mobile MPSoC platform instantiated for real-life streaming wireless applications. These designs vary in terms of system-level architecture design parameters such as on-chip communication architecture and memory hierarchy. For each of these designs, both 2D and 3D-SIC integration-based implementations will be carried out. We will consider Cu–Cu bonding-based face-to-face (F2F) stacking in the case of 3D-SIC implementations because we will carry out fine-grained memory-on-logic 3D partitioning, resulting in 2-layer 3D-SIC implementations with very large number of inter-die connections. Using the design flow and the proposed case studies, we show that 3D-SIC implementations result in more than 30% reduction in total wirelength and energy dissipation in the interconnects, more than 60% reduction in delay of the longest wire, and about 50% reduction in wiring area overhead (calculated in terms of repeater area), as compared with 2D. 3D-SIC presents gains over 2D for both the design variations. However, the extent of impact of 3D-SIC is highly dependent on the system-level architecture. For example, the performance improvement for 3D-SIC implementation is higher for the design with private memories and shared bus communication structure. We also show that different combinations of system-level architecture choices and interconnect technology choices have different impact on system design parameters. Thus, it is nontrivial to make system architecture and interconnect technology choices without taking into account physical integration and technology aspects, along with application requirements. This chapter is structured as follows: Section 2.2 gives an overview of 3D integration technology. Section 2.3 discusses the proposed design flow, and Section 2.4 presents the case study. The chapter concludes with Section 2.5.

2.2 Overview of 3D Integration Technologies

2.2 Overview of 3D Integration Technologies It is often not very straightforward to classify 3D-SIC technology because of lack of standardization in terms of definition on one hand and significant overlap between different technology options on the other hand. This section attempts to provide a brief overview of the 3D-SIC technology that is relevant in the context of this work. We use the following three factors to broadly classify different technology options: (i) integration granularity, (ii) stacking orientation, and (iii) TSV formation. 2.2.1

Integration Granularity

There are two basic approaches for vertical integration [9]: (i) monolithic approaches and (ii) assembly approaches. In monolithic approaches, multiple layers are formed sequentially starting from the bottommost layer. Each layer is isolated from the other, and devices are processed in each layer using mainstream process technology. Monolithic integration can be carried out at the gate and transistor level [10]. The dimension and parasitics of monolithic interlayer vias (MIVs) is of the order of local via, as a result of which vertical interconnects can be realized in larger number and more densely than other 3D integration technologies. This also allows ultrafine-grained vertical integration of devices and interconnects [10]. However, monolithic integration requires high thermal budget, high precision process alignment accuracy, and monolithic 3D-aware physical design tools that are not matured yet. Thus, monolithic integration technology still needs to mature further before it can become mainstream. In assembly approaches, partially or fully processed integrated circuits are integrated vertically. Assembly-based integration can be carried out at three different granularities: (i) die-to-die, (ii) die-to-wafer, and (iii) wafer-to-wafer. Each assembly method has its advantages, and the choice depends on the application needs to realize the cost benefits of 3D integration. 2.2.2

Stacking Orientation

Three stacking possibilities exist based on the orientation of the face and the back of the dies: (a) face-to-face (F2F), (b) face-to-back (F2B), and (c) back-to-back (B2B), as shown in Figure 2.1. In F2F 3D-SIC, the faces of both dies are oriented toward each other and interconnected directly. I/O TSVs are used to connect the active layer of one of the dies to the package for the external communication. However, F2F is limited to stacking of two dies only. In case of F2B, the face of one die is oriented toward the back of another die. The active layers of the die are connected using signal TSVs. Active layer of one of the dies is exposed, which is used for external communication. F2B can be used to stack more than two dies. In case of back-to-back (B2B) 3D-SIC, the backs of both dies are oriented toward each other. Signal TSVs are used in both dies to interconnect their active

23

24

2 Ultrafine Pitch 3D Stacked Integrated Circuits

DIE 1

DIE 1

FEOL

FEOL BEOL Micro-bump RDL

Cu pads

BEOL

DIE 2

FEOL

TSV

FEOL

I/O TSV

BEOL

DIE 2

BEOL C4 bump (a)

(b) BEOL FEOL DIE 1 DIE 2 FEOL BEOL (c)

Figure 2.1 Cross-sectional view of the three different stacking possibilities based on orientation of face and back of the dies. (a) Face to back (F2B), (b) face to face (F2F), (c) back to back (B2B).

layers. External communication can be provided in a manner similar to F2B. B2B has two major drawbacks: (i) it does not scale well for stacking of more than two dies, and (ii) it requires signal TSVs in both the dies, which will add to the cost of manufacturing. 2.2.3

TSV Formation

TSVs can be fabricated at different stages of die processing. These are referred to as (i) via first, (ii) via middle, (iii) via last, and (iv) after bonding. Via-first TSVs are formed before the front end of line (FEOL) and back end of the line (BEOL) processing, whereas via-middle TSVs are fabricated after FEOL but before BEOL processing. Via-last TSVs, as the name indicates, are formed after both FEOL and BEOL processing, whereas after-bonding TSVs are formed after the dies have already been bonded. Via-first and via-middle TSVs result in lower area overhead and have higher density as compared with via-last and after-bonding TSVs. A detailed discussion of the 3D-SIC process technology is out of scope of this chapter. We will consider assembly-based approaches for 3D integration in the scope of this work. However, this work is orthogonal to the choice of D2D, D2W, and W2W integrations. We will focus on F2F stacking orientation where we assume Cu–Cu pad-based bonding. For the TSV formation, we will consider via middle, and all TSV-related parameters are based on via-middle TSV.

2.3 Design Enablement of Ultrafine Pitch 3D Integrated Circuits

2.3 Design Enablement of Ultrafine Pitch 3D Integrated Circuits 2.3.1

Design Flow Overview

In this section, we will describe a design flow used for exploration of 3D partitioning. Figure 2.2 gives an overview of the proposed flow used to analyze trade-offs across system-level design choices, physical design options, and technology options. The flow differentiates four phases: (i) application mapping and system architecture exploration, (ii) MPSoC specification to register transfer level (RTL) implementation, (iii) physical synthesis and placement and routing, and (iv) design characterization. The first phase of the design flow is the application mapping and platform architecture exploration (based on our previous work [11]). For a given input application, architecture template, and system specifications and constraints, we generate a set of Pareto-optimal design points. Each of these design points specifies a heterogeneous MPSoC platform and its configuration. The design points also specify the corresponding partitioning of the input application and assignment of these application components to the platform resources. Multimode streaming application

Architecture template

Application partitioning and platform architecture exploration

Specs Constraints

Phase 1 For each design point

Pruned set of design points Spec to RTL implementation

Phase 2

RTL synthesis

2D Clustering

3D Clustering and partitioning

Floor planning

Floor planning

Placement and routing

Placement and routing

Design characterization

Design characterization

For each stacking type

Phase 3

For each die

Phase 4 Area, energy, delay estimation

Figure 2.2 Overview of the 2D and 3D design flows using SpyGlass Physical

®.

25

26

2 Ultrafine Pitch 3D Stacked Integrated Circuits

The second phase of the design flow is the implementation of the set of MPSoC design points obtained in the first phase. We have used the IP Designer tool suite [12] from Synopsys to implement these designs. IP Designer is an ASIP (application-specific instruction set processors) design tool suite in which processor architectures can be quickly modeled and evaluated. It generates retargetable tools for compiling and debugging the processor model and the application code, assemblers, linkers, simulators, SystemC wrappers, and synthesizable RTL. Each design point is a set of multiple heterogeneous processors. Each of these processors is then implemented using the nML language [13]. nML is a proprietary IP Designer-specific high-level definition language that is used to describe the architecture and the instruction set of the processor. The tasks mapped on the processor are implemented in C. RTL is generated for each processor using the IP Designer tool suite. The platform-level RTL is manually written, which integrates all the platform components as per the configuration of the MPSoC instance. Although we use IP Designer tool suite, the proposed framework is generic enough to be adapted to other tool suites. High-level synthesis tools can also be used in this phase, which can take as input the set of MPSoC designs obtained in Phase 1 and generate the RTL for each design. For the third and fourth phases, we have developed a dedicated design flow to support different flavors of 3D integration, namely, F2F and F2B 3D stacking, using any combination of the 3D structures mentioned previously (TSV, micro-bumps, Cu pads) and without any limitations regarding their physical dimensions. The flow is composed of a backbone electronic design automation (EDA) tool and a set of external stand-alone add-ons. The backbone tool is used to perform typical IC implementation tasks (synthesis and place and route) (phase 3), while different add-ons focus on design characterization (phase 4). The backbone EDA tool is based on the existing industrial solution: SpyGlass Physical (SGP) from Atrenta. The commercial version of this tool used for 2D design exploration and prototyping has been extended with dedicated functionalities to handle 3D-IC design (see [14]):

®

• Gate-level netlist partitioning capability (into an arbitrary number of partitions). • Appropriate 3D interconnect physical modeling depending on 3D integration scheme adopted (F2F, F2B, and choice and geometrical/electrical properties of the 3D interface). • 3D floor planning and 3D place and route (including backside RDL routing). To cover many different aspects of the 3D design exploration, we have developed independent modules that target various functionalities. These modules enable holistic design exploration of realistic multimillion gate designs. They can be used either as stand-alone tools or directly interfaced with other EDA tools. Currently the following add-ons have been enabled: 3D interconnect delay and power models, repeater insertion model, cost model, compact thermal model, and automated partitioning for block-to-die assignment, although only first three will be used in the context of this work.

2.3 Design Enablement of Ultrafine Pitch 3D Integrated Circuits

2.3.2

3D Integration Backbone Tool

Different steps associated with 2D and 3D design flows using SGP are depicted in Figure 2.2. The following information is supplied as input to the flow: • RTL of the design (synthesizable Verilog or VHDL). • Abstracted high-level models for yet undefined modules (interface definition, area, power, timing, etc.). • Timing constraints (using Synopsys .sdc file format). • Technology libraries (using OpenAccess database format, compiled from foundry .LIB/.LEF). • 3D-IC stack configuration information (user specified using a dedicated XML scheme). • Physical constraints (as user scripts). Different design flow steps are detailed below. Step 1: RTL synthesis – The gate-level netlist is obtained after RTL synthesis using standard cell technology library. The netlist is analyzed for area, timing, logical congestion, etc. Once the synthesis flow is stable, i.e., the synthesis process produces the netlist in line with the timing constraints provided, generated netlist can be used as input to both 2D and 3D physical design flows. Step 2: Clustering – Prior to floor planning, the gate-level netlist is partitioned into a number of reasonably sized physical clusters of standard cells. The clustering is done to enable design floor plan on a reduced number of placable instances. The size of clusters (the meaning of reasonable) will depend on the total circuit size. In general it is chosen in such a way so that the total number of clusters is in the range of few dozens to few hundreds of physical (thus placable) entities. This is the optimal number of instances in terms of the tool run time vs. quality of the solution for the floor plan engine. Standard cell clustering is a very important step, since the quality of the physical design will depend on the clustering scheme adopted. It can be performed using many different methods, using, for example, top-down approach (from top-level way to the standard cell level) or bottom-up (from the standard cell level and up). Different clustering objectives could be achieved during clustering: keeping and following the logical hierarchy, creating clusters of the similar size, hierarchical min-cut across the clusters, etc. Note that in the case of the 3D integration, the clustered netlist is further partitioned into a number of gate-level entities (that will remain clustered), equal to the number of dies in the system (see next step, 3D partitioning). During 3D partitioning clusters will be assigned to given tiers, but will not be partitioned themselves. Step 3: 3D partitioning – This step is performed only in the 3D flow. The stack structure, the number of dies, the technology node on per die basis (this is to support heterogeneous integration), stacking orientation (face-up or face-down), 3D structure properties (TSV/micro-bumps/Cu pad) and RDL net properties (width/pitch), etc. are specified in a manually generated XML file, given as an input to the tool. The actual 3D partitioning of the gate-level netlist is carried out in an automated fashion, using the stack configuration file, synthesized gate-level netlist, and user-specified partitioning directives

27

28

2 Ultrafine Pitch 3D Stacked Integrated Circuits

Tier1/Die10

Tier1/Die1 U_B

Tier0/Die0

U_A

Tier1/Die11 U_B

TSV

U_A

Tier0/Die0 b) (b)

(a)

Figure 2.3 Same logical net in the face-to-face and face-to-back integrations, resulting in different physical nets.

in the case of manual partitioning. These come in the form of explicit block-to-die assignment directives using dedicated tool command. In the case of automated partitioning, the external add-on will create an optimal partition depending on the optimization objective. Once the block assignment to tiers is performed, the tool carries out automatic extraction of all inter-die (3D) nets. The logical view of each 3D net in the design is automatically replaced with the appropriate physical view of the net, depending on the 3D integration scheme used. This is shown in Figure 2.3, where the same logical net connecting two different components will have completely different physical views depending on the stacking scheme chosen. Different 3D integration schemes such as F2F and F2B in 2.5D or 3D configuration with any of the 3D structures could be explored using exactly the same design environment, by only changing the stack structure in the XML file. This ensures the minimum design effort overhead when we want to compare the impact of different 2D, 2.5D, or 3D integration techniques, or process technology options, on the same design. Step 4: Floor planning – The clustering and 3D partitioning steps are followed by the floor planning step. In the case of 2D, the floor planning is carried out automatically, with physical constraints that are manually generated based on connectivity analysis and whatever knowledge/constraints we might have on the design (e.g. hard-macro pre-placements). In the case of 3D, additional physical constraints related to 3D net placement are generated (explicit TSV/micro-bump clustering and placement). Floor planning is then carried out for each die. Step 5: Standard cell placement and routing (PNR) – Standard cell placement and routing are carried out after floor planning. In the case of 3D, it is performed separately for each die, in sequential fashion. 2.3.3

3D Design Flow Add-Ons

3D integration flow described above is used as a backbone tool to which we attach dedicated stand-alone modules developed in various software forms (XLS,

2.3 Design Enablement of Ultrafine Pitch 3D Integrated Circuits

Python, MATLAB, etc.) to allow fastest possible implementation. Necessary interfaces have been developed to enable smooth transition from the SGP to the modules themselves. The modules can be classified into following categories: • 3D interconnect timing and power. • Repeater insertion. • Cost modeling. 2.3.3.1

Interconnect Delay and Power Models

After 3D placement and routing, detailed wirelength reports (length and fan-out) are extracted for all 2D (on per die basis) and 3D nets. The output files are parsed, and relevant information is extracted to derive interconnect delay and power (energy) using the models explained below. The wire delay can be approximated using the RC product, where R is the resistance and C is the capacitance of the wire, which can be calculated using a lumped elements model. We apply Elmore’s delay approximation [15] to the lumped elements model of a wire to calculate the total wire delay (𝜏wire ): ( i ) n ∑ ∑ Rj Ci (2.1) 𝜏wire = i=1

j=1

where Rj and Ci are the equivalent resistance and capacitance of a wire segment. The total delay (Twire ) for a wirelength of length l can be calculated using the following model: N +1 1 (2.2) ≈ Rw Cw l2 2N 2 where Rw and Cw are the resistance and capacitance of per unit length of the wire and N is the number of segments considered in the lumped model of the wire. It can be seen that delay of the wire has a quadratic dependence on its length. Thus, to reduce the delay in long wires, repeaters are inserted in the wire to split the long wirelength into a set of smaller buffered segments. Each segment is modeled using simple pi model, and its delay (Tstage ) is calculated as follows: Twire = Rw Cw l2

Rw Cw L Rw Cw L 2 L + Rw Cg S + (2.3) SN 2 N N We model the repeater using its input (Cg ) and output (CT ) capacitances per unit width and equivalent resistance per unit width (RT ). S is the width of the repeaters, and N is the number of repeaters inserted in a wire of length L. Thus, the total delay of the wire is Twirerep = N ⋅ Tstage . Although inserting repeaters will have a positive impact on the wire delay, it also introduces area and energy overhead. Thus, the number (Nopt ) and size (Sopt ) of repeaters must be optimized such that delay, area, or energy of the wire is minimized. We calculate the Nopt and Sopt to minimize the total delay of wire (Topt ): √ Rw Cw ×L (2.4) Nopt = 2RT (Cg + CT ) Tstage = RT (CT + Cg ) +

29

30

2 Ultrafine Pitch 3D Stacked Integrated Circuits

3D net

Die 0

Die 1

Sink gate

Cu–Cu pad

Source gate

lopt L0

L1

Figure 2.4 Physical view of the 3D F2F wire.

√

RT Cw ×L Rw Cg √ √ = 2 Rw Cw RT Cg (1 + 0.5(1 + 𝛾)) × L

Sopt =

(2.5)

Topt

(2.6)

where 𝛾 is the ratio between the output CT and the input capacitance Cg of the repeater. Figure 2.4 shows a physical representation of the wire connecting two dies using 3D-F2F stacking scheme. We consider that repeaters have been inserted along the wires on the two dies to minimize delay. The delay of the wire segments L0 (Die 0) and L1 (Die 1) is modeled using repeated wire model, as described above. The 3D net consists of Cu–Cu pads, and its delay is estimated using lumped RC model. The total delay of a wire (TF2F ) connecting the two dies is then calculated with TF2F = TL0 + T3DF2F + TL1

(2.7)

where TL0 and TL1 are the total delays of wires with length L0 and L1 in dies 0 and 1, respectively, and are calculated using Eq. (2.6). T3DF2F is the delay of the 3D net between the two dies: Rdriver Cpad Cpad T3DF2F = (2.8) + (Rdriver + Rpad )( + CL ) 2 2 where Rdriver is the resistance equivalent of the gate driving the 3D net and Cload is the input capacitance of the gate being driven by the 3D net. RC for Cu–Cu pads is represented by Rpad and Cpad . The power dissipation in the interconnects is due to the charging and discharging of the wire capacitance. The total dynamic power dissipated in a wire of length L is estimated as shown in Eq. (2.9): Pwire = 𝛼(Cw ⋅ L)V 2 f

(2.9)

where 𝛼 is the switching activity factor, Cw is the wire capacitance per unit length, V is the voltage swing on the interconnect, and f is the clock frequency. 2.3.3.2

Repeater Area Model

We have discussed in Section 2.3.3.1 that adding repeaters to a long wire reduces the overall wire delay. It also changes the dependence of the wire delay on the

2.3 Design Enablement of Ultrafine Pitch 3D Integrated Circuits

wirelength from quadratic to linear. However, adding repeaters to the wires creates the following design complexities: • Repeaters are typically inverting elements, and hence an even number of repeaters are required to maintain the logic levels. • Repeaters are typically inserted for longer wires that are routed on upper metal layers. These repeaters require many via cuts from upper metal layers down to the substrate. This uses significant routing resources in the intervening layers, thus increasing routing congestion. • Placement of repeaters is typically constraint to pre-planned clusters, and thus the flexibility to optimally place them is limited. Apart from delay and energy overhead of the repeaters, the area overhead could be significant, depending on the wirelength distribution of the design. We estimate the area overhead of the interconnects in terms of total area of the repeaters inserted using the following expression: Awire = Arep ⋅ Nrep = {S ⋅ Ainv } ⋅ Nrep

(2.10)

where Awire is the total area of all the repeaters (Nrep ) inserted for all the interconnects, Arep is area of a single repeater of width S, and Ainv is the area of unit inverter. Typically S = Sopt and Nrep = Nopt when repeaters have been inserted to minimize the total wire delay. 2.3.3.3

Cost Model

While evaluating 3D-SIC implementations for a design, it is also important to understand the cost implications as 3D integration adds cost overheads. We extend the following 2D wafer processing cost model for estimating the cost of 3D-SIC processing1 : Cwafer =

CFEOL + CBEOL + Ctest Ywafer

(2.11)

where Cwafer is the total cost for processing a wafer, CFEOL is FEOL processing cost, CBEOL is the BEOL processing cost, Ctest is the wafer testing cost, and Ywafer is the yield of the wafer. Total 3D stacked processing cost (CW2W ) of N-layer wafers using W2W can be calculated with the following expression: ∑N ∑N i=1 CFEOLi + i=1 CBEOLi + C3D (2.12) CW2W = ∏N ∏N−1 i=1 Ywaferi ⋅ i=1 Y3Di where CFEOLi and CBEOLi are the FEOL and BEOL processing costs, respectively, for the ith wafer layer. C3D includes the cost of wafer testing and 3D integration process (bonding, thinning, and TSV processing) of all the N wafers. Y3Di is the 3D Stacking yield of dies i and (i + 1). In the case of D2W integration, the total processing cost must also include the cost to test each die to separate out all the good dies [17] (known-good-die 1 Cost models shown here are derived from IMEC’s internal cost models and those presented in [16]

31

32

2 Ultrafine Pitch 3D Stacked Integrated Circuits

(KGD) test). The total processing cost for D2W stacking (CD2W ) can be then calculated as ( ) ∑N CFEOLi +CBEOLi +CKGD + C3D i=1 Ywaferi (2.13) CD2W = ∏N−1 i=1 Y3Di where CKGD is the cost of KGD test.

2.4 Implementation of Mobile Wireless Application 2.4.1

Application Driver

In this work we target the application domain of wireless communications using standards such as WLAN, LTE, WiMax, etc. These standards are evolving at a fast pace [18–20], and there is a growing trends toward supporting higher data rates and consistent quality of service at very tight power, energy, and cost budgets. As discussed in Section 2.1, process scaling and advanced packaging will not be able to meet the requirements of future mobile SoCs [21]. This is true for wireless baseband processors also, and thus they are ideal candidates for 3D-SIC-based integration due to their high performance at low energy and area requirements. We specifically focus on baseband processing of LTE-3GPP receiver [22] as the application driver because LTE is an upcoming standard being introduced in new products and LTE receivers are one of the biggest new design challenges in the wireless community. We also chose LTE-3GPP for exploring memory-on-logic 3D partitioning because it is a memory-intensive application [23] and the total area of the on-chip memory can be comparable with the area occupied by the logic for LTE baseband processor implementations. Depending on the mode of operation (defined by the number of transmitting and receiving antennas, modulation schemes, data-coding rate, etc.), the baseband processing of wireless standards can be computationally demanding and memory intensive. We have considered 2 × 2 MIMO 20 MHz with 256-QAM modulation mode of LTE-3GPP in this work. Although we use LTE-3GPP as the driver application, the findings of this work can be extended to other wireless applications such as WLAN 11n, WLAN 11ac, DVB, etc. due to similar baseband processing characteristics. 2.4.2

Architecture Template

We have considered template-based MPSoC architecture that is designed and optimized for streaming applications (wireless in this case). Such template-based domain-specific architectures not only have the potential to bridge the currently existing energy efficiency–flexibility gap between ASIC (application-specific integrated circuit) and GPP (general purpose processor) or DSP (digital signal processor) [24] but also reduce the design time and cost while enabling high reusability.

2.4 Implementation of Mobile Wireless Application

At the platform level of the considered architecture template, there are multiple processors, all of which share a unified instruction and data level-2 (I+D L2) memory. The communication between the processors and L2 is considered to be based on advanced microcontroller bus architecture (AMBA) advanced high-performance bus (AHB)-Lite bus protocol [25]. The number of processors, the size of L2 memory, and the bus width are configurable at the platform level. At the processor level, each processor consists of scalar and vector datapaths (DP), L1 instruction cache memory (L1-I) and L1 data cache memory (L1-D), and AHB slaves (AHBS) and AHB master (AHBM) interfaces for connectivity to the system bus. Figure 2.5a,b shows the processor-level and 2D platform-level templates, respectively. The instruction set of these processors is customized to wireless application domain requirements. For each processor, the memory sizes, the scalar and the vector word sizes, and the instruction set can be configured. AHBS

AHBS

L1 I

L1 D

LB

sRF

Scalar DP

L2 I + D

vRF

Vector DP

AHBM (a)

(b) L2 I + D

3D nets

Logic (bottom) die

Memory (top) die

(c)

Figure 2.5 (a) Processor- and (b) system-level templates for MPSoC platform. (c) MPSoC partitioning for two-die 3D-SIC: logic and memory dies, and 3D nets.

33

34

2 Ultrafine Pitch 3D Stacked Integrated Circuits

Although the architecture template considered here is basic and high level, it captures the essential elements and provides configuration parameters also provided in other academic templates such as SODA [26] and commercial ones such as Tensilica (http://www.tensilica.com/products/dsps/connx-basebandengine.htm), CEVA (http://www.ceva-dsp.com/CEVA-MM3000.html), and X-Gold [27]. Based on the above template, given MPSoC instance can be integrated as standard 2D-IC and 3D-SIC. For 3D-SIC, we consider F2F integration with fine-pitch Cu–Cu pads using either D2W or W2W bonding. We consider a two-layer memory-on-logic 3D partitioning, where the top die integrates all system memories (L2, L1-I, L1-D) and the lower die integrates the DP and the register file of all the cores (Figure 2.5c). We consider that the BEOL processing of two dies uses different number of metal layers (four metal layers for memory die) to optimize the total system cost. 2.4.3

MPSoC Instance

In this work, we have considered a 10-core design instance obtained using the application architecture co-exploration framework presented in our earlier work [11]. This design instance is one of the area–energy Pareto-optimal design points obtained using the exploration framework described in Section 2.3.1. The design instance is a heterogeneous MPSoC where the cores have different micro-architectures, as indicated by the different bounding box colors in Figure 2.6. The micro-architecture of the cores differs in terms of the instruction set architecture, degree of DP vectorization, and sizes of memory instances (instruction memory, vector and scalar data memory). Each core has private L1 instruction (L1-I) and L1 data (L1-D) memory. L1-D consists of scalar and vector memory. The integrated L2 instruction and data memory (L2) is shared across all the cores. The size and width of the interface of the vector data memory varies across the cores depending on the tasks assigned to the cores. This instance resulted in more than 5k Datapath ↔ Memory interconnections. 2.4.4

Multiple Memory Organization and Bus Structure

The MPSoC platform architecture template considers that the level-1 data memory (L1-D) is private to each core. In this work, we have also explored a hybrid organization of L1-D, wherein not all the cores may have private L1-D. The cores 1

3

7

L2 I+D

9

L1-I L1-D Datapath

2

4

5

6

8

10

Figure 2.6 A 10-core heterogeneous MPSoC processor with a three-stage pipelined streaming architecture, instantiated for baseband processing for 2×2 MIMO 20 MHz with 256-QAM modulation mode of LTE-3GPP.

2.4 Implementation of Mobile Wireless Application

Figure 2.7 The 10-core instantiation of the MPSoC mobile template where each core has private L1-I memory but L1-D is shared across the cores (referred to as hybrid memory in this work). The cores are connected using a segmented bus (HyArch).

10 6

7

M6

M3

M2

2

1

5

4

3

M1

M5

M4

9

8

having inter-core data transfers share L1-D, whereas the rest have private L1-D. Thus, for the MPSoC instance used in this work, we have considered two cases of L1-D organization – private (Figure 2.6) and hybrid (Figure 2.7). We have restricted the number of cores that share a L1-D to 2 because memories with more than two ports are not energy and area efficient. To eliminate performance overhead due to memory contention, we have considered that both the cores can access the memory simultaneously. In the case of private L1-D organization, the inter-core communication over the shared bus takes place in a burst mode, and it is assumed that bus contention overhead is negligible. However, in the case of hybrid L1-D organization, the bus will be more frequently accessed for the Core ↔ L1-D transactions, thus resulting in an increased bus contention. To minimize bus contention, we have considered segmented bus and assumed that the performance overhead due to bus contention within a segment is negligible. It may be noted that we have not derived these shared memory variations based on a thorough design space exploration. We have considered one possible shared memory variation for each instance as our objective is to analyze the difference in impact of different memory and communication architectures at the system level for the 2D and 3D-SIC integrations. 2.4.5

Experimental Setup

We have used the design framework proposed in Section 2.3.1 for design implementation and characterization. We have used the IP Designer tool suite [12] from Synopsys to implement the MPSoC design. Each processor in the design is implemented using the nML language [13]. The tasks mapped on the processor are implemented in C. RTL is generated for each processor using IP Designer. The platform-level RTL is manually written, which integrates all the platform components as per the configuration of the MPSoC instance. The MPSoC processor instance is further synthesized, floor planned, and placed and routed using Atrenta’s SGP 3D tool [28] for two different integration schemes: (i) 2D and (ii) F2F. In the case of F2F, we have considered Cu pad pitch of 3 μm based on realistic figures derived from current wafer alignment tools accuracy. We have used commercial 28 nm technology for gate-level synthesis. For 3D we have used memory-on-logic-based partitioning that results in about

35

36

2 Ultrafine Pitch 3D Stacked Integrated Circuits

5k inter-die (3D) nets. Both designs resulted in about 120 I/O nets, and we have considered C4 bumps to connect the dies to the package. 2.4.6

Experimental Results

We have analyzed results for the following four design variations for each of the two MPSoC instances considered: private memory (PrArch) in (i) 2D (Pr2D) and (ii) 3D-F2F (Pr3D) and hybrid memory (HyArch) in (iii) 2D (Hy2D) and (iv) 3D-F2F (Hy3D). Since the configurations of the processing cores remain the same across all the four designs, we have discussed memory and interconnect aspects only. We have used empirical area and energy models for system-level estimations [11]. For area, energy, and delay estimation of the interconnect network, we have used wirelength and area information obtained from post-place and route reports along with the area, energy, and delay models presented in Section 2.3.3 to carry out the comparison. 2.4.6.1

Private vs. Hybrid Memory Architecture

The HyArch results in an increase of more than 2× in the total memory area and 1.6× in the total memory access energy, as compared with the PrArch. In the case of the PrArch, each processor is individually configured including the memory sizes based on the requirement of the tasks mapped to the processors. In Figure 2.6, processors 1–6 result in more than 80% of total memory accesses, whereas their corresponding memory sizes are less than 10% of the total memory. However in the case of HyArch, the memory sizes are worst cased across the processors sharing them. This results in an increase in total memory area as well as increase in the size of memories, requiring higher number of accesses. On the other hand, HyArch results in about 60% reduction in data communication across the cores. Thus from a system architecture perspective, PrArch is more area and energy efficient in terms of memory, whereas HyArch is more communication throughput efficient. 2.4.6.2

Interconnect Technology Comparison

In Figure 2.8, we have compared the wirelength distribution across all the four designs. 3D-F2F shows a shift in wirelength distribution from longer to shorter wires. As a result, 3D-F2F results in more than 30% reduction in total wirelength and energy dissipation in the interconnects, more than 60% reduction in delay of the longest wire, and about 50% reduction in wiring area overhead (calculated in terms of repeater area), as shown in Figure 2.9. Thus, 3D-F2F benefits over 2D for both PrArch and HyArch. However, 3D-F2F poses manufacturing, yield, reliability, and CAD challenges that may have higher-cost implications as compared with 2D. The impact of 3D-F2F is similar on total wirelength and the energy dissipation for PrArch and HyArch, because the number of cores is same in both cases and the number of Core ↔ Memory interconnections is also similar. However, PrArch results in a higher reduction in delay, as compared with HyArch, because it also results in a higher reduction in the length of the longest wire. Thus, 3D-F2F has higher-performance impact in the case of PrArch than HyArch.

2.4 Implementation of Mobile Wireless Application

4.0 Pr2D

Pr3D

Hy2D

Hy3D

Wire count (normalized to Pr2D)

3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Wirelength bins (Log-2 scale)

Figure 2.8 Wirelength distribution for all four design choices. 1.2 Pr2D

Pr3D

Hy2D

Hy3D

Normalized to Pr2D

1.0

0.8

0.6

0.4

0.2

0.0 Total wire length

Longest wire delay

Average wire energy

Total wiring area overhead

Figure 2.9 Comparison of impact of HyArch with respect to PrArch on different interconnect parameters.

2.4.6.3

Impact of System Architecture

HyArch results in an increase of total wirelength, average interconnect energy, and total wiring area overhead. This is because HyArch uses multiple segmented bus to minimize performance overhead, whereas PrArch uses a single shared bus across all the cores. Multiple buses result in increase in number of control and data signals and hence increase in total wirelength and energy. There is an increase in interconnect energy inspite of reduction in data communication

37

38

2 Ultrafine Pitch 3D Stacked Integrated Circuits

System parameters

Pr2D

Pr3D

Hy2D

Hy3D

Interconnect performance Interconnect energy Memory energy Memory area Inter-core communication Area overhead Color legend

Best

Worst

Figure 2.10 Variation in suitability of design choices across different system parameters.

across the cores because LTE-3GPP is not a communication dominant application. On the other hand, multiple buses instead of a single bus result in reduction in long wires, which in turn reduces the interconnect delay, as shown in Figure 2.9. Thus from the interconnect perspective, HyArch is more performance efficient, whereas PrArch is more energy efficient. Besides this, the impact of moving from PrArch to HyArch is higher in the case of 2D than 3D-F2F. 2.4.6.4

System Parameter vs. Design Choices

Figure 2.10 summarizes the above discussion, where we show variation in suitability of different design choices for different system parameters. Thus, depending on the application requirements and design trade-off objectives, a choice between PrArch vs. HyArch (system architecture) and 2D vs. 3D-F2F (interconnect technology) should be made. For example, LTE-3GPP that is a memory dominant wireless application will benefit most from Pr3D implementation, whereas WLAN 11n that is a compute and communication dominant wireless application will benefit most from Hy3D implementation.

2.5 Conclusions In this chapter we have presented an EDA environment that we developed to enable 3D-IC design implementation exploration. The environment supports various 3D integration flavors, 3D F2F or F2B using typical 3D interconnect structures such as TSV/micro-bump/RDL/Cu pads. The tool flow is built on the top of the already existing commercial EDA solution that deals with 2D designs: SGP from Atrenta. To enable holistic exploration of the 3D design space, we have developed a series of our own stand-alone modules that we have successfully interfaced with SGP to enable interconnect performance assessments of the 3D nets, thermal simulation, cost modeling, and automated partitioning for fine-grained 3D interconnects. We have presented a case study where we have explored 2D and 3D-SIC implementations for mobile MPSoC platform targeted at wireless baseband processing. We have considered ultrafine pitch 3D interconnections using Cu–Cu pad bonding and F2F stacking.

References

We have analyzed the interdependence of the system-level architecture choices and the interconnect technology choices. We have considered variations in memory organization and communication structure for a 10-core heterogeneous MPSoC design instantiated for LTE-3GPP 2×2 MIMO baseband processing. For each design, we have evaluated 2D and Cu–Cu bonding-based F2F 3D-SIC implementations. We have shown that 3D-SIC presents gains between 30% and 60% over 2D-IC for both the design variations. However, the extent of impact of 3D-SIC is highly dependent on the system-level architecture. For example, the performance improvement for 3D-SIC implementation is higher for the design with private memories and shared bus communication structure. We have also shown that different combinations of system-level architecture choices and interconnect technology choices have different impact on system parameters, such as interconnect performance and energy, memory area and energy, design area, inter-core communication, etc. Given that the application characteristics have implications on these parameters, it is important to take into account application characteristics and requirements while making system architecture and interconnect technology choices.

References 1 Lu, J.J. (2012). (Invited) advances in materials and processes for 3D-TSV

integration. ECS Transactions 45 (6): 119–129. 2 Zoschke, K., Wegner, M., Wilke, M. et al. (2010). Evaluation of thin wafer

3

4 5 6 7

8

9 10

processing using a temporary wafer handling system as key technology for 3D system integration. In: Proceedings of ECTC, 1385–1392. Jourdain, A., Buisson, T., Phommahaxay, A. et al. (2010). 300mm wafer thinning and backside passivation compatibility with temporary wafer bonding for 3D stacked IC applications. In: Proceedings of 3DIC, 1–4. Hu, Y.H., Liu, C.S., Lii, M.J. et al. (2012). 3D stacking using CU–CU direct bonding. In: Proceedings of 3DIC, 1–4. Velenis, D., Marinissen, E., and Beyne, E. (2010). Cost effectiveness of 3D integration options. In: Proceedings of 3DIC, 1–6. Tang, Y.S., Chang, Y.J., and Chen, K.N. (2012). Wafer-level CU–CU bonding technology. Microelectronics Reliability 52 (2): 312–320. Zhang, T., Cevrero, A., Beanato, G. et al. (2013). 3D-MMC: a modular 3D multi-core architecture with efficient resource pooling. In: Proceedings of DATE, 1241–1246. Priyadarshi, S., Choudhary, N.K., Dwiel, B. et al. (2013). Hetero2 3D integration: a scheme for optimizing efficiency/cost of chip multiprocessors. In: Proceedings of ISQED, 1–7. Sheibanyrad, A., Petrot, F., and Jantsch, A. (2010). 3D Integration for NoC-based SoC Architectures. New York: Springer. Lee, Y.J., Morrow, P., and Lim, S.K. (2012). Ultra high density logic designs using transistor-level monolithic 3D integration. In: Proceedings of ICCAD, 539–546.

39

40

2 Ultrafine Pitch 3D Stacked Integrated Circuits

11 Agrawal, P., Raghavan, P., Hartman, M. et al. (2013). Early exploration for

12

13

14 15

16

17

18

19 20 21 22 23 24

25 26

27

28

platform architecture instantiation with multi-mode application partitioning. In: Proceedings of DAC, 132:1–132:8. Synopsys Inc. Synopsys IP Designer. http://www.synopsys.com/IP/ ProcessorIP/asip/ip-mp-designer/Pages/ip-designer.aspx (accessed 24 August 2018). Van Praet, J., Lanneer, D., Geurts, W., and Goossens, G. (2008). nML: a structural processor modeling language for retargetable compilation and ASIP design. In: Processor Description Languages: Applications and Methodologies, Volume 1 in Systems on Silicon (ed. P. Mishra and N. Dutt), 65–93. Elsevier Inc.. Milojevic, D., Marchal, P., Marinissen, E.J. et al. (2013). Design issues in heterogeneous 3D/2.5D integration. In: Proceedings of ASP-DAC 2013, 403–410. Elmore, W.C. (1948). The transient response of damped linear networks with particular regard to wideband amplifiers. Journal of Applied Physics 19 (1): 55–63. Dong, X. and Xie, Y. (2010). System-level 3D IC cost analysis and design exploration. In: Three Dimensional Integrated Circuit Design: Integrated Circuits and Systems. Springer US, 261–280. Xie, Y. (2011). Microprocessor design using 3D integration technology. In: Three Dimensional System Integration - IC Stacking Process and Design (ed. A. Papanikolaou, D. Soudris, and R. Radojcic), 211–236. Springer US. Akyildiz, I.F., Gutierrez-Estevez, D.M., and Reyes, E.C. (2010). The evolution to 4G cellular systems: LTE-advanced. Physical Communication 3 (4): 217–244. Raychaudhuri, D. and Mandayam, N.B. (2012). Frontiers of wireless and mobile communications. Proceedings of the IEEE 100 (4): 824–840. Alsabbagh, E., Yu, H., and Gallagher, K. (2013). 802.11ac design considerations for mobile devices. Microwave Journal 56 (2): 80, 82, 84, 86, 88. Gilmore, R. (2013). System design considerations for next generation wireless mobile devices. In: Proceedings of VLSIT, T8–T13. 3GPP TSG-RAN, 3GPP TS36.211, Physical Channels and Modulation (Release 8), v8.1.0 (2007-11). Sharma, N., Aa, T.V. Agrawal, P. et al. (2013). Data memory optimization in LTE downlink. In: Proceedings of ICASSP, 2610–2614. Fasthuber, R., Catthoor, F., Raghavan, P., and Naessens, F. (2013). Energy-Efficient Communication Processors - Design and Implementation for Emerging Wireless Systems. New York: Springer. ARM Limited AMBA 3 AHB-Lite Protocol v1.0. Verbauwhede, I. and Nicol, C. (2000). Low power DSP’s for wireless communications. In: Proceedings of ISLPED, 303–310. https://doi.org/10.1109/LPE .2000.155303. Ramacher, U., Raab, W., Hachmann, U. et al. (2011). Architecture and implementation of a software-defined radio baseband processor. In: Proceedings of ISCAS, 2193–2196. Atrenta Inc. http://www.atrenta.com/about-spyglass.htm5 (accessed 25 August 2018).

41

3 Power Delivery Network and Integrity in 3D-IC Chips Makoto Nagata Kobe University, Graduate School of Science, Technology and Innovation, 1-1 Rokkodai, Nada, Kobe 657-8501, Japan

3.1 Introduction Power delivery network (PDN) determines the capability of powering circuits in a very large-scale integration (VLSI) chip. It also characterizes power noise emission, interference, and susceptibility of an entire electronic system that uses the VLSI chip. The design and verification of PDNs need not only to include circuits on chips but also to consider their interaction with packages and printed circuit boards (PCBs) that are associated with assembly as a complete system. The analysis and diagnosis frameworks should combine the static impedance characteristics of PDNs and dynamic power current consumptions of the circuits for guaranteeing their completeness of operation or integrity in performance. The whole considerations of PDN and power domain integrity (PI) are to significantly evolve for multiple chips vertically integrated in three-dimensional (3D) VLSI systems [1–4].

3.2 PDN Structure and Integrity A PDN involves power sources such as a battery, power loads representing integrated circuits (ICs), and power lines formed on PCBs and packages, as sketched in an equivalent circuit representation of Figure 3.1 [1, 2]. Each component has its impedance in series to power lines, such as ZPS of internal output impedance of a power source, ZPCB on power traces of a PCB, ZPKG on power traces within a package, and ZCHIP of a chip. Those are parasitic impedance due to resistive and inductive natures of metallic wires within their structure. Additionally, the shunt capacitances between power (V DD ) and ground (V SS ) or return nodes are associated parasitically with individual capacitive structures or explicitly provided as decoupling capacitors (decaps). The ICs dynamically consume power currents of I LOAD when they are operating, in addition to static leakage currents that can be made relatively negligible with low leakage devices

Handbook of 3D Integration: Design, Test, and Thermal Management, First Edition. Edited by Paul D. Franzon, Erik Jan Marinissen, and Muhannad S. Bakir. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

42

3 Power Delivery Network and Integrity in 3D-IC Chips

IPS

ZPCB

ZPS VOUT

CPS

CPCB

ZPKG CPKG ZPKG

Power source

PCB

– Battery – Regulator – Power driver

Package

ZCHP CCHP

ILOAD

ZCHP IC chip – VRM – Decap – Power gates

Figure 3.1 Power delivery network (PDN). Source: Adapted from Park et al. 2010 [1] and Yoshikawa et al. 2011 [2].

or low power circuits pursued in present VLSI technologies. The total current supplied by the power source is I PS at the supply voltage of V OUT . An actual example of PDN is given in Figure 3.2. A silicon chip is packaged in a quad flat package (QFP) and mounted on a PCB. An external power source is connected to ICs through one of power connectors and associated traces formed on internal metal layers of the PCB (as shown in design data of PCB). While the V DD pin and its associated traces are isolated from the other power voltage domains, the V SS pins are often connected to a single and shared ground island. The inside of the QFP is also shown, where a tiny silicon die with the side length of approximately 5 mm is located in the center of a package cavity. The power pads at the die periphery are wire-bonded to the lands within the package. The bonding wires, package, and board traces are of the order of 10 mm in length and 10–1000 times larger in physical dimensions in comparison with on-chip metallic wires within ICs. Those off-chip components primarily characterize the frequency-domain response of a PDN, particularly at and lower than the frequency of PDN resonance typically at a few 100 MHz.

(a)

(b)

(c)

Figure 3.2 PDN components. (a) PCB after assembly, (b) PCB design data, and (c) Silicon chip in QFP package.

3.3 PDN Simulation and Characterization

Static drop

Dynamic drop (V) 2.0

Power noise waveform

1.8 1.6

Signal waveform

1.4 1.2 (a)

0

2

4

6

8 10 (ns)

(b)

Figure 3.3 Power noise and impacts on signal waveform. (a) Overview and (b) actual waveforms.

Power currents periodically consumed by digital ICs interact with the PDN impedance. This creates voltage variations on power rails, as exemplified in Figure 3.3, including the static voltage drops by average power currents and dynamic ones by instantaneous currents regarding switching operations of logic gates and clock drivers. The power noise waveforms were on-chip captured in an IC chip. The static and dynamic voltage drops modulate slopes and shapes of signals, respectively, and potentially lead to timing violations in digital ICs. The PDN of a whole system even in 3D chip integration should be co-designed for IC chips, packages, and PCBs and verified for power integrity as well as signal integrity.

3.3 PDN Simulation and Characterization The equivalent circuit representation of a PDN is given in Figure 3.4. An IC chip model includes on-die V DD and V SS planes in respective horizontal impedance meshes, current sources capturing power current consumption of active circuits, and chip capacitances (C CHIP ) vertically shunting those planes. A p-type silicon substrate is also considered in parallel to the V SS plane. The wires in packaging are modeled as inductance in series, approximately estimated from lengths of bonding and routing. An accurate multi-port S-parameter model is derived Chip PCB

Package

S-parameter model

ESL

VDD network Bonding wire

Cdecap

CCHP

Active power current model

ESR VSS network + silicon substrate network

Figure 3.4 Power noise simulation model.

43

44

3 Power Delivery Network and Integrity in 3D-IC Chips

Power source

Cable

Chip

pkg pcb

300 nH

6 nH

5 nH

RDD

~pF

~pF

220 pF

VDD ~μF

100 pF

1 nH Power source

Cable (UFL 1 m)

PCB

Package

RSS //RSUB VSS

Decap

Chip

Figure 3.5 PDN passive network model.

by full-wave electromagnetic analysis for parasitic impedance networks among power source terminals and power pins of packaged IC chips on the PCB. In addition, explicit decaps are added with electronic series inductance (ESL) and electronic series resistance (ESR) parasitic to component structures. A system-level capture of Figure 3.5 involves an external power source and its connections to IC chips in a package and on a board. The series inductances parasitic to wiring components such as cables in a harness, traces on a board, wires and leads in a package, and metal wiring in a chip are estimated, along with shunting capacitances associated with individual components. It is noted that the metal shields of cables and metal plains of V ss on PCBs are tightly connected to the system ground and make the series impedance to the return path sufficiently small. The PDN impedance (ZPDN ) of Figure 3.5, seen from on-chip ICs represented by a single capacitor of C CHIP to an external power source, is simulated in the frequency domain as given in Figure 3.6. The total inductance of 312 nH in the closed circuit of the PDN and C CHIP of 220 pF characterize the frequency response of ZPDN in this particular example. The size of ZPDN increases with frequencies from DC and reaches the maximum value as large as 300 Ω at the frequency of resonance (F RES ) of 20 MHz that follows to Eq. (3.1). The impedance becomes smaller with frequencies by the shunt capacitances up to the total sum of series resistance within the PDN closed circuit: FRES = 1∕(2∗ π∗ sqrt(LC))

(3.1)

The time-domain (transient) response of the PDN is also simulated and given in Figure 3.6. It shows the power current flowing through the PDN in a resonating waveform (ringing) at F RES , after a single current impulse is drawn at the position of ICs in the PDN (this current source is not shown).

3.3 PDN Simulation and Characterization

300 nH

6 nH

RDD

5 nH

VDD IDD Z

220 pF

RSS //RSUB

1 nH Power source

Cable (UFL 1 m)

Package

PCB

350 FRES = 20 MHz

IDD

10 Current (mA)

Impedance (Ω)

Chip

15

300 250 200 150 100

5 0 –5 –10

50 0

VSS

–15 1

10 100 1000 Frequency (MHz)

10

4

–20

0

20

40 60 Time (ns)

80

100

Figure 3.6 Resonance response of PDN without decap.

The frequency response is intentionally modified by inserting decaps in the PDN to suppress the undesired peak of impedance and/or to move its frequency. While the inductance parasitic to the cable was the largest among the parasitic inductances of Figure 3.5, the power line trace on the PCB of Figure 3.7 becomes dominant when a decap is placed at the power source terminal particularly on the shortest path. The V DD trace on the PCB exhibits capacitive response at the power source terminal when the other end (IC chip side) is open, as simulated Package and PCB Short VDD Long VDD

Chip

Figure 3.7 Power trace on PCB (example).

45

3 Power Delivery Network and Integrity in 3D-IC Chips

105

1000 VDD open

VDD shorted

Measurement

Impedance (Ω)

Impedance (Ω)

104 1000 100 Simulation

10

100 Measurement 10

1

Simulation

1 0.1 0.1 (a)

1

10 100 1000 Frequency (MHz)

0.1 0.1

104

1

(b)

10 100 1000 Frequency (MHz)

104

Figure 3.8 PCB impedance of V DD traces in open (a) from and shorted (b) to ground.

with full-wave electromagnetic analysis software in Figure 3.8. The response is also measured on a blank (vacant) version of this PCB by a network analyzer, for comparison. It becomes inductive when the far end is shorted to the ground plane as also shown. The total PDN response of Figure 3.7 is simulated as in Figure 3.9, where the frequency of resonance moves to F RES = 100 MHz as calculated from 300 nH

6 nH

IDD

Power source

Cable (UFL 1 m)

RDD

5 nH

IDecap

VDD

220 pF

Z

RSS //RSUB VSS

1 nH

Decap on PCB C = 10 μF ESL = 2 nH ESR = 50 mΩ

Package

50

Chip

60

40

40

FRES = 100 MHz Current (mA)

Impedance (Ω)

46

30 20 10

IDecap

20 0 –20

IDD

–40 –60

0

1

10 100 1000 Frequency (MHz)

104

–80

0

20

40 60 Time (ns)

80

Figure 3.9 Resonance response of PDN with 10 μF proximate to power source.

100

3.3 PDN Simulation and Characterization

Eq. (3.1) with the total inductance of 12 nH and C CHIP of 220 pF, respectively. The simulated power current of Figure 3.9 proves the effect of decap. The current through the PCB and IC chip side of the PDN is resonating after the impulse current stimulus and completely decoupled from the branch to an external power source. If there are continuous stimuli of resonation, DC current should flow in the entire PDN. The decap obviously decreases the size of peak impedance. Further decoupling is demonstrated in Figure 3.10, where the decap is placed at the power pins of the IC package. The inherent frequency of resonance becomes F RES = 140 MHz, where the impedance seen from the IC circuits is dominated only by the parasitic components within a package. The resonating current is shunted at the decap, and almost no varying current flows outside the package. The frequency response of PDN impedance (ZPDN ) is evaluated for an actual silicon IC chip of digital circuits, shown in Figure 3.11. The chip is assembled on a PCB board where the V DD trace involves the parasitic capacitance of C brd against the system ground while no explicit decap is installed. The frequency-domain response of ZPDN is primarily determined by (C brd + C die ) and (Lbrd + Lwire ) for low frequencies. The derived numbers of C brd , C die , Lbrd , and Lwire are 5.5 pF, 174 pF, 1.15 nH, and 10 nH, respectively, and F RES is calculated as 112 MHz from Eq. (3.1). When the PDN is seen from the power source side, the other end (chip side) is openly terminated by the parasitic capacitance of C die , and an open-like response is shown (Figure 3.11a). On the other hand, the PDN seen from ICs 300 nH

6 nH

5 nH

RDD

VDD

IDD IDecap

Power source

Cable (UFL 1 m)

Decap on PCB C = 1 μF ESL = 0.5 nH ESR = 50 mΩ

Chip

Package

50 IDecap

FRES = 140 MHz Current (mA)

20 Impedance (Ω)

RSS//RSUB VSS

1 nH

25

Z

220 pF

15 10

0 IDD –50

5 0 1

10 100 1000 Frequency (MHz)

104

–100

0

20

40 60 Time (ns)

Figure 3.10 Resonance response of PDN with 1 μF proximate to IC chip.

80

100

47

3 Power Delivery Network and Integrity in 3D-IC Chips

104

Z from power source Lwire Rdie

Lbrd

Z Impedance (Ω)

Z Cdie

Cbrd

Simulation

103 102

Measurement 10

1

100 10–1 –1 10

100

(a)

Lbrd

104

30

Lwire Rdie

Simulation Z

AC Cbrd

101 102 103 Frequency (MHz)

35

Z from chip

Cdie

Chip

Z Impedance (Ω)

48

25 20 15 10 5 0

(b)

0

100

200 300 400 Frequency (MHz)

500

Figure 3.11 Chip-package-board unified impedance (Z) seen from power source (a) and from IC chip (b) [2]. (Copyright 2011, IEEE.)

is virtually terminated to AC ground at the power source, and closed-loop response is shown (Figure 3.11b). The frequency at the resonance is equivalently at 112 MHz for both directions where C brd is negligibly small, and no decap is given on the PCB in this particular example. The time-domain power voltage variations, namely, power noise, are recorded in the waveforms of Figure 3.12. An on-chip power noise measurement technique [3, 4] was applied. The digital ICs operated at the clock frequency (F CLK ) of 10 and 100 MHz. We see the decaying resonation (ringing) of 100 MHz clearly after each clock edge (either rise or fall edge) of F CLK at 10 MHz, while exhibiting successive peaks with stable amplitudes in every half clock cycle of F CLK when it is at 100 MHz. The power noise waveforms are strongly characterized by the PDN impedance and specially emphasized in the frequency components of F RES .

3.4 PDN in 3D Integration

FCLK = 10 MHz

1.6

1.4 Voltage (V)

Voltage (V)

1.4 1.2 1 TRES = 10 ns

0.8 0.6 (a)

FCLK = 100 MHz

1.6

0

20

1.2 1 0.8

40 60 Time (ns)

80

0.6

100 (b)

0

5

10 15 Time (ns)

20

25

Figure 3.12 Measured power noise waveforms on V DD with clock network at 10 MHz (a) and at 100 MHz (b) [2]. (Copyright 2011, IEEE.)

3.4 PDN in 3D Integration The PDN evolves with 3D-IC chip integration as sketched in Figure 3.13 [5, 6]. Each tier (die) completes its own PDN for supplying internal ICs with V DD and V SS wirings while connected to each other vertically with through-silicon vias of power domains (power TSVs). The response of PDN impedance in low frequencies is significantly characterized by power line traces and associated decoupling components on a PCB as well as in an interposer, following to the discussions on 2D PDNs given in the previous sections. The differences primarily come from a 3D-IC chip stack and its interposer within a molded package. The primal features of a 3D PDN attribute to the parasitic series impedance and parallel coupling among TSVs and μ-bumps (Figure 3.14). The power TSVs for V DD and V SS are alternatively placed along power supply trucks and form VDD VSS

Power TSVs

Top tier Bottom tier Interposer

Figure 3.13 PDN in 3D integration. Source: Adapted from Araga et al. 2014 [5] and Nagata et al. 2015 [6].

49

50

3 Power Delivery Network and Integrity in 3D-IC Chips

VDD

VSS

TSV Top tier μ-Bump

Bottom tier

Silicon substrate

Figure 3.14 Power noise coupling between TSVs and substrate. Source: Reproduced with permission from Araga et al. 2014 [5]. Copyright 2014, IEEE.

dense arrays. The vertical vias involve capacitive couplings to a silicon substrate that is resistive and induces further interactions with the other ones. The multiple placements of TSVs reduce longitudinal impedance while enlarging lateral coupling, all contributing to a tighter interaction among V DD and V SS planes and resulting in a larger effective PDN capacitance. The parasitic capacitances distributed among internal tiers entirely contribute the PDN response. An experimental prototype of 3D-IC integration is introduced in Figure 3.15. It includes two tiers each having digital noise source circuits (NSs), probing front ends (PFEs), and a data processing unit (DPU) for on-chip power noise measurements [3, 4]. The two NSs (NS1 and NS2) are located on the top (thinned) tier, while the other two (NS3 and NS4) on the bottom one. Each NS has its own PDN and connections to an external power source through dedicated pins and routes. It is noted that the PDNs on the bottom tier have TSVs on the top tier in Common inputs Reset top Reset bottom

Top tier Common inputs VDD-NS1 VSS-NS1 VDD-NS3 VSS-NS3 VDD-NS4 VSS-NS4 VDD-NS2 VSS-NS2

NS1 PFE PFE NS2

DPU

Dout top Dout bottom

TSVs Bottom tier NS3 PFE PFE NS4

DPU

Figure 3.15 Case study: PDN coupling in 3D integration. Source: Reproduced with permission from Araga et al. 2014 [5]. Copyright 2014, IEEE.

40

–20

30

–30

20

–40

10

–50 V0p@ VSS-NS4, NS3 running V0p@ VSS-NS4, NS2 running

0

S12(VSS-NS3 to VSS-NS4) S12(VSS-NS2 to VSS-NS4)

–10 10

100 Frequency (MHz)

–60

Calculated S12 parameter (dB20)

Figure 3.16 Tier-to-tier PS noise coupling. Source: Reproduced with permission from Araga et al. 2014 [5]. Copyright 2014, IEEE.

Measured V0p (dB mV)

3.4 PDN in 3D Integration

–70 1000

their paths. The return paths on each tier have shared connections to its silicon substrate (exactly same as in a 2D integration.) The PFE observing voltage variations on the V SS4 node of NS4 on the bottom tier evaluates the strength of power noise coupling, as given in Figure 3.16. The size of voltage variation measured as its peak-to-peak amplitude, V PP , stays significant when NS3 adjacent to NS4 on the same tier is operating with different frequencies while NS4 is halting in its operation. This represents power noise coupling mainly through a silicon substrate resistively within the bottom tier and also partly by the coupled power TSVs formed on the top tier. On the one hand, it is clearly shown that the V PP is seen by the inactive NS4 even when NS2 on the top tier is only operating. The power noise by NS2 propagates through a silicon substrate and then couples to the power TSVs, happening on the top tier even for NS4 that is located on the bottom one. This resistive–capacitive coupling exhibits the high-pass characteristics of power noise coupling observed in Figure 3.16. The tight coupling of PDNs in a 3D-IC chip stack can bring about the undesired power noise coupling among tiers as observed in the prototype. This is potentially harmful for mixed analog/RF and digital integration [7, 8]. On the other hand, it can provide high-speed digital systems with the advantages of proximate placements of area-efficient decoupling capacitors to digital processing elements as well as data interface circuits, allowing high density decoupling capacitors to suppress power current and voltage variations in high frequencies. This supports the advancements of higher speed and density of memory–logic 3D integration [9–12]. It is necessary for 3D-IC designs to overview the PDN response as a whole, involving the chip, package, and board interactions for low-frequency response as conventionally in 2D designs and also suppression, interference, and impacts of high-frequency power noise components among chips and interposers in a 3D stack. An advanced power noise simulation technique will assist estimating both global and local responses in a wide frequency range and help completing as well as optimizing the designs of 3D PDNs.

51

52

3 Power Delivery Network and Integrity in 3D-IC Chips

References 1 Park, H.H., Song, S.-H., Han, S.-T. et al. (2010). Estimation of power

2

3

4

5

6

7

8

9

10

11

12

switching current by chip-package-PCB cosimulation. IEEE Transactions on Electromagnetic Compatibility 52 (2): 311–319. Yoshikawa, K., Sasaki, Y., Ichikawa, K. et al. (2011). Measurements and co-simulation of on-chip and on-board AC power noise in digital integrated circuits. In: Proceedings of IEEE EMC Compo, 76–81. Noguchi, K. and Nagata, M. (2007). An on-chip multi-channel waveform monitor for diagnosis of systems-on-chip integration. IEEE Transactions on VLSI Systems 15 (10): 1101–1110. Hashida, T. and Nagata, M. (2011). An on-chip waveform capturer and application to diagnosis of power delivery in SoC integration. IEEE Journal of Solid-State Circuits 46 (4): 789–796. Araga, Y., Nagata, M., Van der Plas, G. et al. (2014). Measurements and analysis of substrate noise coupling in TSV-based 3-D integrated circuits. IEEE Transactions on Components, Packaging, and Manufacturing Technology 4 (6): 1026–1037. Nagata, M., Takaya, S., and Ikeda, H. (2015). In-place signal and power noise waveform capturing within 3D chip stacking. IEEE Design and Test 32 (6): 87–98. Afzali-Kusha, A., Nagata, M., Verghese, N.K., and Allstot, D.J. (2006). Substrate noise coupling in SoC design: modeling, avoidance, and validation. Proceedings of the IEEE 94 (12): 2109–2138. Nagata, M. (2012). Modeling and analysis of substrate noise coupling in analog and RF ICs (Invited). IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E95-A (2): 430–438. Kim, J.-S., Oh, C.S., Lee, H. et al. (2011). A 1.2V 12.8GB/s 2Gb mobile wide-I/O DRAM with 4×128 I/Os using TSV-based stacking. In: 2011 IEEE International Solid-State Circuits Conference Digest of Technical Papers, 496–497. Liu, Y., Luk, W., and Friedman, D. (2012). A compact low-power 3D I/O in 45nm CMOS. In: 2012 IEEE International Solid-State Circuits Conference Digest of Technical Papers, 142–143. Takaya, S., Nagata, M., Sakai, A. et al. (2013). A 100GB/s wide I/O with 4096b TSVs through an active silicon interposer with in-place waveform capturing. In: 2013 IEEE Intl. Solid-State Circuits Conference Digest of Technical Papers, 434–435. Dutoit, D., Bernard, C., Chéramy, S. et al. (2013). 0.9 pJ/bit, 12.8 GByte/s WideIO memory interface in a 3D-IC NoC-based MPSoC. In: 2013 Symposium on VLSI Circuits Digest of Technical Papers, C22–C23.

53

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D Integrated System Alexander Steinhardt, Dimitrios Papaioannou, Andy Heinig, and Peter Schneider Fraunhofer Institute for Integrated Circuits (IIS), Division Engineering of Adaptive Systems (EAS), Zeunerstraße 38, 01069 Dresden, Germany

4.1 Introduction Advanced packaging technologies open new perspectives for system designers. Fan-in/fan-out wafer-level packaging technologies (e.g. eWLB) and 2.5D/3D integration (e.g. die stacking) as well as the combination of traditional system integration techniques (e.g. flip chip, wire bonding, package on package) enable an enrichment of the design space. Such advanced packaging technologies facilitate systems with reduced footprint and/or thinner packages for mobile, wearable and medical applications, higher pin count, and reduced power consumption. Further, such technologies enable high-performance computing applications. Most of these new package types are also open for the integration of existing components (e.g. bare dies) that saves NRE costs and/or time to market during the system development process. However, advanced packaging technologies often introduce additional materials or material mixes into the system. This typically implies differences with regard to the material properties (e.g. coefficient of thermal expansion (CTE) or thermal resistivity). Differences between the material properties can cause mechanical and electrical issues. A prominent example is a crack induced by different CTEs between different materials of different components. Furthermore, during system development, the designers face the following new (co-design) challenges: • Find an optimal packaging setup and technology combination considering performance, cost, supply chain, and reliability. • Interlink design data from different origins, and handle it simultaneously in a consistent way. In addition to these economical and technical aspects, physical effects have to be considered during the design of modern electronic systems. In this chapter we specifically address the electrical, mechanical, and thermal domains. Furthermore, the multi-physical couplings within a highly integrated system

Handbook of 3D Integration: Design, Test, and Thermal Management, First Edition. Edited by Paul D. Franzon, Erik Jan Marinissen, and Muhannad S. Bakir. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D Signal/power integrity Crosstalk

Signal delays

IR drop

on

s

Electrical

Transmission structure matching

Elec tr

Heat sinks Heat control and cooling

Thermal interface materials

ds

Elect ro

Fans

io n

Thermal-in du ce

at

Thermal

Behavior and reliability

gr

ti Joule/ohmic/resistive ac te r heating n i l ma her ot

i m

54

t re s s

Material mismatches

Mechanical Cracks Stresses

Permanent deformation

Figure 4.1 Possible challenges for advanced packaging.

are of particular importance. Some typical challenges in the aforementioned physical domains are (see Figure 4.1): Electrical: The power supply and transmission of the signals have to be ensured with respect to the given specification. In this context, crosstalk and signal integrity, as well as power integrity, are the main challenges that occur; they are subjected to further rise-time degradation, noise, ground bounce effects, and insertion losses. General qualitative characteristics are given in [1, 2]. Mechanical: Stress induced by the manufacturing process might have a significant impact on the system performance as well as on reliability. Depending on environmental conditions and operation modes of the system, additional mechanical loads may occur [3]. Thermal: Heat control and cooling are essential to get the thermal energy dissipated from the system and to prevent damages of devices, dies, or package components [4–6]. Between the different domains there are interactions that cause specific issues for the design (see also Figure 4.1). Examples are as follows: Electromigration and stress migration: Depending on material properties, the cross section of conductors, and electrical or mechanical load conditions, a material transport in interconnect structures may occur, which has impact on the electrical and thermal behaviors as well as on mechanical characteristics [7]. Electrothermal interactions: Due to the power dissipation in devices and interconnects, the system will heat up, which changes its electrical properties. Depending on the boundary conditions, this will influence the system function and reliability [8–10].

4.1 Introduction

Thermomechanical stress: Caused by mismatch of CTE or specific thermal regimes in the manufacturing process, mechanical stress occurs depending on (self-)heating or external cooling of the system. Since the stress has also influence on the device characteristics in semiconductors, the electrical behavior of the system may change [4]. With increasing miniaturization there is a higher risk of significant influence of the abovementioned effects. For example, a smaller form factor by stacked dies increases the density of active devices, which may lead to higher temperatures and additional mechanical stress in a mixed material 3D system. Additionally, there is no a priori estimation of the full multi-physical behavior possible for an application-specific 3D integrated system. This will be emphasized by our two examples, too. Example 1 (Section 4.1.1) is a medical device equipped with an additional electronic system designed for low power and high miniaturization. Example 2 (Section 4.1.2) is a memory–processor system considered for high-performance applications. The design goals and challenges are completely different in these two cases. Therefore, it is very important to consider the entire system including its intended operation and the specified environmental or boundary conditions. The challenge for designers of 3D systems is to identify components and subsystems of the considered system that allow cost-efficient improvements within the 3D methodology with high impact on the overall performance and no drawbacks regarding reliability. There are different ways to handle these challenges. On the one hand, we can make the necessary decisions out of designers’ experience supported by tools. Note that these tools implement the calculation of geometry parameters for impedance matching based on simplified formula, e.g. utilized for printed circuit board (PCB) designs. Depending on the level of abstraction of structural details, this way may lead to redesign cycles after creating a prototype. On the other hand, if the system is more complex or prototyping is rather expensive, it is essential to invest more effort in early design phases to avoid unnecessary redesign cycles and prototyping runs (see Figure 4.2). One Confidence level

Design flexibility

Idea Specification Decomposition System and and and integration requirement system component definition design design

Test Production and reliability assessment

Information

Time

Figure 4.2 Iterations during the system design process.

55

56

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D

appropriate way to do investigations of system concepts and perform design optimization is the simulation of the system. Simulations on different levels of abstraction may help to manage the challenges. However, there are several general model and simulation issues in the design process: (1) The trade-off between modeling effort, simulation performance, and accuracy. (2) The detailed information availability for the final implementation (which will be elaborated in the simulation-based design process). (3) The validity of material data especially w.r.t. to process fluctuations during the manufacturing and specific geometries (like thin layers, feature dimensions in the range of grain size, etc.). To effectively utilize simulations within the design process, intelligent approaches for hierarchical model validation are needed. This includes the design and fabrication of test structures as well as domain-specific measurement setups. Such a hierarchical approach is also a useful prerequisite for the partitioning of simulation tasks. Usually, it is not possible to simulate all the physical effects and all structural details of a complex system in one simulation. Therefore, for a specific simulation task, it is necessary: • • • •

to separate substructures of the entire system. to identify a manageable amount of physical effects. to choose the right simulation method, including solver parameters. to define appropriate loads and boundary conditions.

The results of such a (partial) simulation can be the basis for optimization steps in the early design process or input for other (partial) simulations, e.g. as boundary condition, stimuli, or reference for behavioral modeling. Simulation results are always strongly influenced by the assumptions made for modeling. Therefore, the accuracy of the models has to be taken into account for the interpretation of the simulation results. A general challenge in the design process for complex and heterogeneous system is the collaboration of different disciplines and groups for concurrent engineering and decision-making. Especially, if different parts of the value chain are involved (e.g. chip design, software development, packaging, PCB design, etc.), a consistent data handling (materials, geometries, topology of interconnects, load conditions, etc.) is essential to ensure design efficiency and quality. A central and consistent data handling is crucial for an efficient design flow [11] (also see Section 4.2). Sections 4.3–4.6 of this chapter deal with the electrical, mechanical, thermal, and thermomechanical challenges of an advanced packaging system, respectively. There is a clear distinction between the mechanical, thermal, and thermomechanical behaviors of a system. The term mechanical refers to the displacements, deformations, strains, or stresses induced by pure mechanical phenomena, such as torsions, axial loads, bending moments, etc. The thermal behavior concerns the temperature distribution due to temperature gradients within the system. According to the second law of thermodynamics, the heat will flow from the

4.1 Introduction

high-temperature area to the low-temperature one. This heat flow heats up all areas of the system it passes through. Thermomechanical issues are related to the mechanical behavior of a system when thermal (high temperatures, heat fluxes) and mechanical effects are combined. Of course, these behaviors of a system should be known due to the reliability of this system. We generate this knowledge with simulations where we have to take boundary conditions (e.g. constrained surfaces for the mechanical case and ambient temperature for the thermal case) into consideration. Furthermore, different material properties complicated each case, too. On the one hand, for the mechanical behavior, we need to know the Young modulus or the Poisson ratio of the materials involved. On the other hand, for the thermal behavior we need the thermal conductivity that describes how the materials spread the heat. Finally, for the thermomechanical behavior, the aforementioned properties in combination with the heat capacity or the CTE of the materials are needed. In the following parts of the chapter, we refer to two examples in order to define and describe the aforementioned types of challenges of an advanced system. A complete stent system for medical applications is addressed in Section 4.1.1. The loads and forces deforming the stent can be considered as a pure mechanical problem. Since the (low power) ASIC (application-specific integrated circuit) produces only a small amount of heat, the caused thermal stresses and displacements can be neglected. So, this example focuses on the mechanical and electrical challenges. On the other hand, the second example focuses on the thermal, thermomechanical, and electrical challenges. It is an interposer package, introduced in Section 4.1.2, which can be considered as a subsystem of a bigger one. We have many different materials stacked on top of each other, and all of them are subjected to a temperature increase. Moreover, in the interposer example we do not have any external loads exerted. This means that there is no pure mechanical behavior but thermal and thermomechanical. 4.1.1

Example: Stent

This example deals with a biomedical stent that consists of sensors and other microelectronic features. It was chosen to clarify the challenges that occur in the field of advanced packaging. In general, a stent can be described as a biomedical device that is used to restore the blood flow through narrow, weak, or blocked vasculatures inside the human body by taking the form of a superelastic small mesh tube. Superelasticity can be described as the ability of shape-memory alloys, such as nitinol (composition of nickel and titanium) or other specific alloys, to return to their original shape after they are deformed under extreme loading conditions [12, 13]. Stents are usually made of superelastic materials such as martensite or austenite (both are iron-based alloys). In our case a martensitic biomedical stent, consisting of two strips with various microelectronic components (sensors, ASIC, diodes, capacitors, etc.) placed on a PCB, is used. The martensitic stent skeleton is arranged in a meshed cylinder and is connected with the two PCB strips containing the various microelectronic components; they are all placed on the copper tracks on the FR-4 body of the PCB [12, 13].

57

58

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D

A common stent design includes only the martensitic or austenitic meshed tube that can be deformed and return to its initial state and shape. However, in our example the stent is enhanced with some microelectronic components and especially sensors. Thereby, it can gather various information related to the blood pressure and temperature, factors that may influence the stent. Furthermore, the aforementioned type of information can be used to define the physical state of the stent and gives medical information for diagnostic purposes. The power supply of this system will be inductively provided by an external device because no batteries are allowed. Their placement violates the form factor requirements, and additionally they cause biocompatibility issues. This implies that the measurements for the data acquisition can only be performed at a certain time and location. Furthermore, there are different ways to arrange the involved components on the PCB. Moreover, several devices like rectifiers, diodes, or ASICs can be included in a certain package that is subsequently placed on the PCB. An example of such a solution is an interposer packaging that achieves a smaller form factor or reduces the power consumption of the system. Here, we do not use an additional package and place the components directly on the PCB. Figure 4.3 shows the schematic of the stent; it consists of a superelastic meshed tube, colored in green, and the two PCB strips with several types of microelectronic parts, colored in gray. After the fabrication and assembly of the stent microelectronic configuration, the whole structure must be compressed and inserted to the human body via the vasculature (e.g. consists of arteries or veins). That means that excessive forces and loads are applied on its parts; they lead to stress concentrations and deformations on the PCB, sensors, ASICs, skeletal tube, etc. These features must be able, of course, to absorb the applied loads, torsions, and compressive forces.

50

5 0 5 y

z

x

0

0

Figure 4.3 A martensitic biomedical stent, colored green, with two electronic parts, colored gray, with various microelectronic components – such as ASICs, diodes, capacitors, and inductors – placed on a PCB substrate.

4.1 Introduction

Subsequently, after the stent is, for example, introduced inside a blocked or damaged vein, it expands in a manner that keeps the vein open, and sufficient blood flows through it. The degree of expansion and compression is strongly affected by the microelectronic parts, especially the ASICs and the sensors. Another important function of the stent is the protection of the enclosed strips from damages. That means that the stent must remain in a quite mechanically stable position in order to hold the vessel open for quite long time and also to be able to withstand shocks and vibrations. Such a structure should not be detached from the vessel or even get damaged, for example, during intense movements of the patient. The heat produced by the ASICs is distributed through the PCB copper tracks, and the blood. The increase of temperature of various components may intensify issues due to CTE mismatch. This is a low power application, and the effects caused by the heat are not a main aspect taken into account. However, the heat can violate some medical requirements. Mechanical (displacement and von Mises stresses) and thermal (temperature distribution) simulations have been performed on this design, and various graphs and numerical values of the stress level, deformations, and temperature distribution have been extracted; see Sections 4.4 and 4.5. 4.1.2

Example Interposer

The second example used to illustrate the challenges and issues that occur in advanced packaging. We consider a 2.5D configuration using a silicon interposer as an intermediate layer between a PCB substrate and integrated circuits (ICs) dies. The system includes a processing die that can be a field-programmable gate array (FPGA) or an ASIC, placed side by side with a memory used for data storage. Solder micro-bumps are placed underneath the die and the memory; they are used to interconnect the power/ground and signal pads to the interposer. Metal wires and through-silicon-vias (TSVs) on the interposer are used for (re)routing the signal and power/ground wires of the dies and the PCB. The die and the memory are placed on the interposer that is subsequently placed on the PCB substrate. There are two ways to arrange the die and the memory. First, the die and memory are placed side by side next to each other. Second, the die is placed on the top of the interposer, and the memory is placed underneath on its bottomside. In both cases the interposer offers better interconnection and routing conditions for the various signals that travel between the die, memory, and additional components, e.g. microelectromechanical systems (MEMS). Various materials, such as silicon and glass, are preferred for the interposer bulk. However, we restrict our studies to those made of silicon. The silicon body of the interposer gets etched to forming holes that are filled with copper. This structure is called TSV. Copper is preferred due to its higher electrical and thermal conductivities. Subsequently, the interposer is metalized in order to offer electrical interconnects on both sides. For the mounting of the interposer on the PCB, so-called C4 bumps are used. A common PCB design consists of long lined copper tracks, w.r.t. interposer ones, placed on an insulating flame-retardant (FR-4) substrate [3, 14–17]. In our example, two multilayer chip ceramic (MLCC) capacitors are used in order to stabilize the power supply to the network; they are mounted on the PCB (see Figure 4.4 or 4.5). Finally, extra solder balls called ball grid array (BGA)

59

60

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D

(10)

(2)

(1) (3) (4)

(8)

(5) (6)

(7) (9)

Figure 4.4 Representation of a 2.5D assembly consisting of an FPGA, a memory, several micro-bumps, a silicon interposer with copper TSVs, C4 bumps, and a PCB substrate. Two MLCC capacitors are also placed on the left and right sides of the interposer. Finally, BGA balls are placed on the bottom of the substrate, and a mold compound is used to enclose the assembly. (10)

(1) (4)

(11) (3)

(2) (8)

(5) (6)

(7) (9)

Figure 4.5 : Representation of a 2.5D assembly consisting of the same elements as before without mold compound; in this configuration a heat sink is placed on top of the FPGA and the memory with a thermal interface material between them and the heat sink.

bumps are placed under the PCB for the capability to mount this system onto other larger substrates. The system that is described above may be encapsulated within a mold compound, or it may include a heat sink for a better dissipation of the heat that is produced by the die. In high power applications the presence of a heat sink is essential. It consists of an aluminum rectangular base with either 11 × 11 aluminum pins or 11 fins arranged on it. The pins can be described as very thin cylindrical shapes (like nails) placed vertically on the base of the heat sink, forming an n times m matrix (with n and m as natural numbers). Another option is fins. They are very thin and wide rectangles placed vertically on the base, forming very thin and very high wall-like structures parallel to each other. Finally, a thermal interface material (TIM) is placed between the die and the heat sink. The functions of the heat sink and the attached TIM are fully explained under the subchapter with the thermal challenges for advanced packaged designs in [14–18]. Figures 4.4 and 4.5 show the two configurations of the 2.5D package with the mold compound and the heat sink. The different parts of the assembly are indicated as follows: (1) FPGA, (2) memory, (3) micro-bumps between the FPGA/memory and the interposer, (4) silicon interposer with a top metal layer, (5) copper TSVs offering electrical interconnects through the interposer, (6) C4 bumps between the PCB substrate and the interposer, (7) PCB copper tracks

4.2 Data Handling for the System View

placed on and through the FR-4 substrate, (8) MLCC capacitors, and (9) BGA balls under the PCB. The item labeled as (10) indicates the mold compound in the first case and the heat sink in the second case. Moreover, in the second case, a TIM material (11) is placed between the base of the heat sink and the memory/FPGA.

4.2 Data Handling for the System View Before we discuss the particular challenges in the electrical, mechanical, and thermal domains related to a complex system in the next sections, we start with a problem that affects all three of them. For all domains we want to find an appropriate description and modeling of the considered system. Usually, each of the three domains has its own specific modeling approaches and data representation. This may cause inconsistencies between the data models and makes it difficult to find an optimal representation with respect to the issues of all three domains. Furthermore, there might be different objectives for optimization. So, from the electrical point of view, it is beneficial to place electrical devices of a complex system next to each other, whereas, from the thermal point of view, this placement may heat up the system, which implies electrical losses or physical damages of the system; see Section 4.6. In addition, the three domains are often related to different design departments or groups that may work on different locations, which makes the design process more complicated due to communication errors. This means that if a lot of designers work on a complex system, maybe in different groups and locations, then they have to stay in permanent contact to ensure that the models of the complex system are not in different or conflicting states; otherwise the final fabrication of the system may be delayed. To prevent these problems we use a priori another approach to avoid corrupt states of the system’s model. The idea behind our approach, also used in our in-house tool, is to have a central data model or more precisely an advanced software tool that is able to support different simulation tools and give them the central data model as input. Changes of the model will only be applied centrally to avoid corrupt states of the model. So, there are several benefits of such an advanced software tool with a central data model. First, if a change has been made, then the designers or designer groups can call all involved simulation environments from the advanced software tool and perform the required simulations to study the impact of this change onto the model with respect to electrical, mechanical, and thermal behaviors. Second, next to the electrical, mechanical, and thermal behaviors, we can perform some checks on the central data model like design or package rules, i.e., necessary rules out of the fabrication process. See [19] for more information about design rules and some working principles of our in-house tool. These checks can also be used to verify model changes, i.e., only changes on the model that pass all checks will be applied. Note that this is a big difference with respect to the local data handling that is usually applied with the traditional approach. For example, if a designer fits the model for thermal issues, he commonly does not care about such checks and may do a change that violates a rule. As summarized, the benefit is that the designers of the electrical, mechanical, or thermal domains do not have to care about such checks, as usual,

61

62

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D

Better

Old way

Electrical model d an eh on the On On the oth er ha nd

Mechanical model On

d an eh on the On

the

oth er h

an d

New approach

Thermal model d an eh on the On On the oth er ha nd

On

d han one the On the othe r hand

Model d

an eh on the On On the oth er

ha nd

d

an eh on the On On the oth er

ha nd

Electrical part Mechanical part

(a)

Thermal part

(b)

Figure 4.6 Two different data handling approaches for electrical, mechanical, or thermal simulations of a complex system. The common local approach (a) and a novel central data model (b).

because the rules are still satisfied, a priori. Finally, the advanced tool tracks the changes by an appropriate version control that makes it possible to monitor the changes and the impact of the behavior if necessary. Figure 4.6 depicts the difference between the local data model and the central data model schematically. Both approaches, the local and the central data handling, allow us to study the electrical, mechanical, and thermal behaviors of the considered system. Furthermore, the simulation process allows us to study the impact of changes at the model or design parameters (e.g. geometry parameter, material data, loads, and boundary conditions) onto the model. Especially, we are able to study the impact of one parameter on the model by modifying this parameter, in a given range, successively and leaving the other parameters fixed. This easy study of the impact of the parameters is a big benefit of the simulation process and supports the decision-making, i.e., to find optimal parameters with respect to a given objective (e.g. minimal losses of a signal along a transmission line). A novel approach, used in another in-house tool, for the decision-making and optimization that deals with the simulation data out of some parameter sweeps is introduced in Section 4.3.3. The aim of the decision-making process, i.e., to derive the right values for a given set of design parameters, is to get a model that is ready for prototyping. The general flow for prototyping is shown in Figure 4.7. Figure 4.7 demonstrates the common way, create a model that is used for the simulations and use the simulation results for prototyping, with dotted arrows between the main steps that are colored in blue. The additional steps of our approaches, central data handling and decision-making, are colored in green. The dotted and normal arrows represent a used for relation.

4.3 Electrical Challenges As mentioned in Section 4.1, the main electrical challenges are crosstalk as well as signal and power integrity that will be accompanied by side effects like rise-time degradation, noise, ground bounce effects, and insertion losses, for

4.3 Electrical Challenges

example. Note that additional challenges regarding to high frequencies (e.g. microwave or radio frequencies) can occur, where higher frequencies increase the data rate and performance. For example, the dielectric permittivity of the substrate material depends on the frequency, and the resonance behavior of a transmission line depends on the permittivity. We describe the electrical issues with appropriate behavioral models where these depend strongly on the package that is used for the system. These packages can be divided into two main parts, the PCB packages and advanced system integration technologies. The PCB packages are widely used in practice, and an example is given in Section 4.1.1. This type of packaging also provides a frequency-dependent challenge, namely, the trade-off between its cost and its material because more expensive materials with better dielectric permittivity are needed for higher frequencies. Typical PCB materials are FR-4, BT, GETEK, and ROGERS (see [20]), where each of them has its specific electrical behavior. In [21] a comparison between different materials with respect to high frequencies is presented. The first difference between the advanced system integration technologies and PCB packages is the substrate materials. Furthermore, these packages also have a smaller form factor than PCB solutions, which implies other electrical challenges or may affect the existing ones. For example, if we model a interconnect structure in an interposer package, then we can use a transmission line or RLCG model including lumped elements for resistance (R), inductance (L), capacitance (C), and conductance (G). This model includes more parameters than an LC model, where we neglect the resistance and conductance, of a transmission line in a PCB package; see [2]. Characterizations of other structures like TSV and interposer are discussed in [22] and [23], respectively. An example of an interposer package is given in Section 4.1.2. Furthermore, the advanced system integration technologies need very accurate simulations because the prototyping is much more expensive than PCB packages. Additionally, we can use the simulation results to optimize the parameters of the system regarding to the post-processing step demonstrated in Figure 4.7.

Figure 4.7 Process flow used for prototyping.

Model

Central data handling

Electrical/thermal/mechanical simulation

Post-processing/optimization/ decision-making

Prototyping

63

64

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D

4.3.1

Modeling

The main challenge for the modeling process is to handle the trade-off between the simplification that can be applied on the model parameters to speed up the simulation and its accuracy. This means that there are many different models (e.g. lumped or finite element method [FEM] based) for one system, depending on the degree of simplification. Note that this degree depends strongly on the considered frequency (e.g. operation frequency of the system or maximum frequency out of a given range) that is related to the physical effects that have to be taken into account. If we choose a PCB package for the system, an equivalent circuit with lumped elements (resistance, inductance, capacitance, and conductance) is usually used to model the system’s behavior. This approach has a limitation above approximately 10GHz for structures related to the interposer package in the example in Section 4.1.2; see [22]. If we exceed this limit, we need to consider another modeling approach where frequency-dependent S-parameters are commonly used. These parameters express a linear relationship between the input and output voltages that will run into or out of the considered device or system. It is represented by a quadratic matrix where the dimension is the number of input or equivalently output voltages. S-parameters can be transformed into many other useful parameters; see [2] or more generally [24]. One of these transformations is the one from S-parameters into T-parameters; this is a reordering of the S-parameters in such manner that they can be cascaded. This means that if we consider an electrical device that consists of two sub-devices where the output of the first sub-device is the input of the second, then we get an S-parameter model for the whole device by cascading the T-parameters of the sub-devices and transforming the result into S-parameters; see Figure 4.8. The term cascading refers to a multiplication of the corresponding matrices. So, the T-parameter notation allows us to describe the behavior of the whole system by its parts with respect to an appropriate decomposition. Furthermore, the simulation of the parts instead of the whole system saves time and memory. Note that another way of cascading is to use a circuit simulator by importing the S-parameters and connecting the input and output components in an appropriate manner. 4.3.2

Simulation

Simulations are very useful because they allow us to study the impact of parameter variations on the system behavior. However, these studies are often very time Device

Input

(Sub-) Device A

S-parameter Device A

T-parameter Device A

X

Output

(Sub-) Device B

S-parameter Device B

T-parameter Device

S-parameter Device

T-parameter Device B

Figure 4.8 Schematic of a conversion between S- and T-parameters for a device that consists of two sub-devices.

4.3 Electrical Challenges

and memory consuming where the consumption of the resources increases with the complexity of the system. This means that simulations cause some costs with respect to their required time. So, in practice, the simulation process is skipped sometimes. For example, if the package is cheap, such as PCB packages with FR4 material, and we want to find an appropriate technology option for a transmission structure, then it is possible to produce several structures with different technology options and characterize them by measurements. They can be used to determine the required parameters without spending a lot of money. If the package gets more expensive, this possibility loses its focus, and the target to create as few test structures as needed for the final prototype gets more interesting. The simulations help us to reduce the number of required test structures. Furthermore, if the accuracy of the simulation process is high, i.e., the simulation results fit very well with the measurements, then only one test structure is needed. Note that we need to consider fabrication errors as well by sweeping several parameters in a given range of interest. The time and memory problem of complex system simulations can be addressed by an appropriate decomposition of the system. This means that if we divide the system into smaller less complex parts and simulate these, then we compose the results in an appropriate manner as mentioned above; see Section 4.3.1. Furthermore, we can reduce the number of model parameters of the system if we abstract it, i.e., use less details, to speed up the simulation. For the same reason, multi-physical simulations are often avoided. This implies additional simplifications, e.g. only the boundary conditions of the considered simulation topic (e.g. electrical, mechanical, or thermal) will be applied and the other one will be neglected. Another simulation challenge is to get the required material data, especially if the data depends on the frequency. In practice, the vendors usually provide the material data only for a single frequency. The data for the other frequencies remains unknown. This leads to another simplification of the model where the material data (e.g. the permittivity) will be assumed as constant over the whole frequency range. Note that in a practical simulation, a lot of simplifications will be applied. This decreases the accuracy of the simulation, of course. A further problem is that we cannot validate the accuracy of the simulation results immediately because we need accurate measurements of the test structure, which can include errors as well, to compare the results. Further issues may occur in the high-frequency domain since the accuracy of the measurements strongly depends on the frequency (see [25]), and in some cases it is difficult to measure a sub-device in a complex package solution; see [26]. 4.3.3

Optimization

The aforementioned benefit of the simulation is that we are able to investigate the impact of design parameter variations. Here, we use these informations to optimize the design parameter of the model as represented in Figure 4.7 through the post-processing step. The aim is to get optimal design parameters. Commonly, a cyclic optimization process will be performed, depicted in Figure 4.9a. This

65

66

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D

f(x) 5 Modeling 4 Simulation

3 Physical behavior Interpolation

2 Evaluation

1 20

(a)

40

60

80

100

x

(b)

Figure 4.9 (a) Optimization cycle and (b) interpolated function.

means that we perform a simulation with modified design parameters. Next, we evaluate the results and replace the current design parameters with the new one if these results are better than the others with respect to an objective. A different approach, used in an in-house tool, is that we first perform several simulations, namely, a parameter sweep, and then interpolate the simulation results to get a function that is a measure for the intended system behavior with respect to the considered parameter. Figure 4.9b shows schematically such an interpolation. The blue line is the original unknown physical function with respect to a parameter. The red crosses depict some simulation results regarding specified values for the considered parameter. The orange dashed line is the result of an interpolation of the red crosses. This approach allows an analytic study of the interpolated function, i.e., derivations or gradient-based methods, for the optimization of a certain parameter value. Furthermore, we are able to create more complex functions out of the interpolated one by an appropriate composition of the functions (e.g. [weighted] average, Cartesian product). For example, if p(f , w) and q(f , h) are two interpolated functions that depend on the frequency f and the width w or height h, respectively, then we get a function r(f , w, h) = p(f , w) ⊗ q(f , h) that depends on the frequency, the width, and the height, where ⊗ denote the appropriate composition. This means that if we want to optimize the microstrip line (MSL) used in the example in Section 4.1.1, then we sweep the width and height of this line and interpolate the results. In this case, p(f , w) is the interpolated function of the insertion loss (−20 log(|S21 |) [dB]) of the MSL with width w at frequency f , and q(f , h) is the interpolated function of the insertion loss of the MSL with height h at frequency f . If we build the average of these two functions, then we get r(f , w, h) = 12 (p(f , w) + q(f , h)), which denote the average insertion loss of the MSL with width w and height h at frequency f . Finally we can use a gradient-based approach to get the optimal width and height of the MSL, i.e., the width and height where this line has minimal insertion loss. Note that we can successively create very complex functions with this method and optimize the parameter values by analytic methods with respect to these functions and a given objective. Furthermore, we can use more data in addition to the parameter sweep-like measurements to generate more complex and accurate functions. Additionally,

4.4 Mechanical Challenges

if we have several results for the same parameter value, then we can use the weighted average of these results to get exactly one data set with respect to this parameter for the interpolation process. Note that we can analyze the functions as well to get some trends of this function, i.e., a certain parameter range in which all function values are at least as good as required. This supports the decision-making because we are able to choose a parameter value out of this range with respect to safety aspects, for example, due to fluctuations in the manufacturing process. Additionally, we can use existing generated functions of other projects and systems that are related to the current system for its investigations and decision-making process. This kind of IP reuse speeds up the decision-making process because less simulations are needed. Furthermore, an IP library can be built for a kind of know-how management. This approach holds for the other simulation domains (mechanical and thermal), too.

4.4 Mechanical Challenges From the mechanical point of view, any advanced packaging structure must be able to withstand harsh environmental conditions, such as shocks and vibrations. Few examples are as follows: the microelectronic parts inside a car must stay intact when we are driving in a cobblestone street, a cellphone or a tablet must stay undamaged and functional after dropping it, and so on. Not only during their use but also during further fabrication process steps, the multiple sensors, MEMS components, or ICs must stay stable, regardless of the induced vibrations or other environmental conditions. High endurance and reliability of a system can be achieved if the assembly is fabricated in such a way and with such materials that the system is able to absorb the induced stresses and loads. Moreover, depending on the different materials and their combinations used, the engineer must always be careful not to exceed the materials nominal values for yield stress, modulus of rupture, or ultimate strength during their bending, torsion, tension, or compression. They are characteristic for each material and can be found in many books and online resources. However, these material properties often have to be extracted from experimental measurements [3]. In the mechanical domain, simulation can be a very useful method that may save a lot of time and prototyping effort, since we are able to model very complex structures, apply different materials and boundary conditions (predefined displacements, axial and torsional loads, moments, constraints), and achieve graphs and numerical results concerning displacements, stresses, or deformations. On the one hand, the example with the stent is a case where the mechanical issues play a much more important role than the thermal ones; due to the nature and function of a stent design, it is crucial to observe its mechanical behavior under different tensile/compressive, bending, or torsional loads applied to it. As mentioned before, the stent is uniformly compressed before it is inserted in the human vasculature. That leads to high compression and torsion exerted on it [13]. On the other hand, the example with the silicon interposer does not include any external loads applied to the system; all the occurred stresses, strains, and deformations result from the self-heating of the system.

67

68

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D

Thus, several of mechanical simulations have been conducted only for the first example. They include two subsets of simulations: one only for the strips where the microelectronic components are placed and another set for the whole assembly (strips and mesh tube). The applied loads and stresses cause tension, compression, or bending of the strips and subsequently of the microelectronic features. The uniform compression of the walls of the martensitic mesh tube causes extra compression to the microelectronics via the four connection rings found in the two ends of each strip. Additionally, the stent is not only uniformly compressed but also bent inside the body, taking the shape of the vein or artery where it is placed. Several numerical or graphical results of the induced stresses, deformations, and displacements have been extracted with the simulation tool COMSOL Multiphysics. Figure 4.10 shows three simulation results extracted by COMSOL Multiphysics; forces and bending moments are applied on the strips, 0.5 N on each side of each strip for the force and bending, and the martensitic skeletal tube that surrounds the electronics. Forces in the human body are normally in the range of several millinewtons. For worst case studies and due to the uniform compression of the stent during the insertion human body we apply these higher forces in the shown simulation cases. The proper boundary conditions concerning the points and edges of constrain have also been applied. For example, if the center of the strips (ring area) is constrained and bending moments are applied to their edges in the y-direction, the left and right parts of each strip will bend (depicted with red and yellow colors in 4.10a). The maximum deflection (meaning the distance from the initial position) can be found at the left and right extreme points of the strips. The von Mises stress is shown in 4.10b. Additionally, if we choose to apply a load of 1 N in the z-direction to the right side of the stent tube (shown in 4.10c) while its left side is mechanically constrained (zero displacement), the tube will bend and deflect by several millimeters. The red part of the figure depicts the area of maximum deflection in the z-direction.

4.5 Thermal Challenges The thermal behavior of an advanced package is another important aspect that engineers must take into account when they design and fabricate such a system. Components, like ASICs and FPGAs, may produce a thermal power of several watts and lead to temperatures up to 150 or 200 ∘ C. The structure must show high reliability under such high thermal conditions. Elevated temperatures cause parasitic effects, and they, subsequently, affect the functionality, robustness, reliability, and life span of the electronic parts that can lead to a whole system failure. For example, cell phones that stop working if their battery is heated very much or tablets that stop running during a hot summer day, because their circuit cannot withstand a temperature increase above 100 ∘ C, are definitively non-desirable devices by the customers. Several issues and questions concerning the thermal behavior of the system may arise: what is the acceptable power dissipation for electronics, and what

4.5 Thermal Challenges

Total deformation (mm)

x10

–2 von Mises stresses (N m )

3

10

1.8 1.6

2.5

2

10

1.2 1

1.5 0.8

0

2 mm

1

0.6

10

0.4 0.5

z

y

x

y

z

0

(a)

0.2

0

x

(b) 2 × 105 × 105

1.8 1.6 1.4

10 1.2 1

10 50

5

0.8 0.6

0 0.4

5 y

0.2

0

z x

–5

0 0

(c)

Figure 4.10 Deformations and stresses that occur in the parts of the stent by using loads in y and (−z) or z-direction for the top pictures and the bottom picture, respectively. (a) Deformation of one half of the microelectronic strip. (b) Distribution of the von Mises stresses on one half of strip. (c) Deformation of the mesh tube if a load is applied to its right part.

is the resulting temperature increase? Moreover, since some elements (e.g. certain types of sensors) are more temperature sensitive than other components, so where can they be placed, and how can they be arranged on a substrate (e.g. PCB) regarding their sensitivity? Which elements can withstand high temperatures and can be placed next to the FPGA, and which cannot? Are the featured components (e.g. diodes) still functional at 200 ∘ C if they are placed very close to the FPGA? Note that if the distance between the FPGA and the diodes is quite long, they are subjected to lower temperatures but require longer interconnection paths. Moreover, for which cases did we need an interposer with thermal TSVs for better heat distribution? See [5, 6] for more informations about thermal TSVs. In applications with high power generation (e.g. our example with the interposer), several cooling components, such as heat sinks and fans, can be applied. They should be large enough to provide an adequate heat removal and at the same

69

70

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D

time small and compact enough regarding the form factor requirements. In the past years various configurations and concepts of heat sinks have been developed where the number, shape, and orientations of their pins or fins can vary a lot; see [27]. A TIM is a thin thermal layer (usually epoxy) of high thermal conductivity that is placed between the die and the heat sink (with or without the presence of an integrated heat spreader, IHS). It plays an important role in the heat reduction of the chip package and the thermal resistance between the die and the external cooling component (e.g. heat sink, fans); see [16, 28]. To determine the thickness of the TIM layer is an additional challenge for engineers since we want to optimize the thermal path and the thermal resistance from the heat source through the different components to the environment (usually air). Concerning the thermal simulations performed, the user can simulate the inner elements of the IC die and calculate how much power in watts is produced and where the hotspots inside the die are. For the stent example (see Section 4.1.1), we have an inhomogeneous flow of the produced heat through the different materials, layers, and elements, and an accurate simulated model with the proper boundary conditions (temperatures, heat flux applied) can show the temperature and heat distribution through the assembly, shown in Figure 4.11. A critical aspect here is that the engineer must always keep in mind the operating temperatures Tave and Tmax of the specific IC used (FPGA, ASIC, etc.); see [5, 6, 28]. We apply a power of 0.2 W on the ASIC. The regions of the model with high temperature (Tmax = 56 ∘ C) are colored in white, and those with lower temperature (about 37–39 ∘ C) in red. Usually, an ASIC produces heat due to several milliwatts and up to 8–9 W that is subsequently distributed through the copper tracks of the PCB. We apply for the simulation just 0.2 W and get a maximal temperature of 56 ∘ C. This is a critical aspect from the biomedical point of view, because a small increase of the blood temperature, even by 0.5 ∘ C, may affect the function and life span of the red or white blood cells severely. Thus, this is an additional challenge for the thermal design of the stent: 56 54 52 50 48 46 44 42 40 y

z

x

38

Figure 4.11 Heat dissipation produced by the ASIC that is spread through the copper tracks of the PCB.

4.5 Thermal Challenges

distribute the heat sufficiently in such manner that the microelectronics work and the human body is not harmed. Of course, the stent skeleton and the electronics are covered with biocompatible hermetic materials to prevent any fluid leakage and damage of electronics. However, the maximal temperature of 56 ∘ C implies that we can only use an ASIC or other electronic components without heat sink that consumes at most few hundred millivolts and milliamperes. Further, we have a forced convection, since blood is considered as an external cooling element to the system. Additionally, while the blood is flowing through the stent, there is a very small increase of the temperature of the system, and no external heat sink or another cooling element is required. In this case, the temperature can be neglected as we asserted above. Furthermore, the graphical and numerical results of the simulations show the amount of the heat produced, its distribution, and the maximum temperature of the system. All this information helps to design a robust and reliable product. In the second example (see Section 4.1.2), where the FPGA and memory are placed side by side on the interposer (see Figure 4.4), the heat produced by the dies is distributed through the volume of the assembly, heating the MLCC capacitors, solder balls, interposer, TSVs, PCB, etc. For the simulation we create the heat by several heat sources distributed as blocks over the two dies as shown in the ellipse, colored in light blue boxes, of Figure 4.12. Also the heat distribution, related to 1.5 watt per die, of the two dies over the interposer and PCB is shown in Figure 4.12. The heat of each of the two dies corresponds to a range from 0.25 to 2.5 W where the ambient temperature is set to 20 ∘ C. For the thermal assessment we are interested in the maximal temperature of the die Tdie , the interposer Tinter , and the PCB TPCB . We see in Figure 4.13 a linear increase w.r.t. the applied power. The maximal temperatures Tdie , Tinter , and TPCB , in Figure 4.13, are colored in blue, red, and green where the samples are marked as diamond, quarter, and triangle, respectively.

90

×104

80

70

60 x1 0 –0.5 –1 –1.5 2

50

1 40

0 y

z

x

0 –1

30

Figure 4.12 Heat source setup and heat distribution of two dies placed on an interposer that is placed on PCB.

71

72

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D

Figure 4.13 Maximal temperature of the dies (blue), interposer (red), and PCB (green).

°C 140 120 100 80

Tdie Tinter TPCB

60 40 20 0

1

2

3

4

5

W

Comparing the two examples, as mentioned before, the produced heat is much more critical in the second one than in the first.

4.6 Thermomechanical Challenges Another important aspect that combines the topics covered before and should be taken into account is the thermomechanical behavior of an advanced microelectronic assembly. This simply indicates that there may be a mechanical damage due to the assembly’s inability to absorb mechanical stresses during rapid and excessive temperature changes (e.g. for engine electronics inside a car), especially within a very short time [4, 17, 18, 29, 30]. As mentioned before, due to the different CTEs of the materials involved, the different deformations and displacements may lead to high stresses and subsequently to failures and cracks. On the one hand, the CTE of silicon, for example, is 2 ppm K−1 , and the CTE of copper is 17 ppm K−1 . In that case, the copper tracks of a PCB, the top and bottom copper layers of the interposer or the copper TSVs inside its bulk, show a greater deformation and expansion compared with the silicon substrate. On the other hand, the attached ICs and the interposer substrate show more or less the same deformation, since they are all made of Si [4, 17, 18, 30]. Moreover, internal material defects, such as incontinuities, holes, or gaps between the surfaces of the bonded materials, can cause high local stress concentrations and may lead to further cracks after a few thermal test cycles. In the case where a mold compound is present, its low thermal conductivity offers a very poor heat dissipation. Therefore, a suitable material for the mold capsule has to be chosen. Additionally, fabrication errors, like bubbles or bad metal to mold compound that can occur during the cooling of the melted mold that is ejected to the package, decrease the thermal conductivity, too. Another important part of the assembly is the solder balls connecting the different levels of the system (e.g. micro-bumps for FPGA/memory to interposer connection or C4 bumps for interposer to PCB connection), which may also crack during the fabrication or operation process. Different kinds of cracks (surface cracks, body cracks) may

4.6 Thermomechanical Challenges

be introduced at any process step caused by rapid temperature changes during the soldering and cleaning processes. The potential size and arrangement of the solder balls is another important topic, since structures may fail at solder joints and balls caused by thermal fatigue after long-term thermal cycles and operations. The stresses are also affected by the different thermal conductivities of the materials and the rate of change of the temperature [4, 17, 18, 29–31]. The von Mises stress yield criterion is a criterion to evaluate the stability of the material where the goal is to use such materials that minimize the possibility of cracks, permanent deformations, or defects in the structure. It determines the stress level in which a ductile and isotropic material (e.g. a metal or a semiconductor) can withstand without yielding, if it is subjected to complex tensions or compressions. It is described by a mathematical formula that combines the elements of the stress tensor, i.e., the principal normal and shear stresses of the material. It is also called maximum distortion energy hypothesis and is based on the distortion energy of a material; when it reaches a critical √ value, the material state becomes critical. The mathematical formula is equal to s1 2 + s2 2 − s1 ⋅ s2 , where s1 and s2 are the absolute maximum stress values. According to the criterion, the von Mises stress calculated for the material should be below its yield stress, which is a material characteristic that can be ascertained with mechanical tests. The von Mises hypothesis states that a material may fail if its distortion energy reaches a critical value. However, in many cases, the engineers use another condition of the aforementioned criterion for extra safety reasons; the von Mises stress should be equal to 1/3 of the yield stress, instead of its full value. Additional mechanical properties like the modulus of rupture or the compressive strength are used for reliable simulations [4, 30, 32, 33]. In the second example (see Section 4.1.2), where the FPGA and the memory are placed on the interposer, the heat produced by the FPGA is distributed through the whole volume of the assembly; see Figure 4.12. The different elements (MLCC capacitors, solder balls, interposer, TSVs, PCB) heated up and expand. Since they have different CTEs, higher for metals and lower for epoxy and mold, the two materials show different expansions and displacements, and this may lead to cracks and defects after few thermal cycles [3, 4]. Concerning the possible delaminations or cracks that may be caused by the high temperatures produced within the FPGA, possible areas of failure are the various solder bumps. Due to the different displacements on their top and bottom connecting surfaces caused by the different CTEs, the bumps (micro-bumps, C4 bumps, BGA bumps) may rupture due to high shear stresses during the placing or operating process. Additionally, the TSVs of the interposer are another area where cracks may appear inside the interposer due to deformations and displacements caused by the high CTE mismatch between silicon and copper. The defects will appear on the silicon interposer because it is more brittle than copper. Furthermore, the metal parts on top and bottom ends of the TSVs may also fail. The MLCC capacitors are very sensitive elements of the design, and their possible defects and failures are described separately [3, 4]. However, the FPGA and the interposer show more or less the same expansion, since they are both made of silicon. In contrast, a problem can occur on the metal

73

74

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D

(10) (1)

(11) (3)

(2) (8) (5)

(6) (7) (9)

Figure 4.14 Possible cracks of the structure are depicted by arrows and pink ellipses.

surfaces of the interposer because metals tend to expand more than semiconductors under high temperatures. The mold, due to its low thermal conductivity and heat dissipation ability, shrinks down during the cooling process and may cause ruptures during the fabrication process of the package, or even later during operation, if it is not able to dissipate sufficiently the produced heat. The majority of this heat during operation is directed through the thermal path that includes the micro-bumps, interposer, C4 bumps, PCB, and BGA balls, since the thermal resistance through this path is much lower than the one through the mold. This heat flow warms up these components even more and affects their thermal or thermomechanical behavior. The MLCC capacitors are also affected by the increased temperature. Figure 4.14 shows the areas that have a high potential for cracks and ruptures, denoted by arrows and pink ellipses. Additionally, the TSVs may crack due to the high CTE mismatch between silicon and copper. A beneficial approach is to add a heat sink on top of the FPGA. In that case, the majority of the heat follows now in the opposite direction, through the aluminum base and pins of the heat sink and not through the micro-bumps–interposer–C4 bumps–PCB–BGA balls chain, since the excellent thermal conductivity of the aluminum offers a better thermal path. Figure 4.15 shows the different thermal paths of the produced heat for each of the two configurations; each path means that the heat from the hotspots moves from an area with higher temperature (hotspot) to an area with lower temperature (ambient, usually air) and passes through all intermediate elements and features, heating them up. The topside of Figure 4.15 depicts a heat sink that is placed on top of the FPGA (its red dots represent the points of concentrated produced heat, called hotspots), and the majority of the heat is dissipated through the base and the pins of the heat sink made of aluminum. Less heat is directed through the other direction, shown by the yellow arrows, causing quite smaller deformations on the other components. The bottomside of Figure 4.15 depicts a mold compound encapsulating the assembly where the majority of the heat is

4.6 Thermomechanical Challenges

(10)

(1) (4)

(11) (3)

(2) (8)

(5) (6)

(7) (9)

(10) (2)

(1) (3) (4)

(8)

(5) (6)

(7) (9)

Figure 4.15 Two different thermal configurations and paths of produced heat.

directed through the interposer, various balls, and PCB. This is due to the fact that mold exhibits low thermal capacity; it dissipates heat very badly. It may cause high thermal stresses and possible cracks after a few thermal cycles. The red arrows represent the direction of the heat dissipation. Other approaches for better heat dissipation include localized thermal management solutions, such as thermal copper pillar (TCP) bumps. They are made of thin film thermoelectric material, and they are embedded into flip-chip interconnects to provide integrated cooling of flip-chip components and packages. The thermoelectric layer is used to transform the standard PCB to an active thermoelectric configuration [4, 34–36]. The utilization of thermal TSVs (filled with copper) is another approach for a better thermal management of the assembly. They transfer the produced heat vertically through the Si interposer, since the thermal conductivity of copper is higher than silicon. However, the CTE mismatch of Cu and Si can cause severe problems in a 3D configuration due to large stresses. The number, the position, and the arrangement of the thermal TSVs are another degree of freedom for the thermal management of the advanced packages [15, 37, 38]. As mentioned above, the multilayer ceramic chip capacitors are parts of the assembly of the second example; see Section 4.1.2. An MLCC capacitor consists of nickel (Ni) electrodes arranged horizontally inside a doped barium titanate (BaTiO3 ) ceramic body used as dielectric material. The terminal electrodes

75

76

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D

(a)

(b)

Figure 4.16 Schematic (a) and a crack in the ceramic body (see [40] with print permissions from TDK) (b) of an MLCC capacitor.

include copper, nickel, and solder layers. In some configurations, an extra conductive plastic epoxy layer is used in order to absorb and reduce the induced stresses developed in the electrodes and the body of the capacitor. These epoxies are primarily used in applications with very high mechanical and thermal cracking risks that may lead to a failure of the structure. The intermediate epoxy resin layer consists of conductive silver (Ag) filler and improves the mechanical stability of the design. Another configuration for further reduction of the mechanical stresses is a silicone rubber capsule with a thickness of few hundred micrometers that surrounds the structure and absorbs a part of the thermal-induced bending stresses [18, 29, 31, 39]. Figure 4.16 depicts a schematic of an MLCC capacitor (a) and a crack in the ceramic body of a specific model of an MLCC developed by TDK [18] (b). According to the schematic, the capacitor consists of several nickel electrodes (black), a ceramic barium titanate body (yellow), and two terminal electrodes with tin (maroon), solder (blue), conductive epoxy(green), and copper layers (orange). Concerning the thermomechanical behavior of the capacitor, it is important to know the mechanical properties of nickel and BaTiO3 and especially the ones that refer to their yield stress, rupture, and failure limits. The yield stress/ultimate strength for Ni is 140–350/140–195 MPa. For BaTiO3 we have a modulus of rupture about 76.5–94.5 MPa and a compressive strength of 411–561 MPa [30, 32]. That means that there is a higher risk for a crack in the ceramic body, as shown on the right of Figure 4.16, or in the ceramic–Ni interface than inside the nickel electrodes. For the purpose of our simulations, four Ni electrodes of 1 μm thickness have been designed. Different temperatures are applied to the capacitor, and various results have been extracted from the simulations, including the von Mises stress distributions for Ni and BaTiO3 . Figure 4.17 shows a simplified model of a capacitor used during the simulations performed with COMSOL Multiphysics (on the left). On the right, the distribution of the von Mises stresses through the bulk of the capacitor is shown; the proper thermomechanical boundary conditions are applied including a temperature increase produced by the FPGA and setting the bottom surfaces of the two terminal electrodes (they are in direct contact with the PCB) as fixed constraints.

References Surface: von mises stress (N m2)

1.8757×1014 ×1014

1.8 1.6 1.4

1.5

1.5

1

1

0.5

0.5

1.2 1 0.8

0

0

0.6

–0.5

–0.5

3 1.5

2

1

2

1

3 1.5 0

1

0.5 0

0.4 0.2

0 1.4159×1010

0

(a)

1

0.5

(b)

Figure 4.17 Simulation model of an MLCC capacitor (a) and distribution of the von Mises stresses through its volume (b).

Acknowledgments This work has been funded within the projects V3DIM, ESiMED, and NEEDS under the labels 01M3191D, 16M3201B, and 01M3090 by the German Ministry for Education and Research (BMBF = Bundesministerium für Bildung und Forschung) as well as within the CATRENE project 3DIM3. The authors are solely responsible for the content of this publication.

References 1 Martens, L. (1998). High-Frequency Characterization of Electronic Packaging,

vol. 1. Springer Science & Business Media. 2 Hall, S.H., Hall, G.W., and McCall, J.A. (2000). High-Speed Digital System

3 4

5

6

7

Design: A Handbook of Interconnect Theory and Design Practices. Wiley-IEEE Press. Tu, K.N. (2011). Reliability challenges in 3D IC packaging technology. Microelectronics Reliability 51 (3): 517–523. Banijamali, B., Ramalingam, S., Nagarajan, K., and Chaware, R. (2011). Advanced reliability study of TSV interposers and interconnections for the 28nm technology FPGA. In: 2011 Electronic Components and Technology Conference. IEEE. Szekely, V., Poppe, A., Rencz, M. et al. (2000). Therman: a thermal simulation tool for IC chips, microstructures and PW boards. Microelectronics Reliability 40 (3): 517–524. Leduc, P., de Crecy, F., Fayolle, M. et al. (2007). Challenges for 3D IC integration: bonding quality and thermal management. In: International Interconnect Technology Conference, IEEE. IEEE. Lienig, J. (2013). Electromigration and its impact on physical design in future technologies. In: Proceedings of the ACM 2013 International Symposium on Physical Design (ISPD’13), 33–40.

77

78

4 Multiphysics Challenges and Solutions for the Design of Heterogeneous 3D

8 Sohrmann, C., Heinig, A., Dittrich, M. et al. (2013). Electro-thermal co-design

9

10

11

12 13 14 15

16 17

18 19

20

21 22

23

24

of chip-package-board systems. In: THERMINIC 2013 - 19th International Workshop on Thermal Investigations of ICs and Systems, Proceedings, 39–45. Jancke, R., Wilde, A., Martin, R. et al. (2010). Simulation of electro-thermal interaction. In: Electronics System Integration Technology Conference, ESTC 2010 - Proceedings. Schneider, P., Reitz, S., Stolle, J. et al. (2009). Design support for 3D system integration by multi physics simulation. In: Material and Technologies for 3-D Integration, vol. 112, 235–246. Schneider, P., Heinig, A., Fischbach, R. et al. (2010). Integration of multi physics modeling of 3D stacks into modern 3D data structures. In: 3DIC IEEE 3D System Integration Conference 2010, Proceedings. NIH (2013). National Institutes of Health. http://www.nhlbi.nih.gov/health/ health-topics/topics/stents (accessed 22 August 2018). Duerig, T.W. and Wholey, M. (2000). An Overview of Superelastic Stent Design. Min. Invas. Ther. Allied Technol. Bayer, C., Reitz, S., Stolle, J. et al. (2012). Enabling system level simulation of 3D integrated systems. ECCOMAS. Heinig, A., Fischbach, R., and Dittrich, M. (2014). Thermal analysis and optimization of 2.5D and 3D integrated systems with wide I/O memory. 14th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), 86–91. Tan, S.C., Gutmann, R.J., and Reif, L.R. (2008). Wafer Level 3-D ICs Process Technology. Springer. Bellenger, S., Omnès, L., and Tenailleau, J. (2014). Silicon Interposers with Integrated Passive Devices: Ultra-Miniaturized Solution Using 2.5D Packaging Platform. Ipdia. TDK (2010). Multilayer Ceramic Chip Capacitor. Technical Journal. Fischbach, R., Heinig, A., and Schneider, P. (2014). Design rule check and layout versus schematic for 3D integration and advanced packaging. In: 2014 IEEE International 3D Systems Integration Conference (3DIC). Ritchey, L.W. and Edge, S. (1999). A survey and tutorial of dielectric materials used in the manufacture of printed circuit boards. Circuitree Magazine (November). Coonrod, J. (2012). Selecting PCB Materials for High-Frequency Applications. Microwave Engineering Europe. Wojnowski, M., Pressel, K., Beer, G. et al. (2014). Vertical interconnections using through encapsulant via (TEV) and through silicon via (TSV) for high-frequency system-in-package integration. In: 2014 IEEE 16th Electronics Packaging Technology Conference (EPTC), 122–127. Dittrich, M., Steinhardt, A., Heinig, A. et al. (2015). Characteristics and process stability of complete electrical interconnection structures for a low cost interposer technology. In: IEEE Electronic Components and Technology Conference (ECTC). Frickey, D.A. (1994). Conversions between S, Z, Y, H, ABCD, and T parameters which are valid for complex source and load impedances. IEEE

References

25 26

27 28 29 30

31 32 33 34 35 36 37 38

39 40

Transactions on Microwave Theory and Techniques (Institute of Electrical and Electronics Engineers);(United States) 42 (2): 205–211. Wartenberg, S.A. (2002). RF measurements of Die and Packages. Artech House. Johannsen, U., Smolders, A.B., and Reniers, A.C.F. (2010). Measurement package for mm-wave antennas-on-chip. In: ICECom, 2010 Conference Proceedings, 1–4. Dagan, B. (2010). Pin Fin Heat Sinks: A Sharper Way to Keep Medical Electronics Cool. Cool Innovations. Lu, D. and Wong, C.P. (2009). Materials for Advanced Packaging, vol. 181. Springer. TDK (2013). Multilayer Ceramic Chip Capacitor. Technical Journal. Grether, M.F., Coffeen, W.W., Kenner, G.H., and Park, J.B. (1980). The mechanical stability of barium titanate (ceramic) implants in vitro. Biomaterials Medical Devices and Artificial Organs 8 (3): 265–272. Maxwell, J. (1988). Cracks: the hidden defect. In: Electronics Components Conference, IEEE, 376–384. Howatson, A.M., Lund, P.G., and Todd, J.D. (1972). Engineering Tables and Data, vol. 1, 41. Springer Netherlands. Gross, D., Hauger, W., Schröder, J. et al. (2011). Engineering Mechanics 2: Mechanics of Materials. Springer. Lee, S. (2008). Enabling Cooling Strategies for 3D Packages. Advanced Packaging. Magil, P.A. (2008). The Thermal CPB: An Approach to Thermal and Power Management. Advanced Packaging. Magil, P.A. (2009). Localized Cooling for Data Centers. Advanced Packaging. Karnik, T. (2011). IFTLE 41 SRC focus center 3D update. Semiconductor Manufacturing and Design. Chien, J., Yu, H., Tsai, N. et al. (2012). Hybrid Thermal Solution for 3D-ICS: Using Thermal TSVS with Placement Algorithm for Stress Relieving Structures. IEEEXplore. TDK (2013). TDK’s Guide to Multilayer Ceramic Capacitors for Use with Conductive Epoxies. Technical Journal. TDK (2016). Capacitor element does not develop cracks, but the terminal electrodes peel off owing to the fail-safe function, even under excessive stress. https://product.tdk.com/info/en/products/capacitor/ ceramic/mlcc/technote/solution/mlcc02/index.html (accessed 22 August 2018).

79

81

5

®

Physical Design Flow for 3D/CoWoS Stacked ICs Yu-Shiang Lin, Sandeep K. Goel, Jonathan Yuan, Tom Chen, and Frank Lee Taiwan Semiconductor Manufacturing Company Ltd., Design Technology Platform, 2585 Junction Ave, San Jose, CA 95134, USA

5.1 Introduction Traditional 2D-IC has been riding on the scaling of the Moore’s law to become more powerful within the same footprint in the last few decades. However, it is becoming harder and harder to maintain this trend of doubling the number of transistors every 18 months. It is not only becoming more challenging to manufacture transistors at smaller dimensions but also getting harder to sustain the overall system performance improvement on all front. For example, the requirement of memory bandwidth needs to keep up with the computing power of the processor; otherwise the system performance will stall. By 2015, the data bandwidth of DRAM is already beyond 100 GB s−1 [1]. At the same time, interconnects become the bottleneck for both the performance and the power because further scaling of the back-end-of-line (BEOL) is more limited than the front-end-of-line (FEOL) devices. Compared with intra-die routings, I/Os that go off-chip suffer performance/power penalty because of the package parasitics as well as the electrostatic-discharge (ESD) protections that are required. 3D-IC was proposed to alleviate the problem by stacking dies in close proximity. In this way the driver strength that is proportional to the power consumption can be reduced to achieve the same data rate. 3D-ICs also make use of the inter-die connections with smaller dimension to relief the limitation of I/O density requirement. Another advantage that comes with the 3D-IC is smaller form factor. By stacking dies with the same size, the area density is multiplied by the number of dies. After thinning of the silicon and making use of the through-silicon via (TSV) technology, the height remains comparable with its 2D counterpart. This feature is highly valuable for the application of the wearable devices where overall system volume is constrained. On the other hand, small form factor also brings the challenges to the 3D-IC design. Increasing the density means that more heat is dissipated by the transistors per unit area and that the path to bring the heat out to the heat sink is also more thermally resistive. The same logic can be applied for the power delivery network, which attempts to bring in current from the external power source. Since the lead-free bumps are Handbook of 3D Integration: Design, Test, and Thermal Management, First Edition. Edited by Paul D. Franzon, Erik Jan Marinissen, and Muhannad S. Bakir. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

82

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

becoming more commonly adopted, the allowable current density is limited by the bumps themselves. Therefore, system-level power planning and new design methodology needs to be adopted to fulfill the demand for 3D-IC designs. The ability to build heterogeneous system is one of the most important features of the 3D-IC system. System on chip (SoC), DRAM, nonvolatile memories, and power management integrated circuits (PMICs) are common building blocks that are packed together using advanced packaging techniques nowadays. Advanced node that is used to manufacture the SoC is typically not suitable for nonvolatile memory or high voltage devices and vice versa. As the gate length keeps shrinking, the technology that is preferred for the logic gates may not be optimized for the analog designs. Therefore, the ability to stack heterogeneous dies and combine their functionalities to the same chip offers great system design flexibility. It also allows partial redesign that saves a lot of design efforts and faster time to the market for the new products. Trade-off between the yield and the testing cost is another selling point for the 3D-IC design. Individual die yield increases when a large die is partitioned into smaller ones. The dies then go through known-good-die (KGD) screening, and the functional ones can be assembled in a chip stack. While the testing time and effort is greater for the 3D-IC design, the opportunity to get highly yield dies is attractive compared with the traditional 2D designs when scaling makes manufacture even more challenging.

®

5.2 CoWoS vs. 3D Design Paradigm Chip on Wafer on Substrate (CoWoS) is an interposer-based solution to bridge the gap between the traditional 2D design and the true 3D design. It offers a massive interconnect network between the dies and the substrate. It is often referred to as 2.5D design since the interposer does not contain any active device and is BEOL only. The interposer is based on silicon technology with TSV to join the bump at the frontside to those on the backside. Figure 5.1 shows the illustrations of the silicon interposer and the true 3D die stacking system. For a true 3D-IC, the dies can be stacked in either face-to-face, face-to-back, or back-to-back manner where the face side refers to the side that contains active devices. For face-to-face connection, the micro-bumps are used for joining the two dies. In the case when the backside is used for connection, TSV has to DRAM SoC

DRAM

DRAM

Interposer

(a)

Figure 5.1 (a) Silicon interposer and (b) 3D-IC die stacking.

SoC

(b)

5.3 Physical Design Challenges

Die stack (1998)

POP (2004)

WideIO (2013)

Figure 5.2 Memory stacking technologies.

be used to create a conducting path through the silicon substrate. Due to the physical manufacturing constraints, the dies have to be thinned down when TSVs are in use. The C4 bumps are used to connect the last die of the 3D-IC stack to the substrate, which is the same as the traditional flip-chip process. The interposer also utilizes the micro-bumps to join the silicon dies at the frontside where BEOL exists. For the CoWoS technology, there are up to three copper routing layers for submicron dimensions to connect between the dies. TSV is used to connect the frontside RDL to the backside C4 bump. In comparison, the interposer does not have the same form factor advantage as the 3D-IC design. On the other hand, the routing density of the interposer is significantly better than the substrate such that the memory bandwidth is competitive with that of the 3D-IC. Figure 5.2 shows the technologies of the memory stacking starting with the wire-bond design and gradually evolving to the 3D-IC stacking with TSV. Stacking memory chips and using wire bonds for I/O is a low-cost solution for high bit cell density albeit the limitation in bandwidth. Recently, standards such as WideIO, hybrid memory cube (HMC), and high bandwidth memory (HBM) are proposed to target for both high cell density and high memory bandwidth applications. A four-high stack of HBM dies can provide eight 128-bit channels, far exceeding the bus width of the 64-bit DDR4 interface. Silicon interposer not only provides submicron interconnects but also brings proximity of the dies within tens of microns to allow total wirelength within a few mm. This shortened distance allows the design to utilize unterminated I/O drivers and receivers to save power. Although the silicon interposer does not have the form factor advantage as the 3D system, on the contrary it means that the power delivery and the thermal dissipation challenges can be greatly relieved. Since there is no active device on the interposer, the placement of the TSVs is restrained by the large IP or caches to provide low conductivity path from the C4 directly to the dies. In addition, metal–insulator–metal (MIM) capacitance can be used to decouple high-frequency noises and to help suppressing the dynamic IR drop.

5.3 Physical Design Challenges Figure 5.3 shows the physical design scope of CoWoS including system planning and design implementation as well as analysis and verification. 3D-IC provides several design opportunities along with the design challenges due to the scope of the system that involves multiple dies in close proximity. First of all, system-level planning can be very different from that of a 2D chip. For a true 3D stack, the die sizes are identical, and partitioning existing designs become

83

84

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

3D-IC components Cross-die interface

Design planning

Memory

SoC

Micro-bump Interposer

TSV

Frontside metals

Flip-chip bump (C4) Package substrate

BGA balls

Ecosystem/ partner support

Thermal/power delivery

Verification/ validation

Figure 5.3 Physical design scope of CoWoS.

harder with more tiers on the stack unless the most contents are in structured blocks such as DRAM. To take advantage of the heterogeneous stacking, design partitioning will be further constrained. Take CoWoS, for example, in order to reduce the stress on the silicon interposer, the ratio between the total die area and the interposer area needs to be defined and followed by the design rule. On top of that, the footprint and the aspect ratio of the silicon dies should be considered as a whole such that large white spaces without silicon dies can be avoided. This means that reusing existing designs or IPs and porting them to the interposer design may be difficult, and some systems might need complete overhaul. New 3D-IC-specific components are also needed to be considered. The interposer connects to the dies through the micro-bumps and to the substrate through the TSV. The electrical characteristics of the new components need to be considered for their impact to I/O performance to power delivery efficiency. Modeling the parasitics of those components varies depending on the application. In addition, for 3D-IC, the noise coupling from the TSV to the active device should be analyzed, and new methodologies to avoid the impact should be adopted. There is a large gap of current density limitation between the TSV and the micro-bump. We will discuss in the later section to demonstrate how the gap can be mitigated by a new design structure called combo-bump. To enable KGD screening, probe pads are used as part of the back-end design. Although fully populated probe pads are required at speed testing, they become redundant once the stack is formed. Providing a good usage model of the probe pads is another key feature for the combo-bump. The interface between the dies for 3D-IC can be considered the same as connecting different components in an SiP module in that different parts

5.4 Physical Design Flow

are designed in different process corners and different supply voltages. The difference is that in 3D-IC the designer can potentially take advantage of the tighter margin and achieve faster and more power-efficient design for cross-die signaling. In order to do that, cross-die timing needs to be modeled as close to intra-die timing as possible. At the same time, ESD protection scheme needs to be revisited as the loading and area penalties coming from the ESD devices are large and should be reduced if possible. The interconnect length in general is much less for the 3D-IC, so unterminated I/O and low-swing alternatives should be considered as well for power improvement. By packing more functional blocks into the same volume, 3D-IC has to deal with greater thermal gradient brought by higher power density and also more thermal resistive paths to the heat sink. A general concept is to place high power blocks closer to the power source and more temperature-sensitive block such as SRAM closer to the heat sink to minimize the impact. However, the designer has to rely on proper thermal modeling of the materials and the thermal simulation to provide a good estimate of the thermal gradient impact on the chip. The problem of dissipating heat to the outside is similar to the problem of bringing in the current to supply the chip. Instead of thermal conductivity, power analysis relies on parasitic RLC of the power supply network. Functional blocks demand current and produce heat as a by-product. In fact, the two problems are highly correlated such that co-design is necessary when any one issue of them is a concern. 3DIC sign-off can be categorized into two parts: physical and electrical. For the physical sign-off, DRC and LVS (layout vs. schematic) checkings are the key deliverables for the 3D interconnects. With 3D-IC system, it can be regarded as a wrapper around the subsystems, which are individual dies that pass their respectively physical sign-off already. For LVS, the designer is responsible for generating a golden netlist for the wrapper module. The connections between the dies need to be added on top of the existing LVS deck. Unlike the 2D counterpart, 3D-IC needs to address dies from different process nodes, so database handling is a nontrivial problem. We will discuss the black box checking methodology later that is aimed to alleviate the overhead. From the electrical point of view, the static timing analysis (STA) tool can implicitly handle multiple technologies since it treats each library separately. However, for IR/EM (electromagnetic) sign-off or current density or point-to-point resistance sign-off, the tool must be enhanced to take into account the physical alignment between the dies as well as the technology files for different dies together. As a result, establishing an ecosystem and supports from the IP and electronic design automation (EDA) vendors is critical for any 3D IC system.

5.4 Physical Design Flow 5.4.1

RC Extraction and TSV Modeling

The discussion of the interconnect characteristics for the interposer can be separated into two categories: wire routings including the RDL and the 3D components. For the former part, RC extraction through the typical digital IC design

85

86

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

Table 5.1 Critical line length for different rise times.

Rise time (ps)

Critical line length (mm)

50

3

100

6

250

15

process is readily available unless high-speed interconnect is being considered. At higher frequency, simple RC model is insufficient when the transmission line effects start to show up. The criteria to include the transmission line effect can be determined by the time of flight comparing with the rise/fall time of the signal being transmitted. The relationship can be established as follows: Tr < N ⋅ Ttof

(5.1)

where T r is the rise/fall time, T tof stands for the time of flight, and N is a constant from 2 to 4 [2, 3]. For CoWoS technology, T tof is roughly 7 ps mm−1 . Assuming N equals to 2.5, the critical line length for considering transmission line effect can be summarized in Table 5.1. For example, when the line is 3 mm long, T tof is 21 ps and the rise time at 50 ps satisfies the criteria. It means that for unterminated waves propagating along the line, the reflection of the wave itself will be added or subtracted from the desired signal and shows up as ringing on the transient waveform. Therefore for high-speed signals above the critical line length, EM simulations are required to model the high-frequency nonideal effects. The 3D components that require modeling and characterization include the micro-bump, the TSV, and the floating substrate. The micro-bumps can be modeled as discrete components with a typical T-model including two resistors and one capacitor. The electrical field between the neighboring microbumps is ignored due to the small aspect ratio of their heights and diameters. For high-frequency applications, inductance with serial parasitic resistance and capacitances will be included. For CoWoS, the series impedance for the microbump is more than 3× smaller than those of the TSV that will be discussed next. TSV can be modeled with either two-port or three-port component depending on the application as shown in Figure 5.4. Simple RC model can be used to represent the TSV at the baseband frequency with oxide capacitance C (Figure 5.4b). At RF frequencies, the self-inductance of the TSV can no longer be ignored. Additionally, the leakage through the silicon substrate also starts to increase with frequency. A more complicated model is then used as in Figure 5.4c. At lower frequencies, the insertion loss is controlled by the oxide thickness until it enters resistive region dominated by the conductance of the substrate. While at high frequencies, C sub that is dictated by the distance to the grounded TSVs creates another pole. The model fits the measurement data well up to the 20 GHz range. Due to the aspect ratio between the height and the pitch, the coupling between the TSVs should be considered from all the surrounding TSVs. Considering a

5.4 Physical Design Flow

R

R

Rs1

Ls1

Rs

Ls

C

L1

(b)

L2

R

R

R

Ls1

Rs

Ls R

C

Csub

Rs1

Rsub

(a) (c)

Figure 5.4 SPICE model for TSV.

cox

c

cox

0.6 0.4 0.2

ox

cox

Ctotal / Ctsv

c ox

cox

0.8

c

c

ox

1.0

ox

0.0

0

2 4 6 8 10 Number of grounded TSV

Silicon substrate

(a)

(b)

Figure 5.5 Signal TSV surrounded by grounded TSVs. (a) Illustration and (b) C total with respect to the number of grounded TSVs.

signal TSV that is surrounded by the ground TSVs as shown in Figure 5.5, the parasitic capacitance can be calculated as N (5.2) ⋅ Cox N +1 As the number of surrounding TSV increases, the total parasitic capacitance approaches C ox , which is given by the TSV model. Ctotal =

5.4.2

Interposer Connectivity Checking Technique (LVS)

The connectivity of a stacked 3D system can be verified with the following approaches: 1. A divide-and-conquer approach where the connectivity of individual die is checked separately. Only the interface pins of the dies plus the interposer are used to create an assembly view for the whole stack. The correctness of the inter-die connections will then be verified by the assembly database. 2. Include all the shapes from each die to the checking database. This comprehensive approach requires layer mapping to bump a layer number from the original database to avoid layer collision. It also requires an LVS superset to include all the connectivity information between layers in the database.

87

88

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

The drawback of the latter approach is obvious in that it requires additional efforts in database handling and LVS deck creation. The benefit, on the other hand, cannot be justified except when the parasitic extraction of the inter-die signals requires the connected database from the LVS result. When the dies are treated as black box during connectivity check, the internal wire routings are not seen from any other die. Therefore, the inter-die parasitic modeling is optimistic. In the case of chip bonding technique such as CoWoS, the signals within each die are separated far enough such that the difference can be safely ignored. Under most scenarios, the former approach provides a much efficient flow that saves significant runtime while being just as good as the comprehensive approach in terms of LVS correctness. For the remainder of the section, we will focus only on the LVS checking within the interposer and leave the interface checking to the next. The LVS checking for the interposer is almost the same as that for a SoC die except that there is no active device. A dummy device needs to be added in both the source netlist and the layout database for the following reasons: 1. The LVS engine uses the devices as initial comparison points. Without them, LVS simply exits without comparison. 2. The RC extraction process is closely related to LVS. In order to properly annotate a path, it needs to have at least two terminals that have unique names. For the interposer design, a dummy device is inserted at each micro-bump. In practice, such dummy device can be embedded by the LVS deck whenever the CAD layer representing the micro-bump is encountered. From the netlist point of view, a utility is needed to insert the dummy device where a pin name matches certain naming convention. Typically the ports that connect to the micro-bump are given different prefix from the C4 bumps to fulfill this purpose. The dummy device can carry a certain SPICE model that is technology dependent. The model associated with the micro-bump is filtered out during LVS and only takes place after RC extraction is done. For the TSV, it can either be modeled as a device or a via. If the TSV is modeled as a device, then it needs to be explicitly specified in the netlist. It becomes rather tasking job to specify all the TSVs that are needed for power connections. The alternative approach is to simply treat TSV as a normal via. The connectivity through the TSV can therefore be established like in normal LVS flow. On the other hand, under certain situation such as when the redundant TSVs are required, missing a TSV connection cannot be screened out by the LVS tool when it is simply treated as a via. Additional steps or utilities are needed to check for the correctness of redundancy. 5.4.3

Interposer Interface Alignment Checking

Inter-die LVS checking requires two components: a golden netlist and a layout view of the assembly database for the stack. The netlist can be described by the Verilog modules as shown in Figure 5.6. A top die-stack module includes the sub-die modules, and the interposer module with all the interconnects is described through the pins of each sub module. One-to-one, one-to-many,

5.4 Physical Design Flow

Top die-stack module Module top (C_A, C_D); input C_A; output C_D; wire A, B, C, D;

Subdie module module TM1 (P_A, P_B, P_C, P_D); input P_A, P_B; output P_C, P_D;

TM1 u1 (.P_A(A), .P_B(B), .P_C(C), .P_D(D)); SII ui (.U_A(A), .U_B(B), .U_C(C), .U_D(D), .C_A(A), .C_D(D));

endmodule

Endmodule

P_A P_B TM1

U_A C_A

Interposer module

P_C P_D

U_B U_C Interposer

module SII (U_A, U_B, U_C, U_D, C_A, C_D); inout U_A, U_B, U_C, U_D; inout C_A, C_D;

U_D

assign U_A = C_A; assign U_D = C_D; assign U_B = U_C;

C_D

endmodule

Figure 5.6 Verilog modules for the die stack.

Top die

Bump placement/ assignment

Interposer die Bump file

Bump placement/ assignment

Figure 5.7 Bump location file content.

many-to-one, and many-to-many connections are all valid in this representation. While most signal connections fall into one-to-one category, I/O ports tend to utilize dummy TSV/micro-bump to guarantee access to the chip for signals that are not frequency critical. Due to the many-to-many connectivity nature of the micro-bump, the physical design is usually based on the bump file to describe the physical connection between the dice as shown in Figure 5.7. A bump file lists all the one-to-one connections between the interposer and the sub-die unlike the netlist counterpart. Since each line of the bump file is unique, it can be used to derive the golden netlist for the inter-die LVS checking. For chip-on-wafer assembly, both the sub-dies and the interposer are designed facing up. It means that during the bonding process, the sub-dies will be flipped

89

90

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

down toward the interposer for face-to-face joining. Therefore, the sub-die databases are also needed to be flipped either along the x-axis or the y-axis to align with the interposer for interface alignment checking. For half-node process, properly scaling is needed during database merging to ensure that the physical dimension is correctly translated. Establishing the connectivity can only guarantee part of the correctness of the interface between the sub-die and the interposer. The following guidelines should be taken into account when implementing the rules for inter-die checks: 1. Center-to-center alignment of the micro-bumps: In order to meet the assembly tolerate for chip stacking, the center-to-center misalignment between the chips should be kept to a minimum that is allowed by the ground rule. 2. Coverage of micro-bumps: The size of the micro-bumps scales from its original GDS in half-node design; thus it is beneficial to check for the micro-bump size mismatch after database assembly. Also, it guarantees that the DRC checkings on individual die are still valid after stacking. 3. Clearance of the interposer: When sensitive analog components are used in the sub-die, metal pattern over the top of the component might have sizable impact to the performance. A common practice is to define a dummy layer to waive some metal density rules in the region to minimize the impact. If the impact is also non-negligible when the interposer is present, then the clearance checking needs to be enforced through the inter-die checking rules as well. 4. Alignment markers: The alignment markers are defined so that individual dies can be visually aligned during assembly. The inter-die checking should also help to identify the alignment markers to make sure that they are properly placed. It is noted that proper labeling is needed when electrical connection is not required. By giving a unique name to the bumps without electrical connection, the inter-die checking ensures that only those who have the same intention are aligned. 5.4.4

Cross-Die Timing Check

Cross-die STA is challenging for 3D-IC design considering the following factors: 1. Multiple technologies in use that means different libraries and sign-off corners. 2. Applying different (on-chip variation, OCV) and extreme corners combination might be too pessimistic for inter-die timing paths. 3. Timing budgeting for inter-die paths becomes harder due to difference corners being applied. Modern STA tool allows the designer to load libraries and supply voltages for any individual gate along a path, so technically one can obtain path timing if the condition can be given correctly. However, there may not be a corresponding library corner for different technologies. The sensitivity at the corners is also different for the technologies, making it difficult to properly define the corners

5.4 Physical Design Flow

conditional without imposing extra margins. For example, assume the driver and the receiver are in different dies. In order to avoid unnecessary voltage translation, the typical sign-off corner for them should be the same. If the worst-case sign-off corner is defined as the SS global corner, then the supply voltage is most likely different for each die. At the boundary of the power domains, the incoming voltage level is different from the supply voltage level and will contribute to the inaccuracy of delay and slew calculation. On the other hand, if the voltage of SS global corner is forced to be the same, extra margin is unavoidable for one of the technologies. OCV is a key component to address the within-die variations in advanced node technologies. Flat OCV that treats all the paths equally is proven to be too pessimistic. Path logic depth and spatial location of gates along the path will result into different OCV number to be applied. Assuming each gate delay can be represented by a Gaussian distribution with standard deviation 𝜎 i , the total variation of the path can be simplified as ∑ 𝜎i 2 = n𝜎 2 (5.3) 𝜎path 2 = when all the gates are identical. This means that the total path variation should be reduced approximately to the square root of the number of gates in the path. In practice, the number is obtained through simulations and delivered as part of the library release. The concept works fairly well for single technology involved. However, the definition of traversed logics changes in the case of cross technology timing path. The assumption of OCV no longer holds true since the path is not only across different dies but also different technologies. Due to the unavoidable pessimism for cross-die STA sign-off, it is necessary for the designers to follow some design guidelines to alleviate the problem. Figure 5.8 shows the cross-die timing path with interposer involved. The guidelines can be summarized as below: 1. 2. 3. 4.

Leverage low power STA to deal with multiple power domains. Well architecture timing partition. Data bus and clock alignment is critical. Minimize interface logic to reduce OCV concern. Align clock and data SoC1

Figure 5.8 Recommendations for cross-die signaling.

Interposer SoC2

91

92

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

By properly aligning the clock and data right before sending them out of the die, the OCV impact on the fast path versus the slow path can be minimized. A structured, well-designed outcoming and incoming data bus can be simulated through circuit-level simulation to minimize the uncertainty caused by the STA. 5.4.5

IR Drop Analysis for the Interposer

For any 3D-IC assembly process, power delivery is a critical issue that cannot be emphasized enough for its impact on overall system performance. Interposer design can be modeled similar to its 3D counterpart except that there is no heat source in the interposer. One can view the power grid patterns of the interposer as an extension of the original BEOL layers of the sub-die itself and simply merge the database for IR analysis. However, the task of merging and remapping the database is not trivial and can be beyond the tool capability if multiple sub-dies are assembled in different technology nodes. Furthermore, to create a new technology file that includes the interposer for all the existing baseline processes is simply impractical. A 3D-specific power analysis flow needs to be adopted in order to simplify the process. Figure 5.9 shows the power analysis flow for the interposer. Each database of the die is imported along with the associated technology file into the integrated IR/EM engine. Additionally, IPs with technology-independent power model and package sub-circuits can also be loaded if necessary. The integration is handled by a mapping file that relates the micro-bumps in a given die to those in the interposer. The relationship can be established through a couple ways described as follows: 1. Absolute x and y coordinates: The interposer is assumed to be placed at (0, 0), while the dies are flipped and maybe rotated further to align with the micro-bumps. Specifying the coordinate allows the tool to link the micro-bumps by its location. 2. Instance name for a given die: When instance name of the bumps is predefined by the designer, this approach can be used to simplify the integration process. Once the connection is established by the mapping file, individual power network can be extracted, simplified, and combined with other power networks to form the complete database for the whole stack. This approach should work very well for the static power analysis since resistive network is sufficient. When the dynamic analysis is needed, the capacitances and the inductances of the power lines have to be included. The integration method assumes there is no coupling Die 1

Die n

IP1, …, IPn

Interposer

Package

IC design data (LEF/DEF)

IC design data (LEF/DEF)

IC design data (LEF/DEF)

SiI design data (LEF/DEF, MIMCAP)

Pkg layout (mcm, sip)

Power model Micro-bump mapping file Integrated IR/EM

Figure 5.9 Power analysis flow for the interposer.

RLC spice

5.4 Physical Design Flow

Max 250 mV (21.33% drop)

Max 176 mV (14.69%drop)

24.4 nF MIM Insertion

Figure 5.10 Dynamic IR drop with MIM capacitance.

between the die and the interposer, which is generally true for capacitance since the field decreases linearly with distance. However, the mutual inductance effect needs to be evaluated by the designer to ensure that it can be safely ignored for the dynamic analysis. Due to the fact that the interposer is close enough to the die, it is considered as part of the impedance network that dominates the high-frequency response. Therefore, the MIM capacitances that are inserted in the interposer decouple the high-frequency noise generated by the SoC. Figure 5.10 shows the maximum IR drop reduction by inserting 24.4 nF of MIM capacitance in the interposer. The simulation results show that more than 6% IR drop can be reduced by adding dedicated decoupling capacitance in the power domain. In a reference design with blocks such as the SoC and the WideIO DRAM, experimental results also suggest that IR drop savings is as much as 2% and 8%, respectively (Table 5.2). Table 5.2 Summary of experimental dynamic voltage drop reduction. With package RLC Voltage drop (mV)

Ideal voltage (mV)

Block

MIM scenario

Power domain

SOC

No MIM capacitance

VDD

57.4

1.21

4.74

VSS

13.2

0

1.09

VDD + VSS

70.6

1.21

5.83

VDD

30.9

1.21

2.55

VSS

12.7

0

1.05

VDD + VSS

43.6

1.21

3.60

127.1

1.2

10.59

With 66 nF MIM capacitance

WideIO DRAM

No MIM capacitance

VDDQ VSS VDDQ + VSS

With 82 nF MIM Capacitance

VDDQ VSS VDDQ + VSS

2.44

0

IR (%)

0.20

129.54

1.2

10.80

27.15

1.2

2.26

0

0.27

1.2

2.53

3.2 30.35

93

94

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

5.5 Physical Design Guideline 5.5.1

Interposer Wide Bus Routing Guideline

As mentioned before, an important application for the interposer is to enable wide bus memories. The following discussion will be based on the HBM standard; however, the general principle can be applied to other standard as well. Figure 5.11 shows the ballout map of the HBM. The region in shade is zoomed out to illustrate the regular bump pattern. The micro-bumps are lay-outed in staggered patterns with x and y pitch of 55 and 96 μm, respectively. Each region contains 44 signals and 4 clocks while the rest are supply bumps arranged in alternate VSS and VDDQ patterns. With the basic understanding of the ballouts, it can be seen that the physical design can be pieced together by replicating the small but regular tiles. Since the routes go horizontally, instead of choosing the shade region for the unit cell, the region under the dotted line will be used. Ideally, the total wiring length among the data or address bits that belong to the same bus should be kept minimum and as close as possible. Therefore, the ballout should be identical on both the HBM side and the SoC side. Figure 5.12 illustrates the placement of the micro-bumps and the routing scenarios. A trace block is used to represent the straight parallel wirings between the bump regions. The benefits of separating the trace from the other parts of the routes are twofold: reusability and modeling accuracy. Since the ballout is predetermined, changing the floor plan can be as simple as changing the trace length, and the rest parts of the design will remain untouched. The latter advantage is due to the simplicity of the pattern. One can easily leverage a field solver to produce wideband electrical model for the trace that would otherwise require more computation complexity to include the whole routing patterns. There are two types of micro-bump provided by the CoWoS technology. One is called metal layer (MT) bump and the other is called AP bump. Figure 5.13 shows the routing pattern associated with respective micro-bump type. For the MT bump, no RDL metals are used and the top MT is directly via up to the underbump metal. From Figure 5.13b, it has to get around the bumps when routed on the top MT. On the other hand, no via to MT is allowed beneath the bump, and the RDL layer has to be used to access the routes outside the bump area. When signals are not routed on MT, using the MT bump might be a little advantageous since the via between RDL and MT is considered as local obstruction that needs to be avoided by the other routes. In general, AP bump is the preferable bump type to use since most wires are straight until they get to the micro-bumps. CoWoS technology provides one, two, and four TSV options for the BGA connection. The usage of either option can be classified as follows: 1. Four TSV connections are used for power BGAs: Although a single TSV is more than enough to carry the current into the chip, having multiple TSVs helps reducing the parasitic resistance and inductance. 2. Two TSVs are used for general signal I/O connections: Having redundancy in I/O provides assurance that the chip is testable. 3. Single TSV for high-frequency signals: At high frequencies, the loading from the TSV should be reduced as much as possible to reduce the insertion loss. Adopting single TSV option is crucial for application such as 28 GHz or higher SERDES links.

Channel_e[0]

Channel_a[0]

Channel_f[0]

Channel_b[0]

96 μm

Channel_e[1] Channel_a[1] Channel_f[1]

Test ports

Power supply region

Channel_g[2] Channel_c[2] Channel_h[2] Channel_d[2] Channel_g[3] Channel_c[3] Channel_h[3] Channel_d[3]

Figure 5.11 HBM ballout and dimensions.

55 μm

Mechanical bumps

Channel_b[1]

96

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

Trace

Micro-bumps

SoC

Micro-bumps

DRAM

Figure 5.12 Routing between the SoC and DRAM interface.

(a)

(b)

Figure 5.13 (a) AP bump routing and (b) MT bump routing. Power supply region

Signal region TSV

BGA

RDL

Unit cell

Align BGA with unit cell boundary

Unit cell

Unit cell

Figure 5.14 BGA and TSV placements for HBM.

The TSV and BGA placements for the power supply are arranged as in Figure 5.14. In a staggered manner, the center-to-center distance of the BGAs is around 220 μm. The pitch in x spans three micro-bump pitches horizontally. The vertical alignment of the BGA relative to the micro-bump is defined by the aforementioned unit cell block. The guideline is to align the center of

5.5 Physical Design Guideline

the BGA ball to the unit cell boundary. In this way, the BGA patterns can be maintained across the power supply region and the signal region for HBM ballout. Additionally, the power micro-bumps within the unit cell can be wired to adjacent BGA without occupying too much signal-routing resources. In the power supply region, vertical RDLs can be used to connect the micro-bumps to enhance the power delivery as shown in Figure 5.14. Stacking multiple HBM dies is supported for CoWoS. Figure 5.15 shows the setups with one SoC and multiple HBM dies. For two-HBM setup the solution is straightforward. HBMs are placed along the longer side of the footprint. Note that since each HBM has to be flipped, one has to be rotated 180∘ with respect to the other. Figure 5.15b,c shows two setups when four HBMs are involved. In typical cases, having two HBMs on each side of the SoC is preferable to the counterpart unless the aspect ratio of the SoC does not allow it. This not only maximizes the utilization of the interposer silicon area but also eliminates the low density area around the corners of the interposer that causes manufacturing issues. Since the chip-on-wafer assembly process happens after dicing, additional silicon area should be considered when accounting for the distance between the SoC and the HBMs. For CoWoS with HBM configuration, the distance between the dies should be kept tight enough such that there is no underfill between the dies and wide enough to be able to tolerate the die saw margin. 5.5.2

Interposer SI/PI Analysis for HBM Interface

In the previous section, the routing guidelines for HBM interface are introduced. Once the routing patterns are created, RC extraction can be performed to extract the parasitics of the wires and vias. By integrating the micro-bumps and the IO drivers, the simulation environment can be set up as shown in Figure 5.16. For clock signals such as data strobes, the toggle rate is set as half of the data rate single since the data are triggered in both edges. Clock uncertainty can be added to the clock source through the RMS jitter. On the other hand, the data signals can switch as fast as the data rate while the phase is 90∘ shifted from the clocks. In order to cover the most signal switching events, LFSR pseudo random sequence is generated with a different starting seed for each bit of the bus. To add uncertainty caused by the clock source, a sample and hold circuit is used to latch the LFSR data stream and then mixed with the clock pattern with jitter to produce the desired output. The IO drivers and receivers can be modeled by Input/output Buffer Information Specification (IBIS) to save the simulation runtime. And finally, the package RLC is needed for power integrity (PI) analysis. It is noted that the RLC values of the package should be scaled to match the number of bits under simulation to reflect the real case. Figure 5.17 shows the eye diagram of the interposer with various trace lengths. The eye opening degrades as the trace length increases. This is expected because the wire parasitics increase linearly with signal length. As RC constant starts to increase, the rise/fall time of the signal gets larger and results into collapsed vertical eye. At the same time, the increased mutual inductances between the wires cause the edges to potentially deviate more from its original crossing time.

97

SoC HBM0

HBM1

HBM2

HBM0

HBM1

(a) HBM0

HBM1

SoC

SoC HBM2

HBM3

HBM3

(b)

(c)

Figure 5.15 Multiple HBM dies arrangement. (a) Two-HBM setup, (b) four-HBM setup with two HBMs on each side, and (c) four-HBM setup with one HBM on each side.

5.5 Physical Design Guideline RSM jitter source LFSR bit data stream

Data strobe RSM jitter source

IBIS

Sample and Data in hold

μ-Bump model Interposer parasitics

Package RLC model

DQ[0:15] DQs_c DQs_t

IBIS

μ-Bump model

Figure 5.16 Simulation setup for interposer SI/PI analysis.

Consequently, the horizontal eye also collapsed as well. Due to the package inductance, the signal ringing can also be observed as well from the eye diagram. Eye diagram is very useful to determine the inter-die signal quality. The designer can apply an eye mask to the eye diagram to assess whether it fails the test or not. However, since we are dealing with simulation environment that is statistical in nature, a criterion that is based on the measured bit error rate (BER) is more representative of the signal quality. In most cases, one can make good assumption about the noise in the communication system to be Gaussian distributed. Such distribution has no bound; in other words, with enough samples measured by the system, it will eventually fail the test. During the test, a bit fails when the signal at the strobe time cannot be corrected, interpreted to be logic one when one was transmitted, and vice versa. Figure 5.18 shows the waveform of the transmitted signal and the corresponding noise distribution at logic one and zero. At each decision level, the tail of the distribution represents errors seen by the receiver. One way to estimate the BER is through the statistical indexes of the two distribution functions as follows: |𝜇 − 𝜇0 | (5.4) SNR = 1 σ1 + σ0 where 𝜇0 , 𝜇1 are the mean values of the distribution and 𝜎 0 , 𝜎 1 are the standard deviation. Although the SNR gives a first-order estimation of the signal quality, it does not correlate well with the BER in the real world. Alternatively, the dual-Dirac model is widely used to relate the jitter analysis to the BER [4]. Total jitter in the system is composed of the deterministic jitter (bounded) and the random jitter (unbounded). The deterministic jitter that is resulted from deterministic source such as duty cycle distortion (DCD) is modeled as two delta function in this model. The effect of combining the deterministic jitter with the random jitter is by convoluting them. As a result, the BER(x) where x is the sample delay time can be expressed by ∞

BER(x) =

∫x

N(x)

(5.5)

99

1.5

1

1

1

0.5

0

(V)

1.5

(V)

(V)

1.5

0.5

0

0

Trace length = 2.0 mm –0.5

0

2

4

6

–0.5 0

2

4

×10–10

6

Trace length = 4.0 mm –0.5

8

1

1

1 (V)

1.5

0

0.5

0

0

2

4

6 Time

×10–10

Figure 5.17 Eye diagram for interposer SI/PI analysis.

6

8 ×10–10

0.5

Trace length = 7.0 mm

Trace length = 6.0 mm –0.5

8

4

0

Trace length = 5.0 mm –0.5

2

Time

1.5

0.5

0

×10–10

Time

1.5

(V)

(V)

Trace length = 3.0 mm

8

Time

0.5

0

2

4

6 Time

–0.5

8 ×10–10

0

2

4

6 Time

8 ×10–10

5.5 Physical Design Guideline

Bit errors Decision level

Eye center

Figure 5.18 Noise distribution and decision error.

where N(x) is the noise distribution function. N(x) can be expressed in terms of deterministic jitter 𝜇 and standard deviation 𝜎 of the random jitter. Equation (5.5) becomes BER(x) = √

1

∞

2π𝜎 ∫x

e−

(x′ −𝜇)2 2𝜎 2

dx′

(5.6)

It is convenient to define quality factor Q in terms of 𝜇 and 𝜎 where (x − 𝜇) (5.7) 𝜎 Q factor defined in this way allows us to relate its value with the BER at any given x. The relationship between the Q factor and the BER is linked through the inverse error function, which does not have any close-form solution. However, it can be approximated by lower order of polynomial function when BER is low. Table 5.3 summarizes the BER and the corresponding Q factor using numerical analysis. For example, the Q factor that is needed to achieve a maximum BER of 10−9 is around 6. We can obtain experimental results and analyze them with the aforementioned SNR and the Q factor. For a BER of 10−12 , it is equivalent to 1 error bit being detected for all the bits measured over 1000 seconds of time interval for 1 Gbps data. It will take enormous amount of simulation time to approach the confidence level that is needed for Q factor. It is complicated by the fact that the measured Q=

Table 5.3 Bit error rate vs. Q factor. Bit error rate

Q factor

10−8

5.612

10−9

5.998

10−10

6.361

−11

10

6.709

10−12

7.035

10−13

7.349

10−14

7.651

101

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

Q factor is through curve fitting of the Gaussian distribution and is highly dependent on the number of samples that can be measured. On the other hand, the SNR needs fewer samples to achieve high confidence level although it can be skewed by intersymbol interference (ISI) and other noises. To compare experimental results and use it to guide toward better design, both factors are evaluated in the following analysis. In order to maximize the signal quality on the interposer, consider the following parameters: 1. 2. 3. 4. 5.

Wire dimensions including width, spacing, and length. Signal patterns. Number of RDL layers used. Package RLC. Driver/receiver IO and strength settings.

Among those parameters, wire width/spacing and signal patterns are design parameters that impact signal integrity while the package RLC is more related to PI due to the so-called simultaneous switching output (SSO) noise. Figure 5.19 shows that the signal quality degrades as the trace length increases. The trace length is defined as the distance between the micro-bumps on the edge of the HBM and the SoC. As was mentioned earlier, the Q factor accuracy is limited by the amount of samples, so the trend is not as monotonic as the SNR results. It can be seen that the signal width around 1 μm produces the best result, given that all the traces are on the same RDL layer. At smaller width, the signal trace is 70 W = 0.5 μm

60

W = 0.75 μm W = 1.0 μm W = 1.25 μm W = 1.5 μm

EyeSNR

50 40 30 20 10 0 100

W = 0.5 μm 80 Quality factor

102

W = 0.75 μm W = 1.0 μm

60

W = 1.25 μm W = 1.5 μm

40 20 0 0

1

2

3

4 5 Trace length (mm)

Figure 5.19 SNR and Q factor vs. trace length.

6

7

8

27 26 25 24 23 22 21 20 19 18

6.5 6.0 EyeSNR

EyeSNR

5.5 Physical Design Guideline

5.5 5.0 4.5 4.0

Trace length 2500 μm

90

Trace length 5500 μm

25 20

70

Quality factor

Quality factor

80 60 50 40 30

15 10 5

20 10 0.4

0.6

0.8 1.0 1.2 Width (μm)

1.4

1.6

GSG:0:0_GROUND

0 0.4

0.6

0.8 1.0 1.2 Width (μm)

1.4

1.6

GSSG:0:0_GROUND

Figure 5.20 Shielding effect of the signals.

resistive, and hence the eye opening is limited by the rise/fall time. On the other hand, the capacitance starts to dominate overall impedance as the wire width further increases. When the signals are not shielded, it also increases cross talk between the neighboring traces. Figure 5.20 shows the result considering the shielding pattern on the signal wires. It is noted that only GSG and GSSG configurations are considered here since at least one side of the signals has to be shielded to provide a return path for the signal. With GSG configuration, the spacing between the signal and the shields is smaller than the GSSG counterparts for a given channel width. Nonetheless, the result shows that GSG configuration is superior because the signal quality is dominated by the coupling between the signals. It is especially true for long trace lengths where the Q factor of the GSSG configuration becomes unacceptable. The shields effectively terminate the electrical field from the aggressors; however, there is diminishing return by increasing their wire widths. For the CoWoS technology, the area impact from the shielding wires can be minimized since it allows submicron design rule. So far only one RDL routing layer is considered. With additional RDL resources, the routing density can be reduced and hence lower the coupling between signals without sacrificing the wire width. To compare different routing patterns, the pattern configuration is formatted as ::_: :. Applicable values for each data field can be summarized

103

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

in Table 5.4. For the field, NONE stands for not available, and GROUND stands for the slotted ground plane assigned for the particular RDL layer. The field means the additional shields inserted at the beginning or the end of the patterns to stagger the signals from the neighboring routing layers. Finally, the field adds offset from the origin to fine-tune the location of the trace. With these configurations, the designer can specify the patterns that are needed for the experiments involving two RDL layers. Figure 5.21 shows the results of several predefined configurations based on two RDL routings. It can be easily seen that simply redistributing the wires in two RDL layers does not give a good result (GSSG:0:0_GSSG:0:0). It is actually worse than simply assigning all the routings into one RDL layer due to coupling (GSSG:0:0_GROUND). As an alternative, the signal pattern between the RDL Table 5.4 Summary of applicable values for data patterns. Field name

Applicable values

NONE, GROUND, GSG, GSSG, GSGG, GGSG, etc.

Integers

Floating numbers

22 20 18 16 14 12 10 8 6 4

5.0 4.5 EyeSNR

EyeSNR

4.0 3.5 3.0 2.5 2.0

Trace length 2500 μm

60

Trace length 5500 μm

16 14 Quality factor

50 Quality factor

104

40 30 20 10

12 10 8 6 4 2 0

0 0.5

1.0 1.5 Width (μm) GSGG:–1:0_GGSG:1:–0.75 GSSG:0:0_GROUND

0.5

2.0

GSSG:0:0_GSSG:0:0 GROUND_GSG:0:0

Figure 5.21 Experimental results with two RDL layers.

1.0 1.5 Width (μm)

2.0

GSG:–1:0_GSG:1:0.5 NONE_GSG:0:0

EyeSNR

5.5 Physical Design Guideline

28 26 24 22 20 18 16 14 12 10

Inductance = 0.1 nH Inductance = 1 nH

160 Inductance = 0.1 nH Inductance = 1 nH

Quality factor

140 120 100 80 60 40 20 0 0

200

400

800 600 MIM capacitance (pF)

1000

1200

Figure 5.22 SSO noise suppression with MIM capacitance.

layers should be staggered carefully such that the cross talk is minimized (GSGG:−1:0_GGSG:1:−0.75 and GSG:−1:0_GSG:1:0.5). It is noted that while SNR suggests that optimal wire width can be achieved at higher value, the Q factor indicates that it prefers narrower wire width for those two optimized configurations (GSGG:−1:0_GGSG:1:−0.75 and GSG:−1:0_GSG:1:0.5). The same study can be performed for three or more RDL layers with more complicated predefined patterns. However, the designer should decide based on the physical and electrical limitations whether adding more RDL layers can be justified by the additional costs. In Section 5.4.5, the impact of the MIM capacitance on power supply noise has been discussed. For SSO event, the MIM capacitance also plays a role to suppress the unwanted high-frequency noise. Figure 5.22 shows the SNR and Q factor changes by inserting MIM capacitance on the interposer. With 1 nH of input inductance, the SNR can improve almost 2× by implementing enough MIM capacitors. In the case of 0.1 nH of input inductance, a significant drop of both the SNR and the Q factor can be observed by increasing the MIM capacitances from 0. This is due to the resonance created by the LC network on the power delivery network. It not only degrades the eye but also causes overshoot that should be avoided. The driving strength of the IO should also be considered for the optimization process. Even though the IO used in CoWoS with HBM application is unterminated, it can still cost unnecessary dynamic power if the IOs are overdriven. The general guideline is to pick a driving strength setting that is intermediate

105

106

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

for the IO driver and optimize the other parameters from there. Once the other parameters are decided, the designer can lower the driving strength as long as the eye diagram specification can still be satisfied to decide the optimal driving strength. For the clock drivers, differential type IO should be adopted to reject the common mode noise as well as to minimize the DCD of the data strobes in the analysis. 5.5.3

Combo-Bump Design Style

The interposer provides connections between the sub-dies through the micro-bump. It also provides the connections between the sub-die and the substrate along the link consisting of the micro-bump, RDL, TSV, and backside C4 bump. There is a huge gap of the current carrying capability between the micro-bump and the C4 bump. The concept of using the combo-bump is thus proposed to provide better power delivery and structured RDL routing and to reduce design complexity. A combo-bump is a combination of a probe pad, a number of micro-bumps, and the corresponding connection layers. Figure 5.23a shows one configuration of the combo-bump. In this example, eight micro-bumps are used to form the connection between the SoC and the interposer. It is noted that by carefully designing the placement of the micro-bumps, the pitch of the placed combo-bumps should be able to align with that of the C4 bump. Figure 5.23b shows the alignment of the combo-bump and the C4 bump. The advantage of such design pattern comes in twofold: the RDL routing can be minimized, and at the same time the placement can be modularized to reduce the design efforts. In an effort to increase the usability of the combo-bump, modularized configuration should be supported. Figure 5.24 shows the Lego-like elements that can be used to construct the combo-bump. The number of micro-bumps, the number of the TSVs, and the RDL routing patterns provide a variety of options to Micro-bump

Combo-bump

C4 bump

Probe pad

(a)

(b)

Figure 5.23 (a) Structure of the combo-bump and (b) alignment between the C4 bumps and the combo-bumps.

5.5 Physical Design Guideline

TSV

Probe pad

Figure 5.24 Basic building blocks for combo-bump.

serve different design need. For example, signal routings do not need to handle high current density compared with the power wires. Hence two micro-bump configurations are suitable for them. In addition, high-speed I/Os are sensitive to the parasitic loading, and minimizing the TSV number in use is beneficial to improve the signal integrity. From the power delivery point of view, the use of combo-bump greatly simplified the analysis of current flow. In other words, the power delivery network within the interposer can be considered as feed-through path that the current coming into the C4 bump will be directly carried out to the micro-bumps through the combo-bump structure. Figure 5.25 summarizes the design flow with the combo-bumps. Starting with the interposer, TSVs are inserted at the location specified by the bump file. Then the micro-bumps are inserted around the TSV according to the micro-bump configuration. The adapter that is formed by the RDLs and the vias is inserted at the location as well. In the last step, power meshes are formed and signal routings can be created. The SoC bump locations can be derived from the interposer after flipping, rotation, and proper shifting. Dummy micro-bumps are needed where electrical connection is not needed or allowed. Probe pads are inserted to perform the full-chip testing for the SoC chip. IO pads are placed around the chip according to their corresponding combo-bump location. Signal pads and power pads are interleaved to provide adequate IO driving capability and to ease RDL routing congestions. Finally, automatic RDL routings can be performed by the EDA tool with custom fine-tuning if needed. This design flow is considered package-driven top-down flow. Once the bump file is decided, the SoC and the interposer designs are divided to proceed individually. Alternatively, a bottom-up design starts with the IO pad assignment for the SoC. The bump locations are adjusted according to the IO pad assignment. The interposer design can only begin after the SoC bump placement is completed. As a result, this approach

107

Bump assignments

Insert micro-bumps

Insert RDL acceptors

μ-Bump

Power/signal routing

RDL

TSV

Interposer

Map and insert pads

SOC

Probe pads

IO pad assignment

Figure 5.25 Design flow with combo-bumps.

IO pads

RDL routing

5.5 Physical Design Guideline

offers more flexibility to the SoC design where design is easier to be optimized compared with the fixed bump location approach. 5.5.4

Chip-Package Co-Design for Stacked ICs

Figure 5.26 shows the system containing the dies, the interposer, the substrate, the printed circuit board (PCB), and the heat sink. The routing dimension in an advanced chip is submicron nowadays while it is typically on a scale of tens of microns in the substrate and hundreds of microns for the PCB. Due to such a widespread of physical dimensions, it is hard to fit all the designs into one simulation environment with a high accuracy target. Thus, chip-package co-design becomes necessary to limit each design space and to extract the key information for the others to model efficiently. For CoWoS, the distance between the dies is comparable and even smaller than the height of the interposer. Therefore, the temperature effect of the heat generated by power hungry SoC has to be considered for a DRAM chip sitting next to it. Figure 5.27 shows the flow diagram for the chip-package co-design flow. The first step is to create the initial designs at the IC, package, system, and board level, respectively. Each design can then be extracted or imported into its own simulation environment. A key concept for the co-design is that one design parameter

DRAM

SoC

Interposer Substrate PMIC PCB

Trace

Figure 5.26 The system view of CoWoS.

Package design

IC design

Create thermal model

Power Chip thermal model Thermal

Update R, EM limit

Thermal simulation

System design Power map Boudary condition

Thermal profile

Figure 5.27 Chip-package co-design flow.

PCB design

Temperature Thermal simulation

Selfheating power

DC IR drop on trace/via

109

110

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

affects another so the interaction between the design parameters has to be considered. For example, to perform thermal simulation at the package level, chip thermal model has to be created in the first place. The chip thermal model contains temperature-dependent power database. Initially, a fixed junction temperature is assumed for the chip thermal model. After thermal simulation is completed with the package design and the chip thermal model, the thermal profile can be generated, and the junction temperature can be fed back to the chip thermal model. It takes a few iterations to achieve a steady-state junction temperature for the given design. The thermal profile can be used to update the resistance of the power delivery network that is temperature dependent and the EM limit that is affected by the junction temperature as well. On top of that, the system design provides the overall thermal simulation based on the converged power map of the package-level thermal simulation. The result of the system-level simulation can provide a more accurate boundary condition to individual components. At the same time, the PCB design can be used to simulate the DC IR drop on the traces. Higher IR drop means that more heat is dissipated by the traces, and it will impact the overall system thermal condition on top of the package containing the active components. On the other hand, the temperature change of the system can be used to update the PCB simulation and changes the IR drop on the trace consequently. As more co-analyses are considered, the accuracy of the simulation results increases at the expense of the design complexity. 5.5.5

Interposer Multi-Die ESD Protection Scheme

ESD protection is an important topic for semiconductor devices since transistors are getting smaller and ESD threshold remains the same. Human body model (HBM), machine model (MM), and charge device model (CDM) are common tests that need to be applied before parts can be shipped. While HBM and MM charges are applied externally, CDM charges are accumulated during manufacture and will only be discharged once I/O pin has contact externally. Because of this, most ESD protections that are aimed for HBM or MM cannot be used to protect CDM as well when the resistive path from the charge and the ESD device is too long. For CoWoS, the ESD specification is summarized in Table 5.5. Among the ESD models, CDM is considered the most relevant failure mechanism [5]. The difference between HBM and MM is that while HBM voltage is much higher, the instantaneous peak current is much lower due to high source resistance in HBM. The time constant for MM is also shorter in that it requires faster discharging path for prevention. Table 5.5 CoWoS ESD specification for various ESD models. Clamp cell

ESD targeted level (HBM/MM/CDM)

Type 1 (micro-bump)

–/–/250 V

Type 2 (probe pad)

1 kV/30 V/250 V

Type 3 (C4 bump)

2 kV/100 V/500 V

5.5 Physical Design Guideline

The general rule for ESD protection is to provide a bypassing path for the zap current to the external reference so that the current through the vulnerable devices can be avoided. ESD devices to both supply and ground nodes should be included for each input pin of the IO drivers to maximize protection. An ESD clamp between the supply and ground should be placed close to the power bumps to bypass zap through the power bumps as well. The difference between the 2.5D interposer and the single die is the off-chip interface involving the interposer, C4 bump, micro-bump, and TSVs. For internal die-to-die signals, the requirement for ESD protection can potentially be lowered. On the other hand, adopting typical ESD protection for these IO interfaces is possible but may not be optimal in terms of system performance. With wide memory IO interface such as HBM, the power and area penalties can be significant. For the interposer, each die has its own supply domain, and the scenarios order by the required ESD protections are as follows: 1. Share the same VDD and the same VSS at the interposer: Existing ESD clamp should be enough to cover the potential ESD events. No additional devices need to be added at the cost of IO performance. In order to ensure no excessive high voltage along the power delivery network, the resistance from the target pin to the clamp should be kept below 1 Ω. 2. Share the same VSS but no VDD: Cross-supply-domain signaling is very common in SoC design with various IPs and cores talking to each other. Different functional blocks might require different interface circuitry, which is vulnerable to ESD events for cross-supply-domain ESD stresses. Sometimes separate power domains are required for noise coupling considerations. While the ESD clamp between the supply pins can be used to bypass the peak current, the transistors may still under stress from the high potential imposed at the gate oxide. Figure 5.28 illustrates the current flow when ESD is applied at VDD1. At the driver side, n1 will rise with VDD1 and creates high voltage across transistor Mp1 or Mn1 depending on whether VDD2 or VSS is grounded externally if the components in the shaded box are ignored for now. This is due to the hold voltages of the ESD clamps as well as the finite turn-on resistances of the ESD clamps. The resistor and two diodes form a resistor–clamp interface to mitigate the voltage stress [6]. As long as the resistance Rin is much greater than the diode on resistance Rd , then the voltage attenuation ratio proportional to Rd /Rin can be achieved. 3. Share neither the VDD nor the VSS: Under some situations where the circuit block is highly sensitive to power supply noise, the designer would like to isolate not only the supply domain but also the ground domain as well. For example, analog PLL is one of the common elements that is supplied separately from the digital blocks. In this case, extra protections are required since the resistive path for the ESD current is usually very long. In CoWoS, connections through the back-to-back diodes to the global ESD bus are required as shown in Figure 5.29. The global ESD bus is connected at the front side of the interposer to minimize the resistance. At each chip, the global ESD bus is connected to the supply domains by the ESD clamps. In the event of ESD zap, ESD current can be flowed through the global ESD bus in either direction.

111

112

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

VDD1

VDD2

Mp1 ESD clamp

ESD clamp

n1 Mn1 VSS IESD

Figure 5.28 Cross-supply-domain ESD protection. VDD1

ESD clamp

VDD2

ESD clamp

ESD clamp

VSS1

ESD clamp

VSS2

Global ESD bus

Interposer

Figure 5.29 Interposer ESD protection with global VSS bus.

EDA tools are used to compute the point-to-point resistance to make sure that the resistance on the power supply network is under the 1 Ω threshold. However, with multiple dies and the interposer involved in the power delivery network, point-to-point resistance may not be easily available. Instead, one can partition the threshold and assign a scaled threshold for each of the dies and check individual parts separately. The disadvantage with this work-around method is that the result is pessimistic compared with the original threshold although it can be alleviated by iterating the allocated ratio among the dies after each part is analyzed. In order to minimize the resistances, the C4 power bumps should locate directly below each die and not wire by RDL from the outside of it. In order to

5.6 TSMC Reference Flows

avoid the floating net on the interposer, metal and via connected to the TSV should have at least one micro-bump connection to the die. Other general guidelines that are recommended to reduce ESD failures include increasing via numbers, widening metal widths, and keeping the routing length from one side of the interposer to the other side as short as possible.

5.6 TSMC Reference Flows In order to guide a smooth flow transition from existing 2D-IC design to stack design with minimum disruption, TSMC provides reference flows for CoWoS technology. It concludes emerging problems with silicon-validated design solutions and supports from several EDA vendors. Table 5.6 summarizes the reference flow solutions and the EDA readiness. The reference flows can be categorized into designs for the top chip, the silicon interposer, the substrate, and the overall stacked system. Table 5.6 CoWoS reference flow solutions. EDA CoWoS

Top chips

Capability

Ansys

Bump assignment/RDL routing Wafer-level DFT and BIST Micro-bump DRC

Wafer (silicon interposer)

Bump alignment and silicon wafer routing Bump assignment, routing, and multi-dies aware custom design DRC LVS RC extraction SSN (SI/PI) RLC extraction PDN RLC extraction

Substrate

Substrate routing Substrate RLC extraction

Stacked system

Inter-die LVS/DRC Integrated IR/EM Thermal dissipation analysis

SSN (SI/PI) PDN simulation

Mentor

Synopsys

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√

√ √

√ √

√

√

√

Package-level DFT and BIST Stacking STA/multi-tech simulation

Cadence

√

√ √

√

√

√√ /

√ —/

√

√√ /

113

114

5 Physical Design Flow for 3D/CoWoSⓇ Stacked ICs

5.7 Conclusion Physical design for 3D-IC involves implementation, model validation, design verification, and electrical analysis. In each design phase, there are new opportunities provided by the stacked system and the challenges that come with them. The designer not only should focus on individual physical and electrical sign-off for the dies but also pay greater attention to the overall system impact since the dies are in closer proximity to each other. The use of the TSVs and the micro-bumps allows inter-die connection with tighter margin. To exploit those margins, the methodologies including device modeling, interface simulation, and system sign-off conditions need to be redefined from the traditional 2D flow. In the case of the CoWoS interposer, physical design guidelines are provided for the application such as HBM integration. The design guideline starts with physical design components such as combo-bump, interposer routing patterns, and ESD rules and concludes with chip-package co-design involving the substrate and the PCB. And finally, TSMC reference flow was introduced to summarize the existing resources to the interposer design ecosystem.

References 1 ISSCC (2014). ISSCC trend. 2 Bakoglu, H.B. (1990). Circuits, Interconnects and Packaging for VLSI.

Addison-Wesley Publishing Company, Inc. 3 Minges, M.L. (1989). Electronic Materials Handbook: Packaging. ASM Interna-

tional. 4 Stephens, R. (2004). Jitter Analysis: The dual-Dirac Model, RJ/DJ, and Q-Scale.

Retrieved from https://www.keysight.com/upload/cmc_upload/All/dualdirac1 .pdf. 5 JEDEC JESD22-C101E (2009). Field Induced Charged-Device Model Test Method for Electrostatic Discharge Withstand Thresholds of Microelectronic Components. JEDEC. 6 Worley, E. (2006). Distributed gate ESD network architecture for inter-power domain signals. In: Proceedings of EOS/ESD Symposium, 196–204.

115

6 Design and CAD Solutions for Cooling and Power Delivery for Monolithic 3D-ICs Sandeep Samal and Sung K. Lim Georgia Institute of Technology, Electrical & Computer Engineering, 791, Atlantic Dr NW, Atlanta, GA 30332, USA

6.1 Introduction The advent of 3D-IC technology has opened up the potential of highly improved circuit designs. 3D-ICs help in the continuation of Moore’s law. Their major advantages include reduced interconnect length leading to low power and increased speed, reduced footprint area, high bandwidth, and integration of heterogeneous technology flavors. Through-silicon vias (TSVs) enable the vertical integration of separate dies to form a single 3D chip. However, TSVs consume a lot of area and have very large capacitance. This puts a restriction on the total number of TSVs and the type of circuits that can be used. Therefore, the greater benefits of 3D-IC are masked by these negative characteristics of TSVs. Recently developed monolithic 3D integration technology [1] enables sequential integration of device layers in contrast to bonding of fabricated dies. Monolithic 3D integration uses nanoscale monolithic inter-tier vias (MIVs) to connect the vertical device layers. MIVs are similar to regular metal-layer vias, and their corresponding capacitance and area values are negligible compared with those of TSVs that are micron scale. This allows the use of many such MIVs for vertical connections that enables significantly higher integration density than that of TSV-based 3D-ICs. A side view of a typical two-tier monolithic stackup structure with seven metal layers in each tier is shown in Figure 6.1. The device layer thickness is around 30 nm, and the inter-tier dielectric (ILD) that separates different tiers is about 100 nm thick. Monolithic 3D-ICs can overcome the shortcomings of TSV-based 3D-ICs; however, one major concern with 3D-ICs in general is the increase in power density (PD) that leads to high temperature values. The reduction in footprint area effectively increases the PD by the same factor. Even with power reduction in 3D-IC, the increased PD affects the temperature, especially in the layers away from the heat sink or other equivalent cooling features in modern miniaturized electronics. Therefore, importance of thermal-aware design methodologies becomes more critical in 3D-ICs. The major bottleneck of considering thermal aspect within the physical design process is the large runtime overhead required Handbook of 3D Integration: Design, Test, and Thermal Management, First Edition. Edited by Paul D. Franzon, Erik Jan Marinissen, and Muhannad S. Bakir. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

116

6 Design and CAD Solutions for Cooling and Power Delivery for Monolithic 3D-ICs

Figure 6.1 Side view of two-tier monolithic 3D-IC structure (with seven metal layers in each tier).

m7 Top tier m1 m7

MIV

ILD 100 nm

Bot tier m1 Handle bulk (75 µm)

Active 30 nm

for accurate temperature analysis. The inclusion of such detailed analysis within the design process is not practically feasible. Prior works have tried to develop accurate temperature evaluation models to be included within the chip design process. Cong et al. used compact resistive thermal grid network to estimate the temperature profile of a chip [2]. They estimate the temperature during the floor planning process and insert whitespace for dummy vias. The calculation of resistive network solution still consumes non-negligible runtime, and the insertion of whitespace increases the area further, diminishing the 3D-IC benefits. They report 56% reduction in temperature but with a large area increase of 21%. The optimization of silicon area is important in 3D-ICs along with the reduction in temperature, and too much area overhead is not feasible to obtain temperature improvement. Hsu et al. used the conductivity of copper in TSVs to facilitate heat flow by stacking more number of signal TSVs [3]. Juan et al. studied the modeling of temperature based on total leakage power dissipation and use it in the tier planning of similar layout processor chips [4]. In another work, Hung et al. used the 3D overlap estimation along with PD calculations for thermal-aware planning [5]. All these methods are either targeted for TSV-based 3D-IC design, incur extra runtime and area, or use indirect methods of thermal analysis. To justify the overall advantages of monolithic 3D over 2D, thermal-aware 3D-IC design is necessary. There has been no comprehensive study on the thermal characteristics of monolithic 3D-ICs with respect to vertical tier-to-tier coupling or the development of a compact temperature model for such technologies. PDN is another important aspect of any design, but due to 3D stacking of tiers with very high integration density in monolithic 3D-ICs, optimal PDN design becomes much more important. There have been very few studies on 3D PDN design specific to monolithic 3D-ICs. Wei et al. have shown that PDN helps in reducing the temperature in monolithic 3D-ICs with an example of OpenSPARC T2 processor core [6]. Though the presence of PDN helps in improving the thermal conductivity and reduces maximum temperature, they assume the same power dissipation of the blocks with or without PDN, ignoring

6.2 New Thermal Issues in Monolithic 3D-ICs

its impact on increased congestion during signal routing. This congestion results in increased signal wirelength and hence increased net power, especially in advanced technology nodes. Also, their power simulation has been carried out at the architectural level. In other works on power supply in monolithic 3D-ICs, Xie et al. focus only on cross power domain interface design for multiple power domains and do not study any PDN designs [7]. Other 3D PDN works cover only on TSV-based 3D-ICs or 3D PDN simulation and analysis techniques. A 3D-IC floor plan and power–ground co-synthesis tool is developed in [8]. However, only block-level floor planning and power/ground design for TSV-based 3D designs is considered. There is no discussion on the PDN inside the blocks. The total intra-block wirelength heavily dominates the total wirelength in any design. Therefore, the routing resource usage inside the blocks needs to be considered to have a correct estimation and obtain the best PDN design. In other recent works, Luo et al. developed 3D-IC PDNs benchmark for research purposes [9]. They covered various sizes of 3D designs, but all of them are TSV-based and at the block level. System-level power delivery design comparison for 2D and TSV-based 3D-ICs has been studied in [10]. None of the above works consider the full-chip impact of PDN design on monolithic 3D-ICs. Gate-level monolithic 3D-ICs present great power and performance benefits, and therefore, it is important to study any factors influencing these benefits. This chapter discusses the thermal characteristics of monolithic 3D-ICs in comparison with TSV-based 3D-ICs. The factors affecting the chip temperature are identified, and the development of a very fast and accurate nonlinear regression-based temperature evaluation model for monolithic 3D-ICs is explained. The impact of PDN on gate-level monolithic 3D-IC design QoR and the various PDN design techniques to reduce the adverse effects are also studied. In particular, it is demonstrated that PDN affects monolithic 3D-ICs more severely than 2D-ICs in terms of wirelength increase and hence interconnect power overhead. Simple yet effective PDN design techniques are studied to reduce its impact on wirelength and power increase without exceeding the IR drop budget. Lastly, efficient PDN design guidelines specific to monolithic 3D-ICs are summarized.

6.2 New Thermal Issues in Monolithic 3D-ICs A typical two-tier monolithic stackup is shown in the left in Figure 6.2. The stackup is shown in a flip-chip configuration to explain the thermal characteristics including package configuration with heat sink on the top. The first set of transistors closer to the handle bulk is processed with standard silicon on insulator SOI process and make up Tier 1. A thin ILD separates the two tiers, and devices are fabricated in the next layer. This device layer along with the metal layers makes up the other tier (Tier 0) of the 3D stackup. The transistors in these layers are processed with low temperature process (85 ∘ C, the mentioned lateral and vertical temperature variations cause significant differences in the required refresh rate of each DRAM bank ( 303 μΩ−1 in this experiment), the RO will stop oscillation. In summary, different types of faults manifest themselves with different syndromes in the signature space, and so fault type diagnosis can be performed. In a test and diagnosis flow using this method, one copy of the abovementioned test structure is inserted at each TSV during the design stage, and then the signature of each TSV, i.e., (T REF , T SLOW ), is measured one by one during the test stage. If there is no RO oscillation for a TSV, then we conclude that there is a catastrophic fault associated with the RO (such as open or stuck at) or there is a severe leakage fault in that TSV . Otherwise, one can further perform outlier analysis [20] on all the other TSVs’ signatures to detect if a resistive open fault or a leakage fault has occurred. 10.2.4.5

Impact of Process Variation

Under process variation, the signatures of different fault-free TSVs may spread out to a region in the signature space, therefore making fault detection more difficult. Figure 10.11 shows the impact of the process variation. The fault-free signatures now spread out in a range of (1.1 ns, 1.5 ns) in the horizontal T REF axis and (1.2 ns, 1.65 ns) in the vertical T SLOW axis. However, the fault-free signatures tend to cluster in a linear region, and thus a linear regression model can be derived. The distances of all fault-free signature samples away from this linear regression model can be further presented as a statistical variable with mean (𝜇) and standard deviation (𝜎). It is notable that the ISA that allows one to map the electrical features of the TSVs onto a two-dimensional signature space is mandatory to make the fault detection effective when process variation is considered. If only one oscillation period is measured per TSV without using the proposed ISA technique, then it is like mapping the signatures of Figure 10.11 onto the one-dimensional T REF axis. Under such a mapping, the resulting one-dimensional signatures of the

219

10 Interconnect Testing for 2.5D- and 3D-SICs

1.7 Larger Gleak

1.6 TSLOW (ns)

220

1.5

Ideal fault-free signatures ile of

1.4 1.3

e

re t-f

Fault-free signatures (with CTSV variation)

pr

Resistive open fault signatures

ul

Fa

Leakage fault signatures

1.2 1.1 1.05

Larger Rfault 1.25 TREF (ns)

1.45

Figure 10.11 Signatures with process variation.

faulty TSVs will be highly overlapped with those of the fault-free TSVs, therefore making the fault detection almost impossible.

10.3 Post-Bond Interconnect Testing Once multiple functional dies have been stacked or bonded together, the overall manufacturing and assembly quality needs to be ensured by chip-level testing (for both internal circuits as well as die-to-die interconnects). During this post-bond stage, a die-to-die interconnect is accessible at both ends, and the testing is focused on checking if the end-to-end delay (including the gate delay across its driver and the wire delay across the entire interconnect) falls within a specified range. If it is within this allowable range, then it is classified as fault-free; otherwise, it is classified as a faulty interconnect. It is notable that sometimes this allowable delay range is hard to decide in advance. In that case, statistical outlier analysis can be applied to identify any interconnect with a delay deviating from its nominal values abnormally. In other words, testing a parametric fault associated with a post-bond interconnect may sometimes require delay characterization techniques to produce the delay estimates of all interconnects under test before the fault detection can be made upon these delay estimates, leading to a characterization-based test methodology. 10.3.1

Direct Measurement

The delay across an interconnect could be measured by a method [21] illustrated in Figure 10.12. A low-speed clock signal is used to drive the interconnect while a two-step measurement is being conducted. In step 1, the interconnect delay is converted into a form of pulse width by an XOR gate. Here, the interconnect delay is referred to the delay from the driver’s input, node A, to the end point of the interconnect, node WO. In Figure 10.12, node A and node WO have been connected to the inputs of the measurement circuit, denoted as node X and node Y, respectively. In general, the signal arriving at node WO is a delayed version of

10.3 Post-Bond Interconnect Testing

Driver

Interconnect under measurement

A

X

WO

Receiver B

Y P

Time-todigital converter

Digital code

Two relay paths: 1. A→X 2. WO→Y

Delay measurement circuit

Pulse-width representing the delay of the interconnect under measurement

X Y

Figure 10.12 Direct measurement of the delay across an interconnect.

the signal at node A, and so is signal at node Y a delayed version of that at node X. When we connect the two signals at nodes X and Y to an XOR gate, the result will be a pulse train at node P, with the pulse width positively correlated to the delay to be measured. In step 2, a time-to-digital converter (TDC) can be used to further convert the pulse width at node P into a digital code as the final output. Although this measurement is simple and elegant for bus signals within a planar IC, it may not be suitable for die-to-die interconnects due to a hard-to-satisfy requirement. In order to achieve a high accuracy, the delays of the two access paths (i.e., from node A to node X and from node WO to node Y) need to be made equal to cancel each other when reaching the measurement circuit. This requirement is especially hard to satisfy for a die-to-die interconnect, as these two relay paths physically spread out in two different dies. 10.3.2

Voltage-Divider-Based Test

Testing whether a die-to-die interconnect has a parametric open defect can be achieved by a test method [22] illustrated in Figure 10.13. Vref

VDD

Functional input

0

Test input “0”

RTSV

Ropen

RTSV

+ −

Pass/fail

Receiver

WO

1 CTSV

CTSV

EN

TM Driver side

Die-to-die interconnect

Footer transistor

Figure 10.13 Voltage-divider-based test method for post-bond interconnects.

221

222

10 Interconnect Testing for 2.5D- and 3D-SICs

At the driver side, a multiplexer is added to support the test mode input, which is tied to logic 0. At the receiver side, a footer transistor is added between the input of the receiver (labeled as node WO) and the ground. In the following discussion, node WO is also called the observation node, and it is further connected to the negative input of an analog comparator. The positive input of that analog comparator is driven by a constant reference voltage signal, denoted as V ref . The pass-or-fail decision is made by comparing the voltage level at the observation node WO with V ref . In test mode (in which control signal TM is 1 and EN at the receiver side is 1), the pull-up transistor of the driver is turned on, and then its resistance along the interconnect (including Rtsv and Ropen ) and the on-resistance of the footer transistor at the receiver side form a voltage divider. Depending on the value of Ropen , a proper voltage level will be set up at the observation node WO, and one of the following two outcomes will be produced: (1) When the interconnect is fault-free, Ropen is very small, and the voltage at node WO is higher than the applied V ref , the test result is 1, indicating a passing condition. (2) When the interconnect is faulty and Ropen is large enough to cause the voltage at node WO to a value lower than V ref , the test result becomes 0, indicating a failing condition. In this voltage-divider-based test method, the size of the footer transistor for each interconnect and the voltage level of the global reference voltage, V ref , must be selected carefully to set up a proper test threshold value in terms of the faulty resistance Ropen for each interconnect. Whenever the resistance of the interconnect under test is larger than the test threshold value, the test result will become 1 to indicate a failing condition. 10.3.3

Pulse-Vanishing Test (PV Test)

In this subsection, we discuss another post-bond interconnect test method called pulse-vanishing test (or PV test for short), originally proposed in [23]. As illustrated in Figure 10.14, it applies a short-duration pulse signal at the driving end of the interconnect with the pulse width roughly equal to the system clock cycle time. Here, the system clock is a high-speed clock (e.g. 1GHz) used for functional operations. If the interconnect is fault-free as illustrated in Figure 10.14a, then the pulse signal could manage to arrive at the receiving end of the interconnect (denoted as WO). This pulse waveform is said to have survived the journey through the interconnect and will be restored to a full swing after passing a logic gate at the receiving end. On the other hand, if the interconnect is faulty with an excessively large resistance as illustrated in Figure 10.14b, then the pulse signal may vanish altogether. This is because a faulty interconnect could be too resistive, and therefore the rising transition at node WO in the figure is so slow that it never reaches a level above the threshold voltage of the receiver. As a result, the resulting pulse waveform at node WO is considered as a glitch and filtered by the receiver, leading to a no-pulse situation at the receiver’s output (i.e., node B), indicating a failing condition.

10.3 Post-Bond Interconnect Testing

Driver

Fault-free interconnect

A

Receiver B WO Pass

Threshold “Surviving pulse”

(a) Faulty interconnect (with excessive resistance)

Driver A

Receiver WO B

Fail “0”

Threshold (b)

“No pulse”

Figure 10.14 The concept of pulse-vanishing test for an interconnect. (a) Fault-free case. (b) Faulty case.

IN

TM 0

D Q FF R Launch cell

Driver A

Interconnect under test

Receiver B WO

1 “1” 1 ns

DQ FF

“1”

SE Capture cell

Threshold (“0” initially)

(“0” initially)

A double-pulse signal (shared by all launch cells)

System clock cycle time (e.g. 1 ns)

PV-test controller

Figure 10.15 Primitive design-for-testability circuitry supporting PV test.

For a given interconnect under test, the PV test can be supported by simple DfT circuitry in its primitive form as shown in Figure 10.15, where IN denotes the original functional input. There are two types of cells inserted to support delay testing, namely, the launch cell at the driver side and the capture cell at the receiver side. The launch cell is responsible for launching the required pulse signal as the test stimulus in the test mode, while the capture cell is used to detect if there is an arriving pulse signal within a designated test clock cycle. Each of these two cells incorporates an FF, namely, the launch FF and the capture FF, respectively. Before each PV-test cycle, initialization is required – i.e., both node A (i.e., the input of the driver) and node B (i.e., the output of the receiver) need to be set to 0. For the launch cell, this is achieved by the asynchronous reset of the launch FF. For the capture cell, this is achieved by scan-shifting an all-0 pattern to the capture FFs, as discussed later.

223

224

10 Interconnect Testing for 2.5D- and 3D-SICs

In the launch cell, the clock port of the launch FF is provided by a double-pulse signal generated by a local PV-test controller. The launch FF is configured as a toggle-type FF to convert this double-pulse signal into a desired single-pulse signal as the test stimulus. The reason for not passing a single-pulse signal directly from the test controller to every launch cell is due to the consideration that the pulse width of a pulse signal might shrink or expand when passing through a long routing path with buffers due to unequal rise and the fall times when passing through a buffer. However, the time interval between rising edges in a double-pulse signal is immune to the routing path (because there is no discrepancy between the rise times of two rising edges). An analogy can be drawn from a clock routing network where the duty cycle of a clock signal (like the pulse width) could change from one node to another throughout the routing network, while the clock period typically remains invariant. In a capture cell, the clock port of the capture FF is driven by the output of the receiver. Since the capture FF has been initialized to 0 before the test cycle and its input is also tied to 1, it will become 1 if its clock port is triggered by an arriving pulse signal through the interconnect, indicating a passing condition. Otherwise, it will remain 0 if there is no pulse signal, indicating a failing condition at the end of the test session. Figure 10.16 shows a set of typical simulation waveforms of the key signals in a PV-test session for four interconnects. The test session uses a number of scan-in cycles to initialize the capture FFs to 0 and uses one extra Init cycle to initialize all launch FFs to 0. Then, in a test clock cycle (indicated by the assertion of a signal PVT_fire), every launch cell fires a short Test_Pulse while within the same test cycle, the waveforms at the termination ends of the four interconnects are shown in {WO1, WO2, WO3, WO4}. It can be seen that the signal at WO2 becomes an incomplete pulse (due to a 4 kΩ resistance injected into the second interconnect). This incomplete pulse was eventually filtered by the capture cell, causing its corresponding capture FF to stay at 0 and indicating a fail bit after a number of scan-out cycles that unload the contents of all capture FFs. Overall, PV test has the following distinctive features: (1) PV test does not require die-to-die high-speed clock synchronization. Its operations is self-timed in the sense that the test threshold is carried in the TCLK States PVT_fire Test_pulse WO1 WO2 WO3 WO4 Pass/fail

Scan-in

Init

Pulse

Scan-out

Incomplete pulse Fail bit

Figure 10.16 Simulation waveforms of some key signals during a PV-test session for four interconnects, among which the second interconnect is faulty with a 4 kΩ resistive open fault.

10.3 Post-Bond Interconnect Testing

pulse width of the test pulse while insensitive to the skew of the clock signal arrival times at the launch FFs and capture FFs. (2) PV test requires only logic-based DfT circuitry, and thus it is very easy to be integrated with a cell-based design flow. (3) PV test can detect multiple delay faults and point out their exact locations without any post-processing. This feature is especially important when applied to support built-in self-repair, as the failing interconnect(s) can be easily identified and individually replaced on the spot. Further exploration of this application is referred to the literature [24]. 10.3.4

Characterization-Based Test Method via VOT Scheme

In this subsection, we discuss a delay characterization-based test method for interconnects. First, the delay of each interconnect is characterized into a digital code, and then a test decision is made on these data based on either threshold checking or outlier analysis. The RO has been used extensively as a vehicle in testing delay faults and in measuring the delay of a cell or a transmission line [25, 26]. Based on this concept, an enhancement technique called variable output thresholding (VOT) scheme was further proposed to increase its resolution for fault detection. Figure 10.17 shows a basic ring-oscillation-based test structure supporting the VOT analysis [27, 28]. A test unit is composed of two interconnects going in opposite directions, denoted as IW1 and IW2, respectively. (i) At the inputs of the two original drivers of the two interconnects, two multiplexers are added to allow for switching between the functional mode and the Use a VOT inverter (controlled by Z) as the receiver for each Interconnect: (1) Control signal Z = 0 → Normal inverter (2) Control signal Z = 1 → Schmitt trigger inverter (with hysteresis) TM WO1 IN1

Observation point

0 1

Z1

IW1 Die 2

Die 1

IW2

OUT2

Z2 WO2

OUT1

1 0

IN2

TM

IW1 and IW2 are two interconnects going in opposite directions TM = “0” → Functional mode TM = “1” → Test mode

IN1, IN2 → Functional inputs OUT1, OUT2 → Functional outputs

Figure 10.17 The ring oscillator (RO) as a test structure around a pair of interconnects to support VOT analysis. Source: Adapted from Lin et al. 2012 [27] and Huang et al. 2014 [28].

225

226

10 Interconnect Testing for 2.5D- and 3D-SICs

Symbol Schematic VDD

Vout (v) 1.8

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1.8 1.6 1.4 1.2 1

Vin

Vout VDD

0.8 0.6 0.4 0.2

VTH(0–1) = 0.54 v

Vin (v) 1.8 1.8

VTH(1–0) = 1.27 v

Figure 10.18 A Schmitt trigger inverter.

test mode. (ii) Two circuit paths of the two interconnects are cascaded to form an RO in test mode. Since there is an odd number of signal inversions along this ring, at any node along the ring, there is an oscillation signal, with the period equal to twice the delay along the entire ring. (iii) The receiver of each of the two interconnects is converted into a so-called VOT inverter. A VOT inverter can be switched from a normal inverter to a Schmitt trigger inverter, depending on the value of its control signal, denoted as Z. A Schmitt trigger inverter is shown in Figure 10.18. The static voltage transfer curve of this inverter indicates two threshold voltages, i.e., V TH(1–0) = 1.27 V to cause the output to change from logic HIGH to logic LOW and V TH(1–0) = 0.54 V to cause the output to change back from logic LOW to logic HIGH, assuming that V DD is 1.8 V. This characteristic introduces hysteresis, implying that it is relatively hard to change the state of its output, and so a signal will take a longer time passing through this inverter than a normal inverter. In our application, we need a VOT inverter that can function sometimes like a normal inverter and sometimes like a Schmitt trigger inverter, depending on the value of its control signal. Such an inverter can be implemented by a schematic as shown in Figure 10.19. In some sense, a VOT inverter degenerates into a normal inverter when Z is 0 and a ST inverter when Z is 1. The main idea of the VOT scheme is that one can characterize the transition time at the termination end of an interconnect (since it is proportional to the interconnect delay while irrelevant to the other parts of the RO containing it) by measuring the oscillation periods under three configurations: (Configuration 1) Both VOT inverters are in normal configuration, i.e., {Z1 , Z2 } = {0, 0}, producing an oscillation period denoted as T REF . (Configuration 2) The interconnect IW1’s VOT inverter is switched to Schmitt trigger mode, i.e., {Z1 , Z2 } = {1, 0}, producing an oscillation period denoted as T ST1 .

10.4 Concluding Remarks

Symbol

Symbol

=

Symbol

+

VDD

VDD

VDD

Z

GND

GND

VDD

VDD

Z (a)

(b)

(c)

Figure 10.19 A variable output thresholding (VOT) inverter. (a) VOT inverter. (b) Normal inverter (Z = 0). (c) ST inverter (Z = 1).

(Configuration 3) The interconnect IW2’s VOT inverter is switched to Schmitt trigger mode, i.e., {Z1 , Z2 } = {0, 1}, producing an oscillation period denoted as T ST2 . Analysis shows that ΔT ST1 (defined as T ST1 − T REF ) is correlated to the interconnect delay across IW1 even in the presence of a delay fault and ΔT ST2 (defined as T ST2 − T REF ) similarly is correlated to the interconnect delay across IW2. This is due to the subtle property that the hysteresis through a Schmitt trigger inverter (as compared with a normal inverter) is proportional to the transition time of its input signal. Simulation has shown [27, 28] that there exists a close correlation between a measurable ΔT (e.g. ΔT ST1 for interconnect IW1 or ΔT ST2 for interconnect IW2) and the delay of its corresponding interconnect, as illustrated in Figure 10.20. In that experiment, a 1000 μm long interconnect has been injected with a gradually increasing faulty resistance (from 1 to 100 kΩ in increments of 1 kΩ), which causes its delay to increase accordingly (from only a few hundred picoseconds up to more than 14 ns). It can be seen that in this process, its measurable ΔT increases accordingly as well from a small value less than 1 ns up to more than 25 ns.

10.4 Concluding Remarks The interconnect structure of a 3D-IC could be much more complicated than its 2D counterpart due to the use of TSVs, interposers, or redistribution layers and thus susceptible to various parametric defects during the fabrication process. These parametric defects (such as micro-voids, shorts between micro-bumps,

227

10 Interconnect Testing for 2.5D- and 3D-SICs

30.00 25.00 20.00 ΔT (ns)

228

15.00 10.00 5.00 0.00 0.00

2.00

4.00

6.00 8.00 10.00 12.00 Interconnect delay (ns)

14.00

16.00

Figure 10.20 Measurable ΔT vs. interconnect delay in VOT-based oscillation test. Source: Adapted from Lin et al. 2012 [27] and Huang et al. 2014 [28].

pinholes, etc.) could cause the IC to malfunction or to degrade in performance. Even though traditional boundary scan test methods such as IEEE 1149.1 or IEEE 1500 are able to detect catastrophic faults (such as stuck-at faults or hard bridging faults), they are inadequate in catching the parametric defects in the realm of 100 ps. In this chapter, we have briefly reviewed a number of promising test methods for this purpose. During the pre-bond stage, a TSV could be tested either by direct probing or by DfT circuits to quantify its leakage current and/or effective resistance/capacitance values for exposing any anomaly. During the post-bond stage, an interconnect can be thoroughly tested by several methods as well, including the voltage-divider-based method, the PV test, and the VOT-based oscillation test. These methods differ from one another in their fault detection ability, delay characterization ability, area overhead, and compatibility with traditional boundary scan test. One may need to ponder the pros and cons among these methods before choosing one that fits his/her overall needs. Beyond testing, the issues of self-repair and online monitoring of interconnects are also significant in order to enhance the yield and reliability of a 3D-IC. Readers interested in these issues are referred to the literature for further study [24, 29].

References 1 Lee, H. and Chakrabarty, K. (2009). Test challenges for 3-D integrated cir-

cuits. IEEE Design and Test of Computers 25 (5): 26–35. 2 Tsai, M., Klotz, M., Leonard, A. et al. (2009). Through silicon via (TSV)

defect/pinhole self test circuit for 3D-IC. In: Proceedings of International Conference on 3D System Integration, 28–30. 3 Chi, C.-C., Marinissen, E.J., Goel, S.K., and Wu, C.-W. (2011). Post-bond testing of 2.5D-SICs and 3D-SICs containing a passive silicon interposer base. In: Proceedings of International Test Conference, 1–10.

References

4 Chakrabarty, K. (2012). TSV defects and TSV-induced circuit failures: the

5

6

7

8

9

10

11 12

13

14

15

16

17

18

third dimension in test and design-for-test. In: Proceedings of International Reliability Physics Symposium, (IRPS), 5F1.1–5F.1.12. Marinissen, E.J. (2012). Challenges and emerging solutions in testing TSV-based 2.5-D and 3D-stacked ICs. In: Proceedings of IEEE Design, Automation, and Test in Europe Conference, –1277, 1282. Banijamali, B., Ramalingam, S., Nagarajan, K., and Chaware, R. (2011). Advanced reliability study of TSV interposers and interconnects for the 28nm technology FPGA. In: Proceedings of IEEE Electronic Components and Technology Conference, 285–290. Kang, U., Chung, H., Heo, S. et al. (2010). 8 GB 3-D DDR3 DRAM using through-silicon-via technology. IEEE Journal of Solid-State Circuits 45 (1): 111–119. Chi, C.-C., Wu, C.-W., Wang, M.-J., and Lin, H.-C. (2013). 3D-IC interconnect test, diagnosis, and repair. In: Proceedings of IEEE VLSI Test Symposium, 1–6. O’Brien, P.R. and Savarino, T.L. (1989). Modeling the driving-point characteristic of resistive interconnect for accurate delay estimation. In: Proceedings of Design Automation Conference, 512–515. Marinissen, E.J., Chi, C.-C., Verbree, J., and Konijnenburg, M. (2010). 3D DFT architecture for pre-bond and post-bond testing. In: Proceedings of 3D Systems Integration Conference, 1–8. Noia, B. and Chakrabarty, K. (2011). Pre-bond probing of TSVs in 3D stacked ICs. In: Proceedings of International Test Conference, 1–10. Cho, M., Liu, C., Kim, D.H. et al. (2011). Pre-bond and post-bond test and signal recovery structure to characterize and repair TSV defect induced signal degradation in 3-D system. IEEE Transactions on Components, Packaging and Manufacturing Technology 1 (11). Sunter, S., McDonald, C., and Danialy, G. (2001). Contactless digital testing of IC pin leakage currents. In: Proceedings of IEEE International Test Conference, 204–210. Chen, P.Y., Wu, C.W., and Kwai, D.M. (2009). On-chip TSV testing for 3D-IC before bonding using sense amplification. In: Proceedings of IEEE Asian Test Symposium, 450–455. Chen, P.Y., Wu, C.W., and Kwai, D.M. (2010). On-chip testing of blind and open-sleeve TSVs for 3D IC before bonding. In: Proceedings of IEEE VLSI Test Symposium, 263–268. Deutsch, S. and Chakrabarty, K. (2013). Non-invasive pre-bond TSV test using ring oscillator and multiple voltage levels. In: Proceedings of IEEE Design Automation and Test in Europe, 1065–1070. Huang, L.-R., Huang, S.-Y., Sunter, S. et al. (2013). Oscillation-based pre-bond TSV test. IEEE Transactions on Computer-Aided Design of Electronic Circuits 31 (9): 1440–1444. Huang, S.-Y., Lin, Y.-H., L.-R.H. et al. (2013). Programmable leakage test and binning for TSVs with self-timed timing control. IEEE Transactions on Computer-Aided Design of Electronic Circuits (TCAD) 32 (8): 1265–1273.

229

230

10 Interconnect Testing for 2.5D- and 3D-SICs

19 You, J.-W., Huang, S.-Y., Lin, Y.-H. et al. (2013). In-situ method for TSV

20

21

22

23

24

25

26

27 28

29

delay testing and characterization using input sensitivity analysis. IEEE Transactions on VLSI Systems 21 (3): 443–453. Wu, S.H., Drmanac, D., and Wang, L.-C. (2008). A study of outlier analysis techniques for delay testing. In: Proceedings of IEEE International Test Conference, 1–10. Su, C.C., Chen, Y.T., Huang, M.J. et al. (2000). All digital built-in delay and crosstalk measurement for on-chip buses. In: Proceedings of Design, Automation & Test in Europe Conference and Exhibition (DATE), 527–531. Ye, F. and Chakrabarty, K. (2012). TSV open defects in 3D integrated circuits: characterization, test, and optimal spare allocation. In: Proceedings of Design Automation Conference, 10240–11030. Huang, S.-Y., Lee, J.-Y., Tsai, K.-H., and Cheng, W.-T. (2014). Pulse-vanishing test for interposers wires in 2.5-D IC. IEEE Transactions on Computer-Aided Design of Electronic Circuits (TCAD) 33 (8): 1258–1268. Huang, S.-Y., Tsai, M.-T., Zeng, Z.-F. et al. (2015). General timing-aware built-in self-repair for die-to-die interconnects. IEEE Transactions on Computer-Aided Design of Electronic Circuits (TCAD) 34 (11): 1836–1846. Li, K.S.-M., Lee, C.L., Su, C., and Chen, J.E. (2005). Oscillation ring based interconnect test scheme for SoC. In: Proceedings of IEEE Asia South Pacific Design Automation Conference (ASP-DAC), 184–187. Das, B.P., Amrutur, B., Jamadagni, H.S. et al. (2008). Within-die gate delay variability measurement using re-configurable ring oscillator. In: Proceedings of IEEE Custom Integrated Circuits Conference (CICC), 133–136. Lin, Y.-H., Huang, S.-Y., Tsai, K.-H. et al. (2012). Small delay testing for TSVs in 3D ICs. In: IEEE Proceedings of Design Automation Conference. Huang, L.-R., Huang, S.-Y., Tsai, K.-H., and Cheng, W.-T. (2014). Parametric fault testing and performance characterization of post-bond interposer wires in 2.5-D ICs. IEEE Transactions on Computer-Aided Design of Electronic Circuits (TCAD) 33 (3): 476–488. Huang, S.-Y., Tsai, M.-T., Li, H.-X. et al. (2015). Non-intrusive on-line transition-time binning and timing failure threat detection for die-to-die interconnects. IEEE Transactions on Computer-Aided Design of Electronic Circuits (TCAD) 34 (12): 2039–2048.

231

11 Pre-Bond Testing Through Direct Probing of Large-Array Fine-Pitch Micro-Bumps* Erik Jan Marinissen 1 , Bart De Wachter 1 , Jörg Kiesewetter 2 , and Ken Smith 3 1 IMEC, Kapeldreef 75, 3001 Leuven, Belgium 2 3

FormFactor GmbH, Süss Straße 1, 01561 Thiendorf, Germany FormFactor Inc., 9100 SW Gemini Drive, Beaverton, OR 97008, USA

11.1 Introduction There is a lot of excitement around and expectations from 2.5D- and 3D-stacked integrated circuits (SICs) [1]. In 2.5D-SICs, multiple active dies are placed side-by-side on top of and interconnected by a passive interposer die. In 3D-SICs, multiple active dies are stacked vertically. Both 2.5D- and 3D-SICs are enabled by the capability to manufacture through-silicon vias (TSVs) that provide an electrical connection between the front- and backside of a silicon substrate [2–4]. In 2.5D-SICs TSVs connect the stacked active dies through the silicon interposer to the package substrate. In 3D-SICs, TSVs provide vertical interconnections between various active dies stacked onto each other. Both types of SICs serve their particular market segments and are here to stay; 2.5D-SICs provide better chip cooling options and hence typically target high-performance computing and networking applications, whereas 3D-SICs with their small footprint are better suited for mobile applications. In order to obtain acceptable compound stack yields, there is a need to perform pre-bond testing of the various dies before stacking [5, 6]. For non-bottom dies in the stack, the typical functional interface is through an array of fine-pitch micro-bumps. These micro-bumps are too small and too dense for conventional probe technology. Consequently, the current industrial approach to enable test access for pre-bond testing is to provide non-bottom dies with dedicated pre-bond probe pads [5, 7–9]. Although these dedicated probe pads achieve the job, they come at the expense of extra design effort, extra silicon area, possibly extra processing steps, extra test application time, and extra load on the micro-bump I/Os during post-bond functional stack operation and still leave the micro-bumps themselves untested. * Reprinted with permission from Marinissen, E.J., De Wachter, B., Smith, K. et al. (2014). Direct probing on large-array fine-pitch micro-bumps of a wide-I/O logic-memory interface. In: Proceedings IEEE International Test Conference (ITC), Seattle, WA, USA (October 2014). IEEE. Copyright 2014, IEEE. Handbook of 3D Integration: Design, Test, and Thermal Management, First Edition. Edited by Paul D. Franzon, Erik Jan Marinissen, and Muhannad S. Bakir. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

232

11 Pre-Bond Testing Through Direct Probing of Large-Array Fine-Pitch Micro-Bumps

In this work, we set out to directly probe on large-array fine-pitch microbumps. We are capable to do this at wafer level with a probe card in a single-site setup. This enables a test flow in which the die’s internal circuitry (logic, DRAM) is tested through dedicated pre-bond probe pads, possibly in a (massive) multisite arrangement, and in which the micro-bumps and underlying TSVs are separately tested in a single-site setup. It also enables an alternative test flow, in which the entire pre-bond test is performed single-site by probing directly on the micro-bumps; this will circumvent the need for dedicated pre-bond probe pads with all its associated drawbacks and costs. Direct probing on fine-pitch micro-bumps requires advanced probe technology: fine-pitch low-force probe cards and accurate probe stations. Prior work in this domain has been reported by others [10–15] and by us [16–18], but, to the best of our knowledge, we are the first to report on pre-bond contact resistance, probe marks on both top and landing micro-bumps, and impact on stack interconnect yield. In this chapter, we are using the JEDEC WideIO mobile DRAM interface (JESD229) [19–21] as a typical target for today’s 2.5D- and 3D-SIC micro-bump arrays. We have designed and manufactured test wafers with this micro-bump interface and report on our experiences in probing and subsequent stacking of that interface. We have also used the 3D-COSTAR test flow cost modeling tool [22–25] to analyze the cost-effectiveness of our approach, in comparison with performing pre-bond testing through dedicated pre-bond probe pads; for this cost analysis, see Chapter 9. The remainder of this chapter is organized as follows. Section 11.2 discusses the importance of pre-bond testing. Section 11.3 describes the micro-bump probe targets. Section 11.4 details the selected probe technology, while Section 11.5 describes the test vehicle. Experiment results are given in Section 11.6. Section 11.7 concludes this chapter.

11.2 Pre-Bond Testing The post-bond compound stack yield ystack of a stack consisting of n dies cannot be greater than the product of the individual die yields yd (for 1 ≤ d ≤ n) and the interconnect yields yi (for 1 ≤ i ≤ n − 1), where yi is the yield of the interconnects between adjacent Dies i and i + 1: ystack ≤

n ∏ d=1

yd ⋅

n−1 ∏

yi

(11.1)

i=1

Figure 11.1 plots the maximum post-bond compound stack yield ystack for varying die yields yd for various values of n and yi . The graph demonstrates that the compound stack yield decreases drastically if yd decreases. Consequently, it is important to test dies before stacking (the so-called pre-bond test) and only stack dies passing that pre-bond test in a die-to-die or die-to-wafer scheme. Compared with skipping it, pre-bond testing obviously requires additional costs, and the better the pre-bond test, the higher those costs will be. However, this investment typically pays off, as the alternative is that bad dies get detected

11.2 Pre-Bond Testing

Post-bond compound stack yield (ystack)

100% n = 2; yi = 100%

90%

n = 2; yi = 95%

80%

n = 4; yi = 100%

70%

n = 4; yi = 95%

60%

n = 6; yi = 100%

50%

n = 6; yi = 95%

40% 30% 20% 10% 0% 0%

10%

20%

30%

40% 50% 60% Pre-bond die yield (yd)

70%

80%

90%

100%

Figure 11.1 Post-bond compound stack yield ystack as function of pre-bond die yield yd for various stack heights n and various interconnect yields yi .

0% 20% 40%

D cov ie 3: era Pre ge -bo an nd d te tes st c t ost

Good stack cost price ($)

21 20 19 18 17 16 15 14 13 12 11 10

60% 0% 10% 80% 20% 30% 40% 50% Die 2: Pre 60% 100% 70% 80% -bond tes t coverag 90% 100% e and tes t cost

Figure 11.2 Good stack cost price for a three-die stack as function of pre-bond test coverage and associated test cost for Dies 2 and 3.

only after stacking, at which point they are filtered out of the production flow together with the good dies to which they are now attached. Figure 11.2 shows an example of a total stack cost price calculation made with 3D-COSTAR [22–25]. We assumed a three-die stack, in which Die 1 was fully tested before stacking, but for which we varied the pre-bond test coverage and associated pre-bond test costs for Dies 2 and 3. The graph shows that more pre-bond testing (at assumed linearly increasing pre-bond test cost) actually decreases the overall stack cost price.

233

234

11 Pre-Bond Testing Through Direct Probing of Large-Array Fine-Pitch Micro-Bumps

Die 3 Die 1

Die 2

Die 3

Package substrate

Package substrate

Active die 3 Micro-Bumps Active die 2 Micro-Bumps Active die 1 Cu pillars/C4 bumps Package substrate

Ball grid array

(a)

(b)

Figure 11.3 Schematic cross sections of typical (a) 2.5D- and (b) 3D-SICs containing three active dies. Source: Reprinted with permission from Marinissen et al. [26]. Copyright 2014, IEEE.

Test access for pre-bond testing is through probing. Probing the bottom die of a stack is relatively easy, as the natural interface to the package substrate is implemented with large C4 bumps or copper pillars; a typical diameter is 50 μm at 120 μm pitch, which is no problem for today’s probe technology. However, this is not true for the non-bottom dies (see Figure 11.3). All their functional connections (for power, ground, control, clocks, digital, analog, etc.) go through large arrays of fine-pitch micro-bumps. Typical micro-bumps have a diameter of ∼20 μm at 40 μm pitch and come in arrays of several hundreds to thousands of micro-bumps. Cantilever probe cards can achieve these small pitches, but cannot handle such large arrays. Vertical probe cards can be made in arbitrary array configurations but are limited to minimum pitches around 60 μm. Today’s solution in the industry is to equip non-bottom dies with dedicated pre-bond probe pads, with sufficiently large size and pitch to accommodate today’s probe technology [5, 7–9]. This solution requires extra design effort and possibly extra processing steps. Moreover, it causes a trade-off between extra silicon area and extra test time. The probe pads are larger than the micro-bumps; that is their whole purpose. Hence, typically one cannot afford as many probe pads as there are micro-bumps, as they would simply consume too much silicon area. As a result, the same pre-bond stimulus/response data needs to be pumped in and out of the die-under-test through a narrower interface, and consequently the die’s pre-bond test time stretches out over more clock cycles, increasing the pre-bond test application cost. Furthermore, after performing a pre-bond test through dedicated probe pads, one can still not be certain of the correct operation of the functional interface through the micro-bumps. Finally, the dedicated pre-bond probe pads cause an extra capacitive load on the micro-bump I/Os during post-bond functional stack operation, which negatively impacts the interconnect performance.

11.3 Micro-Bumps Micro-bumps come in different metallurgies, forms, and shapes. IMEC’s 40 μm pitch micro-bump reference process utilizes copper (Cu) landing bumps of 5 μm height and 25 μm diameter and copper–nickel–tin (Cu/Ni/Sn) top bumps of

11.3 Micro-Bumps

∅25 μm Cu (a)

Cu micro-bump, ∅25 μm 15 μm

(b) 25 μm

∅15 μm Cu/Ni/Sn Cu/Ni/Sn micro-bump, ∅15 μm 15 μm

Top Cu Ni Sn

5 μm 1 μm 3.5 μm

40 μm

5 μm

Cu

Bottom 25 μm (c)

15 μm Micro-bump pairs

25 μm

Figure 11.4 Typical micro-bumps at IMEC: (a) copper landing bump of 25 μm diameter, (b) copper–nickel–tin top bump of 15 μm diameter, and (c) schematic cross section. Source: (a,b) Reprinted with permission from Marinissen et al. [26]. Copyright 2014, IEEE. (c) Reprinted with permission from Marinissen et al. [28]. Copyright 2017, IEEE.

9.5 μm height and 15 μm diameter [27]. Two such micro-bumps are depicted in Figure 11.4. The micro-bumps have a cylindrical shape. As can be seen, the Cu micro-bumps have a rather smooth surface. As no reflow was applied (yet) on the Cu/Ni/Sn micro-bumps, their surface is significantly more rough. During stacking, the two micro-bumps form an intermetallic bond under thermo-compression. Micro-bumps typically come in large arrays. For this work, we took as target the representative micro-bump array of the JEDEC WideIO mobile DRAM standard [19–21]. This first standard for stackable WideIO DRAMs, published as JESD229 in December 2011, defines the functional and mechanical aspects of the WideIO logic–memory interface. The interface consists of four DRAM channels (named a, b, c, and d), each consisting of an array of 6 rows × 50 columns = 300 micro-bumps with a horizontal pitch of 50 μm and a vertical pitch of 40 μm. The pad locations are symmetric between the four channels, and also

235

11 Pre-Bond Testing Through Direct Probing of Large-Array Fine-Pitch Micro-Bumps 6 × 40 μm

236

50 × 50 μm

Figure 11.5 Standardized micro-bump layout according to the JEDEC WideIO mobile DRAM specification [19]. Source: Reprinted with permission from Marinissen et al. [26]. Copyright 2014, IEEE.

the spacing between the four channels is defined. The total interface occupies 0.52 mm × 5.25 mm. Figure 11.5 shows the layout of the 1200 JEDEC WideIO micro-bumps. Direct probing on large arrays of fine-pitch micro-bumps has to meet the following criteria: • Good electrical contact with low contact resistance, to allow for pre-bond testing of the die-under-test. We specify 5 Ω as maximum contact resistance. • Probe marks with a limited profile, to not impair downstream bonding or negatively impact the yield of that bonding process. We used as specification a probe mark profile