CUDA C PROGRAMMING GUIDE PG-02829-001_v9.2 | July 2018 Design Guide www.nvidia.

CUDA C PROGRAMMING GUIDE PG-02829-001_v9.2 | July 2018 Design Guide www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | ii CHANGES FROM VERSION 9.0 ‣ Documented restriction that operator-overloads cannot be __global__ functions in Operator Function. ‣ Removed guidance to break 8-byte shuffles into two 4-byte instructions. 8-byte shuffle variants are provided since CUDA 9.0. See Warp Shuffle Functions. ‣ Passing __restrict__ references to __global__ functions is now supported. Updated comment in __global__ functions and function templates. ‣ Documented CUDA_ENABLE_CRC_CHECK in CUDA Environment Variables. ‣ Warp matrix functions [PREVIEW FEATURE] now support matrix products with m=32, n=8, k=16 and m=8, n=32, k=16 in addition to m=n=k=16. ‣ Added new Unified Memory sections: System Allocator, Hardware Coherency, Access Counters www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | iii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3. A Scalable Programming Model.........................................................................4 1.4. Document Structure...................................................................................... 6 Chapter 2. Programming Model............................................................................... 8 2.1. Kernels......................................................................................................8 2.2. Thread Hierarchy......................................................................................... 9 2.3. Memory Hierarchy.......................................................................................11 2.4. Heterogeneous Programming.......................................................................... 13 2.5. Compute Capability..................................................................................... 15 Chapter 3. Programming Interface..........................................................................16 3.1. Compilation with NVCC................................................................................ 16 3.1.1. Compilation Workflow.............................................................................17 3.1.1.1. Offline Compilation.......................................................................... 17 3.1.1.2. Just-in-Time Compilation....................................................................17 3.1.2. Binary Compatibility...............................................................................17 3.1.3. PTX Compatibility..................................................................................18 3.1.4. Application Compatibility.........................................................................18 3.1.5. C/C++ Compatibility...............................................................................19 3.1.6. 64-Bit Compatibility............................................................................... 19 3.2. CUDA C Runtime.........................................................................................19 3.2.1. Initialization.........................................................................................20 3.2.2. Device Memory..................................................................................... 20 3.2.3. Shared Memory..................................................................................... 24 3.2.4. Page-Locked Host Memory........................................................................29 3.2.4.1. Portable Memory..............................................................................30 3.2.4.2. Write-Combining Memory....................................................................30 3.2.4.3. Mapped Memory...............................................................................30 3.2.5. Asynchronous Concurrent Execution............................................................ 31 3.2.5.1. Concurrent Execution between Host and Device........................................32 3.2.5.2. Concurrent Kernel Execution............................................................... 32 3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 32 3.2.5.4. Concurrent Data Transfers.................................................................. 33 3.2.5.5. Streams.........................................................................................33 3.2.5.6. Events...........................................................................................37 3.2.5.7. Synchronous Calls.............................................................................38 3.2.6. Multi-Device System............................................................................... 38 3.2.6.1. Device Enumeration.......................................................................... 38 3.2.6.2. Device Selection.............................................................................. 38 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | iv 3.2.6.3. Stream and Event Behavior................................................................. 39 3.2.6.4. Peer-to-Peer Memory Access................................................................39 3.2.6.5. Peer-to-Peer Memory Copy..................................................................40 3.2.7. Unified Virtual Address Space................................................................... 41 3.2.8. Interprocess Communication..................................................................... 41 3.2.9. Error Checking......................................................................................42 3.2.10. Call Stack.......................................................................................... 42 3.2.11. Texture and Surface Memory................................................................... 42 3.2.11.1. Texture Memory............................................................................. 43 3.2.11.2. Surface Memory............................................................................. 52 3.2.11.3. CUDA Arrays..................................................................................56 3.2.11.4. Read/Write Coherency..................................................................... 56 3.2.12. Graphics Interoperability........................................................................56 3.2.12.1. OpenGL Interoperability................................................................... 57 3.2.12.2. Direct3D Interoperability...................................................................59 3.2.12.3. SLI Interoperability..........................................................................65 3.3. Versioning and Compatibility..........................................................................66 3.4. Compute Modes..........................................................................................67 3.5. Mode Switches........................................................................................... 68 3.6. Tesla Compute Cluster Mode for Windows.......................................................... 68 Chapter 4. Hardware Implementation......................................................................70 4.1. SIMT Architecture....................................................................................... 70 4.2. Hardware Multithreading...............................................................................72 Chapter 5. Performance Guidelines........................................................................ 74 5.1. Overall Performance Optimization Strategies...................................................... 74 5.2. Maximize Utilization.................................................................................... 74 5.2.1. Application Level...................................................................................74 5.2.2. Device Level........................................................................................ 75 5.2.3. Multiprocessor Level...............................................................................75 5.2.3.1. Occupancy Calculator........................................................................77 5.3. Maximize Memory Throughput........................................................................ 79 5.3.1. Data Transfer between Host and Device.......................................................80 5.3.2. Device Memory Accesses..........................................................................81 5.4. Maximize Instruction Throughput.....................................................................85 5.4.1. Arithmetic Instructions............................................................................85 5.4.2. Control Flow Instructions.........................................................................89 5.4.3. Synchronization Instruction.......................................................................90 Appendix A. CUDA-Enabled GPUs........................................................................... 91 Appendix B. C Language Extensions........................................................................92 B.1. Function Execution Space Specifiers.................................................................92 B.1.1. __device__.......................................................................................... 92 B.1.2. __global__...........................................................................................92 B.1.3. __host__............................................................................................. 93 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | v B.1.4. __noinline__ and __forceinline__............................................................... 93 B.2. Variable Memory Space Specifiers....................................................................93 B.2.1. __device__.......................................................................................... 94 B.2.2. __constant__........................................................................................94 B.2.3. __shared__.......................................................................................... 94 B.2.4. __managed__....................................................................................... 95 B.2.5. __restrict__......................................................................................... 95 B.3. Built-in Vector Types................................................................................... 97 B.3.1. char, short, int, long, longlong, float, double................................................ 97 B.3.2. dim3..................................................................................................98 B.4. Built-in Variables........................................................................................ 98 B.4.1. gridDim.............................................................................................. 98 B.4.2. blockIdx..............................................................................................98 B.4.3. blockDim.............................................................................................98 B.4.4. threadIdx............................................................................................ 99 B.4.5. warpSize............................................................................................. 99 B.5. Memory Fence Functions...............................................................................99 B.6. Synchronization Functions............................................................................102 B.7. Mathematical Functions...............................................................................103 B.8. Texture Functions......................................................................................103 B.8.1. Texture Object API...............................................................................103 B.8.1.1. tex1Dfetch()..................................................................................103 B.8.1.2. tex1D()........................................................................................ 103 B.8.1.3. tex1DLod()....................................................................................103 B.8.1.4. tex1DGrad().................................................................................. 104 B.8.1.5. tex2D()........................................................................................ 104 B.8.1.6. tex2DLod()....................................................................................104 B.8.1.7. tex2DGrad().................................................................................. 104 B.8.1.8. tex3D()........................................................................................ 104 B.8.1.9. tex3DLod()....................................................................................104 B.8.1.10. tex3DGrad().................................................................................105 B.8.1.11. tex1DLayered()............................................................................. 105 B.8.1.12. tex1DLayeredLod().........................................................................105 B.8.1.13. tex1DLayeredGrad()....................................................................... 105 B.8.1.14. tex2DLayered()............................................................................. 105 B.8.1.15. tex2DLayeredLod().........................................................................105 B.8.1.16. tex2DLayeredGrad()....................................................................... 106 B.8.1.17. texCubemap().............................................................................. 106 B.8.1.18. texCubemapLod().......................................................................... 106 B.8.1.19. texCubemapLayered().....................................................................106 B.8.1.20. texCubemapLayeredLod()................................................................ 106 B.8.1.21. tex2Dgather()...............................................................................106 B.8.2. Texture Reference API...........................................................................107 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | vi B.8.2.1. tex1Dfetch()..................................................................................107 B.8.2.2. tex1D()........................................................................................ 107 B.8.2.3. tex1DLod()....................................................................................108 B.8.2.4. tex1DGrad().................................................................................. 108 B.8.2.5. tex2D()........................................................................................ 108 B.8.2.6. tex2DLod()....................................................................................108 B.8.2.7. tex2DGrad().................................................................................. 108 B.8.2.8. tex3D()........................................................................................ 109 B.8.2.9. tex3DLod()....................................................................................109 B.8.2.10. tex3DGrad().................................................................................109 B.8.2.11. tex1DLayered()............................................................................. 109 B.8.2.12. tex1DLayeredLod().........................................................................110 B.8.2.13. tex1DLayeredGrad()....................................................................... 110 B.8.2.14. tex2DLayered()............................................................................. 110 B.8.2.15. tex2DLayeredLod().........................................................................110 B.8.2.16. tex2DLayeredGrad()....................................................................... 111 B.8.2.17. texCubemap().............................................................................. 111 B.8.2.18. texCubemapLod().......................................................................... 111 B.8.2.19. texCubemapLayered().....................................................................111 B.8.2.20. texCubemapLayeredLod()................................................................ 111 B.8.2.21. tex2Dgather()...............................................................................112 B.9. Surface Functions......................................................................................112 B.9.1. Surface Object API............................................................................... 112 B.9.1.1. surf1Dread()..................................................................................112 B.9.1.2. surf1Dwrite...................................................................................112 B.9.1.3. surf2Dread()..................................................................................113 B.9.1.4. surf2Dwrite().................................................................................113 B.9.1.5. surf3Dread()..................................................................................113 B.9.1.6. surf3Dwrite().................................................................................113 B.9.1.7. surf1DLayeredread()........................................................................ 114 B.9.1.8. surf1DLayeredwrite()....................................................................... 114 B.9.1.9. surf2DLayeredread()........................................................................ 114 B.9.1.10. surf2DLayeredwrite()......................................................................114 B.9.1.11. surfCubemapread()........................................................................ 115 B.9.1.12. surfCubemapwrite()....................................................................... 115 B.9.1.13. surfCubemapLayeredread()...............................................................115 B.9.1.14. surfCubemapLayeredwrite()..............................................................115 B.9.2. Surface Reference API........................................................................... 116 B.9.2.1. surf1Dread()..................................................................................116 B.9.2.2. surf1Dwrite...................................................................................116 B.9.2.3. surf2Dread()..................................................................................116 B.9.2.4. surf2Dwrite().................................................................................116 B.9.2.5. surf3Dread()..................................................................................117 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | vii B.9.2.6. surf3Dwrite().................................................................................117 B.9.2.7. surf1DLayeredread()........................................................................ 117 B.9.2.8. surf1DLayeredwrite()....................................................................... 117 B.9.2.9. surf2DLayeredread()........................................................................ 118 B.9.2.10. surf2DLayeredwrite()......................................................................118 B.9.2.11. surfCubemapread()........................................................................ 118 B.9.2.12. surfCubemapwrite()....................................................................... 118 B.9.2.13. surfCubemapLayeredread()...............................................................119 B.9.2.14. surfCubemapLayeredwrite()..............................................................119 B.10. Read-Only Data Cache Load Function.............................................................119 B.11. Time Function.........................................................................................119 B.12. Atomic Functions..................................................................................... 120 B.12.1. Arithmetic Functions........................................................................... 121 B.12.1.1. atomicAdd().................................................................................121 B.12.1.2. atomicSub()................................................................................. 121 B.12.1.3. atomicExch()................................................................................122 B.12.1.4. atomicMin()................................................................................. 122 B.12.1.5. atomicMax().................................................................................122 B.12.1.6. atomicInc()..................................................................................122 B.12.1.7. atomicDec().................................................................................123 B.12.1.8. atomicCAS().................................................................................123 B.12.2. Bitwise Functions............................................................................... 123 B.12.2.1. atomicAnd().................................................................................123 B.12.2.2. atomicOr().................................................................................. 123 B.12.2.3. atomicXor()................................................................................. 124 B.13. Warp Vote Functions.................................................................................124 B.14. Warp Match Functions............................................................................... 125 B.14.1. Synopsys.......................................................................................... 125 B.14.2. Description....................................................................................... 125 B.15. Warp Shuffle Functions..............................................................................126 B.15.1. Synopsis...........................................................................................126 B.15.2. Description....................................................................................... 126 B.15.3. Return Value..................................................................................... 127 B.15.4. Notes.............................................................................................. 128 B.15.5. Examples..........................................................................................128 B.15.5.1. Broadcast of a single value across a warp............................................ 128 B.15.5.2. Inclusive plus-scan across sub-partitions of 8 threads............................... 129 B.15.5.3. Reduction across a warp................................................................. 129 B.16. Warp matrix functions [PREVIEW FEATURE]......................................................129 B.16.1. Description....................................................................................... 130 B.16.2. Example...........................................................................................132 B.17. Profiler Counter Function........................................................................... 132 B.18. Assertion............................................................................................... 133 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | viii B.19. Formatted Output.................................................................................... 134 B.19.1. Format Specifiers............................................................................... 134 B.19.2. Limitations....................................................................................... 135 B.19.3. Associated Host-Side API.......................................................................136 B.19.4. Examples..........................................................................................136 B.20. Dynamic Global Memory Allocation and Operations............................................ 137 B.20.1. Heap Memory Allocation.......................................................................138 B.20.2. Interoperability with Host Memory API......................................................138 B.20.3. Examples..........................................................................................138 B.20.3.1. Per Thread Allocation.....................................................................139 B.20.3.2. Per Thread Block Allocation............................................................. 140 B.20.3.3. Allocation Persisting Between Kernel Launches...................................... 141 B.21. Execution Configuration.............................................................................142 B.22. Launch Bounds........................................................................................ 142 B.23. #pragma unroll........................................................................................145 B.24. SIMD Video Instructions..............................................................................145 Appendix C. Cooperative Groups.......................................................................... 147 C.1. Introduction.............................................................................................147 C.2. Intra-block Groups.....................................................................................148 C.2.1. Thread Groups and Thread Blocks.............................................................148 C.2.2. Tiled Partitions....................................................................................149 C.2.3. Thread Block Tiles............................................................................... 149 C.2.4. Coalesced Groups................................................................................ 150 C.2.5. Uses of Intra-block Cooperative Groups...................................................... 150 C.2.5.1. Discovery Pattern........................................................................... 150 C.2.5.2. Warp-Synchronous Code Pattern..........................................................151 C.2.5.3. Composition.................................................................................. 152 C.3. Grid Synchronization.................................................................................. 152 C.4. Multi-Device Synchronization........................................................................ 154 Appendix D. CUDA Dynamic Parallelism..................................................................156 D.1. Introduction.............................................................................................156 D.1.1. Overview........................................................................................... 156 D.1.2. Glossary............................................................................................ 156 D.2. Execution Environment and Memory Model....................................................... 157 D.2.1. Execution Environment.......................................................................... 157 D.2.1.1. Parent and Child Grids.....................................................................157 D.2.1.2. Scope of CUDA Primitives................................................................. 158 D.2.1.3. Synchronization..............................................................................158 D.2.1.4. Streams and Events.........................................................................158 D.2.1.5. Ordering and Concurrency.................................................................159 D.2.1.6. Device Management........................................................................ 159 D.2.2. Memory Model.................................................................................... 159 D.2.2.1. Coherence and Consistency............................................................... 160 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | ix D.3. Programming Interface................................................................................162 D.3.1. CUDA C/C++ Reference..........................................................................162 D.3.1.1. Device-Side Kernel Launch................................................................ 162 D.3.1.2. Streams....................................................................................... 163 D.3.1.3. Events......................................................................................... 164 D.3.1.4. Synchronization..............................................................................164 D.3.1.5. Device Management........................................................................ 164 D.3.1.6. Memory Declarations....................................................................... 165 D.3.1.7. API Errors and Launch Failures........................................................... 166 D.3.1.8. API Reference................................................................................167 D.3.2. Device-side Launch from PTX.................................................................. 168 D.3.2.1. Kernel Launch APIs......................................................................... 168 D.3.2.2. Parameter Buffer Layout.................................................................. 170 D.3.3. Toolkit Support for Dynamic Parallelism......................................................170 D.3.3.1. Including Device Runtime API in CUDA Code........................................... 170 D.3.3.2. Compiling and Linking......................................................................171 D.4. Programming Guidelines.............................................................................. 171 D.4.1. Basics............................................................................................... 171 D.4.2. Performance.......................................................................................172 D.4.2.1. Synchronization..............................................................................172 D.4.2.2. Dynamic-parallelism-enabled Kernel Overhead........................................ 172 D.4.3. Implementation Restrictions and Limitations................................................173 D.4.3.1. Runtime.......................................................................................173 Appendix E. Mathematical Functions..................................................................... 176 E.1. Standard Functions.................................................................................... 176 E.2. Intrinsic Functions..................................................................................... 184 Appendix F. C/C++ Language Support.................................................................... 187 F .1. C++11 Language Features.............................................................................187 F .2. C++14 Language Features.............................................................................190 F .3. Restrictions.............................................................................................. 190 F .3.1. Host Compiler Extensions........................................................................190 F .3.2. Preprocessor Symbols.............................................................................191 F .3.2.1. __CUDA_ARCH__............................................................................. 191 F .3.3. Qualifiers........................................................................................... 192 F .3.3.1. Device Memory Space Specifiers.......................................................... 192 F .3.3.2. __managed__ Memory Space Specifier...................................................193 F .3.3.3. Volatile Qualifier.............................................................................195 F .3.4. Pointers............................................................................................. 196 F .3.5. Operators........................................................................................... 196 F .3.5.1. Assignment Operator........................................................................ 196 F .3.5.2. Address Operator............................................................................ 196 F .3.6. Run Time Type Information (RTTI)............................................................. 196 F .3.7. Exception Handling............................................................................... 196 www.nvidia.com CUDA C Programming Guide PG-02829-001_v9.2 | x F .3.8. Standard Library...................................................................................196 F .3.9. Functions........................................................................................... 196 F .3.9.1. External Linkage............................................................................. 197 F .3.9.2. Implicitly-declared and explicitly-defaulted functions................................ 197 F .3.9.3. Function Parameters........................................................................ 198 F .3.9.4. Static Variables within Function.......................................................... 198 F .3.9.5. Function Pointers............................................................................ 199 F .3.9.6. Function Recursion.......................................................................... 199 F .3.9.7. Friend Functions............................................................................. 199 F .3.9.8. Operator Function........................................................................... 200 F .3.10. Classes............................................................................................. 200 F .3.10.1. Data Members...............................................................................200 F .3.10.2. Function Members..........................................................................200 F .3.10.3. Virtual Functions........................................................................... 200 F .3.10.4. Virtual Base Classes........................................................................201 F .3.10.5. Anonymous Unions......................................................................... 201 F .3.10.6. Windows-Specific........................................................................... 201 F .3.11. Templates......................................................................................... 202 F .3.12. Trigraphs and Digraphs..........................................................................202 F .3.13. Const-qualified variables....................................................................... 203 F .3.14. Long Double...................................................................................... 203 F .3.15. Deprecation Annotation........................................................................ 203 F .3.16. C++11 Features...................................................................................204 F .3.16.1. Lambda Expressions........................................................................204 F .3.16.2. std::initializer_list..........................................................................205 F .3.16.3. Rvalue references.......................................................................... 206 F .3.16.4. Constexpr functions and function templates.......................................... 206 F .3.16.5. Constexpr variables........................................................................ 206 F .3.16.6. Inline namespaces..........................................................................207 F .3.16.7. thread_local.................................................................................208 F .3.16.8. __global__ functions and function templates......................................... 208 F .3.16.9. __device__/__constant__/__shared__ variables...................................... 210 F .3.16.10. Defaulted functions.......................................................................210 F .3.17. C++14 Features...................................................................................211 F .3.17.1. Functions with deduced return type....................................................211 F .3.17.2. Variable templates......................................................................... 212 F .4. Polymorphic Function Wrappers..................................................................... 212 F .5. Experimental Feature: Extended Lambdas.........................................................216 F .5.1. Extended Lambda Type Traits...................................................................217 F .5.2. Extended Lambda Restrictions..................................................................218 F .5.3. Notes on __host__ __device__ lambdas.......................................................226 F .5.4. *this Capture By Value........................................................................... uploads/s1/ cuda-c-programming-guide.pdf

  • 50
  • 0
  • 0
Afficher les détails des licences
Licence et utilisation
Gratuit pour un usage personnel Attribution requise
Partager
  • Détails
  • Publié le Mar 18, 2022
  • Catégorie Administration
  • Langue French
  • Taille du fichier 5.6862MB