Shedding Light on windows Programming Home Training Resources Registration About the Paul Yao Company
The Paul Yao Company





Microsoft Windows Embedded Partner


Principles of CUDA Parallel Processing for C/C++ Programmers

Overview

This workshop introduces C/C++ programmers to the elements needed to access the features and functionality of the NVidia CUDA runtime to enable parallel processing large volumes of data.


Who Should Attend
This lecture and lab-based workshop is intended for software software developers who have tasked with building new libraries, or enhancing existing libraries, to be able to offload computationally-intensive operations from the main system CPU to the GPU when one or more CUDA-compatible GPUs is present.

Workshop Highlights
  • The role of Compute support in software development.
  • CUDA unit testing strategies.
  • Memory management of applications, CPUs, and GPUs.
  • GPU Threads, streams, and events.
  • Parallel execution strategies.
  • Memory coalescing and performance.
  • Multi-GPU data management.
  • GPU Occupancy.
  • Optimizing performance using tensor cores.
Performance Objectives
At workshop completion, attendees will be able to...
  • Describe how to identify a compute candidate within existing software.
  • Setup a development workstation and create a Hello Cuda program.
  • Describe three strategies for maximizing computational, memory, and instruction throughput.
  • Write code to access the CUDA device interface.
  • Write code to set up unit testing for CUDA-based software.
  • Describe the organization of application, CPU, and GPU memory.
  • Describe the difference between CPU threads and GPU threads.
  • Write code using streams and events to optimize compute throughput.
  • Write code to identify Warp Divergence.
  • Setup a development workstation with profiling tools and profile CUDA execution.
  • Write code to detect multi-GPU systems and collect KPIs for multi-GPU usage.
  • Describe the rationale and strategies for memory coalescing.
  • Write code to detect and use tensor cores.

Workshop Syllabus

Monday Tuesday Wednesday Thursday Friday

Compute Candidates

  • Input
  • Process
  • Output

Hello CUDA

  • System Setup
  • Visual Studio Wizard
  • Program Structure
  • Run-time Activity

Compute Interfaces

  • APIs
  • Libraries
  • Function Families
  • Interfaces

Unit Testing

  • Setup
  • Test
  • Validate
  • Performance

Memory Management

  • Application
  • CPU
  • GPU

Application Memory

  • C/C++
  • Native Win32
  • Managed

CPU Memory

  • Real & Protected
  • Kernel & User
  • Shared User Memory
  • Paged & Virtual
  • Cached & Missing

GPU Memory

  • Global
  • Shared
  • Local
  • Thread

Threads

  • GPU vs CPU Thread
  • Thread Blocks
  • GPU Warps

Streams and Events

  • CUDA Streams
  • CUDA Events

Warp Divergence

  • GPU Threads
  • Performance Cost
  • Design Patterns

Atomics & Reductions

  • Atomic Operations
  • Role of Reductions

Parallel Execution Strategies

  • Computational Throughput
  • Memory Throughput
  • Instruction Throughput
  • Memory Thrashing

Profiling

  • Tools
  • Metrics
  • Launch Parameters

Memory & Performance

  • Memory Hierarchy
  • Memory Coalescing
  • Bandwidth & Latency
  • Bank Conflicts
  • cudaMemcpy()

Multi-GPU

  • Hello M-GPU
  • Data management
  • Test Strategies

GPU Occupancy

  • Trade-offs
  • Warp Scheduling

Memory Coalescing

  • Internal operation
  • Register spilling
  • Design patterns

Tensor Cores

  • Basics
  • Usage
  • Limitations
  • Debugging
CONTACT US LOGISTICS REQUEST MORE INFORMATION