Principles of CUDA Parallel Processing for C/C++ Programmers

Microsoft Windows Embedded Partner

	Principles of CUDA Parallel Processing for C/C++ Programmers
	Overview
	This workshop introduces C/C++ programmers to the elements needed to access the features and functionality of the NVidia CUDA runtime to enable parallel processing large volumes of data.
	Who Should Attend
	This lecture and lab-based workshop is intended for software software developers who have tasked with building new libraries, or enhancing existing libraries, to be able to offload computationally-intensive operations from the main system CPU to the GPU when one or more CUDA-compatible GPUs is present.
	Workshop Highlights
	The role of Compute support in software development. CUDA unit testing strategies. Memory management of applications, CPUs, and GPUs. GPU Threads, streams, and events. Parallel execution strategies. Memory coalescing and performance. Multi-GPU data management. GPU Occupancy. Optimizing performance using tensor cores.
	Performance Objectives
	At workshop completion, attendees will be able to...
	Describe how to identify a compute candidate within existing software. Setup a development workstation and create a Hello Cuda program. Describe three strategies for maximizing computational, memory, and instruction throughput. Write code to access the CUDA device interface. Write code to set up unit testing for CUDA-based software. Describe the organization of application, CPU, and GPU memory. Describe the difference between CPU threads and GPU threads. Write code using streams and events to optimize compute throughput. Write code to identify Warp Divergence. Setup a development workstation with profiling tools and profile CUDA execution. Write code to detect multi-GPU systems and collect KPIs for multi-GPU usage. Describe the rationale and strategies for memory coalescing. Write code to detect and use tensor cores.


	Overview
	Who Should Attend
	Workshop Highlights
	Performance Objectives
	Workshop Syllabus
Delivery Format
Request More Information

	Workshop Syllabus

Monday

Tuesday

Wednesday

Thursday

Friday

Compute Candidates

Input
Process
Output

Hello CUDA

System Setup
Visual Studio Wizard
Program Structure
Run-time Activity

Compute Interfaces

APIs
Libraries
Function Families
Interfaces

Unit Testing

Setup
Test
Validate
Performance

Memory Management

Application
CPU
GPU

Application Memory

C/C++
Native Win32
Managed

CPU Memory

Real & Protected
Kernel & User
Shared User Memory
Paged & Virtual
Cached & Missing

GPU Memory

Global
Shared
Local
Thread

Threads

GPU vs CPU Thread
Thread Blocks
GPU Warps

Streams and Events

CUDA Streams
CUDA Events

Warp Divergence

GPU Threads
Performance Cost
Design Patterns

Atomics & Reductions

Atomic Operations
Role of Reductions

Parallel Execution Strategies

Computational Throughput
Memory Throughput
Instruction Throughput
Memory Thrashing

Profiling

Tools
Metrics
Launch Parameters

Memory & Performance

Memory Hierarchy
Memory Coalescing
Bandwidth & Latency
Bank Conflicts
cudaMemcpy()

Multi-GPU

Hello M-GPU
Data management
Test Strategies

GPU Occupancy

Trade-offs
Warp Scheduling

Memory Coalescing

Internal operation
Register spilling
Design patterns

Tensor Cores

Basics
Usage
Limitations
Debugging

REQUEST MORE INFORMATION

(c) Copyright 1990-2024 The Paul Yao Company. All rights reserved.