A curated list of best cuda programming books
-
Updated
May 19, 2026
A curated list of best cuda programming books
Real-time stream editing pipeline powered by the FLUX.2-klein-4B model, optimized for consumer GPUs
A GPU-Accelerated First-Order LP Solver
The intelligent OptiScaler installer Linux gamers needed. Automates FSR4, XeSS & DLSS configuration with GPU-optimized profiles for RDNA3/4, Arc & RTX cards.
GVProf: A Value Profiler for GPU-based Clusters
AI Infrastructure Performance Engineer Learning Track - GPU optimization, inference optimization, and cost reduction
Boost Valheim's FPS to forge a smoother Viking journey!
Fast waifu2x converter with GPU optimization
Fast waifu2x converter with GPU optimization
Handwritten Flash Attention 2 CUDA kernel for Blackwell (SM120) with TMA, swizzle, double buffering & warp specialization
AI Infrastructure Senior Engineer Learning Track - Advanced ML infrastructure and technical leadership
KeSSie HUGE Context Semantic recall for Large Language Models
The GPU Optimizer for ML Models enhances GPU performance for machine learning. It offers advanced scheduling, real-time monitoring, and efficient resource management through a user-friendly web interface and robust API, integrating big data technologies for seamless data processing and model optimization. @NVIDIA
Production-ready checklists and frameworks for deploying LLMs, GenAI models, and AI infrastructure. Covers vLLM, Kubernetes, GPU optimization, observability, compliance, and Day-0 to Day-2 operations.
Bilingual CUDA SGEMM optimization tutorial and reference implementation, from naive kernels to Tensor Core WMMA | 双语 CUDA SGEMM 优化教程与参考实现,从朴素内核到 Tensor Core WMMA
First open-source real-time face filter app using MediaPipe FaceMesh for high-performance, GPU-accelerated effects.
This is a short course covering GPU optimization techniques for LLM inference
Physics-based computation at scale — Hamiltonian dynamics, spectral theory, and statistical mechanics powering optimization, drug discovery, genomics, molecular proof, and agentic commerce.
A high-performance C++ backend for extreme-context LLM inference. It replaces item-count batching with dynamic, VRAM-aware First-Fit Decreasing (FFD) bin packing. By using PyBind11 for async queueing, 16-token alignment, and `cudaHostAlloc` for zero-copy FlashAttention-2 transfers, it mathematically eliminates OOMs and maximizes GPU throughput.
用于复现和优化常见的深度学习算子,基于cuda和triton两种方案,可供学习和参考
Add a description, image, and links to the gpu-optimization topic page so that developers can more easily learn about it.
To associate your repository with the gpu-optimization topic, visit your repo's landing page and select "manage topics."