cuda-graphs

Here are 7 public repositories matching this topic...

D3velop-llc / csm-rtx5090

Optimized CSM-1B TTS pipeline for RTX 5090 (Blackwell sm_120). CUDA graph replay via patched HF Transformers. ~0.46x RTF. Topics (tags): csm text-to-speech rtx-5090 blackwell cuda-graphs torch-compile sesame streaming pytorch

text-to-speech streaming pytorch tts sesame csm huggingface blackwell torch-compile rtx-5090 sm-120 cuda-graphs

Updated Apr 5, 2026
Python

LaelaZorana / embodied-efficiency

Star

Measuring what makes a VLA fast enough to run on the robot: a 5.9x CUDA-graph win, four experiments on why low-bit doesn't, a budget-driven deploy-compiler, and a runtime safety supervisor. Live demo: hf.co/spaces/LaelaZ/embodied-efficiency

robotics cuda triton quantization vla embodied-ai torchao cuda-graphs

Updated Jun 7, 2026
Python

davidzha712 / vllm-qwen3-dflash-budget

Star

vLLM v0.21 + Qwen 3.6 DFlash + real thinking_token_budget enforcement on Blackwell (sm_120 / sm_121a)

reasoning blackwell fp8 vllm qwen speculative-decoding qwen3 dgx-spark cuda-graphs dflash thinking-budget

Updated May 26, 2026
Python

davidzha712 / vllm-gemma4-dflash-budget

Star

vLLM v0.21 + Gemma 4 DFlash + real thinking_token_budget enforcement on Blackwell (sm_120 / sm_121a)

gemma reasoning blackwell vllm speculative-decoding nvfp4 dgx-spark cuda-graphs dflash gemma-4 thinking-budget

Updated May 26, 2026
Python

blackfirebitcoin / Dreamerv4-MC-GB10

Star

GB10 inference port; see fork.md

minecraft inference world-model diffusion-transformer flash-attention gb10 dgx-spark cuda-graphs

Updated May 14, 2026
Python

connollydavid / capturable-peer-copy-cuda

Star

Technical note + runnable demo: cudaMemcpyPeerAsync isn't capturable into a CUDA graph; capture a peer-to-peer copy as a DeviceToDevice UVA memcpy or a peer-access kernel. Confirmed identical on WSL2 and native Linux.

gpu peer-to-peer cuda multi-gpu nvlink stream-capture cuda-graphs

Updated Jun 13, 2026
Cuda

karun2328 / qwen2.5-7b-vllm-prefill-benchmarks

Star

Prefill performance study on Qwen2.5-7B using vLLM. Compares static vs mixed (bucketed) prefill under eager execution and CUDA Graphs, with controlled concurrency and real-world latency/throughput metrics.

gpu vllm llm-inference qwen2-5 cuda-graphs prefill-benchmarking inference-optmization

Updated Feb 10, 2026
Python

Improve this page

Add a description, image, and links to the cuda-graphs topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the cuda-graphs topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda-graphs

Here are 7 public repositories matching this topic...

D3velop-llc / csm-rtx5090

LaelaZorana / embodied-efficiency

davidzha712 / vllm-qwen3-dflash-budget

davidzha712 / vllm-gemma4-dflash-budget

blackfirebitcoin / Dreamerv4-MC-GB10

connollydavid / capturable-peer-copy-cuda

karun2328 / qwen2.5-7b-vllm-prefill-benchmarks

Improve this page

Add this topic to your repo