Skip to content
@xlite-dev

xlite-dev

Develop ML/AI toolkits and ML/AI/CUDA Learning resources.

Pinned Loading

  1. LeetCUDA LeetCUDA Public

    📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

    Cuda 10.3k 1.1k

  2. lite.ai.toolkit lite.ai.toolkit Public

    🛠A lite C++ AI toolkit: 100+ models with MNN, ORT and TRT, including Det, Seg, Stable-Diffusion, Face-Fusion, etc.🎉

    C++ 4.4k 774

  3. Awesome-LLM-Inference Awesome-LLM-Inference Public

    📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

    Python 5.1k 366

  4. Awesome-DiT-Inference Awesome-DiT-Inference Public

    📚A curated list of Awesome Diffusion Inference Papers with Codes: Sampling, Cache, Quantization, Parallelism, etc.🎉

    Python 538 26

  5. torchlm torchlm Public

    💎An easy-to-use PyTorch library for face landmarks detection: training, evaluation, inference, and 100+ data augmentations.🎉

    Python 270 28

  6. ffpa-attn ffpa-attn Public

    🤖FFPA: Extend FlashAttention-2 w/ Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA.

    Cuda 265 15

Repositories

Showing 10 of 60 repositories
  • LeetCUDA Public

    📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

    xlite-dev/LeetCUDA’s past year of commit activity
    Cuda 10,297 GPL-3.0 1,050 2 0 Updated Apr 18, 2026
  • ffpa-attn Public

    🤖FFPA: Extend FlashAttention-2 w/ Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA.

    xlite-dev/ffpa-attn’s past year of commit activity
    Cuda 265 GPL-3.0 15 0 0 Updated Apr 18, 2026
  • Awesome-LLM-Inference Public

    📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

    xlite-dev/Awesome-LLM-Inference’s past year of commit activity
    Python 5,149 GPL-3.0 366 0 1 Updated Apr 18, 2026
  • cache-dit Public Forked from vipshop/cache-dit

    A PyTorch-native Inference Engine with Cache Acceleration, Parallelism and Quantization for DiTs.

    xlite-dev/cache-dit’s past year of commit activity
    Python 4 Apache-2.0 69 0 0 Updated Apr 17, 2026
  • diffusers Public Forked from huggingface/diffusers

    🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch and FLAX.

    xlite-dev/diffusers’s past year of commit activity
    Python 1 Apache-2.0 7,012 0 0 Updated Apr 17, 2026
  • quack Public Forked from Dao-AILab/quack

    A Quirky Assortment of CuTe Kernels

    xlite-dev/quack’s past year of commit activity
    Python 2 Apache-2.0 112 0 0 Updated Apr 17, 2026
  • cutlass Public Forked from NVIDIA/cutlass

    CUDA Templates and Python DSLs for High-Performance Linear Algebra

    xlite-dev/cutlass’s past year of commit activity
    C++ 1 1,810 0 0 Updated Apr 13, 2026
  • sglang Public Forked from sgl-project/sglang

    SGLang is a fast serving framework for large language models and vision language models.

    xlite-dev/sglang’s past year of commit activity
    Python 1 Apache-2.0 5,459 0 0 Updated Apr 2, 2026
  • TensorRT-LLM Public Forked from NVIDIA/TensorRT-LLM

    TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

    xlite-dev/TensorRT-LLM’s past year of commit activity
    Python 1 2,319 0 0 Updated Apr 1, 2026
  • nunchaku Public Forked from nunchaku-ai/nunchaku

    [ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

    xlite-dev/nunchaku’s past year of commit activity
    Python 3 Apache-2.0 242 0 0 Updated Mar 31, 2026

Top languages

Loading…

Most used topics

Loading…