DGX-Spark-Finetune-LLM

(github.com)

2 points | by waybarrios 5 hours ago

1 comments

  • waybarrios 5 hours ago
    I built a toolkit to fine-tune LLMs using LoRA + native 4-bit quantization on NVIDIA's new Blackwell GPUs (DGX Spark with GB10).

      Key features:
      - NVFP4 (4-bit) via Transformer Engine - fastest option
      - MXFP8 (8-bit) for higher precision
      - bitsandbytes FP4 fallback for any CUDA GPU
      - ~240MB LoRA adapters instead of ~6GB full models
    
      Tested on DGX Spark (128GB VRAM). Training SmolLM3-3B takes ~70GB VRAM with NVFP4.