CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator
Aug 26, 2025·,,,,·
0 min read
Ehsan Yousefzadeh-Asl-Miandoab
Reza Karimzadeh
Bulat Ibragimov
Florina M. Ciorba
Pinar Tözün
Abstract
GPUs running deep learning workloads are frequently underutilized due to exclusive resource allocation despite their parallel nature. Collocating multiple deep learning training tasks on shared GPUs can improve utilization, yet it introduces out-of-memory (OOM) crashes and performance interference. We present CARMA, a task-level, collocation-aware resource management system that mitigates these risks through fine-grained monitoring, GPU memory estimation via a novel ML-based estimator (GPUMemNet), utilization-aware placement policies, and lightweight recovery. CARMA significantly improves GPU utilization and reduces execution time and energy use on real-world deep learning traces. :contentReference[oaicite:0]{index=0}
Type
Publication
arXiv