CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator

Aug 26, 2025·
Ehsan Yousefzadeh-Asl-Miandoab
,
Reza Karimzadeh
,
Bulat Ibragimov
,
Florina M. Ciorba
,
Pinar Tözün
· 0 min read
Abstract
GPUs running deep learning workloads are frequently underutilized due to exclusive resource allocation despite their parallel nature. Collocating multiple deep learning training tasks on shared GPUs can improve utilization, yet it introduces out-of-memory (OOM) crashes and performance interference. We present CARMA, a task-level, collocation-aware resource management system that mitigates these risks through fine-grained monitoring, GPU memory estimation via a novel ML-based estimator (GPUMemNet), utilization-aware placement policies, and lightweight recovery. CARMA significantly improves GPU utilization and reduces execution time and energy use on real-world deep learning traces. :contentReference[oaicite:0]{index=0}
Type
Publication
arXiv