batch size - reduces number of optimizer steps that needs to be performed
learning rate - higher learning rate can improve the speed of learning
optimizer class - optimizers modifies weights with different startegies, some approaches might be faster
residual connections - vanishing, exploding, ... gradients can cause poor training performance, therefore adding some skip connections in the nn graph can help training process -> so you can achive the same results running less experiments
model split architecture - we can run different parts of the model with more GPUs
data split flow
number and size of hidden layers
location of data versus location of training infrastructure - training models where data are located will increase throughput of the system
mixed precision policy - improve the speed by alowing computation with different precision: float16/float32/...
apply batch processing - minimize the RAM vs VRAM communication cycles, instead of sending samples one by one, the samples are send in batches
Replication - when serving the model, we can use k8s to replicate the model, so that we could serve more requests. All we have to use, is to
pick specific node selector
set GPU Operator
[optional] set up mig devices
Partitioning (data split) we can split data to several gpus with DataParallel This will enable multi gpu training and inferrence.