Tips for traning an ML model

After submitting a job, revising, or deleting the code will not impact the submitted job.

Using pinmemory can speed up dataloader. It is mentioned here.

In dataloader, pin memory can not be used together with persistent_worker. The issue is discussed here:

When loading models trained in parallel, an extra step is required: here

tensor.detach() returns a new tensor as discussed here.

Runtime on Sulis: given that we can only use home/ folder, the runtime highly depends on the load on the file system. Specifically, if the load on home/ directory is high, our code may require a much longer runtime compared to when the load on home/ is low.

FP16 vs FP32: https://datascience.stackexchange.com/questions/73107/fp16-fp32-what-is-it-all-about-or-is-it-just-bitsize-for-float-values-pytho

get time in bash: https://unix.stackexchange.com/questions/428217/current-time-date-as-a-variable-in-bash-and-stopping-a-program-with-a-script

When enumerating data loader while using multiple GPUs it could time a long time.

The issue is discussed here, and the solution (multi-epoch data loader) works very well. Experimental results are as follows.

Google Sites

Report abuse