Gelu


  ReLu is commonly used in CNN models

  Gelu (Gaussian Error Linear Unit) is cmmonly used in NLP models, e.g. GPT

  

  Relu alleviates vanishing gradient problem, however it can cause Dying Relu problem, in which some neurons stuck with zero and unlikely to recover.

  Gelu is a smoother version of Relu, but more expensive to calculate.

  Unlike ReLU, which gates inputs by their sign, GELU weights inputs by their percentiles. This means that GELU allows small negative values when the input is less than zero, providing a richer gradient for backpropagation.