It is a linear threshold unit (LTU): the inputs and output are numbers (instead of binary on/off values) and each input connection is associated with a weight(Refer: below diagram). The LTU computes a weighted sum of its inputs (z = w1 x1 + w2 x2 + ⋯ + wn xn = wT · x),
Step function(Refer: below diagram) is applied in a perceptron (to that sum) and outputs the result: hw(x) = step (z) = step (wT · x).
A Perceptron is simply composed of a single layer of LTUs,6 with each neuron con‐ nected to all the inputs (Refer: below diagram)
An MLP is composed of one (passthrough) input layer, one or more layers of LTUs, called hidden layers, and one final layer of LTUs called the output layer (see Figure 10-7). Every layer except the output layer includes a bias neuron and is fully connected to the next layer.
A recurrent neural network looks very much like a feedforward neural network, except it also has connections pointing backward.
When an ANN has two or more hidden layers, it is called a deep neural network (DNN).
Loss function of neural networks are non-convex (Verify)
Refer here for the detail understanding of what cause ML training to converge to minima
ANNs frequently outperform other ML techniques on very large and complex problems
There is now a huge quantity of data available to train neural networks,
The tremendous increase in computing power since the 1990s now makes it pos‐ sible to train large neural networks in a reasonable amount of time.
The training algorithms have been improved. For example, use of ReLU instead of Sigmoid which suffered to vanishing gradient problem(verify if it is right)
Refer here for understanding condition for achieving global minima
The number of layers
For many problems you can start with just one or two hidden layers and it will work just fine
For more complex problems, you can gradually ramp up the number of hidden layers, until you start overfitting the training set.
The number of neurons per layer,
The type of activation function to use in each layer and seed value
For the hidden layers, In most cases you can use the ReLU activation function(or one of its variants).
For the output layer, the softmax activation function is generally a good choice for classification tasks (when the classes are mutually exclusive). For regression tasks, you can simply use no activation function at all.
The weight initialization logic
Mini batch size
Number of epochs
https://www.quora.com/How-can-you-prove-that-the-loss-functions-in-Deep-Neural-nets-are-non-convex
https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291
https://www.linkedin.com/posts/dpkumar_convergence-machinelearningmodels-datasciences-activity-6769803533836521472-rjnu
https://images.app.goo.gl/Gp6ZN6v2vgPB8f9z7
https://youtu.be/jTzJ9zjC8nU