Deep Learning is a subfield of machine learning that employs neural networks with multiple layers (deep neural networks) to model complex patterns and representations in data.
Neural Networks are the foundation of deep learning, consisting of interconnected nodes organized into layers. They are capable of learning complex mappings and representations in data through the use of activation functions and optimization algorithms.
Convolutional Neural Networks (CNNs) are deep learning architectures designed for processing structured grid data, such as images. They use convolutional layers to automatically learn hierarchical features from input data.
Since the introduction of CNNs, a large number of model architectures have been introduced.
LeNet - One of the earliest CNN architectures, designed for handwritten digit recognition.
AlexNet - Developed by Alex Krizhevsky, it gained significant attention for winning the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It played a key role in popularizing deep learning.
VGG (Visual Geometry Group) Networks - Known for their simplicity and uniform architecture, VGG networks have variations like VGG16 and VGG19, named after the number of layers.
GoogLeNet (Inception) - Introduced the concept of inception modules, which allow for the simultaneous use of filters of different sizes. This architecture aims to capture features at multiple scales.
ResNet (Residual Networks) - Addresses the vanishing gradient problem by introducing skip connections. It allows the model to skip one or more layers during training, making it easier to train very deep networks.
MobileNet - Designed for mobile and edge devices with limited computational resources. It employs depthwise separable convolutions to reduce the number of parameters and computations.
Xception - A variant of Inception, Xception replaces the standard convolutional layer with depthwise separable convolutions, aiming for better performance.
DenseNet (Densely Connected Convolutional Networks) - Introduces dense connectivity, where each layer receives input from all preceding layers. This promotes feature reuse and strengthens feature propagation.
EfficientNet - A family of models that scale the model's depth, width, and resolution simultaneously to optimize performance. It achieved state-of-the-art results with relatively fewer parameters.
YOLO (You Only Look Once) - A real-time object detection system that divides an image into a grid and predicts bounding boxes and class probabilities directly. YOLO has multiple versions, such as YOLOv1, YOLOv2 (YOLO9000), YOLOv3, and YOLOv4.
Recurrent Neural Networks (RNNs) are neural networks with loops that allow information to persist. They are well-suited for sequence data and time-series analysis, enabling the model to capture temporal dependencies.
RNNs consist of repeating units. Each unit takes an input and produces an output while maintaining a hidden state that serves as the memory for previous information.
Since the introduction of RNNs, several variations have been proposed that have demonstrated superior performance:
Long Short-Term Memory (LSTM)
Improvements - Introduced to address the vanishing gradient problem by incorporating a more complex memory cell structure.
Memory Cell - Contains a cell state that allows information to be stored, modified, and retrieved selectively over long sequences. It uses three gates ("input", "forget", and "output" gates) to control the flow of information.
Gated Recurrent Unit (GRU)
Improvement - A variation of LSTM that simplifies its architecture while retaining its ability to capture long-range dependencies.
Gating Mechanism - Utilizes two gates ("update" and "reset" gates) to control the flow of information. The "update" gate combines aspects of the "input" and "forget" gates from LSTM.
Generative Adversarial Networks (GANs) consist of a generator and a discriminator network trained simultaneously through adversarial training. GANs are used for generating realistic synthetic data, such as images.
DCGAN (Deep Convolutional GAN) - Extends GANs by using convolutional neural networks in both the generator and discriminator. DCGANs are particularly effective for image-generation tasks.
WGAN (Wasserstein GAN) - Introduces the Wasserstein distance as a measure of the difference between the generated and real data distributions. WGANs aim to provide more stable training and avoid mode collapse.
CGAN (Conditional GAN) - Extends GANs to generate data conditioned on specific input conditions. It allows the generation of targeted and controlled synthetic data.
CycleGAN - A type of GAN designed for unpaired image-to-image translation. It learns mappings between two domains without requiring paired training examples.
StyleGAN (Style Generative Adversarial Network) - Introduced by NVIDIA, StyleGAN allows control over the style and appearance of generated images. It is known for its ability to generate high-quality, diverse, and realistic faces.
BigGAN - A large-scale GAN architecture designed for high-resolution image synthesis. It has been shown to generate high-quality images across multiple classes.
ProGAN (Progressive GAN) - Progressively grows both the generator and discriminator during training. It starts with a small resolution and gradually increases it, allowing for the generation of high-resolution images.
StarGAN - A GAN architecture designed for multi-domain image-to-image translation. It can translate images across multiple domains, such as changing facial attributes.
SAGAN (Self-Attention GAN) - Incorporates self-attention mechanisms to enable the model to focus on different parts of the input image during the generation process.
Transformers are a type of deep learning architecture that relies on self-attention mechanisms to process input data in parallel. They have been highly successful in natural language processing tasks and other sequence-based applications.
BERT (Bidirectional Encoder Representations from Transformers) - Introduces bidirectional attention and pre-training on large corpora, leading to contextualized word embeddings. BERT has been pivotal in various NLP tasks such as question answering and sentiment analysis.
GPT (Generative Pre-trained Transformer) - The GPT series, including GPT-2 and GPT-3, utilizes transformer architectures for language modeling and generation. GPT-3, in particular, is one of the largest language models with broad applications in natural language understanding and generation.
T5 (Text-to-Text Transfer Transformer) - Unifies different NLP tasks under a "text-to-text" framework, where tasks are cast as converting input text to target text. T5 has achieved state-of-the-art results across various NLP benchmarks.
Transformer-XL - Addresses the limitation of the fixed-length context window in vanilla transformers by introducing recurrence mechanisms. It allows models to capture longer-term dependencies in sequences.
XLNet - Integrates ideas from autoregressive models (like GPT) and autoregressive models with context (like BERT). It captures bidirectional context while maintaining the autoregressive property.
RoBERTa (Robustly optimized BERT approach) - An optimized version of BERT that modifies key hyperparameters and removes the next-sentence prediction objective, leading to improved performance on various NLP tasks.
DistillBERT - A smaller and faster version of BERT, designed for resource-constrained environments. It retains most of the performance of the original BERT model.
ViT (Vision Transformer) - Extends transformers to computer vision tasks by dividing an image into fixed-size patches and treating them as sequential data. ViT has shown competitive performance on image classification tasks.
DeiT (Data-efficient Image Transformer) - A transformer-based model for image classification that leverages distillation techniques to achieve strong performance with smaller amounts of labeled data.