Empirical Setup

introduction

We chose three domains for experiments, namely CV, NLP and ASR. A total of 14 commonly used models and 7 datasets were selected. The model and dataset are described in detail below.

model introduction

AlexNet

The Alexnet model consists of 5 convolutional layers and 3 pooling layers, including 3 fully connected layers. AlexNet is similar in structure to LeNet, but uses more convolutional layers and a larger parameter space to fit the large-scale dataset ImageNet. It is the dividing line between shallow neural networks and deep neural networks. Its structure diagram is as follows:

AlexNet

LeNet-5

LetNet-5 is a simpler convolutional neural network. The following figure shows its structure: the input 2D image (single-channel) goes through the convolutional layer twice to the pooling layer, then the fully connected layer, and finally the output layer. The whole is: input layer->convulational layer->pooling layer->activation function->convulational layer->pooling layer->activation function->convulational layer->fully connect layer->fully connect layer->output layer.

The entire LeNet-5 network includes a total of 7 layers (excluding the input layer), namely: C1, S2, C3, S4, C5, F6, OUTPUT. Its structure diagram is as follows:

LeNet-5

Resnet-50

ResNet has been widely used in various feature extraction applications. When the number of deep learning network layers is deeper, the theoretical expression ability will be stronger. However, after the CNN network reaches a certain depth, it will cause the network to converge more slowly and the accuracy rate will decrease. Even if the dataset is increased to solve the problem of overfitting, the classification performance and accuracy will not improve. Kaiming et al. found that residual networks can solve this problem.

ResNet-50 has gone through 4 blocks, and each block has 3, 4, 6, and 3 Bottlenecks respectively.

DenseNet

DenseNet adopts a dense connection mechanism. That is, all layers are connected to each other, and each layer will be connected with the previous layer in the channel dimension to realize feature reuse as the input of the next layer. The advantage of this operation is that it not only slows down the problem of gradient disappearance, but also enables it to achieve better performance than Resnet with fewer parameters and less computation.

DenseNet

MobileNet

The basic unit of MobileNet is the depth-level separable convolution, which can be decomposed into two smaller operations: depthwise convolution and pointwise convolution.

Depthwise convolution is different from standard convolution. For standard convolution, the convolution kernel is used on all input channels (input channels), while depthwise uses different convolution kernels for each input channel, that is, a convolution kernel corresponds to an input channel. It is generally composed of M convolution kernels of n*n*1, where M is the depth of the input data.

The pointwise convolution consists of N 1*1*M convolution kernels, where N is the depth of the output data.

InceptionV3

One of the most important improvements of InceptionV3 is decomposition, which decomposes 7x7 into two one-dimensional convolutions (1x7, 7x1), and 3x3 is the same (1x3, 3x1). This benefit can not only speed up the calculation, but also convert 1 conv Splitting into 2 convs further increases the depth of the network and increases the nonlinearity of the network. It is also worth noting that the network input has changed from 224x224 to 299x299, and the 35x35/17x17/8x8 modules are designed more precisely.

Its structure diagram is as follows:

InceptionV3

VGG

VGGNet is a deep convolutional neural network developed by researchers at Oxford University's Computer Vision Group and Google DeepMind. VGG mainly explores the relationship between the depth of convolutional neural networks and their performance. By repeatedly stacking 3*3 small convolution kernels and 2*2 maximum pooling layers, VGGNet successfully built a deep convolutional neural network with 16-19 layers. Compared with the previous state-of-the-art network structure, the error rate is greatly reduced. At the same time, the generalization ability of VGG is very good. It has good performance on different image datasets. So far, VGG is still often used to extract feature images.

Xception

Xception is an improved model of InceptionV3 proposed by Google. The main content of its improvement is to use depthwise separable convolution to replace the multi-size convolution kernel feature response operation in the original Inception v3.

Xception

LSTM

LSTMs are widely used for many sequential tasks (including natural gas load forecasting, stock market forecasting, language modeling, machine translation) and perform better than other sequential models (e.g. RNNs), especially with large amounts of data. LSTMs are carefully designed to avoid the vanishing gradient problem of RNNs. The main practical limitation of vanishing gradients is that the model cannot learn long-term dependencies. However, by avoiding the vanishing gradient problem, LSTMs can store much more memory (hundreds of time steps) than regular RNNs. Compared to RNNs that only maintain a single hidden state, LSTMs have more parameters. It provides better control over which memories are kept and which ones are discarded at specific time steps. For example, the hidden state must be updated at each training step, so RNNs cannot determine which memories to keep and which to discard.

Its structure diagram is as follows:

LSTM

Wav2Vec 2.0

Wav2Vec2.0 is a representation learning of audio through unsupervised learning Self-supervised learning, and the learned representation information is used for downstream tasks such as speech recognition.4

Its structure diagram is as follows:

Wav2Vec 2.0

ContextNet

ContextNet refers to SENet and does something similar to self-attention, extracting features from the global context.

ContextNet consists of 22 layers of Conv Block.

ContextNet

DATASET

CIFAR-10

Classification on the CIFAR-10 dataset is an open benchmark problem in machine learning. The objective of the task is to classify a set of 32x32 RGB images,

The CIFAR-10 dataset consists of 60,000 32x32 color images that cover 10 categories (airplane, car, bird, cat, deer, dog, frog, horse, boat, and truck). There are 6000 images of each type. Generally, 50,000 images are used for training and 10,000 images are used for testing.

Fashion-MNIST

Fashion-MNIST is an image dataset that replaces the MNIST handwritten digits set. It is provided by the research division of Zalando, a German fashion technology company. It covers a total of 70,000 different product front images from 10 categories.

Fashion-MNIST is divided into 60000/10000 training and testing data, 28x28 grayscale images.

MNIST

The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

Imdb

The IMDB dataset contains 50,000 heavily polarized reviews from the Internet Movie Database (IMDB). The dataset is divided into 25,000 reviews for training and 25,000 reviews for testing, and both the training and testing sets contain 50% positive reviews and 50% negative reviews.

Both train_labels and test_labels are lists of 0 and 1, where 0 is negative and 1 is positive

Reuters

The Reuters dataset contains many short stories and their corresponding topics, published by Reuters in 1986. It is a simple, widely used dataset for text classification. It includes 46 different subjects: some subjects have more samples, but each subject in the training set has at least 10 samples.

There are 8982 training samples and 2246 testing samples

Common Voice3

Common Voice consists of 18 different languages (including English, French, German, Mandarin, Welsh, Kabir, etc.). It adds about 1,00 hours of recorded voice clips from over 42,000 contributors.

Page updated

Google Sites

Report abuse