Bug Details

As stated in the paper, we discovered 26 bugs across 5 DL frameworks (i.e., Keras, TensorFlow, CNTK, Theano, and PyTorch), including 13 crashes, 8 NaN bugs, and 5 inconsistencies. We detail the 26 bugs as follows. Note that, we consider a crash or NaN as seperate bugs if it occurs on different frameworks, even using the same parameters. In other words, we count the bug multiple times for different affected frameworks.

Crashes

Conv2D(kernel_size=0) on TensorFlow Confirmed and Fixed

TensorFlow lacks necessary check for abnormal parameter kernel_size=0.

Embedding(input_dim=0) on TensorFlow Confirmed and Fixed

TensorFlow lacks necessary check for abnormal parameter input_dim=0.

Dense(unit=0) on Tensorflow

Tensorflow lacks necessary check for abnormal parameters unit=0.

Conv2D/DepthwiseConv2D(dilation_rate=0) on Theano Confirmed

Theano lacks necessary check for abnormal parameter dilation_rate=0.

MaxPooling(pool_size=0) on CNTK

CNTK lacks necessary check for abnormal parameters (pool_size=0).

Float16 supporting issues on Keras, CNTK and Theano

Some APIs on CNTK and Theano do not support input in float16 dtype (e.g., Conv2D and SpatialDropout2D), even though float16 exists in the dtype options for the two frameworks. However, there is no relevant hints or warnings in the Keras document about the compatibility issues of float16 dtype in different backend operations.

unroll supporting issue for GRU on Keras Confirmed

Keras has supporting issue with GRU(unroll=true) on CNTK backend. Actually, CNTK does not support the feature unroll in the source code.

GRU on Keras

Keras fails in building GRU model in some cases when using TensorFlow as beckend, and it returns a variable-multiplication error, as shown below.

RuntimeError('Variable *= value not supported. Use var.assign(var * value) to modify the variable or var = var * value to get a new Tensor object.')

BatchNormalization on Keras

Keras cannot build model with layer BatchNormlization under certain parameters when using CNTK as backend, and returns an error about the dynamic axes, as shown below.

ValueError: AssignNode: None of the operands 'Parameter('batch_normalization_1/moving_mean', [], [1]), Output('Plus275_Output_0', [#], [1])' can have dynamic axes .

Conv2dTranspose(dilation_rate=(1,2)) on Keras

Keras cannot build a model when using layer Conv2dTranspose with the parameter dilation=(1,2) on TensorFlow backend. It seems like Keras doesn't support dilation=(x,y) that x is not equal with y when using TensorFlow as backend.

ConvLSTM2D on Keras

Keras fails in building the model with the layer ConvLSTM2D in some cases on the CNTK backend becuase of the shape-unmatch error. Other layers like Conv2D also suffer from the same issue on CNTK backend.

NaN Bugs

BatchNormalization on TensorFlow and Theano Confirmed

When executing BatchNormalization, TensorFlow and Theano return NaN in some cases. This is because both two frameworks lack neceessary check for negative values before calculating square root operations in BatchNormalization.

TensorFlow has confirmed this bug, and the relevant source code for Theano is shown in the following picture.

ReLU(threshold=None) on TensorFlow Confirmed and Fixed

TensorFlow lacks necessary exception check for floating-point parameters. In case of abnormal values like None, it directly convert them to NaN, which affects the subsequent calculation. TensorFlow has confirmed and fixed this issue as follows.

LeakyReLU(alpha=None) on TensorFlow Confirmed

Similar to the ReLU bug mentioned above. That is, TensorFlow directly converts the None-value to NaN for the floating-point parameter alpha in LeakyReLU.

AvgPool2d on PyTorch Confirmed

Under ceil_mode=True for AvgPool2d, PyTorch may choose the pools out of the image in some cases and further lead to division-by-zero, which finally triggers NaN output.

Dense/Conv/LSTM/GRU/SimpleRNN on Tensorflow, CNTK, and Theano

Using exponential activation in some APIs (e.g., Conv2D, Dense) can easily lead to infinity output and trigger NaN when using the infinity value in further calculations. Figure1 shows a NaN example caused by the layer SimpleRNN with the parameter activation=exp on TensorFlow.

Figure 1. NaN example on SimpleRNN with the parameter activation=exp on TensorFlow

Figure 2. Keras implementation of dilation_rate for SeprableConv2D/DepthwiseConv2D on CNTK backend

Inconsistencies

ThresholdedReLU on Keras and TensorFlow -Confirmed

Using ThresholdedReLU(theta=None) on Keras, different backends return inconsistent results. However, there is no relevant hints or warnings in the Keras document for this problem of inconsistent backend performance.

TensorFlow has confirmed this bug which is similar to case of ReLU mentioned above.

DepthwiseConv2D on CNTK

Using DepthwiseConv2D(kernel_size=2,strides=1 ) on CNTK backend will trigger significant inconsistent results with other backends. This is because CNTK only takes the first channel of the input into calculation due to a support limitation of the inner MKL library in case of asymmetric padding, as shown in the paper.

Different Padding Impelmentation

Under the parameter padding=same, different padding implementations trigger inconsistent results across different frameworks. It is a common problem occurring in multiple APIs that support padding function, such as MaxPooling2D, AveragePooling2D,Conv2D and DepthwiseConv2D.

SeprableConv2D/DepthwiseConv2D on Keras

Keras simply assigns the value of dilation_rate to strides on the CNTK backend, resulting in inconsistent output with TensorFlow. This is because CNTK does not implement dilation convolution, and Keras compromises to fullfill that simply by replacing strides with the dilation_rate value, as shown in Figure 2. This leads to different kernel elements (pixels) compared to the actual dilation operation.

Page updated

Google Sites

Report abuse