ctc loss

torch.nn.CTCLoss(blank=0, reduction='mean', zero_infinity=False)

parameters:

blank. the input sequence uses index/label/coding 0 to represent the blank between characters. the index 1 and above are for the actual character label.

reduction. 'none' | 'mean' | 'sum'. When use mean, the output losses will be divided by the target lengths.

zero_infinity. Sometimes, the calculated ctc loss has an infinity element and infinity gradient. This is common when the input sequence is not too much longer than the target.

In the below sample script, set input length T = 35 and leave target length = 30. Run the script in a loop for a few hundred times it will get a Infinity loss.

The zero infinity parameter resets the infinity loss to 0 so the learning can carry on.

input:

Log_probs:

Tensor of size (T, N, C)(T,N,C)

T = length of the sequence , N = batch size , and C = number of classes (including blank).

It needs to apply softmax first on the raw network output to obtain probability, and then apply logarithm (the log trick on probability to make computation easier)

torch.nn.functional.log_softmax() can do both softmax and log in one go.

Note input sequences may be of different lengths. They are padded to the same legnth. it will rely on the input_lengths parameter to cut off the padding later.

Targets:

Tensor of size (N, S)

N = batch size and S = max target sequence length. Targets are padded to the length of the longest sequence, and stacked. It will rely on the target_lengths parameter

to cut off the padding part. Each element in the target sequence is a class index. And the target index cannot be blank (default=0).

Note that the traget sequences dont have to be of the same length.

So another way to input targets is 1-demension tensor of size sum(target_length). It simply concatenate all target sequences into 1-d tensor.

The Target_lengths parameter will tell how to split the tensor into sequences of various lengths.

Input_lengths:

Tensor of size (N)

The length of each input sequence. It must be <= T. It will be used for masking under the assumption that input sequences are padded to equal lengths.

e.g. if padding at the end of a shorter sequence, then the specified length will help to cut off the padding.

Target_lengths:

Tensor of size (N)

The length of each target sequence. If targets is of size (N, S) then this parameter helps to cut off the padding.

If the targets is a 1-d concatenated sequence, then this parameter helps to chop the conatenated sequence into original sequences.

Output:

scalar loss.

Sample script

T = 50 # Input sequence length

C = 20 # Number of classes (including blank)

N = 16 # Batch size

S = 30 # Target sequence length of longest target in batch

S_min = 10 # Minimum target length, for demonstration purposes

# Initialize random batch of input vectors, for *size = (T,N,C)

# log_softmax(2) applies on dim = 2 which is the 'C' dimension

input = torch.randn(T, N, C).log_softmax(2).detach().requires_grad_()

# Initialize random batch of targets (0 = blank, 1:C = classes)

# random class value for target

target = torch.randint(low=1, high=C, size=(N, S), dtype=torch.long)

# fill tensor of (N,) with value T, so all the same length here

input_lengths = torch.full(size=(N,), fill_value=T, dtype=torch.long)

# random length between s_min and s for target sequences

target_lengths = torch.randint(low=S_min, high=S, size=(N,), dtype=torch.long)

ctc_loss = nn.CTCLoss()

loss = ctc_loss(input, target, input_lengths, target_lengths)

loss.backward()

Page updated

Google Sites

Report abuse