Neural Machine Traslation By Jointly Learning To Align And Translate

Paper

Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio
Jacobs University Bremen, Universite de Montreal
ICLR’15

Link of Paper

Motivation

Challenge

Neral Machine Translation

  • A family of encoder-decoders and encode a source sentence into a fixed-length vector from which a decoder generates a translation.

Pic

Issue

  • Compressing all the necessary information of a source sentence into a fixed-length vector will lose the effective information.
  • The performance of long sentences deteriorates rapidly.

Solution

Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words. (Attention Mechanism)

Model

The new architecture consists of a bidirectional RNN as an encoder and a decoder that emulates searching through a source sentence during decoding a translation.

Pic

Encoder: Bidirectional RNN

A BiRNN consists of forward and backward RNNs. The annotation for each word $x_j$ by concatenating the forward hidden state $h_j(f)$ and the backward one %h_j(b)$, i.e., $h_j = [h_j(f);h_j(b)]$.

Each annotation $h_i$ contains information about the whole input sequence with a strong focus on the parts surrounding the $i^{th}$ word of the input sequence.

Decoder

Defination of each conditional probability.

Pic

where $s_i$ is an RNN hidden state for time $i$, computed by

Pic

The context vector $c_i$ is computed as a weighted sum of these annotations $h_i$. $c_i$ is the expected annotation over all the annotations with probabilities $a_{ij}$.

Pic

$a_{ij}$ is a probability that the target word $y_i$ is aligned to a source word $x_j$. The weight $a_{ij}$ of each annotation $h_j$ is computed by

Pic

where

Pic

is an alignment model which scores how well the inputs around position $j$ and the output at position $i$ match. The alignment model directly computes a soft alignment, which allows the gradient of the cost function to be backpropagated through.

Advantage

  • The decoder decides parts of the source sentence to pay attention to.
  • Instead of compressing all information in the source sentence into a fixed-length vector, the information can be spread throughout the sequence of annotations which can be selectively retrieved by the decoder accordingly.

Experiment

Dataset

  • WMT’14 contains the following English-French parallel corpora: Europarl (61M words), news commentary (5.5M), UN(421M) and two crawled corpora of 90M and 272.5M words respectively, totaling 850M words.
  • Reduce the size of the combined corpus to have 348M words using the data selection method.
  • Use a shortlist of 30,000 most frequent words in each language to train models, and any word not included in the shortlist is mapped to a special token ([UNK]).

Model

Two types

  • An RNN Encoder-Decoder
    • 1000 hidden units
  • The proposed model RNNsearch
    • forward and backward RNN each has 1000 hidden units
    • Decoder has 1000 hidden units

Two ways

  • With the sentences of length up to 30 words (RNNencdec-30, RNNsearch-30)
  • With the sentences of length up to 50 words (RNNencdec-50, RNNsearch-50)

Result

Pic

Pic

Analysis

Alignment

Pic Pic Pic

  • Soft-alignment can let the model look at not only one word.
  • Soft-alignment can deal with source and target phrases of different lengths, without requiring a counter-intuitive way of mapping some words to or from nowhere ([NULL]).

Long sentences

The source sentence from the test set

Pic

The RNNencdec-50

Pic

It replaced [based on his status as a health care worker at a hospital] in the source sentence with [enfonction de son etat de sante] (‘based on his state of health’).

The RNNsearch-50

Pic

Preserving the whole meaning of the input sentence without omitting any details.

Future

  • To better handle unknown, or rare words.
  • Computation is expensive.