**Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith**

*Marianas Labs, NLP Group of Pompeu Fabra University, Carnegie Mellon University*

*ACL’15*

# Background

## Definition

Transition-based dependency parsing formalizes the parsing problem as a series of decisions that read words sequentially from a buffer and combine them incrementally into syntactic structures.

## Advantages

- The number of operations is linear in the length of the sentence.
- computationally efficient.

## Challenges

Modeling which action should be taken in each of the unboundedly many states encountered as the parser progresses.

## Solutions

Alternative transition sets simplify the modeling problem by making better attachment decisions, through feature engineering and more recently using neural networks.

# Method

## Stack LSTMs

- The LSTM with a “stack pointer”.
- The
**POP**operation moves the stack pointer to the previous element. - The
**PUSH**adds a new entry at the end of the list.

## Dependency Parser

### Transition Operations

### Token Embeddings and OOVs

- A learned vector representation for each word type ($w$);
- A fixed vector representation from a neural language model ($w_{LM}$)
- A learned representation ($t$) of the POS tag of the token, provided as auxiliary input to the parser.
- A linear map ($V$) is applied to the resulting vector and passed through a component-wise ReLU.
**OOVs**: stochastically replace (with $p$ = 0.5) each singleton word type in the parsing training data with the UNK token in each training iteration.**Pretrained word embeddings**: structured skip n-gram.

### Composition Functions

- The syntactic head (
**h**) - The dependent (
**d**) - The syntactic relation being satisfied (
**r**) - Concatenating the vector embeddings of the head, dependent and relation, applying a linear operator and a component-wise non-linearity.
- For the relation vector, use an embedding of the parser action that was applied to construct the relation.

### Parser Operation

#### Algorithm

- Initialized by pushing the words and their representations of the input sentence in reverse order onto $B$ such that the first word is at the top of $B$ and the ROOT symbol is at the bottom, and $S$ and $A$ each contain an empty-stack token.
- At each time step, the parser computes a composite representation of the stack states and uses that to predict an action to take, which updates the stack.
- Completes when $B$ is empty, $S$ contains two elements, one representing the full parse tree headed by the ROOT symbol and the other the empty-stack symbol, and $A$ is the history of operations taken by the parser.

#### Model

#### Formulations

- $P_t$ is the parser state representation at time $t$, which is used to determine the transition to take.
- $W$ is a learned parameter matrix.
- $b_t$ is the stack LSTM encoding of the input buffer $B$.
- $s_t$ is the stack LSTM encoding of $S$.
- $a_t$ is the stack LSTM encoding of $A$.
- $d$ is a bias term.
- Passed through a component-wise ReLU nonlinearity.

- The parser state $p_t$ is used to compute the probability of the parser action at time t.

- The chain rule may be invoked to write the probability of any valid sequence of parse actions $z$ conditional on the input.

# Experiments

## Datasets

### English

- The Stanford Dependency treebank contains a negligible amount of non-projective arcs (Chen and Manning, 2014).
- The part-of-speech tag are predicted by using the Stanford Tagger with an accuracy of 97.3%.
- Language model word embedding were generated from the AFP portion of the English Gigaword corpus (version 5).

### Chinese

- The Penn Chinese Tree-bank 5.1 (Zhang and Clark, 2008).
- Gold part-of-speech tags (Chen and Manning, 2014).
- Language model word embedding were generated from the complete Chinese Gigaword corpus (version 2) as segmented by the Stanford Chinese Segmenter (Tseng et al., 2005).

## Results

Exclude punctuation symbols for evaluation.