Zhipeng Guo, Xiaoyuan Yi, Maosong Sun, Wenhao Li, Cheng Yang, Jiannan Liang, Huimin Chen, Yuhui Zhang, Ruoyu Li
Tsinghua University
ACL’19
Motivations
 Most existing systems are merely modeloriented, which input some userspecified keywords and directly complete the generation proess in one pass, with little user participation.
 Most existing systems generate poetry in fewer styles and genres, and proviede limited options for users.
Contributions
 Multimodal input, such as keywords, plain text and images.
 Various styles and genres.
 Humanmachine collaboration.
Architecture
Overview
Four modules:
 input preprocessing
 generation
 postprocessing
 collaborative revision
Processing: Given the userspecified genre, style and inputs, the preprocessing module extractes several keywords from the inputs and then conducts keyword expansion to introduce richer information. With these preprocessed keywords, the generation module generates a poem draft. The postprocessing module reranks the candidates of each line and removes the ones that do not conform to structural and phonological requirements. At last, the collaborative revision module interacts with the user and dynamically updates the draft for several times according to the user’s revision, to collaboratively create a satisfying poem.
Input Preprocessing Module
Keyword Extraction

For plain text
1.Use THULAC to conduct Chinese word segmentation;
2.compute the importance $r(w)$ of each word $w$:
\[\begin{equation} r(w)=[\alpha * ti(w)+(1\alpha) * tr(w)] \end{equation}\]where $ti(w)$ and $tr(w)$ are the TFIDF value and TextRank score,$\alpha$ is a hyperparameter to balance the weights of $ti(w)$ and $tr(w)$.
3.Select top K words with the highest scores.

For each image
1.Use Aliyun image recognition tool.
2.Gives the names of five recognized objects with corresponding probability $s(w)$.
3.Select top K words with the highest $s(w) \cdot r(w)$.
Keyword Mapping

Problem:The generation module will take some modern concepts(which never occur in the classical poetry corpus, such as airplane and refrigerator) as a UNK symbol and generate totally irrelevant poems.

Solution:
1.Build a Poetry Knowledge Graph (PKG) from Wikipedia data, which contains 616,360 entities and 5,102,192 relations.
2.Use PKG to map the modern concepts to its most relevant entities in poetry, to guarantee both quality and relevance of generated poems.
3.For a modern concept word $w_i$, score its each neighbor word $w_j$ by:
\[\begin{equation} g\left(w_{j}\right)=t f_{w i k i}\left(w_{j}  w_{i}\right) \cdot \log \left(\frac{N}{1+d f\left(w_{j}\right)}\right) \cdot \arctan \left(\frac{p\left(w_{j}\right)}{\tau}\right) \end{equation}\]where $tf_{wiki}(w_j  w_i)$ is the term frequency of $w_j$ in the Wikipedia article of $w_i$, $df(w_j)$ is the number of Wikipedia articles containing $w_j$, $N$ is the number of Wikipedia articles, and $p(w_j)$ is the word frequency counted in all articles.

Example:
Keyword Extension

Problem:More keywords could lead to richer contents and emotions in generated poems. If the number of extracted keywords is less than $K$, further conduct keywords extension.

Solution:
1.Build a Poetry Word Cooccurrence Graph (PWCG), which indicates the cooccurrence of two words in the same poem.
2.The weight of the edge between two words is calculated according to the Pointwise Mutual Information (PMI).
\[\begin{equation} P M I\left(w_{i}, w_{j}\right)=\log \frac{p\left(w_{i}, w_{j}\right)}{p\left(w_{i}\right) * p\left(w_{j}\right)} \end{equation}\]where $p(w_i)$ and $p(w_i, w_j)$ are the word frequency and cooccurrence frequency in poetry corpus.
3.For a given word $w$, get all its adjacent words $w_k$ in PWCG and select those with higher values of $\log p\left(w_{k}\right) * P M I\left(w, w_{k}\right)+\beta * r\left(w_{k}\right)$ where $\beta$ is a hyperparameter.

Example:
Generation Module  Core Component
Reference:
Chinese Poetry Generation with a Working Memory Model
Xiaoyuan Yi, Maosong Sun, Ruoyu Li, Zonghan Yang
Tsinghua University
IJCAI’18
Overview
Three modules:
 topic memory $M_{1} \in \mathbb{R}^{K_{1} * d_{h}}$
 Each topic word $w_k$ is written into the topic memory in advance, which acts as the ‘major message’ and remains unchanged during the generation process of a poem.
 history memory $M_{2} \in \mathbb{R}^{K_{2} * d_{h}}$
 Each character of $L_{i1}$ is written into the local memory to provide full shortdistance history.
 local memory $M_{3} \in \mathbb{R}^{K_{3} * d_{h}}$
 Select some salient characters of $L_{i2}$ to write into the history memory which maintains informative partial longdistance history.
where $K_1$, $K_2$ and $K_3$ are the numbers of slots and $d_h$ is slot size.
Encoder
Use GRU for bidirectional encoder.
Denote $X$ a line in encoder ($L_{i1}$), $X=\left(x_{1} x_{2} \ldots x_{T_{e n c}}\right)$, and $h_t$ represent the encoder hidden states.
Global Trace Vector
$v_{i}$ is a global trace vector, which records what has been generated so far and provides implicit global information for the model. Once $L_i$ is generated, it is updated by a simple vanilla RNN.
\[\begin{equation} v_{i}=\sigma\left(v_{i1}, \frac{1}{T_{e n c}} \sum_{t=1}^{T_{e n c}} h_{t}\right), v_{0}=\mathbf{0} \end{equation}\]where $\sigma$ defines a nonlinear layer and $\mathbf{0}$ is a vector with all 0s.
Addressing Function
Defining an Addressing Function, $\alpha=A(\tilde{M}, q)$, which calculates the probabilities that each slot of the memory is to be selected and operated.
\[\begin{equation} z_{k}=b^{T} \sigma(\tilde{M}[k], q) \end{equation}\] \[\begin{equation} \alpha[k]=\operatorname{softmax}\left(z_{k}\right) \end{equation}\]where $q$ is the query vector, $b$ is the parameter, $\tilde{M}$ is the memory to be addressed, $\tilde{M}[k]$ is the kth slot of $\tilde{M}$ and $\alpha[k]$ is the kth element in vector $\alpha$.
Memory Writing
Before the generation, all memory slots are initialized with 0.

For topic memory, feed characters of each topic word $w_k$ into the encoder, then fill each topic vector into a slot.

For local memory, fill the encoder hidden states of characters in $L_{i1}$ in.

For history memory, select a slot by writing addressing function and fill encoder state $h_t$ of $L_{i2}$ into it.
\[\begin{equation} \alpha_{w}=A_{w}\left(\tilde{M}_{2},\left[h_{t} ; v_{i1}\right]\right) \end{equation}\]where $\alpha_w$ is the writing probabilities vector, \(\tilde{M}_{2}\) is the concatenation of history memory $M_2$ and a null slot, $v_{i1}$ is global trace vector.
For testing (nondifferentiable),
\[\begin{equation} \beta[k]=I\left(k=\arg \max _{j} \alpha_{w}[j]\right) \end{equation}\]where $I$ is an indicator function.
For training, simply approximate $\beta$ as,
\[\begin{equation} \beta=\tanh \left(\gamma *\left(\alpha_{w}\mathbf{1} * \max \left(\alpha_{w}\right)\right)\right)+\mathbf{1} \end{equation}\]where $\mathbf{1}$ is a vector with all 1s and $\gamma$ is a large positive number.
\[\begin{equation} \tilde{M}_{2}[k] \leftarrow(1\beta[k]) * \tilde{M}_{2}[k]+\beta[k] * h_{t} \end{equation}\]If there is no need to write $h_t$ into history memory, model learns to write it into the null slot, which wil be ignored.
Memory Reading
\[\begin{equation} \alpha_{r}=A_{r}\left(M,\left[s_{t1} ; v_{i1} ; a_{i1}\right]\right) \end{equation}\] \[\begin{equation} o_{t}=\sum_{k=1}^{K} \alpha_{r}[k] * M[k] \end{equation}\]where $\alpha_r$ is the reading probability vector, $s_{t1}$ is the decoder hidden states and the trace vector $v_{i1}$ is used to help the Addressing Function avoid reading redundant content, $a_{i1}$ is the topic trace vector. Joint reading from the three memory modules enables the model to flexibly decide to express a topic or to continue the history content.
Topic Trace Mechanism
Problem: A global trace vector $v_i$ is not enough to help the model remember whether each topic has been used or not.
Solutions: Design a Topic Trace (TT) mechanism to record the usage of topics in a more explicit way.
\[\begin{equation} c_{i}=\sigma\left(c_{i1}, \frac{1}{K_{1}} \sum_{k=1}^{K_{1}} M[k] * \alpha_{r}[k]\right), c_{0}=\mathbf{0} \end{equation}\] \[\begin{equation} u_{i}=u_{i1}+\alpha_{r}\left[1 : K_{1}\right], u_{i} \in \mathbb{R}^{K_{1} * 1}, u_{0}=\mathbf{0} \end{equation}\] \[\begin{equation} a_{i}=\left[c_{i} ; u_{i}\right] \end{equation}\]$c_i$ maintains the content of used topics and $u_i$ explicitly records the times of reading each topic. $a_i$ is the topic trace vector.
Genre embedding
$g_t$ is the concatenation of a phonology embedding and a length embedding, which are learned during training.
 The phonology embedding indicates the required category of $y_t$.
 The length embedding indicates the number of characters to be generated after $y_t$ in $L_i$ and hence controls the length of $L_i$.
Decoder
Denote $Y$ a generated line in decoder ($L_i$), $Y=\left(y_{1} y_{2} \ldots y_{T_{d e c}}\right)$, and $s_t$ represent the decoder hidden states. $e(y_t)$ is the word embedding of $y_t$.
\[\begin{equation} s_{t}=G R U\left(s_{t1},\left[e\left(y_{t1}\right) ; o_{t} ; g_{t} ; v_{i1}\right]\right) \end{equation}\] \[\begin{equation} p\left(y_{t}  y_{1 : t1}, L_{1 : i1}, w_{1 : K_{1}}\right)=\operatorname{softmax}\left(W s_{t}\right) \end{equation}\]where $g_t$ is a special genre embedding, $o_t$ is the memory output and $W$ is the projection parameter. $v_{i1}$ is a global trace vector.
Generation Module  Unsupervised Style Control
Reference:
Stylistic Chinese Poetry Generation via Unsupervised Style Disentanglement
Cheng Yang, Maosong Sun, Xiaoyuan Yi, Wenhao Li
Tsinghua University
EMNLP’18
Overview
Take two arguments as input: input sentence $s_{input}$ and style id $k \in 1,2 \ldots K$ where $K$ is the total number of different styles. Then enumerate each style id $k$ and generate stylespecific output sentence $s^{k}_{output}$ respecitively.
This method is transparent to model structures which can be applied to any generation model. In this stage, we employ it for the generation of Chinese quatrain poetry (Jueju).
Mutual Information
Given two random variables $X$ and $Y$, the mutual information $I(X, Y)$ measures “the amount of information” obtained about one random variable given another one. Mutual information can also be interpreted as a measurement about how similar the joint probability distribution $p(X, Y)$ is to the product of marginal distributions $p(X)p(Y)$.
\[\begin{equation} I(X, Y)=\int_{Y} \int_{X} p(X, Y) \log \frac{p(X, Y)}{p(X) p(Y)} d X d Y \end{equation}\]Attention Mechanism
Reference:
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio
Jacobs University, Universite de Montr´eal
ICLR’15
 The encoder uses bidirectional LSTM model to project the input sentence $X$ into the vector space.

The hidden state of LSTM are computed by
\[\begin{equation} \overrightarrow{h_{i}}=L S T M_{\text {forward}}\left(\overrightarrow{h_{i1}}, e\left(x_{i}\right)\right) \end{equation}\] \[\begin{equation} \overleftarrow h_{i}=L S T M_{b a c k w a r d}\left(\overleftarrow h_{i+1}, e\left(x_{i}\right)\right) \end{equation}\]for $i=1,2 \ldots T$ where \(\overrightarrow{h_{i}}\) and \(\overleftarrow h_{i}\) are the ith hidden state of forward and backward LSTM respectively, \(e\left(x_{i}\right) \in \mathcal{R}^{d}\) is the character emebedding of character $x_i$ and $d$ is the dimension of character embeddings.

Concatenate corresponding hidden states of forward and backward LSTM \(h_{i}=\left[\vec{h}_{i}, \overleftarrow h_{i}\right]\) as the ith hidden state of biLSTM.

The decoder module contains an LSTM decoder with attention mechanism, which computes a context vector as a weighted sum of all encoder hidden states to represent the most relevant information at each stage.

The ith hidden state in the decoder LSTM
\[\begin{equation} s_{i}=L S T M_{d e c o d e r}\left(s_{i1},\left[e\left(y_{i1}\right), a_{i}\right]\right) \end{equation}\]for $i=2, \ldots T$ and $s_{1}=h_{T}$, where $[ ]$ indicates concatenation operation, $e\left(y_{i1}\right)$ is character embedding of $y_{i1}$ and $a_i$ is the context vector learned by attention mechanism.
\[\begin{equation} a_{i}=\operatorname{attention}\left(s_{i1}, h_{1 : T}\right) \end{equation}\] 
The character probability distribution when decoding the ith character can be expressed as
\[\begin{equation} p\left(y_{i}  y_{1} y_{2} \ldots y_{i1}, X\right)=g\left(y_{i}  s_{i}\right) \end{equation}\]
For details, you can read my another note Neural Machine Traslation By Jointly Learning To Align And Translate.
Decoder Model with Style Disentanglement
Problem:When the input style id is changed, the output sentence would probably be the same because no supervised loss is given to force the output sentences to follow the one hot style representation.
Solution:
Add a regularization term to force a strong dependency relationship between the input style id and generated sentence.

Concatenate the onehot representation of style id $one_hot(k)$ and the embedding vector $h_T$ obtained y biLSTM encoder and then feed \(\left[\text { one }_{} h o t(k), h_{T}\right]\) into the decoder model.

Assume the input style id is a uniformly distributed random variable $S_{ty}$ and \(\operatorname{Pr}(S t y=k)=\frac{1}{K}\) for $k = 1, 2, \ldots K$ where $K$ is the total number of styles.

Maximize the mutual information between the style distribution $Pr(S_{ty})$ and the generated sentence distribution $Pr(Y; X)$.
\[\begin{equation} \begin{aligned} & I(\operatorname{Pr}(\text {Sty}), \operatorname{Pr}(Y ; X)) \\=& \sum_{k=1}^{K} \operatorname{Pr}(\text {Sty}=k) \int_{Y  k ; X} \log \frac{\operatorname{Pr}(Y, \operatorname{St} y=k ; X)}{\operatorname{Pr}(\text {Sty}=k) \operatorname{Pr}(Y ; X)} d Y \\=& \sum_{k=1}^{K} \operatorname{Pr}(\text {Sty}=k) \int_{Y  k ; X} \log \frac{\operatorname{Pr}(Y, S t y=k ; X)}{\operatorname{Pr}(Y ; X)} d Y \\&\sum_{k=1}^{K} \operatorname{Pr}(\text {Sty}=k) \log \operatorname{Pr}(\text {Sty}=k  Y ; X) \\=& \sum_{k=1}^{K} \operatorname{Pr}(\text {Sty}=k) \int_{Y  k ; X} \log \operatorname{Pr}(\text {Sty}=k  Y) d Y+\log K \\=& \int_{Y ; X} \sum_{k=1}^{K} \operatorname{Pr}(\text {Sty}=k  Y) \log P(\text {Sty}=k  Y) d Y+\log K \end{aligned} \end{equation}\] 
But the posterior probability distribution \(\operatorname{Pr}(\text {Sty}=k \ Y)\) is unknown. With the help of variational inference maximazation, we can train a parameterized function $Q(S_{ty} =k  Y)$ which estimates the posterior distribution and maximize a lower bound of mutual information.
\[\begin{equation} \begin{aligned} & I(\operatorname{Pr}(S t y), \operatorname{Pr}(Y ; X))\log K \\=& \int_{Y ; X} \sum_{k=1}^{K} \operatorname{Pr}(\text {Sty}=k  Y) \log \operatorname{Pr}(\text {Sty}=k  Y) d Y \\=& \int_{Y ; X} \sum_{k=1}^{K} \operatorname{Pr}(\text {Sty}=k  Y) \log Q(\text {Sty}=k  Y) d Y \\ &+\int_{Y ; X} \sum_{k=1}^{K} \operatorname{Pr}(\text {Sty}=k  Y) \log \frac{\operatorname{Pr}(\text {Sty}=k  Y)}{Q(\text {Sty}=k  Y)} d Y \\=& \int_{Y ; X} \sum_{k=1}^{K} \operatorname{Pr}(S t y=k  Y) \log Q(S t y=k  Y) d Y \\&+\int_{Y ; X} K L(\operatorname{Pr}(S \operatorname{ty}  Y), Q(S t y  Y)) d Y \\ \geq & \int_{Y ; X} \sum_{k=1}^{K} \operatorname{Pr}(S t y=k  Y) \log Q(S t y=k  Y) d Y \\=& \sum_{k=1}^{K} \operatorname{Pr}(S t y=k) \int_{Y  k ; X} \log Q(S t y=k  Y) d Y \end{aligned} \end{equation}\]Here $K L(\operatorname{Pr}(\cdot)  Q(\cdot))$ indicates the KLdivergence distance between probability distribution $\operatorname{Pr}(\cdot)$ and $Q(\cdot)$. The inequality comes from the fact that the KLdivergence is always no less that zero and is tight when $Pr(\cdot) = Q(\cdot)$.

Employ neural network to parametrize the posterior distribution estimation function $Q$.
 Compute the average character embedding of sequence $Y$ and then use a linear projection with softmax normalizer to get the style distribution.
where $W \in \mathcal{R}^{K \times d}$ is the linear projection matrix.
Expected Character Embedding
Problem: The search space of sequence $Y$ is exponential to the size of vocabulary. Hence it’s impossible to enumerate all possible sequence $Y$ for computing the integration $Y k; X$.
Solution:

Use Expected character embedding to approximate the probabilistic space of output sequences: only generate an expected embedding sequence and suppose $Y k; X$ has one hundred percent probability generating this one.
\[\begin{equation} \operatorname{expect}(i ; k, X)=\sum_{c \in V} g\left(c  s_{i}\right) e(c) \end{equation}\]where \(\text {expect }(i ; k, X) \in \mathcal{R}^{d}\) represents the expected character embedding at ith output given style id $k$ and input sequence $X$ and $c \in V$ enumerates all characters in the vocabulary.

Feed \(\text {expect }(i ; k, X)\) into the LSTM in decoder to update the hidden state for generating next expected character embedding:
\[\begin{equation} s_{i+1}=L S T M_{d e c o d e r}\left(s_{i},\left[\text {expect}(i ; k, X), a_{i+1}\right]\right) \end{equation}\] 
Use the expected embeddings $\operatorname{expect}(i ; k, X)$ for $i = 1, 2 \cdot T$ as an approximation of the whole probability space of $Y  k; X$. The lower bound can be rewritten as
\[\begin{equation} \mathcal{L}_{r e g}=\frac{1}{K} \sum_{k=1}^{K} \log \left\{\operatorname{softmax}\left(W \cdot \frac{1}{T} \sum_{i=1}^{T} \operatorname{expect}(i ; k, X)\right)[k]\right\} \end{equation}\]where $x[j]$ represents the jth dimension of vector $x$.

Add the lower bound $\mathcal{L}_{r e g}$ to the overall training objective as a regularization term. For each training pair $(X, Y)$, we aim to maximize
\[\begin{equation} \operatorname{Tr} \operatorname{ain}(X, Y)=\sum_{i=1}^{T} \log p\left(y_{i}  y_{1} y_{2} \ldots y_{i1}, X\right)+\lambda \mathcal{L}_{r e g} \end{equation}\]where $p\left(y_{i}  y_{1} y_{2} \ldots y_{i1}, X\right)$ is style irrelevant generation likelihood and computed by setting onehot style representation to an allzero vector, and $\lambda$ is a harmonic hyperparameter to balance the loglikelihood of generating sequence $Y$ and the lower bound of mutual information.
The first term ensures that the decoder can generate fluent and coherent outputs and the second term guaran tees the stylespecific output has a strong depen dency on the onehot style representation input.
Generation Module  Acrostic Poetry Generation
Given a sequence of words $seq = \left(x_{0,0}, x_{1,0}, \cdots, x_{n, 0}\right)$, create a poem using each word $x_{i, 0}$ as the first word of each line $x_i$ and the created poem should also conform to the genre pattern and convey proper semantic meanings.

Use pretrained word2vec embeddings to get $K$ keywords related to $seq$ according to the cosine distance of each keyword and the words in $seq$.

Feed each $x_{i, 0}$ into the decoder at the first step.

Generate the second word with the conditional probability
\[\begin{equation} p_{g e n}\left(x_{i, 1}  x_{i, 0}\right) = p_{\operatorname{dec}}\left(x_{i, 1}  x_{i, 0}\right)+\delta * p_{\operatorname{lm}}\left(x_{i, 1}  x_{i, 0}\right) \end{equation}\]where $p_{dec}$ and $p_{lm}$ are probability distributions of the decoder and a neural language model respectively.

If the length of the input sequence is less than $n$ (the number of lines in a poem), use the language model to extend it to $n$ words.
Postprocessing Module
Jiuge takes a linetoline generation schema and generates each line with beam search (beamsize=$B$). Design a postprocessing module to automatically check and rerank these candidates, and then select the best one, which is used for the generation of subsequent lines in a poem.
Pattern Checking
Problem: The genre embedding cannot guarantee that generated poems perfectly adhere to required patterns.
Solution: Remove the invalid candidates according to the sepcified length, rhythm and rhyme.
ReRanking
Reference:
Automatic Poetry Generation with Mutual Reinforcement Learning
Xiaoyuan Yi, Maosong Sun, Ruoyu Li, Wenhao Li
Tsinghua University
EMNLP’18
Problem: The best candidate may not be ranked as the top 1 because the training objective is Maximum Likelihood Estimation (MLE), which tends to give the generic and meaningless candidates lower costs.
Solution: Adopt automatic rewarders, including a fluency rewarder, a context coherence rewarder, and a meaningfulness rewarder. Then the candidate with the highest weightedaverage rewards given by them will be selected.
Fluency Rewarder
Use a neural language model to measure fluency.
Given a poem line $L_i$, higher probability $P_{lm}(L_i)$ indicates the line is more likely to exist in the corpus and thus may be more fluent and wellformed. However, it’s inadvisable to directly use $P_{lm}(L_i)$ as the reward, since over high probability may damage diversity and innovation.
Define the fluency reward of a poem $\mathcal{P}$.
\begin{equation} \begin{array}{l}{r\left(L_{i}\right)=\max \left(\leftP_{l m}\left(L_{i}\right)\mu\right\delta_{1} * \sigma, 0\right)} \ {R_{1}(\mathcal{P})=\frac{1}{n} \sum_{i=1}^{n} \exp \left(r\left(L_{i}\right)\right)}\end{array} \end{equation}
where $\mu$ and $\sigma$ are the mean value and standard deviation of $P_{lm}$ calculated over all training sets. $\delta_{1}$ is a hyperparameter to control the range.
Coherence Rewarder
Use Mutual Information (MI) to measure the coherence of $L_i$ and $L_{1:i1}$.
MI of two sentences $S_1$ and $S_2$.
\[\begin{equation} M I\left(S_{1}, S_{2}\right)=\log P\left(S_{2}  S_{1}\right)\lambda \log P\left(S_{2}\right) \end{equation}\]where $\lambda$ is used to regulate the weight of generic sentences.
The coherence reward.
\[\begin{equation} \begin{aligned} M I\left(L_{1 : i1}, L_{i}\right) &=\log P_{\text { seqaseq }}\left(L_{i}  L_{1 : i1}\right) \\ &\lambda \log P_{l m}\left(L_{i}\right) \\ R_{2}(\mathcal{P}) &=\frac{1}{n1} \sum_{i=2}^{n} M I\left(L_{1 : i1}, L_{i}\right) \end{aligned} \end{equation}\]where $P_{seq2seq}$ is a GRUbased sequencetosequence model.
A better choice is to use a dynamic $\lambda$ instead of a static one.
\[\begin{equation} \lambda=\exp \left(r\left(L_{i}\right)\right)+1 \end{equation}\]Which gives smaller weights to lines with extreme language model probabilities.
Meaningfulness Rewarder
Problem: The basic model tends to generate some common and meaningless words, such as bu zhi (don’t know), he chu (where) and wu ren(no one).
Solution:Utilize TFIDF to motivate the model to generate more meanningful words.
Direct use of TFIDF leads to serious outofvocabulary (OOV) problem and high variance. Therefore use another neural network to smooth TFIDF values.
\[\begin{equation} R_{3}(\mathcal{P})=\frac{1}{n} \sum_{i=1}^{n} F\left(L_{i}\right) \end{equation}\]Where $F(L_i)$ is a neural network which takes a line as input and predicts its estimated TFIDF value.
For each line in training sets, we calculate standard TFIDF values of all words and use the average as the line TFIDF value. Then use them to train $F(L_i)$ with Huber loss.
Total Reward
\[\begin{equation} R(\mathcal{P})=\sum_{j=1}^{3} \alpha_{j} * \tilde{R}_{j}(\mathcal{P}) \end{equation}\]Where $\alpha_{j}$ is the weight and the symbol $\tilde{}$ means the three rewards are rescaled to the same magnitude.
Collaborative Revision Module
User may revise the draft for several times to collaboratively create a satisfying poem together with the machine.
Revision Modes
Processing: Define a nline poem draft as $X=\left(x_{1}, x_{2}, \cdots, x_{n}\right)$, and each line containing $l_i$ words as $x_{i}=\left(x_{i, 1}, x_{i, 2}, \dots, x_{i, l_{i}}\right)$. At every turn, the user can revise one word in the draft. The revision module returns the revision information to the generation module which updates the draft according to the revised word.
Three revision modes:
 Static updating mode
 The revision is required to meet the phonological pattern of the draft and the draft will not be updated except the revised word.
 Local dynamic updating mode
 If the user revises word $x_{i, j}$, then Jiuge will regenerate the succeeding subsequence \(x_{i, j+1}, \cdots, x_{i, l_{i}}\) in the ith line by feeding the revised word to the decoder for the revised position.
 Global dynamic updating mode
 If the user revises word $x_{i, j}$, Jiuge will regenerate all succeeding words \(x_{i, j+1}, \cdots, x_{i, l_{i}}, \cdots, x_{n, l_{n}}\).
Automatic Reference Recommendation
Problem:For poetry writing beginners or these lacking professional knowledge, it is hard to revise the draft appropriately.
Solution:Searches several humanauthored poems, which are semantically similar to the generated draft, for the user as references. The user could decide how to make revision with these references.

Define a nline humancreated poem as $Y = (y_1, \cdot, y_n)$ and a relevance scoring function as $rel(x_i, y_i)$ to give the relevance of two lines.

Return $N$ poems with the highest relevance score $rs(X, Y)$.
\[\begin{equation} r s(X, Y)=\sum_{i=1}^{n} r e l\left(x_{i}, y_{i}\right)+\gamma * I\left(Y \in D_{\text {master}}\right) \end{equation}\]where $D_{master}$ is a poetry set of masterpieces, $I$ is the indicator function, and the hyperparameter $\gamma$ is specified to balance the quality and relevance of searched poems.
System
The initial pageprovides some basic options: multimodal input and the selection of genre and style.
The collaborative creation page: an example of revision.