Skip Deep LSTM(关于LSTM的trick)
首先,网络深度对模型性能影响极为关键
但是,在LSTM的应用中,通常堆叠三层以上的LSTM训练困难,存在梯度消失或爆炸的问题
因此,借鉴GNMT(谷歌翻译系统)的思想,提出一种基于稠密跳跃连接的深度LSTM(Skip Deep LSTM)
实验表明,在图像理解任务上,训练loss优于传统LSTM,同时,在时序预测等任务(发电预测等)上,该模型设计方式优于常用LSTM。
模型设计参考了GNMT思想,第一层为双向长短时记忆网络(BiLSTM),深度为5-7层效果最佳,在image caption 和时序预测问题上优于常用LSTM。
核心代码如下(基于tf2实现):
# 基于稠密连接的深度LSTM,可根据实验情况搭建不同尺度连接和网络层数,目前5-7层效果最佳
bi1 = (Bidirectional(LSTM(64, return_sequences=True)))(se2)
bi2 = LSTM(128, return_sequences=True)(bi1)
bi3 = LSTM(128, return_sequences=True)(bi2)
res1 = add([bi1, bi3])
bi4 = LSTM(128, return_sequences=True)(res1)
res2 = add([bi2, bi4, bi1])
bi5 = LSTM(128, return_sequences=True)(res2)
res3 = add([bi3, bi5, bi2, bi1])
bi6 = LSTM(128, return_sequences=True)(res3)
res4 = add([bi4, bi6, bi3, bi1])
se3 = LSTM(256)(res4)
#融合的LSTM
bi1_1 = (Bidirectional(LSTM(16, return_sequences=True)))(add1)
bi1_2 = (Bidirectional(LSTM(16, return_sequences=True)))(bi1_1)
bi1_3 = (Bidirectional(LSTM(16, return_sequences=True)))(bi1_2)
# attention_mul = attention_3d_block(bi1)
bi2_1 = LSTM(32, return_sequences=True)(se2)
bi2_2 = LSTM(32, return_sequences=True)(bi2_1)
bi2_3 = LSTM(32, return_sequences=True)(bi2_2)
res1 = add([bi1_1, bi2_1, bi1_3, bi2_3])
bi1_4 = (Bidirectional(LSTM(16, return_sequences=True)))(res1)
bi2_4 = LSTM(32, return_sequences=True)(res1)
res2 = add([bi1_1, bi2_1, bi1_2, bi2_2])
bi1_5 = (Bidirectional(LSTM(16, return_sequences=True)))(res2)
bi2_5 = LSTM(32, return_sequences=True)(res2)
res3 = add([bi1_1, bi2_1, bi1_2, bi2_2, bi1_3, bi2_3])
# se3 = LSTM(256)(res3)
se3 = (Bidirectional(LSTM(128)))(res3)
decoder2 = Dense(256, activation='relu')(se3)
基于时间步注意力机制的嵌入
# 基于timeStep的注意力
def attention_3d_block(inputs):
# input_dim = int(inputs.shape[2])
a = Permute((2, 1))(inputs) # 置换维度
# Dense层神经元个数就是最大单词数,在时序预测问题中,是输入的特征数
a = Dense(36, activation='tanh')(a)
a_probs = Permute((2, 1), name='attention_vec')(a)
# output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
output_attention_mul = multiply([inputs, a_probs], name='attention_mul')
return output_attention_mul
# 一般问题是第一层LSTM和第二层LSTM中嵌入timeStep注意力较好,可针对不同问题进行大量实验,选择最佳位置
bi1 = (Bidirectional(LSTM(64, return_sequences=True)))(se2)
attention_mul = attention_3d_block(bi1)
bi2 = LSTM(128, return_sequences=True)(attention_mul)
bi3 = LSTM(128, return_sequences=True)(bi2)
res1 = add([bi1, bi3])
bi4 = LSTM(128, return_sequences=True)(res1)
res2 = add([bi2, bi4, bi1])
bi5 = LSTM(128, return_sequences=True)(res2)
res3 = add([bi3, bi5, bi2, bi1])
bi6 = LSTM(128, return_sequences=True)(res3)
res4 = add([bi4, bi6, bi3, bi1])
se3 = LSTM(256)(res4)