做推理的ESIM

接触了一个在Inference领域比较有影响力的模型——ESIM。同时薅了Colab羊毛。

ESIM模型简介

Enhanced LSTM for Natural Language Inference这篇论文提出了一种计算两个句子相似度的模型。模型由3个部分组成:

Input Encoding

首先将输入的两个句子,premise和hypothesis的词向量$a=(a_1,…,a_{l_a})$和$b=(b_1,…,b_{l_b})$经过一个BiLSTM的处理,得到新的词向量表示$(\bar{a_1}, \dots, \bar{a_{l_a}})$和$(\bar{b_1}, \dots, \bar{b_{l_b}})$。

Local Inference

论文中说到,计算两个词的相关程度最好的方法是计算词向量的内积,也就是$e_{ij}=\bar{a_i}^T\bar{b_j}$。这样,计算两个句子的所有词对之间的相似度(attention),就可以获得一个矩阵

$$(e_{ij}){l_a \times l_b} = (\bar{a_i}^T\bar{b_j}){l_a \times l_b}$$

接着是一个很有意思的思想:既然要判断两个句子相似度,那么就需要看看两者之间能否相互表示。也就是分别用premise和hypothesis中的词向量$\bar{a_i}$和$\bar{b_i}$表示对方的词向量。

论文中的公式为:

$$\widetilde{a_i} = \sum_{j=1}^{l_b}{\frac{exp(e_{ij})}{\sum_{k=1}^{l_b}{exp(e_{ik})}}\bar{b_j}}$$
$$\widetilde{b_j} = \sum_{i=1}^{l_a}{\frac{exp(e_{ij})}{\sum_{k=1}^{l_a}{exp(e_{kj})}}\bar{a_i}}$$

翻译一下就是,因为模型不知道应该哪对$a_i$和$b_j$才是相近或相对,所以做了一个枚举的操作,将所有的情况都表示出来。之前计算的相似度矩阵就是就用来做加权。每个位置上的权重即当前权重矩阵行(对于计算$\widetilde{a_i}$来说,对于计算$\widetilde{b_j}$就是列)的Softmax值。

论文为了强化推理(Enhancement of inference information),将之前得到的中间结果都堆叠起来。

$$m_a = [\bar{a};\widetilde{a};\bar{a}-\widetilde{a};\bar{a} \odot \widetilde{a}]$$
$$m_b = [\bar{b};\widetilde{b};\bar{b}-\widetilde{b};\bar{b} \odot \widetilde{b}]$$

Inference Composition

推理组合使用的词向量就是上一个部分所得的$m_a$和$m_b$,还是用到了BiLSTM来获取两组词向量的上下文信息。

将所有的信息组合起来之后,一并送给全连接层,完成最后的糅合。

导入需要用到的库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import os
import time
import logging
import pickle
from tqdm import tqdm_notebook as tqdm

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import torchtext
from torchtext import data, datasets
from torchtext.vocab import GloVe

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import nltk
from nltk import word_tokenize
import spacy
from keras_preprocessing.text import Tokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
cuda

挂载Google Drive

1
2
from google.colab import drive
drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/xxxxxxxx

Enter your authorization code:
··········
Mounted at /content/drive
1
!nvidia-smi
Fri Aug  9 04:45:35 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P0    62W / 149W |   6368MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

使用torchtext准备数据

torchtext的使用方式参考了参考了:https://github.com/pytorch/examples/blob/master/snli/train.py

torchtext中的GloVe可以直接使用,但是由于其没有提供类似torchvision的直接读取源文件的功能,而只能读取缓存,所以最好:

  1. 先将GloVe下载到本地
  2. 在下载目录打开终端,然后在终端中先使用torchtext生成缓存
  3. 以后使用GloVe的时候增加cache参数,这样torchtext就会从cache中读取而不是下载庞大的GloVe到本地了

不过如果薅的是Colab羊毛,那就随便了(~ ̄▽ ̄)~

torchtext还可以直接加载SNLI数据集,不过数据集的加载目录结构如下:

  • root
    • snli_1.0
      • snli_1.0_train.jsonl
      • snli_1.0_dev.jsonl
      • snli_1.0_test.jsonl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
TEXT = data.Field(batch_first=True, lower=True, tokenize="spacy")
LABEL = data.Field(sequential=False)

# 分离训练、验证、测试集
tic = time.time()
train, dev, test = datasets.SNLI.splits(TEXT, LABEL)
print(f"Cost: {(time.time() - tic) / 60:.2f} min")

# 加载GloVe预训练向量
tic = time.time()
glove_vectors = GloVe(name='6B', dim=100)
print(f"Creat GloVe done. Cost: {(time.time() - tic) / 60:.2f} min")

# 创建词汇表
tic = time.time()
TEXT.build_vocab(train, dev, test, vectors=glove_vectors)
LABEL.build_vocab(train)
print(f"Build vocab done. Cost: {(time.time() - tic) / 60:.2f} min")

print(f"TEXT.vocab.vectors.size(): {TEXT.vocab.vectors.size()}")
num_words = int(TEXT.vocab.vectors.size()[0])

# 保存分词和词向量的对应字典
if os.path.exists("/content/drive/My Drive/Colab Notebooks"):
glove_stoi_path = "/content/drive/My Drive/Colab Notebooks/vocab_label_stoi.pkl"
else:
glove_stoi_path = "./vocab_label_stoi.pkl"
pickle.dump([TEXT.vocab.stoi, LABEL.vocab.stoi], open(glove_stoi_path, "wb"))

batch_sz = 128

train_iter, dev_iter, test_iter = data.BucketIterator.splits(
datasets=(train, dev, test),
batch_sizes=(batch_sz, batch_sz, batch_sz),
shuffle=True,
device=device
)
Cost: 7.94 min
Creat GloVe done. Cost: 0.00 min
Build vocab done. Cost: 0.12 min
TEXT.vocab.vectors.size(): torch.Size([34193, 100])

通用参数配置

炼丹的时候最好有一个全局配方,这样好调整。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class Config:

def __init__(self):
# For data
self.batch_first = True
try:
self.batch_size = batch_sz
except NameError:
self.batch_size = 512

# For Embedding
self.n_embed = len(TEXT.vocab)
self.d_embed = TEXT.vocab.vectors.size()[-1]

# For Linear
self.linear_size = self.d_embed

# For LSTM
self.hidden_size = 300

# For output
self.d_out = len(LABEL.vocab) # 表示输出为几维
self.dropout = 0.5

# For training
self.save_path = r"/content/drive/My Drive/Colab Notebooks" if os.path.exists(
r"/content/drive/My Drive/Colab Notebooks") else "./"
self.snapshot = os.path.join(self.save_path, "ESIM.pt")

self.device = device
self.epoch = 64
self.scheduler_step = 3
self.lr = 0.0004
self.early_stop_ratio = 0.985 # 可以提早结束训练过程


args = Config()

ESIM模型代码实现

代码参考了:https://github.com/pengshuang/Text-Similarity/blob/master/models/ESIM.py

nn.BatchNorm1d的使用

对数据的正则化可以消除不同维度数据分布不同的问题,几何上的理解就是将n维空间的一个“椭球体”正则化为一个“球体”,这样可以简化模型的训练难度,提高训练速度。

但是如果将所有的输入数据全部正则化,会消耗大量的时间,Batch Normalization就是一种折衷的方法,它只对输入的batch_size个数据进行正则化。从概率上理解就是根据batch_size个样本的分布,估计所有样本的分布。

PyTorch的nn.BatchNorm1d听名字就知道是对一维数据的批正则化,所以这里有两个限制条件:

  1. 训练(即打开了model.train())的时候,提供的批大小至少为2;测试、使用的(model.eval())时候没有batch大小的限制
  2. 默认倒数第2维是“batch”

而我之前的数据处理所得到的每一个批次的数据,经过词向量映射之后得到的形状为batch * seq_len * embed_dim,所以这里有3个维度。并且经过torchtext的data.BucketIterator.splits处理,每个batch的seq_len是动态的(和当前batch中最长句子的长度相同)。这样如果不加处理直接输入给BatchNorm1d,一般会看到如下的报错:

RuntimeError: running_mean should contain xxx elements not yyy

关于Embedding之后是否需要增加BatchNorm1d层

参考代码实现非常漂亮,可以看出作者的代码功底。不过作者似乎不是使用预处理的词向量作为Embedding向量,而我是用的是预训练的词向量GloVe,并且也不会去训练Glove,所以是否有必要增加nn.BatchNorm1d

因为盲目增加网络的层数并不会有好的影响,所以最好的方式就是先看看GloVe词向量是不是每一维都是“正则化的”。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
glove = TEXT.vocab.vectors

means, stds = glove.mean(dim=0).numpy(), glove.std(dim=0).numpy()
dims = [i for i in range(glove.shape[1])]

plt.scatter(dims, means)
plt.scatter(dims, stds)
plt.legend(["mean", "std"])
plt.xlabel("Dims")
plt.ylabel("Features")
plt.show()

print(f"mean(means)={means.mean():.4f}, std(means)={means.std():.4f}")
print(f"mean(stds)={stds.mean():.4f}, std(stds)={stds.std():.4f}")

png

mean(means)=0.0032, std(means)=0.0809
mean(stds)=0.4361, std(stds)=0.0541

从图中可以看出每一维的分布还是比较稳定的,所以不打算在Embedding层后使用nn.BatchNorm1d

nn.LSTM的使用

1
2
3
nn.LSTM(
input_size, hidden_size, num_layers, bias=True, batch_first=False, dropout=0, bidirectional=False)
)

nn.LSTM的默认参数batch_first是False,这会让习惯了CV的数据格式的我十分不适应,所以最好还是设置一下True

以下是LSTM的输入/输出格式。Inputs可以不带上h_0c_0,这个时候LSTM会自动生成全0的h_0c_0

  • Inputs: input, (h_0, c_0)
  • Outputs: output, (h_n, c_n)
  • input: (seq_len, batch, input_size)
  • output: (seq_len, batch, num_directions * hidden_size)
  • h / c: (num_layers * num_directions, batch, hidden_size)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
class ESIM(nn.Module):

def __init__(self, args):
super(ESIM, self).__init__()
self.args = args

self.embedding = nn.Embedding(
args.n_embed, args.d_embed) # 参数的初始化可以放在之后
# self.bn_embed = nn.BatchNorm1d(args.d_embed)

self.lstm1 = nn.LSTM(args.d_embed, args.hidden_size,
num_layers=1, batch_first=True, bidirectional=True)
self.lstm2 = nn.LSTM(args.hidden_size * 8, args.hidden_size,
num_layers=1, batch_first=True, bidirectional=True)

self.fc = nn.Sequential(
nn.BatchNorm1d(args.hidden_size * 8),
nn.Linear(args.hidden_size * 8, args.linear_size),
nn.ELU(inplace=True),
nn.BatchNorm1d(args.linear_size),
nn.Dropout(args.dropout),
nn.Linear(args.linear_size, args.linear_size),
nn.ELU(inplace=True),
nn.BatchNorm1d(args.linear_size),
nn.Dropout(args.dropout),
nn.Linear(args.linear_size, args.d_out),
nn.Softmax(dim=-1)
)

def submul(self, x1, x2):
mul = x1 * x2
sub = x1 - x2
return torch.cat([sub, mul], -1)

def apply_multiple(self, x):
# input: batch_size * seq_len * (2 * hidden_size)
p1 = F.avg_pool1d(x.transpose(1, 2), x.size(1)).squeeze(-1)
p2 = F.max_pool1d(x.transpose(1, 2), x.size(1)).squeeze(-1)
# output: batch_size * (4 * hidden_size)
return torch.cat([p1, p2], 1)

def soft_attention_align(self, x1, x2, mask1, mask2):
'''
x1: batch_size * seq_len * dim
x2: batch_size * seq_len * dim
'''
# attention: batch_size * seq_len * seq_len
attention = torch.matmul(x1, x2.transpose(1, 2))
# mask的作用:防止计算Softmax的时候出现异常值
mask1 = mask1.float().masked_fill_(mask1, float('-inf'))
mask2 = mask2.float().masked_fill_(mask2, float('-inf'))

# weight: batch_size * seq_len * seq_len
weight1 = F.softmax(attention + mask2.unsqueeze(1), dim=-1)
x1_align = torch.matmul(weight1, x2)
weight2 = F.softmax(attention.transpose(
1, 2) + mask1.unsqueeze(1), dim=-1)
x2_align = torch.matmul(weight2, x1)

# x_align: batch_size * seq_len * hidden_size
return x1_align, x2_align

def forward(self, sent1, sent2):
"""
sent1: batch * la
sent2: batch * lb
"""
mask1, mask2 = sent1.eq(0), sent2.eq(0)
x1, x2 = self.embedding(sent1), self.embedding(sent2)
# x1, x2 = self.bn_embed(x1), self.bn_embed(x2)

# batch * [la | lb] * dim
o1, _ = self.lstm1(x1)
o2, _ = self.lstm1(x2)

# Local Inference
# batch * [la | lb] * hidden_size
q1_align, q2_align = self.soft_attention_align(o1, o2, mask1, mask2)

# Inference Composition
# batch_size * seq_len * (8 * hidden_size)
q1_combined = torch.cat([o1, q1_align, self.submul(o1, q1_align)], -1)
q2_combined = torch.cat([o2, q2_align, self.submul(o2, q2_align)], -1)

# batch_size * seq_len * (2 * hidden_size)
q1_compose, _ = self.lstm2(q1_combined)
q2_compose, _ = self.lstm2(q2_combined)

# Aggregate
q1_rep = self.apply_multiple(q1_compose)
q2_rep = self.apply_multiple(q2_compose)

# Classifier
similarity = self.fc(torch.cat([q1_rep, q2_rep], -1))
return similarity


def take_snapshot(model, path):
"""保存模型训练结果到Drive上,防止Colab重置后丢失"""
torch.save(model.state_dict(), path)
print(f"Snapshot has been saved to {path}")


def load_snapshot(model, path):
model.load_state_dict(torch.load(path))
print(f"Load snapshot from {path} done.")


model = ESIM(args)
# if os.path.exists(args.snapshot):
# load_snapshot(model, args.snapshot)

# Embedding向量不训练
model.embedding.weight.data.copy_(TEXT.vocab.vectors)
model.embedding.weight.requires_grad = False

model.to(args.device)
ESIM(
  (embedding): Embedding(34193, 100)
  (lstm1): LSTM(100, 300, batch_first=True, bidirectional=True)
  (lstm2): LSTM(2400, 300, batch_first=True, bidirectional=True)
  (fc): Sequential(
    (0): BatchNorm1d(2400, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): Linear(in_features=2400, out_features=100, bias=True)
    (2): ELU(alpha=1.0, inplace)
    (3): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (4): Dropout(p=0.5)
    (5): Linear(in_features=100, out_features=100, bias=True)
    (6): ELU(alpha=1.0, inplace)
    (7): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): Dropout(p=0.5)
    (9): Linear(in_features=100, out_features=4, bias=True)
    (10): Softmax()
  )
)

训练阶段

这里有几个细节:

batch.label的形状

batch.label是形状为(batch)的一维向量;而Y_pred是形状为$batch \times 4$的二维向量,使用.topk(1).indices提取最大值后仍然是二维向量。

所以如果不拓展batch.label的维度,PyTorch会自动广播batch.label,最终得到的结果不再是$batch \times 1$,而是$batch \times batch$,那么最后计算出来的准确率会大到离谱。这是下面代码的含义:

1
(Y_pred.topk(1).indices == batch.label.unsqueeze(1))

tensor和标量的除法

在Python3.6中,除法符号/的结果默认是浮点型的,但是PyTorch并不是这样,这也是另一个很容易忽视的细节。

1
(Y_pred.topk(1).indices == batch.label.unsqueeze(1))

上面代码结果可以看作是bool类型(实际上是torch.uint8)。调用.sum()求和之后的结果类型是torch.LongTensor。但是PyTorch中整数除法是不会得到浮点数的。

1
2
3
# 就像下面的代码会得到0一样
In [2]: torch.LongTensor([1]) / torch.LongTensor([5])
Out[2]: tensor([0])

变量acc累加了每一个batch中计算正确的样本数量,由于自动类型转换,acc现在指向torch.LongTensor类型,所以最后计算准确率的时候一定要用.item()提取出整数值。如果忽视了这个细节,那么最后得到的准确率是0。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
def training(model, data_iter, loss_fn, optimizer):
"""训练部分"""
model.train()
data_iter.init_epoch()
acc, cnt, avg_loss = 0, 0, 0.0

for batch in data_iter:
Y_pred = model(batch.premise, batch.hypothesis)
loss = loss_fn(Y_pred, batch.label)
optimizer.zero_grad()
loss.backward()
optimizer.step()

avg_loss += loss.item() / len(data_iter)
# unsqueeze是因为label是一维向量,下同
acc += (Y_pred.topk(1).indices == batch.label.unsqueeze(1)).sum()
cnt += len(batch.premise)

return avg_loss, (acc.item() / cnt) # 如果不提取item,会导致accuracy为0


def validating(model, data_iter, loss_fn):
"""验证部分"""
model.eval()
data_iter.init_epoch()
acc, cnt, avg_loss = 0, 0, 0.0

with torch.set_grad_enabled(False):
for batch in data_iter:
Y_pred = model(batch.premise, batch.hypothesis)

avg_loss += loss_fn(Y_pred, batch.label).item() / len(data_iter)
acc += (Y_pred.topk(1).indices == batch.label.unsqueeze(1)).sum()
cnt += len(batch.premise)

return avg_loss, (acc.item() / cnt)


def train(model, train_data, val_data):
"""训练过程"""
optimizer = optim.Adam(model.parameters(), lr=args.lr)
loss_fn = nn.CrossEntropyLoss()
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', factor=0.5, patience=args.scheduler_step, verbose=True)

train_losses, val_losses, train_accs, val_accs = [], [], [], []

# Before train
tic = time.time()
train_loss, train_acc = validating(model, train_data, loss_fn)
val_loss, val_acc = validating(model, val_data, loss_fn)
train_losses.append(train_loss)
val_losses.append(val_loss)
train_accs.append(train_acc)
val_accs.append(val_acc)
min_val_loss = val_loss
print(f"Epoch: 0/{args.epoch}\t"
f"Train loss: {train_loss:.4f}\tacc: {train_acc:.4f}\t"
f"Val loss: {val_loss:.4f}\tacc: {val_acc:.4f}\t"
f"Cost time: {(time.time()-tic):.2f}s")

try:
for epoch in range(args.epoch):
tic = time.time()
train_loss, train_acc = training(
model, train_data, loss_fn, optimizer)
val_loss, val_acc = validating(model, val_data, loss_fn)
train_losses.append(train_loss)
val_losses.append(val_loss)
train_accs.append(train_acc)
val_accs.append(val_acc)
scheduler.step(val_loss)

print(f"Epoch: {epoch + 1}/{args.epoch}\t"
f"Train loss: {train_loss:.4f}\tacc: {train_acc:.4f}\t"
f"Val loss: {val_loss:.4f}\tacc: {val_acc:.4f}\t"
f"Cost time: {(time.time()-tic):.2f}s")

if val_loss < min_val_loss: # 即时保存
min_val_loss = val_loss
take_snapshot(model, args.snapshot)

# Early-stop:
# if len(val_losses) >= 3 and (val_loss - min_val_loss) / min_val_loss > args.early_stop_ratio:
# print(f"Early stop with best loss: {min_val_loss:.5f}")
# break
# args.early_stop_ratio *= args.early_stop_ratio

except KeyboardInterrupt:
print("Interrupted by user")

return train_losses, val_losses, train_accs, val_accs


train_losses, val_losses, train_accs, val_accs = train(
model, train_iter, dev_iter)
Epoch: 0/64	Train loss: 1.3871	acc: 0.3335	Val loss: 1.3871	acc: 0.3331	Cost time: 364.32s
Epoch: 1/64	Train loss: 1.0124	acc: 0.7275	Val loss: 0.9643	acc: 0.7760	Cost time: 998.41s
Snapshot has been saved to /content/drive/My Drive/Colab Notebooks/ESIM.pt
Epoch: 2/64	Train loss: 0.9476	acc: 0.7925	Val loss: 0.9785	acc: 0.7605	Cost time: 1003.32s
Epoch: 3/64	Train loss: 0.9305	acc: 0.8100	Val loss: 0.9204	acc: 0.8217	Cost time: 999.49s
Snapshot has been saved to /content/drive/My Drive/Colab Notebooks/ESIM.pt
Epoch: 4/64	Train loss: 0.9183	acc: 0.8227	Val loss: 0.9154	acc: 0.8260	Cost time: 1000.97s
Snapshot has been saved to /content/drive/My Drive/Colab Notebooks/ESIM.pt
Epoch: 5/64	Train loss: 0.9084	acc: 0.8329	Val loss: 0.9251	acc: 0.8156	Cost time: 996.99s
....
Epoch: 21/64	Train loss: 0.8236	acc: 0.9198	Val loss: 0.8912	acc: 0.8514	Cost time: 992.48s
Epoch: 22/64	Train loss: 0.8210	acc: 0.9224	Val loss: 0.8913	acc: 0.8514	Cost time: 996.35s
Epoch    22: reducing learning rate of group 0 to 5.0000e-05.
Epoch: 23/64	Train loss: 0.8195	acc: 0.9239	Val loss: 0.8940	acc: 0.8485	Cost time: 1000.48s
Epoch: 24/64	Train loss: 0.8169	acc: 0.9266	Val loss: 0.8937	acc: 0.8490	Cost time: 1006.78s
Interrupted by user

绘制Loss-Accuracy曲线

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
iters = [i + 1 for i in range(len(train_losses))]

# 防止KeyboardInterrupt的打断导致两组loss不等长
min_len = min(len(train_losses), len(val_losses))

# 绘制双纵坐标图
fig, ax1 = plt.subplots()
ax1.plot(iters, train_losses[: min_len], '-', label='train loss')
ax1.plot(iters, val_losses[: min_len], '-.', label='val loss')
ax1.set_xlabel("Epoch")
ax1.set_ylabel("Loss")

# 创建子坐标轴
ax2 = ax1.twinx()
ax2.plot(iters, train_accs[: min_len], ':', label='train acc')
ax2.plot(iters, val_accs[: min_len], '--', label='val acc')
ax2.set_ylabel("Accuracy")

# 为双纵坐标图添加图例
handles1, labels1 = ax1.get_legend_handles_labels()
handles2, labels2 = ax2.get_legend_handles_labels()
plt.legend(handles1 + handles2, labels1 + labels2, loc='center right')
plt.show()

png

预测

模型除了训练出结果以外,还需要能在实际中运用。

1
2
3
4
5
6
7
8
9
nlp = spacy.load("en")

# 重新加载之前训练结果最棒的模型参数
load_snapshot(model, args.snapshot)
# 小规模数据还是cpu跑得快
model.to(torch.device("cpu"))

with open(r"/content/drive/My Drive/Colab Notebooks/vocab_label_stoi.pkl", "rb") as f:
vocab_stoi, label_stoi = pickle.load(f)
Load snapshot from /content/drive/My Drive/Colab Notebooks/ESIM.pt done.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def sentence2tensor(stoi, sent1: str, sent2: str):
"""将两个句子转化为张量"""
sent1 = [str(token) for token in nlp(sent1.lower())]
sent2 = [str(token) for token in nlp(sent2.lower())]

tokens1, tokens2 = [], []

for token in sent1:
tokens1.append(stoi[token])

for token in sent2:
tokens2.append(stoi[token])

delt_len = len(tokens1) - len(tokens2)

if delt_len > 0:
tokens2.extend([1] * delt_len)
else:
tokens1.extend([1] * (-delt_len))

tensor1 = torch.LongTensor(tokens1).unsqueeze(0)
tensor2 = torch.LongTensor(tokens2).unsqueeze(0)

return tensor1, tensor2


def use(model, premise: str, hypothsis: str):
"""使用模型测试"""
label_itos = {0: '<unk>', 1: 'entailment',
2: 'contradiction', 3: 'neutral'}

model.eval()
with torch.set_grad_enabled(False):
tensor1, tensor2 = sentence2tensor(vocab_stoi, premise, hypothsis)
predict = model(tensor1, tensor2)
top1 = predict.topk(1).indices.item()

print(f"The answer is '{label_itos[top1]}'")

prob = predict.cpu().squeeze().numpy()
plt.bar(["<unk>", "entailment", "contradiction", "neutral"], prob)
plt.ylabel("probability")
plt.show()

输入两个句子之后,打印最可能的推测结果,并用直方图显示每种推测的概率

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 蕴含
use(model,
"A statue at a museum that no seems to be looking at.",
"There is a statue that not many people seem to be interested in.")

# 对立
use(model,
"A land rover is being driven across a river.",
"A sedan is stuck in the middle of a river.")

# 中立
use(model,
"A woman with a green headscarf, blue shirt and a very big grin.",
"The woman is young.")
The answer is 'entailment'

png

The answer is 'contradiction'

png

The answer is 'neutral'

png