区块链技术进阶与实战电子书,区块链技术进阶与实战txt

  

  作者|李   

  

  编辑|卡罗尔   

  

  图|区块链大本营(区块链_营地)   

  

  语义分析作为自然语言处理的一个重要方面,主要有以下功能:在词的层次上,语义分析的基本任务是进行词义消歧;在句子层面上,语义角色标注是所关心的问题;在文章层次上,指代消解、篇章语义分析是重点。   

  

  实体识别和关系抽取是构建知识地图等上层自然语言处理应用的基础。关系抽取可以简单理解为一个分类问题:给定两个实体和两个实体共同出现的句子文本,判别两个实体之间的关系。.   

  

  使用CNN或双向RNN加注意力的深度学习方法被认为是目前关系提取技术水平的解决方案。现有的大部分文献和代码都是针对英语语料库,使用词向量作为训练的输入。这里以实践为目的,介绍一个用双向GRU、字与句子的双重Attention模型,以天然适配中文特性的字向量(characterembedding)作为输入,网络爬取数据作为训练语料构建的中文关系抽取模型.代码主要基于清华的开源项目thunlp/TensorFlow-NRE开发,效果如下:   

  

     

  

  实验前的准备首先,我们用的python版本是3.6.5。使用的模块如下:   

  

  tensorflow模块:用于创建整个模型的训练和保存调用,以及网络的构建框架等等。   

  

  用于处理数据的numpy模块:矩阵运算。   

  

  Sklearn模块:是一些机器学习算法的集成模块。   

  

  型号的网络建设   

  

  该模型的网络图如下:   

  

  双向GRU加词级注意的模型思想来源于《基于注意的双向长短时记忆网络用于关系分类》一文。这里,原文的模型结构中的LSTM被改变为GRU,并且句子中的每个汉字被输入作为字符嵌入。这个模型训练每一个句子输入,并添加单词级的注意。   

  

  句子级注意的概念来自文章“对实例进行选择性注意的神经关系提取”。原模型的结构图如下。这里,每个句子编码的CNN模块被上面的双向GRU模型代替。该模型将每种句子输入一起训练,并加入句子级的注意。   

  

  建立network.py文件,定义词向量大小、步数、类别数等。   

  

  def __init__(self):   

  

  self.vocab_size=16691   

  

  self.num_steps=70   

  

  self.num_epochs=10   

  

  self.num_classes=12   

  

  self.gru_size=230   

  

  self.keep_prob=0.5   

  

  self.num_layers=1   

  

  self.pos_size=5   

  

  self.pos_num=123   

  

  #训练或测试期间每批实体对的数量   

  

  self.big_num=50   

  

  然后建立GRU网络。根据给定的网络模型图,将基本网络框架定义为具体参数的调用。   

  

def __init__(self, is_training, word_embeddings, settings):
self.num_steps = num_steps = settings.num_steps
self.vocab_size = vocab_size = settings.vocab_size
self.num_classes = num_classes = settings.num_classes
self.gru_size = gru_size = settings.gru_size
self.big_num = big_num = settings.big_num
self.input_word = tf.placeholder(dtype=tf.int32, shape=, name='input_word')
self.input_pos1 = tf.placeholder(dtype=tf.int32, shape=, name='input_pos1')
self.input_pos2 = tf.placeholder(dtype=tf.int32, shape=, name='input_pos2')
self.input_y = tf.placeholder(dtype=tf.float32, shape=, name='input_y')
self.total_shape = tf.placeholder(dtype=tf.int32, shape=, name='total_shape')
total_num = self.total_shape<-1>
word_embedding = tf.get_variable(initializer=word_embeddings, name='word_embedding')
pos1_embedding = tf.get_variable('pos1_embedding', )
pos2_embedding = tf.get_variable('pos2_embedding', )
attention_w = tf.get_variable('attention_omega', )
sen_a = tf.get_variable('attention_A', )
sen_r = tf.get_variable('query_r', )
relation_embedding = tf.get_variable('relation_embedding', )
sen_d = tf.get_variable('bias_d', )
gru_cell_forward = tf.contrib.rnn.GRUCell(gru_size)
gru_cell_backward = tf.contrib.rnn.GRUCell(gru_size)
if is_training and settings.keep_prob < 1:
gru_cell_forward = tf.contrib.rnn.DropoutWrapper(gru_cell_forward, output_keep_prob=settings.keep_prob)
gru_cell_backward = tf.contrib.rnn.DropoutWrapper(gru_cell_backward, output_keep_prob=settings.keep_prob)
cell_forward = tf.contrib.rnn.MultiRNNCell( * settings.num_layers)
cell_backward = tf.contrib.rnn.MultiRNNCell( * settings.num_layers)
sen_repre =
sen_alpha =
sen_s =
sen_out =
self.prob =
self.predictions =
self.loss =
self.accuracy =
self.total_loss = 0.0
self._initial_state_forward = cell_forward.zero_state(total_num, tf.float32)
self._initial_state_backward = cell_backward.zero_state(total_num, tf.float32)
# embedding layer
inputs_forward = tf.concat(axis=2, values=tf.nn.embedding_lookup(pos1_embedding, self.input_pos1),
tf.nn.embedding_lookup(pos2_embedding, self.input_pos2)>)
inputs_backward = tf.concat(axis=2,
values=)),
tf.nn.embedding_lookup(pos1_embedding, tf.reverse(self.input_pos1, <1>)),
tf.nn.embedding_lookup(pos2_embedding,
tf.reverse(self.input_pos2, <1>))>)
outputs_forward =
state_forward = self._initial_state_forward
# Bi-GRU layer
with tf.variable_scope('GRU_FORWARD') as scope:
for step in range(num_steps):
if step > 0:
scope.reuse_variables
(cell_output_forward, state_forward) = cell_forward(inputs_forward<:, step, :>, state_forward)
outputs_forward.append(cell_output_forward)
outputs_backward =
state_backward = self._initial_state_backward
with tf.variable_scope('GRU_BACKWARD') as scope:
for step in range(num_steps):
if step > 0:
scope.reuse_variables
(cell_output_backward, state_backward) = cell_backward(inputs_backward<:, step, :>, state_backward)
outputs_backward.append(cell_output_backward)
output_forward = tf.reshape(tf.concat(axis=1, values=outputs_forward), )
output_backward = tf.reverse(
tf.reshape(tf.concat(axis=1, values=outputs_backward), ), <1>)
# word-level attention layer
output_h = tf.add(output_forward, output_backward)
attention_r = tf.reshape(tf.matmul(tf.reshape(tf.nn.softmax(
tf.reshape(tf.matmul(tf.reshape(tf.tanh(output_h), ), attention_w),
)), ), output_h), )

  

模型的训练和使用其中用来训练的语料获取,由于中文关系抽取的公开语料比较少。我们从distant supervision的方法中获取灵感,希望可以首先找到具有确定关系的实体对,然后再去获取该实体对共同出现的语句作为正样本。负样本则从实体库中随机产生没有关系的实体对,再去获取这样实体对共同出现的语句。

  

对于具有确定关系的实体对,我们从复旦知识工厂得到,感谢他们提供的免费API!一个小问题是,相同的关系label在复旦知识工厂中可能对应着不同的标注,比如“夫妻”,抓取到的数据里有的是“丈夫”,有的是“妻子”,有的是“伉俪”等等,需要手动对齐。

  

(1) 模型的训练:建立train_GRU文件,通过训练已经经过处理后得到的npy文件进行训练。

  

其中训练的数据如下:

  

代码如下:

  

def main(_):
# the path to save models
save_path = './model/'
print('reading wordembedding')
wordembedding = np.load('./data/vec.npy')
print('reading training data')
train_y = np.load('./data/train_y.npy')
train_word = np.load('./data/train_word.npy')
train_pos1 = np.load('./data/train_pos1.npy')
train_pos2 = np.load('./data/train_pos2.npy')
settings = network.Settings
settings.vocab_size = len(wordembedding)
settings.num_classes = len(train_y<0>)
big_num = settings.big_num
with tf.Graph.as_default:
sess = tf.Session
with sess.as_default:
initializer = tf.contrib.layers.xavier_initializer
with tf.variable_scope("model", reuse=None, initializer=initializer):
m = network.GRU(is_training=True, word_embeddings=wordembedding, settings=settings)
global_step = tf.Variable(0, name="global_step", trainable=False)
optimizer = tf.train.AdamOptimizer(0.0005)
train_op = optimizer.minimize(m.final_loss, global_step=global_step)
sess.run(tf.global_variables_initializer)
saver = tf.train.Saver(max_to_keep=None)
merged_summary = tf.summary.merge_all
summary_writer = tf.summary.FileWriter(FLAGS.summary_dir + '/train_loss', sess.graph)
def train_step(word_batch, pos1_batch, pos2_batch, y_batch, big_num):
feed_dict = {}
total_shape =
total_num = 0
total_word =
total_pos1 =
total_pos2 =
for i in range(len(word_batch)):
total_shape.append(total_num)
total_num += len(word_batch)
for word in word_batch:
total_word.append(word)
for pos1 in pos1_batch:
total_pos1.append(pos1)
for pos2 in pos2_batch:
total_pos2.append(pos2)
total_shape.append(total_num)
total_shape = np.array(total_shape)
total_word = np.array(total_word)
total_pos1 = np.array(total_pos1)
total_pos2 = np.array(total_pos2)
feed_dict = total_shape
feed_dict = total_word
feed_dict = total_pos1
feed_dict = total_pos2
feed_dict = y_batch
temp, step, loss, accuracy, summary, l2_loss, final_loss = sess.run(
,
feed_dict)
time_str = datetime.datetime.now.isoformat
accuracy = np.reshape(np.array(accuracy), (big_num))
acc = np.mean(accuracy)
summary_writer.add_summary(summary, step)
if step % 50 == 0:
tempstr = "{}: step {}, softmax_loss {:g}, acc {:g}".format(time_str, step, loss, acc)
print(tempstr)
for one_epoch in range(settings.num_epochs):
temp_order = list(range(len(train_word)))
np.random.shuffle(temp_order)
for i in range(int(len(temp_order) / float(settings.big_num))):
temp_word =
temp_pos1 =
temp_pos2 =
temp_y =
temp_input = temp_order
for k in temp_input:
temp_word.append(train_word)
temp_pos1.append(train_pos1)
temp_pos2.append(train_pos2)
temp_y.append(train_y)
num = 0
for single_word in temp_word:
num += len(single_word)
if num > 1500:
print('out of range')
continue
temp_word = np.array(temp_word)
temp_pos1 = np.array(temp_pos1)
temp_pos2 = np.array(temp_pos2)
temp_y = np.array(temp_y)
train_step(temp_word, temp_pos1, temp_pos2, temp_y, settings.big_num)
current_step = tf.train.global_step(sess, global_step)
if current_step > 8000 and current_step % 100 == 0:
print('saving model')
path = saver.save(sess, save_path + 'ATT_GRU_model', global_step=current_step)
tempstr = 'have saved model to ' + path
print(tempstr)
训练过程:

  

  

(2) 模型的测试:其中得到训练后的模型如下:

  

while True:
#try:
#BUG: Encoding error if user input directly from command line.
line = input('请输入中文句子,格式为 "name1 name2 sentence":')
#Read file from test file
'''
infile = open('test.txt', encoding='utf-8')
line = ''
for orgline in infile:
line = orgline.strip
break
infile.close
'''
en1, en2, sentence = line.strip.split
print("实体1: " + en1)
print("实体2: " + en2)
print(sentence)
relation = 0
en1pos = sentence.find(en1)
if en1pos == -1:
en1pos = 0
en2pos = sentence.find(en2)
if en2pos == -1:
en2post = 0
output =
# length of sentence is 70
fixlen = 70
# max length of position embedding is 60 (-60~+60)
maxlen = 60
#Encoding test x
for i in range(fixlen):
word = word2id<'BLANK'>
rel_e1 = pos_embed(i - en1pos)
rel_e2 = pos_embed(i - en2pos)
output.append()
for i in range(min(fixlen, len(sentence))):
word = 0
if sentence not in word2id:
#print(sentence)
#print('==')
word = word2id<'UNK'>
#print(word)
else:
#print(sentence)
#print('||')
word = word2id>
#print(word)
output<0> = word
test_x =
test_x.append()
#Encoding test y
label = <0 for i in range(len(relation2id))>
label<0> = 1
test_y =
test_y.append(label)
test_x = np.array(test_x)
test_y = np.array(test_y)
test_word =
test_pos1 =
test_pos2 =
for i in range(len(test_x)):
word =
pos1 =
pos2 =
for j in test_x:
temp_word =
temp_pos1 =
temp_pos2 =
for k in j:
temp_word.append(k<0>)
temp_pos1.append(k<1>)
temp_pos2.append(k<2>)
word.append(temp_word)
pos1.append(temp_pos1)
pos2.append(temp_pos2)
test_word.append(word)
test_pos1.append(pos1)
test_pos2.append(pos2)
test_word = np.array(test_word)
test_pos1 = np.array(test_pos1)
test_pos2 = np.array(test_pos2)
prob, accuracy = test_step(test_word, test_pos1, test_pos2, test_y)
prob = np.reshape(np.array(prob), (1, test_settings.num_classes))<0>
print("关系是:")
#print(prob)
top3_id = prob.argsort<-3:><::-1>
for n, rel_id in enumerate(top3_id):
print("No." + str(n+1) + ": " + id2relation + ", Probability is " + str(prob))
完整代码:

  

链接:https://pan.baidu.com/s/1aY2WOAw9lgG_1I2rk_EPKw

  

提取码:noyv

  

作者介绍:

  

李秋键,CSDN 博客专家,CSDN达人课作者。硕士在读于中国矿业大学,开发有taptap安卓武侠游戏一部,vip视频解析,文意转换工具,写作机器人等项目,发表论文若干,多次高数竞赛获奖等等。

  

相关文章