深度学习-12.简单的nlp语义相似度检索

首发：知乎
作者：djh
1.ES简单检索

数据准备：

将要插入的文本转换成json(数据的结构要根据自己的结构设计)。

import json def transform_data2json(input_file,outfile):
     tikuDataFiles = open("demo.txt")
     tikuDataFilesLines = tikuDataFiles.readlines()
     attr = {}
     f_json = open(outfile,"w")
     for i,line in enumerate(tikuDataFilesLines):
         print(i)
         attr["titel"] = line
         new_ent_j = json.dumps(attr,ensure_ascii=False)
         f_json.write(new_ent_j + "\n")
 transform_data2json("demo.txt","demo.json")

插入：

调用ES插入接口。

def mappingSetting(index_name,type_name):
     base_url = "http://localhost:9200/" + index_name + "" + "?pretty"
     mapping = json.dumps({"mappings": {type_name:{"properties":{"titel":{"type": "text","analyzer":"ik_max_word","search_analyzer":"ik_smart"}}}}},ensure_ascii=False)
     response = requests.post(base_url, headers={"Content-Type":"application/json"}, data=mapping.encode('utf-8'))
  def bulk_insert(base_url, data):
     response = requests.post(base_url, headers={"Content-Type":"application/x-ndjson"}, data=data.encode('utf-8'))
  def begin_insert_job(index_name, type_name, json_filepath, bulk_size=1000):
     base_url = "http://localhost:9200/" + index_name + "/" + type_name + "/_bulk"
     f = open(json_filepath)
     cnt, es_id = 0, 1
     data = ""
     for line in f:
         action_meta = '{"index": {"_id":"' + str(es_id) + '"}}'
         data = data + action_meta + "\n" + line
         es_id += 1
         cnt += 1
         if cnt >= bulk_size:
             bulk_insert(base_url, data)
             cnt, data = 0, ""
         if not (es_id % bulk_size):
             print(es_id)
     if cnt:
         bulk_insert(base_url, data)
  if __name__ == '__main__':
     begin_insert_job("doc", "word", "./demo.json")

查询：

import requests import json def query(base_url, data):
      query = json.dumps({"query": { "match":{"titel" : data}}},ensure_ascii=False)
      headers = {"Content-type": "application/json"}
     response = requests.get("http://localhost:9200/doc/word/_search", data = query.encode('utf-8'),headers = headers)
      query("base_url","王者荣耀")

2.annoy语义检索

ES关键词检索，不太有语义的理解。现在我们用annoy结合深度学习embedding进行语义检索。

将文本进行embedding

可以用word2vec，也可以用cnn或者用bert抽离出词向量。抽离出接口

encodearrary = bc.encode(["王者荣耀"])
print(encodearrary)
[0.792, −0.177, −0.107, 0.109, −0.542, ...]

将文本编码插入annoy

tc_index = AnnoyIndex(768) annFilename = "./tmp/"+inputfilename[-1]+".ann.index"
if not os.path.exists(annFilename): 
    for i,line in enumerate(needSimProcessFilelinesFilter): 
        print("encode line "+str(i))
        encodenum = bc.encode([line]) 
        tc_index.add_item(i,encodenum[0])
      tc_index.build(100)
      tc_index.save(annFilename)
else: 
    print("已经存在该文件的编码。。。。")

annoy进行embedding相似度查询

 encodearrary = bc.encode(["王者荣耀"])
 print(encodearrary)
 tc_index = AnnoyIndex(768)
 annFilename = "./tmp/"+inputfilename[-1]+".ann.index"
 tc_index.load(annFilename)
 items = tc_index.get_nns_by_vector(encodearrary[0], 8, include_distances=True)
 for j in range(len(items[0])):
     index = items[0][j]
     indexvalue = items[1][j]
     anchorlabel = needSimProcessFilelinesFilter[index]
     print(str(j)+'  '+anchorlabel)

 王者荣耀
 王者农药
 王者荣耀代练平台
 王者荣耀体验服务
 王者荣耀皮肤碎片

推荐阅读

更多嵌入式AI技术干货请关注嵌入式AI专栏。

-END-

推荐阅读

目录