首发:知乎
作者:djh1.ES简单检索
数据准备:
将要插入的文本转换成json(数据的结构要根据自己的结构设计)。
import json def transform_data2json(input_file,outfile):
tikuDataFiles = open("demo.txt")
tikuDataFilesLines = tikuDataFiles.readlines()
attr = {}
f_json = open(outfile,"w")
for i,line in enumerate(tikuDataFilesLines):
print(i)
attr["titel"] = line
new_ent_j = json.dumps(attr,ensure_ascii=False)
f_json.write(new_ent_j + "\n")
transform_data2json("demo.txt","demo.json")
插入:
调用ES插入接口。
def mappingSetting(index_name,type_name):
base_url = "http://localhost:9200/" + index_name + "" + "?pretty"
mapping = json.dumps({"mappings": {type_name:{"properties":{"titel":{"type": "text","analyzer":"ik_max_word","search_analyzer":"ik_smart"}}}}},ensure_ascii=False)
response = requests.post(base_url, headers={"Content-Type":"application/json"}, data=mapping.encode('utf-8'))
def bulk_insert(base_url, data):
response = requests.post(base_url, headers={"Content-Type":"application/x-ndjson"}, data=data.encode('utf-8'))
def begin_insert_job(index_name, type_name, json_filepath, bulk_size=1000):
base_url = "http://localhost:9200/" + index_name + "/" + type_name + "/_bulk"
f = open(json_filepath)
cnt, es_id = 0, 1
data = ""
for line in f:
action_meta = '{"index": {"_id":"' + str(es_id) + '"}}'
data = data + action_meta + "\n" + line
es_id += 1
cnt += 1
if cnt >= bulk_size:
bulk_insert(base_url, data)
cnt, data = 0, ""
if not (es_id % bulk_size):
print(es_id)
if cnt:
bulk_insert(base_url, data)
if __name__ == '__main__':
begin_insert_job("doc", "word", "./demo.json")
查询:
import requests import json def query(base_url, data):
query = json.dumps({"query": { "match":{"titel" : data}}},ensure_ascii=False)
headers = {"Content-type": "application/json"}
response = requests.get("http://localhost:9200/doc/word/_search", data = query.encode('utf-8'),headers = headers)
query("base_url","王者荣耀")
2.annoy语义检索
ES关键词检索,不太有语义的理解。现在我们用annoy结合深度学习embedding进行语义检索。
将文本进行embedding
可以用word2vec,也可以用cnn或者用bert抽离出词向量。抽离出接口
encodearrary = bc.encode(["王者荣耀"])
print(encodearrary)
[0.792, −0.177, −0.107, 0.109, −0.542, ...]
将文本编码插入annoy
tc_index = AnnoyIndex(768) annFilename = "./tmp/"+inputfilename[-1]+".ann.index"
if not os.path.exists(annFilename):
for i,line in enumerate(needSimProcessFilelinesFilter):
print("encode line "+str(i))
encodenum = bc.encode([line])
tc_index.add_item(i,encodenum[0])
tc_index.build(100)
tc_index.save(annFilename)
else:
print("已经存在该文件的编码。。。。")
annoy进行embedding相似度查询
encodearrary = bc.encode(["王者荣耀"])
print(encodearrary)
tc_index = AnnoyIndex(768)
annFilename = "./tmp/"+inputfilename[-1]+".ann.index"
tc_index.load(annFilename)
items = tc_index.get_nns_by_vector(encodearrary[0], 8, include_distances=True)
for j in range(len(items[0])):
index = items[0][j]
indexvalue = items[1][j]
anchorlabel = needSimProcessFilelinesFilter[index]
print(str(j)+' '+anchorlabel)
王者荣耀
王者农药
王者荣耀代练平台
王者荣耀体验服务
王者荣耀皮肤碎片
推荐阅读
更多嵌入式AI技术干货请关注嵌入式AI专栏。
-END-