Elasticsearch系列---几个高级功能

概要

本篇主要介绍一下搜索模板、映射模板、高亮搜索和地理位置的简单玩法。

标准搜索模板

搜索模板search tempalte高级功能之一，可以将我们的一些搜索进行模板化，使用现有模板时传入指定的参数就可以了，避免编写重复代码。对常用的功能可以利用模板进行封装，使用时更简便。

这点类似于我们编程时的接口封装，将一些细节处理的东西封装成接口，供别人调用，使用者就只需要关注参数和响应结果就行，这样可以更好地提高代码复用率。

下面我们来看看最基本的几种用法

参数替换

GET /music/children/_search/template
{
  "source": {
    "query": {
      "match": {
        "{{field}}":"{{value}}"
      }
    }
  },
  "params": {
    "field":"name",
    "value":"bye-bye"
  }
}

该搜索模板编译后等同于：

GET /music/children/_search
{
  "query": {
    "match": {
      "name":"bye-bye"
    }
  }
}

使用Json格式的条件查询

{{#toJson}}块内可以写稍微复杂一些的条件

GET /music/children/_search/template
{
  "source": "{\"query\":{\"match\": {{#toJson}}condition{{/toJson}}}}",
  "params": {
    "condition": {
      "name":"bye-bye"
    }
  }
}

该搜索模板编译后等同于如下：

GET /music/children/_search
{
  "query": {
    "match": {
      "name":"bye-bye"
    }
  }
}

join语法

join内的参数names可以写多个：

GET /music/children/_search/template
{
  "source": {
    "query": {
      "match": {
        "name": "{{#join delimiter=' '}}names{{/join delimiter=' '}}"
      }
    }
  },
  "params": {
    "name":["gymbo","you are my sunshine","bye-bye"]
  }
}

该搜索模板编译后等同于如下:

GET /music/children/_search
{
  "query": {
    "match": {
      "name":"gymbo you are my sunshine bye-bye"
    }
  }
}

搜索模板的默认值设置

可以对搜索模板进行一些默认值的设置，如{{^end}}500表示如果end参数为空，默认值为500

GET /music/children/_search/template
{
  "source":{
    "query":{
      "range":{
        "likes":{
          "gte":"{{start}}",
          "lte":"{{end}}{{^end}}500{{/end}}"
        }
      }
    }
  },
  "params": {
    "start":1,
    "end":300
  }
}

该搜索模板编译后等同于：

GET /music/children/_search
{
  "query": {
    "range": {
      "likes": {
        "gte": 1,
        "lte": 300
      }
    }
  }
}

条件判断

在Mustache语言中，它没有if/else这样的判断，但是你可以定section来跳过它如果那个变量是false还是没有被定义

{{#param1}}
    "This section is skipped if param1 is null or false"
{{/param1}}

示例：创建mustache scripts对象

POST _scripts/condition
{
  "script": {
    "lang": "mustache",
    "source": 
    """
        {
            "query": {
              "bool": {
                "must": {
                  "match": {
                    "name": "{{name}}"
                  }
                },
                "filter":{
                  {{#isLike}}
                    "range":{
                      "likes":{
                        {{#start}}
                          "gte":"{{start}}"
                          {{#end}},{{/end}}
                        {{/start}}
                        {{#end}}
                          "lte":"{{end}}"
                        {{/end}}
                      }
                    }
                  {{/isLike}}
                }
              }
            }
        }
    """
  }
}

使用mustache template查询：

GET _search/template
{
    "id": "condition", 
    "params": {
      "name":"gymbo",
      "isLike":true,
      "start":1,
      "end":500
    }
}

以上是常用的几种搜索模板介绍，如果在大型项目，并且配置了专门的Elasticsearch工程师，就经常会用一些通用的功能进行模板化，开发业务系统的童鞋只需要使用模板即可。

定制映射模板

ES有自己的规则对插入的数据进行类型映射，如10，会自动映射成long类型，"10"会自动映射成text，还会自带一个keyword的内置field。方便是很方便，但有时候这些类型不是我们想要的，比如我们的整数值10，我们期望是这个integer类型，"10"我们希望是keyword类型，这时候我们可以预先定义一个模板，插入数据时，相关的field就按我们预先定义的规则进行匹配，决定这个field值的类型。

另外要声明一下，实际工作中编码规范一般严谨一些，所有的document都是预先定义好类型再执行数据插入的，哪怕是中途增加的field，也是先执行mapping命令，再插入数据的。

但自定义动态映射模板也需要了解一下。

默认的动态映射效果

试着插入一条数据：

PUT /test_index/type/1
{
  "test_string":"hello kitty",
  "test_number":10
}

查看mapping信息

GET /test_index/_mapping/type

响应如下：

{
  "test_index": {
    "mappings": {
      "type": {
        "properties": {
          "test_number": {
            "type": "long"
          },
          "test_string": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

默认的动态映射规则，可能不是我们想要的。

例如，我们希望数字类型的默认是integer类型，字符串默认是string类型，但是内置的field名字叫raw，不叫keyword，保留128个字符。

动态映射模板

有两种方式：

根据新加入的field的默认的数据类型，来进行匹配，匹配某个预定义的模板
根据新加入的field的名字，去匹配预定义的名字，或者去匹配一个预定义的通配符，然后匹配上某个预定义的模板

根据数据类型进行匹配

PUT /test_index
{
  "mappings": {
    "type": {
      "dynamic_templates": [
        {
          "integers" : {
            "match_mapping_type": "long",
            "mapping": {
              "type":"integer"
            }
          }
        },
        {
          "strings" : {
            "match_mapping_type": "string",
            "mapping": {
              "type":"text",
              "fields": {
                "raw": {
                  "type": "keyword",
                  "ignore_above": 128
                }
              }
            }
          }
        }
      ]
    }
  }
}

删除索引，重新插入数据，查看mapping信息如下：

{
  "test_index": {
    "mappings": {
      "type": {
        "dynamic_templates": [
          {
            "integers": {
              "match_mapping_type": "long",
              "mapping": {
                "type": "integer"
              }
            }
          },
          {
            "strings": {
              "match_mapping_type": "string",
              "mapping": {
                "fields": {
                  "raw": {
                    "ignore_above": 128,
                    "type": "keyword"
                  }
                },
                "type": "text"
              }
            }
          }
        ],
        "properties": {
          "test_number": {
            "type": "integer"
          },
          "test_string": {
            "type": "text",
            "fields": {
              "raw": {
                "type": "keyword",
                "ignore_above": 128
              }
            }
          }
        }
      }
    }
  }
}

以按预计类型进行映射，符合预期。

按field名称进行映射
"long_"开头的field，并且原本是long类型的，转换为integer类型
"string_"开头的field，并且原本是string类型的，转换为string.raw类型
"_text"结尾的field，并且原本是string类型的，保持不变

PUT /test_index
{
  "mappings": {
    "type": {
      "dynamic_templates":[
       {
         "long_as_integer": {
             "match_mapping_type":"long",
           "match": "long_*",
           "mapping":{
             "type":"integer"
           }
         }
       },
       {
         "string_as_raw": {
             "match_mapping_type":"string",
           "match": "string_*",
           "unmatch":"*_text",
           "mapping": {
              "type":"text",
              "fields": {
                "raw": {
                  "type": "keyword",
                  "ignore_above": 128
                }
              }
            }
         }
       }
      ]
    }
  }
}

插入数据：

PUT /test_index/type/1
{
  "string_test":"hello kitty",
  "long_test": 10,
  "title_text":"Hello everyone"
}

查询mapping信息

{
  "test_index": {
    "mappings": {
      "type": {
        "dynamic_templates": [
          {
            "long_as_integer": {
              "match": "long_*",
              "match_mapping_type": "long",
              "mapping": {
                "type": "integer"
              }
            }
          },
          {
            "string_as_raw": {
              "match": "string_*",
              "unmatch": "*_text",
              "match_mapping_type": "string",
              "mapping": {
                "fields": {
                  "raw": {
                    "ignore_above": 128,
                    "type": "keyword"
                  }
                },
                "type": "text"
              }
            }
          }
        ],
        "properties": {
          "long_test": {
            "type": "integer"
          },
          "string_test": {
            "type": "text",
            "fields": {
              "raw": {
                "type": "keyword",
                "ignore_above": 128
              }
            }
          },
          "title_text": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

结果符合预期。

在某些日志管理的场景中，我们可以定义好type，每天按日期创建一个索引，这种索引的创建就可以用到映射模板，把我们定义的映射关系全部做进去。

高亮搜索

我们在浏览器上搜索文本时，发现我们输入的关键字有高亮显示，查看html源码就知道，高亮的部分是加了标签的，ES也支持高亮搜索这种操作的，并且在返回的文档中自动加了标签，兼容html5页面。

highlight基本语法

我们还是以音乐网站为案例，开始进行高亮搜索：

GET /music/children/_search 
{
  "query": {
    "match": {
      "content": "love"
    }
  },
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}

highlight里面的参数即为高亮搜索的语法，指定高亮的字段为content，我们可以看到命中的Love里面带了高亮标签，表现在html上会变成红色，所以说你的指定的field中，如果包含了那个搜索词的话，就会在那个field的文本中，对搜索词进行红色的高亮显示。

{
  "took": 35,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "music",
        "_type": "children",
        "_id": "5",
        "_score": 0.2876821,
        "_source": {
          "id": "1740e61c-63da-474f-9058-c2ab3c4f0b0a",
          "author_first_name": "Jean",
          "author_last_name": "Ritchie",
          "author": "Jean Ritchie",
          "name": "love somebody",
          "content": "love somebody, yes I do",
          "language": "english",
          "tags": "love",
          "length": 38,
          "likes": 3,
          "isRelease": true,
          "releaseDate": "2019-12-22"
        },
        "highlight": {
          "content": [
            "<em>love</em> somebody, yes I do"
          ]
        }
      }
    ]
  }
}

highlight下的字段可以指定多个，这样就可以在多个字段命中的关键词进行高亮显示，例如：

GET /music/children/_search 
{
  "query": {
    "match": {
      "content": "love"
    }
  },
  "highlight": {
    "fields": {
      "name":{},
      "content": {}
    }
  }
}

三种高亮语法

有三种高亮的语法：

plain highlight：使用standard Lucene highlighter，对简单的查询支持度非常好。
unified highlight：默认的高亮语法，使用Lucene Unified Highlighter，将文本切分成句子，并对句子使用BM25计算词条的score，支持精准查询和模糊查询。
fast vector highlighter：使用Lucene Fast Vector highlighter，功能很强大，如果在mapping中对field开启了term_vector，并设置了with_positions_offsets，就会使用该highlighter，对内容特别长的文本（大于1MB）有性能上的优势。

例如：

PUT /music
{
  "mappings": {
    "children": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "ik_max_word"
        },
        "content": {
          "type": "text",
          "analyzer": "ik_max_word",
          "term_vector" : "with_positions_offsets"
        }
      }
    }
  }
}

一般情况下，用plain highlight也就足够了，不需要做其他额外的设置
如果对高亮的性能要求很高，可以尝试启用unified highlight
如果field的值特别大，超过了1M，那么可以用fast vector highlight

自定义高亮html标签

我们知道高亮的默认标签是，这个标签可以自己定义的，然后使用自己喜欢的样式：

GET /music/children/_search 
{
  "query": {
    "match": {
      "content": "Love"
    }
  },
  "highlight": {
    "pre_tags": ["<tag1>"],
    "post_tags": ["</tag2>"], 
    "fields": {
      "content": {
        "type": "plain"
      }
    }
  }
}

高亮片段fragment的设置

针对一些很长的文本，我们不可能在页面上完整显示的，我们需要只显示有关键词的上下文即可，这里设置fragment就行：

GET /_search
{
    "query" : {
        "match": { "content": "friend" }
    },
    "highlight" : {
        "fields" : {
            "content" : {"fragment_size" : 150, "number_of_fragments" : 3, "no_match_size": 150 }
        }
    }
}

fragment_size: 设置要显示出来的fragment文本判断的长度，默认是100。

number_of_fragments：你可能你的高亮的fragment文本片段有多个片段，你可以指定就显示几个片段。

地理位置

现在基于地理位置的app层出不穷，支持地理位置的组件也有不少，Elasticsearch也不例外，并且ES可以把地理位置、全文搜索、结构化搜索和分析结合到一起，我们来看一下。

geo point数据类型

Elasticsearch基于地理位置的搜索，有一个专门的对象geo_point存储地理位置信息（经度，纬度），并且提供了一些基本的查询方法，如geo_bounding_box。

建立geo_point类型的mapping

PUT /location
{
  "mappings": {
    "hotels": {
      "properties": {
        "location": {
          "type": "geo_point"
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

插入数据

推荐使用如下插入数据方式：

#latitude：维度,longitude：经度
PUT /location/hotels/1
{
  "content":"7days hotel",
  "location": {
    "lon": 113.928619,
    "lat": 22.528091
  }
}

还有两种插入数据的方式，但特别容易搞混经纬度的位置，所以不是很推荐：

# location中括号内，前一个是经度，后一个是纬度
PUT /location/hotels/2
{
  "content":"7days hotel ",
  "location": [113.923567,22.523988]
}

# location中，前一个是纬度，后一个是经度
PUT /location/hotels/3
{
  "text": "7days hotel Orient Sunseed Hotel",
  "location": "22.521184, 113.914578" 
}

查询方法

geo_bounding_box查询，查询某个矩形的地理位置范围内的坐标点

GET /location/hotels/_search
{
  "query": {
     "geo_bounding_box": {
      "location": {
        "top_left":{
          "lon": 112,
          "lat": 23
        },
        "bottom_right":{
          "lon": 114,
          "lat": 21
        }
      }
    } 
  }
}

常见查询场景

geo_bounding_box方式

GET /location/hotels/_search
{
  "query": {
    "bool": {
      "must": [
        {"match_all": {}}
      ],
      "filter": {
        "geo_bounding_box": {
          "location": {
            "top_left":{
              "lon": 112,
              "lat": 23
            },
            "bottom_right":{
              "lon": 114,
              "lat": 21
            }
          }
        }
      }
    }
  }
}

geo_polygon方式,三个点组成的多边形（三角形）区域

支持多边形，只是这个过滤器使用代价很大，尽量少用。

GET /location/hotels/_search
{
  "query": {
    "bool": {
      "must": [
        {"match_all": {}}
      ],
      "filter": {
        "geo_polygon": {
          "location": {
            "points": [
              {"lon": 115,"lat": 23},
              {"lon": 113,"lat": 25},
              {"lon": 112,"lat": 21}
            ]
          }
        }
      }
    }
  }
}

geo_distance方式

根据当前位置的距离进行搜索，非常实用

GET /location/hotels/_search
{
  "query": {
    "bool": {
      "must": [
        {"match_all": {}}
      ],
      "filter": {
        "geo_distance": {
          "distance": 500, 
          "location": {
            "lon": 113.911231,
            "lat": 22.523375
          }
        }
      }
    }
  }
}

按距离排序

根据当前位置进行条件搜索，会指定一个距离的上限，2km或5km，并且符合条件查询的结果显示与当前位置的距离（可以指定单位），并且按从近到远排序，这个是非常常用的场景。

请求示例：

GET /location/hotels/_search
{
  "query": {
    "bool": {
      "must": [
        {"match_all": {}}
      ],
      "filter": {
        "geo_distance": {
          "distance": 2000, 
          "location": {
            "lon": 113.911231,
            "lat": 22.523375
          }
        }
      }
    }
  },
  "sort": [
    {
      "_geo_distance": {
        "location": { 
          "lon": 113.911231,
          "lat": 22.523375
        },
        "order":         "asc",
        "unit":          "m", 
        "distance_type": "plane" 
      }
    }
  ]
}

filter.geo_distance.distance: 最大的距离，这里是2000m
_geo_distance: 固定写法，下面为指定位置的经纬度
order: 排序方式，asc或desc
unit: 距离的单位，m/km都行
distance_type: 计算距离的方式，sloppy_arc (默认值), arc (精准的) and plane (最快速的)