elasticsearch synonym filter 使用思考

ES synonym filter

为了进行扩为了进行扩召回，一种有效的方式是添加同义词，加入同义词后扩大了搜索范围同时也带来了两个问题：

term query 原词需要比同义词有更高的评分

# 发现结果中 原词和同义词 具有同样的权值
GET learning_test_03/_search
{
  "_source": "post_title", 
  "explain": true, 
  "query": {
    "term": {
      "post_title.jieba_dic_all_synonym": {
        "value": "视图"
      }
    }
  }
}

match_phase 也有这个问题，同义词低于原词的评分

GET learning_test_03/_search
{
  "_source": "post_title",
  "explain": true, 
  "query": {
      "match_phrase": {
          "post_title.jieba_dic_all_synonym": "插入流程图"
      }
  }
}

# result：
#  "description" : """weight(post_title.jieba_dic_all_synonym:"插入 (流程图 visio 略图 视图)" in 20533) [PerFieldSimilarity], result of:"""
# 可以看出，流程图 和他的同义词： visio 略图 视图 ，身份都是一样的。但是在查询中，往往应该原词高于 扩充的同义词.

synonym 对评分的干扰

带有 synonym filter 的 analyzer 的使用：

官方文档提供了 synonym filter 并举例了，索引数据时的应用示例，但是经过调研分析，得出了带有 synonym 的 analyzer 适用于 search 而不是 index。

synonym 增加了field 的 term 数量(导致评分参数 avgdl 变大)，还有重要的是如果使用 match query 的话，会导致匹配的 termFreq 增加到 synonym 的数量，影响评分。
如果同义词变化的话，需要同步更新所有的关系到同义词的文档。
对于匹配原词和他的同义词，往往原词的评分应该更高。但是 ES 中却一视同仁。没有区别。虽然可以通过定义不同的 field ，一个 field 使用完全切分，一个field 使用同义词，并且在search时，给全完且分词field 一个较高的权重。但是又带来了怎加了term 存储的容量扩大问题。

使用 demo 说明：

同义词文件内容：
工作,简历,招聘,入职
学校,老师,学生,操场
医院,护士,医生

PUT /test_synonym_1
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "jieba_synonym": {
            "tokenizer": "jieba_search",
            "filter": [
              "synonym"
            ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym",
            "synonyms_path": "synonyms/synonyms.txt"
          }
        }
      }
    }
  }
}

PUT test_synonym_1/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer" :"jieba_synonym"
    },
    "content" :{
      "type" :"text",
      "analyzer" :"jieba_search" 
    }
  }
}

POST test_synonym_1/_doc/1
{
  "title" :"插入流程图时怎么编辑工作",
  "content":"插入流程图时怎么编辑工作"
}

POST test_synonym_1/_doc/2
{
  "title" :"怎么自定义功能区的学校",
  "content":"怎么自定义功能区的学校"
}
POST test_synonym_1/_doc/3
{
  "title" :"如何在表格中加医院",
  "content":"如何在表格中加医院"
}
POST test_synonym_1/_doc/4
{
  "title" :"首页怎么关闭?",
  "content":"首页怎么关闭?"
}
POST test_synonym_1/_doc/5
{
  "title" :"修改的关系图怎么做成一整个图",
  "content":"修改的关系图怎么做成一整个图"
}
POST test_synonym_1/_doc/6
{
  "title" :"在哪里给文档命名",
  "content":"在哪里给文档命名"
}

GET test_synonym_1/_search
{
  "explain": true,
  "query": {
    "match": {
      "title": {
        "query": "表格学生",
        "analyzer": "jieba_synonym"
      }
    }
  }
}

# result
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.7425265,
    "hits" : [
      {
        "_shard" : "[test_synonym_1][0]",
        "_node" : "EyNKn90XS1Otize_1yE7-w",
        "_index" : "test_synonym_1",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 2.7425265,
        "_source" : {
          "title" : "怎么自定义功能区的学校",
          "content" : "怎么自定义功能区的学校"
        },
        "_explanation" : {
          "value" : 2.7425265,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 2.7425265,
              "description" : "weight(Synonym(title:学校 title:学生 title:操场 title:老师) in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 2.7425265,
                  "description" : "score(freq=4.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.5404451,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 6,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.80924857,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 4.0,
                          "description" : "termFreq=4.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 7.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[test_synonym_1][0]",
        "_node" : "EyNKn90XS1Otize_1yE7-w",
        "_index" : "test_synonym_1",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.7443275,
        "_source" : {
          "title" : "如何在表格中加医院",
          "content" : "如何在表格中加医院"
        },
        "_explanation" : {
          "value" : 1.7443275,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 1.7443275,
              "description" : "weight(title:表格 in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 1.7443275,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.5404451,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 6,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.5147059,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 7.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}


# 可以看到 第一个得分较高，并且 termFreq=4.0 恰好是 同义词的数量。这是因为在搜索的时候也使用了 synonym ，对原有的query 进行了扩充。
# 使用 term 是没有这个问题的，因为 term query 不会对 搜索词 进行 analyzer的加工处理。但是没有办法保证精确匹配的原词有更高的 score，而不是匹配上的其他同义词有更高 score， 比如 query：学校 ，结果是  （一个学生， 一个拥有大量面积的学校） ，而不是精准匹配的在前面。

上述问题的解决思考：

不要使用带 synonym 的analyzer 进行 index 操作，使用他们进行 query 操作。

# analyzer 在数据索引的事后起作用
# search_analyzer 在请求的时候起作用，如果没有默认是 analyzer 
PUT test_synonym_1/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "jieba_search",
      "search_analyzer": "jieba_synonym"
    },
    "content" :{
      "type" :"text",
      "analyzer" :"jieba_search" 
    }
  }
}

#更加灵活的处理是 在 match query 是指定相应的 analyzer 
GET test_synonym_1/_search
{
  "explain": true,
  "query": {
    "match": {
      "title": {
        "query": "表格学生",
        "analyzer": "jieba_synonym"
      }
    }
  }
}

最后，如果 synonym filter本身支持远程词库的作用的话，那么更新了远程词库，搜索的时候就会主动生效。

# 使用远程词库的 synonym filter， 拼接起来的 analyzer 去 search
PUT /test_synonym_1
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "jieba_synonym": {
            "tokenizer": "jieba_search",
            "filter": [
              "remote_synonym"
            ]
          }
        },
        "filter": {
          "remote_synonym": {
            "type": "dynamic_synonym",
            "synonyms_path": "http://locahost:8080/synonym.txt",
            "interval": "60"
          }
        }
      }
    }
  }
}

PUT test_synonym_1/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "jieba_search",
      "search_analyzer": "jieba_synonym"
    },
    "content" :{
      "type" :"text",
      "analyzer" :"jieba_search" 
    }
  }
}

elasticsearch synonym filter 使用思考

ES synonym filter

synonym 对评分的干扰

相关文章

在elasticsearch 中更好的处理同义词

elasticsearch OOM

python arm 基础镜像构建

elasticsearch translog 去除问题

ES 排序，相关度和热度之间的平衡

ES7 选主去掉了minimum_master_nodes

激流勇进

MYSQL union 双向排序