1、首先定义一个索引,如下
PUT /person_news
{
"settings": {
"index": {
"number_of_shards": "3",
"number_of_replicas": "0",
"max_result_window": "2000000000"
}
},
"mappings": {
"properties": {
"companyName": {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"newsSource": {
"type": "keyword"
},
"newsContent": {
"type": "text",
"analyzer": "ik_max_word"
},
"newsTitle": {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"labels": {
"type": "keyword"
},
"personInfo": {
"type": "nested",
"properties": {
"personName": {
"type": "keyword"
},
"age": {
"type": "integer"
}
}
},
"hotPoint": {
"type": "long"
}
}
}
}
person_news 这个索引是新闻和人相关的索引,companyName公司名称,定义了text类型,分词器采用的是ik分词,同时定义子字段类型为keyword,表示不分词(可以用来聚合和精准匹配);
newsSource 新闻来源,不分词;
newsContent 新闻内容,分词;
newsTitle 新闻标题,分词,同时建立子字段为keyword类型(同上companyName);
labels 标签,不分词(这里我准备给这个字段存储的是一个数组类型,就是一个新闻有多个标签,详见下文插入文档);
personInfo 新闻中的人物对象信息,采用的是nested结构,是一个数组对象,对象里面有personName和age字段;
hotPoint 新闻的热点值,通常通过此字段给新闻排序;
2、插入数据
PUT person_news/_doc/1
{
"companyName": "中国恒大有限责任公司",
"newsSource": "新华社",
"newsContent": "今日中国证监会对中国恒大董事长许家印罚款4000万,并对其做出终身不能入市的处罚规定,其公司其他高管夏海钧也被做出相应处罚",
"newsTitle": "恒大许家印被罚",
"labels": [
"恒大",
"许家印"
],
"personInfo": [
{
"personName": "许家印",
"age": 60
},
{
"personName": "夏海钧",
"age": 59
}
],
"hotPoint": 1
}
PUT person_news/_doc/2
{
"companyName": "阿里巴巴有限责任公司",
"newsSource": "新华社",
"newsContent": "今日阿里公司集团董事长张勇卸任,由蔡崇信接任",
"newsTitle": "阿里张勇卸任",
"labels": [
"阿里",
"蔡崇信",
"张勇"
],
"personInfo": [
{
"personName": "张勇",
"age": 60
},
{
"personName": "蔡崇信",
"age": 54
}
],
"hotPoint": 2
}
PUT person_news/_doc/3
{
"companyName": "中国恒大有限责任公司",
"newsSource": "路透社",
"newsContent": "中国恒大董事长传闻跳楼,恒大资产负债高达几万亿,传闻阿里张勇将对恒大进行投资,进军房地产,具体消息恒大高管夏海钧予以否认",
"newsTitle": "恒大董事长许家印",
"labels": [
"恒大",
"张勇"
],
"personInfo": [
{
"personName": "张勇",
"age": 60
},
{
"personName": "夏海钧",
"age": 59
}
],
"hotPoint": 3
}
3、可以通过kibana的DSL语句,查看文本采用某个分词器的效果(采用的是ik_max_word最大粒度分词)
GET /person_news/_analyze
{
"analyzer": "ik_max_word",
"text": "中国恒大有限责任公司"
}
结果如下:
{
"tokens" : [
{
"token" : "中国",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "恒",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "大有",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "有限责任",
"start_offset" : 4,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "有限",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "责任",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "公司",
"start_offset" : 8,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 6
}
]
}
采用ik_smart智能分词
{
"tokens" : [
{
"token" : "中国",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "恒",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "大",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "有限责任",
"start_offset" : 4,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "公司",
"start_offset" : 8,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 4
}
]
}
使用es自带的默认分词器,分词效果如下(会把每个中文分成一个个的汉字)
GET /person_news/_analyze
{
"analyzer": "standard",
"text": "中国恒大有限责任公司"
}
{
"tokens" : [
{
"token" : "中",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "国",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "恒",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "大",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "有",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "限",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "责",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6
},
{
"token" : "任",
"start_offset" : 7,
"end_offset" : 8,
"type" : "<IDEOGRAPHIC>",
"position" : 7
},
{
"token" : "公",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<IDEOGRAPHIC>",
"position" : 8
},
{
"token" : "司",
"start_offset" : 9,
"end_offset" : 10,
"type" : "<IDEOGRAPHIC>",
"position" : 9
}
]
}