ES度量聚合(ElasticSearch Metric Aggregations)总结

Metric聚合，主要针对数值类型的字段，类似于关系型数据库中的sum、avg、max、min等聚合类型。
一、avg 平均值

对字段grade取平均值。对应的java示例如下：

    @Resource
    private RestHighLevelClient client ;

    @Test
    public void testMatchQuery() {
        try {
            SearchRequest searchRequest = new SearchRequest();
            searchRequest.indices("items");
            SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
            AggregationBuilder avg = AggregationBuilders.avg("avg-price").field("price").missing(0);
            sourceBuilder.aggregation(avg);
            sourceBuilder.size(0);
            sourceBuilder.query(
                    QueryBuilders.termQuery("category", "一级")
            );
            searchRequest.source(sourceBuilder);
            SearchResponse result = client.search(searchRequest, RequestOptions.DEFAULT);
            System.out.println(result);
        } catch (Throwable e) {
            e.printStackTrace();
        } finally {
            try {
                client.close();
            }catch (Exception e){
                log.error(e.getMessage());
            }
        }
    }

其中代码missing(0)表示如果文档中没有取平均值的字段时，则使用该值进行计算，本例中使用0参与计算。
其返回结果如下：

{
    "aggregations": {
        "asMap": {
            "avg-price": {
                "fragment": true,
                "name": "avg-price",
                "type": "avg",
                "value": 484.9945,
                "valueAsString": "484.9945"
            }
        },
        "fragment": true
    },
    "clusters": {
        "fragment": true,
        "skipped": 0,
        "successful": 0,
        "total": 0
    },
    "failedShards": 0,
    "fragment": false,
    "hits": {
        "fragment": true,
        "hits": [],
        "maxScore": 0,
        "totalHits": 2
    },
    "numReducePhases": 1,
    "profileResults": {},
    "shardFailures": [],
    "skippedShards": 0,
    "successfulShards": 5,
    "timedOut": false,
    "took": {
        "days": 0,
        "daysFrac": 2.3148148148148148e-8,
        "hours": 0,
        "hoursFrac": 5.555555555555555e-7,
        "micros": 2000,
        "microsFrac": 2000,
        "millis": 2,
        "millisFrac": 2,
        "minutes": 0,
        "minutesFrac": 0.000033333333333333335,
        "nanos": 2000000,
        "seconds": 0,
        "secondsFrac": 0.002,
        "stringRep": "2ms"
    },
    "totalShards": 5
}

二、Weighted Avg Aggregation 加权平均聚合
加权平均算法，∑(value * weight) / ∑(weight)。
加权平均（weghted_avg）支持的参数列表：

value：提供值的字段或脚本的配置。例如定义计算哪个字段的平均值，该值支持如下子参数：
field：用来定义平均值的字段名称。
missing：用来定义如果匹配到的文档没有avg字段，使用该值来参与计算。
weight：用来定义权重的对象，其可选属性如下：
field：定义权重来源的字段。
missing：如果文档缺失权重来源字段，以该值来代表该文档的权重值。
format：数值类型格式化。
value_type：用来指定value的类型，例如ValueType.DATE、ValueType.IP等。

从文档中抽取属性为weight的字段的值来当权重值。其JAVA示例如下：

    @Test
    public void test_weight_avg_aggregation() {
        try {
            SearchRequest searchRequest = new SearchRequest();
            searchRequest.indices("items");
            SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
            WeightedAvgAggregationBuilder avg = AggregationBuilders.weightedAvg("avg-aggregation")
                    .value(
                            (new MultiValuesSourceFieldConfig.Builder())
                                    .setFieldName("price")
                                    .build()
                    )
                    .weight(
                            (new MultiValuesSourceFieldConfig.Builder())
                                    .setFieldName("price")
                                    .build()
                    );
            sourceBuilder.aggregation(avg);
            sourceBuilder.size(0);
            sourceBuilder.query(
                    QueryBuilders.termQuery("category", "一级")
            );
            searchRequest.source(sourceBuilder);
            SearchResponse result = client.search(searchRequest, RequestOptions.DEFAULT);
            System.out.println(JSONObject.toJSONString(result));
        } catch (Throwable e) {
            e.printStackTrace();
        } finally {
            try {
                client.close();
            }catch (Exception e){
                log.error(e.getMessage());
            }
        }
    }

三、Cardinality Aggregation
基数聚合，先distinct,再聚合，类似关系型数据库(count(distinct))。
示例如下：

    @Test
    public void test_Cardinality_Aggregation() {
        try {
            SearchRequest searchRequest = new SearchRequest();
            searchRequest.indices("poems");
            SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
            AggregationBuilder aggregationBuild = AggregationBuilders.cardinality("author_count").field("author");
            sourceBuilder.aggregation(aggregationBuild);
            sourceBuilder.size(0);
            sourceBuilder.query(
                    QueryBuilders.termQuery("dynasty", "唐")
            );
            searchRequest.source(sourceBuilder);
            SearchResponse result = client.search(searchRequest, RequestOptions.DEFAULT);
            System.out.println(JSONObject.toJSONString(result));
        } catch (Throwable e) {
            e.printStackTrace();
        } finally {
            try {
                client.close();
            }catch (Exception e){
                log.error(e.getMessage());
            }
        }
    }

上述实现与SQL:SELECT COUNT(DISTINCT author) from es_order_tmp where dynasty = “唐”; 效果类似。
其核心参数如下：

precision_threshold：精确度控制。在此计数之下，期望计数接近准确。在这个值之上，计数可能会变得更加模糊（不准确）。支持的最大值是40000，超过此值的阈值与40000的阈值具有相同的效果。默认值是3000。

上述示例中返回的11是精确值，如果改写成下面的代码，结果将变的不准确：

{
    "aggregations": {
        "asMap": {
            "author_count": {
                "fragment": true,
                "name": "author_count",
                "type": "cardinality",
                "value": 6,
                "valueAsString": "6.0"
            }
        },
        "fragment": true
    },
    "clusters": {
        "fragment": true,
        "skipped": 0,
        "successful": 0,
        "total": 0
    },
    "failedShards": 0,
    "fragment": false,
    "hits": {
        "fragment": true,
        "hits": [],
        "maxScore": 0,
        "totalHits": 15
    },
    "numReducePhases": 1,
    "profileResults": {},
    "shardFailures": [],
    "skippedShards": 0,
    "successfulShards": 5,
    "timedOut": false,
    "took": {
        "days": 0,
        "daysFrac": 4.2824074074074075e-7,
        "hours": 0,
        "hoursFrac": 0.000010277777777777777,
        "micros": 37000,
        "microsFrac": 37000,
        "millis": 37,
        "millisFrac": 37,
        "minutes": 0,
        "minutesFrac": 0.0006166666666666666,
        "nanos": 37000000,
        "seconds": 0,
        "secondsFrac": 0.037,
        "stringRep": "37ms"
    },
    "totalShards": 5
}

其返回结果如下：

{
    "aggregations": {
        "asMap": {
            "author_count": {
                "fragment": true,
                "name": "author_count",
                "type": "cardinality",
                "value": 12,
                "valueAsString": "12.0"
            }
        },
        "fragment": true
    },
    "clusters": {
        "fragment": true,
        "skipped": 0,
        "successful": 0,
        "total": 0
    },
    "failedShards": 0,
    "fragment": false,
    "hits": {
        "fragment": true,
        "hits": [],
        "maxScore": 0,
        "totalHits": 22
    },
    "numReducePhases": 1,
    "profileResults": {},
    "shardFailures": [],
    "skippedShards": 0,
    "successfulShards": 5,
    "timedOut": false,
    "took": {
        "days": 0,
        "daysFrac": 2.5462962962962963e-7,
        "hours": 0,
        "hoursFrac": 0.000006111111111111111,
        "micros": 22000,
        "microsFrac": 22000,
        "millis": 22,
        "millisFrac": 22,
        "minutes": 0,
        "minutesFrac": 0.00036666666666666667,
        "nanos": 22000000,
        "seconds": 0,
        "secondsFrac": 0.022,
        "stringRep": "22ms"
    },
    "totalShards": 5
}

Pre-computed hashes:一个比较好的实践是需要对字符串类型的字段进行基数聚合的话，可以提前索引该字符串的hash值，通过对hash值的聚合，提高效率。
Missing Value:missing参数定义了应该如何处理缺少值的文档。默认情况下，它们将被忽略，但也可以将它们视为具有一个值，通过missing value来设置。

四：Extended Stats Aggregation
stats聚合的扩展版本，示例如下：

    @Test
    public void test_Extended_Stats_Aggregation() {
        try {
            SearchRequest searchRequest = new SearchRequest();
            searchRequest.indices("items");
            SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
            AggregationBuilder aggregationBuild = AggregationBuilders.extendedStats("extended_stats").field("price");
            sourceBuilder.aggregation(aggregationBuild);
            sourceBuilder.size(0);
//            sourceBuilder.query(
//                    QueryBuilders.termQuery("sellerId", 24)
//            );
            searchRequest.source(sourceBuilder);
            SearchResponse result = client.search(searchRequest, RequestOptions.DEFAULT);
            System.out.println(JSONObject.toJSONString(result));
        } catch (Throwable e) {
            e.printStackTrace();
        } finally {
            try {
                client.close();
            }catch (Exception e){
                log.error(e.getMessage());
            }
        }
    }

返回的结果如下：

{
    "aggregations": {
        "asMap": {
            "extended_stats": {
                "avg": 281.94725,
                "avgAsString": "281.94725",
                "count": 4,
                "fragment": true,
                "max": 880.999,
                "maxAsString": "880.999",
                "min": 10.9,
                "minAsString": "10.9",
                "name": "extended_stats",
                "stdDeviation": 349.2133556190077,
                "stdDeviationAsString": "349.2133556190077",
                "sum": 1127.789,
                "sumAsString": "1127.789",
                "sumOfSquares": 805776.8781010001,
                "sumOfSquaresAsString": "805776.8781010001",
                "type": "extended_stats",
                "variance": 121949.96774268753,
                "varianceAsString": "121949.96774268753"
            }
        },
        "fragment": true
    },
    "clusters": {
        "fragment": true,
        "skipped": 0,
        "successful": 0,
        "total": 0
    },
    "failedShards": 0,
    "fragment": false,
    "hits": {
        "fragment": true,
        "hits": [],
        "maxScore": 0,
        "totalHits": 4
    },
    "numReducePhases": 1,
    "profileResults": {},
    "shardFailures": [],
    "skippedShards": 0,
    "successfulShards": 5,
    "timedOut": false,
    "took": {
        "days": 0,
        "daysFrac": 3.8194444444444445e-7,
        "hours": 0,
        "hoursFrac": 0.000009166666666666666,
        "micros": 33000,
        "microsFrac": 33000,
        "millis": 33,
        "millisFrac": 33,
        "minutes": 0,
        "minutesFrac": 0.00055,
        "nanos": 33000000,
        "seconds": 0,
        "secondsFrac": 0.033,
        "stringRep": "33ms"
    },
    "totalShards": 5
}

五、max Aggregation
求最大值，与avg Aggregation聚合类似，不再重复介绍。
六、min Aggregation
求最小值，与avg Aggregation聚合类似，不再重复介绍。
七、Percentiles Aggregation
百分位计算，ES提供的另外一种近似度量方式。主要用于展现以具体百分比下观察到的数值，例如，第95个百分位上的数值，是高于 95% 的数据总和。百分位聚合通常用来找出异常，适用与使用统计学中正态分布来观察问题。
官方文档：https://www.elastic.co/guide/cn/elasticsearch/guide/current/percentiles.html

八、HDR Histogram(直方图)
HDR直方图(High Dynamic Range Histogram，高动态范围直方图)是一种替代实现，在计算延迟度量的百分位数时非常有用，因为它比t-digest实现更快，但需要更大的内存占用。此实现维护一个固定的最坏情况百分比错误(指定为有效数字的数量)。这意味着如果数据记录值从1微秒到1小时(3600000000毫秒)直方图设置为3位有效数字,它将维持一个价值1微秒的分辨率值1毫秒,3.6秒(或更好的)最大跟踪值(1小时)。

hdr：通过hdr属性指定直方图相关的参数。
number_of_significant_value_digits：指定以有效位数为单位的直方图值的分辨率。

注意：hdr直方图只支持正值，如果传递负值，则会出错。如果值的范围是未知的，那么使用HDRHistogram也不是一个好主意，因为这可能会导致内存的大量使用。
Missing value

missing参数定义了应该如何处理缺少值的文档。默认情况下，它们将被忽略，但也可以将它们视为具有一个值。

    @Test
    public void test_Percentiles_Aggregation() {
        try {
            SearchRequest searchRequest = new SearchRequest();
            searchRequest.indices("items");
            SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
            AggregationBuilder aggregationBuild = AggregationBuilders.percentiles("percentiles")
                    .field("price")
                    .percentiles(75,90,99.9)
                    .compression(100)
                    .method(PercentilesMethod.HDR)
                    .numberOfSignificantValueDigits(3)
                    ;
            sourceBuilder.aggregation(aggregationBuild);
            sourceBuilder.size(0);
//            sourceBuilder.query(
//                    QueryBuilders.termQuery("sellerId", 24)
//            );
            searchRequest.source(sourceBuilder);
            SearchResponse result = client.search(searchRequest, RequestOptions.DEFAULT);
            System.out.println(JSONObject.toJSONString(result));
        } catch (Throwable e) {
            e.printStackTrace();
        } finally {
            try {
                client.close();
            }catch (Exception e){
                log.error(e.getMessage());
            }
        }
    }

参考博客：https://blog.csdn.net/prestigeding/article/details/88373092

郭慕荣博客园

《ES度量聚合(ElasticSearch Metric Aggregations)总结》有一个想法

sklep online说道：

2024年3月13日下午2:02

Wow, fantastic blog format! How lengthy have you been blogging
for? you made running a blog glance easy. The total look of your website is fantastic, as smartly as the content!
You can see similar here sklep online

5G编程聚合网

ES度量聚合(ElasticSearch Metric Aggregations)总结

由admin

admin

相关文章

对嵌入字符串中的数字进行排序

如何使用Pandas对excel文件中的数据进行排序。并对副本进行排序

在pygtk Treevi中添加新行

《ES度量聚合(ElasticSearch Metric Aggregations)总结》有一个想法

发表回复

You missed

对嵌入字符串中的数字进行排序

如何使用Pandas对excel文件中的数据进行排序。并对副本进行排序

在pygtk Treevi中添加新行

我可以在pybel格式转换期间捕获警告消息吗？

5G编程聚合网