Let’s say we have an ElasticSearch index called strings with a field pattern
of {"type": "keyword"}
.
Get the top N values of the column
If we want to get the top N ( 12
in our example) entries, i.e. the pattern
s that are present in the most documents, we can use this query:
{ "aggs" : { "patterns" : { "terms" : { "field" : "pattern.keyword", "size": 12 } } } }
Full example in Python:
from elasticsearch import Elasticsearch es = Elasticsearch() result = es.search(index="strings", body={ "aggs" : { "patterns" : { "terms" : { "field" : "pattern.keyword", "size": 12 } } } }) for aggregation in result["aggregations"]["patterns"]["buckets"]: print(aggregation) # e.g. {'key': 'mypattern, 'doc_count': 2802}
See the terms aggregation documentation for more infos.
Get all the distinct values of the column
Getting all the values is slightly more complicated since we need to use a composite aggregation that returns an after_key
to paginate the query.
This Python helper function will automatically paginate the query with configurable page size:
from elasticsearch import Elasticsearch es = Elasticsearch() def iterate_distinct_field(es, fieldname, pagesize=250, **kwargs): """ Helper to get all distinct values from ElasticSearch (ordered by number of occurrences) """ compositeQuery = { "size": pagesize, "sources": [{ fieldname: { "terms": { "field": fieldname } } } ] } # Iterate over pages while True: result = es.search(**kwargs, body={ "aggs": { "values": { "composite": compositeQuery } } }) # Yield each bucket for aggregation in result["aggregations"]["values"]["buckets"]: yield aggregation # Set "after" field if "after_key" in result["aggregations"]["values"]: compositeQuery["after"] = \ result["aggregations"]["values"]["after_key"] else: # Finished! break # Usage example for result in iterate_distinct_field(es, fieldname="pattern.keyword", index="strings"): print(result) # e.g. {'key': {'pattern': 'mypattern'}, 'doc_count': 315}