ElasticSearch equivalent to MongoDB .distinct(…)

Let’s say we have an ElasticSearch index called strings with a field pattern of {"type": "keyword"}.

Now we want to do the equivalent of MongoDB db.getCollection('...').distinct('pattern'):

Solution:

In Python you can use the iterate_distinct_field() helper from this previous post on ElasticSearch distinct. Full example:

from elasticsearch import Elasticsearch

es = Elasticsearch()

def iterate_distinct_field(es, fieldname, pagesize=250, **kwargs):
    """
    Helper to get all distinct values from ElasticSearch
    (ordered by number of occurrences)
    """
    compositeQuery = {
        "size": pagesize,
        "sources": [{
                fieldname: {
                    "terms": {
                        "field": fieldname
                    }
                }
            }
        ]
    }
    # Iterate over pages
    while True:
        result = es.search(**kwargs, body={
            "aggs": {
                "values": {
                    "composite": compositeQuery
                }
            }
        })
        # Yield each bucket
        for aggregation in result["aggregations"]["values"]["buckets"]:
            yield aggregation
        # Set "after" field
        if "after_key" in result:
            compositeQuery["after_key"] = result["after_key"]
        else: # Finished!
            break

# Usage example
for result in iterate_distinct_field(es, fieldname="pattern.keyword", index="strings"):
    print(result) # e.g. {'key': {'pattern': 'mypattern'}, 'doc_count': 315}

How to query distinct field values in ElasticSearch

Let’s say we have an ElasticSearch index called strings with a field pattern of {"type": "keyword"}.

Get the top N values of the column

If we want to get the top N ( 12 in our example) entries, i.e. the patterns that are present in the most documents, we can use this query:

{
    "aggs" : {
        "patterns" : {
            "terms" : {
                "field" : "pattern.keyword",
                "size": 12
            }
        }
    }
}

Full example in Python:

from elasticsearch import Elasticsearch

es = Elasticsearch()

result = es.search(index="strings", body={
    "aggs" : {
        "patterns" : {
            "terms" : {
                "field" : "pattern.keyword",
                "size": 12
            }
        }
    }
})
for aggregation in result["aggregations"]["patterns"]["buckets"]:
    print(aggregation) # e.g. {'key': 'mypattern, 'doc_count': 2802}

See the terms aggregation documentation for more infos.

Get all the distinct values of the column

Getting all the values is slightly more complicated since we need to use a composite aggregation that returns an after_key to paginate the query.

This Python helper function will automatically paginate the query with configurable page size:

from elasticsearch import Elasticsearch

es = Elasticsearch()

def iterate_distinct_field(es, fieldname, pagesize=250, **kwargs):
    """
    Helper to get all distinct values from ElasticSearch
    (ordered by number of occurrences)
    """
    compositeQuery = {
        "size": pagesize,
        "sources": [{
                fieldname: {
                    "terms": {
                        "field": fieldname
                    }
                }
            }
        ]
    }
    # Iterate over pages
    while True:
        result = es.search(**kwargs, body={
            "aggs": {
                "values": {
                    "composite": compositeQuery
                }
            }
        })
        # Yield each bucket
        for aggregation in result["aggregations"]["values"]["buckets"]:
            yield aggregation
        # Set "after" field
        if "after_key" in result:
            compositeQuery["after_key"] = result["after_key"]
        else: # Finished!
            break

# Usage example
for result in iterate_distinct_field(es, fieldname="pattern.keyword", index="strings"):
    print(result) # e.g. {'key': {'pattern': 'mypattern'}, 'doc_count': 315}

How to fix ElasticSearch ‘Fielddata is disabled on text fields by default’ for keyword field

Problem:

You have a field in ElasticSearch named e.g. patterns of type keyword. However, when you query for an aggregation of this field e.g.

es.search(index="strings", body={
    "size": 0,
    "aggs" : {
        "patterns" : {
            "terms" : { "field" : "pattern" }
        }
    }
})

you see this error message:

elasticsearch.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'Fielddata is disabled on text fields by default. Set fielddata=true on [pattern] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.'

Solution:

This error message is confusing since you already have a keyword field. However, the ElasticSearch fielddata documentation tells us that you need to to use pattern.keyword in the query instead of just pattern.

Full example:

es.search(index="strings", body={
    "size": 0,
    "aggs" : {
        "patterns" : {
            "terms" : { "field" : "pattern.keyword" }
        }
    }
})