Databases

ElasticSearch: How to iterate / scroll through all documents in index

In ElasticSearch, you can use the Scroll API to scroll through all documents in an entire index.

In Python you can scroll like this:

def es_iterate_all_documents(es, index, pagesize=250, scroll_timeout="1m", **kwargs):
    """
    Helper to iterate ALL values from a single index
    Yields all the documents.
    """
    is_first = True
    while True:
        # Scroll next
        if is_first: # Initialize scroll
            result = es.search(index=index, scroll="1m", **kwargs, body={
                "size": pagesize
            })
            is_first = False
        else:
            result = es.scroll(body={
                "scroll_id": scroll_id,
                "scroll": scroll_timeout
            })
        scroll_id = result["_scroll_id"]
        hits = result["hits"]["hits"]
        # Stop after no more docs
        if not hits:
            break
        # Yield each entry
        yield from (hit['_source'] for hit in hits)

This function will yield each document encountered in the index.

Example usage for index my_index:

es = Elasticsearch([{"host": "localhost"}])

for entry in es_iterate_all_documents(es, 'my_index'):
    print(entry) # Prints the document as stored in the DB

 

Posted by Uli Köhler in Databases, ElasticSearch, Python

ElasticSearch: How to iterate all documents in index using Python (up to 10000 documents)

Important Note: This simple approach only works for up to ~10000 documents. Prefer using our scroll-based solution: See ElasticSearch: How to iterate / scroll through all documents in index

Use this helper function to iterate over all the documens in an index

def es_iterate_all_documents(es, index, pagesize=250, **kwargs):
    """
    Helper to iterate ALL values from
    Yields all the documents.
    """
    offset = 0
    while True:
        result = es.search(index=index, **kwargs, body={
            "size": pagesize,
            "from": offset
        })
        hits = result["hits"]["hits"]
        # Stop after no more docs
        if not hits:
            break
        # Yield each entry
        yield from (hit['_source'] for hit in hits)
        # Continue from there
        offset += pagesize

Usage example:

for entry in es_iterate_all_documents(es, 'my_index'):
    print(entry) # Prints the document as stored in the DB

How it works

You can iterate over all documents in an index in ElasticSearch by using queries like

{
    "size": 250,
    "from": 0
}

and increasing "from" by "size" after each iteration.

Posted by Uli Köhler in Databases, ElasticSearch, Python

Fixing ElasticSearch ‘Unknown key for a VALUE_NUMBER in [offset].’

The error message

Unknown key for a VALUE_NUMBER in [offset].

in ElasticSearch tells you that in the query JSON you have specified an offset (numeric value) but ElasticSearch doesn’t know what to do with offset.

This is easy to fix: In order to specify an offset to start from, use from instead of offset.

Incorrect:

{
    "size": 250,
    "offset": 1000
}

Correct:

{
    "size": 250,
    "from": 1000
}

 

Posted by Uli Köhler in Databases, ElasticSearch

Fixing ElasticSearch ‘Unknown key for a START_ARRAY in […].’

When you get an error message like

elasticsearch.exceptions.RequestError: RequestError(400, 'parsing_exception', 'Unknown key for a START_ARRAY in [size].')

in ElasticSearch, look for the value in the [brackets]. In our example this is [size].

Your query contains an array for that key but an array is not allowed.

For example, this query is malformed:

{
    "size": [10, 11]
}

because size takes a number, not an array.

In Python, if you are programmatically building your array, you might have an extra comma at the end of your line.

For example, if you have a line like

query["size"] = 250,

this is just syntactic sugar for

query["size"] = (250,)

i.e. it will set size to a tuple with one element. In the JSON this will result in

{
    "size": [250]
}

In order to fix that issue, remove the comma from the end of the line

query["size"] = 250

which will result in the correct query JSON

{
    "size": 250
}
Posted by Uli Köhler in Databases, ElasticSearch

ElasticSearch equivalent to MongoDB .count()

The ElasticSearch equivalent to MongoDB’s count() is also called count. It can be used in a similar way

When you have an ElasticSearch query like (example in Python)

result = es.search(index="my_index", body={
    "query": {
        "match": {
            "my_field": "my_value"
        }
    }
})
result_docs = [hit["_source"] for hit in result["hits"]["hits"]]

you can easily change it to a count-only query by replacing search with count:

result = es.count(index="my_index", body={
    "query": {
        "match": {
            "my_field": "my_value"
        }
    }
})
result_count = result["count"] # e.g. 200
Posted by Uli Köhler in Databases, ElasticSearch

Fixing ElasticSearch ‘no [query] registered for [query]’

Problem:

You want to run a query in ElasticSearch, but you get an error message like

elasticsearch.exceptions.RequestError: RequestError(400, 'parsing_exception', 'no [query] registered for [query]')

Solution:

In your query body, you have two "query" objects nested in each other. Remove the outer "query", keeping only the inner one.

Example:

Incorrect:

{
    "query": {
        "query": {
            "match": {
                "my_field": "my_value"
            }
        }
    }
}

Correct:

{
    "query": {
        "match": {
            "my_field": "my_value"
        }
    }
}

 

Posted by Uli Köhler in Databases, ElasticSearch

How to fix ElasticSearch ‘Types cannot be provided in put mapping requests, unless the include_type_name parameter is set to true’

Problem:

You want to create a mapping in ElasticSearch but you see an error message like

elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'Types cannot be provided in put mapping requests, unless the include_type_name parameter is set to true.')

Solution:

As already suggested in the error message, set the include_type_name parameter to True.

With the Python API this is as simple as adding include_type_name=True to the put_mapping(...) call:

es.indices.put_mapping(index='my_index', body=my_mapping, doc_type='_doc', include_type_name=True)

In case you now see an error like

TypeError: put_mapping() got an unexpected keyword argument 'include_type_name'

you need to upgrade your elasticsearch python library, e.g. using

sudo pip3 install --upgrade elasticsearch

 

Posted by Uli Köhler in Databases, ElasticSearch, Python

How to view ElasticSearch cluster health using curl

To view the cluster health of your ElasticSearch cluster use

curl -X GET "http://localhost:9200/_cluster/health?pretty=true"

If your ElasticSearch is not running on localhost, replace localhost by the hostname or IP address ElasticSearch is running on.

Example output:
{
  "cluster_name" : "docker-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

The most important information from this is:

  • "cluster_name" : "docker-cluster" The name you assigned to your cluster. You should verify that you are connecting to the correct cluster. All ElasticSearch nodes from that cluster must have the same cluster name, or they won’t connect!
  • "number_of_nodes" : 1 The number of nodes currently in the cluster. Sometimes some nodes take longer to start up, so if there are some nodes missing, wait a minute and retry
  • "status" : "green" The status or cluster health of your cluster.

The cluster health can take three values:

  • green: Everything is OK with your cluster (like in our example)
  • yellow: Your cluster is mostly OK, but some shards couldn’t be replicated. This is often the case with cluster consisting of one node only (in that case, note that a data loss on the one node can not be recovered)
  • red: Something is wrong with the cluster. Usually that’s some configuration issue, so be sure to check the logs.

Also see the official reference on cluster health

If you are looking for help on how to setup your ElasticSearch cluster using docker and docker-compose, you can generate your config file using our generator at ElasticSearch docker-compose.yml and systemd service generator.

Posted by Uli Köhler in Databases, ElasticSearch

How to fix ElasticSearch ‘[match] query doesn’t support multiple fields, found […] and […]’

Problem:

You want to run an ElasticSearch query like

{
    "query": {
        "match" : {
            "one_field" : "one_value",
            "another_field": "another_value"
        }
    }
}

but you only see an error message like

elasticsearch.exceptions.RequestError: RequestError(400, 'parsing_exception', "[match] query doesn't support multiple fields, found [one_field] and [another_field]")

Solution:

Match queries only support one field. You should use a bool query with a must clause containing multiple match queries instead:

{
    "query": {
        "bool": {
            "must": [
                {"match": {"one_field" : "one_value"}},
                {"match": {"another_field" : "another_value"}},
            ]
        }
    }
}

Also see the official docs for the MultiMatch query.

Posted by Uli Köhler in Databases, ElasticSearch

How to fix ElasticSearch ‘no [query] registered for [missing]’

Problem:

You are trying to run an ElasticSearch query like

{
    "query": {
        "missing" : { "field" : "myfield" }
    }
}

to find documents that do not have myfield.

However you only see an error message like this:

elasticsearch.exceptions.RequestError: RequestError(400, 'parsing_exception', 'no [query] registered for [missing]')

Solution:

As the ElasticSearch documentation tells, us, there is no missing query! You instead need to use an exists query inside a must_not clause:

{
    "query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "myfield"
                }
            }
        }
    }
}

 

Posted by Uli Köhler in Databases, ElasticSearch

How to fix ElasticSearch ‘Root mapping definition has unsupported parameters’

Problem:

You want to create an ElasticSearch index with a custom mapping or update the mapping of an existing ElasticSearch index but you see an error message like

elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'Root mapping definition has unsupported parameters:  [mappings : {properties={num_total={type=integer}, approved={type=integer}, num_translated={type=integer}, pattern_length={type=integer}, num_unapproved={type=integer}, pattern={type=keyword}, num_approved={type=integer}, translated={type=integer}, untranslated={type=integer}, num_untranslated={type=integer}, group={type=keyword}}}]')

Solution:

This can point to multiple issues. Essentially, ElasticSearch is trying to tell you that the structure of your JSON is not correct.

Often this error is misinterpreted as individual field definitions being wrong, but this is rarely the issue (and only if an individual field definition is completely malformed).

If your message is structured like

... unsupported parameters:  [mappings : ...

then the most likely root cause is that you have mappings nested inside mappings in your JSON. This also applies if you update a mapping (put_mapping) – in this case the outer mapping is implicit!

Example: Your code looks like this:

es.indices.put_mapping(index='my_index, doc_type='_doc', body={
    "mappings": {
        "properties": {
            "pattern": {
                "type":  "keyword"
            }
        }
    }
})

ElasticSearch will internally create a JSON like this internally:

{
    "mappings": {
        "mappings": {
            "properties": {
                "pattern": {
                    "type":  "keyword"
                }
            }
        }
    }
}

See that there are two mappings inside each other? ElasticSearch does not view this as a correctly structured JSON, therefore you need to remove the "mapping": {...} from your code, resulting in

es.indices.put_mapping(index='my_index, doc_type='_doc', body={
    "properties": {
        "pattern": {
            "type":  "keyword"
        }
    }
})
Posted by Uli Köhler in Databases, ElasticSearch, Python

Fixing ElasticSearch ‘No handler for type [int] declared on field …’

Problem:

You want to create an index with a custom mapping in ElasticSearch but you see an error message like this:

elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'No handler for type [int] declared on field [id]')

Solution:

You likely have a mapping like

"id": {
    "type":  "int"
}

in your mapping properties.

The issue here is int: ElasticSearch uses integer as type of integers, not int!

In order to fix the issue, change the property to

"id": {
    "type":  "integer"
}

and retry creating the index.

 

Posted by Uli Köhler in Databases, ElasticSearch

How to fix ElasticSearch [FORBIDDEN/12/index read-only / allow delete (api)]

If you try to index a document in ElasticSearch and you see an error message like this:

elasticsearch.exceptions.AuthorizationException: AuthorizationException(403, 'cluster_block_exception', 'blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];')

you can unlock writes to your cluster (all indexes) using

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

(thanks to Imran273 on StackOverflow for the original solution)

Note however that often there’s an underlying reason that caused ElasticSearch to lock writes to the index. Most often it is caused by exceeding the disk watermark / quota. See How to disable ElasticSearch disk quota / watermark for details on how to work around that.

Posted by Uli Köhler in Databases, ElasticSearch

How to disable ElasticSearch disk quota / watermark

In its default configuration, ElasticSearch will not allocate any more disk space when more than 90% of the disk are used overall (i.e. by ElasticSearch or other applications).

You can set the watermark extremely low using

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "30mb",
    "cluster.routing.allocation.disk.watermark.high": "20mb",
    "cluster.routing.allocation.disk.watermark.flood_stage": "10mb",
    "cluster.info.update.interval": "1m"
  }
}
'

After doing that, you might need to unlock your cluster for write accesses if you had already exceeded your watermark before:

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

See How to fix ElasticSearch [FORBIDDEN/12/index read-only / allow delete (api)] for more details on that.

I do not recommend to set the values to zero (i.e. below 10 Megabytes) because using every byte of available disk space might cause issues on your system since more important applications will not be able to properly allocate disk space any more.

In order to view the current disk usage use

curl -XGET "http://localhost:9200/_cat/allocation?v&pretty"

See How to view & interpret disk space usage of your ElasticSearch cluster for more details.

Posted by Uli Köhler in Databases, ElasticSearch

How to insert test data into ElasticSearch 6.x

If you just want to insert some test documents into ElasticSearch 6.x, you can use this simple command:

curl -X POST "localhost:9200/mydocuments/_doc/" -H 'Content-Type: application/json' -d"
{
    \"test\" : true,
    \"post_date\" : \"$(date -Ins)\"
}"

Run this command multiple times to insert multiple documents!

In case of success, this will output a message like

{"_index":"mydocuments","_type":"_doc","_id":"vCxB82kBn2U9QxlET2aG","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

Also see the official docs on the indexing API.

Posted by Uli Köhler in ElasticSearch

How to view & interpret disk space usage of your ElasticSearch cluster

In order to find out how much disk space every node of your elasticsearch cluster is using and how much disk space is remaining, use

curl -XGET "http://localhost:9200/_cat/allocation?v&pretty"

Example output for one node without any data:

shards disk.indices disk.used disk.avail disk.total disk.percent host       ip         node
     0           0b     2.4gb    200.9gb    203.3gb            1 172.18.0.2 172.18.0.2 TxYuHLF

This example output tells you:

  • shards: 0 The cluster currently has no shards. This means there is no data in the cluster
  • disk.indices: 0b The cluster currently uses 0 bytes of disk space for indexes.
  • disk.used: 2.4gb The disk ElasticSearch will store its data on has 2.4 Gigabytes used spaced. This does not mean that ElasticSearch uses 2.4 Gigabytes, any other application (including the operating system) might also use (part of) that space.
  • disk.avail: 200.9gb The disk ElasticSearch will store its data on has 200.9 Gigabytes of free space. Remember that this will not shrink only if ElasticSearch is using data on said disk, other applications might also consume some of the disk space depending on how you set up ElasticSearch.
  • disk.total: 203.3gb The disk ElasticSearch will store its data on has a total size of 203.3 gigabytes (total size as in available space on the filesystem without any files)
  • disk.percent: 1 Currently 1 % of the total disk space available (disk.total) is used. This value is always rounded to full percents.
  • host, ip, node: Which node this line is referring to.

Example with one node with some test data (see this TechOverflow post on how to generate test data):

shards disk.indices disk.used disk.avail disk.total disk.percent host       ip         node
     5        6.8kb     2.4gb     21.6gb       24gb           10 172.18.0.2 172.18.0.2 J3W5zqj
     5                                                                                 UNASSIGNED

As we can see, ElasticSearch now has 5 shards. Note that the second line tells us that 5 shards are UNASSIGNED. This is because ElasticSearch has been configured to make one replica for each shard and there is no second node where it can put the replica. For development configurations this is usually OK, but production configurations should usually have at least two nodes. See our docker-compose and systemd service generator for ElasticSearch for instructions on how to configure a local multi-node cluster using docker.

Posted by Uli Köhler in ElasticSearch

How to backup all indices from ElasticSearch

You can use elasticdump to backup all indices from your ElasticSearch cluster. Install using

sudo npm install elasticdump -g

If you don’t have npm, see How to install NodeJS 10.x on Ubuntu in 1 minute.

This package installs two binarys: elasticdump (used to dump a single index) and multielasticdump (used to dump multiple indices in parallel)

We can use multielasticdump to dump all indexes:

mkdir -p es_backup
multielasticdump --direction=dump --input=http://localhost:9200 --output=es_backup

Restore using:

multielasticdump --direction=load --input=es_backup --output=http://localhost:9200

 

Posted by Uli Köhler in ElasticSearch, Linux

How to configure Google Cloud Kubernetes Elasticsearch Cluster with internal load balancer

Google Cloud offers a convenient way of installing an ElasticSearch cluster on top of a Google Cloud Kubernetes cluster. However, the documentation tells you to expose the ElasticSearch instance using

kubectl patch service/"elasticsearch-elasticsearch-svc" \
  --namespace "default" \
  --patch '{"spec": {"type": "LoadBalancer"}}'

However this command will expost ElasticSearch to an external IP which will make it publically accessible in the default configuration.

Here’s the equivalent command that will expose ElasticSearch to an internal load balancer with an internal IP address that will only be available from Google Cloud.

kubectl patch service/"elasticsearch-elasticsearch-svc" \
  --namespace "default" \
  --patch '{"spec": {"type": "LoadBalancer"}, "metadata": {"annotations": {"cloud.google.com/load-balancer-type": "Internal"}}}'

You might need to replace the name of your service (elasticsearch-elasticsearch-svc in this example) and possibly your namespace.

 

Posted by Uli Köhler in Cloud, ElasticSearch, Kubernetes

ElasticSearch equivalent to MongoDB .distinct(…)

Let’s say we have an ElasticSearch index called strings with a field pattern of {"type": "keyword"}.

Now we want to do the equivalent of MongoDB db.getCollection('...').distinct('pattern'):

Solution:

In Python you can use the iterate_distinct_field() helper from this previous post on ElasticSearch distinct. Full example:

from elasticsearch import Elasticsearch

es = Elasticsearch()

def iterate_distinct_field(es, fieldname, pagesize=250, **kwargs):
    """
    Helper to get all distinct values from ElasticSearch
    (ordered by number of occurrences)
    """
    compositeQuery = {
        "size": pagesize,
        "sources": [{
                fieldname: {
                    "terms": {
                        "field": fieldname
                    }
                }
            }
        ]
    }
    # Iterate over pages
    while True:
        result = es.search(**kwargs, body={
            "aggs": {
                "values": {
                    "composite": compositeQuery
                }
            }
        })
        # Yield each bucket
        for aggregation in result["aggregations"]["values"]["buckets"]:
            yield aggregation
        # Set "after" field
        if "after_key" in result["aggregations"]["values"]:
            compositeQuery["after"] = \
                result["aggregations"]["values"]["after_key"]
        else: # Finished!
            break

# Usage example
for result in iterate_distinct_field(es, fieldname="pattern.keyword", index="strings"):
    print(result) # e.g. {'key': {'pattern': 'mypattern'}, 'doc_count': 315}
Posted by Uli Köhler in Databases, ElasticSearch, Python

How to query distinct field values in ElasticSearch

Let’s say we have an ElasticSearch index called strings with a field pattern of {"type": "keyword"}.

Get the top N values of the column

If we want to get the top N ( 12 in our example) entries, i.e. the patterns that are present in the most documents, we can use this query:

{
    "aggs" : {
        "patterns" : {
            "terms" : {
                "field" : "pattern.keyword",
                "size": 12
            }
        }
    }
}

Full example in Python:

from elasticsearch import Elasticsearch

es = Elasticsearch()

result = es.search(index="strings", body={
    "aggs" : {
        "patterns" : {
            "terms" : {
                "field" : "pattern.keyword",
                "size": 12
            }
        }
    }
})
for aggregation in result["aggregations"]["patterns"]["buckets"]:
    print(aggregation) # e.g. {'key': 'mypattern, 'doc_count': 2802}

See the terms aggregation documentation for more infos.

Get all the distinct values of the column

Getting all the values is slightly more complicated since we need to use a composite aggregation that returns an after_key to paginate the query.

This Python helper function will automatically paginate the query with configurable page size:

from elasticsearch import Elasticsearch

es = Elasticsearch()

def iterate_distinct_field(es, fieldname, pagesize=250, **kwargs):
    """
    Helper to get all distinct values from ElasticSearch
    (ordered by number of occurrences)
    """
    compositeQuery = {
        "size": pagesize,
        "sources": [{
                fieldname: {
                    "terms": {
                        "field": fieldname
                    }
                }
            }
        ]
    }
    # Iterate over pages
    while True:
        result = es.search(**kwargs, body={
            "aggs": {
                "values": {
                    "composite": compositeQuery
                }
            }
        })
        # Yield each bucket
        for aggregation in result["aggregations"]["values"]["buckets"]:
            yield aggregation
        # Set "after" field
        if "after_key" in result["aggregations"]["values"]:
            compositeQuery["after"] = \
                result["aggregations"]["values"]["after_key"]
        else: # Finished!
            break

# Usage example
for result in iterate_distinct_field(es, fieldname="pattern.keyword", index="strings"):
    print(result) # e.g. {'key': {'pattern': 'mypattern'}, 'doc_count': 315}
Posted by Uli Köhler in Databases, ElasticSearch, Python