Technologies

mongodump/mongorestore minimal examples

Create & restore a database

mongodump --db mydb --out mydb.mongobackup

This will backup the database mydb from the MongoDB running at localhost and store the backup in the newly created DigiKey.mongobackup directory as BSON.

mongorestore mydb.mongobackup

This will restore the backup to localhost (the database name, mydb, is stored in the backup directory).

It will not overwrite or update existing documents, nor delete documents that are currently present but not present in the backup.

Restore backup with drop

mongorestore --drop mydb.mongobackup

This will drop (i.e. delete) each collection before importing from the backup. This means that

  • Existing documents will effectively be overwritten
  • Documents that are currently present but not present in the backup will be deleted

However note that while importing the backup, some documents might be missing from the database until the backup has been fully restored.

Note that this will not drop collections that are not present in the backup.

Posted by Uli Köhler in Databases

What is a ‘headless’ program or application?

If you’re a backend programmer, you most likely have encountered the term of headless applications (like headless Java or headless Chromium) many times. But what does headless actually mean?

headless means it runs without a graphical user interface (GUI). In most cases, headless applications are command line applications or applications that are interfaced from a programming language.

When you open your browser on your Desktop or Laptop, you’ll see the browser window pop up. This window is a graphical user interface (GUI). Unless you are living under a rock, you have had your fair share of experience with GUIs and know how convenient they can be sometimes.

However, especially in the professional IT world, command line user interfaces are also common – mostly, because they can be used on servers and local PCs with a screen alike, and they are much easier to automate than GUIs.

If you have worked in application development for some time, you might know that GUI applications require a huge number of incredibly complex libraries to run. On Linux that includes the X11 server (X11 is the system that displays and arranges all the windows and other graphical features on your screen), some utility libraries for X11 and possibly some libraries to be able to create the high-level user interfaces in use in most applications today: While high-level GUIs consist of buttons and text inputs, low-level GUIs consist of pixel, colored areas and callback functions (from which the buttons and text inputs are built of).

As said before, headless applications run without a GUI. Under the right circumstances, this has two main advantages:

  • You don’t have to spend as much time developing a complex graphical user interface
  • You can easily run it on a server
Why can you run headless applications on a server but not normal GUI applications?

Firstly, as we said before, GUIs require a large number of complex libraries and software infrastructure to be present on a server. On many servers (like most Linux servers), this software is not installed by default – because installing it would eat up valuable resources like system memory (RAM), hard drive space and the system would be hard to maintain.

Contrary to common belief, running a GUI infrastructure on Linux (i.e. X server plus some utilities) does not require a screen or a dedicated graphics card. See our post on How to run X server using xserver-xorg-video-dummy driver on Ubuntu for an example on how to accomplish this.

Secondly, GUIs are hard to monitor, maintain and automate:

Imagine you have 25 servers running the same application on each of them. If that were a GUI application, you would have to look at 25 windows displaying the state of the application – not to mention the time spent to make all the windows display on your local screen reliably – or the development time spent to make the GUI

Using headless applications you can instead easily automate the task of monitoring the applications and therefore just display a summary on your local screen. Since you don’t have to spend time writing and maintaining GUIs, you can also spend your time more wisely.

Posted by Uli Köhler in Technologies

ElasticSearch: How to iterate / scroll through all documents in index

In ElasticSearch, you can use the Scroll API to scroll through all documents in an entire index.

In Python you can scroll like this:

def es_iterate_all_documents(es, index, pagesize=250, scroll_timeout="1m", **kwargs):
    """
    Helper to iterate ALL values from a single index
    Yields all the documents.
    """
    is_first = True
    while True:
        # Scroll next
        if is_first: # Initialize scroll
            result = es.search(index=index, scroll="1m", **kwargs, body={
                "size": pagesize
            })
            is_first = False
        else:
            result = es.scroll(body={
                "scroll_id": scroll_id,
                "scroll": scroll_timeout
            })
        scroll_id = result["_scroll_id"]
        hits = result["hits"]["hits"]
        # Stop after no more docs
        if not hits:
            break
        # Yield each entry
        yield from (hit['_source'] for hit in hits)

This function will yield each document encountered in the index.

Example usage for index my_index:

es = Elasticsearch([{"host": "localhost"}])

for entry in es_iterate_all_documents(es, 'my_index'):
    print(entry) # Prints the document as stored in the DB

 

Posted by Uli Köhler in Databases, ElasticSearch, Python

ElasticSearch: How to iterate all documents in index using Python (up to 10000 documents)

Important Note: This simple approach only works for up to ~10000 documents. Prefer using our scroll-based solution: See ElasticSearch: How to iterate / scroll through all documents in index

Use this helper function to iterate over all the documens in an index

def es_iterate_all_documents(es, index, pagesize=250, **kwargs):
    """
    Helper to iterate ALL values from
    Yields all the documents.
    """
    offset = 0
    while True:
        result = es.search(index=index, **kwargs, body={
            "size": pagesize,
            "from": offset
        })
        hits = result["hits"]["hits"]
        # Stop after no more docs
        if not hits:
            break
        # Yield each entry
        yield from (hit['_source'] for hit in hits)
        # Continue from there
        offset += pagesize

Usage example:

for entry in es_iterate_all_documents(es, 'my_index'):
    print(entry) # Prints the document as stored in the DB

How it works

You can iterate over all documents in an index in ElasticSearch by using queries like

{
    "size": 250,
    "from": 0
}

and increasing "from" by "size" after each iteration.

Posted by Uli Köhler in Databases, ElasticSearch, Python

Fixing ElasticSearch ‘Unknown key for a VALUE_NUMBER in [offset].’

The error message

Unknown key for a VALUE_NUMBER in [offset].

in ElasticSearch tells you that in the query JSON you have specified an offset (numeric value) but ElasticSearch doesn’t know what to do with offset.

This is easy to fix: In order to specify an offset to start from, use from instead of offset.

Incorrect:

{
    "size": 250,
    "offset": 1000
}

Correct:

{
    "size": 250,
    "from": 1000
}

 

Posted by Uli Köhler in Databases, ElasticSearch

Fixing ElasticSearch ‘Unknown key for a START_ARRAY in […].’

When you get an error message like

elasticsearch.exceptions.RequestError: RequestError(400, 'parsing_exception', 'Unknown key for a START_ARRAY in [size].')

in ElasticSearch, look for the value in the [brackets]. In our example this is [size].

Your query contains an array for that key but an array is not allowed.

For example, this query is malformed:

{
    "size": [10, 11]
}

because size takes a number, not an array.

In Python, if you are programmatically building your array, you might have an extra comma at the end of your line.

For example, if you have a line like

query["size"] = 250,

this is just syntactic sugar for

query["size"] = (250,)

i.e. it will set size to a tuple with one element. In the JSON this will result in

{
    "size": [250]
}

In order to fix that issue, remove the comma from the end of the line

query["size"] = 250

which will result in the correct query JSON

{
    "size": 250
}
Posted by Uli Köhler in Databases, ElasticSearch

ElasticSearch equivalent to MongoDB .count()

The ElasticSearch equivalent to MongoDB’s count() is also called count. It can be used in a similar way

When you have an ElasticSearch query like (example in Python)

result = es.search(index="my_index", body={
    "query": {
        "match": {
            "my_field": "my_value"
        }
    }
})
result_docs = [hit["_source"] for hit in result["hits"]["hits"]]

you can easily change it to a count-only query by replacing search with count:

result = es.count(index="my_index", body={
    "query": {
        "match": {
            "my_field": "my_value"
        }
    }
})
result_count = result["count"] # e.g. 200
Posted by Uli Köhler in Databases, ElasticSearch

Fixing ElasticSearch ‘no [query] registered for [query]’

Problem:

You want to run a query in ElasticSearch, but you get an error message like

elasticsearch.exceptions.RequestError: RequestError(400, 'parsing_exception', 'no [query] registered for [query]')

Solution:

In your query body, you have two "query" objects nested in each other. Remove the outer "query", keeping only the inner one.

Example:

Incorrect:

{
    "query": {
        "query": {
            "match": {
                "my_field": "my_value"
            }
        }
    }
}

Correct:

{
    "query": {
        "match": {
            "my_field": "my_value"
        }
    }
}

 

Posted by Uli Köhler in Databases, ElasticSearch

How to fix ElasticSearch ‘Types cannot be provided in put mapping requests, unless the include_type_name parameter is set to true’

Problem:

You want to create a mapping in ElasticSearch but you see an error message like

elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'Types cannot be provided in put mapping requests, unless the include_type_name parameter is set to true.')

Solution:

As already suggested in the error message, set the include_type_name parameter to True.

With the Python API this is as simple as adding include_type_name=True to the put_mapping(...) call:

es.indices.put_mapping(index='my_index', body=my_mapping, doc_type='_doc', include_type_name=True)

In case you now see an error like

TypeError: put_mapping() got an unexpected keyword argument 'include_type_name'

you need to upgrade your elasticsearch python library, e.g. using

sudo pip3 install --upgrade elasticsearch

 

Posted by Uli Köhler in Databases, ElasticSearch, Python

How to view ElasticSearch cluster health using curl

To view the cluster health of your ElasticSearch cluster use

curl -X GET "http://localhost:9200/_cluster/health?pretty=true"

If your ElasticSearch is not running on localhost, replace localhost by the hostname or IP address ElasticSearch is running on.

Example output:
{
  "cluster_name" : "docker-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

The most important information from this is:

  • "cluster_name" : "docker-cluster" The name you assigned to your cluster. You should verify that you are connecting to the correct cluster. All ElasticSearch nodes from that cluster must have the same cluster name, or they won’t connect!
  • "number_of_nodes" : 1 The number of nodes currently in the cluster. Sometimes some nodes take longer to start up, so if there are some nodes missing, wait a minute and retry
  • "status" : "green" The status or cluster health of your cluster.

The cluster health can take three values:

  • green: Everything is OK with your cluster (like in our example)
  • yellow: Your cluster is mostly OK, but some shards couldn’t be replicated. This is often the case with cluster consisting of one node only (in that case, note that a data loss on the one node can not be recovered)
  • red: Something is wrong with the cluster. Usually that’s some configuration issue, so be sure to check the logs.

Also see the official reference on cluster health

If you are looking for help on how to setup your ElasticSearch cluster using docker and docker-compose, you can generate your config file using our generator at ElasticSearch docker-compose.yml and systemd service generator.

Posted by Uli Köhler in Databases, ElasticSearch

How to fix ElasticSearch ‘[match] query doesn’t support multiple fields, found […] and […]’

Problem:

You want to run an ElasticSearch query like

{
    "query": {
        "match" : {
            "one_field" : "one_value",
            "another_field": "another_value"
        }
    }
}

but you only see an error message like

elasticsearch.exceptions.RequestError: RequestError(400, 'parsing_exception', "[match] query doesn't support multiple fields, found [one_field] and [another_field]")

Solution:

Match queries only support one field. You should use a bool query with a must clause containing multiple match queries instead:

{
    "query": {
        "bool": {
            "must": [
                {"match": {"one_field" : "one_value"}},
                {"match": {"another_field" : "another_value"}},
            ]
        }
    }
}

Also see the official docs for the MultiMatch query.

Posted by Uli Köhler in Databases, ElasticSearch

How to fix ElasticSearch ‘no [query] registered for [missing]’

Problem:

You are trying to run an ElasticSearch query like

{
    "query": {
        "missing" : { "field" : "myfield" }
    }
}

to find documents that do not have myfield.

However you only see an error message like this:

elasticsearch.exceptions.RequestError: RequestError(400, 'parsing_exception', 'no [query] registered for [missing]')

Solution:

As the ElasticSearch documentation tells, us, there is no missing query! You instead need to use an exists query inside a must_not clause:

{
    "query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "myfield"
                }
            }
        }
    }
}

 

Posted by Uli Köhler in Databases, ElasticSearch

How to fix Google Cloud Build ignoring .dockerignore

Problem:

You want to run a docker image build on Google Cloud Build, but the client is trying to upload a huge context image to Google Cloud even though you have added all your large directories to your .dockerignore and the build works fine locally.

Solution:

Google Cloud Build ignores .dockerignore by design – the equivalent is called .gcloudignore.

You can copy the .dockerignore behaviour for gcloud by running

cp .dockerignore .gcloudignore

 

Posted by Uli Köhler in Cloud, Container, Docker

How to fix ElasticSearch ‘Root mapping definition has unsupported parameters’

Problem:

You want to create an ElasticSearch index with a custom mapping or update the mapping of an existing ElasticSearch index but you see an error message like

elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'Root mapping definition has unsupported parameters:  [mappings : {properties={num_total={type=integer}, approved={type=integer}, num_translated={type=integer}, pattern_length={type=integer}, num_unapproved={type=integer}, pattern={type=keyword}, num_approved={type=integer}, translated={type=integer}, untranslated={type=integer}, num_untranslated={type=integer}, group={type=keyword}}}]')

Solution:

This can point to multiple issues. Essentially, ElasticSearch is trying to tell you that the structure of your JSON is not correct.

Often this error is misinterpreted as individual field definitions being wrong, but this is rarely the issue (and only if an individual field definition is completely malformed).

If your message is structured like

... unsupported parameters:  [mappings : ...

then the most likely root cause is that you have mappings nested inside mappings in your JSON. This also applies if you update a mapping (put_mapping) – in this case the outer mapping is implicit!

Example: Your code looks like this:

es.indices.put_mapping(index='my_index, doc_type='_doc', body={
    "mappings": {
        "properties": {
            "pattern": {
                "type":  "keyword"
            }
        }
    }
})

ElasticSearch will internally create a JSON like this internally:

{
    "mappings": {
        "mappings": {
            "properties": {
                "pattern": {
                    "type":  "keyword"
                }
            }
        }
    }
}

See that there are two mappings inside each other? ElasticSearch does not view this as a correctly structured JSON, therefore you need to remove the "mapping": {...} from your code, resulting in

es.indices.put_mapping(index='my_index, doc_type='_doc', body={
    "properties": {
        "pattern": {
            "type":  "keyword"
        }
    }
})
Posted by Uli Köhler in Databases, ElasticSearch, Python

Fixing ElasticSearch ‘No handler for type [int] declared on field …’

Problem:

You want to create an index with a custom mapping in ElasticSearch but you see an error message like this:

elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'No handler for type [int] declared on field [id]')

Solution:

You likely have a mapping like

"id": {
    "type":  "int"
}

in your mapping properties.

The issue here is int: ElasticSearch uses integer as type of integers, not int!

In order to fix the issue, change the property to

"id": {
    "type":  "integer"
}

and retry creating the index.

 

Posted by Uli Köhler in Databases, ElasticSearch

How to fix ElasticSearch [FORBIDDEN/12/index read-only / allow delete (api)]

If you try to index a document in ElasticSearch and you see an error message like this:

elasticsearch.exceptions.AuthorizationException: AuthorizationException(403, 'cluster_block_exception', 'blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];')

you can unlock writes to your cluster (all indexes) using

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

(thanks to Imran273 on StackOverflow for the original solution)

Note however that often there’s an underlying reason that caused ElasticSearch to lock writes to the index. Most often it is caused by exceeding the disk watermark / quota. See How to disable ElasticSearch disk quota / watermark for details on how to work around that.

Posted by Uli Köhler in Databases, ElasticSearch

How to disable ElasticSearch disk quota / watermark

In its default configuration, ElasticSearch will not allocate any more disk space when more than 90% of the disk are used overall (i.e. by ElasticSearch or other applications).

You can set the watermark extremely low using

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "30mb",
    "cluster.routing.allocation.disk.watermark.high": "20mb",
    "cluster.routing.allocation.disk.watermark.flood_stage": "10mb",
    "cluster.info.update.interval": "1m"
  }
}
'

After doing that, you might need to unlock your cluster for write accesses if you had already exceeded your watermark before:

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

See How to fix ElasticSearch [FORBIDDEN/12/index read-only / allow delete (api)] for more details on that.

I do not recommend to set the values to zero (i.e. below 10 Megabytes) because using every byte of available disk space might cause issues on your system since more important applications will not be able to properly allocate disk space any more.

In order to view the current disk usage use

curl -XGET "http://localhost:9200/_cat/allocation?v&pretty"

See How to view & interpret disk space usage of your ElasticSearch cluster for more details.

Posted by Uli Köhler in Databases, ElasticSearch

How to set default zone for Google Cloud project using gcloud command-line tool

Use this command to set the default zone for project myproject-123456 to europe-west4-a and the default region to europe-west4:

gcloud compute project-info add-metadata \
--metadata google-compute-default-region=europe-west4,google-compute-default-zone=europe-west4-a\
--project myproject-123456

Also see the official reference for more detailed information.

Posted by Uli Köhler in Cloud

How to insert test data into ElasticSearch 6.x

If you just want to insert some test documents into ElasticSearch 6.x, you can use this simple command:

curl -X POST "localhost:9200/mydocuments/_doc/" -H 'Content-Type: application/json' -d"
{
    \"test\" : true,
    \"post_date\" : \"$(date -Ins)\"
}"

Run this command multiple times to insert multiple documents!

In case of success, this will output a message like

{"_index":"mydocuments","_type":"_doc","_id":"vCxB82kBn2U9QxlET2aG","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

Also see the official docs on the indexing API.

Posted by Uli Köhler in ElasticSearch

How to view & interpret disk space usage of your ElasticSearch cluster

In order to find out how much disk space every node of your elasticsearch cluster is using and how much disk space is remaining, use

curl -XGET "http://localhost:9200/_cat/allocation?v&pretty"

Example output for one node without any data:

shards disk.indices disk.used disk.avail disk.total disk.percent host       ip         node
     0           0b     2.4gb    200.9gb    203.3gb            1 172.18.0.2 172.18.0.2 TxYuHLF

This example output tells you:

  • shards: 0 The cluster currently has no shards. This means there is no data in the cluster
  • disk.indices: 0b The cluster currently uses 0 bytes of disk space for indexes.
  • disk.used: 2.4gb The disk ElasticSearch will store its data on has 2.4 Gigabytes used spaced. This does not mean that ElasticSearch uses 2.4 Gigabytes, any other application (including the operating system) might also use (part of) that space.
  • disk.avail: 200.9gb The disk ElasticSearch will store its data on has 200.9 Gigabytes of free space. Remember that this will not shrink only if ElasticSearch is using data on said disk, other applications might also consume some of the disk space depending on how you set up ElasticSearch.
  • disk.total: 203.3gb The disk ElasticSearch will store its data on has a total size of 203.3 gigabytes (total size as in available space on the filesystem without any files)
  • disk.percent: 1 Currently 1 % of the total disk space available (disk.total) is used. This value is always rounded to full percents.
  • host, ip, node: Which node this line is referring to.

Example with one node with some test data (see this TechOverflow post on how to generate test data):

shards disk.indices disk.used disk.avail disk.total disk.percent host       ip         node
     5        6.8kb     2.4gb     21.6gb       24gb           10 172.18.0.2 172.18.0.2 J3W5zqj
     5                                                                                 UNASSIGNED

As we can see, ElasticSearch now has 5 shards. Note that the second line tells us that 5 shards are UNASSIGNED. This is because ElasticSearch has been configured to make one replica for each shard and there is no second node where it can put the replica. For development configurations this is usually OK, but production configurations should usually have at least two nodes. See our docker-compose and systemd service generator for ElasticSearch for instructions on how to configure a local multi-node cluster using docker.

Posted by Uli Köhler in ElasticSearch