Databases

Fixing ElasticSearch ‘No handler for type [int] declared on field …’

Problem:

You want to create an index with a custom mapping in ElasticSearch but you see an error message like this:

elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'No handler for type [int] declared on field [id]')

Solution:

You likely have a mapping like

"id": {
    "type":  "int"
}

in your mapping properties.

The issue here is int: ElasticSearch uses integer as type of integers, not int!

In order to fix the issue, change the property to

"id": {
    "type":  "integer"
}

and retry creating the index.

 

Posted by Uli Köhler in Databases, ElasticSearch

How to fix ElasticSearch [FORBIDDEN/12/index read-only / allow delete (api)]

If you try to index a document in ElasticSearch and you see an error message like this:

elasticsearch.exceptions.AuthorizationException: AuthorizationException(403, 'cluster_block_exception', 'blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];')

you can unlock writes to your cluster (all indexes) using

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

(thanks to Imran273 on StackOverflow for the original solution)

Note however that often there’s an underlying reason that caused ElasticSearch to lock writes to the index. Most often it is caused by exceeding the disk watermark / quota. See How to disable ElasticSearch disk quota / watermark for details on how to work around that.

Posted by Uli Köhler in Databases, ElasticSearch

How to disable ElasticSearch disk quota / watermark

In its default configuration, ElasticSearch will not allocate any more disk space when more than 90% of the disk are used overall (i.e. by ElasticSearch or other applications).

You can set the watermark extremely low using

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "30mb",
    "cluster.routing.allocation.disk.watermark.high": "20mb",
    "cluster.routing.allocation.disk.watermark.flood_stage": "10mb",
    "cluster.info.update.interval": "1m"
  }
}
'

After doing that, you might need to unlock your cluster for write accesses if you had already exceeded your watermark before:

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

See How to fix ElasticSearch [FORBIDDEN/12/index read-only / allow delete (api)] for more details on that.

I do not recommend to set the values to zero (i.e. below 10 Megabytes) because using every byte of available disk space might cause issues on your system since more important applications will not be able to properly allocate disk space any more.

In order to view the current disk usage use

curl -XGET "http://localhost:9200/_cat/allocation?v&pretty"

See How to view & interpret disk space usage of your ElasticSearch cluster for more details.

Posted by Uli Köhler in Databases, ElasticSearch

How to insert test data into ElasticSearch 6.x

If you just want to insert some test documents into ElasticSearch 6.x, you can use this simple command:

curl -X POST "localhost:9200/mydocuments/_doc/" -H 'Content-Type: application/json' -d"
{
    \"test\" : true,
    \"post_date\" : \"$(date -Ins)\"
}"

Run this command multiple times to insert multiple documents!

In case of success, this will output a message like

{"_index":"mydocuments","_type":"_doc","_id":"vCxB82kBn2U9QxlET2aG","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

Also see the official docs on the indexing API.

Posted by Uli Köhler in ElasticSearch

How to view & interpret disk space usage of your ElasticSearch cluster

In order to find out how much disk space every node of your elasticsearch cluster is using and how much disk space is remaining, use

curl -XGET "http://localhost:9200/_cat/allocation?v&pretty"

Example output for one node without any data:

shards disk.indices disk.used disk.avail disk.total disk.percent host       ip         node
     0           0b     2.4gb    200.9gb    203.3gb            1 172.18.0.2 172.18.0.2 TxYuHLF

This example output tells you:

  • shards: 0 The cluster currently has no shards. This means there is no data in the cluster
  • disk.indices: 0b The cluster currently uses 0 bytes of disk space for indexes.
  • disk.used: 2.4gb The disk ElasticSearch will store its data on has 2.4 Gigabytes used spaced. This does not mean that ElasticSearch uses 2.4 Gigabytes, any other application (including the operating system) might also use (part of) that space.
  • disk.avail: 200.9gb The disk ElasticSearch will store its data on has 200.9 Gigabytes of free space. Remember that this will not shrink only if ElasticSearch is using data on said disk, other applications might also consume some of the disk space depending on how you set up ElasticSearch.
  • disk.total: 203.3gb The disk ElasticSearch will store its data on has a total size of 203.3 gigabytes (total size as in available space on the filesystem without any files)
  • disk.percent: 1 Currently 1 % of the total disk space available (disk.total) is used. This value is always rounded to full percents.
  • host, ip, node: Which node this line is referring to.

Example with one node with some test data (see this TechOverflow post on how to generate test data):

shards disk.indices disk.used disk.avail disk.total disk.percent host       ip         node
     5        6.8kb     2.4gb     21.6gb       24gb           10 172.18.0.2 172.18.0.2 J3W5zqj
     5                                                                                 UNASSIGNED

As we can see, ElasticSearch now has 5 shards. Note that the second line tells us that 5 shards are UNASSIGNED. This is because ElasticSearch has been configured to make one replica for each shard and there is no second node where it can put the replica. For development configurations this is usually OK, but production configurations should usually have at least two nodes. See our docker-compose and systemd service generator for ElasticSearch for instructions on how to configure a local multi-node cluster using docker.

Posted by Uli Köhler in ElasticSearch

How to backup all indices from ElasticSearch

You can use elasticdump to backup all indices from your ElasticSearch cluster. Install using

sudo npm install elasticdump -g

If you don’t have npm, see How to install NodeJS 10.x on Ubuntu in 1 minute.

This package installs two binarys: elasticdump (used to dump a single index) and multielasticdump (used to dump multiple indices in parallel)

We can use multielasticdump to dump all indexes:

mkdir -p es_backup
multielasticdump --direction=dump --input=http://localhost:9200 --output=es_backup

Restore using:

multielasticdump --direction=load --input=es_backup --output=http://localhost:9200

 

Posted by Uli Köhler in ElasticSearch, Linux

How to configure Google Cloud Kubernetes Elasticsearch Cluster with internal load balancer

Google Cloud offers a convenient way of installing an ElasticSearch cluster on top of a Google Cloud Kubernetes cluster. However, the documentation tells you to expose the ElasticSearch instance using

kubectl patch service/"elasticsearch-elasticsearch-svc" \
  --namespace "default" \
  --patch '{"spec": {"type": "LoadBalancer"}}'

However this command will expost ElasticSearch to an external IP which will make it publically accessible in the default configuration.

Here’s the equivalent command that will expose ElasticSearch to an internal load balancer with an internal IP address that will only be available from Google Cloud.

kubectl patch service/"elasticsearch-elasticsearch-svc" \
  --namespace "default" \
  --patch '{"spec": {"type": "LoadBalancer"}, "metadata": {"annotations": {"cloud.google.com/load-balancer-type": "Internal"}}}'

You might need to replace the name of your service (elasticsearch-elasticsearch-svc in this example) and possibly your namespace.

 

Posted by Uli Köhler in Cloud, ElasticSearch, Kubernetes

ElasticSearch equivalent to MongoDB .distinct(…)

Let’s say we have an ElasticSearch index called strings with a field pattern of {"type": "keyword"}.

Now we want to do the equivalent of MongoDB db.getCollection('...').distinct('pattern'):

Solution:

In Python you can use the iterate_distinct_field() helper from this previous post on ElasticSearch distinct. Full example:

from elasticsearch import Elasticsearch

es = Elasticsearch()

def iterate_distinct_field(es, fieldname, pagesize=250, **kwargs):
    """
    Helper to get all distinct values from ElasticSearch
    (ordered by number of occurrences)
    """
    compositeQuery = {
        "size": pagesize,
        "sources": [{
                fieldname: {
                    "terms": {
                        "field": fieldname
                    }
                }
            }
        ]
    }
    # Iterate over pages
    while True:
        result = es.search(**kwargs, body={
            "aggs": {
                "values": {
                    "composite": compositeQuery
                }
            }
        })
        # Yield each bucket
        for aggregation in result["aggregations"]["values"]["buckets"]:
            yield aggregation
        # Set "after" field
        if "after_key" in result["aggregations"]["values"]:
            compositeQuery["after"] = \
                result["aggregations"]["values"]["after_key"]
        else: # Finished!
            break

# Usage example
for result in iterate_distinct_field(es, fieldname="pattern.keyword", index="strings"):
    print(result) # e.g. {'key': {'pattern': 'mypattern'}, 'doc_count': 315}
Posted by Uli Köhler in Databases, ElasticSearch, Python

How to query distinct field values in ElasticSearch

Let’s say we have an ElasticSearch index called strings with a field pattern of {"type": "keyword"}.

Get the top N values of the column

If we want to get the top N ( 12 in our example) entries, i.e. the patterns that are present in the most documents, we can use this query:

{
    "aggs" : {
        "patterns" : {
            "terms" : {
                "field" : "pattern.keyword",
                "size": 12
            }
        }
    }
}

Full example in Python:

from elasticsearch import Elasticsearch

es = Elasticsearch()

result = es.search(index="strings", body={
    "aggs" : {
        "patterns" : {
            "terms" : {
                "field" : "pattern.keyword",
                "size": 12
            }
        }
    }
})
for aggregation in result["aggregations"]["patterns"]["buckets"]:
    print(aggregation) # e.g. {'key': 'mypattern, 'doc_count': 2802}

See the terms aggregation documentation for more infos.

Get all the distinct values of the column

Getting all the values is slightly more complicated since we need to use a composite aggregation that returns an after_key to paginate the query.

This Python helper function will automatically paginate the query with configurable page size:

from elasticsearch import Elasticsearch

es = Elasticsearch()

def iterate_distinct_field(es, fieldname, pagesize=250, **kwargs):
    """
    Helper to get all distinct values from ElasticSearch
    (ordered by number of occurrences)
    """
    compositeQuery = {
        "size": pagesize,
        "sources": [{
                fieldname: {
                    "terms": {
                        "field": fieldname
                    }
                }
            }
        ]
    }
    # Iterate over pages
    while True:
        result = es.search(**kwargs, body={
            "aggs": {
                "values": {
                    "composite": compositeQuery
                }
            }
        })
        # Yield each bucket
        for aggregation in result["aggregations"]["values"]["buckets"]:
            yield aggregation
        # Set "after" field
        if "after_key" in result["aggregations"]["values"]:
            compositeQuery["after"] = \
                result["aggregations"]["values"]["after_key"]
        else: # Finished!
            break

# Usage example
for result in iterate_distinct_field(es, fieldname="pattern.keyword", index="strings"):
    print(result) # e.g. {'key': {'pattern': 'mypattern'}, 'doc_count': 315}
Posted by Uli Köhler in Databases, ElasticSearch, Python

How to fix ElasticSearch ‘Fielddata is disabled on text fields by default’ for keyword field

Problem:

You have a field in ElasticSearch named e.g. patterns of type keyword. However, when you query for an aggregation of this field e.g.

es.search(index="strings", body={
    "size": 0,
    "aggs" : {
        "patterns" : {
            "terms" : { "field" : "pattern" }
        }
    }
})

you see this error message:

elasticsearch.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'Fielddata is disabled on text fields by default. Set fielddata=true on [pattern] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.'

Solution:

This error message is confusing since you already have a keyword field. However, the ElasticSearch fielddata documentation tells us that you need to to use pattern.keyword in the query instead of just pattern.

Full example:

es.search(index="strings", body={
    "size": 0,
    "aggs" : {
        "patterns" : {
            "terms" : { "field" : "pattern.keyword" }
        }
    }
})
Posted by Uli Köhler in Databases, ElasticSearch

How to fix ‘elasticsearch exited with code 78’

Problem:

You want to run ElasticSearch using docker, but the container immediately stops again using this error message

elasticsearch exited with code 78

or

elasticsearch2 exited with code 78

Solution:

If you look through the entire log message, you’ll find lines like

elasticsearch     | [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

Therefore we need to increase the vm.max_map_count limit:

sudo sysctl -w vm.max_map_count=524288

Now we need to edit /etc/sysctl.conf so the setting will also be in effect after a reboot.

Look for any vm.max_map_count line in /etc/sysctl.conf. If you find one, set its value to 524288. If there is no such line present, add the line

vm.max_map_count=524288

to the end of /etc/sysctl.conf

Original source: GitHub

 

Posted by Uli Köhler in Container, Databases, Docker, Linux

MongoDB: How to run db.adminCommand() in NodeJS

Problem:

You want to run a db.adminCommand() in NodeJS using the node-mongodb-native client, e.g. you want to run the NodeJS equivalent of

db.adminCommand({setParameter: 1, internalQueryExecMaxBlockingSortBytes: 100151432});

Solution:

Use conn.executeDbAdminCommand() where db is a MongoDB database object.

db.executeDbAdminCommand({setParameter: 1, internalQueryExecMaxBlockingSortBytes: 100151432});

Full example:

// To install, use npm i --save mongodb
const MongoClient = require('mongodb').MongoClient;

async function configureMongoDB() {
    // Connect to MongoDB
    const conn = await MongoClient.connect('mongodb://localhost:27017/', { useNewUrlParser: true });
    const db = await conn.db('mydb');
    // Configure MongoDB settings
    await db.executeDbAdminCommand({
        setParameter: 1,
        internalQueryExecMaxBlockingSortBytes: 100151432
    });
    // Cleanup
    return conn.close();
}

// Run configureMongoDB()
configureMongoDB().then(() => {}).catch(console.error)

 

Posted by Uli Köhler in Databases, NodeJS

How to fix NodeJS MongoDB ‘Cannot read property ‘high_’ of null’

When encountering an error message like

TypeError: Cannot read property 'high_' of null
    at Long.equals (/home/uli/dev/NMUN/node_modules/bson/lib/bson/long.js:236:31)
    at nextFunction (/home/uli/dev/NMUN/node_modules/mongodb-core/lib/cursor.js:473:16)
    at Cursor.next (/home/uli/dev/NMUN/node_modules/mongodb-core/lib/cursor.js:763:3)
    at Cursor._next (/home/uli/dev/NMUN/node_modules/mongodb/lib/cursor.js:211:36)
    at nextObject (/home/uli/dev/NMUN/node_modules/mongodb/lib/operations/cursor_ops.js:192:10)
    at hasNext (/home/uli/dev/NMUN/node_modules/mongodb/lib/operations/cursor_ops.js:135:3)
    (...)

you likely have code like this:

const cursor = db.getCollection('mycollection').find({})
while (cursor.hasNext()) {
    const doc = cursor.next();
    // ... handle doc ...
}

The solution is quite simple: Since find(), cursor.hasNext() and cursor.next() all return Promises, you can’t use their results directly.

This example shows you how to do it properly using async/await:

const cursor = await db.getCollection('mycollection').find({})
while (await cursor.hasNext()) {
    const doc = await cursor.next();
    // ... handle doc ...
}

In order to do this remember that the function containing this code will need to be an async function. See the Mozilla documentation or google for Javascript async tutorial in order to learn about the details!

Posted by Uli Köhler in Databases, Javascript

How to install MongoDB CE on Ubuntu in 1 minute

Quick install using

wget -qO- https://techoverflow.net/scripts/install-mongodb.sh | bash

Run these shell commands on your Ubuntu computer to install the current MongoDB community edition and automatically start it (both instantly and on bootup)

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 4B7C549A058F8B6B
echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.2 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.2.list
sudo apt-get update
sudo apt-get install -y mongodb-org
sudo systemctl enable mongod
sudo systemctl start mongod

Source: Official MongoDB documentation

Posted by Uli Köhler in Databases, Linux

How to make mongodump export JSON instead of BSON

mongodump exports your MongoDB database to a BSON format, but if you want to use JSON instead, you need to use the mongoexport tool.

Examples:

mongoexport -d mydatabase -c mycollection # Prints to stdout
mongoexport -d mydatabase -c mycollection -o mycollections.json # Write to mycollections.json

Note that some datatypes that can be stored in MongoDB can only be represented in BSON – so in some cases, information may be lost by exporting to JSON.

Posted by Uli Köhler in Databases

Exporting campaign contact CSVs from VTiger using SQL

In Exporting campaign account CSVs from VTiger using SQL we investigated how to use direct SQL database access to export names and emails for organizations for a given campaign.

This post shows how to do the same thing. Please refer to the previous post for a detailed explanation.

SELECT CONCAT(cont.firstname, " ", cont.lastname) AS "Name", cont.email AS "EMail" FROM vtiger_campaigncontrel AS ccr JOIN vtiger_contactdetails cont ON (cont.contactid = ccr.contactid) WHERE `campaignid` =(SELECT postvalue FROM `vtiger_modtracker_detail` WHERE fieldname = "record_id" AND `id` = (SELECT `id` FROM `vtiger_modtracker_detail` WHERE postvalue = (SELECT campaign_no FROM vtiger_campaign WHERE `campaignname` = "MyCampaign") and fieldname = "campaign_no")) AND cont.email <> ''

Besides joining on vtiger_contactdetails and vtiger_campaigncontrel instead of the account-related equivalent, the only major difference is in assembling the name. For contacts, first name and last name are stored separatedly. Therefore we use CONCAT(cont.firstname, " ", cont.lastname) to get a single name string out of that information.

As shown in the previous post, you can use phpmyadmin to export the result of the query to CSV or to Excel directly. While this might seem like a tedious process, it is much less tedious than copy-and-pasting hundreds of contacts manually.

Posted by Uli Köhler in Databases

Exporting campaign account CSVs from VTiger using SQL

Update: Related post for exporting contacts instead of accounts

Although VTiger works great as CRM, its campaign management feature is missing one of most important features: It doesn’t support exporting

This post aims to provide technically oriented users a guide on how to use SQL queries to perform that task. We assume basic knowledge of SQL syntax, MySQL and PHPMyAdmin.

The goal for our example is to export a table of Account name and E-Mail. For the email, the primary email will be used, unless it’s empty in which case the secondary email is used.

The input for the query is the campaign name (in our example, this will be MyCampaign).

Querying the campaign number

The campaign number is a string, e.g. CAM4 or CAM9 which is stored in vtiger_campaign.

SELECT campaign_no FROM vtiger_campaign WHERE `campaignname` = "MyCampaign"

 Finding the campaign record ID

In the campaign account relations n-to-n relations table, the campaign number is not used directly – instead, the campaign record ID is used. We need to query that from vtiger_modtracker_detail, which contain metadata key-value-data about the campaigns (among other things). One of the K-V pairs contain the campaign identifier (key campaign_no) whereas another one contains the record ID (key record_id).

SELECT postvalue FROM `vtiger_modtracker_detail` WHERE `id` = (SELECT `id` FROM `vtiger_modtracker_detail` WHERE postvalue = <<campaign number>> and fieldname = "campaign_no") AND fieldname = "record_id"

Full query:

SELECT postvalue FROM `vtiger_modtracker_detail` WHERE `id` = (SELECT `id` FROM `vtiger_modtracker_detail` WHERE postvalue = (SELECT campaign_no FROM vtiger_campaign WHERE `campaignname` = "MyCampaign") and fieldname = "campaign_no") AND fieldname = "record_id"

The result of this query is the numeric campaign record ID, e.g. 315.

Querying campaign-to-account relations

We can now use the campaign record ID to query the n-to-n table vtiger_campaignaccountrel. There are other, similar tables for leads (vtiger_campaignleadrel) and contacts (vtiger_campaigncontrel).

SELECT accountid FROM vtiger_campaignaccountrel WHERE `campaignid` = <<campaign record ID>>

Full query:

SELECT accountid FROM vtiger_campaignaccountrel WHERE `campaignid` = (SELECT postvalue FROM `vtiger_modtracker_detail` WHERE fieldname = "record_id" AND `id` = (SELECT `id` FROM `vtiger_modtracker_detail` WHERE postvalue = (SELECT campaign_no FROM vtiger_campaign WHERE `campaignname` = "MyCampaign") and fieldname = "campaign_no"))

The result of this query is a list of account IDs related to the campaign. The order is usually the same as displayed in VTiger.

Querying account information

Using the list of account IDs, we can JOIN on vtiger_account to query the account details. Note that you need to use the appropriate table for leads or contacts if you used an alternative relations table in the previous step.

SELECT acc.accountname AS "Name", COALESCE(acc.email1, acc.email2) AS "EMail" FROM vtiger_campaignaccountrel AS car JOIN vtiger_account acc ON (acc.accountid = car.accountid) WHERE `campaignid` =(SELECT postvalue FROM `vtiger_modtracker_detail` WHERE fieldname = "record_id" AND `id` = (SELECT `id` FROM `vtiger_modtracker_detail` WHERE postvalue = (SELECT campaign_no FROM vtiger_campaign WHERE `campaignname` = "MyCampaign") and fieldname = "campaign_no")) AND (acc.email1 <> '' OR acc.email2 <> '')

Note that besides the SELECT column list there are some aspects which are specific to the Name and Email query:

  1. COALESCE(acc.email1, acc.email2) uses email2 if email1 is empty
  2. AND (email1 <> '' OR email2 <> '') ignores accounts that have neither email1 nor email2 set

If you want a custom list of columns, you need to modify that appropriately.

Exporting CSV

When entering the query into a tool like PHPMyAdmin, you can click the Export button in order to export the result to different formats, including CSV and Excel.

While this might not be the most convenient option, it’s a very flexible way of generating and exporting the data. Any more convenient ways are likely to be less flexible.

Posted by Uli Köhler in Databases

Compiling LevelDB as LLVM binary on Linux

Some time ago I wrote a guide on how to compile and install LevelDB on Linux.

Recently I’m desperately trying to get into LLVM and a tutorial series on how to use LLVM with C/C++ is coming shortly.

As I’m using LevelDB in many of my projects I’d like a way of generating a LLVM IR (intermediate representation) of the LevelDB C++ source – I could link a LLVM program to the native binary, but in order to profit from LLVMs features I suppose using IRs for as many dependencies as possible is the way to go.

Generally there are two ways to go:

  1. Use the g++ LLVM backend
  2. Use clang++

I usually tend to use clang++ for LLVM tasks because even with colorgcc and some recent improvements in gcc error message generation I prefer the clang++ error messages, even if I have way more experience with gcc error messages. Additionally the g++ with LLVM backend does seem to have some bugs, including interpreting -emit-llvm as -e -m -i …, plus recent distribution versions don’t work too well with the LLVM gold plugin and it has proved difficult to tell GCC reliably that it shall use llvm-ld as linker.

Continue reading →

Posted by Uli Köhler in C/C++, Databases, LLVM