How to circumvent Google Cloud Storage 1000 read / 400 write limit in Python

Google Cloud Datastore has a built-in 1000 keys limit for get requests and a 400 entities per request for put limit. If you hit it, you will see one of these error messages:

google.api_core.exceptions.InvalidArgument: 400 cannot get more than 1000 keys in a single call
google.api_core.exceptions.InvalidArgument: 400 cannot write more than 500 entities in a single call

You can fix it by chunking the requests, i.e. only do 1000 requests at one time for get etc.

This code provides a ready-to-use example for a class that automates this process. As an added benefit, it performs the requests in chunks of 1000 (for get) or 400 (for put) in parallel using a concurrent.futures.Executor. As the performance is expected to be IO-bound, it is recommended to use a concurrent.futures.ThreadPoolExecutor.
If you dont give the class an executor on construction, it will create one by itself.

import itertools
from concurrent.futures import ThreadPoolExecutor

def _chunks(l, n=1000):
    """
    Yield successive n-sized chunks from l.
    https://stackoverflow.com/a/312464/2597135
    """
    for i in range(0, len(l), n):
        yield l[i:i + n]

def _get_chunk(client, keys):
    """
    Get a single chunk
    """
    missing = []
    vals = client.get_multi(keys, missing=missing)
    return vals, missing

class DatastoreChunkClient(object):
    """
    Provides a thin wrapper around a Google Cloud Datastore client providing means
    of reading nd
    """
    def __init__(self, client, executor=None):
        self.client = client
        if executor is None:
            executor = ThreadPoolExecutor(16)
        self.executor = executor
    
    def get_multi(self, keys):
        """
        Thin wrapper around client.get_multi() that circumvents
        the 1000 read requests limit by doing 1000-sized chunked reads
        in parallel using self.executor.

        Returns (values, missing).
        """
        all_missing = []
        all_vals = []
        for vals, missing in self.executor.map(lambda chunk: _get_chunk(self.client, chunk), _chunks(keys, 1000)):
            all_vals += vals
            all_missing += missing
        return all_vals, all_missing

    def put_multi(self, entities):
        """
        Thin wrapper around client.put_multi() that circumvents
        the 400 read requests limit by doing 400-sized chunked reads
        in parallel using self.executor.

        Returns (values, missing).
        """
        for none in self.executor.map(lambda chunk: self.client.put_multi(chunk), _chunks(entities, 400)):
            pass

Usage example:

# Create "raw" google datastore client
client = datastore.Client(project="myproject-123456")
chunkClient = DatastoreChunkClient(client)

# The size of the key list is only limited by memory
keys = [...]
values, missing = chunkClient.get_multi(keys)

# The size of the entity list is only limited by memory
entities = [...]
chunkClient.put_multi(entities)

 

Saving an entity in Google Cloud Datastore using Python: A minimal example

Here’s a minimal example for inserting an entity in the Google Cloud Datastore object database using the Python API:

#!/usr/bin/env python3
from google.cloud import datastore
# Create & store an entity
client = datastore.Client(project="myproject-12345")
entity = datastore.Entity(key=client.key('MyEntityKind', 'MyTestID'))
entity.update({
    'foo': u'bar',
    'baz': 1337,
    'qux': False,
})
# Actually save the entity
client.put(entity)

This assumes you have already created an entity kind with the name MyEntityKind in the project with ID myproject-12345.

How to fix Google Cloud Datastore ValueError: A Key must have a project set.

Problem:

You are trying to connect to the Google Cloud Storage object database:

#!/usr/bin/env python3
from google.cloud import datastore
# Create, populate and persist an entity
entity = datastore.Entity(key=datastore.Key('MyEntityKind')) # Line of error
# ...

but when running that code, you get this error message:

Traceback (most recent call last):
  File "./IndexIntoDB.py", line 4, in <module>
    entity = datastore.Entity(key=datastore.Key('MyEntityKind'))
  File "/usr/local/lib/python3.6/dist-packages/google/cloud/datastore/key.py", line 109, in __init__
    self._project = _validate_project(project, parent)
  File "/usr/local/lib/python3.6/dist-packages/google/cloud/datastore/key.py", line 512, in _validate_project
    raise ValueError("A Key must have a project set.")
ValueError: A Key must have a project set.

Solution:

Note: While the solution below fixes the error message listed above, you might be more interested in having a look at this minimal entity insertion example

As the error message indicates, you need to add a project name. If you don’t know the project name, go to the Google Cloud Console, select the right project at the top and then look at the URL:

https://console.cloud.google.com/datastore/welcome?project=perceptive-tape-12345

In this example, the project ID (which you have to use in the Python code is perceptive-tape-12345.

See also the Keys section of the google-cloud-datastore python documentation.