Fixing numpy.distutils.system_info.NotFoundError: No lapack/blas resources found on Ubuntu or Travis

Note: If you are on Windows, you can not install scipy using pip! Follow this guide instead: https://www.scipy.org/install.html. This blog post is only for Linux-based systems!

When building some of my libraries on Travis, I encountered this error during

sudo pip3 install numpy scipy --upgrade
numpy.distutils.system_info.NotFoundError: No lapack/blas resources

Solution

Install lapack and blas:

sudo apt-get -y install liblapack-dev libblas-dev

In most cases you will then get this error message:

error: library dfftpack has Fortran sources but no Fortran compiler found

Fix that by

sudo apt-get install -y gfortran

In Travis, you can do it like this in .travis.yml:

before_install:
    - sudo apt-get -y install liblapack-dev libblas-dev gfortran

Easily compute & visualize FFTs in Python using UliEngineering

UliEngineering is a mixed data analytics in Python – one of the utilities it provides is an easy-to-use package to compute FFTs. In contrast to other package, this library is oriented towards practical usecases and allows you to do the FFT in only one line of code! No knowledge of Math required.

Getting started

First, we’ll generate some test data. See this previous post for more details on how to generate sinusoid test data:

from UliEngineering.SignalProcessing.Simulation import *

# Generate test data: 100 Hz + 400 Hz tone
data = sine_wave(frequency=100.0, samplerate=1000, amplitude=1.) \
       + sine_wave(frequency=400.0, samplerate=1000, amplitude=0.5)

The test data consists of a 100 Hz sine wave plus a 400 Hz sine wave (with half the amplitude). That signal is sampled with a sample rate of 1000 Hz.

Now we can compute & visualize the FFT using matplotlib:

# Compute FFT. Use the same samplerate 
# NOTE: Window defaults to "blackman"!
from UliEngineering.SignalProcessing.FFT import compute_fft
fft = compute_fft(data, samplerate=1e3)

# Plot
from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.gcf().set_size_inches(10, 5) # Use (20, 10) to get a larger plot
plt.plot(fft.frequencies, fft.amplitudes)
plt.xlabel("Frequency")
plt.ylabel("Amplitude")


compute_fft(data, samplerate=1e3) returns a tuple (fftx, ffty). It performs an FFT the size of the input (i.e. since data is a length-1000 array, the FFT will be of size 1000).

fft.frequencies is an array of frequencies (in Hz), corresponding to the values in fft.amplitudes. You can also use fft.angles to get relative angles in degrees, but that is not being covered in this blogpost.

As you can see in the plot shown above, the maximum frequency that can be detected is always half the samplerate, i.e. for our samplerate of f_s = 1000\,\text{Hz} it is 500\,\text{Hz}. See this more math-centric FFT explanation if you want to know more details.

Internally, compute_fft() performs this computation:

2 \cdot \frac{\text{abs}\left(\text{FFT}(\text{data} \cdot \text{Window})\right)}{\text{len(data)}}
  • 2 is a correction factor that takes into account that we throw away the latter half of the raw FFT result (since we’re doing a real FFT)
  • \frac{1}{\text{len(data)}} normalizes the FFT results so they are indepedent of the data length (i.e. if you pass a longer sample of the same sine wave you will still get the same result
  • \text{abs}\left(\cdots\right) Converts the complex phase-aware result of the FFT to a spectrum which is easier to read & visualize.
  • Window (which defaults to blackman) is the window which is applied to the data to alleviate some mathematical effects at the beginning and the end of the dataset. See wikipedia on window functions for more details. UliEngineering currently offers this list of window functions:
    • blackman
    • bartlett
    • hamming
    • hanning
    • kaiser (Parameter is fixed to 2.0)
    • none

Selecting frequency ranges

Using the UliEngineering API, selecting a frequency range of the FFT is trivially easy: Just use fft[lowfreq:highfreq]. You can use fft[lowfreq:] to select everything starting from lowfreq or use fft[:highfreq] to select everything up to highfreq.

from UliEngineering.SignalProcessing.FFT import compute_fft
fft = compute_fft(data, samplerate=1e3)

# Select frequency range: 50 to 200 Hz
fft = fft[50.0:200.0]

# Plot
from matplotlib import pyplot as plt
plt.style.use("ggplot")

plt.gcf().set_size_inches(10, 5) # Use (20, 10) to get a larger plot
plt.plot(fft.frequencies, fft.amplitudes)
plt.xlabel("Frequency")
plt.ylabel("Amplitude")
plt.savefig("/ram/fft-frequency-range.svg")

Extract amplitude & angle at a certain frequency

By using [frequency], i.e. getitem operator with a single value, you get an FFTPoint() object containing the frequency, amplitude and relative angle for a given frequency. The library automatically selects the closest FFT bucket, so even if your FFT does not have a bucket for that specific frequency, you will get sensible results.

from UliEngineering.SignalProcessing.FFT import compute_fft
fft = compute_fft(data, samplerate=1e3)

# Show value at a certain frequency
print(fft[30]) # FFTPoint(frequency=30.0, value=2.91e-08, angle=0.0)
print(fft[100]) # FFTPoint(frequency=100.0, value=0.419, angle=0.0)

Short FFTs for long data

Using compute_fft(), if we have an extremely long data array, this means we’ll compute an extremely long FFT. In many cases, this is not desirable and you want to compute a fixed-size FFT (most often a power-of-two FFT, e.g. 1024, 2048, 4096 etc).

Let’s generate some long test data and assume we want to compute a size-1024 FFT on it

from UliEngineering.SignalProcessing.FFT import *
fftx, ffty = simple_serial_fft_reduce(data, samplerate=1e3, fftsize=1024)

Plotting the data like above yields

which looks almost exactly like our compute_fft() plot before – just as we expected.

simple_serial_fft_reduce() takes care of all the magic and normalization for us, including partitioning the data into overlapping chunks, adding the FFTs and properly normalizing the result.

The naming convention is significant here:

  • simple_...._reduce means this is a variant with sensible defaults (simple) for using a reduction function (default: sum) on multiple FFTs.
  • serial means the individual FFTs

Parallelizing the FFTs

If you have a huge dataset, you can use simple_parallel_fft_reduce() identically tosimple_serial_fft_reduce():

from UliEngineering.SignalProcessing.FFT import *
fftx, ffty = simple_parallel_fft_reduce(data, samplerate=1e3, fftsize=1024)

However in most cases you want to initialize the executor manually so you can re-use it later:

from UliEngineering.SignalProcessing.FFT import *
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor() # No argument => use num_cpus threads
fftx, ffty = simple_parallel_fft_reduce(data, samplerate=1e3, fftsize=1024, executor=executor)

We can use a ThreadPoolExecutor() since scipy.fftpack (which UliEngineering uses to do the hard math) unlocks the Python GIL.

Note that due to the need to do a lot of housekeeping tasks, simple_parallel_fft_reduce() is much slower than simple_serial_fft_reduce() if you have a dataset so small that parallelization is not effective. My initial recommendation is to consider using the parallel variant if the total execution time of the serial variant is larger than 0.5\,s

Install:

sudo pip3 install git+https://github.com/ulikoehler/UliEngineering.git

Easily generate square/triangle/sawtooth/inverse sawtooth waveform data in Python using UliEngineering

In a previous post, I’ve detailed how to generate sine/cosine wave data with frequency, amplitude, offset, phase shift and time offset using only a single line of code.

This post extends this approach by showing how to generate square wave, triangle wave, sawtooth wave and inverse sawtooth wave data – still in only one line of code.

All of the parameters, including frequency, amplitude, offset, phase shift and time offset apply for those functions as well – see the previous post for details and examples for those parameters.

We are using the UliEngineering library, more specifically the UliEngineering.SignalProcessing.Simulation package:

Install

sudo pip3 install git+https://github.com/ulikoehler/UliEngineering.git

Square wave

from UliEngineering.SignalProcessing.Simulation import square_wave

data = square_wave(frequency=10.0, samplerate=10e3)

Triangle wave

from UliEngineering.SignalProcessing.Simulation import triangle_wave

data = triangle_wave(frequency=10.0, samplerate=10e3)

Sawtooth wave

from UliEngineering.SignalProcessing.Simulation import sawtooth

data = sawtooth(frequency=10.0, samplerate=10e3)

Inverse sawtooth wave

from UliEngineering.SignalProcessing.Simulation import inverse_sawtooth

data = inverse_sawtooth(frequency=10.0, samplerate=10e3)

Plotting code

This code was used to generate the plots for this post in Jupyter:

%matplotlib inline
from matplotlib import pyplot as plt
plt.style.use("ggplot")

from UliEngineering.SignalProcessing.Simulation import square_wave

data = square_wave(frequency=10.0, samplerate=10e3)

# set_size_inches(20, 10) to make it even larger!
plt.gcf().set_size_inches(10, 5)
plt.plot(data, label="original")
plt.savefig("/dev/shm/square-wave.svg")

Easily generate sine/cosine waveform data in Python using UliEngineering

In order to generate sinusoid test data in Python you can use the UliEngineering library which provides an easy-to-use functions in UliEngineering.SignalProcessing.Simulation:

Install

sudo pip3 install git+https://github.com/ulikoehler/UliEngineering.git

Basic example

from UliEngineering.SignalProcessing.Simulation import sine_wave
# Default: Generates 1 second of data with amplitude = 1.0 (swing from -1.0 ... 1.0)
sine = sine_wave(frequency=10.0, samplerate=10e3)
cosine = cosine_wave(frequency=10.0, samplerate=10e3)

Amplitude & offset

Use amplitude=0.5 to specify that the sine wave should swing between -0.5 and 0.5.

Use offset=2.0 to specify that the sine wave should be centered vertically around 2.0:

from UliEngineering.SignalProcessing.Simulation import sine_wave

data = sine_wave(frequency=10.0, samplerate=10e3, amplitude=0.5, offset=2.0)

Phaseshift example

You can specify the phaseshift to use – to use a 180° phaseshift, just use phaseshift=180.0

from UliEngineering.SignalProcessing.Simulation import sine_wave

original = sine_wave(frequency=10.0, samplerate=10e3)
shifted = sine_wave(frequency=10.0, samplerate=10e3, phaseshift=180.)

Time delay example

While this is functionally equivalent to phase offset, it is often convenient to specify the time delay in seconds instead of the phase shift in degrees:

from UliEngineering.SignalProcessing.Simulation import sine_wave

original = sine_wave(frequency=10.0, samplerate=10e3)
# Shifted signal is delayed 0.01s = 10 milliseconds compared to original
shifted = sine_wave(frequency=10.0, samplerate=10e3, timedelay=0.01)

Note that in the time domain, the signals appear to be shifted backwards when you use a positive timedelay value. This is in accordance with the delay naming, implying that the signal is delayed by that amount.

You can specify both phase shift and time delay, meaning that both will be applied (the offset is added)

Plotting

If you want to debug your signals visually, this is the code that was used inside Jupyter to generate the plots shown above:

%matplotlib inline
from matplotlib import pyplot as plt
plt.style.use("ggplot")

# Generate data
from UliEngineering.SignalProcessing.Simulation import sine_wave

original = sine_wave(frequency=10.0, samplerate=10e3)
shifted = sine_wave(frequency=10.0, samplerate=10e3, timedelay=0.01)

# set_size_inches(20, 10) to make it even larger!
plt.gcf().set_size_inches(10, 5)
plt.plot(original, label="original")
plt.plot(shifted, label="shifted")
plt.savefig("/dev/shm/timedelay.svg")
plt.legend(loc=1) # Top right

Easy zero crossing detection in Python using UliEngineering

In order to perform zero crossing detection in NumPy arrays you can use the UliEngineering library which provides an easy-to-use zero_crossings function:

Install:

sudo pip3 install git+https://github.com/ulikoehler/UliEngineering.git

Usage example:

from UliEngineering.SignalProcessing.Utils import zero_crossings
# To generate test data
from UliEngineering.SignalProcessing.Simulation import sine_wave

# Generate test data: 10 Hz 10ksps, 1 second (= 1D 10000 values NumPy array)
data = sine_wave(frequency=10.0, samplerate=10e3)

# Prints: [0, 500, 1000, 1500, ...]
print(zero_crossings(data))

Thanks to Jim Brissom on StackOverflow for the original solution!

How to set cv2.VideoCapture() image size in Python

Use cv2.CAP_PROP_FRAME_WIDTH and cv2.CAP_PROP_FRAME_HEIGHT in order to tell OpenCV which image size you would like.

import cv2

video_capture = cv2.VideoCapture(0)
# Check success
if not video_capture.isOpened():
    raise Exception("Could not open video device")
# Set properties. Each returns === True on success (i.e. correct resolution)
video_capture.set(cv2.CAP_PROP_FRAME_WIDTH, 160)
video_capture.set(cv2.CAP_PROP_FRAME_HEIGHT, 120)
# Read picture. ret === True on success
ret, frame = video_capture.read()
# Close device
video_capture.release()

Note that most video capture devices (like webcams) only support specific sets of widths & heights. Use uvcdynctrl -f to find out which resolutions are supported:

$ uvcdynctrl -f
Listing available frame formats for device video0:
Pixel format: YUYV (YUYV 4:2:2; MIME type: video/x-raw-yuv)
  Frame size: 640x480
    Frame rates: 30, 20, 10
  Frame size: 352x288
    Frame rates: 30, 20, 10
  Frame size: 320x240
    Frame rates: 30, 20, 10
  Frame size: 176x144
    Frame rates: 30, 20, 10
  Frame size: 160x120
    Frame rates: 30, 20, 10

How to take a webcam picture using OpenCV in Python

This code opens /dev/video0 and takes a single picture, closing the device afterwards:

import cv2

video_capture = cv2.VideoCapture(0)
# Check success
if not video_capture.isOpened():
    raise Exception("Could not open video device")
# Read picture. ret === True on success
ret, frame = video_capture.read()
# Close device
video_capture.release()

You can also use cv2.VideoCapture("/dev/video0"), but this approach is platform-dependent. cv2.VideoCapture(0) will also open the first video device on non-Linux platforms.

In Jupyter you can display the picture using

import sys
from matplotlib import pyplot as plt

frameRGB = frame[:,:,::-1] # BGR => RGB
plt.imshow(frameRGB)

 

Copying io.BytesIO content to a file in Python

Problem:

In Python, you have an io.BytesIO instance containing some data. You want to copy that data to a file (or another file-like object).

Solution:

Use this function:

def copy_filelike_to_filelike(src, dst, bufsize=16384):
    while True:
        buf = src.read(bufsize)
        if not buf:
            break
        dst.write(buf)

Usage example:

import io

myBytesIO = io.BytesIO(b"test123")

with open("myfile.txt", "wb") as outfile:
    copy_filelike_to_filelike(myBytesIO, outfile)
# myfile.txt now contains "test123" (no trailing newline)

 

How to translate using your custom AutoML model in Python

If you’ve successfully trained your first custom AutoML neuronal translation model, the next step is to integrate it into your application.

Here’s a python3 utility class that easily allows you to translate using your custom model:

class GNTMAutoMLTranslationDriver(object):
    """
    Custom AutoML model translator.

    Usage example (be sure to use your own model here!):

    >>> translator = GNTMAutoMLTranslationDriver('myproject-101472', 'TRL455090968000816104449')
    >>> translator.translate("This is a translation test")
    """
    def __init__(self, project_id, model_id):
        self.client = automl_v1beta1.PredictionServiceClient()
        self._name = 'projects/{}/locations/us-central1/models/{}'.format(project_id, model_id)
    
    def translate(self, engl):
        payload = {'text_snippet': {'content': engl}}
        params = {}
        request = self.client.predict(self._name, payload, params)
        return request.payload[0].translation.translated_content.content

See the class documentation for a usage example. Most of the code is also present in the official AutoML example, but I had to figure out some parts for myself, e.g. how to extract the string from the protobuf (request.payload[0].translation.translated_content.content).

Also note that AutoML is currently in Beta and therefore the API might change without prior notice.

How to circumvent Google Cloud Storage 1000 read / 400 write limit in Python

Google Cloud Datastore has a built-in 1000 keys limit for get requests and a 400 entities per request for put limit. If you hit it, you will see one of these error messages:

google.api_core.exceptions.InvalidArgument: 400 cannot get more than 1000 keys in a single call
google.api_core.exceptions.InvalidArgument: 400 cannot write more than 500 entities in a single call

You can fix it by chunking the requests, i.e. only do 1000 requests at one time for get etc.

This code provides a ready-to-use example for a class that automates this process. As an added benefit, it performs the requests in chunks of 1000 (for get) or 400 (for put) in parallel using a concurrent.futures.Executor. As the performance is expected to be IO-bound, it is recommended to use a concurrent.futures.ThreadPoolExecutor.
If you dont give the class an executor on construction, it will create one by itself.

import itertools
from concurrent.futures import ThreadPoolExecutor

def _chunks(l, n=1000):
    """
    Yield successive n-sized chunks from l.
    https://stackoverflow.com/a/312464/2597135
    """
    for i in range(0, len(l), n):
        yield l[i:i + n]

def _get_chunk(client, keys):
    """
    Get a single chunk
    """
    missing = []
    vals = client.get_multi(keys, missing=missing)
    return vals, missing

class DatastoreChunkClient(object):
    """
    Provides a thin wrapper around a Google Cloud Datastore client providing means
    of reading nd
    """
    def __init__(self, client, executor=None):
        self.client = client
        if executor is None:
            executor = ThreadPoolExecutor(16)
        self.executor = executor
    
    def get_multi(self, keys):
        """
        Thin wrapper around client.get_multi() that circumvents
        the 1000 read requests limit by doing 1000-sized chunked reads
        in parallel using self.executor.

        Returns (values, missing).
        """
        all_missing = []
        all_vals = []
        for vals, missing in self.executor.map(lambda chunk: _get_chunk(self.client, chunk), _chunks(keys, 1000)):
            all_vals += vals
            all_missing += missing
        return all_vals, all_missing

    def put_multi(self, entities):
        """
        Thin wrapper around client.put_multi() that circumvents
        the 400 read requests limit by doing 400-sized chunked reads
        in parallel using self.executor.

        Returns (values, missing).
        """
        for none in self.executor.map(lambda chunk: self.client.put_multi(chunk), _chunks(entities, 400)):
            pass

Usage example:

# Create "raw" google datastore client
client = datastore.Client(project="myproject-123456")
chunkClient = DatastoreChunkClient(client)

# The size of the key list is only limited by memory
keys = [...]
values, missing = chunkClient.get_multi(keys)

# The size of the entity list is only limited by memory
entities = [...]
chunkClient.put_multi(entities)

 

Saving an entity in Google Cloud Datastore using Python: A minimal example

Here’s a minimal example for inserting an entity in the Google Cloud Datastore object database using the Python API:

#!/usr/bin/env python3
from google.cloud import datastore
# Create & store an entity
client = datastore.Client(project="myproject-12345")
entity = datastore.Entity(key=client.key('MyEntityKind', 'MyTestID'))
entity.update({
    'foo': u'bar',
    'baz': 1337,
    'qux': False,
})
# Actually save the entity
client.put(entity)

This assumes you have already created an entity kind with the name MyEntityKind in the project with ID myproject-12345.

How to fix Google Cloud Datastore ValueError: A Key must have a project set.

Problem:

You are trying to connect to the Google Cloud Storage object database:

#!/usr/bin/env python3
from google.cloud import datastore
# Create, populate and persist an entity
entity = datastore.Entity(key=datastore.Key('MyEntityKind')) # Line of error
# ...

but when running that code, you get this error message:

Traceback (most recent call last):
  File "./IndexIntoDB.py", line 4, in <module>
    entity = datastore.Entity(key=datastore.Key('MyEntityKind'))
  File "/usr/local/lib/python3.6/dist-packages/google/cloud/datastore/key.py", line 109, in __init__
    self._project = _validate_project(project, parent)
  File "/usr/local/lib/python3.6/dist-packages/google/cloud/datastore/key.py", line 512, in _validate_project
    raise ValueError("A Key must have a project set.")
ValueError: A Key must have a project set.

Solution:

Note: While the solution below fixes the error message listed above, you might be more interested in having a look at this minimal entity insertion example

As the error message indicates, you need to add a project name. If you don’t know the project name, go to the Google Cloud Console, select the right project at the top and then look at the URL:

https://console.cloud.google.com/datastore/welcome?project=perceptive-tape-12345

In this example, the project ID (which you have to use in the Python code is perceptive-tape-12345.

See also the Keys section of the google-cloud-datastore python documentation.

Converting namedtuples to XLSX in Python

This Python snippet allows you to convert an iterable of namedtuple instances to an XLSX file using xlsxwriter.

The header is automatically determined from the first element of the iterable. If the iterable is empty, the resulting XLSX file will also be empty.

import xlsxwriter
import itertools
from collections import namedtuple

def xlsx_write_rows(filename, rows):
    """
    Write XLSX rows from an iterable of rows.
    Each row must be an iterable of writeable values.

    Returns the number of rows written
    """
    workbook = xlsxwriter.Workbook(filename)
    worksheet = workbook.add_worksheet()
    # Write values
    nrows = 0
    for i, row in enumerate(rows):
        for j, val in enumerate(row):
            worksheet.write(i, j, val)
        nrows += 1
    # Cleanup
    workbook.close()
    return nrows


def namedtuples_to_xlsx(filename, values):
    """
    Convert a list or generator of namedtuples to an XLSX file.
    Returns the number of rows written.
    """
    try:
        # Ensure its a generator (next() not allowed on lists)
        values = (v for v in values)
        # Use first row to generate header
        peek = next(values)
        header = list(peek.__class__._fields)
        return xlsx_write_rows(filename, itertools.chain([header], [peek], values))
    except StopIteration:  # Empty generator
        # Write empty xlsx
        return xlsx_write_rows(filename, [])

Example Usage:

MyType = namedtuple("MyType", ["ID", "Name", "Value"])
namedtuples_to_xlsx("test.xlsx", [
    MyType(1, "a", "b"),
    MyType(2, "c", "d"),
    MyType(3, "e", "f"),
])

This example will generate this table:

ID	Name	Value
1	a	b
2	c	d
3	e	f

 

Fixing TensorFlow libcublas.so.8.0: cannot open shared object file on Ubuntu

Problem:

When you run import tensorflow in Python, you get one of the following errors:

ImportError: libcublas.so.8.0: cannot open shared object file: No such file or directory
ImportError: libcusolver.so.8.0: cannot open shared object file: No such file or directory
ImportError: libcudart.so.8.0: cannot open shared object file: No such file or directory
ImportError: libcufft.so.8.0: cannot open shared object file: No such file or directory
ImportError: libcurand.so.8.0: cannot open shared object file: No such file or directory

Solution:

Install the required packages using:

apt-get install libcublas8.0 libcusolver8.0 libcudart8.0 libcufft8.0 libcurand8.0

Note that you also need to install cuDNN – see this followup post

Which version on CuDNN should you install for TensorFlow GPU on Ubuntu?

for details on how to do that.

If this method does not work, you can (as a quick workaround) uninstall tensorflow-gpu and install the tensorflow – the version without GPU support:

pip3 uninstall tensorflow-gpu
pip3 install tensorflow

However, this will likely make your applications much slower.

For other solutions see the TensorFlow bugtracker on GitHub.

How to use concurrent.futures map with a tqdm progress bar

Problem:

You have a concurrent.futures executor, e.g.

import concurrent.futures

executor = concurrent.futures.ThreadPoolExecutor(64)

Using this executor, you want to map a function over an iterable in parallel (e.g. parallel download of HTTP pages).

In order to aid interactive execution, you want to use tqdm to provide a progress bar, showing the fraction of futures

Solution:

You can use this function:

from tqdm import tqdm
import concurrent.futures

def tqdm_parallel_map(executor, fn, *iterables, **kwargs):
    """
    Equivalent to executor.map(fn, *iterables),
    but displays a tqdm-based progress bar.
    
    Does not support timeout or chunksize as executor.submit is used internally
    
    **kwargs is passed to tqdm.
    """
    futures_list = []
    for iterable in iterables:
        futures_list += [executor.submit(fn, i) for i in iterable]
    for f in tqdm(concurrent.futures.as_completed(futures_list), total=len(futures_list), **kwargs):
        yield f.result()

Note that internally, executor.submit() is used, not executor.map() because there is no way of calling concurrent.futures.as_completed() on the iterator returned by executor.map().

Note: In constract to executor.map() this function does NOT yield the arguments in the same order as the input.

requests: Download file if it doesn’t exist

Problem:

You want to download a URL to a file using the requests python library, but you want to skip the download if it doesn’t exist

Solution:

Use the following functions:

import requests
import os.path

def download_file(filename, url):
    """
    Download an URL to a file
    """
    with open(filename, 'wb') as fout:
        response = requests.get(url, stream=True)
        response.raise_for_status()
        # Write response data to file
        for block in response.iter_content(4096):
            fout.write(block)

def download_if_not_exists(filename, url):
    """
    Download a URL to a file if the file
    does not exist already.

    Returns
    -------
    True if the file was downloaded,
    False if it already existed
    """
    if not os.path.exists(filename):
        download_file(filename, url)
        return True
    return False

 

Removing spans/divs with style attributes from HTML

Occasionally I have to clean up some HTML code – mostly because parts of it were pasted into a CMS like WordPress from rich text editor like Word.

I’ve noticed that the formatting I want to remove is mostly based on span and div elements with a style attribute. Therefore, I’ve written a simple Python script based on BeautifulSoup4 which will replace certain tags with their contents if they have a style attribute. While in some cases some other formatting might be destroyed by such a script, it is very useful for some recurring usecases.

Read more