How to sort files on S3 by timestamp in filename using boto3 & Python

Let’s assume we have backup objects in an S3 directory like:

production-backup-2022-03-29_14-40-16.xz
production-backup-2022-03-29_14-50-16.xz
production-backup-2022-03-29_15-00-03.xz
production-backup-2022-03-29_15-10-04.xz
production-backup-2022-03-29_15-20-06.xz
production-backup-2022-03-29_15-30-06.xz
production-backup-2022-03-29_15-40-00.xz
production-backup-2022-03-29_15-50-07.xz
production-backup-2022-03-29_16-00-06.xz
production-backup-2022-03-29_16-10-12.xz
production-backup-2022-03-29_16-20-18.xz
production-backup-2022-03-29_16-30-18.xz
production-backup-2022-03-29_16-40-00.xz
production-backup-2022-03-29_16-50-09.xz
production-backup-2022-03-29_17-00-18.xz
production-backup-2022-03-29_17-10-13.xz
production-backup-2022-03-29_17-20-18.xz
production-backup-2022-03-29_17-30-18.xz
production-backup-2022-03-29_17-40-06.xz
production-backup-2022-03-29_17-50-21.xz
production-backup-2022-03-29_18-00-06.xz

And we want to identify the newest one. Often in these situations, you can’t really rely on modification timestamps as these can change when syncing old files or when changing folder structures or names.

Hence the best way is to rely on the timestamp from the filename as a reference point. The date timestamp we’re using here is based on our post How to generate filename containing date & time on the command line ; if you’re using a different object key format, you might need to adjust the date_regex accordingly.

The following example script iterates all objects within a specific S3 folder, sorting them by the timestamp from the filename and choses the latest one, downloading it from S3 to the local filesystem.

This script is based on a few of our previous posts, including:

#!/usr/bin/env python3
import boto3
import re
import os.path
from collections import namedtuple
from datetime import datetime

# Create connection to Wasabi / S3
s3 = boto3.resource('s3',
    endpoint_url = 'https://minio.mydomain.com',
    aws_access_key_id = 'my-access-key',
    aws_secret_access_key = 'my-password'
)

# Get bucket object
backups = s3.Bucket('mfwh-backup')

date_regex = re.compile(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})_(?P<hour>\d{2})-(?P<minute>\d{2})-(?P<second>\d{2})")

DatedObject =  namedtuple("DatedObject", ["Date", "Object"])
entries = []
# Iterate over objects in bucket
for obj in backups.objects.filter(Prefix="production/"):
    date_match = date_regex.search(obj.key)
    # Ignore other files (without date stamp) if any
    if date_match is None:
        continue
    dt = datetime(year=int(date_match.group("year")), month=int(date_match.group("month")),
        day=int(date_match.group("day")), hour=int(date_match.group("hour")), minute=int(date_match.group("minute")),
        second=int(date_match.group("second")))
    entries.append(DatedObject(dt, obj))
# Sort entries by date
entries.sort(key=lambda entry: entry.Date)

newest_date, newest_obj = entries[-1]
#print(f"Downloading {newest_obj.key} from {newest_date.isoformat()}")
filename = os.path.basename(newest_obj.key)

with open(filename, "wb") as outfile:
    backups.download_fileobj(newest_obj.key, outfile)

# Print filename for automation purposes
print(filename)

If this post helped you, please consider buying me a coffee or donating via PayPal to support research & publishing of new posts on TechOverflow

Buy me a coffee