How to sort files on S3 by timestamp in filename using boto3 & Python
Let’s assume we have backup objects in an S3 directory like:
production-backup-2022-03-29_14-40-16.xz
production-backup-2022-03-29_14-50-16.xz
production-backup-2022-03-29_15-00-03.xz
production-backup-2022-03-29_15-10-04.xz
production-backup-2022-03-29_15-20-06.xz
production-backup-2022-03-29_15-30-06.xz
production-backup-2022-03-29_15-40-00.xz
production-backup-2022-03-29_15-50-07.xz
production-backup-2022-03-29_16-00-06.xz
production-backup-2022-03-29_16-10-12.xz
production-backup-2022-03-29_16-20-18.xz
production-backup-2022-03-29_16-30-18.xz
production-backup-2022-03-29_16-40-00.xz
production-backup-2022-03-29_16-50-09.xz
production-backup-2022-03-29_17-00-18.xz
production-backup-2022-03-29_17-10-13.xz
production-backup-2022-03-29_17-20-18.xz
production-backup-2022-03-29_17-30-18.xz
production-backup-2022-03-29_17-40-06.xz
production-backup-2022-03-29_17-50-21.xz
production-backup-2022-03-29_18-00-06.xz
And we want to identify the newest one. Often in these situations, you can’t really rely on modification timestamps as these can change when syncing old files or when changing folder structures or names.
Hence the best way is to rely on the timestamp from the filename as a reference point. The date timestamp we’re using here is based on our postĀ How to generate filename containing date & time on the command line ; if you’re using a different object key format, you might need to adjust theĀ date_regex
accordingly.
The following example script iterates all objects within a specific S3 folder, sorting them by the timestamp from the filename and choses the latest one, downloading it from S3 to the local filesystem.
This script is based on a few of our previous posts, including:
- How to download Wasabi/S3 object to file using boto3 in Python
- How to filter for objects in a given S3 directory using boto3
#!/usr/bin/env python3
import boto3
import re
import os.path
from collections import namedtuple
from datetime import datetime
# Create connection to Wasabi / S3
s3 = boto3.resource('s3',
endpoint_url = 'https://minio.mydomain.com',
aws_access_key_id = 'my-access-key',
aws_secret_access_key = 'my-password'
)
# Get bucket object
backups = s3.Bucket('mfwh-backup')
date_regex = re.compile(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})_(?P<hour>\d{2})-(?P<minute>\d{2})-(?P<second>\d{2})")
DatedObject = namedtuple("DatedObject", ["Date", "Object"])
entries = []
# Iterate over objects in bucket
for obj in backups.objects.filter(Prefix="production/"):
date_match = date_regex.search(obj.key)
# Ignore other files (without date stamp) if any
if date_match is None:
continue
dt = datetime(year=int(date_match.group("year")), month=int(date_match.group("month")),
day=int(date_match.group("day")), hour=int(date_match.group("hour")), minute=int(date_match.group("minute")),
second=int(date_match.group("second")))
entries.append(DatedObject(dt, obj))
# Sort entries by date
entries.sort(key=lambda entry: entry.Date)
newest_date, newest_obj = entries[-1]
#print(f"Downloading {newest_obj.key} from {newest_date.isoformat()}")
filename = os.path.basename(newest_obj.key)
with open(filename, "wb") as outfile:
backups.download_fileobj(newest_obj.key, outfile)
# Print filename for automation purposes
print(filename)