S3: Dateien nach Zeitstempel im Dateinamen sortieren mit boto3 & Python

Nehmen wir an, wir haben Backup-Objekte in einem S3-Verzeichnis wie:

s3_backup_list.txt

production-backup-2022-03-29_14-40-16.xz
production-backup-2022-03-29_14-50-16.xz
production-backup-2022-03-29_15-00-03.xz
production-backup-2022-03-29_15-10-04.xz
production-backup-2022-03-29_15-20-06.xz
production-backup-2022-03-29_15-30-06.xz
production-backup-2022-03-29_15-40-00.xz
production-backup-2022-03-29_15-50-07.xz
production-backup-2022-03-29_16-00-06.xz
production-backup-2022-03-29_16-10-12.xz
production-backup-2022-03-29_16-20-18.xz
production-backup-2022-03-29_16-30-18.xz
production-backup-2022-03-29_16-40-00.xz
production-backup-2022-03-29_16-50-09.xz
production-backup-2022-03-29_17-00-18.xz
production-backup-2022-03-29_17-10-13.xz
production-backup-2022-03-29_17-20-18.xz
production-backup-2022-03-29_17-30-18.xz
production-backup-2022-03-29_17-40-06.xz
production-backup-2022-03-29_17-50-21.xz
production-backup-2022-03-29_18-00-06.xz

production-backup-2022-03-29_14-40-16.xz
production-backup-2022-03-29_14-50-16.xz
production-backup-2022-03-29_15-00-03.xz
production-backup-2022-03-29_15-10-04.xz
production-backup-2022-03-29_15-20-06.xz
production-backup-2022-03-29_15-30-06.xz
production-backup-2022-03-29_15-40-00.xz
production-backup-2022-03-29_15-50-07.xz
production-backup-2022-03-29_16-00-06.xz
production-backup-2022-03-29_16-10-12.xz
production-backup-2022-03-29_16-20-18.xz
production-backup-2022-03-29_16-30-18.xz
production-backup-2022-03-29_16-40-00.xz
production-backup-2022-03-29_16-50-09.xz
production-backup-2022-03-29_17-00-18.xz
production-backup-2022-03-29_17-10-13.xz
production-backup-2022-03-29_17-20-18.xz
production-backup-2022-03-29_17-30-18.xz
production-backup-2022-03-29_17-40-06.xz
production-backup-2022-03-29_17-50-21.xz
production-backup-2022-03-29_18-00-06.xz

Und wir möchten das neueste identifizieren. Oft kann man sich in solchen Situationen nicht auf Änderungszeitstempel verlassen, da diese sich beim Synchronisieren alter Dateien oder beim Ändern von Ordnerstrukturen oder Namen ändern können.

Daher ist es am besten, sich auf den Zeitstempel aus dem Dateinamen als Referenzpunkt zu verlassen. Der hier verwendete Datums-Zeitstempel basiert auf unserem Beitrag How to generate filename containing date & time on the command line ; wenn du ein anderes Object-Key-Format verwendest, musst du möglicherweise date_regex entsprechend anpassen.

Das folgende Beispielskript iteriert über alle Objekte in einem bestimmten S3-Ordner, sortiert sie nach dem Zeitstempel aus dem Dateinamen und wählt das neueste aus, um es von S3 auf das lokale Dateisystem herunterzuladen.

Dieses Skript basiert auf einigen unserer vorherigen Beiträge, darunter:

s3_sort_by_timestamp.py

#!/usr/bin/env python3
import boto3
import re
import os.path
from collections import namedtuple
from datetime import datetime

# Verbindung zu Wasabi / S3 herstellen
s3 = boto3.resource('s3',
    endpoint_url = 'https://minio.mydomain.com',
    aws_access_key_id = 'my-access-key',
    aws_secret_access_key = 'my-password'
)

# Bucket-Objekt abrufen
backups = s3.Bucket('mfwh-backup')

date_regex = re.compile(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})_(?P<hour>\d{2})-(?P<minute>\d{2})-(?P<second>\d{2})")

DatedObject =  namedtuple("DatedObject", ["Date", "Object"])
entries = []
# Über Objekte im Bucket iterieren
for obj in backups.objects.filter(Prefix="production/"):
    date_match = date_regex.search(obj.key)
    # Andere Dateien (ohne Datumsstempel) ignorieren, falls vorhanden
    if date_match is None:
        continue
    dt = datetime(year=int(date_match.group("year")), month=int(date_match.group("month")),
        day=int(date_match.group("day")), hour=int(date_match.group("hour")), minute=int(date_match.group("minute")),
        second=int(date_match.group("second")))
    entries.append(DatedObject(dt, obj))
# Einträge nach Datum sortieren
entries.sort(key=lambda entry: entry.Date)

newest_date, newest_obj = entries[-1]
#print(f"Downloading {newest_obj.key} from {newest_date.isoformat()}")
filename = os.path.basename(newest_obj.key)

with open(filename, "wb") as outfile:
    backups.download_fileobj(newest_obj.key, outfile)

# Dateiname für Automatisierungszwecke ausgeben
print(filename)

#!/usr/bin/env python3
import boto3
import re
import os.path
from collections import namedtuple
from datetime import datetime

# Verbindung zu Wasabi / S3 herstellen
s3 = boto3.resource('s3',
    endpoint_url = 'https://minio.mydomain.com',
    aws_access_key_id = 'my-access-key',
    aws_secret_access_key = 'my-password'
)

# Bucket-Objekt abrufen
backups = s3.Bucket('mfwh-backup')

date_regex = re.compile(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})_(?P<hour>\d{2})-(?P<minute>\d{2})-(?P<second>\d{2})")

DatedObject =  namedtuple("DatedObject", ["Date", "Object"])
entries = []
# Über Objekte im Bucket iterieren
for obj in backups.objects.filter(Prefix="production/"):
    date_match = date_regex.search(obj.key)
    # Andere Dateien (ohne Datumsstempel) ignorieren, falls vorhanden
    if date_match is None:
        continue
    dt = datetime(year=int(date_match.group("year")), month=int(date_match.group("month")),
        day=int(date_match.group("day")), hour=int(date_match.group("hour")), minute=int(date_match.group("minute")),
        second=int(date_match.group("second")))
    entries.append(DatedObject(dt, obj))
# Einträge nach Datum sortieren
entries.sort(key=lambda entry: entry.Date)

newest_date, newest_obj = entries[-1]
#print(f"Downloading {newest_obj.key} from {newest_date.isoformat()}")
filename = os.path.basename(newest_obj.key)

with open(filename, "wb") as outfile:
    backups.download_fileobj(newest_obj.key, outfile)

# Dateiname für Automatisierungszwecke ausgeben
print(filename)

Check out similar posts by category: S3

If this post helped you, please consider buying me a coffee or donating via PayPal to support research & publishing of new posts on TechOverflow

Buy me a coffee