Automated domain name extraction from Let’s Encrypt certificate transparency logs

A few days ago, Let’s Encrypt into public beta. At the time of writing this article, almost 120k certificateshave been issued, including the certificate for TechOverflow.

I really like the Let’s Encrypt service and I believe it might actually change the way people perceive HTTPS encryption. However, there is one rarely-mentioned side-effect when protecting your domains with their certificates.

Let’s Encrypt publishes certificate transparency logs at crt.sh. This transparency does not come without side-effects, however: crt.sh effectively publishes.

In other words, hiding sites from the public by not publishing their (sub-)domain names anywhere will not work when you issue a certificate for the domain on services like Let’s Encrypt.

For demonstration, I quickly hacked together a script that will automatically fetch a defineable number of crt.sh IDs and print out their domain names. It will start at the most recent certificate from Let’s Encrypt that is present in the crt.sh database.

Use it like this to fetch 1000 certficates:

$ python3 letsencrypt-domains.py 1000

Note that 1000 domains do not neccessarily correspond to 1000 domain names: On one hand, people sometimes re-issue certs while getting used to Let’s Encrypt’s mechanics. On the other hand, one certificate may contain multiple domain names.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Automated extraction of domain names from certificate transparency logs

Extracts domain names from crt.sh transparency logs for certificates
issued by Let's Encrypt

FOR DEMONSTRATION PURPOSES ONLY
"""
__copyright__ = "Copyright (c) 2015 Uli Köhler"
__license__ = "Apache License v2.0"
__version__ = "1.0.0"

import requests
import re
import itertools
from multiprocessing import Pool


domainRegex = re.compile(r"DNS:([A-Za-z0-9\.]+)<BR>")
certLinkRegex = re.compile(r'<A href="\?id=(\d+)">')


def fetchDomainsByID(i):
    "Fetch a crt.sh domain list by ID"
    response = requests.get("https://crt.sh/?id={0}".format(i))
    return domainRegex.findall(response.text)


def findLatestCACRTSHID(caid):
    """
    For a given crt.sh CA ID, find
    """
    response = requests.get("https://crt.sh/?Identity=%25&iCAID={0}".format(caid))
    return int(certLinkRegex.search(response.text).group(1))

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('n', type=int, default=100, help='Number of certificate IDs to try')
    parser.add_argument('-p', default=32, type=int, help='Number of parallel IO threads')
    parser.add_argument('-q', '--quiet', action="store_true", help='Suppress info messages')
    args = parser.parse_args()

    pool = Pool(args.p)
    start = findLatestCACRTSHID(7395)
    n = args.n

    if not args.quiet:
        print("Fetching {0} crt.sh IDs starting from {1}".format(n, start))

    # Fetch & extract domains with a pool
    domainLists = pool.map(fetchDomainsByID,
                           range(start, start - n, -1))
    domains = sorted(set(itertools.chain(*domainLists)))

    # Print domains
    for domain in domains:
        print(domain)