Removing spans/divs with style attributes from HTML

Occasionally I have to clean up some HTML code – mostly because parts of it were pasted into a CMS like WordPress from rich text editor like Word.

I’ve noticed that the formatting I want to remove is mostly based on span and div elements with a style attribute. Therefore, I’ve written a simple Python script based on BeautifulSoup4 which will replace certain tags with their contents if they have a style attribute. While in some cases some other formatting might be destroyed by such a script, it is very useful for some recurring usecases.

#!/usr/bin/env python3
"""
Replace spans with style attribute by their content
in a HTML file
"""
from bs4 import BeautifulSoup

__author__ = "Uli Köhler"
__copyright__ = "Copyright 2017 Uli Köhler"
__license__ = "CC0 1.0 Universal"
__version__ = "1.0"
__email__ = "ukoehler@techoverflow.net"


def modify_html(infile, outfile, tags):
    # Load HTML
    with open(infile, "r") as infile:
        soup = BeautifulSoup(infile, "html.parser")

    # Replace all span by their contents
    # only if they have some kind of style attribute
    for tagtype in tags: # span, div etc
        for span in soup.find_all(tagtype):
            if "style" in span.attrs:
                span.replaceWithChildren()

    # Write to outfile
    with open(outfile, "w") as outfile:
        outfile.write(soup.prettify())

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('infile', help='The file to read the HTML from')
    parser.add_argument('outfile', help='The file to write the resulting HTML to')
    parser.add_argument('-t', '--tag', nargs='+', default=["span"], help='Which tag types to check & replace')
    args = parser.parse_args()
    # Run modifier
    modify_html(args.infile, args.outfile, args.tag)

You can use it like this:

python3 removeformat.py input.html output.html

By default, it replaces span tags only – but you can modify that behaviour using command line options. For example, to replace div and span:

python3 removeformat.py input.html output.html -t div span