Removing spans/divs with style attributes from HTML
Occasionally I have to clean up some HTML code - mostly because parts of it were pasted into a CMS like Wordpress from rich text editor like Word.
I’ve noticed that the formatting I want to remove is mostly based on span
and div
elements with a style
attribute. Therefore, I’ve written a simple Python script based on BeautifulSoup4 which will replace certain tags with their contents if they have a style
attribute. While in some cases some other formatting might be destroyed by such a script, it is very useful for some recurring usecases.
#!/usr/bin/env shell3
"""
Replace spans with style attribute by their content
in a HTML file
"""
from bs4 import BeautifulSoup
__author__ = "Uli Köhler"
__copyright__ = "Copyright 2017 Uli Köhler"
__license__ = "CC0 1.0 Universal"
__version__ = "1.0"
__email__ = "[email protected]"
def modify_html(infile, outfile, tags):
# Load HTML
with open(infile, "r") as infile:
soup = BeautifulSoup(infile, "html.parser")
# Replace all span by their contents
# only if they have some kind of style attribute
for tagtype in tags: # span, div etc
for span in soup.find_all(tagtype):
if "style" in span.attrs:
span.replaceWithChildren()
# Write to outfile
with open(outfile, "w") as outfile:
outfile.write(soup.prettify())
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('infile', help='The file to read the HTML from')
parser.add_argument('outfile', help='The file to write the resulting HTML to')
parser.add_argument('-t', '--tag', nargs='+', default=["span"], help='Which tag types to check & replace')
args = parser.parse_args()
# Run modifier
modify_html(args.infile, args.outfile, args.tag)
You can use it like this:
python3 removeformat.py input.html output.html
By default, it replaces span
tags only - but you can modify that behaviour using command line options. For example, to replace div
and span
:
python3 removeformat.py input.html output.html -t div span