Occasionally I have to clean up some HTML code – mostly because parts of it were pasted into a CMS like WordPress from rich text editor like Word.
I’ve noticed that the formatting I want to remove is mostly based on
div elements with a
style attribute. Therefore, I’ve written a simple Python script based on BeautifulSoup4 which will replace certain tags with their contents if they have a
style attribute. While in some cases some other formatting might be destroyed by such a script, it is very useful for some recurring usecases.
#!/usr/bin/env python3 """ Replace spans with style attribute by their content in a HTML file """ from bs4 import BeautifulSoup __author__ = "Uli Köhler" __copyright__ = "Copyright 2017 Uli Köhler" __license__ = "CC0 1.0 Universal" __version__ = "1.0" __email__ = "email@example.com" def modify_html(infile, outfile, tags): # Load HTML with open(infile, "r") as infile: soup = BeautifulSoup(infile, "html.parser") # Replace all span by their contents # only if they have some kind of style attribute for tagtype in tags: # span, div etc for span in soup.find_all(tagtype): if "style" in span.attrs: span.replaceWithChildren() # Write to outfile with open(outfile, "w") as outfile: outfile.write(soup.prettify()) if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument('infile', help='The file to read the HTML from') parser.add_argument('outfile', help='The file to write the resulting HTML to') parser.add_argument('-t', '--tag', nargs='+', default=["span"], help='Which tag types to check & replace') args = parser.parse_args() # Run modifier modify_html(args.infile, args.outfile, args.tag)
You can use it like this:
python3 removeformat.py input.html output.html
By default, it replaces
span tags only – but you can modify that behaviour using command line options. For example, to replace
python3 removeformat.py input.html output.html -t div span