How to extract href attributes from HTML page using grep & regex

You can use a regular expression to grep for href="..." attributes in a HTML like this:

grep -oP "(HREF|href)=\"\K.+?(?=\")"

grep is operated with -o (only print match, this is required to get extra features like lookahead assertions) and -P (use Perl regular expression engine). The regular expression is basically


where the .+ is used in non-greedy mode (.+?):


This will give us hits like


Since we only want the content of the quotes (") and not  the href="..." part, we can use positive lookbehind assertions (\K) to remove the href part:


but we also want to get rid of the closing double quote. In order to do this, we can use positive lookahead assertions ((?=\")):


Now we want to match both href and HREF to get some case insensitivity:


Often we want to specifically match one file type. For example, we could match only .png:


In order to reduce falsely too long matches on some pages, we want to use [^\"]+? instead of .+?:


This disallows matches containing ” characters, hence preventing more than the tag being matched.

Usage example:

wget -qO- | grep -oP "(href|HREF)=\"\K[^\"]+?\.png(?=\")"