How to extract href attributes from HTML page using grep & regex
You can use a regular expression to grep for href="..."
attributes in a HTML like this:
grep -oP "(HREF|href)=\"\K.+?(?=\")"
grep
is operated with -o
(only print match, this is required to get extra features like lookahead assertions) and -P
(use Perl regular expression engine). The regular expression is basically
href=".*"
where the .+
is used in non-greedy mode (.+?
):
href=".+?"
This will give us hits like
href="/files/image.png"
Since we only want the content of the quotes ("
) and not the href="..."
part, we can use positive lookbehind assertions (\K
) to remove the href part:
href=\"\K.+?\"
but we also want to get rid of the closing double quote. In order to do this, we can use positive lookahead assertions((?=\")
):
href=\"\K.+?(?=\")
Now we want to match both href
and HREF
to get some case insensitivity:
(href|HREF)=\"\K.+?(?=\")
Often we want to specifically match one file type. For example, we could match only .png
:
(href|HREF)=\"\K.+?\.png(?=\")
In order to reduce falsely too long matches on some pages, we want to use [^\"]+?
instead of .+?
:
(href|HREF)=\"\K[^\"]+?\.png(?=\")
This disallows matches containing " characters, hence preventing more than the tag being matched.
Usage example:
wget -qO- https://nasagrace.unl.edu/data/NASApublication/maps/ | grep -oP "(href|HREF)=\"\K[^\"]+?\.png(?=\")"
Output:
/data/NASApublication/maps/GRACE_SFSM_20201026.png
[...]