Detecting URLs with PCRE
First published at Saturday, 21 June 2008
Warning: This blog post is more then 15 years old – read and use with care.
Detecting URLs with PCRE
From time to time I experience the issue that I should detect URLs in some text, while neither the URLs are standard conform (regarding the used characters), nor the URLs are strictly separated from other stuff by whitespaces or something. Now Derick asked me to provide him with a regular expression for that, and I finally wrote some, which should work in most cases:
( (?:^|[\s,.!?]) (?# Ignore matching braces around the URL) (<)? (\[)? (\()? (?# Ignore quoting around the URL) ([\'"]?) (?# Actually match the URL) (?P<url>https?://[^\s]*?) \4 (?(3)\)) (?(2)\]) (?(1)>) (?# Ignore common punctuation after the URL) [.,?!]?(?:\s|$) )xm
Sadly invalid characters are not always encoded, and also you can't expect to have only matching braces in the URLs, but still user like to write something like:
Check out my Blog ($url)!
In which case the braces are obviously not part of the actual URL so you should skip them, the same for the other brace types.
The regular expression uses conditional subpatterns to check for those matching braces before and after an URL and ignores them, when they are found. The same for quotes. Often URLs are followed by some markup, which also shouldn't be included in the actual URL, which is also ignored by this regular expression, but still - even not valid - characters like commas are included in the URL, if used there.
There are two issues, which are still not really solveable by a regular expression I think, but additions and suggestions would be really welcome:
PCRE does not reuse the end markers
(?:\s|$)as start markers for the next URL, and I see no way to get the regular expression working without them. This means, that two URLs, only separated by one whitespace, would be detected when calling preg_match_all. You can still call preg_match() in a while-loop, though and remove all URLs from the text, after you found them.
Some users tend to use braces for subsentences, where one brace may end right after the URL, like this:
Hi there (Check out my blog at $url)!
Where the closing brace after the URL won't be removed, because there is no opening URL right before the URL.
I don't think this is fixable, because you can't expect the user to have only matching braces in his sentences, nor can you expect that for URLs itself. So we can just guess, what will be there more common problem - ignoring closing braces at the end of URLs, or users writing such sentences...
Still I think this regular expression might be useful to you, feel free to use it where ever you might find it useful. As a german I am not allowed to put something under public domain, but I grant anyone the right to use this for any purpose, without any conditions, unless such conditions are required by law.