Detecting URLs with PCRE
First published at Saturday 21 June 2008
Warning: This blog post is more then 16 years old – read and use with care.
Detecting URLs with PCRE
From time to time I experience the issue that I should detect URLs in some text, while neither the URLs are standard conform (regarding the used characters), nor the URLs are strictly separated from other stuff by whitespaces or something. Now Derick asked me to provide him with a regular expression for that, and I finally wrote some, which should work in most cases:
(
(?:^|[\s,.!?])
(?# Ignore matching braces around the URL)
(<)?
(\[)?
(\()?
(?# Ignore quoting around the URL)
([\'"]?)
(?# Actually match the URL)
(?P<url>https?://[^\s]*?)
\4
(?(3)\))
(?(2)\])
(?(1)>)
(?# Ignore common punctuation after the URL)
[.,?!]?(?:\s|$)
)xm
Sadly invalid characters are not always encoded, and also you can't expect to have only matching braces in the URLs, but still user like to write something like:
Check out my Blog ($url)!
In which case the braces are obviously not part of the actual URL so you should skip them, the same for the other brace types.
The regular expression uses conditional subpatterns to check for those matching braces before and after an URL and ignores them, when they are found. The same for quotes. Often URLs are followed by some markup, which also shouldn't be included in the actual URL, which is also ignored by this regular expression, but still - even not valid - characters like commas are included in the URL, if used there.
Issues
There are two issues, which are still not really solveable by a regular expression I think, but additions and suggestions would be really welcome:
PCRE does not reuse the end markers
(?:\s|$)
as start markers for the next URL, and I see no way to get the regular expression working without them. This means, that two URLs, only separated by one whitespace, would be detected when calling preg_match_all. You can still call preg_match() in a while-loop, though and remove all URLs from the text, after you found them.Some users tend to use braces for subsentences, where one brace may end right after the URL, like this:
Hi there (Check out my blog at $url)!
Where the closing brace after the URL won't be removed, because there is no opening URL right before the URL.
I don't think this is fixable, because you can't expect the user to have only matching braces in his sentences, nor can you expect that for URLs itself. So we can just guess, what will be there more common problem - ignoring closing braces at the end of URLs, or users writing such sentences...
Still I think this regular expression might be useful to you, feel free to use it where ever you might find it useful. As a german I am not allowed to put something under public domain, but I grant anyone the right to use this for any purpose, without any conditions, unless such conditions are required by law.
Subscribe to updates
There are multiple ways to stay updated with new posts on my blog:
Comments
Mark Armendariz at Saturday, 21.6. 2008
I've been using this for years, which has been incredibly successful for me:
'/(?P<protocol>(?:(?:f|ht)tp|https):\/\/)? (?P<domain>(?:(?!-) (?P<sld>[a-zA-Z\d\-]+)(?<!-) [\.]){1,2} (?P<tld>(?:[a-zA-Z]{2,}\.?){1,}){1,} | (?P<ip>(?:(?(?<!\/)\.)(?:25[0-5]|2[0-4]\d|[01]?\d?\d)){4}) ) (?::(?P<port>\d{2,5}))? (?:\/ (?P<script>[~a-zA-Z\/.0-9-_]*)? (?:\?(?P<parameters>[=a-zA-Z+%&0-9,.\/_ -]*))? )? (?:\#(?P<anchor>[=a-zA-Z+%&0-9._]*))?/x';
it has an optional protocol (which you can make mandatory by removing the ? at the end of the 1st line), and names all the parts (protocol, domain, sld, tld, ip, port, script, parameters, anchor).
You can include internal ones using a 'servername' like so:
'|(?P<servername>[a-zA-Z\d\-]*[a-zA-Z\d][^:\/]?)'
after the '<ip>' line.
Mark
Kore at Saturday, 21.6. 2008
I wrote similar ones following the specification of relevant RFC, but this is actually not the point of the regular expression mentioned above.
The above one does not try to detect the parts of regular expressions, I found the PHP function parse_url() more useful (and more readable) for that task, but from filtering URLs out of random text. Your regular expression misses that part and does not accept quite common URL chrarcters like () and ;.
But anyways - thanks for sharing that regular expression.
Mark Armendariz at Monday, 23.6. 2008
Good call about the extra characters. I had recently added semicolons and commas, but hadn't thought to add parentheses (thanks for the suggestion!). The regex I gave can be used to filter them out, but now i realize what you're showing in your post. I'd originally misread it (sorry).
As for punctuation surrounding a url, I imagine you could get rid of anything that is "touching" a url. Any surrounding text that is not a s (or even b) would likely be associated with that url and would likely do well to be filtered out as well.
Dusko at Wednesday, 29.4. 2009
Mark's regular expression matches for example "image.jpg" and when url is at the end of the sentence (ending with point), ending point is also matched.
I am not good in writing regex, but if someone could correct this little bugs, this regex will be veeery good!
Marcos G. at Saturday, 26.9. 2009
Mark's regular expression has problems with Wikipedia's URLS: http://es.wikipedia.org/wiki/Mozart_(desambiguación)
Lars Strojny at Wednesday, 1.2. 2012
A litt more complex version with more enclosings and www-detection:
( (?P<before> (?:^|[\s,.!?_\-]*) # Usual characters before an URL (<)? # Ignore angle brackets around the URL (\[)? # Ignore brackets around the URL (\()? # Ignore braces around the URL ([\'"`]*) # Ignore quoting around the URL („)? # German double quotation marks (‚)? # German single quotation marks (“)? # English double quotation marks (‘)? # English single quotation marks («)? # Guillemets (romanian languages) (‹)? # Single Guillemets (romanian languages) (»)? # Guillemets (german language usage, except Switzerland) (›)? # Single Guillemets (german language usage, except Switzerland) (¿)? # Spaninsh question mark (¡)? # Spanish exclaimation mark ) (?P<url>(https?://|www\.)[^\s]*?) # Actual URL match (?<after> (?(15)\!) # Spanish exclaimation mark (?(14)\?) # Closing spanish quotation mark (?(13)‹) # Closing single Guillemets (german language usage, except Switzerland) (?(12)«) # Closing Guillemets (german language usage, except Switzerland) (?(11)›) # Closing single Guillemets (romanian languages) (?(10)») # Closing Guillemets (romanian languages) (?(9)’) # Closing english single quotation marks (?(8)”) # Closing english double quotation marks (?(7)‘) # Closing german single quotation marks (?(6)“) # Closing german double quotation marks \5 # If fake quotation mark, we expect closing fake quotation mark (?(4)\)) # If braces, expect closing brace (?(3)\]) # If brackets, expect closing bracket (?(2)>) # If angle bracket, expect closing angle bracket [.,?!_\-]*(?:\s|$) # Ignore common punctuation after the URL ) )xm