PHP extract & validate URL's from string

November 9th 2010

Again, here's a script I put together. Feel free to use it as you wish.

Just put the Base URL and the data into the function to get an Array back. The array will only hold unique values, so you're OK!

Call with: extract_links_from_data($url,$data)

PHP Code:
<?php
/*
    Extract URLs from content and validate them using PHP

    Roger Thomas (www.rogerethomas.com)

    This was created for a specific purpose, so it may need to be torn apart to fit what you need.
    As ever, it's free, feel free to link to me, but you really don't have to!

*/

function extract_links_from_data($url,$data)

{

    
$final=array();

    
$regex  '/(<as*(.*?)s*href=[\'"]+?s*(?P<foundurl>S+)s*[\'"]+?s*(.*?)s*>s*(.*?)s*</a>)/i';

    
preg_match_all($regex$data$links);

    foreach(
$links[foundurl]as $key=>$value) {

        
$check_slash=substr($value,0,1);

        
$check_dotdotslash=substr($value,0,3);

        
$first_seven=substr($value,0,7);

        if (
$check_slash=="/") {

            
$r_url=parse_url($url);

            
$r_url=$r_url[scheme]."://".$r_url[host];

            
$add=$r_url.$value;

        }

        elseif (
$check_http=="http://" || $check_http=="https:/") {

            
$add=$value;

        }

        elseif (
$check_dotdotslash=="../") {

            
$add="not implemented";

        }

        else {

            
$add=dirname($url)."/".$value;

        }

        if (!
in_array($add,$final)) {

            
// finally, lets check if the produced URL is valid:

            
if(preg_match('|^http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(/.*)?$|i'$add)) {

                
array_push($final,$add);

            }

        }

    }

    return 
$final;

}
?>