Home > Code Snippets > Web Spider using php cli

Web Spider using php cli

Purely on academic basis, I had once helped to cook up a web spider, which is used to build site maps. The spider was written in php and uses a couple of reg-exp matches, and finally writes the full sitemap from the start url. The system is assembled using two classes, WebPage and WebSpider. Then to make it similar to linux utilities in the command line environment, some functions were scooped in.

Class WebPage

This holds any page, while the spider is walking the site, and spider will be adding to the subpages array, an instance of WebPage corresponding to each link found from this page, unless the link is already indexed in the global variable $done. The code view will be more better than my explanations.
class WebPage {
public
$url;
public
$caption;
public
$subpages = array();
public
$error;
function
WebPage($url, $caption){
$c = parse_url($url);
if(!isset(
$c['path'])) $url .= '/';
$this->url = $url;
$this->caption = $caption;
}
function
walkDown(){
global
$spider;
foreach(
$this->subpages as $key => $obj){
$this->subpages[$key]->subpages = $spider->fetch($obj);
if(!empty(
$this->subpages[$key]->subpages))
$this->subpages[$key]->walkDown();
}
}
function
toString($rl = 0){
echo
str_repeat(' ', $rl);
if(
$rl > 0) echo '<li>';
echo
'<a href="' . $this->url . '">' . $this->caption . '</a>';
if(!empty(
$this->subpages)){
echo
'<ul>';
foreach (
$this->subpages as $obj){
$obj->toString( $rl + 5);
}
echo
'</ul>';
}
if(
$rl > 0) echo '</li>';
echo
"\n";
}
}

function WebPage, is the constructor, function walkDown will take a walk through the subpages if any and issue a recursive walk action, and toString will build the xhtml sitemap.

Class WebSpider

This is where a web page (single page) is fetched, and parsed to identify urls and anchor tags, to assemble one instance of WebPage class. The fetch method gets the page contents using the file_get_contents method, which implies that the urlfopen should be enabled. Please note that this is not a full fledged application, in which case curl would have been a better option. Just for academic purposes, the shown system was enough, and would do. But for a production version, fetching multiple pages in parallel, using curl etc would be the best.

class WebSpider {
public
$restrict;
function
WebSpider(&$webPage, $restrict = false){
$this->restrict = $restrict;
$webPage->subpages = $this->fetch($webPage);
}
function
fetch(&$webPage) {
global
$done;
ob_start();
$homepage = file_get_contents($webPage->url);
$w = ob_get_clean();
if(!
$homepage){
$webPage->error = $w;
return;
}
echo
$webPage->url . "\n";
$homepage = preg_replace("@\n@",'',$homepage);
$homepage = strip_tags($homepage, '<a>');
$homepage = str_replace('><a', ">\n<a", $homepage);
$arr = array();
preg_match_all("@<a([^>]+)>([^<]+)<\/a>@i", $homepage, $matches);
unset(
$homepage);
foreach(
$matches[2] as $id => $caption){
preg_match("@href=(\"|')([^\"']+)@",$matches[0][$id],$m);
$url = trim($m[2],'#/');
if(empty(
$url)) continue;
if(
substr($url,0,4) !== 'http'){
$page_path = substr($webPage->url, 0, strrpos($webPage->url,'/'));
$url = $page_path . '/' . $url;
}
if(
$this->restrict && !preg_match('@' . $this->restrict . '@', $url)) continue;
$caption = $matches[2][$id];
$key = md5($url . $caption);
if(
in_array($key, $done)) continue;
$done[] = $key;
$arr[] = new WebPage($url, $caption);
}
return
$arr;
unset(
$matches);
}
}


Command line code

To combine these, as well as provide a cli compatiable interface, the following code was added to the class definitions, and the final suite was ready to be run using ‘php -q spider.php -u -c
-o ‘ or ‘php -q spider.php –url –caption
–outfile

/* show a short help */
function showUsage(){
 echo 
"usage: <> -u <url> -c <top caption> -o <outfile>\n";
}
 
/* check for parameter values minimum we need 7, first one being ourself */
if(count($argv) < 7){
 
showUsage();
 exit();
}
 
/* identify parameters and assign to variables, may be there is a 
   different method but this was enough for the purpose */
for($i 1$i 7$i += 2){
   switch(
$argv[$i]){
    case 
'-u':
    case 
'--url':
       
$url $argv[($i 1)];
    break;
    case 
'-c':
    case 
'--caption':
       
$caption $argv[($i 1)];
    break;
    case 
'-o':
    case 
'--outfile':
       
$outfile $argv[($i 1)];
        break;
    default:
       
showUsage();
       exit();
    break;
   }
}
 
/* some global variables */
$done = array();
$pages = array();
 
/* instance of the classes to do the real job */
$webPage = new WebPage($url,$caption);
$spider = new WebSpider($webPage$url);
$webPage->walkDown();
 
/* capture all output, such that we can write to a file */
ob_Start();
$webPage->toString();
$html ob_get_clean();
 
/* write the output file */
file_put_contents($outfile$html);
 
echo 
"\nOutput written to $outfile ...\n";
 

The pretty formatting may break certain punctuations, and copying from the page would be useless hence the whole original code is attached for easy download A spider in php (640)

Categories: Code Snippets Tags: ,
  1. No comments yet.
  1. No trackbacks yet.

ninety − eighty nine =