Webbots and Spiders 101: Extracting anchor text with PHP and lib_http

Lately I’ve been playing around with the LIB_http and LIB_http libraries from Michael Schrenk’s fine guide to PHP web automation, Webbots, Spiders, and Screen Scrapers.   Good stuff, except for one omission.  While extracting the href attribute from anchors is a snap with  LIB_parse library there isn’t a straightforward routine for extracting the anchor text.   This is important, since a lot of search engine ranking algorithms place a lot of value on the anchor text linking to a given site when ranking that site.   So here’s a code snippet to accomplish that task.

First, given this page t.html on localhost:



<html>
<head>
<title>This is a Test Page</title>
<meta name="author" content="Joe Doakes">
<meta name="description" content="A test page for parsing">
</head>
<body>
<h1>First Big Header </h1>
<p>The first paragraph of random content</p>
<a href="http://www.intel.com>Intel Home Page</a>
<a href="http://www.amd.com>AMD Home Page</a>
<h1>Second Big Header</h1>
<p>The second paragraph of random content</p>
</body>
</html>

Now here’s a snippet to extract the anchors and anchor text:



 require("LIB_parse.php");
require("LIB_http.php");
$target = "http://localhost/t.html";
$web_page = http_get($target,$referer="localhost");
$link_array = parse_array($web_page['FILE'],"");
printf("Links and Anchor Text:nn");
for ($i=0; $i < count($link_array); $i++) {
// First pull out the href
$tmp = return_between($link_array[$i],"",INCL);
$url = get_attribute($tmp,$attr="href");
// Next the anchor text
$anchorText = return_between($link_array[$i],">","",EXCL);
printf("%s - %sn",$url,$anchorText);
}
?>

Piece of cake.