Scraping Links With PHP

MerchantOS, the web’s fastest growing point of sale software company, is proud to bring you this article. If you like this one, you may also like these: Facebook PHP Tutorial and Spyjax Internet History Software.

Abstract Network

In this tutorial you will learn how to build a PHP script that scrapes links from any web page.

What You’ll Learn

  1. How to use cURL to get the content from a website (URL).
  2. Call PHP DOM functions to parse the HTML so you can extract links.
  3. Use XPath to grab links from specific parts of a page.
  4. Store the scraped links in a MySQL database.
  5. Put it all together into a link scraper.
  6. What else you could use a scraper for.
  7. Legal issues associated with scraping content.

What You Will Need

  • Basic knowledge of PHP and MySQL.
  • A web server running PHP 5.
  • The cURL extension for PHP.
  • MySQL – if you want to store the links.
Backhoe Digging

Get The Page Content

cURL is a great tool for making requests to remote servers in PHP. It can imitate a browser in pretty much every way. Here’s the code to grab our target site content:

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
	echo "<br />cURL error number:" .curl_errno($ch);
	echo "<br />cURL error:" . curl_error($ch);
	exit;
}

If the request is successful $html will be filled with the content of $target_url. If the call fails then we’ll see an error message about the failure.

curl_setopt($ch, CURLOPT_URL,$target_url);

This line determines what URL will be requested. For example if you wanted to scrape this site you’d have $target_url = “/makebeta/”. I won’t go into the rest of the options that are set (except for CURLOPT_USERAGENT – see below). You can read an in depth tutorial on PHP and cURL here.

Tip: Fake Your User Agent

Many websites won’t play nice with you if you come knocking with the wrong User Agent string. What’s a User Agent string? It’s part of every request to a web server that tells it what type of agent (browser, spider, etc) is requesting the content. Some websites will give you different content depending on the user agent, so you might want to experiment. You do this in cURL with a call to curl_setopt() with CURLOPT_USERAGENT as the option:

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

This would set cURL’s user agent to mimic Google’s. You can find a comprehensive list of user agents here: User Agents.

Common User Agents

I’ve done a bit of the leg work for you and gathered the most common user agents:

Search Engine User Agents

  • Google – Googlebot/2.1 ( http://www.googlebot.com/bot.html)
  • Google Image – Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html)
  • MSN Live – msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm)
  • Yahoo – Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
  • ask

Browser User Agents

  • Firefox (WindowsXP) – Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
  • IE 7 – Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)
  • IE 6 – Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)
  • Safari – Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/522.11 (KHTML, like Gecko) Safari/3.0.2
  • Opera – Opera/9.00 (Windows NT 5.1; U; en)

Using PHP’s DOM Functions To Parse The HTML

Puzzle Workers

PHP provides with a really cool tool for working with HTML content: DOM Functions. The DOM Functions allow you to parse HTML (or XML) into an object structure (or DOM – Document Object Model). Let’s see how we do it:

$dom = new DOMDocument();
@$dom->loadHTML($html);

Wow is it really that easy? Yes! Now we have a nice DOMDocument object that we can use to access everything within the HTML in a nice clean way. I discovered this over at Russll Beattie’s post on: Using PHP TO Scrape Sites As Feeds, thanks Russell!

Tip: You may have noticed I put @ in front of loadHTML(), this suppresses some annoying warnings that the HTML parser throws on many pages that have non-standard compliant code.

XPath Makes Getting The Links You Want Easy

Now for the real magic of the DOM: XPath! XPath allows you to gather collections of DOM nodes (otherwise known as tags in HTML). Say you want to only get links that are within unordered lists. All you have to do is write a query like “/html/body//ul//li//a” and pass it to XPath->evaluate(). I’m not going to go into all the ways you can use XPath because I’m just learning myself and someone else has already made a great list of examples: XPath Examples. Here’s a code snippet that will just get every link on the page using XPath:

$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

Next we’ll iterate through all the links we’ve gathered using XPath and store them in a database. First the code to iterate through the links:

for ($i = 0; $i < $hrefs->length; $i++) {
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');
	storeLink($url,$target_url);
}

$hrefs is an object of type DOMNodeList and item() is a function that returns a DOMNode object for the specified index. The index can be between 0 and $hrefs->length. So we’ve got a loop that retrieves each link as a DOMNode object.

$url = $href->getAttribute('href');

DOMNodes inherit the getAttribute() function from the DOMElement class. getAttribute() returns any attribute of the node (in this case an <a> tag with the href attribute). Now we’ve got our URL and we can store it in the database.

We’ll want a database table that looks something like this:

CREATE TABLE `links` (
`url` TEXT NOT NULL ,
`gathered_from` TEXT NOT NULL ,
`time_stamp` TIMESTAMP NOT NULL
);

We’ll a storeLink() function to put the links in the database. I’ll assume you know the basics of how to connect to a database (If not grab a MySQL & PHP tutorial here).

function storeLink($url,$gathered_from) {
	$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
	mysql_query($query) or die('Error, insert query failed');
}

Your Completed Link Scraper

function storeLink($url,$gathered_from) {
	$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
	mysql_query($query) or die('Error, insert query failed');
}

$target_url = "//";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
	echo "<br />cURL error number:" .curl_errno($ch);
	echo "<br />cURL error:" . curl_error($ch);
	exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');
	storeLink($url,$target_url);
	echo "<br />Link stored: $url";
}

What Else Could I Do With This Thing?

The possibilities are limitless. For starters you might want to store a list of sites that you want scraped in a database and then set up the script so it runs on a regular basis to scrap those sites. You could then compare the link structure over time or maybe republish the links in some sort of directory. Leave a comment below and say what you’re using this script for. Here are a few other things people have done with scrapers in the past:

Law Book and Gavel

There is no easy answer to this question. Many organizations scrap content from all over the web – Google, Yahoo, Microsoft, and many others. These companies get away with it under fair use and because site owners want to be included in the search results. However, there have been copyright infringement rulings against these companies.

The real answer is that it depends who you scrape and what you do with the content. Basic copyright law gives authors an automatic copyright on everything they create. But the same law permits fair use of copyrighted material. Fair use includes: criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research. But even these uses could be considered copyright infringement in some circumstances. So be careful before you claim “fair use” as your defense!

Here’s a couple sites that have granted you the right to use their content. They do require you to attribute the content to the author or the URL you scraped it from:

  • Wikipedia – GNU Free Documentation License
  • Open Directory Project – Open Directory License
  • Creative Commons Logo
    Creative Commons – Creative Commons Attribution 3.0

    Many sites publish their content under some form of the Creative Commons license. You can search for creative commons licensed works here: Creative Commons Search. Remember that it’s your responsibility to verify the copyright rules for anything you use, even stuff found using the Creative Commons Search.

148 thoughts on “Scraping Links With PHP

  1. Pingback: Parsanje podatkov s spletne strani – php DomDocument | .: TRSplet - internetne storitve :.

  2. @naden —

    after testing ur two zipped files,i am getting 1y blank screen.

    if u dont mind can you tell me the exact problem.
    i have to scrap all the projects and post from scriptlance site.

    any idea??
    Thank you

  3. Hi Justin, Thx for this good stuff

    Please allow 2 short questions:

    1. my “target” looks like:
    TEXT-I-NEED
    thanks for a short hint buddy

    2. Ajax requesting
    Your attempts/thoughts

    Thanks for your presence in the web. Best regards,

    Malvin

  4. Hi Justin, Thx for this good stuff

    Please allow 2 short questions:

    1. my “target” looks like:
    (td width=”160″) TEXT-I-NEED(/td)
    i replaced with ()
    thanks for a short hint buddy

    2. Ajax requesting
    Your attempts/thoughts

    Thanks for your presence in the web. Best regards,

    Malvin

  5. Can anyone show me how they would do this with the url
    smotri.com/broadcast/list
    I would like to parse the live broadcast links into a plx playlist file that contains the following data per parsed link
    type=video
    name=The video’s name
    thumb=http://thumbnail_url
    URL=http://the_videos_url
    not the actual rtmp stream, but the sites own video url. Could some one show me how they would implement all this in this manner for
    smotri.com/broadcast/list
    Thanks in advance
    SSEO

  6. hi justin, great post I like this article
    I am trying to create a .exe file search engine Is there any possiblity to search exe files? which are in the webpages???

  7. How to get the text inside the a href links. Here u have shown getting the link using href attribute, can u please explain how to get the text in the which this href is linked.

  8. Is it possible to get the URL when the url on the page is a redirect-url (for counter click purposes).

    so my page is full of http//www.site.com/clickcounter.php?id=001 kind of urls, but all these urls lead to another one (via the clickcounter-script)..

    Is it possible to follow where it leads and then grab the url in some way or another ?

    Tom.

  9. how could i use this to scrape images i’ve tried the code that dave posted but nothing happens maybe someone can point me in the right direction cheers for the great tutorial

  10. Hey man, I was about to do something almost exactly like this and was surprised I hadn’t seen anyone use DOMDocument yet in the examples I had found, then I stumbled across yours and needless to say, you already did it, so thanks for saving me some time figuring it out.
    There’s two things I’m not having any luck with though …, recursively crawling the given url and accurately testing for local URLs only. I am trying to make a sitemap generator script that works by crawling the given URL, but the problem is inconsistency in how the local links are written. Example, some use the site’s URL, then others use relative paths. So the first approach was to just use the input URL with preg_match(), then check for relative paths.
    This concept worked pretty good until it came across an https:// protocol, which wasn’t part of the input address. Part of it was to also attach the input URL to any relative paths for correct sitemap values, so I’m really not sure how to correctly filter the urls at this point and still have that level of flexibility.

    If you have any ideas, give me a shout. Thanks again.

  11. As sima stated. Didnt work for me too. Doing something wrong?
    Have setup the database and connected to it. So that part is working. Please advise…

  12. hello.

    I am trying to parse the links or data from a file without using the Dom objects, Is this possible ?, yes it is possible as it is a lengthy process and need more time. The purpose for this is to run the php code from command line via servers ssh. because browser as time limit but not command line (in case we need to parse around 1000 pages throttling with time). I think dom objects don’t works in command line as it outputs context/html something like this and gives error. I will be waiting for the answer from u guys !

  13. I only parse from the command line with PHP-CLI.

    When writing scrapers or webbots in general you should organize them in two parts:
    1) The part that actually scrapes the data behind the scene running from a cron job automatically.
    2) Front end UI that allows you to set scraper configuration options and display the data that was scrape in which ever way you want.

    I am pretty sure you can use DOM because you are downloading a complete HTML file at first.

    Wade Cybertech

  14. Thanks for posting this well written tutorial.

    I’m using CentOS and I just needed to install php-xml in order to get the DOM Document portion of the code to work: yum install php-xml

    I didn’t try to save the files to a database, since I’m mainly creating a test application to test certain links on our site.

  15. Thank you for this wonderful script.

    However, is there a way of storing all the backlinks into an array, WITHOUT using a database. I’m no expert in PHP, but I’m sure there must be a way of doing this.

    I’m trying to create a tool for myself to capture all backlinks (inbound links) to particular websites and would like to be able to list these backlinks with their PR using an ARRAY (NO DATABASES). Something like http://www.linkdiagnosis.com but more simple.

    Any help will be greatly appreciated ;)

  16. Pingback: Reading websites with PHP « BumScientist

  17. hi, thanks for the blog post, has the domelement or domdocument got a method to retrieve the innerHTML of a tag like the javascript version, had a look through the docs but couldnt find anything

  18. Pingback: display external html list item in new page using php?

  19. Pingback: Scrapear con PHP (I)

  20. Pingback: Blog do Bragil » Obtendo os links de uma página HTML com PHP

  21. Pingback: JIRA: Baxshop

  22. Pingback: How to close PHP loop that inserts from DOM into MySQL | deepinphp.com

  23. Pingback: Twitter: Is there a service to count the total number of followers of the users who tweet a link? - Quora

  24. hi,
    when i run above code snippet i get an warning like this..

    Warning: domdocument::domdocument() expects at least 1 parameter, 0 given in C:\wamp\htdocs\interface\linkscrapper.php on line 27

    ie the domdocument() class needs a parameter..

    Please somebody help me out of this…..

  25. Hi, i really think this is a great post, i´m learning PHP and i really love it, is so powerfull, right now im trying to biuld a link scraper (just for fun and for learning) i want to tell me all links in a webpage or a site, and then show them in a table and in the next column to show if it is follow or not
    How could i do this? any function to get all links from a webpage?
    thanks a los

  26. Actually it wasn’t the robots.. turns out I need to create a recursion… But I am having issues doing this… I can get to where it just submits the incorrect link but that is it… ideas or suggestions?

  27. Hello Justin,

    I copy past your code in my localhost & trying to execute it, it shows me “Fatal error: Call to undefined function curl_init()”. Please help me because I am new to php.

    Regards,
    Sandeep pattanaik

  28. Nice share but using all these functions what relay isn’t necessary is just going to confuse you no?

    Like you could cut that down alot and some links what will be scraped wont be full, Would be easy doing something like this. . .

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL , “http://merchantos.com/makebeta/php/scraping-links-with-php”);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $response = curl_exec ($ch);
    curl_close($ch);

    $matches = array();
    if ( preg_match_all( ‘@<a href="http://(.*?)"@', $response, $matches ) ){
    foreach($matches['1'] as $link){
    echo "http://&quot;.$link."”;
    }
    }

    Thats a fully working one alot less code.

Leave a Reply

Your email address will not be published. Required fields are marked *