PHP web programming

Do-It-Yourself PHP Tool for Scraping and Grabbing Web Content

Hi folks. Today we will learn how to scrape data from websites using just a few lines of code with PHP.

First, a couple disclaimers:

Using this method you can parse just any HTML website, however, that is not always the best idea. For example, if you want to get data from YouTube or Twitter etc., you should look whether the site or the service doesn’t have an API for this sort of stuff.

If you want to learn more about modern API’s you can check this guide on CodeAcademy:
www.codecademy.com/en/tracks/youtube

or read more on Wikipedia:
en.wikipedia.org/wiki/Application_programming_interface

Since we are using PHP, you’ll need a web server to run a script. If you don’t have one, you can install a local server. You can get it here for free.

To scrap a webpage for data, we basically need to know an address of a webpage we want to look into and a specific part we want to grab.
If you are not familiar with the DOM, you can find out more about it on Wikipedia.

DOM – is a representation of an HTML page comprised of elements organized in a tree structure.

It looks like this:

PAGE:
Element a
Element b
      Element b1
      Element b2
Element c
Element d

So, the logic of our code looks like this: we need to grab the page by the address and then get the value of the Element b2 (bold), which is the child of the Element b.

In this article we will scrap the website called Product Hunt:
www.producthunt.com

If we look at this website, we’ll see that it contains ideas for products for three days (today, yesterday, the day before yesterday). Our task is to get three things from today’s list of ideas: idea name, idea tagline and the link to the product.

Let’s start coding. We will be using Goutte library, which you can download on GitHub.

Simply put the .phar file in your script folder and copy the script below:

<?php
//adding the library. you can use js.composer for that
require_once 'goutte.phar';
use Goutte\Client;
    $client = new Client();
    //getting the page and putting it into a $crawler object
    $crawler = $client->request('GET', 'http://www.producthunt.com');
    // the page has 3 divs with class="day". 
    we need to grab the first one(it would today's ideas)
    $x=$crawler->filter('.day')->first();

      //each day has 10 ideas. we need to grab each one's title,tagline,link
      $x->filter('li>div')->each(function ($node) {
      //$title = text value of a DOM elenent with a class="post-url"
      $title=$node->filter('.post-url')->text();
      $tagline=$node->filter('.post-tagline')->text();
      $link='http://www.producthunt.com'.$node->filter('.post-url')->attr('href');
      //printing the results
      echo $title;
      echo "|";
      echo $tagline;
      echo "|";
      echo $link;
      echo "<br>";
      });
?>

This is the result we are getting:

Web scraping php script result
Click on the image to enlarge

So, as you can see, it’s a very simple process, one that you can do even if you are not a programmer. Was this article useful? Maybe you have questions? Hit us in the comment section below.

P.S. To better understand how to access DOM elements and their values check the documentation on Symphony.com.