Codeigniter - using Curl to steal remote content

Sometimes you need to get remote data from a foreign website that hasn’t been kind enough to develop an API you can tap into. What are you gonna do? This is a tricky situation.

You could just link to the remote page? Maybe have it in a new window? You could but that’s not an elegant solution. What if you lose your visitor?

You could use frames, but that’s also an ugly solution.

The trick is to absorb the data by force! In this case using Codeigniter and Curl.

Here’s the code I use to grab the remote page. I made a sample controller that connects to the remote page, grabs the remote HTML and prints it to the screen:

<?php if (!defined('BASEPATH')) exit('No direct script access allowed.');

class Example extends Controller {
			
	function __construct() 
	{
		parent::Controller();
	}
	
	function index()
	{
		$url='http://remote-site.com/the-page.html';
							
		$curl = curl_init();
		curl_setopt($curl, CURLOPT_URL, $url);
		curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);				
		$html .= curl_exec($curl);
		curl_close ($curl);

		echo $html;
	}
}

The problem we now face is that the entire page has been sent, not just the portion we want. Time for some surgery.

The trouble with this type of operation is that it’s a bit of a pain in the ass to do. Especially if the remote HTML is filled with markup errors, inline styles, font tags and the like.

The first thing we have to do to perform surgery is turn the entire HTML document into a single line string so we can perform operations on it. The PHP function str_replace() can help us with that. Add this code into the function above:

$html = str_replace('  ', '', $html);
$html = str_replace("\n", '', $html);
$html = str_replace("\r", '', $html);
$html = str_replace("\t", '', $html);

What we’ve done here is removed extra whitespace, line breaks (not HTML break tags), tabs and returns from the remote HTML. Now we can really dig in to the HTML to perform the extraction. I have no idea what remote HTML is going to be your patient, but your basic routines will be similar to the following examples.

Removing inline styles from an element:

$html = preg_replace('/<body(.*?)>/', '', $html);

Fixing inline styles alternate option:

$html = str_replace(' align="center"', '', $html);

I think you get the basic idea. Other typical operations include fixing urls. This is a powerful tool to use when helping clients integrate foreign web information.

Updated 3/8/2011 - I received a couple requests regarding changing urls on the remote HTML we’re stealing:

Example: Making a local path a full path for remote images etc.

$html = preg_replace('/images/email_friend.gif/', 'http://remotewebsite.com/images/email_friend.gif', $html);

Example: The Other way around

$html = str_replace('http://remotewebsite.com/images/email_friend.gif', '/images/email_friend.gif', $html);

About Joseph R. B. Taylor

Joseph R. B. Taylor is a humble designer/developer who makes stuff for screens of all shapes and sizes. He is currently the lead UI/UX Architect at MScience, LLC, where he works to create simple experiences on top of large rich datasets for their customers and clients.