domenica 20 aprile 2014

How to configure Grails and Geb as webscraping tool versus php simple html dom library

I went across Geb because i was looking for a smart way to webscrape pages using Grails. Geb is a browser automation tool.

This definition is really important: it is not a library to parse html, it is a browser automation tool. What does this mean? It means that it will run an external browser and it will execute all the operations you coded on the browser pages as a human would do.

So it will load pages, populate form fields with text, will simulate clicks and will do all the things you would do as a human. You will also be able to look at the browser while it automatically does all the operations because Geb will run an external browser.

It is a very sophisticated framework but i think it should be used primarily as a test automator. You can run tests with different browser such as explorer, chrome or safari and query for "visual stuffs" like div's height or width. Obviously you can query the page for specific elements (and this is the part where you may want to use it as a scraping tool).

This is the premise i wanted to share. I think that Geb is a powerful tool, but i think it can be really useful for automated tests and not for scraping. And this because it seems to me to be too much complex for common and batch operations like scraping, if you compare it to a html parser library easy to use such as simple html dom.

But if you want to use Geb for scraping, here are my two cents on how to configure it with Grails version > 2.3.5 (and this is good i think because a lot of good material out there is outdated).

Pay attention: i'm sure this is not the best way to configure it, but it took me a little to make it run because docs are not clear about how to configure it with Grails. This is just a basci configuration, to allow you run and experiment with it.
You can find a Geb plugin for Grails but it seems to be useless.

So, here is my recipe.

1) Download Chrome driver from here and save it locally.

2) Create a GebConfig file under conf folder. This should be the code:

import org.openqa.selenium.chrome.ChromeDriver
driver = {
 System.setProperty('webdriver.chrome.driver', 'path/to/your/downloaded/chromedriver')
 new ChromeDriver() 
}

3) In BuildConfig add this repository to repositories section

mavenRepo "https://oss.sonatype.org/content/repositories/releases/"

4) Add under dependencies:

compile "org.gebish:geb-core:0.9.2"
compile "org.seleniumhq.selenium:selenium-support:2.26.0"
compile "org.seleniumhq.selenium:selenium-chrome-driver:2.31.0"

5) Now you can create a controller to test Geb. For example:

package gt

import geb.Browser;

class GebController {

    def index() {
  
  Browser.drive {
   go "http://google.com/ncr"
   
   // make sure we actually got to the page
   assert title == "Google"
   
   // enter wikipedia into the search field
   $("input", name: "q").value("wikipedia")
   
   // wait for the change to results page to happen
   // (google updates the page dynamically without a new request)
   waitFor { title.endsWith("Google Search") }
   
   // is the first link to wikipedia?
   def firstLink = $("li.g", 0).find("a")
   assert firstLink.text().contains("Wikipedia")
   
   // click the link
   firstLink.click()
   
   // wait for Google's javascript to redirect to Wikipedia
   waitFor { 
    title.startsWith("Wikipedia")
    render "OK"
   }
   
   
  }
  
 }
}

6) Execute the controller: you will see a new Chrome browser running all the code.


PHP and simple html dom parsing library

I will not write a lot on how to use the php library, and this because it is super-simple to understand. But how to integrate php with Grails?

Well, starting an external process with groovy is really simple:

def process = command.execute();
process.waitForOrKill(MY_TIMEOUT);

Where command is a string. This will start the process with a fixed timeout.
It will be really easy to run an external php process which executes different scraping operations using the simple html dom library.

Then it will be easy to get the output of the php process and parse it.

As an example, in a project of mine, i run external php processes (scheduled by a grails application) which perform scraping operations and at the end, they do call a grails controller passing a json result. The Grails controller persists the object and executes different operations.

I think this is a faster and cleaner approach to webscraping instead of using Geb.
I think Geb may be useful if you have to compile a lot of forms and visit a lot of pages to perform the scraping. But if all the things you have to do are only scraping lists (for example "products lists") and navigate through pagination, then i think php parsing it's quicker and more productive.
Now that you have read my article, i would like to show you another thing: i've developed an app to help increase customers registration and customers conversion.

You can find it at appromocodes.com

2 commenti:

  1. You don't need to set 'webdriver.chrome.driver' variable, it is sufficient that driver installation folder will be included in PATH variable. This is better, becaue you can share your project with others then.

    RispondiElimina
    Risposte
    1. You are definetively right. I experimented a little more with geb, especially with phantomjs web driver. Phantomjs is a headless driver so you can run it on a server with no graphics environment and found it easier to configure using this approach, having the possibility to specify different environment properties. I must say that (as expected), any driver behaves differently and ajax-based pages webscraping becomes very hard and difficult to perform when you are trying to query list with items changing, for example when you are parsing a product-list page.

      Elimina