Recently a friend asked me if it was possible to extract data from a website and display it on an Android device, rather than having a dedicated app pulling data from a server. I did a bit of research and found that with the help of Jsoup it may not be that hard at all.

Jsoup is a Java library that can manipulate HTML data from a source, be it a file website or string etc. This allows us to scrape and parse websites for the data that we wish to collect quite easily and succinctly. As it is a Java library it can also be used on Android with no hassle. The best way to start with Jsoup is to just use it in a simple Java program that will print the results to the console.

First download the Jsoup .jar from here. There are lots of good explanations and resources on the website so have a look around.

Make a new java project, add a ‘lib’ folder to your project directory and copy the .jar file inside it. Hit refresh (F5) and then right click on the .jar file in the lib folder. Select ‘add to build path’. You will then see the .jar file will move from lib to the referenced libraries location in your project. If you need any further assistance with that look here.

Just for an easy example and to see how it works I am just going to take the title off of http://www.wikipedia.org using Jsoup. Create a new class and use the following code.

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {

	public static void main(String[] args){

		String url = "http://www.wikipedia.org";

		Document doc = null;
		try {
			doc = Jsoup.connect(url).get();
		} catch (IOException e) {
			// TODO Auto-generated catch bloc
			e.printStackTrace();
		}
		String title = doc.title();
		System.out.println("Title = "+ title);
	}
}

If you run that code it should print “Title = Wikipedia”. Not much is really going on here. We are just passing the url string to the Jsoup connect interface, where get() is then called which will return a parsed Document for us to work with from the original url.

All we do then is use .title() on doc, which returns a string of the document’s title. If you look at the below snippet of html from Wikipedia.org you will see why this works.

<!DOCTYPE html>
<html lang="mul" dir="ltr">
<head>
<!-- Sysops: Please do not edit the main template directly; update /temp and synchronise. -->
<meta charset="utf-8">
<title>Wikipedia</title>

What if we wanted the Wikipedia logo image though instead? Easy enough, all we have to do is change the last two lines of code to the below.

Element img = doc.getElementsByTag("img").first();
System.out.println(img.toString());

Run that and you should get the url source for the Wiki image display in your console.

Jsoup is searching the HTML document doc for elements with the tag “img” and then printing that Element to the console using .toString(). Also notice I used .first() because I only wanted the first image. Take it out and se how many you get now!

Again this makes more sense if you look at the HTML for the wiki page.

<h1 class="central-textlogo" style="font-variant: small-caps;">
<img src="//upload.wikimedia.org/wikipedia/meta/6/6d/Wikipedia_wordmark_1x.png" srcset="//upload.wikimedia.org/wikipedia/meta/a/a9/Wikipedia_wordmark_1.5x.png 1.5x, //upload.wikimedia.org/wikipedia/meta/8/8a/Wikipedia_wordmark_2x.png 2x" width="174" height="30" alt="WikipediA" title="Wikipedia">
</h1>

You might be questioning what is the use of this but let me assure you it can be useful. The code below gets all the fixtures for a month from this website and formats them into easy readable data on the console. Give it a go. I’ve tried to explain how it works using comments in the code.

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Main {

	public static void main(String[] args){
		String url = "http://www.national-autograss.co.uk/march.htm";
		Document doc = null;
		Element table = null;
		Elements rows = null;
		Element date = null;
		Elements tables = null;
		try {
			doc = Jsoup.connect(url).get();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}

		tables = doc.select("table"); //gets all the table elements from doc 

		//how many table elements were parsed
		for (int t = 0; t < (tables.size() - 1); t++){
			table = tables.get(t);      	//get singular table element
			rows = table.select("tr");  	//get the rows from the table element
			date = doc.select("h2").get(t);	//get date from h2 field.

			/*there is a date for each table on this website so I use the
			 * t from table size to get the appropriate one
			 */
			System.out.println("********" + date.text() + "********");

			Elements data = null;
			//cycle through the rows getting the data from each cell of data td.
			//I started at one because I dodn't want the titles
			for (int i = 1; i < rows.size(); i++){
				data = rows.get(i).select("td");

				for (int k = 0; k < data.size(); k++){
					//go through td data of row and print out
					System.out.println(data.get(k).text());
				}
			}
		}
	}

You will get the following in your console if it has worked correctly:

********Sunday 09 March********
Nottingham
Oxton
10:30am
RO
Single Days Racing

********Sunday 16 March********
Carlow
knocknatubbrid
12:00noon
RO/Q
Single Days Racing

Pennine
Darleymoor
11:00am
RO
Single Days Racing

And so on…

The key to understanding how this all works is really knowing how HTML tables work. I found this website quite handy but this is all you really need to know for this from W3 schools):

HTML Tables
Tables are defined with the table tag.

A table is divided into rows with the tr tag. (tr stands for table row)

A row is divided into data cells with the td tag. (td stands for table data)

The td elements are the data containers in the table.

The td elements can contain all sorts of HTML elements like text, images, lists, other tables, etc.

In the next post I will show you how to use this in an Android app.

Cheers