Using Jsoup with Android — August 11, 2014

Using Jsoup with Android

As promised, we will now look at using Jsoup with Android. This is pretty simple and very similar to the previous java example where we parsed the title from Wikipedia’s page. By keeping it bare bones and simple you should see how this works and be able to add more to it gradually.

Start by making a new Android project with a blank activity and blank layout. Add the Jsoup .jar into the libs directory of your project and add to the build path.

Get the quickest/easy thing out the way by putting the following in your manifest.xml file beneath the uses SDK section. We are going to be using the device’s Internet connection so you need a permission to do so.

  <uses-permission android:name="android.permission.INTERNET"/>

Copy the code below for your layout file. It is just a button and a textView with IDs so we can access them from the main activity.

<RelativeLayout xmlns:android="http://schemas.android.com/apk/res/android"
    xmlns:tools="http://schemas.android.com/tools"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:paddingBottom="@dimen/activity_vertical_margin"
    android:paddingLeft="@dimen/activity_horizontal_margin"
    android:paddingRight="@dimen/activity_horizontal_margin"
    android:paddingTop="@dimen/activity_vertical_margin"
    tools:context="com.example.jsouptest.MainActivity" >

    <Button
        android:id="@+id/button1"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_alignParentTop="true"
        android:layout_centerHorizontal="true"
        android:layout_marginTop="40dp"
        android:text="Button" />

    <TextView
        android:id="@+id/textView1"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_below="@+id/button1"
        android:layout_centerHorizontal="true"
        android:layout_marginTop="42dp"
        android:text="TextView" />

</RelativeLayout>

Now change your Main Activity to look like the below. I have commented in the code to try and explain what is going on.

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import android.support.v7.app.ActionBarActivity;
import android.os.AsyncTask;
import android.os.Bundle;
import android.view.View;
import android.view.View.OnClickListener;
import android.widget.Button;
import android.widget.TextView;

public class MainActivity extends ActionBarActivity {

	String url = "http://www.wikipedia.org";
	Document doc = null;
	TextView textView = null;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);

        textView = (TextView) findViewById(R.id.textView1);
        Button button = (Button) findViewById(R.id.button1);

        button.setOnClickListener(new OnClickListener() {

			@Override
			public void onClick(View v) {
				textView.setText("WORKING"); //just to show button has been pressed
				new DataGrabber().execute(); //execute the asynctask below
			}
		});

    }
    //New class for the Asynctask, where the data will be fetched in the background
    private class DataGrabber extends AsyncTask<Void, Void, Void>{

		@Override
		protected Void doInBackground(Void... params) {
			// NO CHANGES TO UI TO BE DONE HERE
			try {
				doc = Jsoup.connect(url).get();
			} catch (IOException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
			return null;
		}

		@Override
		protected void onPostExecute(Void result) {
			//This is where we update the UI with the acquired data
			if (doc != null){
				textView.setText(doc.title().toString());
			}else{
				textView.setText("FAILURE");
			}
		}
    }
}

I suppose the main thing that is out of the ordinary here is having a new class (DataGrabber) that extends AsyncTask. The reason is that if you want to get data from a server or the internet like we are you cannot do it from the main UI thread. This is so that if it takes some time it won’t make the UI unresponsive. It is quite simple though, as you see I have just used the method doInBackground to get the URL and then used onPostExecute to modify my UI with the result (you can’t modify the UI from a background task!). More info on all that here if you are interested.

Run the app on a device with Internet access and if all has worked well you should see something like this:

Screenshot_2014-08-10-21-54-48

Now experiment by trying to get other information out of other websites, like the last post. If you want to see this in action I have used it in an app of mine. Its pretty simple but does what it needs to.

Hopefully that was helpful. Let me know if anything needs greater detail.

 

Attempting to use JSOUP — August 10, 2014

Attempting to use JSOUP

Recently a friend asked me if it was possible to extract data from a website and display it on an Android device, rather than having a dedicated app pulling data from a server. I did a bit of research and found that with the help of Jsoup it may not be that hard at all.

Jsoup is a Java library that can manipulate HTML data from a source, be it a file website or string etc. This allows us to scrape and parse websites for the data that we wish to collect quite easily and succinctly. As it is a Java library it can also be used on Android with no hassle. The best way to start with Jsoup is to just use it in a simple Java program that will print the results to the console.

First download the Jsoup .jar from here. There are lots of good explanations and resources on the website so have a look around.

Make a new java project, add a ‘lib’ folder to your project directory and copy the .jar file inside it. Hit refresh (F5) and then right click on the .jar file in the lib folder. Select ‘add to build path’. You will then see the .jar file will move from lib to the referenced libraries location in your project. If you need any further assistance with that look here.

Just for an easy example and to see how it works I am just going to take the title off of http://www.wikipedia.org using Jsoup. Create a new class and use the following code.

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {

	public static void main(String[] args){

		String url = "http://www.wikipedia.org";

		Document doc = null;
		try {
			doc = Jsoup.connect(url).get();
		} catch (IOException e) {
			// TODO Auto-generated catch bloc
			e.printStackTrace();
		}
		String title = doc.title();
		System.out.println(&quot;Title = &quot;+ title);
	}
}

If you run that code it should print “Title = Wikipedia”. Not much is really going on here. We are just passing the url string to the Jsoup connect interface, where get() is then called which will return a parsed Document for us to work with from the original url.

All we do then is use .title() on doc, which returns a string of the document’s title. If you look at the below snippet of html from Wikipedia.org you will see why this works.

&lt;!DOCTYPE html&gt;
&lt;html lang=&quot;mul&quot; dir=&quot;ltr&quot;&gt;
&lt;head&gt;
&lt;!-- Sysops: Please do not edit the main template directly; update /temp and synchronise. --&gt;
&lt;meta charset=&quot;utf-8&quot;&gt;
&lt;title&gt;Wikipedia&lt;/title&gt;

What if we wanted the Wikipedia logo image though instead? Easy enough, all we have to do is change the last two lines of code to the below.

Element img = doc.getElementsByTag(&quot;img&quot;).first();
System.out.println(img.toString());

Run that and you should get the url source for the Wiki image display in your console.

Jsoup is searching the HTML document doc for elements with the tag “img” and then printing that Element to the console using .toString(). Also notice I used .first() because I only wanted the first image. Take it out and se how many you get now!

Again this makes more sense if you look at the HTML for the wiki page.

&lt;h1 class=&quot;central-textlogo&quot; style=&quot;font-variant: small-caps;&quot;&gt;
&lt;img src=&quot;//upload.wikimedia.org/wikipedia/meta/6/6d/Wikipedia_wordmark_1x.png&quot; srcset=&quot;//upload.wikimedia.org/wikipedia/meta/a/a9/Wikipedia_wordmark_1.5x.png 1.5x, //upload.wikimedia.org/wikipedia/meta/8/8a/Wikipedia_wordmark_2x.png 2x&quot; width=&quot;174&quot; height=&quot;30&quot; alt=&quot;WikipediA&quot; title=&quot;Wikipedia&quot;&gt;
&lt;/h1&gt;

You might be questioning what is the use of this but let me assure you it can be useful. The code below gets all the fixtures for a month from this website¬†and formats them into easy readable data on the console. Give it a go. I’ve tried to explain how it works using comments in the code.

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Main {

	public static void main(String[] args){
		String url = "http://www.national-autograss.co.uk/march.htm";
		Document doc = null;
		Element table = null;
		Elements rows = null;
		Element date = null;
		Elements tables = null;
		try {
			doc = Jsoup.connect(url).get();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}

		tables = doc.select("table"); //gets all the table elements from doc 

		//how many table elements were parsed
		for (int t = 0; t < (tables.size() - 1); t++){
			table = tables.get(t);      	//get singular table element
			rows = table.select("tr");  	//get the rows from the table element
			date = doc.select("h2").get(t);	//get date from h2 field.

			/*there is a date for each table on this website so I use the
			 * t from table size to get the appropriate one
			 */
			System.out.println("********" + date.text() + "********");

			Elements data = null;
			//cycle through the rows getting the data from each cell of data td.
			//I started at one because I dodn't want the titles
			for (int i = 1; i < rows.size(); i++){
				data = rows.get(i).select("td");

				for (int k = 0; k < data.size(); k++){
					//go through td data of row and print out
					System.out.println(data.get(k).text());
				}
			}
		}
	}

You will get the following in your console if it has worked correctly:

********Sunday 09 March********
Nottingham
Oxton
10:30am
RO
Single Days Racing

********Sunday 16 March********
Carlow
knocknatubbrid
12:00noon
RO/Q
Single Days Racing

Pennine
Darleymoor
11:00am
RO
Single Days Racing

And so on…

The key to understanding how this all works is really knowing how HTML tables work. I found this website quite handy but this is all you really need to know for this from W3 schools):

HTML Tables
Tables are defined with the table tag.

A table is divided into rows with the tr tag. (tr stands for table row)

A row is divided into data cells with the td tag. (td stands for table data)

The td elements are the data containers in the table.

The td elements can contain all sorts of HTML elements like text, images, lists, other tables, etc.

In the next post I will show you how to use this in an Android app.

Cheers