You are here

Hadoop Cluster

Printer-friendly versionPrinter-friendly version

Warning: this page may be out of date

The Hadoop cluster has changed since this page was last updated. We'll try to update the page, but in the meantime feel free to ask computing support for details of current Hadoop arrangements.

What is Hadoop?

Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 licence. It supports the running of applications on large clusters of commodity hardware. The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.

For more information read the full Wikipedia entry or watch Isabel Drost's talk on Hadoop at FOSDEM 2010.

Access to the DICE Hadoop cluster

We run a DICE Hadoop cluster. It has two main categories of users:

  1. Extreme Computing students. If you are registered for Extreme Computing you will be given access to the cluster when you need it for your assignments.
  2. Researchers wishing to try out Hadoop can use it as a test cluster.

If you want to try using the Hadoop cluster, ask Computing Support via the request form, to give you access.

Once you have access, read on.

Using the DICE Hadoop cluster

If you want to try out Hadoop, the Hadoop 0.23.6 documentation pages include a Single Node Setup which lets you have a play with your own temporary one node Hadoop cluster. If you're ready to try a bigger cluster, try the Map/Reduce Tutorial on the DICE Hadoop cluster. In the first example of that Map/Reduce tutorial, the correct "javac" command for our cluster would be "javac -classpath /opt/hadoop/hadoop-0.23.6/hadoop-0.23.6-core.jar -d wordcount_classes"

The DICE Hadoop cluster runs mainly on the DICE Beowulf machines. There are currently about 70 data nodes in the cluster. It runs Hadoop 0.20.2 on Scientific Linux 6. The documentation for our version of the Hadoop software is no longer available; version 26.3 being the closest.

To get access to the web status pages of the clusters (URLs are given below) you will need to be connecting from a machine on the Informatics network. If you're connecting from outside that network (for instance if your machine is on the University's wireless network) then you may find the OpenVPN service useful.

*To get access to the cluster* "ssh namenode" from a machine on the Informatics network.

*To examine the HDFS status* see

*To examine the status of Map/Reduce jobs* look at

*If you have a question or a problem* to do with the Hadoop cluster contact Computing Support. They'll either answer your question or put you in touch with someone who can.

HadoopCareAndFeeding is a resource for computing staff. It describes how to maintain the cluster.

Last reviewed: 

System Status

Home dirs (AFS)
Other services
Scheduled downtime

Choose a topic