NB: Nectar's Data Processing service is currently considered to be in beta - users are encouraged to try it out and provide feedback via support, but please note that we cannot provide application level expertise and are still developing a support model for this and other advanced services.

Nectar provides data processing services powered by the OpenStack Sahara project. Sahara is an OpenStack service that gives users an Elastic Data Processing (EDP) service that can be used to deploy, manage and operate various common data analytics clusters. Users can choose between operating Sahara on the command-line, using plain text templates for clusters and tasks, or using the dashboard tools and wizards to easily start up clusters and execute jobs on clusters.

This document provides an overview of the general functionality and terms. We also encourage potential users to visit the official upstream Sahara user guide.

Data Processing Overview

OpenStack Sahara provides users with a simple means to provision a data processing framework (such as Hadoop, Spark) on OpenStack (see below for the specific frameworks currently available on Nectar's Data Processing). The deployment process can be divided into two major steps:

  • Create Data Processing cluster
  • Execute EDP jobs


Data Processing Clusters are composed of Node Groups. Each Cluster is based on a particular EDP Plugin, which amongst other things determines the data processing framework and version used. A Data Processing Cluster is launched based on a Cluster Template, which incorporates several instances of a Node Group built from a Node Group Template. The basic procedure to build a cluster is:

  • Build a Node Group Template
  • Build a Cluster Template
  • Launch the Cluster


Elastic Data Processing (EDP) is an OpenStack concept for the execution of big data processing jobs on Data Processing clusters. An EDP Job is executed on a Cluster based on a Job Template. The Job Template uses a Job Binary to store the job processing logic, and a Data Source to store both input and output data. The Job Binary usually contains a job script or Jar file (depending on the backend EDP framework). The script and data can be stored in a Cluster-internal HDFS filesystem, external OpenStack Object Store container, or OpenStack Shared FileSystem shares. The procedure to execute jobs in OpenStack is: 

  • Create a Job Binary by uploading or specifying the job script
  • Create a Data Source for input and output data
  • Create a Job Template specifying which Job Binary and Data Source to use
  • Launch the Job on existing Cluster


Provisioning Plugin

In general, a Plugin enables Data Processing to deploy a specific data processing framework (for example, Hadoop) or distribution. Currently, Nectar Data Processing supports:

  • Vanilla Plugin - deploys Vanilla Apache Hadoop, version 2.7.1
  • Spark Plugin - deploys Apache Spark with Cloudera HDFS, version 1.3, 1.6, 2.1


If you're new to data analytics

Then you should know that Data Processing in the Nectar cloud supports several data analytics tools - currently the Hadoop and Spark frameworks. Other platforms, such as Cloudera and Hortonworks Data Platform are planned. You should follow these links to learn more about these tools and their suitability for your work. You may find that Spark with its Python interface is the easiest way to get started.

If you're experienced in data analytics and new to Data Processing

You know what you want to do, you just need to know how to get started. First, have a quick look at the 'Guides' under Data Processing in the Nectar Dashboard. The Guides step you through the creation of a selected cluster and the generation of a job ready for execution. You should read the official Sahara user documentation to learn about key Data Processing concepts for clusters, jobs and data sources.

Launching a Data Processing Cluster via the Dashboard

This section demonstrates how to use Data Processing services on the Nectar research cloud by creating a Spark cluster and executing a word count job.

The Data Processing menu is available on the Dashboard on the left side toolbar:

Items under Data Processing on the right contain all the functionality. Guides are usually the best starting point, although some knowledge of data analytics is expected. 

Node Group Template creation is the first step. After choosing the desired plugin and version (Spark 2.1 in the example), you will be presented several tabs to specify the template details such as template name, instance flavor, availability zone, and storage.  Please be aware the template name can only contain digits, letters and special characters . and - . 

An important setting in the Node Processes tab is Select Node Group Processes. It determines the role of the servers in the data processing cluster topology. A common case is running one or more servers on a master node group template and another set of servers running on a slave template. All resources used in Data Processing are subject to your Nectar quota.

Each plugin has some specific cluster topology requirements. Further information can be found in the OpenStack Sahara plugin documentation, such as https://docs.openstack.org/sahara/latest/user/spark-plugin.html for the Spark plugin.

Repeat the same procedure to create the worker template. Once complete they should look like:

Once the node group templates are there, the cluster template can be created based on the the node group templates. Click on the Create Template button on the Cluster Template toolbar, select the correct plugin and version, then specify the required details. In the Node Groups tab, the created node group template names can be added and instance numbers selected. The number of instances that can be launched is subject to the quota of the user. Other parameters (General Parameters, HDFS Parameters, and Spark Parameters tab) can be left unchanged unless you have specific cluster requirements. 

After the cluster template is created, the spark cluster can be launched by clicking Launch ClusterSelect the required cluster template, cluster image, server key pair and network name, then click Launch.

A Cluster object has a Status attribute which changes as the launch proceeds. A successful new cluster creation should go from Validating - InfraUpdating - Spawning - Waiting - Preparing - Configuring - Starting to Active.  

Although the OpenStack provided EDP interfaces can be used to run data processing jobs, the Data Processing cluster can also be used as a traditional Spark/Hadoop cluster. Just SSH to the cluster using the master instance's IP address. 

Execute Data Processing Job via the Dashboard

The EDP Spark word count example job uses Swift as the data source and job binary medium. The example word count jar script provides the job binary, and any text file, such as Shakespeare's works, can be used as the input data. The jar example, spark-wordcount.jar, is part of the OpenStack Sahara tests framework.

Since object storage is used, the Swift container with input folder and job binary folder should be ready and the input file and job binary file uploaded into their respective folders.

You can then create the job binary package in Data Processing. Set the Storage type to Swift, and the URL to the job binary path in Swift. Swift also requires your Nectar OpenStack username/password for authentication. The Internal Storage type can be selected to store the uploaded job binary file in the Data Processing database. 

At the same time, the data input and output must be assigned. Data Processing uses the Data Source concept to implement this, so you have to create an input data source and an output data source. Please note that output data source path should NOT exist, otherwise the job execution will fail. 

After both the data (Data Source) and application logic (Job Binary) are ready, the Job Template can be created. The job binary you just created is the main binary. If needed, other libraries can entered in the Libs tab. 

It is now time to launch the job by clicking the Launch On Existing Cluster button. 

The job Type should be Spark and additional parameters should be provided in the Configure tab, as indicated in https://github.com/openstack/sahara-tests/tree/master/sahara_tests/scenario/defaults/edp-examples/edp-spark 

Wait until the job execution is finished and the result are written to the output path defined in the output data source. The Status should change to Succeeded on the right side of the Jobs toolbar.

Launching Data Processing Cluster via the CLI

Alternatively, data processing cluster creation and job execution can be performed using the Command Line Interface (CLI). All Data Processing functions are under the openstack dataprocessing command. See Manage instance via API for details on installing the CLI.

Start by checking the supported plugins, plugin versions, and available images:

$ openstack dataprocessing plugin list
+---------+---------------------+
| Name    | Versions            |
+---------+---------------------+
| vanilla | 2.7.1               |
| spark   | 1.3.1, 1.6.0, 2.1.0 |
+---------+---------------------+
$ openstack image list --property _sahara_tag_vanilla='True'
+--------------------------------------+--------------------------------------------+--------+
| ID                                   | Name                                       | Status |
+--------------------------------------+--------------------------------------------+--------+
| 622a0299-72eb-484e-9c96-3ce7d0a4d2f7 | Sahara Vanilla 2.7.1 (Ubuntu 16.04 - Beta) | active |
+--------------------------------------+--------------------------------------------+--------+

To get information on the available plugin services:

$ openstack dataprocessing plugin show vanilla --plugin-version 2.7.1
+-------------------------------+---------------------------------------------------------------+
| Field                         | Value                                                         |
+-------------------------------+---------------------------------------------------------------+
| Description                   | The Apache Vanilla plugin provides the ability to launch      |
|                               | upstream Vanilla Apache Hadoop cluster without any management |
|                               | consoles. It can also deploy the Oozie component.             |
| Name                          | vanilla                                                       |
| Required image tags           | 2.7.1, vanilla                                                |
| Title                         | Vanilla Apache Hadoop                                         |
|                               |                                                               |
| Plugin version 2.7.1: enabled | True                                                          |
| Plugin version 2.7.1: stable  | True                                                          |
| Plugin: enabled               | True                                                          |
| Plugin: stable                | True                                                          |
|                               |                                                               |
| Service:                      | Available processes:                                          |
|                               |                                                               |
| HDFS                          | datanode, namenode, secondarynamenode                         |
| Hadoop                        |                                                               |
| Hive                          | hiveserver                                                    |
| JobFlow                       | oozie                                                         |
| MapReduce                     | historyserver                                                 |
| Spark                         | spark history server                                          |
| YARN                          | nodemanager, resourcemanager                                  |
+-------------------------------+---------------------------------------------------------------+

Once you have the required image and plugin details, create a master node group template:

$ openstack dataprocessing node group template create \
>     --name vanilla-default-master --plugin vanilla \
>     --plugin-version 2.7.1  --processes namenode resourcemanager \
>     --flavor m2.small --auto-security-group
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| Auto security group | True                                 |
| Availability zone   | None                                 |
| Flavor id           | 639b8b2a-a5a6-4aa2-8592-ca765ee7af63 |
| Floating ip pool    | None                                 |
| Id                  | cde95af5-d295-4bbd-b2b2-e25c995197e6 |
| Is default          | False                                |
| Is protected        | False                                |
| Is proxy gateway    | False                                |
| Is public           | False                                |
| Name                | vanilla-default-master               |
| Node processes      | namenode, resourcemanager            |
| Plugin name         | vanilla                              |
| Plugin version      | 2.7.1                                |
| Security groups     | None                                 |
| Use autoconfig      | False                                |
| Volumes per node    | 0                                    |
+---------------------+--------------------------------------+

Then create a worker node group template:

$ openstack dataprocessing node group template create \
>     --name vanilla-default-worker --plugin vanilla \
>     --plugin-version 2.7.1 --processes datanode nodemanager \
>     --flavor m2.small --auto-security-group
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| Auto security group | True                                 |
| Availability zone   | None                                 |
| Flavor id           | 639b8b2a-a5a6-4aa2-8592-ca765ee7af63 |
| Floating ip pool    | None                                 |
| Id                  | 2e11033a-3316-4139-8afd-86e144ed31bb |
| Is default          | False                                |
| Is protected        | False                                |
| Is proxy gateway    | False                                |
| Is public           | False                                |
| Name                | vanilla-default-worker               |
| Node processes      | datanode, nodemanager                |
| Plugin name         | vanilla                              |
| Plugin version      | 2.7.1                                |
| Security groups     | None                                 |
| Use autoconfig      | False                                |
| Volumes per node    | 0                                    |
+---------------------+--------------------------------------+

Next, create a cluster template:

$ openstack dataprocessing cluster template create \
>     --name vanilla-default-cluster \
>     --node-groups vanilla-default-master:1 vanilla-default-worker:3
+----------------+----------------------------------------------------+
| Field          | Value                                              |
+----------------+----------------------------------------------------+
| Anti affinity  |                                                    |
| Description    | None                                               |
| Domain name    | None                                               |
| Id             | 078384f2-d486-4322-9a78-a007bd85eb49               |
| Is default     | False                                              |
| Is protected   | False                                              |
| Is public      | False                                              |
| Name           | vanilla-default-cluster                            |
| Node groups    | vanilla-default-worker:3, vanilla-default-master:1 |
| Plugin name    | vanilla                                            |
| Plugin version | 2.7.1                                              |
| Use autoconfig | False                                              |
+----------------+----------------------------------------------------+

You can now provision a data processing cluster from the cluster template:

$ openstack dataprocessing cluster create --name my-cluster \
     --cluster-template vanilla-default-cluster --user-keypair my-keypair \
     --neutron-network "Classic Provider" --image "Sahara Vanilla 2.7.1 (Ubuntu 16.04 - Beta)"
+----------------------------+----------------------------------------------------+
| Field                      | Value                                              |
+----------------------------+----------------------------------------------------+
| Anti affinity              |                                                    |
| Cluster template id        | 078384f2-d486-4322-9a78-a007bd85eb49               |
| Description                | None                                               |
| Id                         | 920ce694-af87-4a56-b6d4-0cb2cdf453b1               |
| Image                      | 622a0299-72eb-484e-9c96-3ce7d0a4d2f7               |
| Is protected               | False                                              |
| Is public                  | False                                              |
| Is transient               | False                                              |
| Name                       | my-cluster                                         |
| Neutron management network | 00000000-0000-0000-0000-000000000000               |
| Node groups                | vanilla-default-worker:3, vanilla-default-master:1 |
| Plugin name                | vanilla                                            |
| Plugin version             | 2.7.1                                              |
| Status                     | Validating                                         |
| Use autoconfig             | False                                              |
| User keypair id            | my-keypair                                         |
+----------------------------+----------------------------------------------------+

You can use the openstack network list command to get a list of the available Neutron networks.

The cluster Status changes to Active when provisioning is compete:

$ openstack dataprocessing cluster show my-cluster
+----------------------------+----------------------------------------------------+
| Field                      | Value                                              |
+----------------------------+----------------------------------------------------+
| Anti affinity              |                                                    |
| Cluster template id        | 078384f2-d486-4322-9a78-a007bd85eb49               |
| Description                | None                                               |
| Id                         | 920ce694-af87-4a56-b6d4-0cb2cdf453b1               |
| Image                      | 622a0299-72eb-484e-9c96-3ce7d0a4d2f7               |
| Is protected               | False                                              |
| Is public                  | False                                              |
| Is transient               | False                                              |
| Name                       | my-cluster                                         |
| Neutron management network | 00000000-0000-0000-0000-000000000000               |
| Node groups                | vanilla-default-worker:3, vanilla-default-master:1 |
| Plugin name                | vanilla                                            |
| Plugin version             | 2.7.1                                              |
| Status                     | Active                                             |
| Use autoconfig             | False                                              |
| User keypair id            | my-keypair                                         |
+----------------------------+----------------------------------------------------+

Run an example job on the cluster to check your Hadoop installation is working correctly. Use SSH to connect to the master node:

$ ssh -i ~/.ssh/my-keypair.pem ubuntu@<master instance IP address>

Switch to the hadoop user and run a simple MapReduce example:

$ sudo su hadoop
$ cd /opt/hadoop-2.7.1/share/hadoop/mapreduce
$ /opt/hadoop-2.7.1/bin/hadoop jar hadoop-mapreduce-examples-2.7.1.jar pi 10 100                                    
Number of Maps  = 10
Samples per Map = 100
Wrote input for Map #0
...
Starting Job
INFO client.RMProxy: Connecting to ResourceManager at my-cluster-vanilla-default-master-0/43.240.99.1:8032
INFO input.FileInputFormat: Total input paths to process : 10
INFO mapreduce.JobSubmitter: number of splits:10
INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528858716019_0001
INFO impl.YarnClientImpl: Submitted application application_1528858716019_0001
INFO mapreduce.Job: The url to track the job: http://my-cluster-vanilla-default-master-0:8088/proxy/application_1528858716019_0001/
INFO mapreduce.Job: Running job: job_1528858716019_0001
INFO mapreduce.Job: Job job_1528858716019_0001 running in uber mode : false
INFO mapreduce.Job:  map 0% reduce 0%
...
INFO mapreduce.Job:  map 100% reduce 100%
INFO mapreduce.Job: Job job_1528858716019_0001 completed successfully
INFO mapreduce.Job: Counters: 49
        File System Counters
                ...
        Job Counters 
                ...
        Map-Reduce Framework
                ...
        Shuffle Errors
                ...
        File Input Format Counters 
                Bytes Read=1180
        File Output Format Counters 
                Bytes Written=97
Job Finished in 84.757 seconds
Estimated value of Pi is 3.14800000000000000000

If successful, the MapReduce example should output an estimated value of Pi, as shown above.

For further information please see Launching a cluster via Sahara CLI commands, in the OpenStack Sahara Quickstart guide.

Data Processing Heat Templates

When launching a cluster, Data Processing uses the OpenStack Heat orchestration engine under the covers to create and manage the required resources. Heat template examples are available in the NeCTAR-RC / heat-tempates repository.