NB: Nectar's Data Processing service is currently considered to be in beta - users are encouraged to try it out and provide feedback via support, but please note that we cannot provide application level expertise and are still developing a support model for this and other advanced services.

Nectar provides data processing services powered by the OpenStack Sahara project. Sahara is an OpenStack service that gives users an Elastic Data Processing (EDP) service that can be used to deploy, manage and operate various common data analytics clusters. Users can choose between operating Sahara on the command-line, using plain text templates for clusters and tasks, or using the dashboard tools and wizards to easily start up clusters and execute jobs on clusters.

This document provides an overview of the general functionality and terms. We also encourage potential users to visit the official upstream Sahara user guide.

Data Processing Overview

OpenStack Sahara provides users with a simple means to provision a data processing framework (such as Hadoop, Spark) on OpenStack (see below for the specific frameworks currently available on Nectar's Data Processing). The deployment process can be divided into two major steps:

  • Create Data Processing cluster
  • Execute EDP jobs

Data Processing Clusters are composed of Node Groups. Each Cluster is based on a particular EDP Plugin, which amongst other things determines the data processing framework and version used. A Data Processing Cluster is launched based on a Cluster Template, which incorporates several instances of a Node Group built from a Node Group Template. The basic procedure to build a cluster is:

  • Build a Node Group Template
  • Build a Cluster Template
  • Launch the Cluster

Elastic Data Processing (EDP) is an OpenStack concept for the execution of big data processing jobs on Data Processing clusters. An EDP Job is executed on a Cluster based on a Job Template. The Job Template uses a Job Binary to store the job processing logic, and a Data Source to store both input and output data. The Job Binary usually contains a job script or Jar file (depending on the backend EDP framework). The script and data can be stored in a Cluster-internal HDFS filesystem, external OpenStack Object Store container, or OpenStack Shared FileSystem shares. The procedure to execute jobs in OpenStack is: 

  • Create a Job Binary by uploading or specifying the job script
  • Create a Data Source for input and output data
  • Create a Job Template specifying which Job Binary and Data Source to use
  • Launch the Job on existing Cluster

Provisioning Plugin

In general, a Plugin enables Data Processing to deploy a specific data processing framework (for example, Hadoop) or distribution. Currently, Nectar Data Processing supports:

  • Vanilla Plugin - deploys Vanilla Apache Hadoop, version 2.7.1
  • Spark Plugin - deploys Apache Spark with Cloudera HDFS, version 1.3, 1.6, 2.1

If you're new to data analytics

Then you should know that Data Processing in the Nectar cloud supports several data analytics tools - currently the Hadoop and Spark frameworks. Other platforms, such as Cloudera and Hortonworks Data Platform are planned. You should follow these links to learn more about these tools and their suitability for your work. You may find that Spark with its Python interface is the easiest way to get started. 

If you're experienced in data analytics and new to Data Processing

You know what you want to do, you just need to know how to get started. First, have a quick look at the 'Guides' under Data Processing in the Nectar Dashboard. The Guides step you through the creation of a selected cluster and the generation of a job ready for execution. You should read the official Sahara user documentation to learn about key Data Processing concepts for clusters, jobs and data sources.

Launching a Data Processing Cluster via the Dashboard

This section demonstrates how to use Data Processing services on the Nectar research cloud by creating a Spark cluster and executing a word count job.

The Data Processing menu is available on the Dashboard on the left side toolbar:

Items under Data Processing on the right contain all the functionality. Guides are usually the best starting point, although some knowledge of data analytics is expected. 

Node Group Template creation is the first step. After choosing the desired plugin and version (Spark 2.1 in the example), you will be presented several tabs to specify the template details such as template name, instance flavor, availability zone, and storage.  Please be aware the template name can only contain digits, letters and special characters . and - . 

An important setting in the Node Processes tab is Select Node Group Processes. It determines the role of the servers in the data processing cluster topology. A common case is running one or more servers on a master node group template and another set of servers running on a slave template. All resources used in Data Processing are subject to your Nectar quota.

Each plugin has some specific cluster topology requirements. Further information can be found in the OpenStack Sahara plugin documentation, such as https://docs.openstack.org/sahara/latest/user/spark-plugin.html for the Spark plugin.

Repeat the same procedure to create the worker template. Once complete they should look like:

Once the node group templates are there, the cluster template can be created based on the the node group templates. Click on the Create Template button on the Cluster Template toolbar, select the correct plugin and version, then specify the required details. In the Node Groups tab, the created node group template names can be added and instance numbers selected. The number of instances that can be launched is subject to the quota of the user. Other parameters (General Parameters, HDFS Parameters, and Spark Parameters tab) can be left unchanged unless you have specific cluster requirements. 

After the cluster template is created, the spark cluster can be launched by clicking Launch ClusterSelect the required cluster template, cluster image, server key pair and network name, then click Launch.

A Cluster object has a Status attribute which changes as the launch proceeds. A successful new cluster creation should go from Validating - InfraUpdating - Spawning - Waiting - Preparing - Configuring - Starting to Active.  

Although the OpenStack provided EDP interfaces can be used to run data processing jobs, the Data Processing cluster can also be used as a traditional Spark/Hadoop cluster. Just SSH to the cluster using the master instance's IP address. 

Execute Data Processing Job via the Dashboard

The EDP Spark word count example job uses Swift as the data source and job binary medium. The example word count jar script provides the job binary, and any text file, such as Shakespeare's works, can be used as the input data. The jar example, spark-wordcount.jar, is part of the OpenStack Sahara tests framework.

Since object storage is used, the Swift container with input folder and job binary folder should be ready and the input file and job binary file uploaded into their respective folders.

You can then create the job binary package in Data Processing. Set the Storage type to Swift, and the URL to the job binary path in Swift. Swift also requires your Nectar OpenStack username/password for authentication. The Internal Storage type can be selected to store the uploaded job binary file in the Data Processing database. 

At the same time, the data input and output must be assigned. Data Processing uses the Data Source concept to implement this, so you have to create an input data source and an output data source. Please note that output data source path should NOT exist, otherwise the job execution will fail. 

After both the data (Data Source) and application logic (Job Binary) are ready, the Job Template can be created. The job binary you just created is the main binary. If needed, other libraries can entered in the Libs tab. 

It is now time to launch the job by clicking the Launch On Existing Cluster button. 

The job Type should be Spark and additional parameters should be provided in the Configure tab, as indicated in https://github.com/openstack/sahara-tests/tree/master/sahara_tests/scenario/defaults/edp-examples/edp-spark 

Wait until the job execution is finished and the result are written to the output path defined in the output data source. The Status should change to Succeeded on the right side of the Jobs toolbar.

Launching Data Processing Cluster via the CLI

This section is under construction.

Alternatively, the data processing cluster creation and job execution can be performed by Command Line Interface. All Data Processing functions are under the openstack dataprocessing command. 

Check the supported plugins and their versions and available images:

~  » openstack dataprocessing image list
| Name                                    | Id                                   | Username   | Tags           |
| Sahara Vanilla 2.7.1 (Ubuntu 16.04)     | b29c6513-ea93-41a9-9b3f-2d7f36626681 | ubuntu     | 2.7.1, vanilla |
| Sahara Spark 2.1.0                      | 902a2c9a-e9b1-43f6-a18d-99dbe95ca2b9 | ubuntu     | 2.1.0, spark   |
| Sahara Spark 1.3.1                      | 6e4f8bd8-e2f7-4ece-8e9e-89d4daf5754a | ubuntu     | 1.3.1, spark   |
| Sahara Spark 1.6.0                      | 012d18ac-90c5-451a-8e00-699aa8a23e1c | ubuntu     | 1.6.0, spark   |

~  » openstack dataprocessing plugin list
| Name    | Versions                    |
| vanilla | 2.7.1                       |
| spark   | 1.3.1, 1.6.0, 2.1.0         |

Create the Node Group Template - master template and worker template, currently the floating ip is not used:

~  »  openstack dataprocessing node group template create \
--name vanilla-default-master --plugin vanilla \
--plugin-version <plugin_version> --processes namenode resourcemanager \
--flavor 2 --auto-security-group

~  »  openstack dataprocessing node group template create \
--name vanilla-default-worker --plugin vanilla \
--plugin-version <plugin_version> --processes datanode nodemanager \
--flavor 2 --auto-security-group

Create Cluster Template:

~  »  openstack dataprocessing cluster template create \
--name vanilla-default-cluster \
--node-groups vanilla-default-master:1 vanilla-default-worker:3

Launch the Cluster:

~  »  openstack dataprocessing cluster create --name my-cluster-1 \
--cluster-template vanilla-default-cluster --user-keypair my_stack \
--neutron-network my_network --image sahara-vanilla-latest-ubuntu

Launching Data Processing Cluster via a Heat Template

This section is under construction.

When launching clusters, Data Processing uses the OpenStack orchestration engine Heat to create and manage the underlying resources, although these are hidden from the user. Heat template examples are provided for use as below link: https://github.com/NeCTAR-RC/heat-templates/tree/master/ocata/sahara_cluster