Deploying an Hadoop stack to a cloud using Brooklyn and Ambari

Deploying and managing an Hadoop stack can be a daunting task.  A huge number of services exist within the Hadoop ecosystem and deploying them so that they are correctly wired up to each other can be difficult and time consuming.  Apache Ambari aims to make this much easier and at Cloudsoft we have started to add functionality to Apache Brooklyn to enhance that provided by Ambari.

Apache Ambari aims to make it easy to provision, manage and monitor Apache Hadoop clusters.

Apache Brooklyn is an application blueprinting and management system which supports a wide range of software and services in the cloud.

By combining the two we can immediately simplify the deployment process further especially when we want to deploy to a cloud. This post will walkthrough the process of spinning up a Hadoop cluster quickly and easily.

Brooklyn Instillation

To start with you need an Apache Brooklyn installation.  If you haven’t got one then there are instructions here https://brooklyn.incubator.apache.org/v/latest/start/running.html 

If Brooklyn is currently running stop it (ctrl-c).

Then download and copy brooklyn-ambari-0.1-SNAPSHOT.jar to <brooklyn_home>/libs/dropins/, which will make it available on the Brooklyn classpath.

Launch Brooklyn (<brooklyn_home>/bin/brooklyn launch)

Deploy Hadoop via Ambari

Brooklyn uses a yaml based blueprint to deploy blueprints.  The following will deploy Ambari to a cluster of 4 machines (1 Ambari server plus 3 agents) to AWS EC2.  It will then deploy the listed services to those machines.

name: Ambari driven cloud installation
location: 
    jclouds:aws-ec2:
        region: eu-west-1
        identity: <your aws identity>
        credential: <your aws credential>
        osFamily: ubuntu
        osVersionRegex: 12.*
        minRam: 8192
services:
- type: io.brooklyn.ambari.AmbariCluster
securityGroup: <the name of a security group exposing 8080>
    initialSize: 3
    services:
    - GANGLIA
    - HDFS
    - MAPREDUCE2
    - NAGIOS
    - YARN
    - ZOOKEEPER

Save this to a file called simple-ambari-hdp.yaml editing the identity, credentials, and security group.  Putting all the VMs in the same security group will let all the services communicate internally but you will at least need to set up an inbound rule for port 8080.  If you’d rather deploy to a cloud provider other than AWS see the locations documentation here:https://brooklyn.incubator.apache.org/v/latest/ops/brooklyn_properties.html#locations.

Aside: Although you can’t test ambari using a localhost deployment you could use a network of virtual machines.  The simplest way to do this would be to use the ambari-vagrant project herehttps://brooklyn.incubator.apache.org/v/latest/ops/brooklyn_properties.html#locations.  You could then treat these VMs as a BYON within brooklyn.

Assuming brooklyn is deployed locally on port 8081 then you can deploy the application with the following command:

curl http://127.0.0.1:8081/v1/applications --data-binary @/path/to/simple-ambari-hdp.yaml

If you browse to http://localhost:8081/#v1/applications you will now be able to watch as brooklyn initialises the VMs and installs ambari on them.  It will then instruct ambari to install the hadoop services listed.  Currently Brooklyn will use the Ambari recommendations API to decide on the service configuration.

Once Ambari has been installed and started you should see something like the following screenshot:

The green traffic lights indicate that the Ambari Server and Agents were installed without any problems.

Click on “Ambari Server” and then open the Sensors tab near the top of the page.

Here you will see information about the vm instance and the Ambari server process.  Find “main.uri” and hover over its row.  A small arrow will appear.  Hovering over the arrow will display the option open, which will open the Ambari server web interface in a new tab.

Log into the Ambari server web interface with the username/password admin/admin.  You should be able to see that a cluster called Cluster1 has been created and there will be one operation in progress.  If you investigate the current operation you’ll see that “Install and start all services” is in progress.  This will take a few minutes but eventually you should see from the dashboard that your Hadoop cluster is up and running.

You can stop your application, and terminate all its VMs, by navigating to the Effectors tab of the main application and clicking the stop effector.  This will ensure that all resources are removed from AWS.

Roadmap

Obviously we have only implemented the simplest functionality so far and while we hope this is useful to people we still have a long list of functionality we would like to add.

Firstly we need to extend the parameters available in our yaml description.  For example adding a parameter to allow an Ambari blueprint to be specified.  We also need further parameters to allow us to wire up non-Ambari services e.g. MySQL.

Similarly we need to add in a range of sensors that make useful data available for us to implement policies against.  For example an HDFS capacity sensor would allow us to have a policy that automatically created extra VMS or added extra storage when headroom dropped below a specific threshold.

Effectors that allow us to change the size of the cluster would also be useful (and vital to support the above policies).  This would make it easy to extend a cluster by adding extra nodes and instructing ambari to rebalance services.

Finally

Hopefully this article was a useful demonstration of how to quickly set up a full hadoop stack on a cluster of machines.  If you have any thoughts or questions please come visit us in irc on #brooklyncentral or contact us via our mailing listshttp://brooklyn.incubator.apache.org/community/mailing-lists.html

You can get the latest code here https://github.com/brooklyncentral/brooklyn-ambari.