JujuBigData Documentation

The jujubigdata Python library is an collection of functions and classes for simplifying the development of Juju Charms for Big Data applications. It includes utilities for:

  • Interacting with Apache Hadoop
  • Connecting to the core Apache Hadoop platform bundle
  • Reading and writing configuration in various formats
  • Managing distribution-specific configuration in a generalized, maintainable fashion

Apache Hadoop Core Platform Bundle

The platform bundle deploys the core Apache Hadoop platform, providing a basic Hadoop deployment to use directly, as well as endpoints to which to connect additional components, such as Apache Hive, Apache Pig, Hue, etc. It also serves as a reference implementation and starting point for creating charms for vendor-specific Hadoop platform distributions, such as Cloudera or Hortonworks.

Deploying the Bundle

Deploying the core platform bundle is as easy as:

juju quickstart apache-core-batch-processing

Connecting Components

Once the core platform bundle is deployed, you can add additional components, such as Apache Hive:

juju deploy cs:trusty/apache-hive
juju add-relation apache-hive plugin

Currently available components include:

Charming New Components

New components can be added to the ecosystem using one of the following two relations on the apache-hadoop-plugin endpoint charm:

  • hadoop-rest: This interface is intended for components that interact with Hadoop only via the REST API, such as Hue. Charms using this interface are provided with the REST API endpoint information for both the NameNode and the ResourceManager. The details of the protocol used by this interface are documented in the helper class, which is the recommended way to use this interface.
  • hadoop-plugin: This interface is intended for components that interact with Hadoop via either the Java API libraries, or the command-line interface (CLI). Charms using this interface will have a JRE installed, the Hadoop API Java libraries installed, the Hadoop configuration managed in /etc/hadoop/conf, and the environment configured in /etc/environment. The endpoint will ensure that the distribution, version, Java, etc. are all compatible to ensure a properly functioning Hadoop ecosystem. The details of the protocol used by this interface are documented in the helper class, which is the recommended way to use this interface.

Replacing the Core

As long as it supports the same interfaces described above, the core platform can be replaced with a different distribtution. The recommended way to create charms for another distribution is to use the core platform charms as the base and modify the dist.yaml and resources.yaml.

API Documentation

jujubigdata.relations

jujubigdata.relations.DataNode Relation which communicates DataNode info back to NameNodes.
jujubigdata.relations.EtcHostsRelation
jujubigdata.relations.FlumeAgent
jujubigdata.relations.Ganglia
jujubigdata.relations.HBase
jujubigdata.relations.HadoopPlugin This helper class manages the hadoop-plugin interface, and is the recommended way of interacting with the endpoint via this interface.
jujubigdata.relations.HadoopREST This helper class manages the hadoop-rest interface, and is the recommended way of interacting with the endpoint via this interface.
jujubigdata.relations.Hive
jujubigdata.relations.Kafka
jujubigdata.relations.MySQL
jujubigdata.relations.NameNode Relation which communicates the NameNode (HDFS) connection & status info.
jujubigdata.relations.NameNodeMaster Alternate NameNode relation for DataNodes.
jujubigdata.relations.NodeManager Relation which communicates NodeManager info back to ResourceManagers.
jujubigdata.relations.ResourceManager Relation which communicates the ResourceManager (YARN) connection & status info.
jujubigdata.relations.ResourceManagerMaster Alternate ResourceManager relation for NodeManagers.
jujubigdata.relations.SSHRelation
jujubigdata.relations.SecondaryNameNode Relation which communicates SecondaryNameNode info back to NameNodes.
jujubigdata.relations.Spark
jujubigdata.relations.SpecMatchingRelation Relation base class that validates that a version and environment between two related charms match, to prevent interoperability issues.
jujubigdata.relations.Zookeeper
class jujubigdata.relations.DataNode(spec=None, *args, **kwargs)

Bases: jujubigdata.relations.SpecMatchingRelation

Relation which communicates DataNode info back to NameNodes.

provide(remote_service, all_ready)
relation_name = 'datanode'
required_keys = ['private-address', 'hostname']
class jujubigdata.relations.EtcHostsRelation(*args, **kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

am_i_registered()
provide(remote_service, all_ready)
register_connected_hosts()
register_provided_hosts()
class jujubigdata.relations.FlumeAgent(port=None, *args, **kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

provide(remote_service, all_ready)
relation_name = 'flume-agent'
required_keys = ['private-address', 'port']
class jujubigdata.relations.Ganglia(**kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

host()
relation_name = 'ganglia'
required_keys = ['private-address']
class jujubigdata.relations.HBase(master=None, region=None, *args, **kwargs)

Bases: jujubigdata.relations.SSHRelation

provide(remote_service, all_ready)
relation_name = 'hbase'
required_keys = ['private-address', 'master-port', 'region-port', 'ssh-key']
class jujubigdata.relations.HadoopPlugin(hdfs_only=False, *args, **kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

This helper class manages the hadoop-plugin interface, and is the recommended way of interacting with the endpoint via this interface.

Charms using this interface will have a JRE installed, the Hadoop API Java libraries installed, the Hadoop configuration managed in /etc/hadoop/conf, and the environment configured in /etc/environment. The endpoint will ensure that the distribution, version, Java, etc. are all compatible to ensure a properly functioning Hadoop ecosystem.

Charms using this interface can call is_ready() (or hdfs_is_ready()) to determine if this relation is ready to use.

hdfs_is_ready()

Check if the Hadoop libraries and installed and configured and HDFS is connected and ready to handle work (at least one DataNode available).

(This is a synonym for is_ready().)

is_ready()
provide(remote_service, all_ready)

Used by the endpoint to provide the required_keys.

relation_name = 'hadoop-plugin'
required_keys = ['yarn-ready', 'hdfs-ready']

These keys will be set on the relation once everything is installed, configured, connected, and ready to receive work. They can be checked by calling is_ready(), or manually via Juju’s relation-get.

class jujubigdata.relations.HadoopREST(**kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

This helper class manages the hadoop-rest interface, and is the recommended way of interacting with the endpoint via this interface.

Charms using this interface are provided with the API endpoint information for the NameNode, ResourceManager, and JobHistoryServer.

hdfs_port

Property containing the HDFS port, or None if not available.

hdfs_uri

Property containing the full HDFS URI, or None if not available.

historyserver_host

Property containing the HistoryServer host, or None if not available.

historyserver_port

Property containing the HistoryServer port, or None if not available.

historyserver_uri

Property containing the full JobHistoryServer API URI, or None if not available.

namenode_host

Property containing the NameNode host, or None if not available.

provide(remote_service, all_ready)

Used by the endpoint to provide the required_keys.

relation_name = 'hadoop-rest'
required_keys = ['namenode-host', 'hdfs-port', 'webhdfs-port', 'resourcemanager-host', 'resourcemanager-port', 'historyserver-host', 'historyserver-port']
resourcemanager_host

Property containing the ResourceManager host, or None if not available.

resourcemanager_port

Property containing the ResourceManager port, or None if not available.

resourcemanager_uri

Property containing the full ResourceManager API URI, or None if not available.

webhdfs_port

Property containing the WebHDFS port, or None if not available.

webhdfs_uri

Property containing the full WebHDFS URI, or None if not available.

class jujubigdata.relations.Hive(port=None, *args, **kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

provide(remote_service, all_ready)
relation_name = 'hive'
required_keys = ['private-address', 'port', 'ready']
class jujubigdata.relations.Kafka(port=None, *args, **kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

provide(remote_service, all_ready)
relation_name = 'kafka'
required_keys = ['private-address', 'port']
class jujubigdata.relations.MySQL(**kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

relation_name = 'db'
required_keys = ['host', 'database', 'user', 'password']
class jujubigdata.relations.NameNode(spec=None, port=None, webhdfs_port=None, *args, **kwargs)

Bases: jujubigdata.relations.SpecMatchingRelation, jujubigdata.relations.EtcHostsRelation

Relation which communicates the NameNode (HDFS) connection & status info.

This is the relation that clients should use.

has_slave()

Check if the NameNode has any DataNode slaves registered. This reflects if HDFS is ready without having to wait for utils.wait_for_hdfs.

is_ready()
provide(remote_service, all_ready)
relation_name = 'namenode'
require_slave = True
required_keys = ['private-address', 'has_slave', 'port', 'webhdfs-port']
class jujubigdata.relations.NameNodeMaster(spec=None, port=None, webhdfs_port=None, *args, **kwargs)

Bases: jujubigdata.relations.NameNode, jujubigdata.relations.SSHRelation

Alternate NameNode relation for DataNodes.

relation_name = 'datanode'
require_slave = False
ssh_user = 'hdfs'
class jujubigdata.relations.NodeManager(**kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

Relation which communicates NodeManager info back to ResourceManagers.

provide(remote_service, all_ready)
relation_name = 'nodemanager'
required_keys = ['private-address', 'hostname']
class jujubigdata.relations.ResourceManager(spec=None, port=None, historyserver_http=None, historyserver_ipc=None, *args, **kwargs)

Bases: jujubigdata.relations.SpecMatchingRelation, jujubigdata.relations.EtcHostsRelation

Relation which communicates the ResourceManager (YARN) connection & status info.

This is the relation that clients should use.

has_slave()

Check if the ResourceManager has any NodeManager slaves registered.

is_ready()
provide(remote_service, all_ready)
relation_name = 'resourcemanager'
require_slave = True
required_keys = ['private-address', 'has_slave', 'historyserver-http', 'historyserver-ipc', 'port']
class jujubigdata.relations.ResourceManagerMaster(spec=None, port=None, historyserver_http=None, historyserver_ipc=None, *args, **kwargs)

Bases: jujubigdata.relations.ResourceManager, jujubigdata.relations.SSHRelation

Alternate ResourceManager relation for NodeManagers.

relation_name = 'nodemanager'
require_slave = False
ssh_user = 'yarn'
class jujubigdata.relations.SSHRelation(*args, **kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

install_ssh_keys()
provide(remote_service, all_ready)
ssh_user = 'ubuntu'
class jujubigdata.relations.SecondaryNameNode(spec=None, port=None, *args, **kwargs)

Bases: jujubigdata.relations.SpecMatchingRelation

Relation which communicates SecondaryNameNode info back to NameNodes.

provide(remote_service, all_ready)
relation_name = 'secondary'
required_keys = ['private-address', 'hostname', 'port']
class jujubigdata.relations.Spark(**kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

provide(remote_service, all_ready)
relation_name = 'spark'
required_keys = ['ready']
class jujubigdata.relations.SpecMatchingRelation(spec=None, *args, **kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

Relation base class that validates that a version and environment between two related charms match, to prevent interoperability issues.

This class adds a spec key to the required_keys and populates it in provide(). The spec value must be passed in to __init__().

The spec should be a mapping (or a callback that returns a mapping) which describes all aspects of the charm’s environment or configuration that might affect its interoperability with the remote charm. The charm on the requires side of the relation will verify that all of the keys in its spec are present and exactly equal on the provides side of the relation. This does mean that the requires side can be a subset of the provides side, but not the other way around.

An example spec string might be:

{
    'arch': 'x86_64',
    'vendor': 'apache',
    'version': '2.4',
}
filtered_data(remote_service=None)
is_ready()

Validate the spec data from the connected units to ensure that it matches the local spec.

provide(remote_service, all_ready)

Provide the spec data to the remote service.

Subclasses must either delegate to this method (e.g., via super()) or include 'spec': json.dumps(self.spec) in the provided data themselves.

spec
class jujubigdata.relations.Zookeeper(port=None, *args, **kwargs)

Bases: charmhelpers.core.charmframework.helpers.Relation

provide(remote_service, all_ready)
relation_name = 'zookeeper'
required_keys = ['private-address', 'port']

jujubigdata.handlers

jujubigdata.handlers.HDFS
jujubigdata.handlers.HadoopBase
jujubigdata.handlers.YARN
class jujubigdata.handlers.HDFS(hadoop_base)

Bases: object

configure_client()
configure_datanode(host=None, port=None)
configure_hdfs_base(host, port)
configure_namenode(secondary_host=None, secondary_port=None)
configure_secondarynamenode(host=None, port=None)

Configure the Secondary Namenode when the apache-hadoop-hdfs-secondary charm is deployed and related to apache-hadoop-hdfs-master.

The only purpose of the secondary namenode is to perform periodic checkpoints. The secondary name-node periodically downloads current namenode image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) namenode.

create_hdfs_dirs()
format_namenode()
register_slaves(slaves=None)
start_datanode()
start_namenode()
start_secondarynamenode()
stop_datanode()
stop_namenode()
stop_secondarynamenode()
class jujubigdata.handlers.HadoopBase(dist_config)

Bases: object

configure_hadoop()
configure_hosts_file()

Add the unit’s private-address to /etc/hosts to ensure that Java can resolve the hostname of the server to its real IP address. We derive our hostname from the unit_id, replacing / with -.

install(force=False)
install_base_packages()
install_hadoop()
install_java()

Run the java-installer resource to install Java and determine the JAVA_HOME and Java version.

The java-installer must be idempotent and its only output (on stdout) should be two lines: the JAVA_HOME path, and the Java version, respectively.

If there is an error installing Java, the installer should exit with a non-zero exit code.

is_installed()
register_slaves(slaves)

Add slaves to a hdfs or yarn master, determined by the relation name.

Parameters:relation (str) – ‘datanode’ for registering HDFS slaves; ‘nodemanager’ for registering YARN slaves.
run(user, command, *args, **kwargs)

Run a Hadoop command as the hdfs user.

Parameters:
  • command (str) – Command to run, prefixed with bin/ or sbin/
  • args (list) – Additional args to pass to the command
setup_hadoop_config()
spec()

Generate the full spec for keeping charms in sync.

NB: This has to be a callback instead of a plain property because it is passed to the relations during construction of the Manager but needs to properly reflect the Java version in the same hook invocation that installs Java.

class jujubigdata.handlers.YARN(hadoop_base)

Bases: object

configure_client(host=None, port=None, history_http=None, history_ipc=None)
configure_jobhistory()
configure_nodemanager(host=None, port=None, history_http=None, history_ipc=None)
configure_resourcemanager()
configure_yarn_base(host, port, history_http, history_ipc)
install_demo()
register_slaves(slaves=None)
start_jobhistory()
start_nodemanager()
start_resourcemanager()
stop_jobhistory()
stop_nodemanager()
stop_resourcemanager()

jujubigdata.utils

jujubigdata.utils.DistConfig This class processes distribution-specific configuration options.
jujubigdata.utils.TimeoutError
jujubigdata.utils.cpu_arch
jujubigdata.utils.disable_firewall Temporarily disable the firewall, via ufw.
jujubigdata.utils.environment_edit_in_place Edit the /etc/environment file in-place.
jujubigdata.utils.get_kv_hosts
jujubigdata.utils.get_ssh_key
jujubigdata.utils.initialize_kv_host
jujubigdata.utils.install_ssh_key
jujubigdata.utils.jps Get PIDs for named Java processes, for any user.
jujubigdata.utils.manage_etc_hosts Manage the /etc/hosts file from the host entries stored in unitdata.kv() by the various relations.
jujubigdata.utils.normalize_strbool
jujubigdata.utils.re_edit_in_place Perform a set of in-place edits to a file.
jujubigdata.utils.read_etc_env Read /etc/environment and return it, along with proxy configuration, as a dict.
jujubigdata.utils.resolve_private_address
jujubigdata.utils.run_as Run a command as a particular user, using /etc/environment and optionally capturing and returning the output.
jujubigdata.utils.strtobool
jujubigdata.utils.update_etc_hosts Update /etc/hosts given a mapping of managed IP / hostname pairs.
jujubigdata.utils.update_kv_host
jujubigdata.utils.verify_resources Predicate for specific named resources, with useful rendering in the logs.
jujubigdata.utils.wait_for_hdfs
jujubigdata.utils.wait_for_jps
jujubigdata.utils.xmlpropmap_edit_in_place Edit an XML property map (configuration) file in-place.
class jujubigdata.utils.DistConfig(filename='dist.yaml', required_keys=None)

Bases: object

This class processes distribution-specific configuration options.

Some configuration options are specific to the Hadoop distribution, (e.g. Apache, Hortonworks, MapR, etc). These options are immutable and must not change throughout the charm deployment lifecycle.

Helper methods are provided for keys that require action. Presently, this includes adding/removing directories, dependent packages, and groups/users. Other required keys may be listed when instantiating this class, but this will only validate these keys exist in the yaml; it will not provide any helper functionality for unkown keys.

Parameters:
  • filename (str) – File to process (default dist.yaml)
  • required_keys (list) – A list of keys required to be present in the yaml

Example dist.yaml with supported keys:

vendor: '<name>'
hadoop_version: '<version>'
packages:
    - '<package 1>'
    - '<package 2>'
groups:
    - '<name>'
users:
    <user 1>:
        groups: ['<primary>', '<group>', '<group>']
    <user 2>:
        groups: ['<primary>']
dirs:
    <dir 1>:
        path: '</path/to/dir>'
        perms: 0777
    <dir 2>:
        path: '{config[<option>]}'  # value comes from config option
        owner: '<user>'
        group: '<group>'
        perms: 0755
ports:
    <name1>:
        port: <port>
        exposed_on: <service>  # optional
    <name2>:
        port: <port>
        exposed_on: <service>  # optional
add_dirs()
add_packages()
add_users()
exposed_ports(service)
path(key)
port(key)
remove_dirs()
remove_packages()
remove_users()
exception jujubigdata.utils.TimeoutError

Bases: exceptions.Exception

jujubigdata.utils.cpu_arch()
jujubigdata.utils.disable_firewall(*args, **kwds)

Temporarily disable the firewall, via ufw.

jujubigdata.utils.environment_edit_in_place(*args, **kwds)

Edit the /etc/environment file in-place.

There is no standard definition for the format of /etc/environment, but the convention, which this helper supports, is simple key-value pairs, separated by =, with optionally quoted values.

Note that this helper will implicitly quote all values.

Also note that the file is not locked during the edits.

jujubigdata.utils.get_kv_hosts()
jujubigdata.utils.get_ssh_key(user)
jujubigdata.utils.initialize_kv_host()
jujubigdata.utils.install_ssh_key(user, ssh_key)
jujubigdata.utils.jps(name)

Get PIDs for named Java processes, for any user.

jujubigdata.utils.manage_etc_hosts()

Manage the /etc/hosts file from the host entries stored in unitdata.kv() by the various relations.

jujubigdata.utils.normalize_strbool(value)
jujubigdata.utils.re_edit_in_place(filename, subs)

Perform a set of in-place edits to a file.

Parameters:
  • filename (str) – Name of file to edit
  • subs (dict) – Mapping of patterns to replacement strings
jujubigdata.utils.read_etc_env()

Read /etc/environment and return it, along with proxy configuration, as a dict.

jujubigdata.utils.resolve_private_address(addr)
jujubigdata.utils.run_as(user, command, *args, **kwargs)

Run a command as a particular user, using /etc/environment and optionally capturing and returning the output.

Raises subprocess.CalledProcessError if command fails.

Parameters:
  • user (str) – Username to run command as
  • command (str) – Command to run
  • args (list) – Additional args to pass to command
  • env (dict) – Additional env variables (will be merged with /etc/environment)
  • capture_output (bool) – Capture and return output (default: False)
  • input (str) – Stdin for command
jujubigdata.utils.strtobool(value)
jujubigdata.utils.update_etc_hosts(ips_to_names)

Update /etc/hosts given a mapping of managed IP / hostname pairs.

Parameters:ips_to_names (dict) – mapping of IPs to hostnames (must be one-to-one)
jujubigdata.utils.update_kv_host(ip, host)
class jujubigdata.utils.verify_resources(*which)

Bases: object

Predicate for specific named resources, with useful rendering in the logs.

Parameters:*which (str) – One or more resource names to fetch & verify. Defaults to all non-optional resources.
jujubigdata.utils.wait_for_hdfs(timeout)
jujubigdata.utils.wait_for_jps(process_name, timeout)
jujubigdata.utils.xmlpropmap_edit_in_place(*args, **kwds)

Edit an XML property map (configuration) file in-place.

This helper acts as a context manager which edits an XML file of the form:

<configuration>
    <property>
        <name>property-name</name>
        <value>property-value</value>
        <description>Optional property description</description>
    </property>
    ...
</configuration>

This context manager yields a dict containing the existing name/value mappings. Properties can then be modified, added, or removed, and the changes will be reflected in the file.

Example usage:

with xmlpropmap_edit_in_place('my.xml') as props:
    props['foo'] = 'bar'
    del props['removed']

Note that the file is not locked during the edits.

Indices and tables