JujuBigData Documentation¶
The jujubigdata
Python library is an collection of functions and
classes for simplifying the development of Juju Charms for Big Data
applications. It includes utilities for:
- Interacting with Apache Hadoop
- Connecting to the core Apache Hadoop platform bundle
- Reading and writing configuration in various formats
- Managing distribution-specific configuration in a generalized, maintainable fashion
Apache Hadoop Core Platform Bundle¶
The platform bundle deploys the core Apache Hadoop platform, providing a basic Hadoop deployment to use directly, as well as endpoints to which to connect additional components, such as Apache Hive, Apache Pig, Hue, etc. It also serves as a reference implementation and starting point for creating charms for vendor-specific Hadoop platform distributions, such as Cloudera or Hortonworks.
Deploying the Bundle¶
Deploying the core platform bundle is as easy as:
juju quickstart apache-core-batch-processing
Connecting Components¶
Once the core platform bundle is deployed, you can add additional components, such as Apache Hive:
juju deploy cs:trusty/apache-hive
juju add-relation apache-hive plugin
Currently available components include:
Charming New Components¶
New components can be added to the ecosystem using one of the following two relations on the apache-hadoop-plugin endpoint charm:
- hadoop-rest: This interface is intended for components that interact with Hadoop only via the REST API, such as Hue. Charms using this interface are provided with the REST API endpoint information for both the NameNode and the ResourceManager. The details of the protocol used by this interface are documented in the
helper class
, which is the recommended way to use this interface.- hadoop-plugin: This interface is intended for components that interact with Hadoop via either the Java API libraries, or the command-line interface (CLI). Charms using this interface will have a JRE installed, the Hadoop API Java libraries installed, the Hadoop configuration managed in
/etc/hadoop/conf
, and the environment configured in/etc/environment
. The endpoint will ensure that the distribution, version, Java, etc. are all compatible to ensure a properly functioning Hadoop ecosystem. The details of the protocol used by this interface are documented in thehelper class
, which is the recommended way to use this interface.
Replacing the Core¶
As long as it supports the same interfaces described above, the core platform
can be replaced with a different distribtution. The recommended way to create
charms for another distribution is to use the core platform charms as the base
and modify the dist.yaml
and resources.yaml
.
API Documentation¶
jujubigdata.relations¶
jujubigdata.relations.DataNode |
Relation which communicates DataNode info back to NameNodes. |
jujubigdata.relations.EtcHostsRelation |
|
jujubigdata.relations.FlumeAgent |
|
jujubigdata.relations.Ganglia |
|
jujubigdata.relations.HBase |
|
jujubigdata.relations.HadoopPlugin |
This helper class manages the hadoop-plugin interface, and is the recommended way of interacting with the endpoint via this interface. |
jujubigdata.relations.HadoopREST |
This helper class manages the hadoop-rest interface, and is the recommended way of interacting with the endpoint via this interface. |
jujubigdata.relations.Hive |
|
jujubigdata.relations.Kafka |
|
jujubigdata.relations.MySQL |
|
jujubigdata.relations.NameNode |
Relation which communicates the NameNode (HDFS) connection & status info. |
jujubigdata.relations.NameNodeMaster |
Alternate NameNode relation for DataNodes. |
jujubigdata.relations.NodeManager |
Relation which communicates NodeManager info back to ResourceManagers. |
jujubigdata.relations.ResourceManager |
Relation which communicates the ResourceManager (YARN) connection & status info. |
jujubigdata.relations.ResourceManagerMaster |
Alternate ResourceManager relation for NodeManagers. |
jujubigdata.relations.SSHRelation |
|
jujubigdata.relations.SecondaryNameNode |
Relation which communicates SecondaryNameNode info back to NameNodes. |
jujubigdata.relations.Spark |
|
jujubigdata.relations.SpecMatchingRelation |
Relation base class that validates that a version and environment between two related charms match, to prevent interoperability issues. |
jujubigdata.relations.Zookeeper |
-
class
jujubigdata.relations.
DataNode
(spec=None, *args, **kwargs)¶ Bases:
jujubigdata.relations.SpecMatchingRelation
Relation which communicates DataNode info back to NameNodes.
-
provide
(remote_service, all_ready)¶
-
relation_name
= 'datanode'¶
-
required_keys
= ['private-address', 'hostname']¶
-
-
class
jujubigdata.relations.
EtcHostsRelation
(*args, **kwargs)¶ Bases:
charmhelpers.core.charmframework.helpers.Relation
-
am_i_registered
()¶
-
provide
(remote_service, all_ready)¶
-
register_connected_hosts
()¶
-
register_provided_hosts
()¶
-
-
class
jujubigdata.relations.
FlumeAgent
(port=None, *args, **kwargs)¶ Bases:
charmhelpers.core.charmframework.helpers.Relation
-
provide
(remote_service, all_ready)¶
-
relation_name
= 'flume-agent'¶
-
required_keys
= ['private-address', 'port']¶
-
-
class
jujubigdata.relations.
Ganglia
(**kwargs)¶ Bases:
charmhelpers.core.charmframework.helpers.Relation
-
host
()¶
-
relation_name
= 'ganglia'¶
-
required_keys
= ['private-address']¶
-
-
class
jujubigdata.relations.
HBase
(master=None, region=None, *args, **kwargs)¶ Bases:
jujubigdata.relations.SSHRelation
-
provide
(remote_service, all_ready)¶
-
relation_name
= 'hbase'¶
-
required_keys
= ['private-address', 'master-port', 'region-port', 'ssh-key']¶
-
-
class
jujubigdata.relations.
HadoopPlugin
(hdfs_only=False, *args, **kwargs)¶ Bases:
charmhelpers.core.charmframework.helpers.Relation
This helper class manages the
hadoop-plugin
interface, and is the recommended way of interacting with the endpoint via this interface.Charms using this interface will have a JRE installed, the Hadoop API Java libraries installed, the Hadoop configuration managed in
/etc/hadoop/conf
, and the environment configured in/etc/environment
. The endpoint will ensure that the distribution, version, Java, etc. are all compatible to ensure a properly functioning Hadoop ecosystem.Charms using this interface can call
is_ready()
(orhdfs_is_ready()
) to determine if this relation is ready to use.-
hdfs_is_ready
()¶ Check if the Hadoop libraries and installed and configured and HDFS is connected and ready to handle work (at least one DataNode available).
(This is a synonym for
is_ready()
.)
-
is_ready
()¶
-
provide
(remote_service, all_ready)¶ Used by the endpoint to provide the
required_keys
.
-
relation_name
= 'hadoop-plugin'¶
-
required_keys
= ['yarn-ready', 'hdfs-ready']¶ These keys will be set on the relation once everything is installed, configured, connected, and ready to receive work. They can be checked by calling
is_ready()
, or manually via Juju’srelation-get
.
-
-
class
jujubigdata.relations.
HadoopREST
(**kwargs)¶ Bases:
charmhelpers.core.charmframework.helpers.Relation
This helper class manages the
hadoop-rest
interface, and is the recommended way of interacting with the endpoint via this interface.Charms using this interface are provided with the API endpoint information for the NameNode, ResourceManager, and JobHistoryServer.
-
hdfs_port
¶ Property containing the HDFS port, or
None
if not available.
-
hdfs_uri
¶ Property containing the full HDFS URI, or
None
if not available.
-
historyserver_host
¶ Property containing the HistoryServer host, or
None
if not available.
-
historyserver_port
¶ Property containing the HistoryServer port, or
None
if not available.
-
historyserver_uri
¶ Property containing the full JobHistoryServer API URI, or
None
if not available.
-
namenode_host
¶ Property containing the NameNode host, or
None
if not available.
-
provide
(remote_service, all_ready)¶ Used by the endpoint to provide the
required_keys
.
-
relation_name
= 'hadoop-rest'¶
-
required_keys
= ['namenode-host', 'hdfs-port', 'webhdfs-port', 'resourcemanager-host', 'resourcemanager-port', 'historyserver-host', 'historyserver-port']¶
-
resourcemanager_host
¶ Property containing the ResourceManager host, or
None
if not available.
-
resourcemanager_port
¶ Property containing the ResourceManager port, or
None
if not available.
-
resourcemanager_uri
¶ Property containing the full ResourceManager API URI, or
None
if not available.
-
webhdfs_port
¶ Property containing the WebHDFS port, or
None
if not available.
-
webhdfs_uri
¶ Property containing the full WebHDFS URI, or
None
if not available.
-
-
class
jujubigdata.relations.
Hive
(port=None, *args, **kwargs)¶ Bases:
charmhelpers.core.charmframework.helpers.Relation
-
provide
(remote_service, all_ready)¶
-
relation_name
= 'hive'¶
-
required_keys
= ['private-address', 'port', 'ready']¶
-
-
class
jujubigdata.relations.
Kafka
(port=None, *args, **kwargs)¶ Bases:
charmhelpers.core.charmframework.helpers.Relation
-
provide
(remote_service, all_ready)¶
-
relation_name
= 'kafka'¶
-
required_keys
= ['private-address', 'port']¶
-
-
class
jujubigdata.relations.
MySQL
(**kwargs)¶ Bases:
charmhelpers.core.charmframework.helpers.Relation
-
relation_name
= 'db'¶
-
required_keys
= ['host', 'database', 'user', 'password']¶
-
-
class
jujubigdata.relations.
NameNode
(spec=None, port=None, webhdfs_port=None, *args, **kwargs)¶ Bases:
jujubigdata.relations.SpecMatchingRelation
,jujubigdata.relations.EtcHostsRelation
Relation which communicates the NameNode (HDFS) connection & status info.
This is the relation that clients should use.
-
has_slave
()¶ Check if the NameNode has any DataNode slaves registered. This reflects if HDFS is ready without having to wait for utils.wait_for_hdfs.
-
is_ready
()¶
-
provide
(remote_service, all_ready)¶
-
relation_name
= 'namenode'¶
-
require_slave
= True¶
-
required_keys
= ['private-address', 'has_slave', 'port', 'webhdfs-port']¶
-
-
class
jujubigdata.relations.
NameNodeMaster
(spec=None, port=None, webhdfs_port=None, *args, **kwargs)¶ Bases:
jujubigdata.relations.NameNode
,jujubigdata.relations.SSHRelation
Alternate NameNode relation for DataNodes.
-
relation_name
= 'datanode'¶
-
require_slave
= False¶
-
ssh_user
= 'hdfs'¶
-
-
class
jujubigdata.relations.
NodeManager
(**kwargs)¶ Bases:
charmhelpers.core.charmframework.helpers.Relation
Relation which communicates NodeManager info back to ResourceManagers.
-
provide
(remote_service, all_ready)¶
-
relation_name
= 'nodemanager'¶
-
required_keys
= ['private-address', 'hostname']¶
-
-
class
jujubigdata.relations.
ResourceManager
(spec=None, port=None, historyserver_http=None, historyserver_ipc=None, *args, **kwargs)¶ Bases:
jujubigdata.relations.SpecMatchingRelation
,jujubigdata.relations.EtcHostsRelation
Relation which communicates the ResourceManager (YARN) connection & status info.
This is the relation that clients should use.
-
has_slave
()¶ Check if the ResourceManager has any NodeManager slaves registered.
-
is_ready
()¶
-
provide
(remote_service, all_ready)¶
-
relation_name
= 'resourcemanager'¶
-
require_slave
= True¶
-
required_keys
= ['private-address', 'has_slave', 'historyserver-http', 'historyserver-ipc', 'port']¶
-
-
class
jujubigdata.relations.
ResourceManagerMaster
(spec=None, port=None, historyserver_http=None, historyserver_ipc=None, *args, **kwargs)¶ Bases:
jujubigdata.relations.ResourceManager
,jujubigdata.relations.SSHRelation
Alternate ResourceManager relation for NodeManagers.
-
relation_name
= 'nodemanager'¶
-
require_slave
= False¶
-
ssh_user
= 'yarn'¶
-
-
class
jujubigdata.relations.
SSHRelation
(*args, **kwargs)¶ Bases:
charmhelpers.core.charmframework.helpers.Relation
-
install_ssh_keys
()¶
-
provide
(remote_service, all_ready)¶
-
ssh_user
= 'ubuntu'¶
-
-
class
jujubigdata.relations.
SecondaryNameNode
(spec=None, port=None, *args, **kwargs)¶ Bases:
jujubigdata.relations.SpecMatchingRelation
Relation which communicates SecondaryNameNode info back to NameNodes.
-
provide
(remote_service, all_ready)¶
-
relation_name
= 'secondary'¶
-
required_keys
= ['private-address', 'hostname', 'port']¶
-
-
class
jujubigdata.relations.
Spark
(**kwargs)¶ Bases:
charmhelpers.core.charmframework.helpers.Relation
-
provide
(remote_service, all_ready)¶
-
relation_name
= 'spark'¶
-
required_keys
= ['ready']¶
-
-
class
jujubigdata.relations.
SpecMatchingRelation
(spec=None, *args, **kwargs)¶ Bases:
charmhelpers.core.charmframework.helpers.Relation
Relation base class that validates that a version and environment between two related charms match, to prevent interoperability issues.
This class adds a
spec
key to therequired_keys
and populates it inprovide()
. Thespec
value must be passed in to__init__()
.The
spec
should be a mapping (or a callback that returns a mapping) which describes all aspects of the charm’s environment or configuration that might affect its interoperability with the remote charm. The charm on the requires side of the relation will verify that all of the keys in itsspec
are present and exactly equal on the provides side of the relation. This does mean that the requires side can be a subset of the provides side, but not the other way around.An example spec string might be:
{ 'arch': 'x86_64', 'vendor': 'apache', 'version': '2.4', }
-
filtered_data
(remote_service=None)¶
-
is_ready
()¶ Validate the
spec
data from the connected units to ensure that it matches the localspec
.
-
provide
(remote_service, all_ready)¶ Provide the
spec
data to the remote service.Subclasses must either delegate to this method (e.g., via super()) or include
'spec': json.dumps(self.spec)
in the provided data themselves.
-
spec
¶
-
jujubigdata.handlers¶
jujubigdata.handlers.HDFS |
|
jujubigdata.handlers.HadoopBase |
|
jujubigdata.handlers.YARN |
-
class
jujubigdata.handlers.
HDFS
(hadoop_base)¶ Bases:
object
-
configure_client
()¶
-
configure_datanode
(host=None, port=None)¶
-
configure_hdfs_base
(host, port)¶
-
configure_namenode
(secondary_host=None, secondary_port=None)¶
-
configure_secondarynamenode
(host=None, port=None)¶ Configure the Secondary Namenode when the apache-hadoop-hdfs-secondary charm is deployed and related to apache-hadoop-hdfs-master.
The only purpose of the secondary namenode is to perform periodic checkpoints. The secondary name-node periodically downloads current namenode image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) namenode.
-
create_hdfs_dirs
()¶
-
format_namenode
()¶
-
register_slaves
(slaves=None)¶
-
start_datanode
()¶
-
start_namenode
()¶
-
start_secondarynamenode
()¶
-
stop_datanode
()¶
-
stop_namenode
()¶
-
stop_secondarynamenode
()¶
-
-
class
jujubigdata.handlers.
HadoopBase
(dist_config)¶ Bases:
object
-
configure_hadoop
()¶
-
configure_hosts_file
()¶ Add the unit’s private-address to /etc/hosts to ensure that Java can resolve the hostname of the server to its real IP address. We derive our hostname from the unit_id, replacing / with -.
-
install
(force=False)¶
-
install_base_packages
()¶
-
install_hadoop
()¶
-
install_java
()¶ Run the java-installer resource to install Java and determine the JAVA_HOME and Java version.
The java-installer must be idempotent and its only output (on stdout) should be two lines: the JAVA_HOME path, and the Java version, respectively.
If there is an error installing Java, the installer should exit with a non-zero exit code.
-
is_installed
()¶
-
register_slaves
(slaves)¶ Add slaves to a hdfs or yarn master, determined by the relation name.
Parameters: relation (str) – ‘datanode’ for registering HDFS slaves; ‘nodemanager’ for registering YARN slaves.
-
run
(user, command, *args, **kwargs)¶ Run a Hadoop command as the hdfs user.
Parameters: - command (str) – Command to run, prefixed with bin/ or sbin/
- args (list) – Additional args to pass to the command
-
setup_hadoop_config
()¶
-
spec
()¶ Generate the full spec for keeping charms in sync.
NB: This has to be a callback instead of a plain property because it is passed to the relations during construction of the Manager but needs to properly reflect the Java version in the same hook invocation that installs Java.
-
-
class
jujubigdata.handlers.
YARN
(hadoop_base)¶ Bases:
object
-
configure_client
(host=None, port=None, history_http=None, history_ipc=None)¶
-
configure_jobhistory
()¶
-
configure_nodemanager
(host=None, port=None, history_http=None, history_ipc=None)¶
-
configure_resourcemanager
()¶
-
configure_yarn_base
(host, port, history_http, history_ipc)¶
-
install_demo
()¶
-
register_slaves
(slaves=None)¶
-
start_jobhistory
()¶
-
start_nodemanager
()¶
-
start_resourcemanager
()¶
-
stop_jobhistory
()¶
-
stop_nodemanager
()¶
-
stop_resourcemanager
()¶
-
jujubigdata.utils¶
jujubigdata.utils.DistConfig |
This class processes distribution-specific configuration options. |
jujubigdata.utils.TimeoutError |
|
jujubigdata.utils.cpu_arch |
|
jujubigdata.utils.disable_firewall |
Temporarily disable the firewall, via ufw. |
jujubigdata.utils.environment_edit_in_place |
Edit the /etc/environment file in-place. |
jujubigdata.utils.get_kv_hosts |
|
jujubigdata.utils.get_ssh_key |
|
jujubigdata.utils.initialize_kv_host |
|
jujubigdata.utils.install_ssh_key |
|
jujubigdata.utils.jps |
Get PIDs for named Java processes, for any user. |
jujubigdata.utils.manage_etc_hosts |
Manage the /etc/hosts file from the host entries stored in unitdata.kv() by the various relations. |
jujubigdata.utils.normalize_strbool |
|
jujubigdata.utils.re_edit_in_place |
Perform a set of in-place edits to a file. |
jujubigdata.utils.read_etc_env |
Read /etc/environment and return it, along with proxy configuration, as a dict. |
jujubigdata.utils.resolve_private_address |
|
jujubigdata.utils.run_as |
Run a command as a particular user, using /etc/environment and optionally capturing and returning the output. |
jujubigdata.utils.strtobool |
|
jujubigdata.utils.update_etc_hosts |
Update /etc/hosts given a mapping of managed IP / hostname pairs. |
jujubigdata.utils.update_kv_host |
|
jujubigdata.utils.verify_resources |
Predicate for specific named resources, with useful rendering in the logs. |
jujubigdata.utils.wait_for_hdfs |
|
jujubigdata.utils.wait_for_jps |
|
jujubigdata.utils.xmlpropmap_edit_in_place |
Edit an XML property map (configuration) file in-place. |
-
class
jujubigdata.utils.
DistConfig
(filename='dist.yaml', required_keys=None)¶ Bases:
object
This class processes distribution-specific configuration options.
Some configuration options are specific to the Hadoop distribution, (e.g. Apache, Hortonworks, MapR, etc). These options are immutable and must not change throughout the charm deployment lifecycle.
Helper methods are provided for keys that require action. Presently, this includes adding/removing directories, dependent packages, and groups/users. Other required keys may be listed when instantiating this class, but this will only validate these keys exist in the yaml; it will not provide any helper functionality for unkown keys.
Parameters: - filename (str) – File to process (default dist.yaml)
- required_keys (list) – A list of keys required to be present in the yaml
Example dist.yaml with supported keys:
vendor: '<name>' hadoop_version: '<version>' packages: - '<package 1>' - '<package 2>' groups: - '<name>' users: <user 1>: groups: ['<primary>', '<group>', '<group>'] <user 2>: groups: ['<primary>'] dirs: <dir 1>: path: '</path/to/dir>' perms: 0777 <dir 2>: path: '{config[<option>]}' # value comes from config option owner: '<user>' group: '<group>' perms: 0755 ports: <name1>: port: <port> exposed_on: <service> # optional <name2>: port: <port> exposed_on: <service> # optional
-
add_dirs
()¶
-
add_packages
()¶
-
add_users
()¶
-
exposed_ports
(service)¶
-
path
(key)¶
-
port
(key)¶
-
remove_dirs
()¶
-
remove_packages
()¶
-
remove_users
()¶
-
exception
jujubigdata.utils.
TimeoutError
¶ Bases:
exceptions.Exception
-
jujubigdata.utils.
cpu_arch
()¶
-
jujubigdata.utils.
disable_firewall
(*args, **kwds)¶ Temporarily disable the firewall, via ufw.
-
jujubigdata.utils.
environment_edit_in_place
(*args, **kwds)¶ Edit the /etc/environment file in-place.
There is no standard definition for the format of /etc/environment, but the convention, which this helper supports, is simple key-value pairs, separated by =, with optionally quoted values.
Note that this helper will implicitly quote all values.
Also note that the file is not locked during the edits.
-
jujubigdata.utils.
get_kv_hosts
()¶
-
jujubigdata.utils.
get_ssh_key
(user)¶
-
jujubigdata.utils.
initialize_kv_host
()¶
-
jujubigdata.utils.
install_ssh_key
(user, ssh_key)¶
-
jujubigdata.utils.
jps
(name)¶ Get PIDs for named Java processes, for any user.
-
jujubigdata.utils.
manage_etc_hosts
()¶ Manage the /etc/hosts file from the host entries stored in unitdata.kv() by the various relations.
-
jujubigdata.utils.
normalize_strbool
(value)¶
-
jujubigdata.utils.
re_edit_in_place
(filename, subs)¶ Perform a set of in-place edits to a file.
Parameters: - filename (str) – Name of file to edit
- subs (dict) – Mapping of patterns to replacement strings
-
jujubigdata.utils.
read_etc_env
()¶ Read /etc/environment and return it, along with proxy configuration, as a dict.
-
jujubigdata.utils.
resolve_private_address
(addr)¶
-
jujubigdata.utils.
run_as
(user, command, *args, **kwargs)¶ Run a command as a particular user, using
/etc/environment
and optionally capturing and returning the output.Raises subprocess.CalledProcessError if command fails.
Parameters: - user (str) – Username to run command as
- command (str) – Command to run
- args (list) – Additional args to pass to command
- env (dict) – Additional env variables (will be merged with
/etc/environment
) - capture_output (bool) – Capture and return output (default: False)
- input (str) – Stdin for command
-
jujubigdata.utils.
strtobool
(value)¶
-
jujubigdata.utils.
update_etc_hosts
(ips_to_names)¶ Update /etc/hosts given a mapping of managed IP / hostname pairs.
Parameters: ips_to_names (dict) – mapping of IPs to hostnames (must be one-to-one)
-
jujubigdata.utils.
update_kv_host
(ip, host)¶
-
class
jujubigdata.utils.
verify_resources
(*which)¶ Bases:
object
Predicate for specific named resources, with useful rendering in the logs.
Parameters: *which (str) – One or more resource names to fetch & verify. Defaults to all non-optional resources.
-
jujubigdata.utils.
wait_for_hdfs
(timeout)¶
-
jujubigdata.utils.
wait_for_jps
(process_name, timeout)¶
-
jujubigdata.utils.
xmlpropmap_edit_in_place
(*args, **kwds)¶ Edit an XML property map (configuration) file in-place.
This helper acts as a context manager which edits an XML file of the form:
<configuration> <property> <name>property-name</name> <value>property-value</value> <description>Optional property description</description> </property> ... </configuration>
This context manager yields a dict containing the existing name/value mappings. Properties can then be modified, added, or removed, and the changes will be reflected in the file.
Example usage:
with xmlpropmap_edit_in_place('my.xml') as props: props['foo'] = 'bar' del props['removed']
Note that the file is not locked during the edits.