Apache Oozie is a workflow scheduler engine to manage and schedule Apache Hadoop jobs. Oozie supports different kinds of Hadoop jobs out of the box such as MapReduce jobs, Streaming jobs, Pig, Hive and Scoop. Oozie also supports system specific jobs such as shell scripts and Java jobs.
Oozie is a Java Web-Application that runs in a Java servlet-container.
Apache Oozie Workflow is a collection of actions; which are Hadoop MapReduce jobs, Pig jobs etc. The actions are arranged in a control dependency DAG (Direct Acyclic Graph), which controls how and when an action can be run. Oozie workflow definitions are written in hPDL, a XML Process Definition Language.
Apache Oozie Workflow contains control flow nodes and action nodes.
Control Flow Nodes - Control flow nodes are the mechanisms that define the beginning and end of the workflow (start, end, fail). In addition, control flow nodes also provide mechanism to control the execution path of the workflow (decision, fork and join)
Action NodesAction nodes are the mechanisms which triggers the execution of a computation/processing task. Oozie provides support for different types of Hadoop actions out of the box - Hadoop MapReduce, Hadoop file system, Pig etc. In addition Oozie also provides support for system specific jobs - SSH, HTTP, eMail etc.
An Apache Oozie Workflow job can have the following states - PREP , RUNNING , SUSPENDED , SUCCEEDED , KILLED and FAILED.
Apache Oozie Workflow does not support cycles. Apache Oozie WorkFlow definitions must be a strict DAG. At workflow application deployment time, if Oozie detects a cycle in the workflow definition then it fails the deployment.
Apache Oozie workflow supports the following control flow nodes that start or end the workflow execution.
Start Control Node - The start node is the first node that a Oozie workflow job transitions to and is the entry point for a workflow job. Every Apache Oozie workflow definition must have one start node.
End Control Node - The end node is last node that a Oozie workflow job transitions to and it indicates that the workflow job has completed successfully. When a workflow job reaches the end node it finishes successfully and the job status changes to SUCCEEDED. Every Apache Oozie workflow definition must have one end node.
Kill Control Node - The kill node allows a workflow job to kill itself. When a workflow job reaches the kill node it finishes in error and the status of the job changes to KILLED.
Apache Oozie workflow supports the following control flow nodes that control the execution path of the workflow.
Decision Control Node - The decision control node is like a switch-case statement, which enables a workflow to make a selection on the execution path to follow.
Fork and Join Control Node - The fork and join control nodes are used in pairs and work as follows. The fork node splits a single path of execution into multiple concurrent paths of execution. The join node waits until every concurrent execution path of the corresponding fork node arrives to it.
Apache Oozie supports the following action nodes which trigger the execution of computation and processing tasks.
Map-Reduce Action - The map-reduce action node starts a Hadoop Map-Reduce job from a Oozie workflow.
Pig Action - The pig action node starts a Pig job from a Oozie workflow.
FS (HDFS) Action - The FS action node enables an Oozie workflow to manipulate HDFS files and directories. FS action nodes support the commands - move , delete , mkdir , chmod , touchz and chgrp .
SSH Action -
Sub-workflow Action -
Java Action - The java action node executes the public static void main(String[] args) method of the specified main Java class form a Oozie workflow.
The Apache Oozie workflow job transitions through the following states.
PREP- An Oozie workflow job is in the PREP state when it is first created. In this state the workflow job is defined but is not running.
RUNNING - An Oozie workflow transitions to the RUNNING state when it is started. The workflow remains in RUNNING state while the workflow does not reach its end state, ends in error or it is suspended.
SUSPENDED - An Oozie workflow job transitions to SUSPENDED state if it is suspended. The workflow will remain in suspended state until it is resumed or it is killed.
SUCCEEDED - A RUNNING Oozie job transitions to the SUCCEEDED state when it reaches the end node.
KILLED - A CREATED, RUNNING or SUSPENDED workflow job transitions to a KILLED state when the workflow job is killed by an administrator.
FAILED - A RUNNING Oozie job transitions to a FAILED state when the workflow job fails with an unexpected error.