Apache Flume is an open source platform to efficiently and reliably collect, aggregate and transfer massive amounts of data from one or more sources to a centralized data source. Data sources are customizable in Flume, hence it can injest any kind of data including log data, event data, network data, social-media generated data, email messages, message queues etc.
Flume flow has three main components - Source, Channel and Sink
Source: Flume source is a Flume component that consumes events and data from sources such as a web server or message queue.Flume sources are customized to handle data from specific sources. For example, an Avro Flume source is used to injest data from Avro clients and a Thrift flume source is used to injest data from Thrift clients. You can write custom Flume sources to inject custom data. For example you can write a Twitter Flume source to injest Tweets.
Channel: Flume sources injest data and store them into one or more channels.Channels are temporary stores, that keep the data until it is consumed by Flume sinks.
Sink: Flume sinks removes the data stored in channels and puts it into a central repository such as HDFS or Hive.
A Flume agent is a JVM process that hosts the components through which events flow from an external source to either the central repository or to the next destination.Flume agent wires togeatrher the external sources, Flume sources, flume Channels, Flume sinks, and external destinations for each flume data flow. Flume agent does this through a configuration file in which it maps the sources, channels, sinks, etc. and defines the properties for each component.
Flume uses a transactional approach to gaurantee the delivery of data. Events or data is removed from channels only after they have been successfully stored in the terminal reposiroty for single-hop flows, or successfully stored in the channel of next agent in the case of multi-hop flows.
In Flume the events or data is staged in channels. Flume sources add events to Flume channels. Flume sinks consume events from channels and publish to terminal data stores. Channels manage recovery from failures. Flume supports different kinds of channels. In-memory channels store events in an in-memory queue, which is faster. File channels are durable which is backed by the local file system.
Flume is a plugin-based architecture. It ships with many out-of-the-box sources, channels and sinks. Many other customized components exist seperately from Flume which can be pluged into Flume and used for you applications. Or you can write your own custom components and plug them into Flume.
There are two ways to add plugins to Flume.
Add the plugin jar files to FLUME_CLASSPATH variable in the flume-env.sh file.
Flume can be setup to have multiple agents process data from multiple sources and send to a single or a few intermiate destimations. Separate agents consume messages from the intermediate data source and write to a central data source.
Fluid platform provides a File Channel Integrity tool which verifies the integrity of individual events in the File channel and removes corrupted events.
If Flume agent goes down then all flows hosted on that agent are aborted.Once the agent is restarted then flow will resume. If the channel is set up as in-memory channel then all events that are stored in the chavvels when the agent went down are lost. But chanels setup as file or other stable channels will continue to process events where it lest off.