Create Your First FLUME Program

Prerequisites:

This tutorial is developed on Linux – Ubuntu operating System.

You should have Hadoop (version 2.2.0 used for this tutorial) already installed and is running on the system.

You should have Java(version 1.8.0 used for this tutorial) already installed on the system.

You should have set JAVA_HOME accordingly.

The connection to the server was reset while the page was loading.

The site could be temporarily unavailable or too busy. Try again in a few moments.
If you are unable to load any pages, check your computer’s network connection.
If your computer or network is protected by a set JAVA_HOME accordingly.

Before we start with the actual process, change user to ‘hduser’ (user used for Hadoop ).

su – hduser

Steps :

Flume, library and source code setup

  1. Create a new directory with name ‘FlumeTutorial’

sudo mkdir FlumeTutorial

  1. Give read, write and execute permissions sudo chmod -R 777 FlumeTutorial
  2. Copy files MyTwitterSource.java and MyTwitterSourceForFlume.java in this directory.

Download Input Files From Here

Check the file permissions of all these files and if ‘read’ permissions are missing then grant the same-

2. Download ‘Apache Flume’ from site- https://flume.apache.org/download.html

Apache Flume 1.4.0 has been used in this tutorial.

Next Click

3. Copy the downloaded tarball in the directory of your choice and extract contents using the following command

sudo tar -xvf apache-flume-1.4.0-bin.tar.gz

This command will create a new directory named apache-flume-1.4.0-bin and extract files into it. This directory will be referred to as <Installation Directory of Flume> in rest of the article.

4. Flume library setup

Copy twitter4j-core-4.0.1.jar, flume-ng-configuration-1.4.0.jar, flume-ng-core-1.4.0.jar, flume-ng-sdk-1.4.0.jar to

<Installation Directory of Flume>/lib/

It is possible that either or all of the copied JAR will have execute permission. This may cause issue with the compilation of code. So, revoke execute permission on such JAR.

In my case, twitter4j-core-4.0.1.jar was having execute permission. I revoked it as below-

sudo chmod -x twitter4j-core-4.0.1.jar

After this command give ‘read’ permission on twitter4j-core-4.0.1.jar to all.

sudo chmod +rrr /usr/local/apache-flume-1.4.0-bin/lib/twitter4j-core-4.0.1.jar

Please note that I have downloaded-

– twitter4j-core-4.0.1.jar from http://mvnrepository.com/artifact/org.twitter4j/twitter4j-core

– Allflume JARs i.e., flume-ng-*-1.4.0.jar from http://mvnrepository.com/artifact/org.apache.flume

Load data from Twitter using Flume

1. Go to directory containing source code files in it.

2. Set CLASSPATH to contain <Flume Installation Dir>/lib/* and ~/FlumeTutorial/flume/mytwittersource/*

export CLASSPATH=”/usr/local/apache-flume-1.4.0-bin/lib/*:~/FlumeTutorial/flume/mytwittersource/*”

3. Compile source code using command-

javac -d . MyTwitterSourceForFlume.java MyTwitterSource.java

4.Create jar

First,create Manifest.txt file using text editor of your choice and add below line in it-

Main-Class: flume.mytwittersource.MyTwitterSourceForFlume

.. here flume.mytwittersource.MyTwitterSourceForFlume is name of the main class. Please note that you have to hit enter key at end of this line.

Now, create JAR ‘MyTwitterSourceForFlume.jar’ as-

jar cfm MyTwitterSourceForFlume.jar Manifest.txt flume/mytwittersource/*.class

5. Copy this jar to <Flume Installation Directory>/lib/

sudo cp MyTwitterSourceForFlume.jar <Flume Installation Directory>/lib/

6. Go to configuration directory of Flume, <Flume Installation Directory>/conf

If flume.conf does not exist, then copy flume-conf.properties.template and rename it to flume.conf

sudo cp flume-conf.properties.template flume.conf

If flume-env.sh does not exist, then copy flume-env.sh.template and rename it to flume-env.sh

sudo cp flume-env.sh.template flume-env.sh

7. Create a Twitter application by signing in to https://dev.twitter.com/user/login?destination=home

a. Go to ‘My applications’ (This option gets dropped down when ‘Egg’

button at top right corner is clicked)

b. Create a new application by clicking ‘Create New App’

c. Fill up application details by specifying name of application, description

and website. You may refer to the notes given underneath each input box.

d. Scroll down the page and accept terms by marking ‘Yes, I agree’ and click on button ‘Create your Twitter application’

e. On window of newly created application, go to tab, ‘API Keys’ scroll down the page and click button ‘Create my access token’

f. Refresh the page.

g. Click on ‘Test OAuth’. This will display ‘OAuth’ settings of application.

h. Modify ‘flume.conf’ (created in Step 6) using these OAuth settings. Steps to modify ‘flume.conf’ are given in step 8 below.

We need to copy Consumer key, Consumer secret, Access token and Access token secret to update ‘flume.conf’.

Note: These values belongs to the user and hence are confidential, so should not be shared.

8. Open ‘flume.conf’ in write mode and set values for below parameters-

[A]

sudo gedit flume.conf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Copy below contents-
MyTwitAgent.sources = Twitter
MyTwitAgent.channels = MemChannel
MyTwitAgent.sinks = HDFS
MyTwitAgent.sources.Twitter.type = flume.mytwittersource.MyTwitterSourceForFlume
MyTwitAgent.sources.Twitter.channels = MemChannel
MyTwitAgent.sources.Twitter.consumerKey = <Copy consumer key value from Twitter App>
MyTwitAgent.sources.Twitter.consumerSecret = <Copy consumer secret value from Twitter App>
MyTwitAgent.sources.Twitter.accessToken = <Copy access token value from Twitter App>
MyTwitAgent.sources.Twitter.accessTokenSecret = <Copy access token secret value from Twitter App>
MyTwitAgent.sources.Twitter.keywords = guru99
MyTwitAgent.sinks.HDFS.channel = MemChannel
MyTwitAgent.sinks.HDFS.type = hdfs
MyTwitAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/hduser/flume/tweets/
MyTwitAgent.sinks.HDFS.hdfs.fileType = DataStream
MyTwitAgent.sinks.HDFS.hdfs.writeFormat = Text
MyTwitAgent.sinks.HDFS.hdfs.batchSize = 1000
MyTwitAgent.sinks.HDFS.hdfs.rollSize = 0
MyTwitAgent.sinks.HDFS.hdfs.rollCount = 10000
MyTwitAgent.channels.MemChannel.type = memory
MyTwitAgent.channels.MemChannel.capacity = 10000
MyTwitAgent.channels.MemChannel.transactionCapacity = 1000

[B]

Also, set TwitterAgent.sinks.HDFS.hdfs.path as below,

TwitterAgent.sinks.HDFS.hdfs.path = hdfs://<Host Name>:<Port Number>/<HDFS Home Directory>/flume/tweets/

To know <Host Name><Port Number> and <HDFS Home Directory> , see value of parameter ‘fs.defaultFS’ set in $HADOOP_HOME/etc/hadoop/core-site.xml

[C]

In order to flush the data to HDFS, as an when it comes, delete below entry if it exists,

TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600

9. Open ‘flume-env.sh’ in write mode and set values for below parameters,

JAVA_HOME=<Installation directory of Java>

FLUME_CLASSPATH=”<Flume Installation Directory>/lib/MyTwitterSourceForFlume.jar”

10. Start Hadoop

$HADOOP_HOME/sbin/start-dfs.sh

$HADOOP_HOME/sbin/start-yarn.sh

11. Two of the JAR files from the Flume tar ball are not compatible with Hadoop 2.2.0. So, we will need to follow below steps to make Flume compatible with Hadoop 2.2.0.

a. Move protobuf-java-2.4.1.jar out of ‘<Flume Installation Directory>/lib’.

Go to ‘<Flume Installation Directory>/lib’

cd <Flume Installation Directory>/lib

sudo mv protobuf-java-2.4.1.jar ~/

b. Find for JAR file ‘guava’ as below

find . -name “guava*”

Move guava-10.0.1.jar out of ‘<Flume Installation Directory>/lib’.

sudo mv guava-10.0.1.jar ~/

c. Download guava-17.0.jar from http://mvnrepository.com/artifact/com.google.guava/guava/17.0

Now, copy this downloaded jar file to ‘<Flume Installation Directory>/lib’

12. Go to ‘<Flume Installation Directory>/bin’ and start Flume as-

./flume-ng agent -n MyTwitAgent -c conf -f <Flume Installation Directory>/conf/flume.conf

Command prompt window where flume is fetching Tweets-

From command window message we can see that the output is written to /user/hduser/flume/tweets/ directory.

Now, open this directory using web browser.

13. To see the result of data load, using a browser open http://localhost:50070/ and browse file system, then go to the directory where data has been loaded, that is-

<HDFS Home Directory>/flume/tweets/

Cisco Bets Big On Selling Hadoop

Cisco doesn’t think you should roll your own big data systems, so it’s bringing together all the hardware, software, and services you might need to quickly deploy, integrate, and scale up Hadoop deployments over time.

On Wednesday Cisco announced reseller agreements with Hadoop’s big three — Cloudera, Hortonworks, and MapR — so the company and its partners can offer Hadoop distributions along with other software and services aimed at rapid and trouble-free deployment on the Cisco Unified Computing System (UCS) Director Express for Big Data.

UCS is Cisco’s fast-growing integrating hardware-and-software system offering that combines compute, networking, and storage, and offers virtualization and management software. UCS was a big part of the 30% increase in server revenue Cisco racked up during 2014, as reported this week by Gartner.

[ Want more on this topic? Read IBM Slumps, Cisco Gains In 2014 Server Sales. ]

With Hadoop deployments in mind, Cisco is offering prebuilt configurations of UCS Director Express for Big Data, incorporating UCS C240 M4 Rack Servers, UCS 6200 Fabric Interconnects, and data-virtualization software. It also incorporates USC Director Express management software that handles Hadoop system deployment and administrative tasks in conjunction with Cloudera, Hortonworks, and MapR management software.

Where roll-your-own Hadoop deployments are often fraught with challenges in getting up and running and, later, scaling up as demands grow, Jim McHugh, VP of products and solutions marketing for Cisco UCS, said UCS Director Express for Big Data all but eliminates these problems.

“This is much more than just a reference architecture,” said McHugh in a phone interview with InformationWeek. “Cisco UCS service-profile components let customers manage by racks or blades, and templates let you apply those policies across multiple racks. In the world of Hadoop, inconsistency quickly leads to disaster, so we put a lot of effort into eliminating those sorts of problems.”

Cisco sells primarily through resellers, all of whom will have access to the new software and systems for big data. The Hadoop distributions and software will be SKUs on Cisco’s master price list. Cisco is working especially closely with big-data-specialized distributors and resellers in key regional markets around the globe, McHugh said.

Additional software Cisco can apply in big-data deployments includes Connected Analytics data-access and data-virtualization technologies, obtained in Cisco’s 2013 acquisition of Composite Software. This software is used to link data warehouses and operational systems to Hadoop. Cisco and its resellers can also act as consultants, bringing in Cisco partners such as Informatica for data movement, data integration, and data cleansing, or Splunk, SAP, SAS, Platfora, or other analytics vendors.

Cisco also has vertical industry practices and a Connected Analytics team that offers consulting services, McHugh said. He cited sports arena connectivity, Internet-of-Things-style predictive maintenance, and supply chain applications in manufacturing and connected-retailing deployments as notable areas where Cisco has deep expertise.

“We can know when customers walk into a retail location and, based on loyalty program activity, we might also know what they were recently looking at online,” McHugh said. “By collecting and storing that data in a [Hadoop] data lake and applying analytics, we can develop a more holistic customer view and figure out better promotion, product-placement, and stocking strategies.”