The best new OpenStack tips and tricks

OpenStack is a big project, and keeping it all running smoothly (or just learning what how to get started) can be a big undertaking. Even if you’re a contributor to the project, there’s a lot to keep track of, especially with the projects you might be less familiar with. Of course, the official documentation as well as a number of OpenStack training and certification programs can be a big help with learning more, but community-authored tutorials are a great supplement.

Every month, Opensource.com pulls together the best how-tos, guides, tutorials, and tips into one place so you can sample some of the best community-produced resources to help you with your OpenStack journey. Without further ado, here’s what we’ve gathered for you this month.

  • First up, let’s take a look at automating the backup of your Cinder volumes. Cinder, the block storage project within OpenStack, allows your VMs to have volumes attached to them that persist even if the VM stops. But what are you doing to make sure these volumes are safe and secure in the event of a failure? Gorka Eguileor takes you through the basics.

  • Next, let’s take a look at a popular storage option for OpenStack, GlusterFS. Swapnil Kulkarni takes a look at installing and configuring RDO with GlusterFS to get your storage back-end up and running.
  • We also enjoyed reading through a trio of OpenStack tips from Loïc Dachary. The first was how to set up a custom name server on an OpenStack instance. Cloud-init makes this relatively easy. Another looked at how to name an OpenStack instance based on its IP address, which certainly makes keeping track of your instances easier. Finally, let’s look at how to delete the last port of an OpenStack router, a one-liner in case you’re having trouble with Neutron router-delete complaining.
  • Upgrades aren’t as difficult within OpenStack as they were just a few years ago, with a number of efforts being made to make it easier to move from one version to another. With a six-month release cycle, being able to upgrade regularly can be important! To that end, here’s a walk through of upgrading Nova to Kilo with minimal downtime.
  • Heat: It’s the native orchestration tool in OpenStack, designed to make deploying elastic cloud applications simple. What if you want to make sure your newly launched instance with Heat is IPv6 ready? Shannon McFarland shows you how.
  • In this next guide, Craige McWhirter writes “When deleting a volume snapshot in OpenStack you may sometimes get an error message stating that Cinder was unable to delete the snapshot. There are a number of reasons why a snapshot may be reported by Ceph as unable to be deleted, however the most common reason in my experience has been that a Cinder client connection has not yet been closed, possibly because a client crashed.” Here’s his guide for getting around that error.
  • Finally this month, we look at two quick guides from Matt Farina, who shows us how to theme the OpenStack dashboard in Kilo, and takes it further with how to build custom AngularJS panels to extend the Horizon dashboard even further.

What I Learned About MapR

MapR, based in San Jose, California, provides a commercial version of Hadoop noted for its fast performance.  This week at the Strata Conference, I got a chance to talk to the folks at MapR and found out how MapR differentiates itself from other Hadoop offerings.

MapR Booth at Strata Conference

The fast speed of MapR appears to come from its filesystem design.  It’s fully compatible with standard open source Hadoop including Hadoop 2.x and YARN and HBase, but with a more optimized filesystem structure to provide the additional speed boost.

MapR promotes these benefits below.

  • No single point of failure
    Normally the NameNode is the single point of failure for a Hadoop installation.  MapR’s design avoids this issue.
  • NFS mount data files
    MapR allows you to NFS mount files into an HDFS cluster.  This ability saves you time from copying files into MapR and you might not even need tools like Flume.  The direct write into the files opens up additional options such as querying Hadoop on near-real-time data.
  • Fast access
    MapR has clocked the fastest data processing with sorting 1.5 trillion bytes in one minute using its MapR Hadoop software on Google Compute Engine cloud service.
  • Binary compatible with Hadoop
    MapR is binary compatible with open source Hadoop, which gives more flexibility in adding other third party components or migrating
  • Enterprise support
    Professional services, enterprise support, and training and certifications

MapR has attracted a number of featured customers including the following:

  • Comscore
  • Cision
  • Linksmart
  • HP
  • Return Path
  • Dotomi
  • Solutionary
  • Trueffect
  • Sociocast
  • Zions Bank
  • Live Nation
  • Cisco
  • Rubicon Project

MapR is also partnering with both Google and Amazon Web Services for cloud-based Hadoop systems.

MapR currently comes in 3 editions.

  • M3 Standard Edition
  • M5 Enterprise Edition (with “99.999% high availability and self-healing”)
  • M7 Enterprise Edition for Hadoop (with fast database)

Additionally, in conjunction with the Strata Conference this week, MapR has announced the release of the MapR Sandbox.  Any user can download the MapR Sandbox for free and run a full MapR Hadoop installation within a VMware or Virtualbox virtual machine.  This sandbox provides a suitable learning environment for those who want to experience the use and operation of MapR Hadoop without investing a lot of effort in the installation.  I haven’t downloaded and installed the MapR Sandbox yet.  If you have already done this and tried it out, tell me what you think in the comments below.

MapR website: http://www.mapr.com

Big Data Analytics – What is that ?

In a recent statistics, IBM estimates that every day 2.5 quintillion bytes of data are created – so much that 90% of the data in the world today has been created in the last two years. It is a mind-boggling figure and the irony is that we feel less informed in spite of having more information available today.

The surprising growth in volumes of data has badly affected today’s business. The online users create content like blog posts, tweets, social networking site interactions and photos. And the servers continuously log messages about what online users are doing.

The online data comes from the posts on the social media sites like Facebook and Twitter, YouTube video, cell phone conversation records etc. This data is called Big Data.

WHAT IS BIG DATA ?

Big Data concept means a datasets which continues to grow so much that it becomes difficult to manage it using existing database management concepts & tools. The difficulty can be related to data capture, storage, search, sharing, analytics and visualization etc.

The Big Data spans across three dimensions: Volume, Velocity and Variety.

  • Volume – The size of data is very large and in terabytes and petabytes.
  • Velocity – It should be used when streaming in to the enterprise in order to maximize its value to the business. The role of time is very critical here.
  • Variety – It extends beyond the structured data, including unstructured data of all varieties: text, audio, video, posts, log files etc.

WHY BIG DATA?

When an enterprise can leverage all the information available with large data rather than just a subset of its data then it has a powerful advantage over the market competitors. Big Data can help to gain insights and make better decisions.

Big Data presents an opportunity to create unprecedented business advantage and better service delivery. It also requires new infrastructure and a new way of thinking about the way business and IT industry works. The concept of Big Data is going to change the way we do things today.

The International Data Corporation (IDC) study predicts that overall data will grow by 50 times by 2020, driven in large part by more embedded systems such as sensors in clothing, medical devices and structures like buildings and bridges. The study also determined that unstructured information – such as files, email and video – will account for 90% of all data created over the next decade. But the number of IT professionals available to manage all that data will only grow by 1.5 times today’s levels.

The digital universe is 1.8 trillion gigabytes in size and stored in 500 quadrillion files. And its size gets more than double in every two years time frame. If we compare the digital universe with our physical universe then it’s nearly as many bits of information in the digital universe as stars in our physical universe.

CHARACTERISTICS OF BIG DATA

A Big Data platform should give a solution which is designed specifically with the needs of the enterprise in the mind. The following are the basic features of a Big Data offering-

  • Comprehensive – It should offer a broad platform and address all three dimensions of the Big Data challenge -Volume, Variety and Velocity.
  • Enterprise-ready – It should include the performance, security, usability and reliability features.
  • Integrated – It should simplify and accelerates the introduction of Big Data technology to enterprise. It should enable integration with information supply chain including databases, data warehouses and business intelligence applications.
  • Open source based – It should be open source technology with the enterprise-class functionality and integration.
  • Low latency reads and updates
  • Robust and fault-tolerant
  • Scalability
  • Extensible
  • Allows adhoc queries
  • Minimal maintenance

BIG DATA CHALLENGES

The main challenges of Big Data are data variety, volume, analytical workload complexity and agility. Many organizations are struggling to deal with the increasing volumes of data. In order to solve this problem, the organizations need to reduce the amount of data being stored and exploit new storage techniques which can further improve performance and storage utilization.

SUMMARY AND CONCLUSION

Big Data is a new gold rush & key enabler for the social business. A large or medium sized company can neither make sense of all the user generated content online nor can collaborate with customers, suppliers and partners effectively on social media channels without using Big Data analytics. The collaboration with customers and insights from user generated online contents are critical for the success in the age of social media.

In a study by McKinsey’s Business Technology Office and McKinsey Global Institute (MGI) firm calculated that the U.S. faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of Big Data.

The biggest gap is the lack of the skilled managers to make decisions based on analysis by a factor of 10x.Growing talent and building teams to make analytic-based decisions is the key to realize the value of Big Data.

Thank you for reading. Happy Learning !!

Introduction To Flume and Sqoop

Before we learn more about Flume and Sqoop , lets study

Issues with Data Load into Hadoop

Analytical processing using Hadoop requires loading of huge amounts of data from diverse sources into Hadoop clusters.

This process of bulk data load into Hadoop, from heterogeneous sources and then processing it, comes with certain set of challenges.

Maintaining and ensuring data consistency and ensuring efficient utilization of resources, are some factors to consider before selecting right approach for data load.

Major Issues:

1. Data load using Scripts

Traditional approach of using scripts to load data, is not suitable for bulk data load into Hadoop; this approach is inefficient and very time consuming.

2. Direct access to external data via Map-Reduce application

Providing direct access to the data residing at external systems(without loading into Hadopp) for map reduce applications complicates these applications. So, this approach is not feasible.

3.In addition to having ability to work with enormous data, Hadoop can work with data in several different forms. So, to load such heterogeneous data into Hadoop, different tools have been developed. Sqoop and Flume are two such data loading tools.

Introduction to SQOOP

Apache Sqoop (SQL-to-Hadoop) is designed to support bulk import of data into HDFS from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Sqoop is based upon a connector architecture which supports plugins to provide connectivity to new external systems.

An example use case of Sqoop, is an enterprise that runs a nightly Sqoop import to load the day’s data from a production transactional RDBMS into a Hive data warehouse for further analysis.

Sqoop Connectors

All the existing Database Management Systems are designed with SQL standard in mind. However, each DBMS differs with respect to dialect to some extent. So, this difference poses challenges when it comes to data transfers across the systems. Sqoop Connectors are components which help overcome these challenges.

Data transfer between Sqoop and external storage system is made possible with the help of Sqoop’s connectors.

Sqoop has connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2. Each of these connectors knows how to interact with its associated DBMS. There is also a generic JDBC connector for connecting to any database that supports Java’s JDBC protocol. In addition, Sqoop provides optimized MySQL and PostgreSQL connectors that use database-specific APIs to perform bulk transfers efficiently.

In addition to this, Sqoop has various third party connectors for data stores,

ranging from enterprise data warehouses (including Netezza, Teradata, and Oracle) to NoSQL stores (such as Couchbase). However, these connectors do not come with Sqoop bundle ;those need to be downloaded separately and can be added easily to an existing Sqoop installation.

Introduction to FLUME

Apache Flume is a system used for moving massive quantities of streaming data into HDFS. Collecting log data present in log files from web servers and aggregating it in HDFS for analysis, is one common example use case of Flume.

Flume supports multiple sources like –

  • ‘tail’ (which pipes data from local file and write into HDFS via Flume, similar to Unix command ‘tail’)
  • System logs
  • Apache log4j (enable Java applications to write events to files in HDFS via Flume).

Data Flow in Flume

Flume agent is a JVM process which has 3 components –Flume SourceFlume Channel and Flume Sink– through which events propagate after initiated at an external source .

  1. In above diagram, the events generated by external source (WebServer) are consumed by Flume Data Source. The external source sends events to Flume source in a format that is recognized by the target source.
  2. Flume Source receives an event and stores it into one or more channels. The channel acts as a store which keeps the event until it is consumed by the flume sink. This channel may use local file system in order to store these events.
  3. Flume sink removes the event from channel and stores it into an external repository like e.g., HDFS. There could be multiple flume agents, in which case flume sink forwards the event to the flume source of next flume agent in the flow.

Some Important features of FLUME

  • Flume has flexible design based upon streaming data flows. It is fault tolerant and robust with multiple failover and recovery mechanisms. Flume has different levels of reliability to offer which includes ‘best-effort delivery’ and an ‘end-to-end delivery’Best-effort delivery does not tolerate any Flume node failure whereas ‘end-to-end delivery’ mode guarantees delivery even in the event of multiple node failures.
  • Flume carries data between sources and sinks. This gathering of data can either be scheduled or event driven. Flume has its own query processing engine which makes it easy to transform each new batch of data before it is moved to the intended sink.
  • Possible Flume sinks include HDFS and Hbase. Flume can also be used to transport event data including but not limited to network traffic data, data generated by social-media websites and email messages.

Since July 2012, Flume is being released as Flume NG (New Generation), as it differs significantly from its original release, as known as Flume OG (Original Generation).

Sqoop Flume HDFS
Sqoop is used for importing data from structured data sources such as RDBMS. Flume is used for moving bulk streaming data into HDFS. HDFS is a distributed file system used by Hadoop ecosystem to store data.
Sqoop has a connector based architecture. Connectors know how to connect to the respective data source and fetch the data. Flume has an agent based architecture. Here, code is written (which is called as ‘agent’) which takes care of fetching data. HDFS has a distributed architecture where data is distributed across multiple data nodes.
HDFS is a destination for data import using Sqoop. Data flows to HDFS through zero or more channels. HDFS is an ultimate destination for data storage.
Sqoop data load is not event driven. Flume data load can be driven by event. HDFS just stores data provided to it by whatsoever means.
In order to import data from structured data sources, one has to use Sqoop only, because its connectors know how to interact with structured data sources and fetch data from them. In order to load streaming data such as tweets generated on Twitter or log files of a web server, Flume should be used. Flume agents are built for fetching streaming data. HDFS has its own built-in shell commands to store data into it.HDFS can not import streaming data

Introduction To Pig And Hive

In this tutorial we will discuss Pig & Hive

INTRODUCTION TO PIG

In Map Reduce framework, programs need to be translated into a series of Map and Reduce stages. However, this is not a programming model which data analysts are familiar with. So, in order to bridge this gap, an abstraction called Pig was built on top of Hadoop.

Pig is a high level programming language useful for analyzing large data sets. Pig was a result of development effort at Yahoo!

Pig enables people to focus more on analyzing bulk data sets and to spend less time in writing Map-Reduce programs.

Similar to Pigs, who eat anything, the Pig programming language is designed to work upon any kind of data. That’s why the name, Pig!

Pig consists of two components:

  1. Pig Latin, which is a language
  2. Runtime environment, for running PigLatin programs.

A Pig Latin program consist of a series of operations or transformations which are applied to the input data to produce output. These operations describe a data flow which is translated into an executable representation, by Pig execution environment. Underneath, results of these transformations are series of MapReduce jobs which a programmer is unaware of. So, in a way, Pig allows programmer to focus on data rather than the nature of execution.

PigLatin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join, Group and Filter.

Execution modes:

Pig has two execution modes:

  1. Local mode : In this mode, Pig runs in a single JVM and makes use of local file system. This mode is suitable only for analysis of small data sets using Pig
  2. Map Reduce mode: In this mode, queries written in Pig Latin are translated into MapReduce jobs and are run on a Hadoop cluster (cluster may be pseudo or fully distributed). MapReduce mode with fully distributed cluster is useful of running Pig on large data sets.

INTRODUCTION TO HIVE

The size of data sets being collected and analyzed in the industry for business intelligence is growing and in a way, it is making traditional data warehousing solutions more expensive. Hadoop with MapReduce framework, is being used as an alternative solution for analyzing data sets with huge size. Though, Hadoop has proved useful for working on huge data sets, its MapReduce framework is very low level and it requires programmers to write custom programs which are hard to maintain and reuse. Hive comes here for rescue of programmers.

Hive evolved as a data warehousing solution built on top of Hadoop Map-Reduce framework.

Hive provides SQL-like declarative language, called HiveQL, which is used for expressing queries. Using Hive-QL users associated with SQL are able to perform data analysis very easily.

Hive engine compiles these queries into Map-Reduce jobs to be executed on Hadoop. In addition, custom Map-Reduce scripts can also be plugged into queries. Hive operates on data stored in tables which consists of primitive data types and collection data types like arrays and maps.

Hive comes with a command-line shell interface which can be used to create tables and execute queries.

Hive query language is similar to SQL wherein it supports subqueries. With Hive query language, it is possible to take a MapReduce joins across Hive tables. It has a support for simple SQL like functions– CONCAT, SUBSTR, ROUND etc., and aggregation functions– SUM, COUNT, MAX etc. It also supports GROUP BY and SORT BY clauses. It is also possible to write user defined functions in Hive query language.

Comparing MapReduce, Pig and Hive

 
Sqoop Flume HDFS
Sqoop is used for importing data from structured data sources such as RDBMS. Flume is used for moving bulk streaming data into HDFS. HDFS is a distributed file system used by Hadoop ecosystem to store data.
Sqoop has a connector based architecture. Connectors know how to connect to the respective data source and fetch the data. Flume has an agent based architecture. Here, code is written (which is called as ‘agent’) which takes care of fetching data. HDFS has a distributed architecture where data is distributed across multiple data nodes.
HDFS is a destination for data import using Sqoop. Data flows to HDFS through zero or more channels. HDFS is an ultimate destination for data storage.
Sqoop data load is not event driven. Flume data load can be driven by event. HDFS just stores data provided to it by whatsoever means.
In order to import data from structured data sources, one has to use Sqoop only, because its connectors know how to interact with structured data sources and fetch data from them.

In order to load streaming data such as tweets generated on Twitter or log files of a web server, Flume should be used. Flume agents are built for fetching streaming data.

HDFS has its own built-in shell commands to store data into it. HDFS cannot be used to import structured or streaming data

Create Your First FLUME Program

Prerequisites:

This tutorial is developed on Linux – Ubuntu operating System.

You should have Hadoop (version 2.2.0 used for this tutorial) already installed and is running on the system.

You should have Java(version 1.8.0 used for this tutorial) already installed on the system.

You should have set JAVA_HOME accordingly.

The connection to the server was reset while the page was loading.

The site could be temporarily unavailable or too busy. Try again in a few moments.
If you are unable to load any pages, check your computer’s network connection.
If your computer or network is protected by a set JAVA_HOME accordingly.

Before we start with the actual process, change user to ‘hduser’ (user used for Hadoop ).

su – hduser

Steps :

Flume, library and source code setup

  1. Create a new directory with name ‘FlumeTutorial’

sudo mkdir FlumeTutorial

  1. Give read, write and execute permissions sudo chmod -R 777 FlumeTutorial
  2. Copy files MyTwitterSource.java and MyTwitterSourceForFlume.java in this directory.

Download Input Files From Here

Check the file permissions of all these files and if ‘read’ permissions are missing then grant the same-

2. Download ‘Apache Flume’ from site- https://flume.apache.org/download.html

Apache Flume 1.4.0 has been used in this tutorial.

Next Click

3. Copy the downloaded tarball in the directory of your choice and extract contents using the following command

sudo tar -xvf apache-flume-1.4.0-bin.tar.gz

This command will create a new directory named apache-flume-1.4.0-bin and extract files into it. This directory will be referred to as <Installation Directory of Flume> in rest of the article.

4. Flume library setup

Copy twitter4j-core-4.0.1.jar, flume-ng-configuration-1.4.0.jar, flume-ng-core-1.4.0.jar, flume-ng-sdk-1.4.0.jar to

<Installation Directory of Flume>/lib/

It is possible that either or all of the copied JAR will have execute permission. This may cause issue with the compilation of code. So, revoke execute permission on such JAR.

In my case, twitter4j-core-4.0.1.jar was having execute permission. I revoked it as below-

sudo chmod -x twitter4j-core-4.0.1.jar

After this command give ‘read’ permission on twitter4j-core-4.0.1.jar to all.

sudo chmod +rrr /usr/local/apache-flume-1.4.0-bin/lib/twitter4j-core-4.0.1.jar

Please note that I have downloaded-

– twitter4j-core-4.0.1.jar from http://mvnrepository.com/artifact/org.twitter4j/twitter4j-core

– Allflume JARs i.e., flume-ng-*-1.4.0.jar from http://mvnrepository.com/artifact/org.apache.flume

Load data from Twitter using Flume

1. Go to directory containing source code files in it.

2. Set CLASSPATH to contain <Flume Installation Dir>/lib/* and ~/FlumeTutorial/flume/mytwittersource/*

export CLASSPATH=”/usr/local/apache-flume-1.4.0-bin/lib/*:~/FlumeTutorial/flume/mytwittersource/*”

3. Compile source code using command-

javac -d . MyTwitterSourceForFlume.java MyTwitterSource.java

4.Create jar

First,create Manifest.txt file using text editor of your choice and add below line in it-

Main-Class: flume.mytwittersource.MyTwitterSourceForFlume

.. here flume.mytwittersource.MyTwitterSourceForFlume is name of the main class. Please note that you have to hit enter key at end of this line.

Now, create JAR ‘MyTwitterSourceForFlume.jar’ as-

jar cfm MyTwitterSourceForFlume.jar Manifest.txt flume/mytwittersource/*.class

5. Copy this jar to <Flume Installation Directory>/lib/

sudo cp MyTwitterSourceForFlume.jar <Flume Installation Directory>/lib/

6. Go to configuration directory of Flume, <Flume Installation Directory>/conf

If flume.conf does not exist, then copy flume-conf.properties.template and rename it to flume.conf

sudo cp flume-conf.properties.template flume.conf

If flume-env.sh does not exist, then copy flume-env.sh.template and rename it to flume-env.sh

sudo cp flume-env.sh.template flume-env.sh

7. Create a Twitter application by signing in to https://dev.twitter.com/user/login?destination=home

a. Go to ‘My applications’ (This option gets dropped down when ‘Egg’

button at top right corner is clicked)

b. Create a new application by clicking ‘Create New App’

c. Fill up application details by specifying name of application, description

and website. You may refer to the notes given underneath each input box.

d. Scroll down the page and accept terms by marking ‘Yes, I agree’ and click on button ‘Create your Twitter application’

e. On window of newly created application, go to tab, ‘API Keys’ scroll down the page and click button ‘Create my access token’

f. Refresh the page.

g. Click on ‘Test OAuth’. This will display ‘OAuth’ settings of application.

h. Modify ‘flume.conf’ (created in Step 6) using these OAuth settings. Steps to modify ‘flume.conf’ are given in step 8 below.

We need to copy Consumer key, Consumer secret, Access token and Access token secret to update ‘flume.conf’.

Note: These values belongs to the user and hence are confidential, so should not be shared.

8. Open ‘flume.conf’ in write mode and set values for below parameters-

[A]

sudo gedit flume.conf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Copy below contents-
MyTwitAgent.sources = Twitter
MyTwitAgent.channels = MemChannel
MyTwitAgent.sinks = HDFS
MyTwitAgent.sources.Twitter.type = flume.mytwittersource.MyTwitterSourceForFlume
MyTwitAgent.sources.Twitter.channels = MemChannel
MyTwitAgent.sources.Twitter.consumerKey = <Copy consumer key value from Twitter App>
MyTwitAgent.sources.Twitter.consumerSecret = <Copy consumer secret value from Twitter App>
MyTwitAgent.sources.Twitter.accessToken = <Copy access token value from Twitter App>
MyTwitAgent.sources.Twitter.accessTokenSecret = <Copy access token secret value from Twitter App>
MyTwitAgent.sources.Twitter.keywords = guru99
MyTwitAgent.sinks.HDFS.channel = MemChannel
MyTwitAgent.sinks.HDFS.type = hdfs
MyTwitAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/hduser/flume/tweets/
MyTwitAgent.sinks.HDFS.hdfs.fileType = DataStream
MyTwitAgent.sinks.HDFS.hdfs.writeFormat = Text
MyTwitAgent.sinks.HDFS.hdfs.batchSize = 1000
MyTwitAgent.sinks.HDFS.hdfs.rollSize = 0
MyTwitAgent.sinks.HDFS.hdfs.rollCount = 10000
MyTwitAgent.channels.MemChannel.type = memory
MyTwitAgent.channels.MemChannel.capacity = 10000
MyTwitAgent.channels.MemChannel.transactionCapacity = 1000

[B]

Also, set TwitterAgent.sinks.HDFS.hdfs.path as below,

TwitterAgent.sinks.HDFS.hdfs.path = hdfs://<Host Name>:<Port Number>/<HDFS Home Directory>/flume/tweets/

To know <Host Name><Port Number> and <HDFS Home Directory> , see value of parameter ‘fs.defaultFS’ set in $HADOOP_HOME/etc/hadoop/core-site.xml

[C]

In order to flush the data to HDFS, as an when it comes, delete below entry if it exists,

TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600

9. Open ‘flume-env.sh’ in write mode and set values for below parameters,

JAVA_HOME=<Installation directory of Java>

FLUME_CLASSPATH=”<Flume Installation Directory>/lib/MyTwitterSourceForFlume.jar”

10. Start Hadoop

$HADOOP_HOME/sbin/start-dfs.sh

$HADOOP_HOME/sbin/start-yarn.sh

11. Two of the JAR files from the Flume tar ball are not compatible with Hadoop 2.2.0. So, we will need to follow below steps to make Flume compatible with Hadoop 2.2.0.

a. Move protobuf-java-2.4.1.jar out of ‘<Flume Installation Directory>/lib’.

Go to ‘<Flume Installation Directory>/lib’

cd <Flume Installation Directory>/lib

sudo mv protobuf-java-2.4.1.jar ~/

b. Find for JAR file ‘guava’ as below

find . -name “guava*”

Move guava-10.0.1.jar out of ‘<Flume Installation Directory>/lib’.

sudo mv guava-10.0.1.jar ~/

c. Download guava-17.0.jar from http://mvnrepository.com/artifact/com.google.guava/guava/17.0

Now, copy this downloaded jar file to ‘<Flume Installation Directory>/lib’

12. Go to ‘<Flume Installation Directory>/bin’ and start Flume as-

./flume-ng agent -n MyTwitAgent -c conf -f <Flume Installation Directory>/conf/flume.conf

Command prompt window where flume is fetching Tweets-

From command window message we can see that the output is written to /user/hduser/flume/tweets/ directory.

Now, open this directory using web browser.

13. To see the result of data load, using a browser open http://localhost:50070/ and browse file system, then go to the directory where data has been loaded, that is-

<HDFS Home Directory>/flume/tweets/

How does data get into Hadoop?

There are numerous ways to get data into Hadoop. Here are just a few:

  • You can load files to the file system using simple Java commands, and HDFS takes care of making multiple copies of data blocks and distributing those blocks over multiple nodes in Hadoop.
  • If you have a large number of files, a shell script that will run multiple “put” commands in parallel will speed up the process. You don’t have to write MapReduce code.
  • Create a cron job to scan a directory for new files and “put” them in HDFS as they show up. This is useful for things like downloading email at regular intervals.
  • Mount HDFS as a file system and simply copy or write files there.
  • Use Sqoop to import structured data from a relational database to HDFS, Hive and HBase. It can also extract data from Hadoop and export it to relational databases and data warehouses.
  • Use Flume to continuously load data from logs into Hadoop.
  • Use third-party vendor connectors

Cisco Bets Big On Selling Hadoop

Cisco doesn’t think you should roll your own big data systems, so it’s bringing together all the hardware, software, and services you might need to quickly deploy, integrate, and scale up Hadoop deployments over time.

On Wednesday Cisco announced reseller agreements with Hadoop’s big three — Cloudera, Hortonworks, and MapR — so the company and its partners can offer Hadoop distributions along with other software and services aimed at rapid and trouble-free deployment on the Cisco Unified Computing System (UCS) Director Express for Big Data.

UCS is Cisco’s fast-growing integrating hardware-and-software system offering that combines compute, networking, and storage, and offers virtualization and management software. UCS was a big part of the 30% increase in server revenue Cisco racked up during 2014, as reported this week by Gartner.

[ Want more on this topic? Read IBM Slumps, Cisco Gains In 2014 Server Sales. ]

With Hadoop deployments in mind, Cisco is offering prebuilt configurations of UCS Director Express for Big Data, incorporating UCS C240 M4 Rack Servers, UCS 6200 Fabric Interconnects, and data-virtualization software. It also incorporates USC Director Express management software that handles Hadoop system deployment and administrative tasks in conjunction with Cloudera, Hortonworks, and MapR management software.

Where roll-your-own Hadoop deployments are often fraught with challenges in getting up and running and, later, scaling up as demands grow, Jim McHugh, VP of products and solutions marketing for Cisco UCS, said UCS Director Express for Big Data all but eliminates these problems.

“This is much more than just a reference architecture,” said McHugh in a phone interview with InformationWeek. “Cisco UCS service-profile components let customers manage by racks or blades, and templates let you apply those policies across multiple racks. In the world of Hadoop, inconsistency quickly leads to disaster, so we put a lot of effort into eliminating those sorts of problems.”

Cisco sells primarily through resellers, all of whom will have access to the new software and systems for big data. The Hadoop distributions and software will be SKUs on Cisco’s master price list. Cisco is working especially closely with big-data-specialized distributors and resellers in key regional markets around the globe, McHugh said.

Additional software Cisco can apply in big-data deployments includes Connected Analytics data-access and data-virtualization technologies, obtained in Cisco’s 2013 acquisition of Composite Software. This software is used to link data warehouses and operational systems to Hadoop. Cisco and its resellers can also act as consultants, bringing in Cisco partners such as Informatica for data movement, data integration, and data cleansing, or Splunk, SAP, SAS, Platfora, or other analytics vendors.

Cisco also has vertical industry practices and a Connected Analytics team that offers consulting services, McHugh said. He cited sports arena connectivity, Internet-of-Things-style predictive maintenance, and supply chain applications in manufacturing and connected-retailing deployments as notable areas where Cisco has deep expertise.

“We can know when customers walk into a retail location and, based on loyalty program activity, we might also know what they were recently looking at online,” McHugh said. “By collecting and storing that data in a [Hadoop] data lake and applying analytics, we can develop a more holistic customer view and figure out better promotion, product-placement, and stocking strategies.”