How big data is changing the database landscape for good

Mention the word “database,” and most people think of the venerable RDBMS that has dominated the landscape for more than 30 years. That, however, may soon change.
JavaScript: The Good Parts
Free course: “JavaScript: The Good Parts”

What better time to sharpen your JavaScript skills? And for free!
Read Now

A whole crop of new contenders are now vying for a piece of this key enterprise market, and while their approaches are diverse, most share one thing in common: a razor-sharp focus on big data.

Much of what’s driving this new proliferation of alternatives is what’s commonly referred to as the “three V’s” underlying big data: volume, velocity and variety.
[ Also on ITworld: Bracing for big data: Preparing your data center for rapid change. Don’t miss a thing! Sign up for ITworld’s daily newsletter. ]

Essentially, data today is coming at us faster and in greater volumes than ever before; it’s also more diverse. It’s a new data world, in other words, and traditional relational database management systems weren’t really designed for it.

“Basically, they cannot scale to big, or fast, or diverse data,” said Gregory Piatetsky-Shapiro, president of KDnuggets, an analytics and data-science consultancy.

That’s what Harte Hanks recently found. Up until 2013 or so, the marketing services agency was using a combination of different databases including Microsoft SQL Server and Oracle Real Application Clusters (RAC).

“We were noticing that with the growth of data over time, our systems couldn’t process the information fast enough,” said Sean Iannuzzi, the company’s head of technology and development. “If you keep buying servers, you can only keep going so far. We wanted to make sure we had a platform that could scale outward.”

Minimizing disruption was a key goal, Iannuzzi said, so “we couldn’t just switch to Hadoop.”

Instead, it chose Splice Machine, which essentially puts a full SQL database on top of the popular Hadoop big-data platform and allows existing applications to connect with it, he said.

Harte Hanks is now in the early stages of implementation, but it’s already seeing benefits, Iannuzzi said, including improved fault tolerance, high availability, redundancy, stability and “performance gains overall.”

There’s a sort of perfect storm propelling the emergence of new database technologies, said Carl Olofson, a research vice president with IDC.

First, “the equipment we’re using is much more capable of handling large data collections quickly and flexibly than in the past,” Olofson noted.

In the old days, such collections “pretty much had to be put on spinning disk” and the data had to be structured in a particular way, he explained.

Now there’s 64-bit addressability, making it possible to set up larger memory spaces, as well as much faster networks and the ability to string multiple computers together to act as single, large databases.

“Those things have opened up possibilities that weren’t available before,” Olofson said.

Workloads, meanwhile, have also changed. Whereas 10 years ago websites were largely static, for example, today we have live Web service environments and interactive shopping experiences. That, in turn, demands new levels of scalability, he said.
Resources

eGuide
Sponsored
Mitigating Security Threats With Big Data
White Paper
IDC MarketScape: European Enterprise Social Networks in 2014

See All

Companies are using data in new ways as well. Whereas traditionally most of our focus was on processing transactions — recording how much we sold, for instance, and storing that data in place where it could be analyzed — today we’re doing more.

Application state management is one example.

Say you’re playing an online game. The technology must record each session you have with the system and connect them together to present a continuous experience, even if you switch devices or the various moves you make are processed by different servers, Olofson explained.

That data must be made persistent so that companies can analyze questions such as “why no one ever crosses the crystal room,” for example. In an online shopping context, a counterpart might be why more people aren’t buying a particular brand of shoe after they click on the color choices.

“Before, we weren’t trying to solve those problems, or — if we were — we were trying to squeeze them into a box that didn’t quite fit,” Olofson said.

Hadoop is a heavyweight among today’s new contenders. Though it’s not a database per se, it’s grown to fill a key role for companies tackling big data. Essentially, Hadoop is a data-centric platform for running highly parallelized applications, and it’s very scalable.

By allowing companies to scale “out” in distributed fashion rather than scaling “up” via additional expensive servers, “it makes it possible to very cheaply put together a large data collection and then see what you’ve got,” Olofson said.

Among other new RDBMS alternatives are the NoSQL family of offerings, including MongoDB — currently the fourth most popular database management system, according to DB-Engines — and MarkLogic.
The hit list

Computerworld holiday gift guide 2015
Computerworld’s holiday gift guide 2015 (with video!)
How to open specific browsers using hyperlinks
IDG Contributor Network
How to open specific web browsers using hyperlinks
skype for business desktop sharing
Microsoft’s new premium Office 365 subscription for businesses is here

“Relational has been a great technology for 30 years, but it was built in a different era with different technological constraints and different market needs,” said Joe Pasqua, MarkLogic’s executive vice president for products.

Big data is not homogeneous, he said, yet in many traditional technologies, that’s still a fundamental requirement.

“Imagine the only program you had on your laptop was Excel,” Pasqua said. “Imagine you want to keep track of network of friends — or you’re writing a contract. Those don’t fit into rows and columns.”

Combining data sets can be particularly tricky.

“Relational says that before you bring all these data sets together, you have to decide how you’re going to line up all the columns,” he added. “We can take in any format or structure and start using it immediately.”

NoSQL databases don’t use a relational data model, and they typically have no SQL interface. Whereas many NoSQL stores compromise consistency in favor of speed and other factors, MarkLogic pitches its own offering as a more consistency-minded option tailored for enterprises.

There’s considerable growth in store for the NoSQL market, according to Market Research Media, but not everyone thinks it’s the right approach — at least, not in all cases.

NoSQL systems “solved many problems with their scale-out architecture, but they threw out SQL,” said Monte Zweben, Splice Machine’s CEO. That, in turn, poses a problem for existing code.

Splice Machine is an example of a different class of alternatives known as NewSQL — another category expecting strong growth in the years ahead.

“Our philosophy is to keep the SQL but add the scale-out architecture,” Zweben said. “It’s time for something new, but we’re trying to make it so people don’t have to rewrite their stuff.”

Deep Information Sciences has also chosen to stick with SQL, but it takes yet another approach.

The company’s DeepSQL database uses the same application programming interface (API) and relational model as MySQL, meaning that no application changes are required in order to use it. But it addresses data in a different way, using machine learning.

DeepSQL can automatically adapt for physical, virtual or cloud hosts using any workload combination, the company says, thereby eliminating the need for manual database optimization.

Among the results are greatly increased performance as well as the ability to scale “into the hundreds of billions of rows,” said Chad Jones, the company’s chief strategy officer.

An altogether different approach comes from Algebraix Data, which says it has developed the first truly mathematical foundation for data.

Whereas computer hardware is modeled mathematically before it’s built, that’s not the case with software, said Algebraix CEO Charles Silver.

“Software, and especially data, has never been built on a mathematical foundation,” he said. “Software has largely been a matter of linguistics.”

Following five years of R&D, Algebraix has created what it calls an “algebra of data” that taps mathematical set theory for “a universal language of data,” Silver said.

“The dirty little secret of big data is that data still sits in little silos that don’t mesh with other data,” Silver explained. “We’ve proven it can all be represented mathematically, so it all integrates.”

Equipped with a platform built on that foundation, Algebraix now offers companies business analytics as a service. Improved performance, capacity and speed are all among the benefits Algebraix promises.

Time will tell which new contenders succeed and which do not, but in the meantime, longtime leaders such as Oracle aren’t exactly standing still.

“Software is a very fashion-conscious industry,” said Andrew Mendelsohn, executive vice president for Oracle Database Server Technologies. “Things often go from popular to unpopular and back to popular again.”

Many of today’s startups are “bringing back the same old stuff with a little polish or spin on it,” he said. “It’s a new generation of kids coming out of school and reinventing things.”

SQL is “the only language that lets business analysts ask questions and get answers — they don’t have to be programmers,” Mendelsohn said. “The big market will always be relational.”

As for new types of data, relational database products evolved to support unstructured data back in the 1990s, he said. In 2013, Oracle’s namesake database added support for JSON (JavaScript Object Notation) in version 12c.

Rather than a need for a different kind of database, it’s more a shift in business model that’s driving change in the industry, Mendelsohn said.

“The cloud is where everybody is going, and it’s going to disrupt these little guys,” he said. “The big guys are all on the cloud already, so where is there room for these little guys?

“Are they going to go on Amazon’s cloud and compete with Amazon?” he added. “That’s going to be hard.”

Oracle has “the broadest spectrum of cloud services,” Mendelsohn said. “We’re feeling good about where we’re positioned today.”

Rick Greenwald, a research director with Gartner, is inclined to take a similar view.

“The newer alternatives are not as fully functional and robust as traditional RDBMSes,” Greenwald said. “Some use cases can be addressed with the new contenders, but not all, and not with one technology.”

Looking ahead, Greenwald expects traditional RDBMS vendors to feel increasing price pressure, and to add new functionality to their products. “Some will freely bring new contenders into their overall ecosystem of data management,” he said.

As for the new guys, a few will survive, he predicted, but “many will either be acquired or run out of funding.”

Today’s new technologies don’t represent the end of traditional RDBMSes, “which are rapidly evolving themselves,” agreed IDC’s Olofson. “The RDBMS is needed for well-defined data — there’s always going to be a role for that.”

But there will also be a role for some of the newer contenders, he said, particularly as the Internet of Things and emerging technologies such as Non-Volatile Dual In-line Memory Module (NVDIMM) take hold.

There will be numerous problems requiring numerous solutions, Olofson added. “There’s plenty of interesting stuff to go around.”

Who will build the Government-as-a-Service platform?

Posted 18 Jun 2015 by 

I’ve lived in many cities during my military career. Each time I’ve moved, I’ve had to deal with a new city’s website, and what I’ve learned is that there are great differences across each city’s site design and in how much government data is online and accessible.

Some city’s websites give the impression that they’ve put a serious effort into open government, while others look as though the site was built in 1995 and the only updates have been to the names and phone numbers of staff.

There is much to be said about open government. While there are many different open government movements, I’ve not yet seen a “platform” that is available for local governments to use. There is a company called OpenGov which does address local government financial transparency, and that is a start, but falls woefully short if you want a fully transparent local government.

If you look at the White House webpage on open government, they are focused on providing White House data. To their credit they even provide some developer tools such as an API for some data, but what they don’t provide is anything for local governments to use. I think that a transparent government is a better government, but it is rare to find a city that has leveraged the web for a better form of government in this way.

My hope is a discussion to help spur ideas towards an open government built on open source.

Business case

A business could be made supporting local governments (US and abroad) with a standardized platform that could be tailored to their needs. Think of WordPress for Government, or perhaps OpenStack for government. I think that some local governments are moving towards Facebook as a platform, and I really think that is a mistake. It is easy to use and free, so it’s easy for local governments to just put their data into the walled garden of Facebook without thinking about it.

If a different solution was available that was just as easy, and provided a tailored experience, I think local governments will leverage it instead. It can make sense at some level for a city to push their data to Facebook, but not for it to be the authoritative repository.

Cities with a great website and online presence will not be as interested as others who might see having a turn-key hosted solution as a no-brainer. The business part of this is the customization of the platform and the hosting of the platform. There are sufficient technologies available to enable this to be scalable from a small town to a large city. Money can be made from the customization, the amount of data stored, the available interfaces, the different capabilities offered, and finally tech support.

So, if a local government wants to try to save money by hosting this solution themselves, they can. But if they want someone else to host it for them, that solution is available for a price.

Requirements

So, what are the requirements for a Government-as-a-Service Platform?

There are three different sets of requirements to look at: data, interfaces, and website. There are probably many ways to break up the requirements but for my purpose data, interfaces, and website will suffice.

What data, interfaces, and website requirements will be driven by the actual use-cases needed by the city? Whoever takes this on as a business will need to talk to many cities to see what the common requirements should be. At a minimum I think that it should act as a phone book, hold the city ordinances, and a place to hold all of the different committee/department rules, regulations, and meeting minutes. After the minimum is complete, and the platform is up and running, additional requirements can be addressed.

Government runs on data. So, from a data standpoint, a central encrypted database to store the data will be important. This means being able to import, store, and export data. The key issue relating to the data is security. The database needs to be encrypted at the record level such that when it is compromised, it doesn’t matter. Not an easy task but necessary. Given the different potential types of data expected, I’d recommend a NoSQL approach. The interfaces are perhaps the most important and hardest part of the platform. The interfaces, or APIs are used to expose the data. This is for the website, but also for third-party applications. Given that, security is important and the public APIs must only allow the appropriate data to be released.

The city website

A city’s website should be:

  • easy for citizens to navigate
  • easy for the city to update

The website will need to be branded for each city, and it may be important to make it easy to tailor. Posting to your city’s website should be as easy as posting to Facebook. There are a few different ways of going about this, but here’s what I recommend:

Start with an existing open source Platform-as-a-Service (PaaS). Then, host it on an Infrastructure-as-a-Service (IaaS) platform. For example, use OpenShift as the PaaS and OpenStack as the IaaS. Begin with a customized version of OpenShift as a starting template for each city. This template will host the basic capabilities for each city. Then, copy the template, tailor it for your specific city, and host that on OpenStack.

The advantages of this open source approach is that while you can pay for support, you have the option to support it yourself. Also, if you have specific requirements that should be incorporated into either OpenShift or OpenStack, you can submit upstream changes.

I believe having a platform for local governments would be a great business. Anything that helps citizens have a better relationship with their local government is worth doing. And, having an open government built on open source just seams like the right thing to do.

What I Learned About MapR

MapR, based in San Jose, California, provides a commercial version of Hadoop noted for its fast performance.  This week at the Strata Conference, I got a chance to talk to the folks at MapR and found out how MapR differentiates itself from other Hadoop offerings.

MapR Booth at Strata Conference

The fast speed of MapR appears to come from its filesystem design.  It’s fully compatible with standard open source Hadoop including Hadoop 2.x and YARN and HBase, but with a more optimized filesystem structure to provide the additional speed boost.

MapR promotes these benefits below.

  • No single point of failure
    Normally the NameNode is the single point of failure for a Hadoop installation.  MapR’s design avoids this issue.
  • NFS mount data files
    MapR allows you to NFS mount files into an HDFS cluster.  This ability saves you time from copying files into MapR and you might not even need tools like Flume.  The direct write into the files opens up additional options such as querying Hadoop on near-real-time data.
  • Fast access
    MapR has clocked the fastest data processing with sorting 1.5 trillion bytes in one minute using its MapR Hadoop software on Google Compute Engine cloud service.
  • Binary compatible with Hadoop
    MapR is binary compatible with open source Hadoop, which gives more flexibility in adding other third party components or migrating
  • Enterprise support
    Professional services, enterprise support, and training and certifications

MapR has attracted a number of featured customers including the following:

  • Comscore
  • Cision
  • Linksmart
  • HP
  • Return Path
  • Dotomi
  • Solutionary
  • Trueffect
  • Sociocast
  • Zions Bank
  • Live Nation
  • Cisco
  • Rubicon Project

MapR is also partnering with both Google and Amazon Web Services for cloud-based Hadoop systems.

MapR currently comes in 3 editions.

  • M3 Standard Edition
  • M5 Enterprise Edition (with “99.999% high availability and self-healing”)
  • M7 Enterprise Edition for Hadoop (with fast database)

Additionally, in conjunction with the Strata Conference this week, MapR has announced the release of the MapR Sandbox.  Any user can download the MapR Sandbox for free and run a full MapR Hadoop installation within a VMware or Virtualbox virtual machine.  This sandbox provides a suitable learning environment for those who want to experience the use and operation of MapR Hadoop without investing a lot of effort in the installation.  I haven’t downloaded and installed the MapR Sandbox yet.  If you have already done this and tried it out, tell me what you think in the comments below.

MapR website: http://www.mapr.com

Big Data Analytics – What is that ?

In a recent statistics, IBM estimates that every day 2.5 quintillion bytes of data are created – so much that 90% of the data in the world today has been created in the last two years. It is a mind-boggling figure and the irony is that we feel less informed in spite of having more information available today.

The surprising growth in volumes of data has badly affected today’s business. The online users create content like blog posts, tweets, social networking site interactions and photos. And the servers continuously log messages about what online users are doing.

The online data comes from the posts on the social media sites like Facebook and Twitter, YouTube video, cell phone conversation records etc. This data is called Big Data.

WHAT IS BIG DATA ?

Big Data concept means a datasets which continues to grow so much that it becomes difficult to manage it using existing database management concepts & tools. The difficulty can be related to data capture, storage, search, sharing, analytics and visualization etc.

The Big Data spans across three dimensions: Volume, Velocity and Variety.

  • Volume – The size of data is very large and in terabytes and petabytes.
  • Velocity – It should be used when streaming in to the enterprise in order to maximize its value to the business. The role of time is very critical here.
  • Variety – It extends beyond the structured data, including unstructured data of all varieties: text, audio, video, posts, log files etc.

WHY BIG DATA?

When an enterprise can leverage all the information available with large data rather than just a subset of its data then it has a powerful advantage over the market competitors. Big Data can help to gain insights and make better decisions.

Big Data presents an opportunity to create unprecedented business advantage and better service delivery. It also requires new infrastructure and a new way of thinking about the way business and IT industry works. The concept of Big Data is going to change the way we do things today.

The International Data Corporation (IDC) study predicts that overall data will grow by 50 times by 2020, driven in large part by more embedded systems such as sensors in clothing, medical devices and structures like buildings and bridges. The study also determined that unstructured information – such as files, email and video – will account for 90% of all data created over the next decade. But the number of IT professionals available to manage all that data will only grow by 1.5 times today’s levels.

The digital universe is 1.8 trillion gigabytes in size and stored in 500 quadrillion files. And its size gets more than double in every two years time frame. If we compare the digital universe with our physical universe then it’s nearly as many bits of information in the digital universe as stars in our physical universe.

CHARACTERISTICS OF BIG DATA

A Big Data platform should give a solution which is designed specifically with the needs of the enterprise in the mind. The following are the basic features of a Big Data offering-

  • Comprehensive – It should offer a broad platform and address all three dimensions of the Big Data challenge -Volume, Variety and Velocity.
  • Enterprise-ready – It should include the performance, security, usability and reliability features.
  • Integrated – It should simplify and accelerates the introduction of Big Data technology to enterprise. It should enable integration with information supply chain including databases, data warehouses and business intelligence applications.
  • Open source based – It should be open source technology with the enterprise-class functionality and integration.
  • Low latency reads and updates
  • Robust and fault-tolerant
  • Scalability
  • Extensible
  • Allows adhoc queries
  • Minimal maintenance

BIG DATA CHALLENGES

The main challenges of Big Data are data variety, volume, analytical workload complexity and agility. Many organizations are struggling to deal with the increasing volumes of data. In order to solve this problem, the organizations need to reduce the amount of data being stored and exploit new storage techniques which can further improve performance and storage utilization.

SUMMARY AND CONCLUSION

Big Data is a new gold rush & key enabler for the social business. A large or medium sized company can neither make sense of all the user generated content online nor can collaborate with customers, suppliers and partners effectively on social media channels without using Big Data analytics. The collaboration with customers and insights from user generated online contents are critical for the success in the age of social media.

In a study by McKinsey’s Business Technology Office and McKinsey Global Institute (MGI) firm calculated that the U.S. faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of Big Data.

The biggest gap is the lack of the skilled managers to make decisions based on analysis by a factor of 10x.Growing talent and building teams to make analytic-based decisions is the key to realize the value of Big Data.

Thank you for reading. Happy Learning !!