Virtual-hiking: Hadoop Summit and Hadoop as a Service

In the beginning of this year, I mentioned working on Big Data and virtualization and it has been a fruitful time. Next week I will be co-presenting with Chris Mutchler from Adobe on "Hadoop-as-a-Service for Lifecycle Management Simplicity" at the Hadoop Summit conference in San Jose, CA. Our session will be on Wednesday from 4:35pm-5:15pm.

I am humbled and excited to help present alongside other sessions from some of the most respected names in the industry from Yahoo!, Google, Cloudera, Hortonworks, MapR, Microsoft. The growing depth, evolution, and community of the Big Data ecosystem is impressive, to say the least. I hope to attend other Hadoop customer sessions as well as investigate what other large players are accomplishing from their respective stacks. I see a lot of advanced sessions around new use-cases for Hadoop and research of adding additional layers and abstractions to Hadoop. The Adobe session is focused on the usability of Hadoop from an IT operations perspective with a few key points to make:

Explain why virtualizing Hadoop is good from a business, techincal and operational perspective
Accommodate the evolution and diversity of Big Data solutions
Simplify the lifecycle deployment of these layers for engineering and operations.
Create Hadoop-as-a-Service with VMware vSphere, Big Data Extensions, and vCloud Automation Center

Hadoop is truly becoming a complicated stack at this point. After I started actually getting hands on and working with customers on Hadoop-specific projects in 2011, I found that calling this new technology Hadoop seemed a bit disingenuous. There was really MapReduce and HDFS, a compute layer and a storage layer. Even though they were tightly coupled, that was enforced for very good and simple reasons. Spending even more time on this has given more perspective on the different layers and their corresponding workloads. Unless you're only running one type of job for your compute layer and sucking in data from a static set of sources for your data layer, then these workloads will vary as well as vary independently. However, in a physical world, with both layers exactly coupled, how can they scale independently and flexibly?

Enter virtualization and everything I've been working on around virtualizing distributed systems, data analytics, Hadoop, and so forth. Consider the layering of functionality for different distributions and look for the similarities. If you take a look at Cloudera:

Or Hortonworks:

And Pivotal HD:

As a wise donkey once talked about in a movie when describing onions and cakes, they all have layers and so does any next-gen analytics platform. Now we have our data layer, and then a scheduling layer, then on top of that we can look at batch jobs, SQL jobs, streaming, machine learning, etc. Many moving parts and each one with probable workload variability per application, per customer. What abstraction layer helps pool resources, dynamically move components for elastic scale-in and scale-out and allows for flexible deployment of these many moving parts? Virtualization is a good answer, but also one of the first questions I get is "How's performance?" Well, I have seen vSphere scale and perform next to baremetal. Listed below is the link to the performance whitepaper detailing performance recommendations that have been tested on large clusters.

Speaking of all these layers, this leads to complexity very quickly so another angle specifically to the Adobe Hadoop Summit presentation is around hiding this complexity from the end-developers and making it easier and faster to develop, prototype, and release their analytics features into production. Some sessions are exploring even deeper, more complex uses of Hadoop and I am eager to see their results, however, enabling this lifecycle management for ops is essential to adoption of the latest functionality of any vendors' Big Data stack. VMware's Big Data Extensions, and in this case with vCloud Automation Center, allows for self-service consumption and easier experimentation. There's a (disputed) quote that has been attributed to Einstein that states "Everything should be made as simple as possible, but not simpler." There are a few vendors working on making Hadoop easier to consume and I would argue simplifying consumption of this technology is a worthwhile goal for the community. Dare I say even Microsoft's vision of allowing Big Data analysis via Excel is actually very intriguing if they can make it that simple to consume.

Another common question I get is "Virtualization is fine for dev/test, but how is it for production?" First, simplicity, elasticity, and flexibility are even more important to a production environment. And maybe more importantly, let's not discount the importance of experimentation to any software development lifecycle. As much as Hadoop enterprise vendors would like to make any Hadoop integration turnkey with any data source, any platform, any applications, I would argue we have a long way to go. Any innovation depends on experimentation and the ability to test out new algorithms, replacing layers of the stack, evaluating and isolating different variables in this distributed system.

One more assumption that keeps coming up is the perception that 100% utilization on a Hadoop cluster equals a high degree of efficiency. I am not a Java guru or an expert Hadoop programmer by any means, but if you think about it, it would be very easy for me to write something that drives a Yahoo! scale set of MapReduce nodes to 100% utilization but which really gives me no benefit whatsoever. Now take that a step further as that job can have some benefit to the user, but still be very resource inefficient. Quantifying that is worthy of more research but for now, optimizing the efficiency of any type of job or application specification will allow better business and operational intelligence to an organization and actually make their data lake (pond, ocean, deep murky loch?) worth the money.

Add to these business and operational justifications the added security posture:
http://virtual-hiking.blogspot.com/2014/04/new-roles-and-security-in-virtualized.html
and now you should have a much better idea of the solutions that forward-thinking customers are adopting to weaponize their in-house and myriad vendor analytics platforms.

Really exciting tech and hope to see you next week in San Jose!

Additional links
Hadoop Summit:
http://hadoopsummit.org/san-jose/schedule/
http://hadoopsummit.org/san-jose/speakers/#andrew-nelson
http://hadoopsummit.org/san-jose/speakers/#chris-mutchler
Hadoop performance case study and recommendations for vSphere:
http://blogs.vmware.com/vsphere/2013/05/proving-performance-hadoop-on-vsphere-a-bare-metal-comparison.html
http://www.vmware.com/files/pdf/techpaper/hadoop-vsphere51-32hosts.pdf
Open source Project Serengeti for Hadoop automated deployment on vSphere:
http://www.projectserengeti.org/
vSphere Big Data Extensions product page:
http://www.vmware.com/products/vsphere/features-big-data
How to set up Big Data Extensions workflows through vCloud Automation Center v6.0:
https://solutionexchange.vmware.com/store/products/hadoop-as-a-service-vmware-vcloud-automation-center-and-big-data-extension#.U4broC8Wetg
Big Data Extensions setup on vSphere:
https://www.youtube.com/watch?v=KMG1QlS6yag
HBASE cluster setup with Big Data Extensions:
https://www.youtube.com/watch?v=LwcM5GQSFVY
Big Data Extensions with Isilon:
https://www.youtube.com/watch?v=FL_PXZJZUYg
Elastic Hadoop on vSphere:
https://www.youtube.com/watch?v=dh0rvwXZmJ0

Virtual-hiking

Thursday, May 29, 2014

Hadoop Summit and Hadoop as a Service

5 comments:

About Me