Kevin: My name is Kevin Leong. I am a Project Manager at VMware. I work on the big data side as part of the data team. Charles [Fain], as Jim alluded to, will be here later. He actually leads a data team. He is Senior Vice President in charge of strategic R&D of which the data team is a part.
Just a word of warning before I go into my presentation, Jim mentioned to me that there will be a range of people here. There will be people from I/T and there might be people from business units as well, but given as I thought it would be useful and necessary to provide some substance behind some of the things that I will be talking about today, I might go up and down through more technical details and then try to bring it back up to what it means for Hadoop users and things of that nature. If the technical details don't interest you as much, feel free to tune out during those moments and then be sure to pop back up for the big picture at the end of it.
We're going to be talking about Hadoop as a service today. First of all, I do want to introduce you to the VMware data team and the data portfolio. Obviously VMware is more well known for virtualization compared to data, but it will be interesting to talk a bit about what the data team does, what products we have, and then I'll talk about some trends that we're seeing in big data and virtualization and then talk about what we're seeing in terms of our customer needs.
When we talk to enterprises what Hadoop users need, what central I/T requirements are of us, and then the bulk of the presentation will be on project Serengeti and Jim alluded to that as well and Ironfan is a part of that project. It's an open source tool for deploying virtualized Hadoop. I'll spend the bulk of the time on virtualized Hadoop for the enterprise.
Let's kick it off with the VMware data team and where we are. This is way VMware sees enterprise I/T and there are a couple of trends in this area; cloud, new application types, frameworks.
Today we are here to talk about data, big data in particular. The one trend that concerns us is data disruption. What do I mean by that? At VMware we do talk about big, fast, flexible data. You've probably heard some incarnation of that. It might be in the form of several of Vs. There's big, fast, and flexible and at VMware we are above all interested in the cloud delivery of data; data, databases, analytics, Hadoop as a service. We're interested in bridging the gap between data, big data and virtualization.
That's what I'm here to talk to you about today. This is where we are in terms of a data team, in terms of the products we have as a data team. We've grouped them under big, fast, and flexible. I've talked a bit about Serengeti and I'll talk more about that later in the rest of my presentation. So, that's big data.
We also have big data analytics. That's analytics as a service. Cetas was a company we acquired recently, earlier this year. They provide analytics as a service mostly geared towards online businesses, Internet companies. We have products in the fast data space, in memory data grids; SQLFire, GemFire. Then under flexible, what we're thinking of there is that we have products that meet various needs and enterprises, starting from relational to NoSQL, key value stores, object stores, document stores are something that we are looking into as well.
Underlining all of that, I talked about cloud delivery before; we have VMware Data Director which is our database as a service product. It's a product that virtualizes databases and allows I/T to manage the entire life cycle of that virtualized database. This is where we are at in terms of the data team at VMware and now I'm going to double-click on the big data side of things, in particular Serengeti which is deploying Hadoop as a service.
So trends, if you are familiar with big data at all this should not come as a surprise to you, but analysts are predicting a data explosion. A large part of that is unstructured data. Some of the estimates are 70, 80, 90% of data enterprises is really in unstructured forms and so enterprises are getting around to thinking about how to store, how to leverage all that information, how do I make money out of that and that's where Hadoop comes in.
Hadoop helps with both the big aspect of it, right; it's a cheap place to store all that humongous data that you have. It's also a great place for storing unstructured data because you don't have to fit that data into any schema, into rows and columns, et cetera, so it's a good place for that and the research bears that out.
In terms of enterprises, in terms of I/T departments, and CIO's the vast majority are either evaluating or have Hadoop in production, or have some plan for Hadoop in the next year or so. That's what the trends are in terms of big data.
Really the take-away from all of this is that, it's various forms of data structured, unstructured information, transaction data, log files, social network data that's going into all this analysis and also this analysis actually helps the company's top line and bottom line. That's also a reason why big data and Hadoop is ramping in terms of adoption, in terms of buzz, and in terms of people getting interested in that.
That said, there are probably folks in the audience who know a lot more about use cases than I do and some speakers down the road who will share that with you.
Let's change gears a bit and talk about something that VMware knows quite a bit about. We talked about big data trends, let's talk about virtualization trends. Starting from about four years back most workloads were on physical machines, so 2008 physical machines. It took weeks to provision computer's new services.
Today, we have VM's but we have kind of figured out how to manage, how to pool CPU and memory, you still have some resources along the side which require some intervention, manual intervention as well. Let's say it takes days or hours to provision new services today, where we see virtualization going in the future is really the virtual data center where all these resources; storage, networks, security, high availability monitoring, all those become wrapped in a virtual data center and the management of all those presented as a service.
Time to provision goes from weeks, hours down to minutes and even seconds in the future. In terms of workloads virtualized back in 2008, about a quarter of the workloads were virtualized; like I said, most were on physical back then. Today, we see about 60% virtualized and where we're going in the future as we provide more value and more benefit with virtualization we are shooting for in excess of 90% virtualization in terms of workloads.
Again, the take-away from this is in the future it would rather be the exception that you would see some workload staying on physical rather than being on virtual.
That being the case, we've seen the big data trends and we've seen the virtualization trends and it seem like they are kind of on an inevitable collision course, and not in a bad way. We want to see how we can bring the big data trends together with the virtualization trends and that's where VMware comes in.
Today, VMware really has a pretty good handle on virtualizing apps within the enterprise; for example, Microsoft Exchange, SharePoint, SAP, and various other apps. At the same time with the customers that we've talked to we do see some aspect of cluster sprawl. They have these data engines, especially big data, multi-node databases, Hadoop, [HSpace], Greenplum potentially, all sitting in separate clusters and that's kind of a shame, especially if they all access the same data.
We're thinking why not put them all in the same virtualization platform. These are all apps that we can virtualize in the same vein. We can bring the same classic benefits of virtualization to these data engines as well; elasticity, multi-tenancy, high availability, things of that nature. We want to apply virtualization to big data, avoid cluster sprawl, and give I/T departments, give enterprises the benefit of simplicity and optimization within their data centers.
Let's go ahead and talk about who benefits from this interaction virtualization and big data. On one side we have Hadoop users, the data scientists, the analysts, the developers these tend to be line of business users, they're not I/T users; they're intimate with the data. They know the data and how to analysis that but they're not that well versed in I/T practices so they depend on I/T, to an extent, to give them the databases they need, or the Hadoop clusters that they need.
Their task is really to provide the actionable intelligence that impacts their business, like I mentioned, topline affecting their revenue, affecting their profitability. They're not really interested in what goes on behind the scenes, what I/T does to give them the Hadoop cluster that they need. They just want it and they want it now.
Their interests, their concerns are that they want to obtain a Hadoop cluster on demand. Make it easy for me to do that otherwise I'm going around I/T, I'm going to circumvent I/T if I have to. They're interested in minimizing time to insight. They want to get to their answers quickly. They want to get access to the data fast. They want to run various iterations of their analysis quickly. They want get their results as quickly as possible.
Then, of course, they want reasonable performance from the Hadoop cluster. Again, they don't really care where the Hadoop cluster comes from. If it's multi-tenanted, that's fine. Just make sure I get the performance that I need to do my job. That's where the Hadoop user comes from. In short, they're looking for something easy; simple, fast, easy to use, don't bother me with the details, I just want to get my job done.
On one side you have the Hadoop user, on the other side you have the I/T guy; administers, architects, the CIO's office. Their responsibility is for the technology infrastructure. They're responsible for compliance, they're responsible for managing their resources, managing budgets.
Most of the I/T budget falls under their prerogative, especially architects. Architects are also responsible for evaluating the technologies and recommending best practices to the rest of the company. This is kind of the profile of the I/T group that we talk to most often.
Their interests and concerns, keeping up the demands of the business. We have these Hadoop users and they want Hadoop clusters. I/T really doesn't want to be the bottleneck in this area, they want to be able to meet the demands of the business, to meet the demands of the Hadoop users as they come up.
The other consideration is cost savings and consolidation. Think of the cluster sprawl case, what a shame it is to have two different clusters at 25 and sometimes 10% utilization. Isn't it better to put all those workloads and get increased utilization of that cluster, so cost saving and consolidation? Again, budget is something that they are concerned about. Reliability, especially as you get into Hadoop production cases.
Hadoop has single points of failure, for example name node and job tracker, and these I/T guys they're concerned with keeping the system up and running, reliable. Then it's just a general complexity of running and tuning Hadoop clusters. The different configurations, perhaps for each cluster and that skill set to do all of this is somewhat limited.
Based on the people we talked to, they said that yeah, it was difficult to find Hadoop admins and to add a bit of levity Big Data [Borax] says, "Give man Hadoop cluster, he gain insight for a day. Teach man teach Hadoop cluster, he soon leave for a better job." And that just shows how hard it is to find the Hadoop admin skill set that's needed to run businesses and enterprises today.
So where VMware comes in is really to lower the bar for usage of Hadoop and saying, look you don't need to be an expert in Hadoop you can be a Hadoop user and we'll give you a way to set up a Hadoop cluster simply and easily. That's where our investment in Hadoop and big data has been over the past year or so and that's what I'll share with you here in the next section.
Another way to segment the market that we're seeing is really we have three stages of Hadoop users and the concerns at each stage also varied. Stage one, piloting. People are interested in getting a Hadoop sandbox up and running quickly. Again, it gives me a Hadoop cluster fast; gives me time to insight quickly.
The next stage, Hadoop production. This is where they're more interested in the robustness, in the high available, and the reliability of the cluster. This is where the Hadoop admin is interested in providing different priorities for different groups of users, whether you have a production Hadoop job or an ad-hoc Hadoop job, for instance.
Then in the last case when you get into big data production, that's when you have Hadoop and other big data use cases perhaps; SQL databases, other NoSQL databases sitting alongside Hadoop as part of a big data suite, if you will. This is where people get more concerned about sharing, multi-tenancy, elasticity, making sure my utilization is as optimized as it could be.
We'll explore both Hadoop uses, central I/T, see what benefit virtualization adds to them. We will talk also about, along this continuum what benefits virtualization has at each stage. Let's get into what VMware has done in this area. What our investments and initiatives have been in virtualized Hadoop.
The question you must be asking is, why virtualized Hadoop? We have three answers to that, three pillars, three benefits, and three values to the people who do so. Simplicity is first and foremost. High availability is the second one. Elasticity and multi-tenancy is the third one. I'll dive into each one in turn.
Let's start off with operational simplicity. Like I said, rapid deployment like it says up here. One stop command center is to configure and reconfigure.
Back in June 2012 at Hadoop summit we announced Project Serengeti. This is a tool that leverages virtualization. It deploys a Hadoop cluster when you request it, on demand, on a virtual platform. You can deploy a Hadoop cluster in 10 minutes. If you look at the demo on projectserengeti.org you will see that. I will not play that demo today. If you deploy a 5 node cluster, it comes up in about 10 minutes. You can customize a Hadoop cluster in various ways, along various dimensions.
I'll show you that later. You have the flexibility of choice, to choose along several dimensions as well. Choose your favorite Hadoop distribution. Choose to use shared storage, or local storage. Day after day it's a one stop command center for setting up your Hadoop cluster, running Hadoop jobs and so on.
Serengeti is a virtual appliance that you deploy on vSphere which is VMware's virtualization platform. It actually takes just a few simple commands to set up a Hadoop cluster. What you are seeing down there is a screen shot of the command line interface. After installing the virtual appliance just say, "Cluster create," and it jumps up right away, in 10 minutes it gives you a Hadoop cluster.
Usually when I talk to folks, I'm usually talking to developers and engineers so I go into all the gory command line details which they enjoy, by the way. Being as this is a CXO big data summit you're actually privileged to be the first audience for which I'm showing some mock-ups of the GUI that we plan for the Serengeti product.
These mock-ups of the GUI, they provide the same concept as if I had showed you the command line so I think they will suffice. Let's do a walk through Serengeti. Like I said, the equivalent to cluster create, you will do in a GUI. The main things you want to specify, you want to specify the number of nodes; 2, 3, 5, 10. You can specify what size these machines should be, what CPU and memory resources you should allocate to different types of nodes. You can also specify which distro you would like.
This is a basic way to set up a cluster. Once you have done that you get a cluster. You get a cluster with a math node, a couple of work nodes, a couple of [inaudible 00:19:37] nodes. It tells you what roles each type of node has whether it's the master services, like name node or job tracker or whether it's data node or the task tracker. That's an easy way, a really simple way to set up a cluster. We think that just about any Hadoop user can handle that sort of complexity.
Now, some more advanced features. What if you want to scale out a Hadoop cluster, that's simple as well. You just change that, if you do it in the command line just cluster resize and you go ahead and go from a 5 node cluster to a 10 node cluster, if you want to. Then there's advanced cluster creation. You can change more configuration parameters in your cluster, like I said.
This is an example of the template you would use for doing that. Choose your distro, choose shared or local storage and there might be reasons why you want one or the other which I will go into later on. You can turn on high availability protection very simply just with that flag, on/off. You can choose the number of nodes that you want in each node group. You can also use this spec file, this [inaudible 00:20:47] spec file to tune the Hadoop cluster config itself.
That's a walk through Serengeti. We said it's an open source tool. We do this to give you freedom, to give you flexibility. We support that aspect of user choice so you have the flexibility to choose from major distributions when using Serengeti. Today, we ship with Apache, we support Cloudera, Greenplum, Hortonworks and we continue to work with these vendors to deliver more value as Serengeti more tightly integrates with the distributions.
Then support for multiple projects so not only does Serengeti deploy Hadoop today, it deploys Pig and Hive. One of the other things we are working on is Hbase and depending on what feedback we get from the early adopters we will continue to support more projects within Serengeti. That's a work in progress as I mentioned in this slide.
Open architecture to welcome industry participation. We welcome contributions from enterprises. In fact we have spoken to several of them who are excited and interested in contributing alongside VMware.
Hadoop virtualization extensions; let me just talk briefly on that. We came up with some extensions to Hadoop to make Hadoop virtualization aware. Hadoop wasn't written with virtualization in mind. It was written with physical infrastructure in mind. In that sense we can make improvements to Hadoop such that it better understands what, say the [topology] looks like in a virtualized infrastructure.
We've added another layer in between the rack and the host in the Hadoop topology to represent the physical machine so you get the rack, the physical machine, and then the virtual machine is a the end of it so that we can do a better job at replication, at reading, and at scheduling Hadoop jobs. We do this for performance reasons. We do this to prevent data loss with replication and we are contributing these extensions back to the open source community. In fact, we have submitted patches and about half of them have been committed back to the Hadoop trunk.
In terms of flexibility, freedom of choice, you can choose local or shared storage depending on what works for you. Why might you choose one or the other? Shared storage is definitely easier to provision, easier to re-balance your cluster and with shared storage you can leverage high availability production that VMware provides.
The flip case for local storage might be, well local disk is obviously a lot cheaper especially if you are talking big data. If large amounts of data, it becomes cost prohibitive to put it all in shared storage so you may want to do that and put your data nodes on local disk. You typically get better bandwidth that way as well, depending on your use case.
If you have a Hadoop cluster for a day and you have gigabytes of data and you just want to do some sandbox stuff on it, you can just choose to put it on your shared storage there. Or you have the option, if you're doing a production cluster, to put your data nodes on local disk or you can do some hybrid of both, shared and local storage and Serengeti will support all aspects of that.
Another question we get asked is performance. How does Hadoop perform on virtual infrastructure? We did some performance tests. This was done with local disks because VMware supports local disks and we get better bandwidth that way. A way to read this graph is the performance and physical is the left most column over there, it's the light blue line, light blue column. The main take-away is that the performance on virtual is comparable to the performance on physical but you get all these other advantages and benefits when you put it on virtual. If you are interested in the full performance tests, there's a white paper there for you to refer to.
We talked a bit about operational simplicity and we took a detour into some freedom of choice and flexibility in terms of storage, distribution, and performance. Let me dive a little bit into high availability and why it's important and why it resonates with some of our enterprise customers.
There are single points of failure within the Hadoop stock. We know about name node, we know about job tracker and there are other points you may want to protect as well such as the Metastores, Hive, HCatalog, or your Hadoop management server. What we provide is a couple of ways to protect these. I will go into each of these in turn; there are three of them.
There's vMotion. vMotion actually reduces plan down time. It's really when you migrate a running VM from one machine to another. If you want to take a machine down for maintenance, this is the way to do it and the user really sees no interruption there.
To protect against failures, unplanned downtime, there's HA, HA protection. This comes into play at various levels. If the physical machine fails, the VM fails, or even if the application fails HA is able to detect each of these cases and restart another VM on another machine if a failure is detected. This restart typically takes several minutes and the jobs will pause and resume until the name node, if it's the name node that's gone done, it will restart when the name node is back up.
For the ultimate protection, there's the FT, vSphere FT, which runs two VM's in lockstep so you have zero downtime and zero data loss when running in FT mode. These are three ways in which you can protect our Hadoop stock.
Let's pull back up to the higher level. Why is this important? Why do enterprises like this? It's really a tried, tested, mature technology for high availability. In fact, among our VMware customers the vast majority of them use this in production. Use some aspect of high availability technology in production. It's a very mature technology.
It's a single mechanism that you need to learn to protect your cluster up and down the stack. It's the same thing you do for name node, job tracker, Metastore server you don't have to re-learn something different for each service or each component you want to protect. With Serengeti it's simple, it's one click just to enable HA and/or FT depending on your choice.
Lastly, very quickly, let me touch on elasticity and multi-tenancy. The left most picture is usually what a physical Hadoop node looks like. You have the compute and the storage bound together. We looked at that model but we thought that's not so great for elasticity. Why? If each VM looked like this, it's going to be hard to turn a VM off if you need access to the data after the fact.
The moment you turn a VM off, the data's on that VM and you've lost access to that data. We think a better way is really, as we go from left to right, to separate the compute and data so that the compute becomes stateless.
What you want to persist is really the storage layer and then the compute can be very elastic in that case. You can turn on and off compute very easily. You can raise utilization that way because you can scale in and out your compute on an as-needed basis.
Taking it a step further, that's the multi-tenancy case where you can have different VM's for different tenants. That gives you VM level isolation, you can isolate different tenants and that's good for performance, for security because you can guarantee a certain amount of resources to each tenant once you have put them in separate VM's. This is the direction we are going in terms of evolving Hadoop on virtual infrastructure.
Let me give you some scenarios get to close this off. What you can expect to do with this technology. You're an enterprise; you have several different Hadoop things going on in your business. We call this the in-house "Hadoop as a service."
If you are familiar with EMR think of it as enterprise grade EMR, this is EMR with all the benefits of high availability, better performance, and so on. You can have production jobs, you can have ad-hoc jobs sharing the same virtual infrastructure, sharing the same HDFS even and the idea here is that you can scale in and out those jobs depending on their priorities and on as-needed basis, so that's one scenario where this comes in useful.
Next, you have to integrate a big data production case where you have Hadoop combined with other big data use cases, Hadoop with Hbase, with other NoSQL databases, with relational databases all sitting on the same virtualization platform in the same physical infrastructure and aligned out. Again, you scale in and out as needed, depending on who needs the resources at a certain point in time.
One other scenario might be, well I have Hadoop and I have a non-Hadoop, or even a non-data use case. Maybe I'm a web company and during the day I expect a lot of web traffic so I can scale out the web servers that I need for that aspect and then during the night I do my offline batch analysis so I scaled back my web services and I scale out my Hadoop instances and then I'm more efficient when using the same physical infrastructure and [underlies] all these workloads.
That should give you some idea as to where we're going with Hadoop on virtual platform. These three scenarios that I just showed are all forward looking as we continue to work on the elasticity and multi-tenancy piece.
Today, Serengeti can deploy a cluster that's separated as a compute and storage separate. We continue to invest in adding intelligence to the elasticity and the resource allocation pieces of the multi-tenant case. That's where we are today with Serengeti.
To summarize, I've talked to you about simple, reliable, elastic Hadoop on demand. It's Hadoop as a service. Again, for those of you familiar with EMR, there might be reasons why you want to do this in-house. For compliance reasons maybe you can't put your data out there or it's too much data to move out there, so you can think of it as an in-house enterprise grade EMR for your company.
Three aspects: simplicity, high availability, elasticity and multi-tenancy. There is something for everyone at each stage of the maturity continuum. If you're piloting you care about rapid deployment, time to insight. If you're in production you care about robustness features like availability, ease of operation if you're the admin administering multiple Hadoop clusters, and then differentiated level of services, different priorities for different Hadoop jobs.
Lastly, big data production is really where the elasticity and multi-tenancy comes to the fore.
If you're interested in Serengeti, our website is productserengeti.org. You can download, there's a demo you can watch as well, we also have a VMware/Hadoop site where the white paper I talked about performance, that's on there, and also additional resources. With that I thank you. Jim.
Jim: Maybe you could just stay up here for a couple of seconds.
Jim: Just raise your hands, how many people here have actually tried to use Serengeti? Anybody here? So just a few. Of course the chimps. How many people here would actually see a use case in leveraging, raise your hands? Great. Any questions for Kevin before we let him go?
Questioner: I wanted to ask you about the division between storage and compute. The selling point of Hadoop is that you move the computation to the storage because the data is too big to move it around. How are you going to deal with that because once you separate it, you can't assure that anymore? Are you then going to put those smarts into the way that you do the elasticity?
Kevin: Yeah. We are.
Questioner: But then you are tightly coupling the compute with storage by other means, right? Because they just belong together, that's the whole point of Hadoop.
Kevin: Yeah. But what we're saying is that, and then we can turn things on and off. The other thing to your question is how we do that. You recall I talked about the Hadoop virtualization extensions and that goes to performance, performance to preserve; that sort of locality, right? If I know that, let's say if I know the data sits on the same physical machine, that's actually close. Closer to me than some piece of data that sits on a different physical host. With that intelligence I can do better scheduling, as it were.
Questioner: Second question since you mentioned it; the second part to our virtualization becomes a problem for the scheduling because Hadoop assumes that all the machines are going to be behaving the same. If you have pathological situation, which might actually happen in practice where you have a lot of the virtual Hadoop nodes on the same physical host, they're actually going to compete for resources rather than being parallel, so it's sort of shooting yourself in the foot. First you break something up and then you send it to the same machine. Is that one of the other aspects where you're going to use the other extension?
Kevin: That's not so much the extension, that's when you deploy the Hadoop node. Serengeti actually can take input from the user to say, okay, I'm only going to put this number of nodes on this physical host.
Questioner: Then that basically means that you can only put one node per CPU core of a physical host, right? Otherwise you might run into that problem.
Kevin: Again, it depends on whether you are using the Hadoop cluster. You can put more than that. You can overlay multiple Hadoop clusters and if you're not using one then the other one can get at those resources. The whole point of virtualization is it does manage the resource, even if you over allocate, or over schedule the virtualization platform is supposed to be able to match those resources.
Jim: Rich, you had a question?
Rich: You went through the part where you've given access to the operator to scale out and reconfigure, what are you doing to expose that as an interface or an API for automated scaling and reconfiguration? And, are you, as part of VMware, doing anything on that?
Kevin: We don't have any concrete plans in that space yet, but Serengeti provides API's to developers if they so choose to develop something automated on top of Serengeti and we have had some folks that are in the process of doing that. You can write some front end on top of Serengeti, you can put some intelligence in that.
Questioner: How do you put the policy together that was basically watching the application or watching the external demand and making decisions about scale out or reconfiguration?
Kevin: Yes. You're absolutely right and these are early days. Serengeti has been out for, I guess, three months or so, so we count on you to work on this and also put more intelligence in front of it.
Questioner: As you're going through the presentation you talked about Hadoop as a service, is it VMware's plan to offer it as a service, or are you more talking about it from enterprises by VMware. They buy Serengeti, they buy the solution, they're offering it as a service to their company, or do you see VMware going down the path of standing and providing the service?
Kevin: I'm talking about the latter at this point. A lot of our early beta customers are actually folks who want to offer it internally. A small fraction of them might be service providers who will purchase the VMware virtualized platform run Serengeti on top of that and then they will offer Hadoop as a service, publicly yeah.
Questioner: I like your analogy of elastic now produced from the enterprise and perhaps drawing from Amazon as a pioneer and having worked through a lot of these issues. Do you have analogies of some of the things that they've come to like allowing for higher bandwidth connection between machines so you have some notion of locality, different storage characteristics like High IOPS, SSD, or different numbers of spindles on the local storage, and even something like elastic block store?
Kevin: We are aware of those things and it's probably premature to talk about what we will be doing in those areas.
Jim: Good roadmap, Ivan. Any other questions? Great.