Jim: So, I want to first give a little plug for Infochimps. I'm the CEO, Jim Kaskade and we basically provide big data as a service in elastic, private, and public cloud configurations. We'll bring big data compute to you, whether it's an Amazon or Rackspace, or in a Tier 1 through 4 data center where your data is located.
Our focus is making big data simple and fast by allowing you to spin up your clusters and then apply them immediately with a big data platform as a service. I'm stealing this quote from one of the evangelist at Google, "Information is powerful, but it's how we use it that really will define us."
I think the whole idea of Hadoop is, or everything in the big data space vendors have been focusing on information is powerful. I think a lot of what has been missing is enabling the organization in how to actually make use of the information and get people access to these platforms that allow you to perform these advanced analytics.
First question: How many people here actually have a big data platform of their own, raise their hand, in-house? Great, so we basically understand what the definition of big data is then.
I think what the transformation in the industry has been realized with these new web scale technologies, is taking relational technology and taking it much further with large amounts of data that are also data substantive. Not just large in volume but also maybe complex in terms of being real-time and in motion and being able to process those data types in a scalable way that, again, the traditional management tools have not allowed you to do.
Second question: How many people here feel like they have a big data platform that has really leveraged all the technologies both, Hadoop, NoSQL, maybe Complex Event Processing, Storm, or Scoop? To what degree do you feel like you have been able to deploy? Is it early stage? Raise your hand. Late stage? Raise your hand.
I think when people think about reference designs there's been a lot of discussion out as to what the perfect stack is. I think all the [Distro's] have actually provided a pretty good view of that, but I think that it's definitely evolving. When we looked at deploying a big data project here at Park there were quite a number of components and we basically went after every part of the traditional stack, as well as, some of the nontraditional pieces and went looking at the entire framework, from the application that is data centric all the way down to the infrastructure, whether that infrastructure be hardware or cloud-based. I think there is quite a bit of complexity in terms of the number of pieces that are being leveraged today.
Third question: The number of applications being deployed. If you've deployed on a big data platform, how many of you have deployed half a dozen different applications that have taken advantage of big data? Over a dozen?
I'm thinking one of the things that the industry has been looking for has been what are the big successes. What are the type of applications that are actually creating huge amount of value and I think the answer really depends on what kind of data you have. Not so much about the platform itself and the various technologies but it always comes back to the data itself.
If you have a lot of data from a lot of customer touch points or throughout your business, your organization, I think the number of applications is actually quite broad. You could probably take any business question or any application that exists today and be able to augment your data sources, provide new data elements that could increase the insights for those applications, improve the meaningful results from those.
When you look at the various verticals, whether it be Web, media, telco, I think the top application use cases are shown here from two different perspectives, advanced analytics to just doing basic data processing. I think the use cases that we've seen, at least the work that I've done over the last couple of years and the people I've discussed big data projects with, and Ron being one of those, is that it seems like the number of applications are actually quite broad.
It really gets down to understanding what's the most important application to you in the initial kind of proof of concept or pilot and making sure you have the data elements that can be brought in, that you haven't been processing in the past to improve the value of those applications.
Fourth question: The process of kind of starting from scratch and going to finish in terms of your big data program. How many folks here feel like they've gone through a very successful, did it right the first time, deployment of Hadoop, NoSQL, or big data? Raise your hand. So the experts in the audience basically.
I think the ability to really go through the right process and get it right the first time really comes down to how well you're willing to plan and bring in the experts who have done it many, many times. How much money are you willing to pay?
If you're going to do it on a shoestring and you're going to try to bring your own team in and educate them I think you're going to find yourself really kind of hitting a wall. We here at Park early on with [Serendra] when we brought in a team, as well as try to educate our folks internally, it was quite difficult even with folks who had expertise and had done this many times. I think even with a lot of good talent we struggled over the course of six months.
When I look at the process and I compare it to the days I worked at Teradata for 10 years, we spent 12 to 24 months basically designing enterprise data warehouses. It started with business discovery, figuring out which applications, which questions you wanted to answer. Figure out which information, which data sources you had to support those and then you went through this logical and physical data modeling exercise, which is probably much larger than I show here.
Then understanding your system requirements. You'd spin out these extremely expensive solutions like Teradata to support those business use cases and then you'd spend a phenomenal amount of time doing data ingestion and ETL and then finally getting to the point where you could do your first query or build your applications on top of that, right? So 24 months, not an unreasonable timeframe, even today.
When we look at big data project we might see that being cut down by a half, maybe even better than this. I think the phenomenal difference is you're not doing logical or physical data modeling and the amount of time that you're staging the systems; well we're commodity hardware so it's a lot simpler to potentially roll out a big data platform that has been the software being developed for commodity and then data ingestion and transformation has been simplified.
I think the use of informatic [inaudible 00:08:42] show and those types of tools. I think we're seeing the use of things like Flume and Scoop and others to make it much simpler to actually bring data into a system in a big data system. Projects are coming down but I think the process is still evolving.
Question number five: So I kind of appreciate what I want to do and know what my process is and I get my talent, what are some of the deployment options? How do I bring big data to my organization?
That also depends on a number of factors. If I need something extremely quickly and I don't want to spend that 6 to 12 months going through that process I might have to look at alternatives of bringing it in-house and actually deploying my in-house cluster. Cost, obviously if you want to try a pilot, or you want to try a proof of concept but your only means of actually testing out your hypothesis is to spin up a 20, 30 or 100 node cluster, how do you prove a potential return on investment without actually having to procure new hardware or get commercial licensed Hadoop and other components?
The amount of expertise you have in-house. You may have been hearing this on the blogosphere and in the news on how hard it is to get resources that understand Hadoop and MapReduced, et cetera. I think how you choose to deploy will be determined somewhat by whether you think you have, or you do have, the expertise in-house to do it.
Culture internally, a lot of fighting going on between I/T today and I/T tomorrow in terms of deploying these new data infrastructure opportunities and making sure they integrate well into your existing data infrastructure. Of course, determining where you're going to deploy your Hadoop cluster is going to be governed by the data security requirements, PII or HIPAA or otherwise, and then, of course, the data volume and velocity.
If you decide you're going to put your Hadoop cluster somewhere other than in a bare metal installation in your own data center or on VMware, what are the performance requirements? You may also be constrained by where you can place it, whether it's in the cloud or on the premise.
And lastly, is this going to be a project that is going to live by itself or how tightly does it have to integrate with your existing infrastructure, assuming that you're going to be getting data from every relational database or data store that you've built your business on. You have to be able to have this Hadoop type of environment tightly coupled with your existing infrastructure.
When you look at the deployment option most people today, because it is so early, are deploying Hadoop clusters on premise. I call this private cloud if you take into consideration the use of Serengeti, or virtualized Hadoop with vSphere, or you're going to deploy Hadoop and NoSQL data stores on something like OpenStack internally.
If you're going to deploy on Amazon you have EMR, EC2, or Rackspace in multi-tenant public clouds where you can deploy your Hadoop clusters. Obviously there are some considerations there in terms of data privacy.
Then there's this virtual-private. For Fortune 1,000 companies who already have a strong outsourcing strategy where they may have 33% of their data infrastructure managed by themselves internally and 70% out in a third-party data center provider that they trust like an Equinix or otherwise. Then there's an opportunity to have virtual-private cloud where you have Hadoop clusters under management but not within your own data centers.
These are three, I think, prominent and obvious deployment scenarios. So the question is if you take those factors that I just showed you and you applied them and overlaid them to each of these, how would that play out within your organization?
One of the topics that we wanted to discuss today is, really what's viable. Which of these, obvious with private cloud on premise, but maybe the cost factor in terms of being able to show quick ROI says I'll have to do something at Amazon first because that's what I have budget for. I want to do an on-demand model. Maybe I want to leverage somebody that already has infrastructure in a third-party trusted data center that's close to my data and be able to draw on their Hadoop cluster and not have to expense my own infrastructure.
When you look at each of these, create an elastic type of operating environment, VMware, OpenStack, within your data center. There's other virtualized types of environments or private cloud environments as well and then the same in your trusted third-party data center and, of course, in the public cloud. your prominent provider is there.
When I look at these various options and I look at private big data cloud where you manage, this is an elastic Hadoop offering like vSphere virtualized Hadoop within your own data center, you're provisioning your own hardware, you're staging your own hardware. You're paying for vSphere. Serengeti is free but vSphere isn't, so your investment is large, right?
But, when you look at the security risk it's probably the lowest, you have the most control over what it is you're doing with your data. It's with the people who work for you. It's in the fore walls of your corporation.
The time to market in terms of being able to provide that solution is long, right; it's that six months. When we deployed here it took us a little over five months to get to a point where we could actually run data analytics, from the time we started in the business discovery to the time we were actually executing an analytic model. Six months was a long time in an investment in our own cluster. I think this is clearly a high cost, long-term investment in terms of getting to an ROI.
Then you have virtual-private big data cloud, which is again, it's a private cloud with Hadoop but it's virtualized, it's in a data center that you don't own but that you have control and you've connected to directly from your data center and you have a secure pipe. In this case, the cost comes down because you're leveraging a data center provider who has leverage, potentially, and then your time to market is still pretty high because you have to deploy that infrastructure and you have to manage it. And so you have to go through the same process and steps that you did here, so you benefit a little bit.
Then you move over to big data in the public cloud and you have to make sure you have people on staff that can manage EC2 or EMR. I think you will see a price drop there especially project by project because you can deploy something quickly and you don't have to have the infrastructure underlined, your big data platform and so this is where I think time to market in dollars benefit, but the perception of security kind of goes out the window.
That's your typical discussion around public cloud being a place that's not really secure and I can't really put my private data out there; I can't put my customer's data out there. The number of applications that you can actually leverage in a public cloud service like Amazon or Rackspace probably gets limited so the scope of opportunities gets smaller. Maybe that's okay. Maybe you have a combination of on-premise and public cloud infrastructure for your big data projects.
These three here are all about you managing, so what about the opportunity of having Hadoop as a service, but a managed Hadoop as a service. It's hard to find the expertise, it's hard to understand how to put these complex architectures together, what if you as an organization could have somebody helping you do that and allow you to focus on your application development and deployment.
So, virtual-private cloud with big data infrastructure but it's a managed service. The idea here is you don't hire your team of data scientists; you don't hire your team of Hadoop admins, right. You don't hire your team of MapReduced, Java, Pig, Hive capable folks; you basically extract that and give more of the big data solution, or service, to application developers. Maybe enabling infrastructure for data scientists in a way where they don't have to worry about what a cluster looks like, they just know it's there and it works.
From that sense you've basically reduced your dependencies on internal staff in order to provide return on investment, so getting to an executed project that shows value to a CXO, be it CMO, CFO, various people within the organization that want to see applications of this technology. You'll spend less money and less time and if it's in a trusted data center that you're already familiar with, and then maybe the risk in terms of privacy, what have you, is also low.
This is a big assumption that you may already be working with an Equinix, a Telex, a [Sabbas], or a Terremark and that you can get big data as a service being deployed within those data centers. If that is the case then I think these elements play out.
Then lastly, when you're looking at public big data but it's managed it's kind of a, okay EMR is easier than EC2 but what if you have something that would even abstract EMR more so and somebody was managing that cluster for you, so big data as a service being offered up where again you don't have to have administers, you don't have to have the folks who manage the Hadoop clusters. The dollars and time go down but the risk is still high in terms of your data being out in the public cloud.
I look at these various different deployments as options and I think as any early technology adoption, everybody is playing with it in-house. It's clear this is where everything starts and I think for really small companies the public cloud might be fine for applications that aren't data sensitive. The public you manage yourself is probably also very much in use.
But I think the opportunity here is to look at these other options depending on your sensitivity to deploying data in a public cloud service or your sensitivity to the amount of money that you have to spend to deploy these, or the sensitivity to the time it takes to deploy.
I guess what I'd ask from the audience, who has experience, whether it be with big data, or just in general, in leveraging the virtual-private cloud, the trusted data center providers infrastructure outside of your own.
Very few, just Steve from [Amlyn], probably one of thought leaders. That is all I had to talk about. I just wanted to kind of bring some thoughts to the surface around possible ways of deploying and just kind of have a dialogue. So maybe we can just stop here for a second and kind of have a little Q&A or any sort of discussion around the different deployment methods.
Anybody have any thoughts around potential alternatives to just deploying in-house. Steve?
Questioner: I did have a question about the alternatives. You had the chart on there where it laid out the traditional data warehouse, 12 to 24 months and then you had the 6 to 12 months for the big data model. If you could comment a little about the price deltas that you've seen between those. You showed the timeline. What's kind of the price sensitivity between the two?
Questioner: Or cost.
Jim: I think that's an easy one because when you look at the cost of a traditional enterprise data warehouse, about $50 thousand a terabyte if you're going to deploy on Teradata and you can get Hadoop infrastructure if you managed it yourself and bought commodity hardware, maybe as low as a hundred bucks, maybe 50 bucks a terabyte if you're really good.
I think if you look at just pure platform investment, the biggest perceived value proposition is cost definitely. I think when you're talking about proprietary systems, on proprietary software and hardware, the costs are prohibitive to bring in for example, clickstream data and load it into your relational data store enterprise data warehouse.
And with Hadoop and NoSQL, even the commercially supported solutions are really cost-effective so you're basically being charged $4,000 a node per year for that and you're essentially paying a subscription model that, if you do the math, it is extremely cost beneficial to the organization. You're getting supported software from a commercial vendor on any sort of hardware, so you can have your choice of what you're going to actually invest in, whether it's HP, Dell, or a Whitebox.
I think that the math in penciling out kind of the traditional hardware and expanding that versus this new approach, there is a phenomenal difference in cost and then you just have to take into account, what's my headcount requirements of managing the two different domains. That maybe a similar kind of cost element for you. I think the biggest issue is that you just don't have people who trained, or enough people who really understand the new operating environment.
Questioner: The availability of the skillsets.
Jim: Yes. It's just the availability of skillsets. We have this gap right now where you have to leverage people who have done it before like system integrators and you have to leverage automated platforms or managed services I think to get your first proof of concept, maybe first pilot, or maybe even first production executed successfully.
Speaker2: I have to throw in the counter side to that is while it may cost more to have specialists who are experts on the big data platforms the hidden cost of operating lots of little datamarts shouldn't be underestimated, right? Managing and administering lots of little systems that are separate can be quite high compared to pooling resources on operating a bigger environment.
Jim: Yes. I think even in the traditional domain, it's not just the Teradata system, it's Teradata plus a lot of different datamarts. Any other questions?
Questioner: Slightly different in nature, but I think if you went back to the last chart that you showed, my question is with respect to transmission costs, transmission performance acquired, et cetera. Particularly with huge degrees of ingestion required with VMware proposing for example that they're going to separate the compute from the storage and you do that within your data center. Even you did, even there you are going to be incurring cost. How many people have 40 gig?
What do you foresee in the next one year as you go from left to right as to what the transmission costs and challenges are going to be in our industry, or is that at all an issue that we are facing, or not?
Jim: Well, I think everybody here would probably speak to the fact that the amount of data is growing significantly, so we're not talking about just moving gigabytes, it's definitely moving to terabytes and for many like the Fortune 100, petabytes. I think transmission costs are real in terms of a concern and I think when you look at deploying in a public cloud you have Sneakernet, you basically ship your drive to get started or you have Amazon Direct where you can move from your data center directly into one of theirs.
There's definitely discussion around how do I get my data in first, which does I think put pressure on basically bringing your solution, big data solution, in-house and having it close to where the data already lives. My question would be to the industry is, where's the data live? If it's in a single place, it never is, right?
Even our work with a large project here at Park, we were looking at a dozen different data sources and each data source was owned by a different group and each group, for the most part, different physical location. When you're looking at trying to bring all this data into one centralized location, you're going to be moving across some sort of pipe that's been established by that company. If not, it's going to be Sneakernet or it going to be over secured public net but I think, probably not if it's, again, sensitive public data.
Getting data into your Hadoop platform is a challenge no matter where it lives. I think the best solutions will be the ones that take that into account. You can support a solution where it can be within various business units, data centers, and trusted third-party data centers and there are fat pipes between those.
I think for the Fortune 100, Fortune 500 CIO's that I've been able to chat with, they definitely have a global footprint and they have data centers all over the world that are well connected with some pretty fat pipes between them and there's been a lot of work to make that look like it's one big centralized data source, that helps. You could deploy in one location and then slurp it in and start processing it. I think transmission costs are an issue.
Speaker2: I was just going to comment on that a little bit as a facility because that's what we do so there's a little bit of a plug in here, but it's based on use cases. We have customers with relatively large big data deployments already and they have gone away from the distributed data model, they've gone to a centralized model, and they have done it because in our facility which is an independent environment that operates at scale, and I think that's the type of environment these systems are going to require.
If you're just in your facility somewhere, you're going to be limited by carriers, you're going to be limited by choice, and you're going to be handcuffed by the one or two connectivity providers that are actually there. Get into a large third-party environment that has scale and has diversity of carriers and all of a sudden those costs are significantly reduced, significantly because the buying power starts to build upon itself.
We've been able to do that in four of our customers that are doing these extremely large, these Fortune 100 customers that are doing these extremely large datasets and doing the analysis. What they've realized is the compute and the data do have to be next to each other, otherwise you can't take advantage of InfiniBand, you can't take advantage of all those things that allow you to move the large datasets in and out. There's a connectivity network behind the scenes centralizing that data but that a low cost connectivity network.
Where they really want the data and compute on top of each other, they're doing it right there at extremely low cost, so there are ways to lower the overall costs but it is going to take looking at it in a very different way and taking advantage of third party ecosystems that give you purchasing scale. I think if you try to do it independently and by yourself, it's going to continue to be expensive.
To kind of give you an example, with the ecosystem we have built our customers are buying 10 gig circuits from Las Vegas to L.A. and San Francisco for under $3,000 a month. When you get down to those costs, all of a sudden moving data over 10 gig pipes is not a big deal. I think we'll get there but it takes those types of environments. You won't get that in your own facility, in your office and your own data center somewhere.
Questioner: So you can't even get that if have multiple data centers of your own?
Speaker2: You'll never get the buying power. You have to get in an environment where the buying power has multiples on top of itself and then you can start to take advantage of those connectivity benefits. If you're by yourself and you're just trying to connect your independent data centers, you're going to be handcuffed by the one or two carriers that happen to be there.
And when you come and say, yeah, we have a big data project, the sales rep with the AT&T logo goes, oh sweet, this is going to be great, because they're going to be bound by this. In large scale, and even in the cloud, I think this is a limitation. In my opinion this is going to be a limitation of public cloud because Amazon and others are charging for bits and bytes coming in and out of those environments and you have to get away from that if you start to do this at scale. One of our customers has a 5,000 node Hadoop cluster with a 56 petabyte data store on it, so that's where we're headed.
Questioner: Have you done the math, compare public cloud at some scale, call it 1,000 nodes, versus [private] network.
Speaker2: We haven't done the math yet but what we know from our customer's experience is that when they look at the connectivity cost of going to the public cloud they view that charge of the bits and bytes going in and out as the hidden costs that can never be accurately, until you get the invoice, you don't know it until you get the invoice, and they're saying it's 40 to 50% of their overall cost.
When they move to private, then they start buying those pipes on their own. Now they're reducing those costs by 60% and 40%, so that some rough estimates of what we've seen. It's a funny discussion because there's a lot of talk about the technology to run it but we always sit there and go, you know this stuff has to sit somewhere, it has to go somewhere and it has to connect.
Questioner: What about the benefit of the public cloud, the no-time provision to capacity [inaudible 00:33:30]?
Speaker2: I think there is some benefit there for small projects but when we start to go at scale you lose it. Just like we're seeing in the enterprise where they're adopting public cloud for certain small projects but when you start to get up to scale it quickly loses its financial benefit.
Questioner: When is the provision [inaudible 00:33:55] in your facility?
Speaker2: Weeks. Less than weeks. You're talking about space and power or actually compute?
Speaker2: Compute is hours, at least with some of our cloud providers because we have cloud providers in the facility that are operating in that ecosystem. We probably have 30 of them there and I have three or four that specialize in deploying Hadoop platforms. By being in that environment where they can actually deploy connectivity from that connectivity network our customers are starting to get private links into those public clouds where they can't get that in other areas. It gives you that ability to move data in and out at a low cost and the benefit of a public cloud scalable platform as well.
Questioner: [inaudible 00:34:43]
Speaker2: All of a sudden that have scale and they're getting capacity in seconds.
Questioner: [inaudible 00:34:51]
Speaker2: Even those that are taking advantage of public cloud in the ecosystem, they still can deploy their private environments right within the data center and then connect to those public clouds. Now all of a sudden you have 10 gig speeds just by a cross-connect. That becomes really valuable and we're starting to see that as well.
It's a valid issue and the ones that are looking at it at scale, I think you have to think about it a little bit differently and you do have to start to look at ecosystems that will give you the economies of scale to purchase at lower cost. That's what the benefit of the cloud is supposed to get us but on a transmission side, it's not there yet and it's a long ways away at least in other areas.
Questioner: [inaudible 00:35:40]