Transcript
Jim:
Welcome, and thank you for joining us for this webinar, "Making Sense of Big Data." I'm really, really excited to be able to have the opportunity to talk to you about some of the things that are happening in the big data industry, and specifically around the deployment of big data technologies. We had a lot of use cases, and hopefully a lot of insightful comments on our perspective based on the experiences that we've had as a company here at Infochimps.
My name is Jim Kaskade and I'm the CEO of Infochimps, and let's get started.
So, first, I have to provide you a quote. And this is from an evangelist at Google, Avinash Kaushik. I don't know if you saw his keynote at the Strata 2012 in Santa Clara, but I think that he gave a great, great talk, and love this quote. And if you know Avinash, he loves a lot of things. "Information is powerful." This is clearly an important statement, and I think if you listen to all the hype in the big data space, it solves for this problem. Information. Making it powerful.
But if you listen to all the vendors — including us, probably, at Infochimps — you hear a lot of emphasis on this, but very little on the second, which is, "It's how we use information that defines us." Getting data into a form that we can extract information, but then making it actionable. Making it … actually creating insights that you can act on is really where the true value comes from, and has come from, in the data infrastructure space. And I'm hoping that, with our talk today, we will be able to provide you more insights on this second problem. How we use it. How we use data, and how do we actually make it actionable.
Before I jump into that, thought, I'd first like to define what big data is. And I know you all on this webinar have probably heard a lot of definitions. Gartner, what have you. But just to put us on the same page, I'll give you this simple quote. "Data sets, the data infrastructure, the data that we're analyzing, is so large in volume and so complex."
Now, what does that mean? Complexity, it can come in the form of a huge variety of data types from different sources, both structured and unstructured, coming from things that are at rest or in motion. So complexity also could be measured in terms of the variety of the data types themselves, and the way they're delivered. Whether they be, again, kind of a batch, or at rest data set, or a very high-velocity stream of data that you're analyzing.
So whether you're large or complex, what we're talking about within this big data industry is the fact that we can't process that data with the infrastructure we have today. Now, that doesn't mean that the infrastructure itself is not capable of processing large and complex data streams. I mean, it may equate to the fact that it's just too expensive. I mean, having worked at Teradata for 10 years, you're not going to put your clickstream data on your Teradata platform. It's just too expensive. So there may be situations where your current on-hand data infrastructure is just not capable, both from a technology perspective and/or from a just practical economic perspective.
All right. So when I look at data as being created, of course you look at this report from EMC and others around the amount of data being generated each year. It's huge! I mean, in 2010 it was 1.2 zetabytes just in that year, and I think this year it's over 2.5 or something north of 2.5 zetabytes, and then it might be, in the year 2020, 35 zetabytes per year. And that's a large amount of volume, and it's a large amount of data that's coming at high velocity, and it's a lot of variety in terms of it being structured and unstructured.
So we've got an opportunity here. And I think another kind of meaningful statistic is the fact that we will not be able to produce hard drives fast enough to keep up with the amount of data that we're producing. And I forget which year that is. I mean, 2015, we're coming to an intersection. We just can't build data infrastructure fast enough to house the data that's being created. That's a big problem.
But let's talk about today. Let's talk about all the data that you currently have and that's available to you that you can store and analyze. And let's just keep in mind that it is about the data. That is the key asset for all of our businesses. And what's really interesting today is that we can go beyond the 15% that's being housed in our enterprise, that we're actually storing and querying and analyzing, and we can go after the other 85%. That may involve image captures, satellite images.
Or we're finally getting that Omniture clickstream data off RNFS and getting it into a system where we can analyze it. We're going to tap into Twitter, Facebook, APIs. We're going to look at the RFID information and combine it with our in-store transaction. Whatever the case maybe, there's a lot of data that you, on this webinar, in your company, is not taking advantage of, guaranteed.
So what we are going to talk about, as part of us in this big data industry, is how we can take advantage of that data that's just dropping on the floor. So this is a huge problem, right? 15%'s being actually leveraged. That's a little bit of data. And it's always about the business user. I don't care what we talk about. It's about answering business questions. Business problems. The business users are the folks that are our customers in this big data infrastructure space, as a vendor, as an organization within the enterprise. We're servicing our business users. But we're serving our business users with very little data.
So I want to give you a little bit of an animation here, and I would imagine that everybody on this webinar has gone through this in one way, shape, or form. And if you haven't, you're an anomaly.
So we've got the executives on the right-hand side here who are asking critical business questions. "I need a report." "I need a new question answered." "I need to understand new things about my line of business, or across my line of businesses." And they'll go to their business analysts and present these questions that they have. These questions that change with a very dynamic environment. A business needs to respond to these questions.
And these analysts will in turn throw up their thumbs and go, "Yeah, boss, we'll get that answer for you, and we'll get it to you, and get that report to you, that answer to that question to you by tomorrow morning." And those business analysts then have to turn inside to the organization, and they'll go to the application developers, and they'll ask the application developers, "Do we have an application that already answers this, or do you have a business intelligence application or report that already gives us the answers to these questions?"
And if you're lucky, that team isn't sleeping. Not because they're not capable, but they aren't sleeping, because they're waiting for data infrastructure to change. They're waiting for more data to be presented to them. They're waiting for opportunities to change out their tools, to enhance their ability to analyze or process these questions.
Hopefully, that organization of application developers can do more than say, "Sorry, we can't answer that question." And if they can, then they turn to the demigods. The data scientists. The [Kwans]. The data analytic people. The group. The small group. The person. The people who can turn data into insight. Those folks will then be able to say, "Well, based on that question, I think we need these data points. And I think we need to do these types of analyses." These people are the data demigods. They are the folks that hold keys to the future of the company, in terms of being able to understand the direction of sales, predicting the need, the demand for supply, etc.
And they will then turn to a group that is probably the least recognized, the hardest-working, the folks that have the least amount of resources: the IT staff. The IT staff is keeping the infrastructure that drives our business up and running. They're always the unrecognized group. The group that works the hardest with the least amount of resources. And if you're lucky, and that group isn't putting out fires, has a little bit of a few seconds during their personal time in the evening to answer those data scientist demigod questions, they will then hopefully be able to turn and say, "Yes, we do have that data." Or, "Yes, our data infrastructure can support that." And in many cases, they have to figure out how to make what they have work, or augment what they have to make it work.
And they'll turn to a group, whether it be the finance team that's going to go and put in the POs and expand that cluster, or the DBAs who are going to unlock the locks to that particular silo data source, and maybe get data from two different lines of business that they can aggregate to answer those critical questions.
So we have this kind of flow coming from the executive all the way down to the data and then back to give the insights. And this is a very difficult process, and this is not one that I think has been made easy, and it's one that we've been trying to improve as an industry for decades. And again, I'd be really surprised if somebody on this call has not gone through this themselves.
As an executive, I just want my answers quickly. And so how do we overcome this challenge? How do the big data technologies, us in this big data industry, how do we solve this problem? [pause]
You've got your NoSQL solutions, and you've got your distribution chosen, and you know how you're going to process your real time midstream data, your in-motion data. How do you integrate it? How do you get it to work within the existing infrastructure? So I show you a potential new data architecture, with all the data formats that you would love to analyze in concert, in aggregation, that are feeding traditional or legacy systems like Teradata and SQL-based data marts and BI servers. Business objects and microstrategies, etc.
Integrating your new technologies into your existing is definitely not something to underestimate in terms of its difficulties. So all of this really sucks. And I'm sure all of you have or are banging your heads. I certainly have.
So how do we get big data for business users? How do we make this process simple? Well, let's come back to this kind of graphic representation. I'd say we take the data itself, we move it as close to the executive as possible. Maybe cloudify it within the organization. Use cloud infrastructure, whether it be on-premise or off-premise in a virtual private cloud with a trusted data center provider. Maybe it's in Amazon or [Rackspace]. But cloudify your resources. Make data available on these new data technologies. Make it elastic, and make it accessible, and make it in such a way that it can be automatically interfaced by an organization without having to go through that cycle.
This would be really good, right? So the big question would be, "How do we reach that goal?" And that's what we're going to talk about today, and I'm going to give you some real life use cases and some examples.
But first, let's see if we can take a little time for a poll. Hopefully my staff can initiate that for us. And if not, we'll push on.
Do we have a poll live?
Amanda:
Yes, Jim. This is Amanda. Hi.
Jim:
All right.
Amanda:
We do have a poll live.
Jim:
I see it.
Amanda:
Great.
Jim:
Great. So which best describes your position in the organization? So I just went through that scenario of kind of how hard it is to get from the executive all the way to the data and back with an answer, so give us a little feel for who you are in that life cycle of big data insights on the organizational chart. Are you an executive? Are you a business user? Are you a demigod? Are you in the IT group? I'm going to hope that we have a good cross section, Amanda. You think we'll have a … out of our 200 or so?
Amanda:
I do. I do think. We have a lot of polls coming in. About 80% of the audience has voted at this time. We'll go ahead and give it a second or two. If you haven't voted now, place your vote. We're going to close the poll in three, two, and one. Poll closed, and there are the results, Jim.
Jim:
Nice. Pretty good, except we're missing the business user. That's okay, though. I think executives will kind of be a good sounding board for us. So pretty good. And having a lot of analytics folks. That's amazing. So we do have a lot of those demigods online. That's great. So I like the fact that we have this cross section. Okay.
All right. Let's move on. Thank you for taking that time.
So what's next? So we take these Hadoop and NoSQL technologies, we take the webscale technologies that our friends at Yahoo and Facebook and Twitter and LinkedIn have all helped us appreciate. Have all made it open-source, and have all made it work at scale. So the promise of big data technologies to process these large and complex data sets has been realized by these webscale companies, and they have proven that you can do it without the challenges associated with all the legacy infrastructure.
But what's really, really clear, and very much disruptive. I always look for things that area at least 10x. We're looking at a 10x or more decrease in price. And of course that's being driven by the fact that these technologies are open source, and if you want, you can have commercial open source support, but it's still at a fraction of the price that you'd pay for these legacy systems. $50,000 per terabyte for a 59s relational database solution, versus your NoSQL or Hadoop solutions? Just amazing.
So let's talk a little bit about the "enterprise data warehouse," right? So I was one of the designers of the [bynet] that powers Teradata's infrastructure. So I'm going to give you a little bit of history of how Teradata works. Up here, on the far right, you have a request — a SQL request — that gets parsed by parsing engines that then distribute those SQL requests down to things called amp nodes. Those are access module processors, that then basically crunch that structured data and return and aggregate the answers to you.
Very basic distributed architecture. Phenomenally scalable. It's publicly known that this architecture is powering eBay at 20 petabytes in a single relational model. That's a big database, guys. And it works. And it's only 256 nodes out of a potential of 4,096. I think we can power a lot of zetabytes of information using Teradata.
But are you willing to pay the price? 10 terabytes may be $1 million over three years. Hmm, if I'm a small organization, I'd say no.
So let's look at Hadoop as an example of the big data infrastructure. So we have the ability to ask questions. Analytic questions. But we're asking questions of semi-structured data, so that's a key differentiation, obviously. So we send those questions in the form of map reduced programs that then pass it on by our name nodes and job chapters out over Ethernet down to the slave data nodes. Does this architecture look similar to the one that I just showed you? Of course it does, because it is a distributed scalable architecture, proven to work at scale by our webscale companies. Is it 59s capable? Nah, probably not. Will it get there? Yes.
So I've given you a comparison of the legacy and the future, and it's obvious that this has got a lot of potential. And at 10 terabytes, I'd probably be stretching to say it'd cost you $100,000 over three years. If you bought all the support packages and PS and layered in a heavy amount of people behind this to help you make it real, then you may get up to about a tenth of the price. My main point is, it's cheap, and it will get there.
And so, when you look at the landscape of batch systems versus real-time systems, for large down to small enterprise, that that traditional decision support bubble is Teradata, and the top one is Oracle, and in this little bubble in the middle are analytic appliances. That's the Greenplums and [Mattieses] of the world. This is the current landscape. And those analytic appliances, by the way, came in much later to solve some key core, tough analytic questions. Faster questions, faster response to the questions. So the Teradatas and Oracles of the world were taken by surprise by this group of analytic appliance managers like Mattiese and Greenplum.
So now, the same thing's going to happen. People are being taken by surprise, and what's happening with Hadoop and NoSQL and big data technologies in general, but they're starting with mid-level companies who are actually operationalized. The big guys are kicking the tires, and there will be some deployments, obviously. But I think, to keep this simple, most mid-size companies are really going to take advantage of this first, and small companies later as they accrue more data.
This type of technology's going to become hardened and more available. It's going to be more well-integrated with existing toolsets like BI. And then we're going to find alternative deployment mechanisms that suit your organization with Beyond Premise or in virtual private or public cloud environments, and then a whole application ecosystem will be developed around this.
Now, I've simplified this, but my main point is, these technologies are going to put pressure on the legacy systems, of course, and they're also going to create a new market opportunity for customers who are in that small company, small enterprise segment, because they can afford it, and it works. And the solutions will be combined batch and real-time solutions, a mixed workload solution.
So I'm excited about this, and I think that Google's [no audio 20:23] in a way where you can keep things simple. The analytics don't need to be fine-tuned. We just take a basic naive bays algorithm and throw it against all of my detailed data that's terabytes, petabytes. Not a sample. The entire data set. And we get insights. We find signals.
And this is true. I've seen it working. I've experienced it myself. I've looked at how thought leaders are bringing all of these new data sources, whether they be images, documents, XML, weblogs, from quickstream, social, sensors, GPS, all of this is finally coming together, and it's not being thrown on the floor. We're not just putting 15% of that into Teradata or IBM or Oracle. It's going into Hadoop, and now you have this big grab bag that can then power business or operational applications as well as discovery types of questions, and people asking what-ifs.
And of course, you do have your SQLs, NoSQLs, NewSQLs, kind of supporting and, in combination, working with Hadoop. This the new data infrastructure.
So let's talk about use cases. I'm sure you've heard about this. Maybe you haven't. A number of financial institutions are looking at better ways to understand what companies they should invest in, and how to better predict where companies are going with their business. Their quarterly earnings. How do I forecast this better? This particular use case, this hedge fund actually used satellite imagery. This is an image of a Walmart in Wichita, Kansas. And the analysts counted the number of cars in the Walmart parking lot. And what did they do? They measured the overall customer traffic to understand the growth versus its competition, right?
And the insight from this — amazing. You've got to process these satellite images, which are big themselves, and you've got to look at all the cars and figure out and count the cars, and then compare it to yesterday or last quarter or last year. And what did they learn? Walmart's growth was determined to come mostly from areas of high unemployment. Go figure. Hard times, unemployment, they go to Walmart. It's cheaper.
Target, in this picture — this is a plaza in Garner, North Carolina, I think. Not too many cars in the parking lot in this one. Analysts compared this satellite parking with regional unemployment trends, and they found that Target's growths tend to come from areas of lower than average unemployment. Hmm. Interesting.
So what does this all mean? Well, this means that you can take a ton of information, not just from cars in the parking lot, but news about the business, what's happening in pricing, and seeing how people are actually dropping their prices relative to their competition on the Web. Looking at social sentiment. Looking at weather reports, and how they overlay with the actual traffic in that parking lot. And the local employment, obviously. And you take all this information, you run it through a very simple, simple algorithm. Let's not spend a lot of time tweaking this. And come up with a quarterly revenue prediction.
All right. Use case number two. Let's talk about a big media company. Over $1 billion in revenue. Media companies have a lot of traditional media sources, and they want to go into new media sources. So how do I merge my traditional media — like TV and radio, and even transcribed print — and how do I merge that with blogosphere and Twitter and Facebook to improve the insights I can provide my advertisers? Somebody advertises on TV or in the radio, or out on the Web, where's most of my traffic coming from, and who's got the loudest voice, and what's the sentiment, and how can I provide instant insight to my advertisers — my customers — in order for them to make advertising decisions?
So this particular media company takes GNIP, takes Moreover. So GNIP gives us, of course, Twitter and Moreover gives us the blogosphere/social combination new media, right? Combined with TV, radio, print. This is incredibly powerful. This particular organization was able to build and deliver a new application with a three-person team within 30 days. Amazing.
So they have data scientists, they have application developers, they have IT staff. But what's in the middle making this all work, and making an application that's on the outside being delivered to their advertisers that give you a better understanding of where the source of information, or source of the advertising response, is coming from and what the sentiment is, that those business users can act on? Well, you've got, in the middle, you've got to be able to process this data in motion. You have to have a data delivery service that knows how to deal with data real-time, in motion, as well as historic data, and be able to process it in-stream and provide instant response and information to the application. To the business user.
And of course, as it's moving through, it's being dropped down into Hadoop for long-term analytics, and into NoSQL, so you could query it. And then this application, obviously, being built on APIs makes the infrastructure transparent.
So this is where, I think, any media company needs to go. And of course the data sources can change or expand. But the architecture, as I've simply shown it here, it works. This is actually a customer that is producing a high-value application that they charge a pretty nice price tag per month per advertiser customer.
Let's talk about another use case. Retail company. There's a lot of them out there. And retail companies touch consumers at multiple touchpoints. So how do I increase online revenue? That's a big question. Because most retailers, bricks and mortar and clicks and clicks and bricks, everybody's moving online.
So in this particular use case, on the far left, I've got a huge customer base, right? Whether they be in-store or online, the customers that I know that shop with me. And then on the top, I have all these approved offers. 15% off this and what have you. All these offers that I'm sending to my customer base. And then I have all this approved content. All these products that support these offers. And then in the middle is this big question mark. It's this analytics box.
It's this box that says, "I want to market to one. I want to be able to understand a single user's user behavior on the Web. What are they clicking? What are they looking at? What do they do in the store? And then how do I customize some sort of campaign to them? How do I personalize it?"
In this case, I'll give you an example of personalized email. This is a personalization solution that every retailer either wants, or is trying to have, or does have and they're ahead of the competition. Why is this a big data problem? Because if you can capture all of your customers' behavior across all touchpoints, obviously that personalized email or that personalized customer service call or that personalized Web interaction is going to be just that. It's going to be unique to you. It's going to be personalized.
So how does this work? Well, you take the current campaigns at the very top, in the green. You take the products that I have, and I take all the transactional information in between. And I use it to understand and target my users. I use a Hadoop cluster. I create data models. I run analytics against it. And I create targeted offer campaigns that then eventually become a personalized email.
So this is an actual use case. This is a very large retailer that went through this process, and as you would hope, you measure the performance, you improve, and you get 85% accuracy predicting what your users' interests are and actually getting them to purchase the right thing at the right time the first time. What retailer wouldn't want this?
So I'm going to break there for a minute before we push forward. Time for second-to-last poll, if you would humor me. Amanda?
Amanda:
Yes, Jim. So we have our second poll coming up. I'll go ahead and launch that now. Shouldn't take us too long to get a tally of votes.
Jim:
So given my use cases, hopefully we've got a few out there that are very interested in doing the same or something similar. It'll be interesting to see what kind of activity we have out there. I'm going to guess yes on most?
Amanda:
We have about …
Jim:
And if it's a no, it's …
Amanda:
Go ahead.
Jim:
We have a few nos, it just means that they're doing a little research, right?
Amanda:
Yeah, I would think so.
Jim:
Trying to figure out what to do next, or if they should jump into this.
Amanda:
Mmhmm. Or maybe how to jump into this, right?
Jim:
But given that we have a lot of data scientists and a lot of executives on this webinar, I'm going to guess that most of them are already pretty well-educated on big data, and have or will be having a project. So let's see what we've got.
Amanda:
The majority of the audience has voted. If you haven't voted yet, go ahead and place your vote now. We'll be closing the poll in just a moment. And Jim, are you ready for those results?
Jim:
All right. Let's see it.
Amanda:
And there we go.
Jim:
Okay. Not sure. Well, I would just hope that that not sure becomes a yes at the end of this webinar, and the nos will also become a yes. Good. All right.
Amanda:
All right.
Jim:
Well, let's push on, then.
Amanda:
There we go.
Jim:
Great. Thanks for participating. So even if you're not ready, I am going to tell you how you're going to get started. But the important thing is, how do I get started without spending a shitload of money, right? I want to be able to … any executive, myself included, you want to be able to prove ROI before you make a huge investment. I mean, it's kind of a little bit of a, I guess a return on investment. So you want to figure out what you return on investment can be before you make that investment. So how do you do something in big data and make a little bit of investment that shows a clear path to a huge return, so then you can turn around and pour some money into this?
And I think, when I think about the ways you can proceed with a big data project, you can obviously do it on-premise. I mean, that's the clear path. You're an early adopter. Go ahead. Let's go ahead and invest in a Dell, HP, IBM, pick your hardware vendor, get that 100-node cluster up and running, and then pick your distribution from Cloudera or Hortonworks or MapR, and then hire … we have a lot of data scientists on the line here. Hire your Hadoop experts and your NoSQL experts and your Hadoop administrators. Just staff this thing up and go.
But I do think that's quite a bit of an investment. You could also deploy this in a trusted data center. Maybe some of your infrastructure's already open source. Maybe you're a Fortune 1000 company, where you already have the data center consolidation strategy underway, and 30% of your structure's in-house and 70%'s outsourced in an FNX or similar partner. Or maybe you're a thought leader, and you're leveraging public clout. You're an Amazon user.
I think each of these have their time to market, their security, their investment trade-offs, right? So let's talk about those.
So on the far left, I have three use cases. And I'm simplifying this. But three use cases where you manage your infrastructure. And then I have a couple on the right where somebody else manages it for you. So let's talk through these. And I'm showing you cost here.
So the private, and I'm assuming you want an elastic capability, but maybe you just do a fixed Hadoop cluster, for example. A certain size. Or maybe you virtualize it. And you make that — let's just assume — you do make that an elastic Hadoop infrastructure within your own data center, your own people, you manage it. Well, obviously that's going to be your biggest investment.
And then, secondly, you might move into a virtualized infrastructure where somebody else manages it. You leverage the economies of scale associated with the interconnectivity cost, the explorer cost, the remote hands. All that you leverage, the economics there. And that brings down your cost.
And then finally, you're a big adopter, an early adopter, and you're willing to take advantage of [Werner Vogels'] public cloud, or Lanham's Rackspace cloud, etc., then you might find some more leverage, from an economics perspective, a time to market perspective, etc.
And then if you go further on that, I think there's opportunities where you can bring in people who do understand big data infrastructure and who can manage it for you in a virtual private setting, or a public setting. And I think those managed services, then, take some pressure off of the internal organization to become experts overnight, and that's a lower investment, quicker time to market.
So, dollar purely. Let's talk about risk. And again, I'm oversimplifying this. But basically, as a CEO or CXO, you would obviously say, "If I manage it myself, I can keep my security issues closer to the vest, and manage those better. And if I move it into my partner's domain, maybe a little perceived security risk goes up." And perceived versus reality, obviously we all know, are two different things. But perception is reality, so public cloud is perceived to be higher security risk, even if it's not, that is associated with this public cloud. So then we drop back down into a secure environment.
Now, if you've got a Tier 4 data center, ladies and gentlemen, it's a lights-out. It's a highly automated. It's a highly secure. Even if somebody else manages it that has your top-secret clearance, that's not much higher security threat than managing it yourself. And of course, public cloud managed by somebody else still has the same perceived risk.
So this is a security issue kind of representation. And then let's talk about time to market. I can tell you right now, having done a big project for a retailer, it took us in the amount of six months to get the hardware in and spun up and ready to analyze some of the data. And whether you do that in-house, you do that in a third-party data center, it's going to be the same time.
Now, you can spin things up much more quickly within public infrastructure because it's geared towards that, whether it's EMR or otherwise. Amazon EMR or otherwise. And then I think if you' re going into the model where you're having somebody else manage things for you, the premise there is that they've done this many times over and they can spin something up in weeks instead of months.
And so, whether it's in a virtual, private, or public setting, these all kind of bring time to market value to you.
So I oversimplify things, but I'm going to put a star in the top of this virtual private big data cloud managed service. It doesn't mean that that's the only option. You could go virtual private cloud within Amazon, for example, and I can move that star over to the public big data cloud. Or I can think maybe this virtual private is Amazon. In either case, having somebody else manage this infrastructure for you, whether it's in a single-tenet or multi-tenant environment, and they bring the expertise to you, is obviously an area that we're very high on at Infochimps, and that we'll talk about a little bit further later.
So, cat out of the bag. Who's going to help you with this? Of course I provide a little bit of a plug here for Infochimps. That's what business we're in. Infochimps is a managed big data service. We do provide elastic and secure big data technologies in a private and public cloud setting, whether it be on-premise or otherwise, and we do this across both public clouds as well as global trusted network of data center providers.
This is a key go to market strategy for us, as well as our Fortune 1000 customers, because data does not want to be moved necessarily, or sneakernetted, up to a public cloud. Sometimes that's the case, and it's something that our customers do and are able to accomplish. In other cases, we've got to bring the computational capability to their data. So I think this is a key way of addressing that type of need.
And then also, we're not just focused on spinning up Hadoop and making that work for you. We have to address the business problem. We look at things in terms of the business requirements, and those business requirements typically ask for both the batch and real-time combined analytic framework that supports both structured and unstructured data. So this is what we're all about.
We provide those, manage it for you, push the button, get you up in weeks. And where we come from is a history of managing thousands and thousands of data sets, so we know how to deal with lots of history and in-motion data, and we know how to take that infrastructure and make it available to you as an enterprise who needs cloud-based data services.
Big data platforms as a service is basically where we are today, and this data intelligence network, this idea of bringing our computational stack to you and your data sets through Tier 4, or trusted data centers is something that differentiates us from our peers. We install and manage on top of OpenStack and vSphere, as well as bare metal for some.
So that's us, and our offering consists of a number of blocks. It starts with the ability to push the button and spin up our cloud infrastructure, and that's done through a solution called Ironfan. And then above that, we have the core database solutions. NoSQL, NewSQL, and then a host of Hadoop distributions to give you elastic Hadoop.
And then of course, just spinning up those environments is not enough. We've got to connect you to your data sources. We've got to process your data in motion. And we do that with a service called Data Delivery Services.
But the real value for our customers, and what we've seen bringing people to be able to take three developers and actually code up an application within 30 days, is because we focused as a platform provider, as a platform, as a service, we focus on making big data simple using a scripting or a data science language called Wukong that we give to application developers. And we turn application developers into data scientists, and augment our existing data scientists with more people savvy on how to use data and bring it to the organization through applications.
And then, while you're developing our application, we have a very simple iterative kind of BI environment for reporting and system management, but also understanding what is happening with your data flows and your application. We call it Dashpot. And everything is built on top of APIs, so that the infrastructure is abstracted.
And the way we offer this really suits our customers' needs. If you want to tinker with this, go to Github. Our code's there. We're an open source advocate, so we open source our solutions. And you can go ahead and kick the tires and deploy anywhere. If you want to have a pilot production solution in the public cloud, we'll support you with that. If you need to bring our computational capability to your data in a virtual private cloud setting, we can do that. And if you're really, really tight on where your data is, you need it on premise, we will support you behind your firewall.
So these are the offerings that we provide as a solution provider in this big data space. So let's take our last poll. Amanda?
Amanda:
Yes, Jim. I'm going to go ahead and …
Jim:
Given what I've just reviewed.
Amanda:
Yeah. This poll is, "What deployment option do you prefer?" So if the audience will take a moment to vote, we'll tally in just a second.
Jim:
So I'm going to guess 20% public cloud, just to kick the tires. Majority private cloud. Maybe a few people kind of straggling in the virtual cloud. No cloud. Let's just define no cloud as a non-virtualized, complete fixed cluster. You're buying the box, and obviously, if you're buying the box and it's a fixed resource, we can install it — or you can install it — obviously in any data center, but let's just assume you're in a data center, whether it's yours or your partner's.
Amanda:
A lot of votes pouring in.
Jim: But your first three options, are … Yeah. Your first three options are going to be elastic. Your fourth option is non-elastic. And maybe some people feel like they need to have it non-elastic just for performance reasons, but I think in an environment today, cloud-friendly enterprises the value of giving multiple people access to a big data infrastructure in a multi-tenant environment is going to be important. We'll see how this pans out. Let's see what our results are.
Amanda:
All right. Let's close the vote. I think these results may surprise you. Here we go.
Jim:
Holy smokes! Okay, so I was close on public cloud, and really kind of split across almost evenly on private and virtualized type of cloud. Very good. I think that's very telling. So, hopefully — let's go ahead and move back to the presentation — hopefully that's a meaningful overview. Are we back live?
Amanda:
We are back live, Jim, yes.
Jim:
Okay. So hopefully that's a meaningful overview. And we can obviously dig into a deeper set of questions and open up the mic, essentially, to our audience, and see if we can speak to some of these specifics around either deployment options or philosophy around batch and real-time, or some of the use cases that we described. So did you get any …
Amanda:
Yeah. We've had a lot of questions come in, actually, throughout the webcast, and we can go ahead and answer a few here, or at least address a couple. Sarah asks, "How do we get big data into the hands of executives and business users? Right now, it feels very engineer and data scientist driven."
Jim:
Yeah. That's a great question, Sarah. And I think in any early adoption period of new technology, it is very much a technology play. Most of the big data efforts are technology vendors pushing technology solutions. And I think if you look at history of any … even, like, a relational database, for example. In the early days, people were pushing just relational databases. I started at Teradata when they were 200 million, and I sold a database. And when I left after 10 years, they were an over $2 billion company selling solutions.
So I think it's a great question. And what we want to see is taking these new big data technologies and actually starting with the business questions first. So I can tell you this. There's not one existing business question today that you cannot apply a new big data technology to. Meaning, every question, whether it's "how do I put better optimized store placement?" Whether it's "how do I target my customers better?" "How do I reduce my fraud?" Whatever the question may be in your industry. Telecom, finance, or retail. Take that question — the most pressing one, the one that you feel is not being addressed the most, at the top of your list — and then we can literally do a information discovery around that question and say, "What information elements have not been applied to answering that question?" Because technology can't support it, or the cost structure doesn't support it economically.
And I can point to you, as a proof of concept or a pilot, a project that you could initiate that would just aggregate more of the data that you already own, or that you're dropping on the floor, that can improve your answers to those questions. And so that's how we're going to bring, I think, big data to our executives, is we've actually got to start with the executives and say, "Okay, let's rank order the questions that you really are tired of not getting the right answers to, and we'll go back and look at the data infrastructure and see how we can augment it with some big data technologies."
This does not mean that you're going to go in and forklift any of what exists today. Let's be clear that existing data infrastructure, all the existing stack in the enterprise, is there, and it's going to stay there for quite some time. What we're talking about is augmenting that and adding in a solution, a big data solution, that first proves ROI. Once you've proven ROI around those use cases that I just showed you, then they get operationalized. And you can operationalize those in a lot of different ways.
But I would encourage all of the people on this webinar: start with the problem, and then figure out which way you want to deploy the solution in a big data space, whether you want to do it quickly in a public cloud setting. You can actually make that happen from a security standpoint. Data, governing standpoint. If you can't, just go down the list and figure out which of these solutions that work best for you. But do it quickly. Bite off a small project. Make it happen fast, and then turn around and build on it. That's what I see successful organizations and enterprises adopting big data … the ones that are successful are doing it just that way. One application at a time.
They don't go and say, "I'm going to be a big data technology company, and I'm going to invest millions and get my cluster, and hire all these people, and then we're going to finally start asking questions." That may work. But I tell you, your time to market on return on investment is going to be much longer.
All right. Next question?
Amanda:
Yes, Jim. Cyrus asks, "How does Wukong compare to Pig or other framework for scripting distributed data processing?"
Jim:
Ooh, that's a great question. Love to bring my product manager and my CTOs and my engineers in. We don't actually displace Pig. We actually use Pig for what it's strong at, and Wukong is applied in certain use cases that I think, in combination, has really enabled the folks that we're working with.
So I'll give you an example. One of our customers is working on what I call kind of a billion dollar pipeline project. And normally in this situation, you'd have to hire your talent, or your [Informatica], or your [Abadicio] specialist who knows how to connect to the data sources, knows how to manipulate them and get them into the right format for you, whether it be an HDFS or HBase or what have you. And I've seen those efforts fail. I hired a person with 10 years of experience in these data integration solutions, and still couldn't get the work done for me.
So this one particular customer, a new graduate out of college, no programming experience, gets behind Wukong and is able to actually define a data pipelining workflow that's 10 times more complicated than the one I hired this 10-year specialist to do, and was able to make it work without any experience. So we basically take the data pipeline and make it very simple. And that's done by constructing a data flow with, essentially, the data sources. Kind of define them as, like, nouns. And then any sort of processing you do in-stream or on the data is like verbs, and then the construction of those together in sentences. And it's so simple. It's kind of declaring a very simple way of actually creating a data pipeline in a language that, I think, is much more functional than Pig, in many ways.
Again, that being said, we don't necessarily say, "Don't use Pig." We use it where it needs to be used, and we use it in combination. But it's a very powerful tool for people who have very little experience in data. And the reason why it's so powerful is we had to build it ourselves. We were a small company in the beginning, and we had to do things in a simple way, so we built it for ourselves, and managed thousands of data sets. And so now we've turned that out for our customers.
And if you need more details, contact us directly and we'll dig into the details for you.
Amanda:
Jim, we have another question from Anita. It's actually a two-part question. The first part: "Does your system scrape social conversations automatically, like Facebook and Twitter?" And the second part: "Can the integrated data be analyzed in traditional analysis programs like R, Fast, and Python?"
Jim:
Can you repeat that in the second part? You dropped out there for a second.
Amanda:
Sure. The second part of the question is: "Can the integrated data be analyzed in traditional analysis programs like R, Fast, and Python?"
Jim:
Ah, great. Great question. I'll answer the second one first.
So we're definitely big advocates of using R and even analytic solutions like Mahout, and definitely to understand the value and the power of data manipulation with Python and other scripting languages. So the quick answer is yes. I think where our customers, and we're seeing use cases, is once you have the data in an aggregate form, you can apply it any sort of solution against. And you can use, for example, our suite of analytics. You can use it as part of a data workflow within Wukong, or you could use it outside of our data pipeline environment.
But essentially, when we manage this platform for you, ideally we want to put an API in front of your application developers, because we do know that the data scientists need to get in there. And so there's nothing that we're doing that doesn't allow you to get in and actually get under the covers yourselves. So there's really no limit to what toolsets you can deploy against Hadoop or other parts of the data infrastructure that we're hosting and managing for you, and I think that's encouraged for lots of our customers. And I think what our hope is, is that you can get these applications developed on top of this infrastructure more quickly because of that flexibility.
And then your first question, in terms of do we automatically scrape the Web and provide ways of getting access to the social data. My quick answer to that is, we work with a suite of partners who are phenomenal on being able to get that data in the form that's needed. Such as Gnip, as an example. And moreover, Intensity. So there's great solutions in terms of getting access to that data without having to do a lot of heavy lifting. And we will bring that to you as part of a solution with us. That's part of where we shine in the sense that, as a … initially our company was founded on the premise of making data accessible and easy, and making public data sets as well as these new social data sets accessible. So that's definitely something we can speak to offline in detail.
Any other questions, Amanda?
Amanda:
Yeah. We have two more, if you don't mind, Jim. We have one from Vineet, who asks, "Different enterprises deal with totally different types of data as their domain, like retail, insurance, medical, etc. How does the analytics framework provide simplified APIs across business domains?"
Jim:
Well, that's a great, great question. I think the holy grail is kind of like in the relational domain. You have a star schema, and they're all kind of well-defined for a retail data model, and for a financial data model, and for a Telco data model. I think the quick answer is it's very early right now, in terms of how this big data infrastructure is being deployed across the verticals, and it's really hard to find these kind of out-of-the-box packages for new customers that come into this domain.
And that's one of the value propositions that you get by working with solution providers like Infochimps, where we have broader experience and we have executed across a number of domains, and we do have, for example, a library of analytics that are used to process data. So we have filters. We have sentiment analysis. We have things that we've used over the last four or five years. We've built up our own library. And we make that available for you. But that doesn't mean you're not going to build on top of it.
So we will help you and guide you through what kind of data analytics could be best applied to your business problems. Again, focusing back to what the questions are for you, and then deploying the right architecture, infrastructure, and suite of analytics that can support those business questions.
Does that help?
Amanda:
Thank you, Jim. That does help.
Last question comes in from Debosh, who asks, "Which framework or database would you suggest, keeping in mind that unstructured data needs to be augmented with structured data for deriving real value?"
Jim:
So my quick answer is, there's not one size fits all. But hopefully you don't even need to answer that question, in the sense that, again, if you work with your business users and defined the business problems, and that there's these applications you're trying to address, we, from a perspective of giving you a complete managed service … if you just have a data store underneath, which data store is there ideally shouldn't be the issue, or shouldn't be the question.
Now, that being said, we do support multiple distributions of Hadoop. We do support different NoSQL databases, whether it be HBase or Cassandra or Elastic Search or MongoDB. And so we clearly have a grab bag of solutions from these commercial open source offerings, and we do have experience across a number of different data stores, because we had to.
But that's part of, hopefully, the value proposition that you'll find important to your business, working with us, is we'll give you the right architecture based on the problems that you're trying to solve. And then the neat thing about this is, once we provide a solution for you, and every one of our customers, you're not tied or locked into that.
So obviously we're a big supporter of open source. We're not going to lock you into any proprietary. But we're also not going to lock you into commercial open source. We're not going to just sell you Cassandra, you know? We're not just going to be focused on one data store and say, "That's all we do, and by the way, it can do everything for you." Because, again, we come from a reality perspective, just being completely practical, that there is not one size that fits all.
And you have to be able to provide at least some flexibility in your data infrastructure in this big data space, because it is very early, and not a lot of people know exactly which problems they want to solve first, and we need to spin a little bit on a dime. And the beauty, again, is you can spin on a dime. You don't have to make huge investments to prove value to your CXOs, to your executives.
So I encourage you to reach out to us and start one single, simple project, and then we can expand from there.
So with that, I'm going to thank everybody who still stayed on the line. I would encourage you to make sense of big data by reaching out to us, Infochimps.com. If you're still kind of a traditional person, want to pick up the phone, data [fun], dial us. And thank you for letting me get on the stage here. I'm extremely excited around what we can accomplish as an industry in making data valuable.
Amanda:
Thank you so much, everyone, for joining the webcast. At this time, we're going to close. We will send the slides and the recording out via email early next week, so look for that in your inbox. If anything comes up in the meantime, feel free to email me, so amanda@infochimps.com. And thank you so much for your attendance today.