Amanda: Hi, welcome to the webcast. My name's Amanda. We're going to wait just a minute or two to allow a few more people to arrive and then we'll get started. Thank you.
Hi. We'll go ahead and get started. Thank you all for joining us and welcome. My name's Amanda McGuckin Hager. I'm the Director of Marketing here at Infochimps, and I'm pleased to introduce you to Tim Gasper, our Product Manager. Before I do there are a few things of housekeeping that I'd like to address.
Number one, this webcast is being recorded so we'll send you an e-mail early next week with a URL link to the recording for your future reference. We hope that you show it and share it with your friends.
Secondly, please feel free to ask questions using the control panel, the webinar control panel on the webcast. We will answer all questions. We will address a lot of them towards the end.
Without further ado, Tim Gasper will present to us the top strategies for successful big data projects. Tim.
Tim: Hi, everyone. Thanks, Amanda for the introduction. Hey, I'm Tim Gasper. I'm our Product Manager here at Infochimps. I've been here for a little over a year now. And, it's definitely been an exciting time over here because as you may know a little bit about our history at Infochimps we used to be a data market place where we would allow for the buying and selling of data. That part of our products suite definitely still exists, but definitely our flagship product now has switched over to our managed platform as a service for big data called the Infochimps Platform.
It's been an exciting transition because what we've been able to see is a lot of different types of use cases and implementation patterns for a lot of different customers that we've been able to work with that I've seen from a product side, that our engineering team has seen from an implementation side, and from where our account sales folks have come in just from the engagement side, and having those conversations about how to get these projects kicked off and how to work with a vendor to actually achieve those.
So, we've been thinking a lot about how we can make projects go more smoothly. In addition to just our own experience, we've been doing a few surveys, one of which we did recently on this topic about the data projects and how IT can be successful with these types of things, which has given us a lot of interesting nuggets of information and some stats. So, we're going to be looking at a few of those throughout this kind of speckled in as we go.
So, just to get it out of the way first, before we start diving into the main content of this, I want to talk a little bit about who we are at Infochimps and kind of what we're doing for our customers.
Just as a little reminder, if you'd like to tweet at us during any point in the webinar, feel free to use that hashtag that you can see up at the top left corner, hashtag [ddsuccess], and we'll kind of be monitoring that and as well as participating that as well.
At Infochimps we like to characterize ourselves as the number one big data platform for the cloud, and that doesn't necessarily mean public cloud. That means that what we try to do is really make big data a more simple, fast experience by packaging all the tools that you need, as well as the managed services and consulting into one suite of services that you can use to get your [big] data projects done.
We have a bunch of different customers in a lot of different industries, such as media and advertising, more technology type companies, hardware, supply chain type things, finances, etc. And you can see a couple of our partners listed as well, some of the other technology providers that we like to work with in order to complete our solutions.
And the basic way that the Infochimps platform works is that we help customers not worry about the big data infrastructure side of things. There are a lot of different technologies that you want to be able to utilize when doing big data, and we don't hide any of them, but we definitely try to minimize complexity.
These systems are fickle in many cases. They require a lot of system administration and maintenance and managing. We take care of that gray square in the center that data infrastructure. And you as a company, just get to focus on which data would I like bring into that system and what do I want to be able to do with that data.
Do I want to process it and perform more of the data science on that data? Would I like to be building big data applications, whether they are custom applications that I'm building myself for various use cases in my organization or if it's bringing tools that already exist there with this infrastructure?
And then finally, things like business intelligence, whether that's visualizations, dashboards, visibility into how your business is doing, your different business processes, and really big data is a broad platform that addresses a lot of different use cases.
And, so that aside, Infochimps being a company that can help with a lot of the scalability and management around big data and getting those big projects done, today I want to set an agenda for the different topics we will be hitting.
First, why do you care about big data? A lot of this is going to be re-hashing information that you already know, but we'll move through that pretty quickly. Too many big data projects fail, and we'll throw a stat or two out there and explain why these projects fail.
And then we'll talk about seven strategies that we've seen that map to why those big data projects fail, as well as map to what we've seen in all the implementations that we've been doing, and how you can make your big data project more successful. Hopefully this will be a good initial way to kick the tires on these strategies and I'm looking forward to your feedback on them, as well as our opportunity to do a follow-up blog post and some white papers and things to really flesh this all out.
And then finally, we'll do a Q&A, and we'll look at any questions that you may have that are related to the topics here and we'll address them.
So, first let's start off with a poll. We're curious to know whether or not you're currently involved in a big data project. This will give a little bit of context for us in terms of whether or not you're kind of looking at something prospectively or if you're kind of in the middle, in the thick of something. So, feel free to fill that out and we'll read the answer in a minute.
All right. So, Amanda, do you have the results?
Amanda: I do. Many of you have voted. If you haven't placed your vote go ahead and do so now. We'll give it just a minute or two, and I will close the poll. Tim, it looks like we're about half and half audience here. 54% of the audience is currently involved in a big data project right now and 46% are not currently involved in a big data project.
Tim: Interesting. That is actually really closely matched to a lot of the surveys that we've been doing, which have shown that of respondents it's about half, sometimes a little over half are involved in a big data project at the moment actually. It's good to see that and it's good to see that a lot of people are involved in big data projects because it definitely shows a high initiative on the IT calendar. Cool.
So, that being said, that puts some interesting context on this because it seems like some of you are going to be in the planning stages. Some of you are definitely in the thick of it. Once we set a little bit of a foundation in what we're talking about when we're thinking about big data, we'll start talking about why those projects fail.
So, some of the most common things that you hear about with big data are obviously the three Vs, and every vendor is going to rehash those over and over again. Unfortunately a very good way of explaining the kinds of problems that the big data is addressing is about volume. How many terabytes are you managing? Are you entering the petabyte scale even? Lastly, are you handling thousands of transactions coming in, or many tweets per second, or if lofty is a recurring theme?
And the third point, variety, is probably under emphasized, is how the data looks so different from structure to unstructured, machine data, log data, versus CSV files and financial records. Being able to bring that all together is something that we've been thinking about ever since we entered the data warehousing age, but now that we enter big data analytics and we're really trying to crunch as much data as possible, really trying to get all that data to talk to one another is a really key challenge big data is trying to solve.
And you'll hear companies trying to throw in some additional Vs. If we scope the whole internet we could probably find 10 additional Vs that companies have suggested to add to this list. But, two of the ones that I like a lot are variability and veracity. Variability, in that one of the interesting things that big data tries to handle are these spikes in data and seasonality and things like that where you may see all the sudden your web traffic spike and normal systems would have trouble handling these types of things but big data is trying to accomplish this better.
And then veracity, what is the true data that you're going to want to use to represent your organization and being able to think about this as a concept across all the data management that you're doing.
Thinking of that from a big side, I want to flip to just the fact that big data isn't just about size. It's not just about the word big. There are a lot of reasons why big data technology is actually a better approach, and I'll just lay these all out on you.
By taking a big data approach these systems are actually more scalable. You can actually start small with a smaller amount of data and grow these systems without reaching the same kinds of pains that you would have had you stuck with more traditional, such as a MySQL database or something like that.
You have the ability to do more intelligence, predictive type analysis which were more difficult or impossible before, or only able to be done on samples of data instead of your entire warehouse of data. It's agnostic in that you can take advantage of the tools that you need to. It's not going to prescribe you to only be using a one proprietary set of tools or strategies.
And, finally it's more holistic. Now we're trying to think of the data in our organization, not in the separate data marks or different business processes, but let's unlock all that data and be able to make all the conclusions that we need to so that we're only focusing on the questions that we need to answer.
So, let's quickly go over a few use cases that commonly come up with big data. First, let's just address a few oldies but goodies. These are some examples of use cases with big data that we see coming up commonly.
So we see website and system log analysis, large scale transaction analysis, whether that be purchases or customer interactions, trade performance analytics around financial data or insurance data, risk analysis and fraud detection, especially in the financial sector, and then finally a lot of geospatial analysis has been unlaunched by big data. For example, when you ask Google to please give you directions from one point to another, doing that all crunching has been enabled by big data technologies.
But there are a lot of new use cases, as well, which are a really key part of big data that many companies including yours are probably thinking about. And they include things like doing large scale brand and sentiment analysis or crisis management, being able to target your marketing and have your website, whether that's an e-commerce website or just a general website, being able to personalize content and increase your conversions.
There is more of the customer insights and behavior analysis in order to improve your marketing or your sales or make your operations more intelligent. There is just general BI in being able to have your business intelligence dashboards or your graphs be able to show you the full breadth of what's happening in your company, where that information is often real time instead of monthly dash reports where you're looking in a rear-view mirror.
Predictive modeling intelligence systems, as well as being able to do large scale data mining and cleansing, something like data mining and cleansing you would think that companies would be doing this already but a lot of times we've seen that companies that have had a production use case, when they knew exactly what they were trying to accomplish, that that has been easier for them to do in the past, and the things more like exploratory mining and cleansing actually have been less of a priority.
Only now are they starting to consider that, which is an interesting thing because I think one of the biggest promises of big data is around being able to figure out those unknown questions, not the things you already know you want to ask.
So this is the big statistic that we're kind of calling attention to because we did a large scale survey of a bunch of different folks in the IT field, and what we discovered is that 44% of big data projects are cancelled. And, that is a huge kind of upsetting fact because if you compare that to IT projects in general, about only 25% of IT projects outright fail or are cancelled.
Basically the industry is saying that almost twice as often projects in solving big data use cases or big data technology aren't achieving what they are trying to do. More specifically you have to imagine that more of these projects, although they may come to fruition and aren't shelved, that they aren't achieving their project objectives.
So, this is a really big problem and challenge because as an industry, as well as just business in general we're identifying that big data is the trend and we're all saying that this industry and this market is going to grow into the many of billions of dollars and everyone is going to have some big data of some fashion. At some point close to 2020 we won't even use the word big data anymore. It will be the defacto way of doing infrastructure around analytics and data management.
However, why are there so many false starts? Why are we having so many issues around this? And, that's a key problem here. And at Infochimps that is a key problem for us particularly because it is our job to try to help our customers be able to make sure that the number looks more like 100%.
And so, I'm going to talk a little bit about the challenges that are caused by this. Before I do that I want to ask one more poll around whether or not you've been involved in a failed project and we'll see if this number kind of holds up to what your experiences have been.
All right. So, Amanda, what do the results look like for that?
Amanda: So, Tim, the majority of the audience has voted. We'll close the poll right now, and the results show a large difference. We have 81% of the audience saying no, they have not been involved in a failed big data project, with only 19% saying yes, they have been involved in a failed big data project.
Tim: Very interesting. So, this could mean kind of two things. One is that this audience happens to be a little bit savvier in terms of trying to get these projects accomplished in a way that works effectively for the organization. Another possibility is we may have wanted to present another poll option which is you haven't been a part of a big data project, which may change that a little bit. But that's a good thing to see. So, maybe some of these strategies are going to confirm some of the things you've already been doing in your organizations.
So, what we've seen based on our experiences on how to try to make these projects succeed and these surveys that we've done of those people that said that they've had a failed project, why it's failed, these were the sorts of reasons that they gave. And, the first category is around business reasons.
By far, the biggest reason why these projects fail was around inaccurate scope. That could mean a couple of things. It could me that requirements got absolutely blown out of proportion and there wasn't a firm hand on exactly what was going to need to be accomplished with that project, or more specifically that they just went past that time deadline. And without having a firm success criterion the project just continued to roll forward until when that ROI could be proved it had to be shut down. It was seen as a cost center instead of a profit center.
Another key point that came up is noncooperation between departments. And by nature big data is not just going to help one stakeholder. It's not a point tool. It's something where a lot of people are involved, a lot of different parties, your IT people, your application developers, your data scientists and analysts, and business people across the organization, line of business people and executives.
And having the right talent, or lack of expertise was a key point. That it is hard to find data scientists that really meet this criterion. Obviously there are a couple problems there. One is more of a nature issue. Can we get that talent? Can we retain them and empower them? And then secondly, are there tools or vendors out there that can help us mitigate the need to have that talent or require that expertise, leverage that we already have.
And finally, from a technical perspective, it was mostly around technical or roll out roadblocks, having a strategic partner or consultant that could really help and make sure that that got rolled out properly, or least that the documentation or that those ways of getting those projects rolled out was clear and straightforward.
One of the main issues that came up specifically was being able to gather data from different sources. A lot of times kind of a fun statistic I like to use is that 90% of the sexiness of big data is around visualization and the actual data science, but 90% of the actual work is around ETL, getting that data in, adjusting it, cleaning it, making it analytics ready and putting it in a database that can handle it, that infrastructure side of the problem. So, what happens is you can spend 90% of your requirements gathering around that sexy part and then realize that the infrastructure is the main problem and really blow the project out.
And finally, finding and understanding tools platforms and technologies. It's a complicated world out there and it's changing rapidly, and we have stay on top of it. Not all tools are great tools but there are a lot of good options out there, so how do you parse through all that? How do you make choices? How do you have multiple POCs or vendor arrangements going at the same time, without just trying to spread your bets too far?
So how do we prevent failure? And this is where we're going to talk about our seven strategies for big data projects success, and as we go through this I'll try to sprinkle in some examples from some of the companies we've seen on how they've been able to see success.
And so, problem number one, or strategy number one is define the problem. And it sounds so simple but it's surprising how many organizations when we talk to them for the first time, and this may be the case for many of you as well, where the simple question of what is your big data use case, or do you have a big data project where the answer is very unclear. And, the main reason why that it is because there hasn't been a specific goal that has been figured out internally for an organization that requires a certain solution.
And so by identifying what it is you're trying to do, whether that's we need to decrease fraud by 10%, or we need to increase the leads that we generate off our website by 50%, or we need to increase our e-commerce sales by 5%, being able to identify what it is that you're trying to accomplish whether it's revenue related, cost related, streamlining, more efficient time, being able to hire less people, that problem will drive all the technology decisions that you make.
And, you'll see in a couple subsequent strategies a rehashing of this point more specifically, but technology never comes first. Going to a company like [Cloudera] and saying I'm interested in Hadoop. It may be true. You are interested in Hadoop but what are you going to use it for? Are you interested in just getting technology for technology sake or are you really pursuing the problem?
So, part of that is identifying the greatest opportunities and challenges that you have, that you can be able to address around those different goals. A phrase that I like to use is, "Eat a frog." What is the worst challenge that is currently facing your organization around this goal? What's the biggest problem? Is it that you can't have visibility into how people are using your website, well then we have to address that first? Always eat a frog first. And it's called eating a frog because no one wants to eat a frog. Why would you want to eat a live frog?
On the flip side, what is the greatest opportunity that you could probably do? Is there an entirely new business line that you could enable here, a new product? That is a huge opportunity. Let's address that. Let's get that accomplished.
Number one is all about figuring out what that is. What is that North Star that this project is going to be headed towards? And we always have to keep that constant, keep that strategy constant because anytime you lose sight of that problem, it just becomes about technology and there is no clear end goal in mind.
Secondly kind of extends this point, and it's think big and be specific. And this is meant to be a little bit conflicting or paradoxical on purpose because it's easy to think big and not be specific, or to be specific and not think big. So, what big goal are you tackling? This goes back to the eat a frog. Can we pick one specific thing that we want to do? And don't waste your time in getting to that.
So that's where we really want to be specific about what are the specific objectives that we're going to use to get to that. So, that may mean creating a very detailed project plan here, not just saying IT we need to put some big data in place and waiting to see what technology shows up at the door, but rather putting a detailed project plan about what are objectives, what are our goals around this project, and not trying to eat an elephant in one bite, being specific in what you're trying to accomplish yet trying to achieve high goals.
And it has to be something you can measure. A lot of this just makes sense from a project perspective but sometimes around big data it's easy to move away from traditional concepts around successful project management because it's exciting, because there is cool technology, and because a lot of people want to get their hands on it. But if you can measure it, if it's specific what you are trying to do, and in what order, and in what priority, and it's a big enough business goal that it's actually going to affect people and you can see ROI and it's creating business value, then you're going to be able to do something amazing.
And again, setting those priorities is so key because there is going to be complexity that shows up that you didn't expect and unless you know exactly what the highest priorities are, you're not going to come up. If you try to accomplish ten things and you can only do three, you need to have that clear set of, okay, we need to do the first three things first.
So, a great example of this would be in the case of one of our customers [Blue Cava]. They are a company that we've worked with where they are basically a device identification company. They help match your laptop to your iPhone to your web browser, so that you can tie together all these different people as being the common person even though they come to your website or interact with your brand from different sources.
One of the biggest things they can to us on was this concept of scalability. They needed to be able to scale their business into the future because they were growing so quickly. But scalability is such a large goal, and it was important for them to think extremely big and that's what allowed this project to be very successful.
However, by working really closely with them to set exactly what those key priorities were, that even if we saw a lot of complexity or miss-happenings along the way, just by the complexity of these technologies, that because we said, "Okay. We need to hit this number of events per minutes, and we need to interact exactly with these systems, and we want to be able to handle this type of log information, or this type of system information," we were able to meet our goals despite all the natural setbacks that happen along the way with any given project. So, very key there.
And just as an added point to this, a little bonus, it's interesting to think of projects having a little bit of variety to them. It's not just about doing exciting new things or creating absolutely new functionality or features for your business. New business initiatives are one aspect, but you also need to be considering what are enhancements that we can be making to our current operations systems processes. Or could we replacing some of our legacy systems?
Because a lot of organizations, you may have some data warehouses or you may have some main frames or systems that are doing a lot of the work today, which could be much more efficiently and cheaply doing them on big data. So that's an important thing to think about when considering how are you going to set those big goals, how you're going to be specific. Is it something that is brand new, or is it something where we're going to be able to fix something existing?
So, only now, once we've had that big plan and we've been able to be specific about how we want to achieve it, the problem and solution, now we can be thinking about execution. We can be thinking about how we map priorities to technologies. So business value first, then technology.
And the key point here is going to be thinking high level. Because this technology is changing so rapidly and I'm going to give a few different examples as to why that is. So, first let's look at an example of a few different technologies here. If you're kind of savvy to some of the big data trends and things, you'll have seen some of this. You'll have seen some of these terms already.
But, Cloudera has just released something called Impala, which is an add-on to Hadoop, which is going to allow you to do more traditional business analysis type queries on data that's in Hadoop using SQL-like interfaces. It's supposed to be fast. But there is a more established tool already that Cloudera has released, as well as the general Hadoop project by Apache called Hive, which also allows a SQL-like interface to Hadoop. It's not as fast but it's definitely been around longer so there is the vision of it being more robust.
There is a company called Hadapt which provides a SQL-like way of interacting with big data, as well as you have more of traditional data warehouse or analytics appliance vendors like Vertica.
So, you just look at this for an example and you think of just in the last six months or year, Impala didn't even exist six months ago. Hive has taken leaps and bounds in the last year. Hadapt didn't exist as a company a year ago, and Vertica is constantly releasing new features. These technologies are moving so quickly that if you only think of it in terms of, "Oh, is this [MPT] architecture or is this parallel processing or XYZ or one thing or the other," you're going to get caught up in the actual technology itself and not the goal of that technology.
So, by staying high level you get to think about how can a vendor work with me, or how can I work with open source schools that are out there in a way where I'm not locking myself into one tool or another by rather one class of tools.
So a couple comparisons I'll make, for example, consider the difference between batch and real time. Is the problem you're trying to accomplish one in which you don't necessarily need to be handling and processing that data instantly, or in near real time? And maybe in a best case in that instance, you're looking more to do something around Hadoop and core Hadoop products.
But if you're thinking about real time and how can I, for example, ingest millions of tweets from Twitter and be able to do things to those tweets so I can have a heartbeat on my brand or on my customer's brands, well that needs to be done in a much faster way than Hadoop can do because Hadoop is very much about doing batch offline, sometimes lengthy processing. In that case you may be thinking more about a class of technology like complex event processing or stream processing framework such as Storm or [inaudible 00:35:51].
You have to be thinking about proprietary options versus open options. Obviously there are a lot of choices there. I won't go into depth but when you're in a proprietary situation where for example you're working with a company like Vertica you get the benefit from some of the support there and the consistency around those platforms being able to work together versus a more open source type of environment, such as Infochimps works with a lot of open source software where you have more choices. You get to swap some of those things in and out and you get more with an enterprise support vendor that is supporting that open source software in addition to just developing.
And then finally thinking about what kinds of technology relationships do we want to have. Do we want to do most of this in-house? Do we want to try to have somebody provide more of the infrastructure side of things but we take care of the rest? All the way on a continuum to a pure software as a service solution where we're working with a vendor that almost hands us the analytics on a platter. And kind of deciding what the right choice is for us.
A lot of times this has a lot to do with the talent that you have. And if you've been thinking about how you can be big, how you can be specific, requirements are going to come out about how do we allocate the right resources, both financial and human to these projects in determining DIY versus infrastructure versus platform versus software as a service will start to become more obvious.
So, we have one more poll that we'd like to do now. And, we'd like to know what's your thinking about how you're going to be executing your big data projects. For those that aren't currently involved in a big data project this is can be a little more hypothetical, in terms of whether or not you'd be looking to do that project more on your own or in collaboration with a few different types of strategic partners, potentially in combination, or if there is something else that you're considering.
All right. Great. So, we'll take a look at those responses.
Amanda: So, I'm closing the polls. The majority of you have placed your votes. Let's see what the results say.
Tim: Cool. So, it looks like we have a little less than half thinking about doing it yourselves. So obviously that's a common thing that we see here is a lot around how we as an IT organization or as a grid organization be able to accomplish these projects mostly on our own and build this as a core competency. We have a little bit in terms of collaboration of consultancy or strategic partner, and then a lot in the combination of DIY vendors and consultants, which makes a lot of sense.
In a situation where you're trying to build these systems you want to be able to manage them and have visibility into them and really have ownership over them. And that's an important aspect of this. But, being able to tap into that help where it's available, whether that's a technology vendor or a consultant can really help in making sure these projects get accomplished easily and quickly.
So we have four more points now around strategies, and I'll try to run through them a little bit more quickly so that we can have some time for questions.
So the fourth point is hit the ground running. And, this is where analysis paralysis comes into play. We can't waste too much time just thinking about how we're going to get this project going. And if you've done your homework properly it's going to be easy to stay disciplined and be iterative. Take a phased approach. You don't necessarily have to eat that whole elephant at the beginning.
Focus on what you're going to accomplish first that can get you to the first checkpoint of measurable specific success, and then we can iterate on a phase two and a phase three, and continue to grow that value over time, add more data sources, and add more use cases. Test and iterate as you go to make sure you're accomplishing it along the way. We can mitigate risk this way and ensure success.
Assuming you've done your homework, you have your compass. You know who the people are that you need to bring in, in order to make this happen. You've thought about the vendors and the technologies. Now is about making it happen.
One of the most important things that we see that is so essential that can fall short is you need to know what you're going to be doing with the data. This is not just about taking infrastructure and making a technology decision. It's about figuring out how you're going to use that technology. If it's a big data BI framework what kinds of dashboards are you going to build that are going to answer what kinds of questions?
If you're utilizing Hadoop what sorts of map reduced jobs are you going to create that are going to answer the business processes that you need or that are going to crunch that data that is coming off your financial wire or that are coming from Twitter or Facebook, or that are coming from your customer insights and your website, or your purchases that are going to lead to those answers that are going to lead to actual business value?
Now five, manage scope [creep], and that points to that really important point in terms of challenges around what everyone saw were the really key difficult challenges. This is really about managing requirements and expectations. And some of those requirements can be involved around just the underlying complexity of these technologies. It's easy to get caught up in how much there is to learn and know about this and how much they can actually accomplish. So by being specific upfront about that project and knowing exactly about what you're trying to accomplish and then actually enforcing it can you really ensure that stay on task to achieving that ROI and that business value.
And really managing people [creep] because so many people can get involved and help with this you need to understand that including everyone as a stakeholder is important, but also making sure that only the key initial users that are going to use this system start off using the system.
And then finally managing technology creep. This is a little different from business creep, which is more common. But technology creep, you can get kind of involved with the technologies, particularly around Hadoop and learn about some of their new capabilities and decide that it's easy to say yes when you know that you have the tools in the tool chest, and it's about trying to manage that and make sure you stay on task.
One of the biggest things that we've done around managing scope creep is trying to apply a really firm timeline onto the projects that we pursue. Typically Infochimps looks at a 30, 60, 90 day type of roll out, and if you can't fit your whole project into that kind of a timeline, then that's when you have to think about, what phase one can fit in a 30, 60, 90 day timeframe so that 90 days from now we have a system out there. We've kept our requirements keen and on focus. We've only involved the people that need to be in there, while considering other feedback. And then at the end of 90 days we get to have that checkpoint and think where we have gotten to, have we proven out that business value, where can we go to next.
Embracing the cross organizational nature of big data is our sixth point. And this came up already in our presentation, so I'll kind of bring out our points here. You have to know who your initial users are, but then think largely about who else can benefit, because when it moves on to phase two or phase three in your project you're going to want to be able to figure out how this investment can grow and continue to be that profit center for your organization. That means other data sources in some cases, other tools, other use cases like some of those use cases that we addressed early in the presentation, figuring out those next steps and involving the whole organization.
And finally to end things off we're just going to drive the point home of driving business value. And this isn't necessarily seventh in terms of importance. It's more of a way to end things off and it kind of bubbles over into everything and it's about the point of big data isn't that you're supposed to be driven by technology, it's that you're driven by business needs and that the true point of big data is not for it to be an infrastructure investment and a cost center, it's to be an enabler of business.
And this is true of any IT project, but with big data it is so easy to get caught up in the hype and the different acronyms and the different tools. Really you want to go to business value and what's the fastest path. What's the quickest amount of time you can get there? If there is any key metric that you can apply to your project. It will be timed business value. We'd love to talk to any one of you about how can we prescribe this time to business value measure to our projects in any project that you're working on because it is so key to staying revenue focused and staying business value focused instead of cost focus or focused on just the implementation of different functionality.
Speed to insight is another key metric. How can we get to understand our business better? This is a little harder to measure but just as important.
And then finally, sticking to those data driven decisions. You need to be able to make your organization data driven. That is the promise of big data, to be able to take all the data internally as well as externally to your organization and apply it to your use cases. And big data needs to enable that. And any vendor that you work with, or consultants, or just any DIY approach needs to be thinking about how we take all those things into account.
Finally, our little motto is, Go fast and do awesome. It sounds silly but it really wraps things up. You want to be able to be iterative and fast and prove things out quickly, but you don't want to settle for less. Just standing up a NoSQL database and calling off victory is not enough to say that accomplished business value. So do something awesome. Set your sights on something that is big. And be specific, execute it and stick to the expectations. Make sure that you control them. Manage that scope creep and then figure out the next steps.
So here you can see our wrap up slide of the different strategies for big data success. And after this we'll go into sort of our Q&A section, but before we do that I just want to make the point that looking for feedback and what you think about these points, maybe some stories about in your own organization where you've seen one thing come to fruition or another. Looking forward to pushing out some additional information and really fleshing this out because I think there's a need in the industry for more of a template around big data projects and how to get them accomplished, for them to be successful, and a path towards doing that and making sure that happens. Hopefully this can be a primer for that as we start to move into additional blog posts and web papers and frameworks around this.
With that being said, let's go ahead and move into our Q&A. Obviously feel free if you're leaving us to leave your comments in other methods or if you click that link there, Infochimps.com/freebigdataconsultation, we'd be happy to set up a meeting with you and start talking about your projects. Yeah, Amanda?
Amanda: So, Tim, we did have a couple questions come in that I'd like to share with you. Audience members if you have any questions use the chat functionality in your control panel and send those over our way. We'll do our best to answer them now. But, while we're waiting on those, I do have a few here.
So, Tim, Eric asks, "To the best of your knowledge, has big data been used for homeland security?"
Tim: That is actually an interesting question. I've had a couple conversations with a few folks involved in government consulting and from what I've seen so far, it doesn't seem like the national government has been that involved in big data yet from a Homeland Security prospective. Obviously from an intelligence prospective, they've been doing a lot around big data processing and analytics.
But when it comes to monitoring social media channels for example, or looking at sort of web analytics they've been using some more traditional approaches such as just using typical social media monitoring tools and things like that. But it's one of those situations where I think that they're coming on to this faster and faster and it's spreading into more parts into the various organizations.
Amanda: Thank you, Tim. We have a question from DJ. "How do you do iterative development when the underlying technologies are so complex to get running?"
Tim: That's a good question. So one of the ways that we think about it at Infochimps is we have to separate some concerns around both business goals and technology goals. So, I'll call back to this whole 30, 60, 90 day framework where you want to try to figure out what is that bitable, measurable phase one that we can start off with.
So, from a business prospective we have to consider what's the one question or the one end-to-end flow that we can make sure that we answer, or the first two or three questions that we feel like we can fit into that scope and make sure that this system works and it is proving to add value, or that that end application you're building for big data has the information it needs to run its system.
And then from a technology prospective, can you phase out some of the technology? Can we start off with just a database, just Hadoop it first or just a real time adjustment framework for bringing the data in and putting that into a database, and maybe we'll add Hadoop later, and figuring out things around that?
Because in the end the technologies are just a tool to get the job done but there is going to be many different ways to do it, so that's one of the biggest things whether you're doing it internally or you're working with a vendor like Infochimps, is trying to have that back and forth conversation about what is sort of the minimum buyable product, the MBP that you can build around this and then iterate it from there.
Just to add another point, there are technologies which are going to make this a lot easier. One of the things that we really latch onto here is infrastructure is code, and we use one of our open source tools called [Ironfan] to do that type of iteration. That means that because it's code that's describing your systems that you have contrasts that are more testable, you get the program then more iteratively and more collaboratively with other people, so in general, it just makes it faster and easier to be working with big data.
Amanda: Tim, we have a lot of questions coming in from the audience, so I want to make sure that we can address them all.
Tim: All right. Sure.
Amanda: EO says he's involved in using large amounts of data, about two gigabytes of growth per week, so he's moving to using Hadoop and Solr. His question, "What is the best way to combine them both. Hadoop is mainly to do analysis, thinking of Hive, and Solr is to act as the search engine." Do you have an answer?
Tim: Yeah. So what we've found is that almost everyone of our customers has wanted to use some combination of search functionality, and in the case of Solr and Lucene and things like ElasticSearch, we actually really like Elastic Search because of its robust API and some of the enhancements it has on top of Lucene, that you want both the sort of search functionality as well as your more traditional Hadoop type map produced analysis or Hive or something like that.
So, what we've always tried to do is have them both be part of the solution. And, there are ways of making those things work together which are prettier or less pretty. Unfortunately that is just the nature of the technology landscape right now.
But, one of the things we've done at Infochimps is we focus primarily on making sure Elastic Search and Hadoop can work nicely together and we have a bridge that kind of connects the two that's an open source project called [Wonder Dog], which you can feel free to look at, which tries to make sure that you can with Hadoop read data from Elastic Search and then write back to it, as well as Elastic Search being able to access the kind of data that it needs, and Hadoop being able to send that data there.
There is no real perfect solution yet. I know a couple vendors are experimenting with search engines on top of Hadoop but it's still very much in progress, I think, from a technology landscape prospective.
Amanda: Wonderful. A few more questions. So Dan asks, "What about the Symantec data? Do you know some technologies that scale with massive data volume?"
Tim: Around Symantec data, that kind of depends on what you mean by Symantec data. There are definitely a lot of different technologies that can handle a lot of different types of data. If you're thinking about Symantec data around sort of relationships and nodes and edges, you're probably talking about graphs and graph databases that may come into play, and while that's kind of a new market, that's definitely considered part of the big data realm and something we're really interested in.
One of the best hallmarks of a good big data solution is going to be able to kind of use the right database for the right job, and a lot of different databases are going to be better at different things. So, graph databases combined with NoSQL databases combined with SQL databases are going to provide a more full solution, as well as if you need to do some of that real time processing in order to make Symantec connections there are technologies like Storm that enable some of that linking. So, I would say that there are definitely technologies out there to address that.
So, we have a response of saying, specifically ontologies for databases. And that is a good question. I would actually prefer to get our chief science officer, Dhruv, to work more closely with you on that because he actually knows a lot about ontology. So, I will connect you with him and you can talk more about that.
Amanda: We have a question from Henry who asks, "Have you or any of your partners applied Agile development best practices to big data projects, and what is your take on its applicability?"
Tim: That is a great question. Yes, we have definitely have Agile and we definitely prescribe to our partners that as much as possible they kind of follow our footsteps. It really has to do with the approach that you take and around infrastructure as code is an important aspect of this so that you can iterate. Because so much of Agile from a software development standpoint is around programming and the more you can make big data projects feel like you're just doing a normal software development project, the more you can take these concepts and prescribe them more literally.
Yeah, it's basically that and being able to if you're thinking about accomplishing these things in smaller chunks so that you can get them done faster… Oh yeah, I remember the point I was trying to make was if you can think about working with big data as more of a work locally on your own code base and then push that to the cloud, or push that to the production system, you get more of a separation of concern between your production systems and more of your local development environment and that's something that we push a lot for us and our partners and that helps with that Agile system as well.
Amanda: And then we have another question here from Dave who asks, "Who in the organization should be driving decisions around Scope, the business users or the techs?"
Tim: That's a great question. It's a little bit of both. Obviously the business objectives in your organization depending on the way your organization is set up is going to be in many cases directed by some of the business executives or the managers of various business units. So in that sort of sense it may be largely some of the business users that are pushing some of those things, but the techies and the IT folks are largely in charge of making sure those systems come to fruition, that the right strategies are going in place to make sure these projects are successful and maintainable, and that ongoing they can continue to hit these objectives consistently.
So really, it's about getting both involved, getting them involved early so you can think big but be specific and really execute on that phase one and really get that project scoped out correctly and keep that scope. Because obviously the more people you involve the more requirements you can gather and the more complicated it can get.
Amanda: So, we have one last question, unless any of you want to send final questions over through the chat functionality now. The last question I have here is, "Tim, is there any correlation between projected length of a project and failure rate?"
Tim: Any correlation between projected length of project and failure rate? I think that's a good question. I don't have any data to support that one way or another right now, but my hunch would be yes, there is a correlation, that the longer the timeline you try to place on a project there is going to be a larger failure rate because of the fact that it's easy to get into a situation where you're not placing those checkpoints at the right places along the way, where you have a really long POC period. "Oh, in six months we'll check and see how that project is doing."
In six months you may check up on it and it's nowhere near where it needs to be, in which case you have to make that decision, "Oh wow. This was a failed attempt. Let's just chalk it up to the 44% and just let it be and move on the next," or you think how can we re-architect this project, right?
And by being more iterative, thinking about it in more 30 day chunks or 60 day chunks, you can start accomplishing things more quickly and actually see if you're actually being successful with your project. Fast is better in a lot of cases as long as you're maintaining that level of quality. So I would always say that if you can move more quickly and prove out your initial objectives sooner, you're going to be able to be more successful with your big data project.
Amanda: Well, thank you Tim. That concludes our webcast today. Thank you so much for your time and presentation. Audience members we will be sending this out. Just a reminder, we'll send you a link to the deck on SlideShare and we will send you a URL to the recorded video for your future reference. Thank you so much for joining us today, and we look forward to hosting you again.
Tim: Yeah. Thanks everyone. Cheers.