Agile Big Data Webinar

Getting to Insights Faster: A Framework for Agile Big Data

Webinar (58:43)

In this recorded webinar, Infochimps Director of Product, Tim Gasper, discusses how Infochimps tackles business problems for customers by deploying a comprehensive Big Data infrastructure in days; sometimes in just hours. Watch as Tim unlocks how Infochimps is now taking that same aggressive approach to deliver faster time to value by helping customers develop analytic applications with impeccable speed.



Caroline: Hello and welcome to our webcast. My name is Caroline Lim and I'm the Demand Gen Manager here at Infochimps. I'm excited about our webcast today: "Discussing an Agile Approach to Big Data and CSC App Development." Here at Infochimps, we're tackling business problems for customers by deploying a comprehensive big data infrastructure in days, sometimes in just hours, but before we jump into all of that, a little housekeeping.

First, we are recording this webcast. We'll send a link to the video via email early next week, so please watch your inbox, and second, ask questions. We want your participation in this webcast. Feel free to use the chat functionality in your GoToWebinar control panel. We'll be monitoring this channel and we'll answer your questions during the webcast. We will also have a Q&A session towards the end of the webcast.

Without further ado, let me introduce you to Infochimps Director of Products Tim Gasper.

Tim: Hey everyone, and thanks, Caroline, for setting this up and for introducing me. I look forward today to talking about this topic, because agile big data is at the center of a lot of what we do over here at Infochimps and now that we're a part of CSC, that acquisition that happened in August, we're looking to combine a lot of Infochimps' fast-paced, open source approach with a lot of the professional services and industry expertise that CSC brings to bear. So this is definitely an exciting time for everyone around.

As part of this presentation, I want to hit a few different topics, and so I'm going to show this agenda to you all just so you can see how we're going to progress from one thing to the next. And I want to invite everyone to please insert your questions into the panel. We'll probably address most of them towards the end of the presentation, but I look forward to kind of quickly moving through my content and getting more to answering some of the things that you might be interested in, in particular.

So first, I want to talk about why are we talking about applications? I'm going to make a statement that it's all about the app and we're going to talk about what is a big data app and what makes this unique from other types of applications and why should we be focusing on this? Third, I'm going to compare traditional application development approaches from agile approaches, particularly in the context of big data analytics applications.

We'll talk about what the enablers are of agile big data, whether they're specific to Infochimps or if they're more general than you can apply them yourselves, regardless of who you're working with. Then finally we'll do a quick demonstration of one of our agile big data applications that we developed, so that way you can see a little bit of how something might look like once you've created it, especially how you can create something really quickly.

So I want to start off with an example that I feel like we can all relate to and that's the healthcare industry and whether you've been in a hospital or you've had family members or friends who have had close encounters with the healthcare industry, this is something that affects everyone and it's definitely an industry where big data can really come to bear. Now, I want to point to a specific example where this technology can be useful and that's with medical device data. There are many different types of healthcare organizations, particularly hospitals and care facilities where they have a number of different health devices, and being able to analyze a lot of that information is extremely helpful in being able to provide better care for everyone.

It's very easy to take this use case and try to take all these big data technologies and only think of them as technologies. You've probably heard over and over again about the three V's, the volume, the velocity, and the variety of data, the distributed nature of Hadoop and blah, blah, blah, blah, blah. We can get so caught up in talking about technology. But really it all comes down to what problem are you trying to solve?

And so with medical device data it's not about how fast is that data coming and how much of it is there and is it structured, is it unstructured. Sure those are check boxes we need to go through, but really what we're trying to do in that particular situation is improve health for those going through the healthcare system. We want to provide better information for doctors. We want to understand better what patients are experiencing and be able to, in real-time, kick off alerts or workflows that could save lives.

We want to be able to do things like predictive medicine and personalized medicine, so that way we can tailor the treatments to the specific needs of the individual. This is all about solving the problem of better healthcare and so I really want to point to this fact and it's going to be a theme throughout this presentation. Regardless of your industry, it's all about the application. It's not about the technology. What use case do you have? What problem are you solving?

So we'll start off with a poll question to kind of understand a little bit better about the audience.

Caroline: Okay Tim, I'm going to go ahead and launch our first poll. Where are you in terms of adopting a big data application? I'll give everyone a few moments to answer this question. Thank you very much.

Okay, we're going to take a few more minutes for people to go ahead and put their poll answers in. If everyone can please participate, we appreciate it. Thank you.

Okay, I'm going to go ahead and close the poll. Here are the results.

Tim: All right, cool. So it looks like most of you in terms of adopting big data applications are in the exploratory kind of realm, so that's great. Hopefully, what we talk about today can help in evaluating the right approach to these technologies and focusing on the application and the use case and a few others are kind of scattered along the axis of being more advanced or less advanced. All right, thank you everyone for your engagement there.

So at this point, I really want to ask the question what is a big data application, particularly in analytics big data application and I want to really point to two main ingredients. This focus on the business problem in the use case is very key. It's what allows us to define the outcomes and the success criteria of the particular project that we're putting in place for our IT group or our business unit.

Then there's the technology dimension, and rather than think of it as things like Hadoop and stream processing and databases, I really want to think of it on a time scale. There are certain actions that need to take place in real-time. There are certain actions which can take place over the course of a much longer period of time. That's your batch analytics and then finally, and in some cases most importantly, there is the actual asking of questions and interaction with the information on an interactive time scale and that's the ad hoc analytics.

So you'll see that there's a theme as we got through this. We really think of the application as pairing that business problem with a variety of time scales in order to result in that final light bulb, which is that impactful, business affecting analytic application, that cost reducer, that revenue producer; that risk mitigator, that brand new impact in terms of new product lines or maybe it's an improvement to an existing product. These things must combine to produce that application.

I want to highlight this particular graphic because I really enjoy it because it actually was a big data analysis that went into producing it, by our friends over at Park. It shows you a cluster graph of different areas in the industry and different types of use cases and problems, where these different time scales of technology and the unique business problems are paired together to produce apps that really impact and utilize big data technologies but solve business problems.

So you see me calling out a few just to kind of jog our brainstorming in terms of how this applies to various industries, whether you're financial and you want to figure out how to do a more broad trade options and future pricing analysis. Maybe you're a retail organization and you're thinking about how do I understand better what's causing my customers to leave and how can I increase customer lifetime value.

Maybe you're thinking about the inventory levels that your supply chain organization is working with and how can we minimize those costs to our organization. Or maybe you're an energy company that's trying to provide value-adds for your customer, so your customers can better regulate and monitor their use of energy. All these things are about the business problem being solved and technology playing a supporting role to drive that value.

So in the second poll question we'd like to drive a little bit more into how you're intending to use big data and if you've already identified that use case.

Caroline: Sure, Tim. I will launch the second poll. Have you identified your first big data application use case? I'll give everyone a few moments to answer this poll question.

Okay it looks like we have almost everyone who voted. Let's go ahead and close the poll and here are the results.

Tim: Okay, cool. So it looks like we're pretty split across the board here. We've got about an even number of folks who have identified their first big data use case, as well as those who haven't and a slightly lower percentage of those don't know yet. Well, hopefully as we go into these enablers of big data, agile big data development, we'll be able to identify those use cases more readily.

So to connect all this together now, the use cases, the business problems, the different time scale in terms of technologies and getting the sort of insights and analysis that you need, it really is all about the application, and we divide an application into these six main categories of items. This is sort of our application framework and it's the reporting and the insights that you're going to be consuming in order to make better business decisions, or in order to understand your customers better, or to provide a new product that you've never been able to before.

It's the analytics and algorithms that power that reporting interface, so that way you can derive new metrics that you can be predictive, that you can do the kinds of statistical analysis that enables you to really be more intelligent about your market or about your business. There's a data model aspect, which is very different from Big Data to a more traditional approach and we'll talk about that soon, where you're describing what the data looks like and understanding the different types of data and applying some structure to that.

There's the data pipeline of bringing data in and being able to analyze it, doing some of that processing upfront as the data comes in, in order to make it fit your model. There's data integration connectors for how do you even get the data in the first place and then the data sources themselves, whether they're structured, they're unstructured. Where are they coming from? How often can we get them?

And so I want to look at a specific application and apply this framework a little bit to it. This is a Tableau dashboard that is part of their repertoire of different demo applications, and this particularly is an analysis of oil wells, their output by year and by technology across the mid-African region. So what you get here it's a really great, rich graphic that you can mouse over and you can interact with that allows you to see how these wells have essentially increased production and sort of if they taper off over time and some of the derivative metrics that come with this output analysis.

But this really is just the tip of the iceberg, and a lot of times we fall into the trap of when we think of applications, not fully appreciating the full set of things that are needed to power an application. This is the reporting and insights layer, but really there are five other areas that we really have to think about. What are the analytics and algorithms that power that? The data model, does it need to be more strict or can it be a little bit more loose, the ETL, the data integration connectors, the data sources.

So here you see a diagram that tries to take that application and apply this framework to it, so we can really understand what are we building here and how does it fit these different categories? Well, there're those data sources. There're those well information sources that are being loaded up onto their disk platform that they can then pull that information from. In this particular use case, that information comes in batch, it's a daily dump.

So what we're going to do is take that information and we're going to pull it straight into a Hadoop cluster, so that ETL/data pipeline actually in this case is extremely simple. Then when it's in that Hadoop cluster, we're doing a bunch of analysis, like aggregate rollups. We're doing some metadata analysis, a little bit of predictive classification of information. Then upon that cluster, using a real-time SQL on Hadoop, you can get this Tableau interface. So the whole iceberg here is visualized, and it's in a way that we can really think of how does everything piece together?

And I apologize a little bit for the flip-flopping here. We're got the green diagram and the diagram underneath that are a little bit offset, but I hope you can follow this and see how they kind of map altogether.

Now, that's just one example of an application. One in particular that has a little bit less of an ETL and data pipelining task, but if we look at all other different kinds of industries, you get much more variety and you see where that time scale really comes into play. For example, if we're looking at 360-degree customer experience management, there's a lot that you can do with data in real-time in terms of how people are using your support portal on the web and the web logs that come from that, text message support or phone support and being to analyze that as quickly as possible as that information becomes available.

A really great real-time use case is for ad network analysis, where you've got people mousing over and interacting with web advertisements. There is some stuff that you want to do more historically, but a lot of that information you want to analyze as it happens. You want to black list scammers. You want to start doing some counting of how many impressions happen per minute, how many happen per hour. You want to be able to take an IP address and apply a geo-location to that.

Social media is another great example as those tweets or those Facebook votes are being created, can we understand who are the influencers, what are the sentiments of these people? Are we about to experience a PR fiasco, we want to know that as soon as possible. So you can see how business problems and time scales here combine and they use this six-part framework to be able to really show off what an application is.

Now that we've talked more about an application and what it is, I want to transition now to talk about why there is there a traditional approach and an agile approach and why we really need to focus on a more agile approach. So the traditional way that these applications are developed, and I use data warehousing as an example, is that there's an upfront period where we're understanding the business, we're identifying the data sources, we're understanding how that information needs to come together, and then we spend a significant amount of time thinking about the data model, both the logical and they physical side, the how do we have to force the events, the data events to look differently and what are the tables in that data store that we have to start creating and stitching together in the right ways in order to make this system make sense.

Then we have to setup the system. We spend a significant amount of time developing the ETL pipeline to bring that data in, and I'm sure most of you know what ETL is but it's extracts, transform and load and there are massive industries just based on this task alone. How do I manage all this data coming in? How do I shape it and force it to match the sort of structure that I've identified so that way we can finally develop applications and analytics on this.

This whole process typically takes 12-24 months to reach production. You have a question today and you can't answer it for 12-24 months, unacceptable. And that's why we're really talking about a new approach and big data is an enabler of this approach. So there is a new hope. You still have to do that business discovery and that information discovery and the faster you can do that, the better. But you can more quickly stage that system because these tools are open source and you can pull them together quickly.

You don't have to worry about the data model as much ahead of time because of the fact that these systems support any format of data. You've probably heard on many occasions, big data, unstructured data, unstructured data, but more importantly it's any structure data. So what you do is you literally take that data, you dump it into this system, right, whether that's no-SQL or not and or if that's a Hadoop cluster and then thinking about the data model and the schema becomes an application matter.

So even though that application development cycle might be a little bit longer because you're thinking about that schema as part of that application cycle, every other step along the way has been significantly decreased or eliminated, in order to allow you ask that question and then get an answer in just a few months. And the idea that's great about this is that you can move iteratively, so the next application can be developed, and the next, and the next, and the next much faster than the first using this framework that we'll put forward.

And our customers have taken advantage of this approach. These are a few of our customers and on the right hand side you see the timeline upon which the application was developed. Those are essentially commit to the GIT repository and in all these cases we've been able to work with just one or two developers that have had either medium or all the way to no big data experience and develop a full application in production in just three to six months and I want to look at one particular use case.

A company that we work with HGST, they are a Western Digital company, and HGST is one of the world's largest creators of storage devices and one of the biggest challenges that they face as a company is being able to make sure that they keep a high yield of the storage devices that they're creating. A lot of times they can detect when there are issues as things are being produced. But when those things finally go into production, let's say a customer of their’s buys 1,000 of these storage devices, several may end up failing upon the staging step.

So they need to be able to do a postmortem and go back and understand what was the problem there, what happened? And that process can take weeks, if not month sometimes. And so what we were able to do is they worked with us at Infochimps and the customer, HGST, was able to within just two months take all the batch data that's coming off of their manufacturing line and their testing rigs and their other production systems, put them all in a centralized repository with Hadoop.

They were able to use HIVE and HUE to be able to create a SQL interface to that information and now they have democratized access to hundreds of engineers and analysts within HGST to ask questions of this data in order to get answers in minutes about what was wrong.

So greatly diminished operational burden, amazingly fast project delivery in just two months and now they’re iterating and iterating with us and our partners to be able to address these case two, three, four, five, six, seven and on. They've identified 60 different use case applications that they can address, so it's all about the app and it's all about moving fast.

So let's do our third and last poll question here before then we move into our final framework for enabling agile big data application development. Caroline?

Caroline: Sure Tim, I'm going to launch the next poll question and this is our last poll question of the webinar. What is your biggest challenge to realizing the value of big data application? Please take a few moments and send in your poll.

Okay we're going to go ahead and close the poll and here are the results.

Tim: Cool, and by the way, we appreciate everyone engaging in this, it's interesting for us as well as for sharing. So it looks like the vast majority of folks are experiencing challenges around the talent gap and expertise around big data, whether that's in the application development side or the operation side and that's kind of what we expected. That's definitely what we see with our customer base and a lot of what we at Infochimps end up helping with through our managed platform.

Cost of capital is second and the technology risk being third. I'm surprised to see that there aren't more sort of failed prior projects. That means either we're still very early, I know a lot of you said you're in exploratory phase or it means, unfortunately, what we see with a lot of big data implementations is the bar is set pretty low, right? It's just a sandbox environment. There's no app in mind. It's just the technology issue that people are addressing and you would achieve success, sure, but do you make an impact on the business, probably not. So thank you for your engagement there.

So now let's move into our next phase of this presentation, which is talking about the enablers of agile big data. So we've identified that apps are important and we've identified that there's a traditional approach versus a faster approach. Now, we should always do this faster approach and here's how you do it. These five points are the main important ones.

Infochimps is a provider of managed infrastructure. Now, commanding these technologies is very crucial to being able to focus on the application, because it's so easy for technology concerns to drive us too much and that could be extremely distracting. So we'll talk about whether it's using managed infrastructure or managing the infrastructure yourself, how to differentiate, between the underlying technology versus the interfaces.

There's a community technology and what it enables. There's a customer engagement framework that we work that we want to provide to you, and I'll talk about a white paper that you can have access to that's on our website that talks about how to develop a big data application, and a lot of it has to do with identifying the use case properly and pairing the technology to the use case. There's an agile iterative app development cycle that we have to engage in, and then finally I want to talk a little bit about our application reference design framework and how that's relevant to our customers for kick starting application development.

So the first point, a managed platform, and I'll talk about this in the context of the Inforchimps' cloud, but regardless of whether you're working with Infochimps or are interested in working, you can think of this as your own IT group or whoever is going to be managing those underlying systems, that we clearly see a differentiating factor between the interfaces of the system and the underlying system itself.

So our product, the Infochimps cloud, has three main cloud services that address the three areas of the time scale for big data. There're cloud streams for real-time data analysis for in-memory analysis. There's cloud Hadoop for the batch or historical analysis, the large scale analysis and then, there's cloud queries for more of that interactive analysis and these are the databases. These are the SQL on Hadoop query interfaces. These three things together provide a comprehensive set of technologies and tools to address all the use cases your enterprise needs for big data.

And so if you think about your own reference architectures that you're developing in-house, it probably matches this overall kind of way of looking at the world, the offline, the near line, and the online. Where those interfaces are, are on the data collection side, and then the actual analytics and application development side. And so at Infochimps we've created these standard integration connectors, which work with a variety of different data sources, whether they're more real-time in nature or batch in nature. HTTP, logs, data partners like GNIP and DATA SIFT for social data, batch uploading, and custom connectors that may fit the needs of something more specific.

Then on the other side you have the application development. You have the ability to process data with scripting tools and MapReduce and SQL on Hadoop tools. A lot of these come natively with these open source projects and then Infochimps has created an open source project called Wukong, which we'll mention in a little bit. We have a command center GUI that we've created that allows us to do the operational kind of management and the application developer can understand metrics of what's happening in the system. There's all the native APIs that you use in order to develop applications, in order to access the information if you're using something like cloud queries or develop analytics with things like cloud streams and cloud Hadoop.

Then finally at Infochimps, we have a unique component we call the cloud API which orchestrates across these three different cloud services. Now, the important part of a managed platform here is that that centerpiece, that grey box, we want that to be as seamless as possible for the application developer, so that they get to focus on the right hand side, and that's extremely key for agile big data development if we want to be able to just focus on what's the question I want to answer, how can I answer that question, not worry about Java heap size and worry about how do I manage the block size on my HDFS clusters. I'm sure some of you are a little more technical and can certainly answer these questions for yourself and it can be quite confusing and quite distracting.

So now I also want to mention the community tools themselves and these technologies are all open sourced, developed by the community and there's so much documentation being put out there by the community that it's really empowering and allows us to move extremely quickly. We're not locked into these proprietary models where you have to depend on tons and tons of professional services. These tools were designed with the open source application developer in mind.

So I'll just point out a few different things here and there on the Hadoop side there's a lot of different interfaces that people use in order to interact with Hadoop and different technologies there. I'll point specifically to streaming MapReduce and SQL on Hadoop, Pig, and Hive as the really interesting ones there that allow you to do agile big data development. Because with Java you're doing these compilation lifecycles where you're developing and then you're compiling it, then you're seeing if it works.

But the great thing about Hadoop and what we've tried to encourage our customers a lot is use streaming MapReduce, which allows you to just use standard in and standard out and essentially use any type of language that you'd like to, Python, that's fine, Ruby, that's fine. You can develop in those languages and have that interact with Hadoop data layer.

SQL on Hadoop is a really great area to really focus on, because you've got folks in your enterprises that know how to do SQL queries. Let's enable them with things like Impala from Cloudera, and Hive from the open source community across all distributions and Pig, which is a really great tool for doing a lot of the ETL and data pipelining that is SQL inspired but is an even more powerful language that's a little bit more Hadoop specific.

Then you've got great no-SQL databases that you can take advantage of. All of these are document stores or key value kind of stores that interact with Hadoop. And the power of using these is, again, that ability to not worry about the structure of the data, structured, semi-structured, unstructured data, you want to be able to think of these as document stores, as very simple ways to store the data so that way you're not worried about that upfront, logical, and physical data model as much.

Then finally there's the stream processing aspect of this which is more the real-time, and that's what allows you to do that in memory processing and that data queuing and those technologies, Storm and Kafka are kind of newer right? But all of these, because they're open source, you can access them, you just download them and you can get started today and you can iterate on them. It's extremely powerful to be able to use these.

Then just to tie this altogether, there are a lot of interfaces that sit on this. I kind of mentioned the Hadoop interfaces already, but there're statistical tools that you want to work with this stack. There's business intelligence and data visualization tools that you want to work with this. This stack is flexible, whether using Infochimps or using just the open source stuff that you're grabbing off the Internet.

They can use these SQL on Hadoop interfaces to be able to integrate with things like Tableau, or Qlikview or Cognos. There are things like RMR, which R to work with Hadoop's stack. So by allowing all these things to work together, what you're really doing is enabling an agile cycle, because you're using the tools that you feel comfortable with. You're using the tools that allow you to hit all six of those different components and so it really all again comes down to what is that app that you need in order to solve that business problem and let's develop that application using all the tools at our disposal. Let's let technology serve the application.

We've created a lot of unique toolsets for our customers and I'll point to two in particular that I think are most relevant to agile big data development. One of them is our scripting tools that we've developed, as well as our deploy pack orchestration. It allows you to, across all three of these different surfaces in terms of time scale, to be able to really quickly and in scripting fashion develop, okay, I want to take this data source. I want to do these things to it and I want it to be either real-time or batch, and then be able to put that in a folder structure, which I'll show in the demonstration part of this demonstration, and push that.

A lot of these tools are available open source. For example, one is called Wukong. You can just grab these off of the Infochimps GitHub, which I recommend that you check out. This deploy pack orchestration framework is a really key part of how our customers develop applications by having a code repository where you can put all that application logic into and those analytics, and regardless of whether that's a Java Jar or if it's more of a scripting approach, to be able to just with the push of a button, push that into your actual production Hadoop environment, right? So we're talking local development and then pushing that to your actual production cloud environment and doing that extremely quickly.

The second tool I want to point to is a tool called Ironfan, which is what we use at Infochimps in order to really quickly in just hours, and in some cases even faster than that, deploy comprehensive big data solutions with all three of these cloud services, these time scales on to a variety of different deployment environments. When you work with Infochimps we use that tool in order to deploy it in the environment of your choice, it's also an open source tool, which you can find on our GitHub and interact with yourself.

This really focuses on that part of the traditional lifecycle of the staging of the system and the setting up of the system that if you can deploy that system quickly, and you can proliferate your changes quickly, that's a big deal in terms of agile big data application development. Less time worrying about the infrastructure and this toolset is part of the class of tools which you'd call “infrastructure as code”. So you've probably heard of Chef and Puppet. Ironfan actually sits on top of Chef and extends it and particularly makes it extremely powerful for big data use cases. So I definitely think check that out and regardless of whether using Ironfan, an infrastructure as code tool, as well as a scripting and orchestration framework for applications are going to be key to moving quickly on both the infrastructure and the application side of things.

At this point I also want to talk about a customer engagement framework that we use for our customers and I won't dive into this too much. If you have some questions, we can follow up afterwards, so you can feel free to reach out to us and we'll talk more about your use cases. But what this does essentially is help identify what are the key ingredients to making a successful project, right? It's identifying what political things might we run into. Who are the stakeholders? What are the different data sources that I'm trying to tap into here? What are the use case and success criteria for this particular project? Then once we have those pieces, defining a data flow and an architecture that serves those needs and really focusing on just one use case first, because it's easy to get distracted with a boil the ocean approach.

The idea is you want to very quickly get that production system out, as we mentioned with our agile framework, and then iterate from there forward. Achieve your first use case, achieve that first business problem and then move to the next use case, and the next use case, and the next use case, and iterate and iterate and improve and improve. Fail fast and succeed fast.

I won't go into this anymore, but I recommend that if you want more information on this customer engagement framework, there is a free white paper available on our website in the resources section called "How to do a Big Data Project" and it'll go into more depth about how to apply this framework to your own company in terms of the use case you're trying to solve.

Now, we've talked a lot about how you set yourself up for success. One of the important pieces towards the end of this presentation is just the literally the application development cycle. You can't be agile unless you have a tight, developed test and deploy loop. So whether you're setting up a local Hadoop cluster or a local Storm cluster and you're working locally that way and then transitioning to your more production cloud environment, that's one approach to this. Another way is to take a more scripting approach. Then finally, if you're working with Infochimps and utilizing some of our toolsets like the deploy pack, you're actually using that application repository, putting your logic in there, putting your analytics in there and going through this cycle potentially many, many times per day.

So as you're thinking about how you can take a more agile approach for big data application development, consider this cycle, consider what are the different technologies or processes that you have in place that'll allow you to move through this cycle extremely quickly. And if you need to do things like create these separate, protected development environments from your more production environment because you have SLA concerns or you've got machine critical applications, then do that. Because you need that separate system, you need to have this application development cycle move quickly, or else you simply won't be able to move background.

And then to wrap this up, we've taken this agile application methodology and we've applied it in the form of these application reference designs. It's something that we announced at Strata earlier and now we're really trying to help proliferate. What it does is it takes that six-component application framework and saying that, okay, well, what we can do is actually create an initial application that serves these six pieces for a given use case. So it's driven around a particular use case and it is a code repository that starts off with some of those sample datasets, some sample data flows, some analytics that you'll want to do in either real-time or in batch and most importantly, a visualization, so that you can see what kind of insight that you'd be getting from this and then take that application framework and apply it to your own problems and your own application needs.

Essentially, this kick starts the whole process, because you already have a framework in place, these building blocks to work from and it's about, "Okay, now what are my unique data sources? What are my unique analytics that I need to develop? Can I piggyback off of some of these existing analytics, and can I use this existing visualization for my needs? How do I want to modify it?” All in the context of what that big data problem is that you're trying to solve, what that use case is and doing so in an agile fashion.

These four use cases that I identified earlier happen to match exactly with the application reference designs that we've developed so far. We have one for predictive manufacturing and manufacturing in energy that applies to that oil use case. There's an ad publisher campaign analytics app reference design, 360-degree customer experience management, and then social media monitoring and analytics. I invite you to hop onto our website. We've got a bunch of free datasheets on there that go into more detail about these app reference designs and they illustrate the first half of what an application reference design is, which is a framework and our approach to handle a specific use case.

Then I invite you to reach out to us and we'd be excited to show you an actual demonstration of these different reference designs and show off what it's like to actually develop application on some of the Infochimps managed cloud platform.

I want to specifically point out the social media application reference design, because I want to show you a really brief, but interesting demonstration of what this application reference design actually looks like. Let me move over to -- let me start off actually with this here. So first I just want to quickly show you the overall structure of an application within the application reference design structure.

So this here is our social media starter kit. It is the code repository where all the logic and analytics live for this and you see there're some sample data sources that are in here. It's got some configuration in order for our orchestration framework to grab this and essentially lift it and drop it into your cloud platform when you're ready to actually push it. And then here's our application logic. We've got various flows that we're using, a shaper, a Twitter analysis flow. We've got some jobs that do things in more historical fashion on Hadoop, our data models, and then we've got our actual processors, which are doing things like applying cost for it, analyzing sentiment, putting a sensor on data, applying geo-location.

This here is a dashboard that's developed in Cabana, which is an open source data visualization tool that anybody can grab off the Internet. It's really great. It sits on top of Elasticsearch, which is a full tech search engine that we use for no-SQL sort of a light analytical interactive analysis. Cabana sits on top of that and is able to interact with that semi-structured data to do things like, interesting time series analysis, real-time displaying of actual sentiment by taking the application code that we've put in place and then displaying that in a very simple fashion with dropdown little widgets, who is influential in terms of where they're coming from and then what quantities, as well as being able to do things like how many retweets are happening and that sort of thing.

So this particular use case here was developed about a month ago and it's actually it only took us about a week to develop this entire application with all six of those components, the data sources, the ETL, the data integration, the analytics and algorithms, and the actual visualization, because we took advantage of the framework of the application code repository, because we're using really light-weight visualization tool, because we're focusing mostly on scripting than worrying about taking Java and doing a bunch of compiling. So in just two weeks we were able to take this with just one person and develop a full application. So this is a starter kit that kicks off anybody interested in doing social media analysis that you can find more information online or talk to us more about afterwards.

I definitely invite folks to think about what could be possible if you were able to move this quickly. Then let's actually go and do it. That's the right approach to big data. Big data is not about moving slower because there's more data, but moving faster because it's a better approach to building these analytic applications.

So to close off this whole presentation, I want to just summarize the Big Data benefits we talked about here. The enabler's are around unstructured data, semi-structured data, not worrying about the logical data model and physical data model out front and spending all that time essentially handcuffing your application, rather than making that an application concern. It's enabled by the full time scale of analysis, real-time, batch, and ad hoc. It's about using scripting tools that allow you to iterate faster and interact with the data directly, instead of having longer, more extended debug cycles.

It's about schema on read so that your app drives the data model and the data structures, not the upfront ETL and all the work you have to do up front. And it's being able to very quickly move from your local environment to a cloud environment, from small datasets and sample datasets to the full, larger datasets or the fast data flow instead of the more marginal flow if you're talking about real-time data streams, and it's about allowing all these tools to talk to each other.

So the common thread is more data, faster data, more varied data, right? We've talked about the three V's. You know about them, but let's actually tease that out. Really, it's an overlapping Venn diagram here. There're these new use cases you've never been able to address before and new techniques for analytics and analytical approaches that are now suddenly available to you, because you have access to the full set of tools. Small data, big data, VI and data warehousing, big data altogether, and then there's this very important time dimension. Faster times to business value, faster iteration of application development, increased flexibility on how you develop those applications. It's all about the application and it's all about being more agile.

So as you think about your own company, what is your first big data app or what is your next big data app that you're going to be working on? That is what we need to be focused on. What business problem are we solving? Is there a VI tool thing that you need? Is there a packaged application aspect to this? Is there a statistical tool that you need to enable on all this data? Is there something more custom? What do we need to do in order to enable that particular use case?

So, thank you. I appreciate your time. That kind of outlines the overall agile approach in terms of the five different steps, as well as why agile big data applications are important and why the app is so important.

So at this point I want to really open it up to questions and just kind of engage in a dialogue here about what you're doing, what you're interested in and what's sorts of issues I can help you resolve.

Caroline: All right, Tim. I have a few questions lined up for you already. Thank you everyone for participating and keep the questions coming. We really want your engagement. The first questions is from Onica. “How much faster is the Chimp approach compared to traditional?”

Tim: Thanks, Onica. Essentially I would say that it depends on the complexity of the application, but as long as we stick to these agile development principles, we're trying to make a really fast cycle of application development. We want to make that local to cloud kind of pattern be an extremely fast one, which you can engage with. We're using open source tools. We're focusing mostly on using scripting, as opposed to using Java and compiling and do that longer debug cycle. And it depends a little bit on whose doing the application development, right?

If used as a customer looking to do more of that application development, there's going to be a little bit more training and coaching that we'd like to do, versus if you're having Infochimps and Infochimps as partners develop that first application. So it can be anywhere from a month to we've seen at the longest about six months to develop that first application. And then typically once you have that foundation in place and folks have been enabled to actually develop on this big data platform and they're going through that loop now, they're cycling and they're moving quickly and they're understanding how to develop it, it's as though you're developing a website.

It's supposed to be that fast in terms of developing your analytical applications and so the next use case and the use case after that should be coming around in just a month at a time. The idea being that you want to iterate right? So you put into production a full flow, but you may iterate and enhance that flow and then push that to production over time and so it's really about an iterative approach and it's about trying to get the value. We typically say in 90 days is a goal, if not sooner.

Caroline: Thank you, Tim. Okay, the next question is from Peter. “Do you help the customer with setting up a business case based on the use case so it's possible to reach for a specific ROI?”

Tim: Yeah, absolutely. So regardless of whether it's a pilot project or it's a more extensive subscription to the Infochimps cloud, we like to spend a lot of time upfront developing that business use case. Some customers or partners come to us with their use cases more fully fleshed out already and that's great, we can move even faster in that case. But sometimes there's a lot of help that's needed in order to really clearly say what is the success criteria here? What's the ROI that we're going for and is this a cost cutting project? Is this a revenue freezing project? If so, what are we trying to do and by how much?

So if some of that isn't developed upfront, we really say it's necessary to develop it and so we'll engage with our customers partly just as part of the sales process, in order to develop what that use case, right? We're not here to push a bunch of consulting hours of, "Oh, let's think for two years about the right way to do a big data roadmap." Let's really quickly pick that first use case, drive value, prove that big data can be successful in the organization and then move on to the second, third, and fourth use cases.

Caroline: Okay, so our next question comes from Hakeem. “What are the key indicators you use to help clients decide on the tools that provide the most quality in results?”

Tim: “What are the key indicators you use to help clients decide on the tools that provide the most quality in results?” So it sounds like what we're talking about here is evaluating which -- how do we decide which technologies in order to ensure the right results? And so there's a lot of factors that are going to go into that. Part of it are organizational factors, right? What are the skill sets in the customer and how do we want to engage with that? All right, so if you're talking about an analyst department that's got more SQL experience and things like that, that's going to play very heavily into choosing more SQL on Hadoop type tools.

If what you really need to consume in the ads is a graph or a dashboard of some sort, right, you want to have the fastest time to dashboard and you're not going to get that if your folks are all focusing on developing Java on app produce jobs and things like that, right? So choosing tools has a lot to do with the talent set, what you're trying to achieve, what the use case is. It's really a very holistic set of criteria that's going to have to come to play in terms of choosing the right technologies, what time scale, right, because a lot of times folks think of Hadoop first when it comes to big data. But I'd say 50% or more of use cases are actually better addressed by a real-time streaming framework like Storm and Kafka as provided by our cloud streams, so all things to consider.

Caroline: Okay, so the next question is also from Peter. Peter, thank you so much for engaging with us. “Is it possible to integrate social media analysis with customer experience analysis, or are they two separate use cases?”

Tim: That's a really good question and I would say that even though we've identified them separately just because what we're trying to do is essentially find these horizontal use cases, develop these app preference designs around them, and have them apply to various vertical focus areas. So something like social media analysis is relevant to a lot of industries, customer experience relevant to a lot of industries. But they are all using the exact same agile framework and application framework in order to define what the app is and how we can develop it very quickly.

So essentially what we would do is we would take those customer experience flows and the social media flows, we would bring them together on to the same code repository and we would allow that data to all work together. There might be some things where you want like a Tweet to come in that's mentioning your brand to them, if it's a negative sentiment to kick off a support workflow or something like that and that's something that you could do in this framework. Really these use cases, even though we define them separately in terms of discussing them, applications can be as varied and integrated as you'd like.

Caroline: Okay, Peter. Hopefully that answers your question. The next question is from Yung, specific about our app referenced designs. “How can the app reference designs be used? Are they flexible? Can I add more sources in analytics?”

Tim: Absolutely. These app reference designs are meant to be the opposite of a full-fledged application that works right out of the box that provides everything you've ever needed just by clicking a button. They really are a framework and a skeleton for being able to address your specific needs. And the reason why we've taken this approach is because the big data landscape right now is still so varied, right? Someday it will mature into a situation where fraud analytics on big data means just one thing or customer experience management on the big data it means one very specific thing.

But right now that just isn't true. It's very different from organization to organization, and so these app reference designs are meant to be used in a way where they're customized. They're used as a building block upon which to develop the very custom and specific application to the customers' needs, and as long as we stick to this agile approach and focus on the application and not the technology, we'll be able to achieve that first production use case in about 90 days.

Caroline: Okay, the next question is from Gustavo. “Are there any examples of big data projects that mix social media analysis with biological sensors?”

Tim: That is a very, very interesting question, a question that I would love to know the answer myself. I don't think I've actually run into any use cases right now where I've seen those two mixed, but I bet whoever does that could write an extremely interesting blog post here or a white paper about it or maybe even start a company around it. So if this is something that you're doing and are trying to explore, feel free to reach out to us after this webinar and we'd love to talk to you more about it.

Caroline: Okay and the next question is from Harsha. “My team needs to push our big data project mid-2014. Can app development run alongside infrastructure deployment?”

Tim: That's a great question, Harsha and you can absolutely do application development alongside infrastructure deployment and that goes back to the whole local versus cloud separation that we tried to delineate here. That what you should be doing as you're setting up that production environment is already have done a lot of the information discovery and identifying what's the potential architecture here of the application, what are the data sources, what are the insights that we're trying to generate, and work locally with probably a sample set of the data to get that application going.

So when that infrastructure's finally complete, that's when you've completed now this agile loop of taking what you've done locally and pushing it to that cloud. There is a little bit of a bottleneck there in the sense that you can develop full-fledged applications locally, but until you're pushing them to your actual cloud environment, you're not going to have full feedback on how it works in a fully distributed fashion where you've got multiple machines working separately from each other. There are connection things that are going to start to crop up that you'll want to evaluate. But essentially the answer is yes, you can do both at the same time and in fact, you should.

Caroline: Thank you, Tim. We have time for a couple more questions, so I encourage anyone who has a question out there to go ahead and put them into the chat functionality, so we can go over those. The next question is from Riley. “In the early stages of exploring Hadoop, why would we need cloud streams if we're not doing anything in real-time?”

Tim: That's a very interesting question. I would answer this in two ways. Firstly, cloud streams end up being or cloud streams what it represents essentially right, stream processing, in memory processing, data pipelining that using that technology set is actually a very superior approach in order to take data sources from one part of your organization or one part of your system architecture and bring it into your big data stack. So we've actually standardized around cloud streams as being the way in which we bring the majority of information into our system. So data pipelining, data ingest, very key for that.

But secondly and maybe more importantly, we feel like Hadoop only addresses one of the three overall time scales that we talked about. There's the online, the offline, and the near line, sort of the real-time, ad hoc versus batch. And Hadoop really is a batch tool. Hadoop does not solve the whole breadth of big data use cases and can't solve all business problems, right? It is a hammer and it's easy to think of everything as a nail if you're thinking of Hadoop as the hammer.

Cloud Streams really addresses more of that real-time side of the equation and a lot of times what happens is you start with Hadoop, you're analyzing a historical set of data. You're building up predictive model that can provide really good business insights, and now you want an operationalize that. You want to take real-time data and apply that model and be kicking off things in near real-time, because that's the speed of business. The speed of business is real-time and that's where cloud streams have to come to play. Hadoop can't solve that use case yet.

Caroline: Okay, we have time for just one last question and this last question is from Siddhartha. “Do you have experience with cybersecurity use cases? Can you give some examples?”

Tim: That's a great question. So we at Infochimps have done a little bit with cybersecurity, but I would say that the majority of our cybersecurity talent at Infochimps and CSC is actually in a sister group of ours. CSC has three main groups that are kind of relevant to Big Data. There's the Big Data Analytics Group that Infochimps is a part of. There's the Cloud Group that provides really great cloud infrastructure and then there's the Cybersecurity Group, and all three of those groups work closely together.

I think that if we wanted to dive more into cybersecurity use cases, I don't have any at my fingertips right now, but I'd say definitely reach out to us and we'll actually where appropriate pull in our cybersecurity group whose done a lot of cybersecurity use cases on both big data and small data infrastructure and we can definitely explore some different things that CSC has done in the past and think about how that might apply to some of the use cases you're doing.

Caroline: All right, this concludes our webcast. Thank you Tim, for taking the time to enlighten us on an agile approach to big data analytics app development, look out for the recording early next week. Have a great rest of week and thank you everyone for joining.

Tim: Yeah, thanks everyone. Have a good day.