Amanda: Hi, good morning and welcome to our webcast.
[Recording: The broadcast is now starting. All attendees are in listen-only mode.]
Amanda: Good morning and welcome to our webcast. My name is Amanda McGuckin Hager, and I'm the Director of Marketing here at Infochimps. I'm excited to have you join us today. We're talking five big data use cases for 2013. Here at Infochimps we're seeing a lot of big data use cases coming up this year, but before we jump into that, I'd like to do a little bit of housekeeping. We are recording this webcast. We will send out a link early next week via email that links to the recorded video and to the slide decks. Those will be coming your way.
We'd love your questions during this webcast. There's a chat functionality in your GoToWebinar control panel. We'll be monitoring that throughout the webcast, and we may hold some of your questions for Tim in the Q&A at the end of the event. We have a Twitter hashtag. It's on the upper left-hand corner of every slide. We'd love to see some of you using the Twitter hashtag and talking about our event there.
Without further ado, I'd love to introduce you to our Director of Product Management, Tim Gasper.
Tim: Thanks, Amanda. Hi, everyone. Thanks so much for joining us today. I think it's exciting that we finally started 2013. It feels like it took forever to get here, but I think what's most exciting about it is that big data's been around for a few years now as a major force in terms of tools and technologies, and a lot of different companies have finally been able to experiment with some of these things, especially the community tools that are out there.
I really feel that 2013 is going to be a year for us to really lock down what those use cases are and what business value that these tools can bring so we can really put some things into production in our organizations and really begin to either build new revenues or save some additional cost savings by using these tools.
Just to quickly introduce Infochimps, in case you're not fully familiar, we provide big data cloud services for enterprises to more easily accomplish their big data needs. We do that by providing the system administration, the infrastructure, hosting, and a powerful enterprise SLA for all the big data tools you need, including Hadoop databases and real time integration and stream processing.
As you can see on the screen, some of our customers, as well as some of the partners that we leverage to, deliver that value. Just on a high level, we help customers collect the data that they need, process it through all the various means that they need in order to accomplish their use case, whether that's batch historical or real time and use tools that allow them to use the languages that they're comfortable with, whether completely on the side of visuals, all the way to intense programming to be able to answer the questions they have about their business, visualize what's happening in their business and build these applications.
I like the term application as it relates to use case because big data is a platform. It is a set of tools that can help you to accomplish a lot of different use cases and what that does is it puts the burden on us, whether that's a vendor or that's you as an organization to decide, "How can you use big data tools to accomplish the needs that your business has?" and matching those business objectives to the way these tools can work.
That's a big challenge, and I'm hoping that through this presentation we can look at a few common use cases on a very high level, I don't want to get too technical today, then be able to give you a creative juice to think about your own organization so that whether in the questions after this presentation or offline after we get done with this presentation we can talk more about what those apps, or those use cases, are for you.
Before we get started on the main content I want to do a quick poll and Amanda, can you introduce that poll for us?
Amanda: I'd love to. Before we get started, we just want to ask who do you represent within your organization. Here's the poll. We'll give you a few minutes to make your selection and we'll gather some votes.
Tim: Thanks, Amanda. This will help us cater what we're doing a little bit more towards the audience that we have today.
Amanda: Most of you have voted. We'll give it just a minute more. If you haven't placed your vote, go ahead and select your answer to the poll on your screen, and we'll close the poll. Here are the results. Tim.
Tim: Awesome. Looks like we've got a mix here of people who are more at the business function at their organization. We've got some technical people as well, whether those are analysts or programmers or system managers. We've got some consultants as well, and we have a little trickle of the other two categories [inaudible 05:57]. Looks like we've got a pretty broad swath of different people here, so that's great. I'll try to meet all of your needs in terms of how we talk to these use cases as well as talk to Infochimps.
Moving forward, I want to start with a stat and establish a little bit about big data. We've got a report that's on our website right now if you feel like grabbing it. It's at Infochimps.com, and it's about what IT teams thought that their CIOs needed to know about big data. We interviewed a lot of different people in collaboration with a couple other groups and one of the coolest, I thought, stats about this was the fact that 81 percent of the companies surveyed said that big data's a top five priority for their IT group.
I think this is great considering that big data has been around for awhile, and we really need to be able to drive towards value, and it seems that various companies are, hopefully if it's now on IT's desk to be able to accomplish these things, these businesses are really thinking about what those use cases need to be and putting those into production.
One of the biggest problems though is that, another stat, I don't have it listed here, was that about 55 percent of big data projects fail and many more don't meet their objectives. That's a big issue, and we really need to, when you're choosing a vendor or establishing your own in-house competency, figure out "How can we mitigate that failure?"
Why is big data such a big issue? This stat is really interesting. This infographic's really interesting by IBM, and it just establishes some of the massive volume of data that different things are being seen. [A number of] 50 million tweets per day, 2.9 emails sent every second is quite unbelievable, and if you see the last stat about Amazon, sometimes I wonder how Amazon even does everything that it does because there's so much data that they're managing and processing for so many different people.
When you look at your own organization, there are so many different data sources that you're looking at and it's not just the fact that there are so many different sources, it's that they all look different. They all answer different questions and questions don't just map to one source or another. You need to take advantage of multiple different kinds of sources to solve those problems. When you bring these different pieces of data together, it's a terabyte and petabyte problem in many cases. You don't need to have a Google size infrastructure and Google size data to have an issue. Even having just one terabyte of data can already start to cause problems when it comes to managing these different things.
The three most common things we talk about with big data are obviously volume, velocity and variety. I don't think that's very new. If you've been looking at some of the things by Gartner and any other company in the big data space, they talk about these things. A lot of times companies will add additional Vs if they feel like. I like variability because I think it speaks to the fact that there can be bursts of data, things can look very different from one day to another in terms of managing these different sources. Choose your own adventure when it comes to your own Vs and your own organization.
I like to map these aspects of the bigness of data with just the fact that big data tools and technologies are providing a better approach. These systems are meant to be scalable so that you can start very small and increase them in size until you get to that petabyte point and you can do that with the same number of people without having to scale your talent and your people as much as you would if you were just completely depending on traditional methods and tools.
These are not only powering, intelligent use cases such as predictive modeling, but these systems themselves are intelligent. We use a tool called Storm that, when it falls down, it knows how to automatically readjust and get the data moving through without dropping a single data point. You don't have to depend on any single vendor because of the fact that most of these tools are Open Source and, in general, big data is a holistic approach. It takes advantage of tools for storing data, presenting data and processing data both in real time and in batch. That's powerful because it means when you get used to one set of tools, it meets a lot of your business needs.
Tim: We'll do one more poll and then after that we'll get into the actual use cases. Amanda.
Amanda: This poll is, "What is the category of your company's ideal big data use case?" We have a few categories I'm going to throw up on the screen right now. Go ahead and place your vote. We'll give it a minute or two so everyone has time to vote, and we'll be back to show and share the answers.
If you haven't placed your vote, go ahead and do so now. I think you can choose one on your screen, one of the radio buttons. A few last votes coming in. Are you all ready to see the answers? Let's close the poll, and here we go.
Tim: We're seeing the results now. We have about eight percent infrastructure, intelligence systems. The vast majority, almost 60 percent, are for business intelligence. A little bit of finance and also a good portion in mobile and online insight. That's great. We're going to hit a lot of these different use cases as we walk through. Unfortunately we had to force you into some general categories here. It's good to see that business intelligence is high though because of the fact that business intelligence can be so affected by big data.
Let's more forward. Let's look at the general landscape of a lot of these different use cases. I love this graphic because they've actually, the creator of this graphic actually did a cluster analysis of how people are using big data to create it, and there's these general categories around how these use cases come together.
There's general business intelligence for understanding your supply chain and being able to make better business decisions. There's understanding your customers with consumer and mobile insights. There's more scientific approaches, like biology and geolocation, finance obviously has its own set of interests and concerns. Then there's the more technical systems approach where you may be managing infrastructure and be trying to monitor and understand how those systems are working as well as building intelligent networks.
I've seen some really interesting things from IBM and Intel around creating smart homes and smart cities and things like this, and in order for these many different sensors to collaborate and work together, big data's a huge part of that.
We'll talk about a few individual use cases within this overall sphere, and we'll start off by talking a little bit about risk analysis and fraud detection. Any organization is going to have some element of risk, whether that's risk if your customer's going to pay you, risk as to whether or not you should bring on a customer, risk as in compliance. We'll talk about a couple of these things quickly.
One of them is, for example, customer risk and whether that's an organization that is in any industry or specifically in finance, you want to have a good picture of what your customers look like and what risk that they may be posing to your organization, whether that's how many returns are we going to get of retail, things that we're selling in our apparel locations, or how many times is a trade going to cause us to lose money?
Always with these risk analyses there's an importance around understanding all the attributes of the particular object that we're looking at, if that's a customer or something else. In order to get all those attributes, you need as much context as possible. You need internal sources within your own enterprise as well as external. Then you want to bring all those data sources together and process them even if they look completely different, bring them into a single view so that way you can start to understand patterns, determine what causes that risk and be able to predict it.
Just as a really simple graphic, here's an example of how a company might determine whether or not a person has good or bad credit. This is something that, no matter what industry, and we can understand is, depending on your income or your credit history as well as your debt, there may be different ways to determine whether you're high risk or low risk.
Surveillance and fraud detection is another interesting use case. I was just talking to a potential client a few days ago and we were talking about how you see all these really interesting events happen in the world where people are leveraging social media or other online [inaudible 15:19] to be able to figure out and organize themselves organically to sometimes do good things or sometimes do not as good things, like violent protests or even acts of terrorism. It becomes important to be able to look at all the signals, a high volume of information as well as velocity and determine where are those things that are unusual? Where are those anomalies, and be able to model that.
Big data's a very important part of that, particularly in utilizing Hadoop and some of the tools that come with Hadoop dataset, you can do some very good cleaning of that data and then modeling and trying to determine those things. Where the interesting step that comes in is can you transition that to a real time situation? That's the second phase of this. It's not only to be able to determine what causes these problems but then to be able to detect them and determine with the click of a finger whether or not this is something that's interesting or not and try to minimize those false positives.
I also want to quickly mention about regulatory compliance. With any organization you have to worry about, from a business perspective as well as a technical perspective, whether or not you're complying with all the regulations for your industry, the government and so on. It's a really complicated environment and what big data is allowing us to do is depend a lot less on bringing outside consultants and specialized vendors in our organization and really put that power into our own hands to be able to look at all the data within our enterprise, process that, analyze that and determine where that regulatory risk might be.
Finally, to end off this use case, I just want to look at an example of a global investment bank that's trying to determine trade risks. To do that they took the trading data they're getting off the wire, combining that with what they know about their existing customers, ingesting that, and as you can see, I've got these three icons here, one of them represents real time integration. The elephant represents the batch due processing that's happening and then the database icon represents that database that's holding that information, presenting it through an application.
This particular organization needed to ingest all that data, especially that high velocity trading data, be able to do analysis not only within a day, but to do large scale historical analysis into the past. This needed to be both for their production in determining those things as they were happening, as well as just trying to improve their models of that risk. Then finally, this particular bank had to deal with a lot of really important compliance issues and built into a complete big data stack is that ability to search and look through all this information and to perform that legal discovery. That was really important for them.
For our second use case I want to look at a very different way of using big data, however, using very similar sorts of analyses, and for this we're going to look at brand and sentiment analysis. For brand and sentiment analysis you're especially looking at social sources in order to figure out what people are saying about you. Your particular company may have its own forums and internal ways of collecting feedback or looking at what customers need but there's so much more happening, especially if you're a large brand on Facebook and Twitter, on new platforms like Pinterest and Tumblr.
In order to understand what these people are saying and protect your brand value, you need to be able to extract all that external information, bring it in your own environment where you can begin to poke it and deconstruct it and learn about what people are saying. This might be for product feedback. It might be for customer service alerting so that if somebody says, "I'm never going to use you again. I can't believe that your customer service person hung up on me," you now have a second, third, fourth, however many chances to get that customer back.
The fact that big data systems are highly integratable with all your other systems means that you can do things really interesting like integrate with your other customer service systems to create a complete way of understanding what your customers are doing in the world and tie that back to business action, which really has to be the goal.
We work with a large media conglomerate over at Infochimps where they're extracting information from all across the Web, particularly social, through Gnip, which is a great data source, a partner of Infochimps, for Twitter, Facebook, YouTube, WordPress, discuss comments and on and on. There's a great data source called Moreover, which provides information from news sites, blogs, magazines, a lot of other sources which are a little different than social media.
Then, this particular media conglomerate already was involved in traditional media like radio and their own newspapers and those types of media. They had all that data within their own confines, and so basically what we're doing here is we're bridging the old traditional media with all the new forms of media to create one unified view on what everyone is doing.
Compared to the risk analysis, this use case is much more focused on the real time element, that purple icon where there's real time, as that data's being ingested and brought in, there's sentiment analysis happening, influence analysis. We're associating genders with these people so we understand if there's a difference in feedback from men or women. Topic extraction and on and on. You can imagine a number of ten or more other sorts of metrics or augmentations that need to happen to this data to really gain more context and value. Doing that in real time through frameworks that enable you to do stream processing and then bringing that into more of a database environment where you can be able to serve that as part of an application.
This particular company doesn't just use internally they actually service information to their customers, their advertisers and publishers that want to understand these insights. Then for Hadoop, Hadoop is a little more simpler here. They're just using it for general trend analysis to see how things evolve over time. This is really much more of a real time big data use case and a great example of where Hadoop isn't the end all be all. There you can see a little screenshot of what that application could look like.
For our third, we're going to look at customer insights and behavior. This is a little general, but it should apply to a lot of different organizations. To do that, I'd like to look specifically at the vertical of retail, although I think that a lot of these different circles will apply generally, that there's a concept of an entire customer life cycle of being able to first, make sure you're efficient at getting that customer and then bringing them through the entire funnel of what your organization does, whether it's optimizing leads, making sure you give the right price, having the product on the shelves, trying to upsell them and then making sure you keep them a passionate, excited customer, that all these aspects are important, whether you're B to C or B to B.
I'm going to look at a couple of examples of these things, particularly customer churn analysis is something which has highly been amplified by the power of big data. To do this, it really is around taking all the different contexts that you can have, internal as well as external data sources, to build a model of that customer. What's cool about turn analysis is that it's really different for every company and every industry, so it really is in your power as an organization to be able to determine that. Bringing all that data into Hadoop gives you the ability to traverse those social graphs and the different connections of that information, be able to identify patterns and find leading indicators about whether or not somebody were to do X and therefore that variable led them to want to two weeks later, not be a customer.
You can start to see some leading indicators of when somebody might be leaving. Then you can start determining a strategy for your organization of, "Okay. Not only how do I do some response to these people," which is maybe more in the realms of brand and sentiment analysis, but "How do I create a long term strategy to mitigate that overall factor?" If the reason why they're leaving is because they couldn't find a store nearby and our presence just isn't as good in terms of offline, then we know that there's a need for a strategy that is to expand our store presence. A lot can be learned there.
Another one is customer loyalty, especially with the advent of online shopping, it's so hard to keep a shopper focused and interested for long enough to make sure that we get them through the entire purchase cycle, let alone become a repeat customer. With customer loyalty, we can use things like loyalty programs, or just in general, how they're interacting, what their online and offline presence is, to see what those trends are and be able to continue to maximize that and watch that loyalty grow over time.
For this particular use case I want to bring up Target because I think probably most of us, if not all of us, have heard about the story, kind of become a prototypical big data story now where Target was looking at a lot of different queues about their customer, was able to figure out that a particular female customer of theirs was pregnant before she even told her father.
I think that's an interesting story because it just goes to show that by taking all these different data sources, obviously starting with your purchase data, so you can understand what they're actually buying and interacting with the actual products, to things like demographics, geography and tying their website actions to their offline actions. Then being able to segment them and really understand that, "Okay. This particular person bought a diaper, a Vitamin Water and a holiday card and therefore, we know that those are predictors that this person is likely to be pregnant." I don't know if those are actually the signals. I'm doing this off the cuff, but those things lead, therefore, to this person being not only a certain type of customer, but likely to take certain types of actions.
Very related to customer insights is being able to specifically do targeted marketing to those segments and personalizing experiences so that we can maximize our conversions. Targeted offers are both an online and offline activity and the fact that you can leverage both these worlds for data is important because online provides some really interesting new methods for tracking what users are doing, whether that be through cookies, as well as looking at their click stream behavior because of all this information, we can actually do a lot more not only online, but offline as well and take what we learn and build that into our application logic.
Application logic may, in an online sense, be, "Okay. When somebody clicks on these four things on our website we know that they might be a fickle user, and therefore we should give them a discount," or in an offline sense it may be that, "We know that people who travel through aisles two through five tend to be less loyal shoppers than those who are in the other aisles and therefore, we should change the way we market in those aisles." There's a lot of different ways that this could affect.
Another way that this comes through is not just through targeted offers or deals, but actually through recommendation engines and being able to predict. I think a good example of this is Amazon. When you travel through their website, right after you're about to make a purchase, they'll show you all sorts of other products that it says other users like you may want, recommend it for you. They know that by showing those products there, as long as they're well focused and personalized to you as a person, that they can increase your market basket purchase sometimes as much as an additional 100 percent or more.
This is a really powerful thing because now almost every organization is really leveraging online, and online is a great place for you to do things that are dynamic where, depending on the person and the attributes and the things you know about them, you can be able to focus that experience directly for them to encourage the right action.
It's not just retail. We've been talking to an investment group and this particular investment group has a website where they encourage people to sign up for an account to begin investing with them. Simply by tweaking their funnel, by which you go into the website and sign up for their service and the types of things they show you in terms of the features and benefits, that they can increase their conversion by about 30 percent. That is a huge deal because, considering how high the turn is in the industry that being able to really increase that top of the funnel is really helpful for them.
For this example, I want to show a major apparel brand. This is a particular company that our CEO worked with where they wanted to take their online storefront and use that click stream data to provide more targeted offers. They had an ongoing campaign where they wanted to provide 20 percent off to their customers, but after figuring out the budget behind it, realized that if they were to, across the board, do a 20 percent discount, that it would have a major negative impact on their quarter.
And so by leveraging a big data approach, ingesting all that data from the click stream storefront and using Hadoop to be able to analyze that, they were able to cluster behavior and identify that if you went through a certain path, I believe it was something like if you look at a shirt then end up just buying something small, like accessories, like underwear or socks, that you were likely to be more of a fickle user and therefore, that if they gave you a 20 percent discount, you would be a lot more likely to actually close a much larger purchase.
Versus if somebody else went through and did a different kind of behavior that, regardless of whether or not you gave them a 20 percent discount, they would just continue along their same path and it really helped them save a lot of money and make sure that their offer was actually the proper way to push the user in the direction that they needed them to go.
I'm sure you can imagine how any website or offline storefront can take advantage of this information to really appeal to not just individual users, but you can imagine geography is important as well. A store in Cleveland, Ohio and a store in Austin, Texas are going to have very different types of people coming to them and personalizing is a huge benefit.
For our last use case we're going to go a little higher level and we're going to talk about big data business intelligence. The reason why I want to go a little higher level here is because I want to show you a couple of interesting architecture diagrams where traditional ways of organizing your systems internally are being revolutionized by new approaches.
Before we do that, I want to get an understanding of how many of you currently are taking advantage of things like BI data visualization tools, or have internal data warehouses so that way we can speak a little bit to your own situation.
Amanda: Let's go ahead and launch this poll. You should see the radio buttons on your screen at this time. Go ahead and place your vote. We'll give everybody just a few minutes, and then we'll share and show the results.
Most of you have voted. If you haven't yet voted, go ahead and place your vote. We're going to close the poll in just a minute, and we'll show everyone the answers. A few last votes trickling in. Okay. Let's see what we have.
Tim: I'm looking at the results now. Seems like the majority of folks have chosen, "Yes. Both BI and data warehousing." That's about 33 percent. We have gotten more people, if you're just doing one, having a data warehouse, whether that's data warehousing technology or a database that you're using as a data warehouse, then people that are just doing BI. Actually a pretty significant portion of folks that aren't doing BI or data warehouse, which is interesting. This will be relevant to what we're talking about here.
If you're not using BI or data warehouse, what that means is that, if you are using a data warehouse and BI, then you know that the point of being able to leverage that is to bring a lot of different aspects of your company together in one place and using a common language to describe that data so that then you're analytics folks and people who want to take that information and improve your business can do that because the data is there at their fingertips. They can use tools like statistical modeling tools or BI visualization platforms to look at that information and extract some insights.
The traditional approach has worked thus far and has been a major impacting force across the industry in terms of both big data as well as just data in general, but what traditional enterprise data warehousing approaches force you to do is often use a batch way of being able to bring that data in and process it. A lot of times that processing is very intensive because of the fact that data warehouses currently and traditionally are more structured and force a very specific schema on that data, whether that's a SQL warehouse or an Oracle warehouse or something like that.
Then you have to be very specific about how you put that data into a data [mart] so that any given specific analytics group or department can actually use that information for whatever tool that they might have. All those tools speak slightly differently and have different ways of interacting with data, so overall there's just all these different touch points and definitely a lack of a real time nature in which you're gaining insight to data.
What I love about data warehousing is the fact that it pushes this concept of big data without actually using big data tools, and if you want to bring a lot of different sources of data together and you have a lot of different use cases for how you're going to use that data, but the problem is that it simply does not have a big data approach.
A big thing that the market is identifying now and that a lot of vendors like us are also advocating is that big data changes the way this picture works. These big data tools specifically, like Hadoop and real time processing frameworks like Storm, that by using these technologies now, not only can we take the data in it in more real time, but we don't have to worry so much about organizing that in such a structured and constricted way. We can let data be the way that it is and then worry about the way that schema needs to look upon read, upon absorbing and using that information.
By using data stores that don't have forced schemas on them, such as no-SQL, as well as some of the new SQL databases and leveraging tools that know how to talk to the big data world that you can get some of the same powerful analysis using all the data that you need and doing it in a way that's so much faster and closer to real time that it's just, in general, a superior way. If that's all I said, that'd be fine, but the fact that these tools are Open Source often means that they're 10 times more cheaper.
Just to show how the Infochimps' platform maps onto this, you can see some of these icons here and that purple icon with the data upstreaming into the three squares is our data delivery service, which is our data integration and stream processing service. You see Hadoop, which represents HDFS, that file system or maybe an H base, which is the big data database that goes with Hadoop. That represents your elastic data warehouse and the processing that you can do to work with all that data. Then you have a variety of different databases as well as application frameworks to be able to work with that data, build applications and answer questions about your organization and use the right tools for the job, whether that's a modeling tool or a BI tool.
What's awesome about this is what used to be a very limited set of information that can power exploration and visualization of your supply chain, of your resources, of your financial practices, of your inventory. Whatever that might be, now you can be doing that over a much broader set of data and you can be doing things that are much more big data oriented in that you can take advantage of social data and all these other types of contexts, as well as doing things like you can see here, a geospatial analysis, which over a large set of data can be really complex, but by utilizing things like Hadoop, and this particular tool is Tableau that you can visualize that and get some really great insight and then present that in a way that's friendly to the business decision makers.
Finally, for this example, I want to look at one of Infochimps' customers. This particular customer is an online [inaudible 38:05] and they want to build a business command center that allowed them to look at all aspects of their business from the inventory of their deals, how they manage customer service to basically every part of their business that was important to them from a business intelligence standpoint and create a dynamic dashboard. However, all the different data sources, especially those weblogs [inaudible 38:28] created at such a high velocity, they needed a way to bring that all together and do that large scale analysis on it. It couldn't take days. It needed to be as close to real time as possible.
What they did is they utilized their data delivery service for that streaming ingestion of that data. It was a stream processing and integration. Then not only serving that directly to the dashboard for quick insights but actually using Hadoop and particularly Hive, which is a tool in the Hadoop ecosystem which lets you do SQL-like analysis that their business analysts were able to go and modify this iteratively to continue to give them the insights they need to run their business.
By using Hive and making sure to tailor those jobs to be very specific and run very frequently that not only could they just make their data visible, but on a very frequent basis they could add in important value added metrics that enhance their view of the business and really help them excel versus their other deal side competitors.
So that ends the structured part of what I was going to be talking about today, and I know we took a high level approach, but I wanted to try to hit a lot of different aspects of where big data can help to try to make your 2013 be the year where you pick a use case and bring it all the way to production and really prove out both business value and return on investment for your organization. At this point, I think we'll shift over to questions, and I definitely encourage you to not only ask general questions, but also about a use case that you might be interested in.
Amanda: Tim, we've had some great questions come in over the chat functionality in the GoToWebinar control panel. I'd like to start with Ignacio's question. Ignacio in the UK asks, "How could one leverage Legacy or algorithms and data sources like [Force.com]?"
Tim: Interesting. It sounds like for your organization you've been taking advantage of R, which for those who haven't used R, or are unfamiliar with it, is statistics package that is Open Source and utilizing that to gain some insight into some modeling. One of the cool things about the big data landscape is that I like to think of it as three major aspects, there's Hadoop for your batch processing. There's your stream processing with things like Storm or Flume, there are a lot of other approaches to doing stream processing.
Then there's your databases where you're storing and managing that data. Our algorithms can actually plugin slightly modified, but they're relatively straightforward into things like Hadoop. Hadoop has an RMR connector which a lot of people like that allows you to adapt your R algorithms to work with Hadoop, and there's a few other ways that you can leverage R in the landscape of those three major aspects of big data.
You just have to be able to use the right tool, whether that's Hadoop with a big batch EPL job to grab that data off of Force.com with some added additional components. Or I actually recommend a streaming approach where you're using something like Storm or the Infochimps data delivery service to hit that Force.com API and interact with that and on a more real time basis or a streaming basis bring that data in so that your R algorithm can run on that data.
Amanda: Thank you, Tim. Here's another great question from Mark in Austin, Texas. "In a big data project," this is a two-part question for you, Tim, "In a big data project, what is off the shelf, and what is Dev intensive or custom work?" That's part one. Part two is, "Is Infochimps a products company or consulting company?"
Tim: Those are great questions. I'll start off with the first one. "In a big data project what is off the shelf and what is Dev intensive or custom work?" That really depends on who you're working with in terms of the big data project. If you're taking a completely community approach, a lot of the off the shelf stuff are the core developer tools and some of the frameworks by which you can interact with these things. But there is going to be a lot of custom work and training and developer intensive things that are going to need to happen in order to not only get those things stood up, but tailored to the way that you want to use them, and then finally actually implementing that application or use case that you're using them for.
There are some approaches to big data where there are staff companies, or Infochimps is a platform company where we try to add some additional things to that to make it either simpler or more straightforward to use those tools, and so I recommend considering that depending on how much [traffic] you have internally and whether or not you want to actually be doing all that learning and working with those things, depending on what does come off the shelf.
Leading into the second question, Infochimps is a product company. We provide some professional services and consulting to help our customers, but we really do advocate a specific platform which utilizes Hadoop, stream processing and databases and then wraps all those three aspects together with a development framework that tries to make it easier to work with big data as well as provides a templated approach to those applications so that our goal is always within 90 days to get you to a production use case with real data, not just sample data to be able to accomplish your needs.
There are few other companies out there that, whether they're providing just directly Hadoop or additional aspects, which can help you not just deal with the off the shelf nature of these things and try to cover up some of those Dev intensive things.
Amanda: A few more questions. The questions keep coming in. This is great. Luke asks, "Is DDS an Open Source tool?" and then, "Is there any alternative to it for near real time integration with Hadoop no-SQL?"
Tim: There are definitely alternatives. DDS is not an Open Source tool, although it leverages Open Source tools at its core. What's unique about our data delivery service is that we use it as a multi-purpose tool, not only for data integration but also for that real time processing of data. In order to do that, it needs to leverage a few different aspects. It leverages a product called Storm, which is a great Open Source tool for doing that processing.
It leverages Kafka which is a message [queue] system that ensures reliability of moving data from one place to another, and it also uses a bunch of web servers at the front end to collect all that data that you can horizontally scale that collection and then pass it off to either Storm or Kafka to then either process or deliver downstream to Hadoop or your BI tools. You can try to bring all these tools together and do something similar, but we package that all together with DDS.
There are alternatives for real time integration. Obviously I mentioned Kafka and Storm as some things. A lot of people use Flume, however we have found some issues with Flume as you scale higher and want to be able to tap into a lot more data sources other than just logs. Flume was mostly a log oriented collection system and there are also ways of using more core messaging systems, whether those are zero on queue, or a Java based message queuing with a bunch of custom code to do that kind of thing.
Right now stream processing is definitely a lot earlier at that dash processing like Hadoop in terms of the new Open Source world of things. It's definitely a space to look for because it's very important for a lot of these use cases, and I would say that over the next couple years, those real time integration and processing tools are going to really start maturing.
Amanda: Thank you, Tim. Here's another one from Christopher. "How often have you leveraged data virtualization technology to mash up big data and typical enterprise data sources?"
Tim: Data virtualization is an interesting topic because it depends on what you mean. Virtualization gets used in a lot of different contexts, unfortunately. But if we're talking about data virtualization in the sense of being able to tap into data where it currently is in terms of letting the data stay where it is, being able to crawl it and create a virtual representation of that data so that when you do a query it distributes that across all the systems and pull up the data that you exactly need, that we don't use data virtualization for our own big data use cases because so much of the processing that we do is so much more effective if you bring that data into the analytics system and really actually embrace that big data data warehousing concept.
However, there are a lot of tools out there. I won't say a lot, but there are some tools out there that allow you to take more of a virtualized approach to mashing up your data and working with data sources, so definitely I would consider the pros and cons there about whether or not you want to move that data or not and obviously, you heard our approach that we like to actually move and process the data itself.
Amanda: Okay, Tim. A few more. They keep coming. [Takar] asks, "What could be use cases for any financial company other than Internet related data?"
Tim: For financial companies it depends on just what that financial company is doing, so if they're an insurance company, they're going to want to use it a lot for how they model all the different claims and ways that people are introducing risk to their system, kind of to that risk use case perspective. And that's going to be mostly just the information within their own systems, maybe with a little bit of external context that they bring in, like demographical data or census data, etc.
Also trade information. Often there's a special system that they're pulling all those transactions off of that's highly real time and is proprietary. In those cases, that wouldn't be Internet related information it would be more specific to that exact organization. If you're using big data tools in the proper way, you can utilize any data source, whether that's online, offline, external, internal.
Amanda: Thank you, Tim. I really like this next question. It reminds me of a few years ago when social media first came out and people were, as a marketer, I saw all this buzz on the Internet going around about, "Will social media kill email? Will email die in the wake of social media?" Joshua asks a very similar type of question. He asks, "What do you say to the buzz on the web that big data is on its way out?"
Tim: That's like saying that social media analysis is on the way out. It's like, "No. Not really." Social is just becoming a fabric of everything that we do, and now social media analysis has been rolled into things like general marketing and branding. Big data has the same concept where it was great to specifically call it out when it was really unique and there was these exciting things like Hadoop that was this really unstable crappy thing that was processing millions of terabytes of data at Google and people were excited about that. They were like, "That's big data."
As this expanded, the term big data kept on being used for the same thing but the tools became much more able to serve general and varied use cases. I don't think the term big data's on the way out yet. I wouldn't be surprised if it started to fade away in a couple of years because what's happening is that big data, the term, represents tools like Hadoop and these no-SQL databases and an approach which is more scalable and agnostic and Open Source oriented.
Big data is more of a trend than it is something specific and all these new tools and technologies are either going to merge with the more traditional SQL ways or traditional data warehouse ways of doing things or it's going to hit the gas pedal and pull away, and it will become the new normal. I separate the term, which I think will fade, but I definitely think the tools and technologies and the use cases they solved is absolutely here to stay and will only grow in importance.
Amanda: Here's one more question from Robert, who asks, "Without the traditional database controls for referential integrity what steps can be taken to ensure integrated data quality?"
Tim: Interesting. This is a really good question and one of the reasons why big data hasn't completely supplanted more traditional ways of doing things. If you know there's a difference between an analytical approach versus an operational approach to data that the operational approach is much more dependent on traditional databases because you need much more assurance of data integrity and retention and compliance, and this entire way of thinking of things has completely evolved around more traditional ways of doing databases.
What I would say for now is, without getting into too much detail, because to get more technical I'd want to bring in some of our other folks at Infochimps who know more about the details of how you program and structure these actual architectures, that when you're doing data integrity and compliance and governance and all these different things in a more big data context, it ends up being a little bit more custom in that you need to structure Hadoop jobs and processes and workflows and user management around these tools to enforce those [inaudible 52:27].
So you may need to work either closely with a vendor or a partner to help ensure that those steps are taking place, or just hope and expect that as these tools continue to evolve, the aspects that they have that serve things like integrity will get better and therefore, it'll be able to cover more of those operational use cases.
Definitely if you have additional follow-up questions to that, Robert, I think that's a really interesting question, I would love to get you in touch with some of our folks here who are a little more technical and think more about that everyday.
Amanda: I'd like to invite anybody who might be feeling a little shy today or maybe didn't get the full answer to their question, to email us at email@example.com and address your questions to Tim or myself, Amanda, and we'll make sure to get you the answers that you're looking for.
We have a question from Cheryl. It's so nice to see another woman on the line. Cheryl asks, "Is Hadoop required for all these use cases?"
Tim: Thanks, Cheryl, for that question. That's actually interesting, and the answer is definitely no. I think Hadoop is a really great tool, and it continues to get better and better, pushed by some of the vendors that are really backing Hadoop, like Cloudera and Hortonworks, and MapR and, in general, by the community, but there's still so many use cases that Hadoop inspires but cannot by itself completely satisfy.
While Hadoop is definitely not required for all these use cases, it's often a key part of them, and in combination with those other two aspects I talked about, the databases that store and present that information as well as the stream processing that gets that information in there in the first place and handles real time processing, that together those three aspects cover the main infrastructural pieces.
And then if you wrap that with the ability to integrate with BI tools and statistical modeling like R and SaaS and SPSS and application development that together that unified approach and technology set can hit all those use cases that you need, at least the majority. There's always going to be exceptions, particularly if you're talking about super fast real time where it's extremely low latency situations that big data is often not the right approach, but you want something that's really close to the hardware and can handle sub-50 millisecond responses and things like that. Those are situations, for example, where Hadoop is continuing to grow and hoping to solve those soon.
Amanda: We have another question from Ben, in Seattle. "What use cases is RFID useful for?"
Tim: Interesting. RFID is a center approach which allows you, it's particularly used in more of the inventory management where you want to keep track of where a particular product in your supply chain, whether it's coming off the conveyor belt, if it's in a warehouse somewhere, it's just entered the truck and not it just entered the actual storefront. Obviously supply chain, inventory management and financial type use cases are important with RFID.
But what's interesting is that RFID and [near field] type technologies start becoming more predominant, and mobile interactions, as well as in this whole concept of home automation and the intelligent home, that it'll become more and more useful for a lot of additional use cases.
Amanda: Thank you, Tim. We have another question from Luke who asks, "How is big data being used in telecom?" The second part of that question is, "Are the business cases currently handled in telecom related to customer insight or marketing and/or operations and/or churn?"
Tim: I think that the answer is all of the above in that telecommunications companies are actually one of the earlier doctors of big data tools. If you look at what I refer to as the stream processing, or the real time frameworks that a lot of the precursors to some of the newer Open Source tools were pioneered by the telecommunications industry to handle things like metadata phone calls and text messaging networks and being able to keep track of people's usage so they could charge them properly on their bills, and all that kind of stuff was very much pioneered by telecom.
I think that mostly big data's being used operationally by telecom to help them manage their networks and monitor them and be able to charge people properly, but I would be very surprised if they weren't utilizing the tools for things like customer insight, marketing and more in order to understand their customers better and not just use their investments in one area of their company but really use it everywhere it can be useful.
Amanda: This last question, I think we have time for one or two more questions. This question comes from Janice, or Janiece, please excuse me. I'm not sure exactly how to say your name. She asks, "Is there a big upfront investment required to kick off a big data project before you are certain there is real value to be extracted from the data you've already collected?" That's a great question.
Tim: That's a really good question. It's actually a question that could apply to any kind of large scale infrastructural project that your company might be taking on and that's that I want to be able to prove that this thing's going to work and provide ROI before I go ahead and actually commit the bank to this thing. Obviously, everyone talks about how much ROI there is, but you don't want to be the one out of three projects or one of the 55 percent of projects that fail. You want to be in the camp of the ones that succeeded.
I would say that it depends on the way that your organization is approaching it. If it's a really huge initiative, I've seen companies just dive headfirst into the big data front and start hiring people and buying tools and getting IT shifting things, but really, what's more common is to start small. I think what's interesting is, actually a previous webinar that we did was talking about how to ensure your big data project success, and one of the points is that you want to think big but be specific.
So it's often to do a smaller scale project just to do a smaller investment and prove out that ROI first, but you don't want to think small because if you're thinking small, you're going to set up a database and do a couple of scripts or something like that and maybe try out Hadoop and say, "Okay. Great. We've got it set up and it answered that question that we wanted to know," but then there's not as much of a next step, and you didn"t really impact the bottom line of the company.
Infochimps, what we really advocate is that you pick a production use case, but scope it well and then work with a company that can get you to achieve a net use case quickly. For example, at Infochimps we advocate a 90 day timeline where at the end of 90 days, not only have you deployed all your systems, but you've done an end to end use case solution around trying to either solve a problem or build an application, whatever that use case may be.
Then have a major check point at that point to make sure that you're ready to move forward and make a much larger commitment and expand the number of use cases as well as expand the data that you're bringing into the system. I advocate an approach like that because I think that helps mitigate the need to start small and make a stated investment while not just accomplishing something small.
Amanda: That is our last question, Tim. Thank you all for participating. We loved your questions. We loved hearing from each and every one of you. I have nothing further.
Tim: Thanks, Amanda. Thanks everyone for joining and feel free to contact us if you have anything, or also you can find me on Twitter. I'm @TimGasper, so definitely feel free to follow me there. Thanks, everyone and we'll be talking again soon.
Amanda: Take care.