Transcript
Amanda:
Today The Real Time Analytics: The Future of Big Data in the Agency. I'm Amanda McGuckin Hager, the director of marketing here at Infochimps. Thank you for joining us. We have a few people on the line. We'll go ahead and get started today.
A few things to note before we get started, some housekeeping. If you have any questions at all through the webcast, please feel free to chat them over. We'll stop about 20 minutes shy and do a Q&A where we'll hope to answer and address a lot of your questions. The ones we can't get we will address over e-mail privately to the e-mail you used to register for the webcast.
We do have a Twitter hash tag. It is on the upper left corner of every slide. In case you need to reference it, it is RT Analytics. We look for you to participate there as well. We will be recording this, so we'll send you early next a video of the recorded webcast as well as slideshare link to the deck for your reference.
Without further adieu, let's get started. Allow me to introduce Dhruv Bansal, one of our founders and chief science officer.
Dhruv:
Thanks, Amanda. And thank you everyone for coming and attending the webinar. My name is Dhruv Bansal. I'm one of the founders and the CSO here at Infochimps. At Infochimps we make big data infrastructure simple. I'm very proud to work with some of the most cutting edge technology and hosting providers in the cloud big data space. A good shout out to our friends at (?) today. We're putting on another great webinar that you guys might be interested in about using their product to use big data analysis.
At Infochimps we take the technology that our partners develop. We research it. We understand how to deploy it, to knit it together. And we build solutions for our customers so that they can get the power of big data without having to worry about the office burdens in technology issues.
Before we get started with the bulk of the webinar, I'd like to set a quick just so that we all understand what's about to happen. I'm going to start out by telling you some stories about opportunities in big data, things we're hearing from our customers in the agency space. I hope that you all agree that you see some of these same opportunities and challenges in front of you.
Next, I'm going to talk (?) about why these (?) these all ultimately (?) from or (?) based on (?). I'll next talk about (?) or mid (?). I'll talk about (?) you might be less familiar (?) that are going to be useful in this case. That (?) recognition to give you a real (?) to highlight some of what I've been saying. It gives a case for how (?) works. It can help you overcome the (?) your birthday and (?). Finally, (?) Amanda (?) will take (????) a quick poll. Just almost a diagnostic question just to find out what is the amount of data we're dealing from folks from the audience. So, go ahead and just try to respond. How much data do you have in your management? This doesn't have to be a very, very accurate measurement. Just estimate it. It's okay to say that you have absolutely no idea how to estimate it. Go ahead and put in your answers. We'll give it another or so. Let's see here.
Great. Okay. So, it seems that you guys are looking at the results right now. We're actually pretty evenly split across. There's a few folks here that are managing a huge amount of data, over 10 terabytes. I'd love to talk to some of you about the solutions you using to handle that much data. I know there's a lot of folks at the other end of the stature and who don't have a lot of data at all today but maybe are going to invest in the technologies and products that are going to generate more. And, of course, just as we suspect, there's a lot of folks that don't know how much data that exists in their infrastructure. That's especially common when you lack transparency into everything that you're running, and it's typical to measure exactly how much data you're dealing with.
Great. So, let's sort of dig in right at the beginning. Big data is in some sense a pretty easy idea to explain. We produce data as a society all the time constantly. And there's a lot of it, so it's big. It's really as simple as that. There's a whole bunch of statistics that are floating around, and I kind of have a hobby of collecting some of the best ones that describe how big this problem is or how large the opportunity is as well.
Just a couple that I want to share with you real fast. Here's what I like that I saw in a tweet the other day. Data centers worldwide take up about 1.3% of global energy usage. That's as much as the entire continent of Australia put together just for pushing bits around on the internet. There's another one that I really like which is that between the beginning of time and, let's say, some time between 25 or 30 years ago, we produced as much data as a species as we produce now every two days. That one's from Eric Smith. Now, there's a lot of people that will complain and say that these things are being measured improperly and you can quibble about the precise numbers, but the lesson here is that we are producing huge amount of data all the time.
On the left hand side of this, there's just a few of the data sources that we know that our customers are excited about getting their hands on, plugging in together and building things with. But let's ask a little bit, what's going on on the right hand side given that you have customer databases, analytics data from web applications, social media data from Twitter, Foursquare, Facebook, product reviews, search results, forums, all this wonderful stuff. What do you build with it? Well, it turns out, the answer to that is anything or rather everything. There is so many options that depend on how you mix this data together that the potential is pretty unlimited. I'll just talk about a couple of the examples. There are many, many more than you (?) on the slide.
Some of our customers say that they want a leverage listening platform. They want to build systems that monitor their client's brands. An example could be your client is an airline. And let's say an upset customer's tweet complaining about some bad service go viral. We've all seen this, right. Hash tag fatal. Imagine being able to spot this trend within seconds of it happening just because you detected an (?) number of tweets with negative sentiment about that airline localized within a given city. That's reactive PR at the speed of light and it can help nullify a crisis or amplify positive stories.
Other of our clients wants to connect to their own clients customer databases, to what those customers are talking about online. This can lead to better segmentation and ultimately more sales and greater revenues. Imagining firing off an e-mail campaign, sending out diaper discounts to Twitter using who claimed that they have just had a baby. Some brands even know that there exists prominent users of social networks like Twitter and Facebook. Influencers. Folks who can organically and honestly spread the gospel about their brand to the rest of the web, if only they can be identified and reached.
The point is that our customers don't lack for ideas about what could be done here. The opportunities as I said before are endless and the ground is poorly trod that it's easy to think of ways of combining social data feeds with existing customer data and other publically available information to make wonderful things happen. But there's a big challenge called big data. And just in this data, combining it in smart ways to produce insight and delivering that insight into your customers' hand in the way that they want it in real time so they can act on it. This is hard. But why is it hard?
I had a good report on it a little while ago that outlined what they called the three V's of big data. I've got them listed here. Volume, velocity, and variety. Volume is the easiest one to explain. It's the most intuitive. We're just talking about a lot of information all at once. Existing data technologies, SQL and some of the other stuff that's in that ecosystem, just falls down when you ask it solve problems that have terabytes of data as input. You need new technologies. And how do you build those kinds of system to handle these scales?
The second V is velocity. A lot of the data sources on the slide a couple slides ago are actually producing data in real time. Twitter alone produces 6,000 tweets every second. Right there, there was another 6,000. How do you build systems that are able to keep up with the data flow that that's large and that is that quick?
The third V is variety. A couple slides ago I showed you just a small sample of the data sets that our customers are interested in leveraging. There a hundreds, thousands of more places where you can go find fascinating data. How do you build a system that's capable of integrating all these disparate sources with their different formats and their quirks together into a single place so you can ask some important questions?
What do you need in order to be able to solve these problems, the three V's of big data? Well, you need staff to spend their time on it, and they've go to know what they're doing. One of the problems is that our agencies tell us that they don't necessarily have the experience. They don't want to hire other staffs, and they don't spend the time or resources. Big data is not a core competency for them, and they don't necessarily want it to become one.
So, that is a little unfair. A lot of our customers are starting to play with a tool called Hadoop. Hadoop is sort of at the leading front of big data technology adoption. It's a wonder open source tool that's designed to process big data at a tremendous scale reliably and without having to use top of the hardware. Just to get a sense of the Hadoop adoption in the crowd, and I'm going to go into a little bit more what Hadoop is all about in a second, I thought we'd do another quick pool.
Just go ahead and let me know have you evaluated Hadoop. Now, you can define and evaluate it here however you want to. It can mean you've read about Hadoop, you've got a small team working on (?), maybe you've even got something in production, however you choose to define it. So we'll just wait another 30 seconds or so. Let's just take a look at what's going on. I'm very curious to know the results.
Great. We're going to close the poll now. Let's take a look at some of those results. Okay. So there's a good fraction of us who have started to evaluate Hadoop and a good fraction that haven't. I hope by the end of the webinar that those of you that haven't started playing Hadoop feel a little bit more confident about stepping out there. And I hope that those of you that have understand a little bit better how Hadoop fits into the larger ecosystem and what's possible in big data.
But for those of you that haven't started evaluated Hadoop, let me tell you a little bit about it. And this might be helpful even to those who have started. Despite how well known Hadoop is, even in the agency ecosystem, there's still often a lot of confusion about what it actually does and what problems it can solve. Hadoop is not a panacea for big data. It does not solve every problem you're going to encounter. In fact, it does not even solve two of the V's of the three V's of big data. Hadoop solves the first one, volume. It is the right tool to use when you've got a lot of data to process all at once. To help you understand this and to help put in context some of the other tools we're going to be showing you, I want to show you an illustrative and delicious analogy that maybe helps put it all together.
So, in addition to being kind of a data science type guy, I happen to run a sandwich shop. I called it a batch sub shop. We kind of do mostly catering. Every day we get large orders, about, let's say, a thousand sandwiches is pretty typical for us. We do pretty well. And we have these two sets of sandwich stations, right. There's the map sandwich stations and the reduced sandwich stations. The mappers, these guys kind of prepare all the ingredients for us. They slice up our bread. They chop up those veggies, and they cut the proteins ready to go. They sort all that output, and they pass it on to the reduced sandwich stations. At the reducers, we go ahead and we assemble all those sandwiches together. We group them by sandwich type. And even though we're doing a lot of work here, this represents about 33 hours of total work, since we've got about 10 sandwich station stations, we can produce a thousand sandwiches in just three hours. It's pretty impressive, and we do a good job at it.
So, those of you that have evaluated Hadoop, I hope you're kind of getting the joke here, right. Hadoop is to big data what a caterer is to this lunch order, right? When you get a lot of work done, you don't need the results right away and you're willing to wait a few hours, it's the perfect choice in the same way as catering is, how you get a large lunch delivered. But this (?) in Hadoop of the map review stations, of course, don't correspond to making sandwiches, unfortunately. They instead correspond to mapping prepares "ingredients", right? Mapping is parsing. It's normalization. It's augmentation with additional data to set context. And then that data is sorted in Hadoop and it's passed to the reduced where the final output of the calculation is produced, the so-called sandwiches. This is one of the reasons that Hadoop is really, really great at analyzing historical data. Historical data exists in large batches sitting in Hadoop clusters and we can analyze it all at once and we don't necessarily need the results right away. It's the perfect choice for that use case.
But there's another issue here. What about real time? What if I don't a thousand sandwiches later on today for lunch. I just want one sandwich right now because I'm hungry. This is why I started a second sandwich shop. It's next door to the batch sub shop. They're kind of complementary, and we do quite well. It's a revolutionary business model in which folks come in and order sandwiches and then we make them right away and it (?) back. It works in a similar fashion. Order stream in. We, instead of letting them stack up and prepping them all at once and then making them all at then end, we just making each sandwich as it comes in. We slice the bread. We chop the veggies. We get that thing toasted, and I get that first sandwich in just a couple of minutes. And I can make as many sandwiches as I want as fast as they come in just by adding more workers in the middle. This is, in the big data world, and example of extreme processing. Instead of taking a big chunk of historical data and processing it all at once to get a result and having to wait several hours, I let data stream into my processing framework and let it stream back out so I can get results as data comes in in just a few seconds.
Although back processing is better known and still very, very useful, real time analytics, like our streaming sub shop, is actually the right tool for a lot of the data feeds that are relevant to the agency. Social media is not only high volume, it's high velocity, so you need to be able to react quickly, which means you need a streaming realtime computation environment.
But why isn't everybody doing this? Well, it's because it's hard, right? There's the three V's of big data and the challenge of having the right resources that the experience and the time to get this done. At Infochimps, we're experts in the full stack of big data technologies. We want to take this webinar to help inform you guys of several factors that you should be evaluating as you start moving into the big data space.
First and foremost, you need the expertise to evaluate and choose the right tools and the right architecture to solve your big data problems. These choices need to provide you with a scalable infrastructure that can grow as your problem grows and as the amount of the data in your systems increases. It also needs to be flexible and adaptable. You don't want to lock yourself in to one fixed infrastructure that you find it difficult to modify it just because there's so much data. You need to be able to learn from your customers and improve your product as time goes on.
Finally, you also want to integrate with whatever technologies you're currently running. This could be traditional resources like SQL serves, web server farms, or the I technologies like Tableau or Pentaho or even your own initial first foray into big data like maybe a Hadoop cluster you've already got built. These systems already work and they're already creating good value for you. You don't need to abandon them. You want to integrate the new technology with the old.
At Infochimps, we have been solving these problems for quite a while. We've created a set of four major tools that work together in what we call the Infochimps platform. It's what we use to solve our own big data problems as well as problems for our clients. The platform consists of four major pieces. The first is the data delivery service. This is kind of like our streaming sub shop. We can ingest data in terabyte scale in real time. We can transport the data to wherever else in our infrastructure it needs to go, and we can transform, decorate, process or augment that data right on the fly, creating value added results in just a few seconds after they come off the wire.
Next is our database management service. The world of big data has a lot of choices for database technologies, everything from Cassandra to HBase to MongoDB to ElasticSearch, even traditional SQL has a place here. Each of these databases has it's own particular strengths and weaknesses. Unlike the case of small data in which you could just toss everything you want into some business cube and ask any question and get an answer in a reasonable amount of time, with big data, each database's strengths need to be leveraged and their weaknesses need to be mitigated against. Don't lock yourself in to being able to only work with a single database because your needs will change and grown. And don't lock yourself into working with vendors who demand that you only use a single database. Since your needs will change and since the technology here is so new and still changing itself, a federated approach to database management is the only scalable approach.
The third component in our platform is, of course, Hadoop. The traditional approach, it consists of building a big Hadoop cluster in the sky with a whole bunch of machines that always on and awake 24 hours a day, seven days a week. That's a good use case for some people who have that amount of volume of data to process or that many different business units who can utilize the Hadoop cluster as a shared resource, but for many of our customers, that's not the case. That's why we offer elastic Hadoop. You can build a traditional Hadoop cluster with elastic Hadoop and have it up all the time if you choose to, but you also have the freedom to work with a different more agile model. You can spin off the Hadoop cluster, run your jobs, and then put that cluster away. The cluster can be as big or as small as you need and you can do this over and over again, in fact, optimizing the cluster to the job you're running instead of having to modify your job for the cluster you have it executed on. You get to have the powerful methods, and abilities of Hadoop at your disposal will also seem tremendous (?) because you don't have to run your cluster when you don't need it.
Finally, our platform has an analytics and monitoring ware. Whether it's seeing what your servers up to, whether it's watching data stream through your infrastructure in real time, whether it's being able to write simple scripts that can automatically be turned into Hadoop jobs or dropped into the data delivery service. The analytics tools that we provide will let even non big data experts be productive in this new world. And for those of you who have been playing with Hadoop for a while, you've already developed some expertise around this, don't worry. The toolset also let's you dig in at that lower level of the native APIs in Java as well as standard tools like Pig, Hive, and so on.
The advantage of working with the Infochimps platform is that we've already identified the major kinds of needs that you have in a typical big data solution. We've got long experience in this space, and we've built our framework with a set of compatible tools that fit together nicely.
I'd like to get into a specific use case that uses the platform and illustrates what it can do. But first, let's do another quick poll. I just want to know how many of us out there in the audience have actually started to hire big data talent or started to make an investment in building something. I ask because in a second I want to talk a little bit about how people and technologies fit together in agency. So, I'm very curious. Have you hired a small team? Have you hired a little bit of a larger team or expanding? Maybe you haven't hired anybody yet. You may not plan on hiring. Or maybe you just don't know and you're still making up your mind. So go ahead and vote, guys, and we'll let the results come in over the next minute or so.
Just another ten to fifteen seconds, guys. Let's collect some more results and we'll move on.
All right. Let's close that poll. Let's take a look at some of the results. Very curious to find out. Again, pretty spread out. So we have some small teams. We've got some big teams and there seems to be, I would say, the majority of us maybe don't know our plans. Either we haven't decided or we haven't exactly executed our decisions yet. So there's, I think, a lot of room for some planning. And I hope that what you learn through the rest of the webinar helps you make some of these decisions.
The reason we ask about people and hiring plans is because we have an idea here about the structure of the modern digital agency. There's a few layers and these layers scale different and they have different call structures and different set of requirement. People have a top layer. They are the most expensive and they are the least scalable part of your operation. But they also provide the most value. These are your analysts, your creatives, your account managers. This is your core competency. This is what differentiates your agency from any other and it's what your clients paid for. It's why they chose you. Well, your people are increasingly supported by reports, visualizations, the kinds of resources required to provide quantitative insights into the scope or the effectiveness of your campaigns. And these reports and visualizations are themselves supported by listening and content technologies that drive that value upward through the stack. It's in the state to pull people from the top of this pyramid where they produce the most value for you and truly differentiate your firms and pulling them into the lower tiers. Those lower tiers would really be supported truly by technology and operations. And that's really where Infochimps comes in. The tools that comprise our platform provide a grammar for building solutions that power your creatives and your account teams without bogging down your organization with a needless responsibility of having to manage new technologies that you don't understand.
Well, that's (?) into a specific example. This is a simplified story based on agency X which is representative of one of the customers were working with on the Infochimps platform. Agency X is excited about making their first inroads into leveraging big data. They want to create insights for their clients and they want, most importantly to them, they want to create repeatable new revenue steams and increase the net profit per client by leveraging the repeat ability and scalability of technology to generate these reports and keep their people working where they're most valuable.
So, what do they want to do? They want to take a whole bunch of social media data from various feeds out there, Twitter, Facebook, RSS feeds, ads, print media, they want to ingest all that data. So we're using our data delivery service which, again, is the real time computational engine that accepts information and routes it through the system. We even do some real time analytics on the data that's coming in. We normalize it so that data from different sources looks the same. We file it away into a database. This database is chosen to be appropriate for this particular problem. It was important to agency X to be able to do complex queries on the database, to be able to do free text search across the aspects that are being pulled in, across those social media feeds by the data delivery service. They want us to do GO queries. They needed time series. All that stuff is built into our choice of database. Then the agencies only responsibility now is to produce a beautiful front end that their client use as a dashboard as well as reports that they can send off. This is really what agency X is good at. Their technical expertise revolves around wonderful front ends that are easy to use that are (?) and that typically are built on a technology platform that is not traditionally big data compatible. What's nice about the solution that we've built for them is that Infochimps and Infochimps platform handles the big data back end. Their system continues to be a simple traditional (?) that they go ahead and query all that data in and all of a sudden they're doing big data without realizing how they got there.
I actually want to give you a real demo of a system that is like this. It's a simplified version of what we've built and I'll talk a little bit at the end about where it differs. But let's just take a quick look.
So, this is our front end app. This is not the front end app that agency X built. This is something that we built for the purposes of this webinar. It's a really, really simple search service. It's just ingesting Twitter data. Let's just do one search just to see what it does. I'm going to search for Apple products. What comes back? Okay, great. So, here's what the interface is like. This could not be any simpler. I just have a search term. I've got a bunch of tweets that are stacked up over here. Just a time series. This is natively produced by the database. All this application is doing is using HTML and Java script to render the contents returned by that database. You can see that the app functions in real time because the data is continuously streaming the data delivery service as it's being produced from Twitter. This came in just a minute ago. So, this is just when you're showing the tweets here again. The point of this application is not to be a complex wonderful user (?) application that someone would pay for. It's meant to illustrate how simple it is to plug in a front end or an existing business intelligence tool into the Infochimps platform. The database (?) to be powerful. I can have some logic. I noticed there's some tweets about applesauce. That's not what I'm interested in, so I can ask for apple and not sauce and I can get back a different set of result. All right. The database is schema aware, so I can even dig in and ask questions about users' locations. I can combine stuff and ask if anybody is talking about the Apple products in New York City, and we have a few people. You'll note that the number of tweets here is kind of modest. This is, again, not the full solution that we've built for our client. This is currently plugged in to the Twitter Spritzer. That's actually a great opportunity to dig in to what the back end is doing and I'll talk about how we would scale it if we were to make this real.
Going back to the architecture diagram that we've got, the data is coming in from Twitter. It's being moved into the data delivery service, which is then processing that data and moving it into the database on the back end. I want to show you what that process actually looks like, not schematically here in this slide but really (?) dashboard available in the Infochimps platform. So, this is what it actually looks like. It's almost as the same as the conceptual picture. We've got the Twitter Spritzer which is the source of the data. Tweets come out of here and they move through the data delivery service right into the Elasticsearch. We don't have to think about how many machines this is running on. We don't have to sit here worrying about the hardware and software details. As the user, we just think about this conceptually. Data is produced from Twitter and moves into this database called Elasticsearch. Again, that's the database we use to power the complex search features that the front end application needed.
If we needed to scale the size of this problem, let's say, we wanted to take the system into production and we wanted it to move off of the Twitter Spritzer and start using, let's say, the Twitter Firehose or another source of data like (?) or other feeds that are really high in volume. What would we do? Traditionally we'd be in trouble because we're going to see a factor of about 100 if not more in scale between the amount of data produced by the Twitter Spritzer every second and the amount of data produce by the Twitter Firehose.
But the Infochimps platform obviates those kinds of problems for us. All we need to do is spin up more machines in the data delivery service to handle the additional (?), and we spin up more machines in our database layer. Everything that we do is horizontally scalable. This means that a solution built on our platform will grow with your company. It will grow with the amount of data that we produce as a society so that you don't have to (?) your front end just because you have more data involved in the system.
So, the next question becomes how do we extend this? Remember, with big data, one of the unfortunate things that can happen is that the systems become so complex that they become difficult to modify. You're spending all your time just trying to keep the systems up and running and you don't have time to get any real feature of velocity and improve the product.
I'm going to put on my developer hat. I'm working for agency X. I want to talk to you guys through what it's actually like to change what the back end is doing in the Infochimps platform. You'll notice that I've already prepped the front end. I've already talked to my front end guys and we've said, "Listen. Let me actually choose a search term that has a little bit more volume. How (?) for all reTweets out there." So you'll see they're all colored blue over here. Blue is corresponding to uncategorized. There's a sense in which we decided we would like to modify this service by adding sentiment analysis into the flow. We want to, instead of having all our tweets coming in just raw, we to label them as either positive tweets or negative tweets.
Conceptually, all we're saying is that we've got the data being sourced. We've got it being written into a database. Somewhere in the middle we want to stick a piece of logic. That piece of logic is something that we at agency X, we understand. We want to control it. We want to improve it and grow it over time based on customer's feedback. So, our mission is to come up with a sentiment analyzer that we can drop in the middle here to label the tweets that come out of the Spritzer as positive or negative in sentiment.
So putting on my developer hat again, I'm going to drop over to a command line. Those of you that aren't programmers and aren't super technical, don't worry. Everything we're about to do here should be pretty straightforward. Those of you that do have a little bit of technical background, I think, hopefully, you'll be able to follow along with the details of what I'm doing right here.
I'm a developer at agency X. I have been tasked with writing a sentiment analyzer that process billions of tweets. I'm not a big data person, but watch how I go into it. I have written a script. It's a ruby script because I'm just a ruby guy. It's easy to do. I wrote it in just a few minutes. For those of you that have any experience in natural language processing are probably laughing at the simplicity of this script. That's really not the issue here. Again, we're not trying to produce a world class sentiment analyzer. We're just trying to show an example of how you would modify the system in place.
So, I'm testing it out. I'm the developer at agency X. I just wrote the script. I want to see if it works. Let me put some text through the system. Let me think of a sentence with positive sentiment. How about today is a wonderful today because it really is ladies and gentleman. Okay. Let's run this through the script that I just wrote. I happened to write in jruby. You can write it in anything you want. Ruby, Python, anything that's able to read standard in and write standard out. That's one of the important contracts that the infrastructure's platform provides you with.
So, you can see I ran my sentiment analyzer and it comes back that my sentence about today being a wonderful day is positive. Okay. So that's cool. It kind of works. Let's try it out on a sentence that obviously has negative sentiment. Today is a terrible day. All right. Lets run it through there and see what happens. It should come back in just a second. Cool. So it's negative. All right. So I have something that seems to be working so far as a sentiment analyzer. It's my first draft. It's my iteration. What do I do? Well, it's really quite simple. I drop this into a (?) and I push this code into the Infochimps platform. What's going to happen is that the data delivery service will automatically pick up on that change. I've just pushed the code right now in the background. The Infochimps DDS will go through and just drop that sentiment analysis code right into the middle of that flow exactly as we want to see. The tweets that come out of the Spritzer are going to get run through this sentiment analysis code that I wrote. I didn't test this in the context of the data delivery service. I didn't test this at scale. All I did was test it out on my command line. I could have done this on my laptop in an airplane without having access to my big data back end. And if I go back to the application, you'll immediately see that right away, as soon as I deployed that sentiment analysis into the flow, I'm starting to see positive and negative sentiment coming right through. And if we wait a minute or two, that will just keep on over time.
A question that you might ask right now is well, hey, what about all this blue junk? What if I want to give that stuff sentiment analyzed? One of the cool things about the Infochimps platform is that I can take the same script that I wrote here on the command line which I have only ever tested locally. I have shown how I can deploy it into the data delivery service to deal with continuous real time data as it comes in. I can actually now just spin up a new cluster, run that exact same script over all this historical data, and back fill sentiment into my historical records.
So, very, very, very easy for a non big data person to start playing in a big data world only using standard tools like Python, Ruby, or PHP, whatever you want. The stuff that you are already good at. We help you take it and make it real in a big data context.
So, that was a demo. (?) dig in a little bit and talk about the real system that we created for agency X. What I showed you today in the demo really was a very simplified version. The real version, we're not just collecting tweets from Twitter and we're not using the Twitter Spritzer. We're actually taking data at scale from a variety of social media, print, and television data resources. We're moving it in a tremendous volume through our data delivery service every second of every day calculating derived quantities, transforming the data in the process, augmenting it with additional information like sentiment analysis, influent scores, all this kind of stuff. And ultimately dropping it into a database that supports both ad hoc queries that can be run by researchers and analysts on the back end as well as dashboards of the kind that you saw us build that can be shown directly to the client as well reports that can be generated against the database.
I also want to (?) that what you saw was very much an example of social media analytics or listening type service. But because the Infochimps platform, it's a set of tools ultimately that are designed to work together to solve big data problems, you can really use it just build anything that you see on this slide, really anything else that is a big data solution.
This is important because solutions like the one I showed you are becoming more and more important. The world has changed. Big data is here, and your clients need to know about what their customers are doing. We can help you help your clients gain those insights, giving your agency a bigger seat at the CMO's table.
Now, agencies have been dabbling in big data 1.0 for quite a while. Black box services that provide limited abilities to do cross channel marketing or meager but passable sentiment analyses, these are becoming common throughout the industry. The next wave is going to be the mixing and matching different data sources, public and private, at huge scale and connecting them together to provide a context for your customers. This context provides true insight and turns what could be just pretty dashboards into really useful tools.
And our platform helps to build this context in an entirely new way. We aren't just a data supplier or an infrastructure hosting company or even a software shop. We're a strategic resource for the agency. The agencies who will win are making big data analysis one of their new and essential super powers and we're helping them get there.
So, this concludes our webinar. Thank you all for coming and attending and listening to me go on about big data for the last few minutes. Please do come visit our web site. I would love to talk to you, set up a free data consultation, just learn what kind of use cases you guys are dealing with, what you're engaged with, what problems you're having, and help you to understand the most effective way for you to realize your success. Thank you so much.
Amanda:
Dhruv, we have some questions that came in I'd like share. Jeff asked what's a good use case where I'd have both a (?) sequel database and a sequel database in production. What's the benefit?
Dhruv:
That's actually a really, really common setup. There are a lot of folks building wonderful data analysis and (?) front ends right now in the big data world. I mentioned Data Mirror. They are one of our favorite partners in that space. You should really check out their product. But the truth is, most BI stacks are so focused on SQL, the experience of most analyst teams so focused on SQL and SQL queries and tools that extend their power (?) the SQL world to wholly move to a completely big data solution. (?) different semantics can be a big challenge. So, one of the most common patterns is we'll take a big data resource and we'll move it into a big data database. So, there everything will be available. Everything will be scalable. And a very pattern then becomes, well, why don't (?) ask (?) over the last day or over the last week to work the data? Or why don't we take a sample of the data and let's move that into a traditional SQL database and that allows our guys to continue to do the analysis in the way that they're use to. It's lets us continue to use our BI tool that we're familiar with while still having the back end fundamentally scalable as big as we need it to be.
Amanda:
Great. Thank you. William asked is there (?) interface, i.e. (?) database?
Dhruv:
So, you'll forgive me for being more of a Python person than a R person at the end of the day and we won't have that fight here on the air. But I'm less (?) with content of data frame. I'm assuming that what you're asking is can we pull, can we make our app like a native citizen in the Hadoop context or in the big data context? I'm here to tell you that absolutely can't. There are things that in the R community have been developed to help R talk to Hadoop cluster. Those are just simple tools that (?) would be installed right into the platform. Also using some of our tools, I did not talk about these in depth, but you can find them on our web site. We have a great tool called WuKong which is sort of Ruby, Hadoop streaming. WuKong is capable of not only take Ruby code and running it in Hadoop context, it's capable of taking arbitrary code that's capable of reading standard in and standard out. That is to say code that supports a streaming interface to data manipulation and running that code both in the context of Hadoop as well in the context of the data delivery service. So, long answer to a simple question, yes, if you want to use R, you can use it in the big data in the Infochimps platform.
Amanda:
Here's another question that came in. Do you have a multi tenant billing API so I can keep billing the (?) for multiple clients?
Dhruv:
That's an interesting question. We have not, ourselves, taken on the burden, let's say, you're using this platform to go out and build something that you have many, many clients using. We certainly can organize the data and the data flows for you in a way that keeps different clients data totally separate. So if you wanted to, you could completely charge on a per client, per index, per project basis. We don't have a necessarily consolidated billing API for that.
Amanda:
So that leads us to the next question. How do you (?) our products?
Dhruv:
We typically think of working with clients in three major phases. The initial phase is, of course, consultation, figuring out exactly what the problem is, designing the solution together, taking advantage of our expertise in how to structure architecture for the problem you're trying to solve. That inevitably turns into an actual implementation phase in which we go out and we build that system and we price appropriately, depending on the complexity of each of these two phases. Finally, there's, of course, the ongoing monthly support element. This is everything from licensing the technology that we're going to be using to make these systems run optimally and expertly to the constant support that is required to maintain these systems so they don't experience down time, even to something just as simple as security and future upgrades. And those are sort of the three major ways in which we think about structure and deals with our clients.
Amanda:
A few more questions, Dhruv. How fast can you deliver a solution?
Dhruv:
How fast can we deliver a solution? I'm comfortable saying that a solution that is typical for us, which is to say that the complexities here are real time leading into databases with the ability to do Hadoop and the ability to plug in front ends, a sort of typical solution like that will be stood up by us in about six to eight weeks ready to be in production.
Amanda:
One more from the audience. Can you help listen to excessive e-mail or Twitter handles?
Dhruv:
We help listen to sets of e-mails or Twitter handles. Is this the idea that maybe there's a, I don't know, this is a bad example, but there's a baseball team and I want all the tweets of every member of the baseball team to analyze. That is not necessarily a data content service that we provide, but there are a huge number of partners that we work with. I'll mention just a couple (?) that make it very, very easy to put in queries like that. Things like I want to listen to tweets from these kinds of people in this geographic region. We find it very easy to take out, put that those provide, and plug it directly into our data delivery service, at which point it can be routed into the rest of the system ready for analysis.
Amanda:
Here's another one. Do you build the algorithm?
Dhruv:
Do we build the algorithm? I'm assuming here you're talking the sentiment analysis algorithm or do you build those? Really it depends on what is the best choice. We do not have stable of every single possible algorithm you may want to run on every possible kind of data that may enter the system, and that's why many of our clients find it really nice that it's easy to develop algorithms on their own and use their own domain expertise to add value to the data that's entering their solution. With that said, though, there are a few common algorithms that many of our clients ask us for. I'm talking about things like network centrality finding, influence ranking, sentiment analysis, topic extraction, keyword, analysis. These sorts of common asks that we've heard over and over from our client are now sort of black boxes that we can deploy into a solution along with the back end technology and the data streams that we're familiar with.
Amanda:
Another one. Can you comment on analyzing city traffic?
Dhruv:
Can I comment on analyzing city traffic? It's hard. I once read some cool papers about base transitions on highways. I don't know. Beyond getting into the specific details, which, by the way, whoever you are, I would love to talk later, I think it's a question of what does the data look like. I mean, can you get access to really good information on city traffic. Where does it come from? Is this from cell phones that are in people's pockets? Is this is different source traffic cameras? Whatever it is, what does it look like? How do we touch it? How do we get it into the system once it's there? What kind of questions do you want to ask? I would love to talk to you into applying kind of your domain knowledge and figuring out what you're interested in finding out and help you think of how to use the tool set that we provide to get those questions answered.
Amanda:
Last question. Would I use (?) instead of Radian 6?
Dhruv:
Yes and no. It depends on what you're looking for. Radion 6 is an absolutely wonderful tool for social media listening analysis and monitoring. It's got a lot of features and it's a great place to start if what you're interested in is just trying out what listening looks like and playing with what's possible. With that said, it is a fixed solution. It is not extendable. It's not something that's going to scale to in just your customer data to connect to what is available from social media streams. It is not something that you have control over the road map of. You cannot customize for your needs. And in that sense, when you realize that you have outstripped the possibilities with Radion 6 and you need to start thinking about ways to extend and build stuff that is ultimately better and richer and more connected, then I would say that your next option is come and talk to us. Let us help you set up a realistic listening platform that you control but we operate and help you do what Radion 6 does but better and more quickly.
Amanda:
Okay. That wraps up our last question. We'd like to invite you to discuss your big data project and ideas with us. There's a link in the chat to register for a free big data consultation. We'll have a free one on one with Dhruv and Flip to talk about your project. Thank you so much for joining today.
Dhruv:
Thank you all.