Application Development: Getting Started with Big Data

Webinar (46:37)

Learn from Dhruv Bansal, Founder and CSO of Infochimps, in this interactive demo of the Big Data Starter Kit for Developers. Watch the story of discovery unfold in batch and stream processing as we use sample social data and write algorithms on the fly to find business insights – targeting, personalization, segmentation.



Amanda: People are still arriving. We have just a few more that we’re expecting. If we’ll hold on just a moment, we’ll get started here. We’ll give it one more minute for people to join. Thank you for your patience.

All right, good morning. We’ll go ahead and get started. We still have some people coming in. Just a few things of housekeeping that I’d like to get presented before we actually start the webcast.

Welcome, my name is Amanda McGuckin Hager. I’m the Director of Marketing here at Infochimps. We’re very excited about our webcast today, “Application Development: Getting Started with Big Data”.

Here at Infochimps we’re seeing a lot of use cases for big data application. Today we’ll present to you just one.

Before we get into that, just a little bit of housekeeping. We are recording this webcast. Early next week we will e-mail you with a link to the video recording and we will also e-mail you a link to the slide deck. Look for that in your inbox.

Go ahead and ask questions. We want your participation in this webcast. Feel free to use the chat functionality in your GoToWebinar control panel. We will be monitoring this channel and we’ll answer your questions during the webcast.

We’re going to stop the webcast a little bit early of the hour today, probably about 20 minutes ‘til and at that point we’ll do Q&A with Dhruv.

We have a Twitter hashtag in the left bottom corner of every slide deck. It’s #appdevbd. Feel free to chat about it. We hope to see some of you over there in the Twitter channel.

Without further ado, let me introduce to you Infochimps cofounder and Chief Science Officer, Dhruv Bansal.

Dhruv Bansal: Thank you, Amanda. Hi everybody. I’m eager to get started and talk to you about how you can go really, really fast developing applications, even big data applications, using Infochimps and some of the software and the services that we provide.

I’m going to start out by just jumping right into our product and our services. There’s a lot of great material on our website for folks that are a little bit less familiar with who we are or our history, what we do.

I’m going to position us right now as a set of cloud services that live in a variety of deployment environments. I’m going to talk about that in a second. We are making big data and big data transformations easier than ever for folks in the enterprise. We’re taking some cutting edge technologies, stuff that you might have heard of, Hadoop, Storm, HBase, Cassandra, Elasticsearch, and we’re bringing them in a suite of three major cloud services.

We’ve got our real-time cloud over here. We’ve got a batch cloud powered by Hadoop. We’ve got a variety of data management options. I’m not going to spend a lot of time during today’s conversation talking about the underlying tools here like Hadoop and Storm and the databases.

There’s other content for that online. You can come check out, learn a lot more about some of those powerful frameworks that we use.

Today I’m going to be focusing quite a lot more on one specific part of our stack. In fact, a part that’s open source and free for the community to use and start getting involved in. That’s called “Wukong”. Wukong is our big data framework for writing data transformations.

Wukong can be run in a variety of backend environments, including all the cloud services that we offer as part of our cloud services platform. Since it’s open source, Wukong can be used to control the flow of data through these clouds even outside of the Infochimps cloud service.

Just a quick diagram here that shows a little bit more structurally how data moves through systems like the ones that we build. We collect data. It’s entered into our delivery service, which is a real-time engine powered by Storm.

It’s routed into Hadoop, which can support batch operations and historical, long time trend analyses and it can also be stored in our various databases. Again, I’m going to focus today mostly on Wukong and how you model within Wukong operations and applications that are running on the rest of our infrastructure.

Keep in mind, you’ve got to work with us to take advantage of all the infrastructure that we provide but Wukong is an open source framework so you can start using that right away. Come to our website and we’ll provide a link here in just a second.

We take our cloud services and we deploy them in a variety of different environments. This is important to know because Wukong’s a big part of our flexibility as well. We take our real-time cloud, our ad hoc batch analytics clouds, and we [brand] them on a public cloud service, so Amazon, for a lot of folks that are comfortable or are already deployed over there.

We also take our same stack and deploy it in two other locations. Virtual private cloud, this is a data center that we control, that we’re deployed in. Then in the private cloud, this is at our customer’s shop. We’re flexible about the cloud and you’ll see that with Wukong, we’re flexible about the algorithm that runs on top of that cloud.

That helps me transition into a topic that I want to introduce because it’s what we’ll really be working with for the rest of the presentation today. At Infochimps we use something called a “starter kit”.

A starter kit is a repository of code, so it’s actual code for programmers that’s written in an editor, run in your command line. It defines a big data application. What’s powerful about starter kit at Infochimps is because of Wukong, they’re able to run locally and they’re also able to run in our cloud services themselves. That means you can have a pretty radical, fast and agile development workflow for your applications.

You use Wukong locally to develop simple transformations, for example, parse this data, add this field, contextualize with this variable, ping this web service and add this other piece of information over here.

The simple kinds of extraction, transformation and load operations and ETL process that every data environment has to go through when it’s ingesting information, you can program those kinds of things very simply and very easily in Wukong and you can try them out right on your command line locally using a small amount of data as [temple] to hold your algorithm.

Wukong then comes with a suite of tools that will let you take your local code that you’ve been running locally, and you’ll see me running a lot of this Wukong code locally here in a few minutes.

You take that local code and then run it in a Hadoop context, which means you can lift your algorithm from working on a few hundred rows or a few thousand rows locally into operating on millions or billions of records within a Hadoop environment, maybe the Hadoop environment that you’re subscribed to from the Infochimps cloud services.

You can also take that same algorithm and run it in a real-time context inside of wu-storm. So this is processing thousands or tens of thousands or hundreds of thousands (if you have that much through it) of events per second using Storm, a cutting edge stream processing framework that was originally released by Twitter but ended up being popularized and used all over the place.

We also have a lot of orchestration language and functionality built in through wu-deploy, which takes all these functionality, the objects that Wukong defines and glues them together into a contiguous object.

If you’ve ever used Rails for web application development, you can think of the starter kit and the deploy patch as sort of a Rails for big data application development and you’re going to see that here in just a few minutes as we get into the code.

I want to focus on an application which is synonymous with a use case at Infochimps. Every use case that our customers have, we ask them to write a deploy pack or a starter kit to solve that use case using Wukong, using the code and modeling the problem. I’m going to propose a use case for us to solve today together.

It’s a simplified version but a reasonably complex problem that we often see in customers that we’re speaking to. We simplified it just to make it easier to follow. The data isn’t as messy, perhaps, as real world data and there’s fewer fields to deal with but the basic operations are going to be the same so you’re going to be able to see how you think about modeling a complex scenario like this using the simple toolset that we provide.

Here’s the use case. I’m a retailer or I’m a 'bricks and clicks' kind of shop where I’m an online retailer. I’m selling something. I’ve got a website. It’s got these conversion pages that I want to drive traffic to.

I’ve released a new product and specifically I want to drive traffic to this new product’s conversion page so there’s some criteria that I care about and I’ve placed a bunch of ads in two test markets: New York and Chicago.

I have these two test markets that I’m trying to cover and introduce a new product out there and I’m getting a lot of traffic to it. I have all these questions suddenly. What city are these visitors coming from? Which cities better respond to my ad campaign?

Which city has web visitors that are actually willing to buy based on their activities relating the impressions that they receive from the ads that I’m purchasing to the actual web behavior that I’m seeing on my web properties? There’s a lot of questions and there’s services that answer all these.

There’s third party services and there’s SaaS applications and there’s, of course, custom things that a lot of folks will build. What we’re seeing with our customers is that the folks who are seeing truly large amounts of, let’s say, either impression data or truly large amounts of log data coming in from web applications.

These folks really have started to move to using custom built big data solutions because they’re finding that the value they get from being able to ingest everything they know about their customers’ browsing behavior, about the advertising resources that were brought to bear for their customers, if they can put it all in one place, there’s a lot of amazing questions that they can answer.

That’s really why we picked this example because it’s so representative of what some of our customers are doing.

Let me actually get into the code at this point. Now, I’m going to pause here and just set the expectation that going forward, everything that you’re going to be seeing in this webinar is going to be pretty technical.

We’re going to be right on a command line and we’re going to be running some, we’re going to be looking at code. We’re going to be writing code. We’re going to be running jobs with real input data. So caveat emptor, here we go.

Let me pause and let’s go to GitHub. I mentioned a starter kit is a blob of code that defines an application. I talked about if you’re familiar with Rails and the way that Rails thinks about structuring and application and the way that Rails thinks about convention over configuration, then hopefully this outline is pretty familiar to you.

I hope that the font there is big enough on the screen. I’m getting some nods. Maybe I can upsize it a couple of times, even make that just a little bit easier to see.

Just like a Rails application, we’ve got some configuration directories, we’ve got some gem files. We’ve got a whole bunch of stuff. This is a Ruby based application templating framework.

My whole app lives in this app directory over here. I’ve got a configuration directory, which I’m going to use to store all sorts of information about the local and remote environments in which my app runs. I’ve got a data directory.

This is going to be important later. This is where I put sample data that I want to use to run and test and develop my application with. Then, of course, I’ve got unit tests and all sorts of stuff that I can run over here to make sure that my app is safe and easy to deploy and develop on.

If I go into the application directory, there’s a few basic kinds of objects that Wukong defines and that this starter kit implements. Now, all these objects are going to be customized for solving the use case I’m interested in. I’m going to go back to my command line now.

I’m just sitting on one of my laptops and I’m looking at some of the files that are sitting in this directory. You’ll see that everything here is exactly what I saw over there on GitHub. I just have a copy of that repository.

Again, my starter kit is a blob of code. I’m working with that blob of code right here locally. Wukong comes with a command that I can run to give me a quick summary of what my application looks like so I’m going to go ahead and run that command and let’s take a look at some of the output. That’ll get us oriented.

I want to go through each of these categories just briefly. You can see I’ve mostly already written this application. I’m going to go through each of these categories, just briefly talk about what they are, what they represent, why Wukong thinks this way and then we’ll run some of them and we’ll see some real output, some real data and we’ll maybe even look at a visualization or two.

Starting at the top, I want to start with models. Models are the thing we are talking about. These are the domain objects that we care about when we’re engaged in solving a problem and building an application. Maybe I can make the font here a little bit bigger, one more, maybe two more. That’s good.

I hope that’s pretty readable for everybody. As I mentioned this is an advertising and web conversion kind of use case that we’re interested in solving so I’ve got a bunch of model events that are pretty representative of that kind of data.

I’ve got a log event. This is related to stuff like an IP, a user or a time stamp, a verb [inaudible 17:33]. Very clearly representative of the kind of data you might see in a weblog. We’ll be looking at some weblogs here in a second. I’ve got conversions and visits.

These are sort of higher level, closer to the business kind of objects. This isn’t a log event. This is a conversion or a visit by a user to a particular product page. Then I’ve got an impression. This is when a user was visited with an ad. This is when that person receives an ad.

Models are like the nouns in my big data application. These are the things that I’m talking about. They’re like the objects that I care about. Processors are the things which change or modify or generate new nouns. They’re almost like the verbs if you want to think about it in terms of that metaphor.

Each one of these processors does something very specific and it does that one task and it does that one task well. The log parser, his job is to parse a line of log data and turn them into one of these log events up here. The identify traffic, he’s supposed to take a log event and figure out whether it was a conversion or a visit.

Each of these units of work is pretty simple to and of itself and that means it’s easy for us to scale when this code starts running in a distributed environment. Right now that scale isn’t so important. We’re programming locally. We can do a lot of things and most of them will work but not every programming pattern is going to survive when you talk terabytes of throughput at it.

This model of having unique verbs that be chained together, you can see later we’re chaining them together. This model of programming lets us have a great modularity when we build these data flows and also lets us independently scale these components out when we’re doing this in a high throughput scenario.

Moving on, if models are nouns and processors are our verbs, then data flows are like sentences. Data flows are saying, “Well, I need to take a particular kind of data, namely data on impressions, I need to take it through the following stages of a data flow.

To parse it from json, I need to turn it into a kind of particular model object. I want to geolocate those records and I want to write them back into a json format for further downstream consumption."

This is kind of a collection of these processors that we put together and then a job down here, we’ll get to that last at the end when we want to ask kind of a more advanced question.

A job is a combination of data flows putting together. It’s kind of like a paragraph. It’s conveying a whole bunch of meaning, more than just one sentence could.

This is the general programming paradigm. We’re using models to describe and encapsulate our data. Processors act on models to create change. Data flows combine processors together to organize that into a process and jobs let us combine data flows to be able to produce complicated insight and actually do maps and reduces and that kind of complexity.

I’m going to take a look at some of the input data here. We’re going to run through something and let’s look at some of the actual code as well. Let’s start at the beginning.

Let’s take a look at some of the data that I’ve got sitting right here on disk. I keep all my data in this directory, the data directory right here in my local repository.

This is all about rapid application development so I don’t want to start querying for data from external databases. I don’t want to be dependent on a Hadoop installation to make progress.

I don’t want there to be Storm anywhere on my laptop. I just want to be able to work locally. I want to be able to do this on an airplane if I have to.

Let’s take a look at some of the input data that I’ve got here. Let’s look at some of the, let’s say, log data that comes in so let’s take a look at some of the traffic. So there we go. These are just kind of log lines that anybody who’s ever looked at the server log understands.

These are kind of an apache or super web servers might produce. So there’s a few different kinds of events. You can see we’re getting data, we’re getting hits to a bunch of our product pages over here. All these different products are getting traffic and most of the time it’s kind of we’re getting get events.

In our made up example today, these get requests represent someone looking at the product. These post requests represent someone making a purchase or converting that product so it’s a very simplistic model. There’s only two types of web traffic that we’re seeing.

That’s not true in a real world example but for the purposes of today’s conversation on how to talk about big data processing, we’re keeping it simple. We have a get, which is a visit and we’ve got a post, which is a purchase.

I need to get this data parsed. I need to get this structured and I actually want to do some geolocation of these objects. These are just random web events that are coming in. I want to localize these to users, either using their mobile IPs or cell towers or social data or whatever information I can get about them.

That’s going to be one of my business rules. That’s something that my application is doing. In fact, you can see I’ve got this geolocate processor that I [inaudible 22:40] over here, whose [inaudible 22:42] is figuring out how to geolocate my records.

Let’s take a look at running some of this data through, we’re going to run it through Wukong’s some of our data flows and we’re going to run it locally. We don’t need to run this at scale or in Hadoop or anything like that.

We’re just going to run this code locally and we’re going to run it through, let’s say for example, the log parser.

Let’s see what comes out the other side of this. Log parser is a processor that I’ve defined. There we go. So the log parser runs through and it goes through that process and gets parsed into a log event.

Let’s write that as maybe json and we can go ahead and preprint some of the output and we’ll see what comes out the other side. There we go. We’re parsing that log output and it’s no longer a single line. We’ve made it out to a structured record.

This is an example of doing a little bit of ETL. This is just us taking that raw log line using something like a regular expression to turn that string log line into a structured record over here.

I can take a look at a more complex object, which is - excuse me, I wanted to look at the traffic first. Let’s take a look at this more complicated flow over here. I hope that’s readable. This is the domain specific language that is built into Wukong for describing data flows.

These are the sentences that were building out of smaller verbs that are part of our starter kit over here. I’ve got a log parser, I can show you what that log parser looks like. It’s in my processors directory. It’s a very simple log parser. I’m trying to parse it using a model object.

If it parses that’s great. I’m going to yield that part out again. If it doesn’t parse then I’m going to notify that I achieved a bad log event. There’s some basic notification which Natalie built into Wukong as well, that I can notify an out of band system about a range of events that are occurring.

Meanwhile, I can go get some of these other processors. Geolocate is going to take a record and use its IP address to geolocate it. Identify traffic is going to look at inbound events and decide whether they are conversion events, it could be purchase events, or just plain visits and to json, of course, just goes ahead and writes the format out as json at the end.

The idea is that I don’t have to understand or identify traffic, how it’s built inside of itself. I don’t need to understand the details of my geolocation code. All I need to do is be able to compose these different kinds of processors or verbs together into the flow that I care about.

Wukong comes with a variety of these processors and kind of a processor library that’s suitable for building up simple flows and you can, as you saw with the log parser, you can always write your own.

Let’s go ahead and run this through the actual full traffic flow and let’s go ahead and take a look at what some of the output looks like. I’m going to run it through not just a log parser but my full traffic flow. Let’s see what comes out the other end.

There we go. I get nice, located visits and conversions that come out and geolocated for me to various places and they’ve got a nice, little time stamp and they’re matching the product. So this is a way to do ETL.

I’m using, in fact, external data on latitudes and longitudes and I’m folding that right into the internal data that I’m getting from my production web servers. In fact, I can do this for the impression data as well. I’ve got a bunch of impression data that I’ll get from, let’s say, the ad networks.

Let’s take a look at what some of this data looks like. This is passing in cookie IDs of users that are coming in. It’s passing me campaigns and creative IDs. It happily also has an IP address. If I look at the flow that I’ve got for impressions, I can compare that to the flow that I was using for traffic.

They’re very similar because I get to use some of the verbs in [inaudible 26:46]. I get to use this geolocater again because I’ve already defined it once as being used here for my web traffic. It’s also being used over here for my impressions data.

Other parts of this flow are going to be unique. For example, traffic needs to be identified, whereas the impressions are always just impressions. I’m taking different verbs.

I’m writing two different sentences and I can have two different flows of data coming through the system. I can, for example, go ahead and parse this as well and do impression [inaudible 27:18] and get those geolocated.

Once I’m done writing this code, which by the way, it only takes a few minutes. These are very simple scripts. It’s all Ruby-based. It’s all interactive. It’s interpreted so there’s no compilation step. I just go ahead, I look at some of my changes and I go ahead and I just commit this code.

Get commit, [inaudible 2738] and some new changes and then I can just push this code out. Having done so, I can actually go ahead and I can run this code at scale in a real backend, big data solution. This is the same code running behind the scenes.

If I go back to my slide deck here just a second. Let’s go back to our architecture slide here. I took the Wukong code that I wrote in front of you guys for doing that live ETL process and I decided that I would deploy it over here in the data delivery service, so in real time. Now, I could have made the decision to run it in Hadoop.

Let’s say I wanted to process a whole day’s worth of impression inputs at once. That might have been an option and that would be a really interesting way to move back and forth between a real-time usage and a batch usage of the same code with the geolocator and the unrelated processors that I was using.

In this case, I’m actually running this code behind the scenes right in the data delivery service real-time, the same code we just saw on the command line and I’ve got a tableau hooked up to the output of that. The data is being written into an SQL database and tableau is sitting on top of that database.

I can take a look and just go ahead and refresh some of this data and see what happens. These are the two test markets that we talked about that we defined. Let’s go ahead and refresh that again.

There we go. We’ve got impression data coming in a variety of other data. This changes in real-time as the data changes comes in. You can see there’s new events being added to the screen as I continually refresh. This is a real-time flow. We’ve taken our code and we deployed it in that real-time context.

We’re able to answer some basic questions about our use case. Where are these impressions coming from? How many of them are converting? Where’s this distributed over space and time in these two test markets that we’re trying to analyze?

If I pause a part of the use case that I defined was the idea of attribution and that’s - I guess my push didn’t go through - when it was the idea of the attribution. I want to take a look at, let’s show the attribution process that I’ve created over here.

Attribution is a job that I’ve defined. It’s more complex than just a single data flow. I’m taking two different data flows and I’m running them as a mapper and reducer.

Anybody’s who’s done any kind of Hadoop program should recognize a lot of what’s happening here. I’ve got impressions data coming in. That’s the ad that was served to the user. I’ve got visit data, which users came to my website.

If I look at some of my actual web logs that I’ve got sitting right here, if I look in the actually traffic, you’ll see that I’ve got cookie IDs all over the place. The traffic that comes from one of these ads has a cookie ID in it.

I can use that cookie ID to match against the impressions that I’m also getting from my ad partner to see whether this particular person actually represents how they converted, where did they see this ad and what kind of person are they and that helps me understand the conversions funnel for the website and the products that I’m trying to sell in these test markets.

That’s a classic problem that you would solve in a batch way. There are streaming solutions to solving the attribution problem and Infochimps’ technology does support some pretty interesting and clever things that you can do in order to be able to do real-time attribution.

The most common approach to attribution is to do it in a batch way. Say, for example, at the end of the night or at the end of the hour, look at all the most recent conversions and see if any of them had impressions that came in previously.

We’re going to kind of hold suit with that and go ahead and show the classic batch oriented was to do that kind of an analysis and what you’ll see is that I’m just going to write a Hadoop job.

Now a Hadoop job for me, because I’m working in the starter kit, I’m starting local, I don’t need to have Hadoop installed. I don’t even initially think of it as a Hadoop job per se. I just think about it as a map reduce job.

I’m working locally with a mapper and a reducer, which are two different data flows that I have to find and together, I’m going to glue them with a sort, just like Hadoop does and we’re going to be able to model that entire Hadoop job locally. Then we’re going to be able to run it in a production setting.

I can go into my app jobs directory, take a look at this attribution job and you’ll see that the way that it’s put together, I’m defining a mapper, a simple Ruby code to define the mapper. I take the sorting, I’m going to use the cookie to sort the records because the cookie is my sort key.

That’s what identifies a unique user in my data set and so that’s going to be the first record that I admit so it gets sorted on that. I have a reducer over here that is, again, just a process that I’ve defined so I’m using the same terminology, the same approach, the same code, the same object, the same code base.

It’s all in one place. I’m not suddenly switching context between the real-time system that I use for ETL and ingestion of data and the batch system that I might use for websites calculating something like an attribution analysis.

Then finally, my reducer’s pretty simple. I just go ahead and run that extra [inaudible 33:11] and you can see that up here when I get a summary from [wu-show] on the way that my job is put together.

I can actually go ahead and run all my clean data that I’ve parsed. This is me working locally. I haven’t gone to Hadoop or anything like that. I’m going to use a new tool called “wu-Hadoop” but I’m going to say, “Please start in local mode. I want to see what this looks like,” and I’m going to run a particular [inaudible 33:33] in my attribution job.

Wu-Hadoop is going to run and go through all the impressions and the traffic data and it’s going to launch a local Hadoop platform modeling what would happen if I were to do this in Hadoop.

There we go. It’s going to discover all the events which actually had a conversion from an impression into a real purchase.

I can run that exact same command in Hadoop as well. Here we are in my Hadoop cluster. I’m just logged in. It’s the same code base, the exact same command. I’m just saying computer running Hadoop. Oops, I forgot to have the flags to remove the output if it already exists.

It’s running in Hadoop mode over here and it’s going to take that exact same code, my attribution job that I defined, and it’s going to run it but right on top of Hadoop. It’s no longer a local job. It’s now running in Hadoop, which mean I can run it at a tremendous scale. I can process millions of events or billions of events if I’ve got them stacked up.

Just a brief summary before we go back to the slide and try to close. We walked you through in the last 20 minutes or so the usage of how you can quickly build a starter kit for working with big data. It’s a system that lets you prototype your application by writing simple processors, models and other stuff locally.

It lets you kind of build together an application very, very, very quickly, test it locally, write unit tests, work on local data. Then when you are ready to run this at scale, the same code because of the connectors that we provide like wu-Hadoop, wu-Storm etc., that same code helps you run - I’m going to go back to that slide.

That same code helps you run all those algorithms that you defined locally but in the context of, let’s say, Hadoop for batch processing or Storm for real-time processing.

I think that takes us through most of what we wanted to talk about today in relation to [inaudible 35:39] off our starter kit. It’s a free way for anybody to get started with Wukong and some of Infochimps’ technologies.

We encourage you to come check out our community page and apply to get a free download of the big data starter kit that you just saw me using, as well as some others.

Amanda: Thank you, Dhruv. At this time, we’ll open it up for Q&A. If any of you have questions, feel free to use the chat functionality of your GoToWebinar control panel and we will ask the questions of Dhruv.

I do have a couple questions here, Dhruv, that came in through the webcast. Travis asks, “How do I leverage Wukong to use languages other than Ruby?”

Dhruv: OK, that’s a pretty good question. You saw actually earlier, when I was launching the Hadoop job, perhaps those of you who were kind of tracking it here, I can run this command again. Let me show you the actual command that was run behind the scenes.

As far as Wukong’s concerned, wu Hadoop launches a particular process called a “mapper” in a particular process called a “reducer” and those of you who are familiar with Hadoop see that what’s really going on behind the scenes here is Hadoop’s streaming.

Hadoop streaming really just uses any Unix process that can read from standard in or standard out and if you saw, that’s all that local was ever doing as well.

Really, my short answer to your question is that if you can write a piece of code in Python or in "C", Java or whatever language that you are trying to write in or in "R", which is another common use case here, if you can write a piece of code in one of those languages that can read and write from standard in to standard out.

Then you can use that code, you can drive that code just as I’m showing you here with Wukong. You can specify, “Please use this as a mapper. Please use this as a reducer.” It is not a first class experience. I’ll talk about that in a second.

It’s not a first class experience. Not everything you can do with native Ruby code in Wukong can you do with this mechanism but it is possible.

In regards to a better experience, we are currently working in our beta build of our cloud services allows you to, the ability to use Java in a native fashion directly. The two languages that will be supported in our next release will be Java and Ruby, kind of both in a first class way so that you have all the same things across both.

Then, I’ll be able to say that, yes, outside of Ruby and Java you’re going to have to use the standard in/standard out kind of approach. I hope that answers the question.

Amanda: Joshua is asking . . . sorry, Joshua, let me get to your question here. “Can you talk about using any other data sets besides log files, like tapping into a data warehouse?” Yes, what that might look like.

Dhruv: Yes, we chose log examples or log data for this particular demonstration just because it’s really, really generic. Everybody seems to have log data of some kind. It’s hopefully pretty accessible, at least conceptually.

For our customers, we have a variety of different methods that gets data of different kinds into our pallet of services. We accept data over http. We accept data over sys log, logging oriented protocols. We accept data via batch uploads.

For example, FTP or some other kind of similar mechanism. We do have some custom connectors to various data partners and certain web services that we are familiar with.

Now, for something like an enterprise data warehouse, it of course depends specifically, exactly which one but I’m thinking something like let’s take a specific example like a big Oracle server somewhere that’s containing a lot of the enterprise’s internal data.

Something like that, there are listeners that we have on as part of our of cloud services that are capable of looking at a SQL database and ingesting a bunch of information from it and pulling it right into the data delivery service for further downstream processing.

Pulling data out of systems like that is supported by one of the custom connectors that we can attach to our cloud services.

Any more questions, Amanda?

Amanda: Steve Kramer is asking, “What is your recommended best practice for continuously tracking entities like cookies or URLs? That is, how can you maintain the map between autogenerated IDs and URLs or cookies?

For example, if a previously seen IP address comes in, look up the existing IP. If it’s new, assign the next available ID.”

Dhruv: That’s a great question and it sounds like Steve has worked with some of these kinds of problems before because that’s exactly one of the things that some of our customers in the [Ad Tech] industry will ask.

They’ll say, “This is such a great demo because you show me how to process real-time data and you showed me how to do a batch-oriented attribution analysis. But really, I’m interested in real-time joining information.

The second I see a conversion event from a given IP, can I look up by that IP or by that cookie ID and instantly make a decision, whether that’s a new placement, whether that’s a purchase, whether that’s incrementing a counter somewhere?

That kind of problem is a challenge because you oftentimes have a high throughput of data and you’re asking to essentially make a join against oftentimes a very large store of information as well in [real time].

Now, the upside that I can say is that’s part of the reason why our cloud platform comes with more than just one particular kind of service. Let me go back to that slide just to make it clear. There are folks that only do Hadoop as a service.

There are folks that only do one particular kind of database. We kind of combine real-time ingestion, which is a big differentiator for us with Hadoop, which I’ve shown you today. But we also add in scalable no SQL databases.

These systems are capable of storing billions of records and capable of serving many, many thousands of lookups and similar operations per second.

Those in combination with some of the technologies that I showed you today wind up being the more complete solution to a more sophisticated approach like the one that you’re outlining. Happily, it’s the same technology stack that you saw today that you would use for it.

You would just look up in the middle of the flow just as I have a geolocator that is looking up geolocations based on IP addresses, you would look up based on whatever constraints that you have in that problem from some scalable database and make that join in real-time.

Amanda: Just a few more questions here, Dhruv. Steven asks, “What infrastructure can a starter kit work with, just Storm and Hadoop?”

Dhruv: The starter kit is really, and this is true of Wukong in general, Wukong is abstract framework for describing data transformations. What does that mean? It means it’s saying what you want to have happen to your data. It’s not saying how that’s supposed to happen or who’s supposed to do it.

Wukong has a collection of plug-ins which connect Wukong to these various engines, like for example, Hadoop and Storm like I talked about.

The short answer is right now, Storm is the engine for real-time. Hadoop is the engine for batch. To your question, will other systems ever show up, for example, can I use something like an Impala or a Flume or another kind of engine or can I plug it into my web servers?

The answer is that, no, that maybe does not exist today but because Wukong is open-sourced and because it’s designed to be extensive [inaudible 43:43] way, you could write a Wukong plugin that takes your abstract data transformation and puts it behind your favorite data engine that you happen to enjoy and want to use. That’s kind of the way it’s been designed.

Amanda: Awesome. We have another question here from Chris. “Do you have to use Wukong in the starter kit to use the Infochimps enterprise cloud?”

Dhruv: That’s kind of a 'yes and no', I’d say. Yes, if you want the full experience of the Infochimps enterprise cloud, if you want to be able to do all the advanced stuff like notifications, data flow, automatic archiving, you want testability, you want automatic scalability.

You want to be able to introspect the nature and structure of your application, then, yes. This is what we tell our customers: we strongly recommend that you use Wukong and that you use the starter kit when you’re working with our platform.

Now with that said, we do not hide the lower level interfaces. On this diagram, for example, there’s a reason that we call out both Wukong as an interface as well as native Java map/reduce pig [inaudible 44:43]etc., is because, for a lot of our customers especially, they have some existing pig code or they have some existing Storm code and they don’t want to throw that away.

They want to get good at some of those skills at using those native interfaces beyond just using the Wukong. For those customers, there may be a situation where 80% of what they do is done in Wukong in the deploy packs in the starter kit, because it’s so simple and so easy to get started.

It’s not all performance, impactful and it’s just the fastest possible way to do it. Then 20% of the application might be written in that native [inaudible 45:17], like for example, native map-reducing Java or pig job or a Storm topology compiled in Java. We don’t forbid that. That is also an interface that you can use currently in our cloud services.

Amanda: One more question has come in from Steve who has asked, going back to his example if you’ll remember it, is asking, “Do we have an example such as that in our community site?”

Dhruv: We don’t have a built-in example that’s using something like that. A big reason is that the way that we think about it, it’s just a database lookup. If you were to work with Infochimps directly, if we were to give you an instance of our enterprise cloud, you’d have a database.

You’d be able to read the instructions for how to make that lookup and it would be just as simple as looking up from a hash or something that’s in your local. So we don’t have an example.

No, but that’s a great suggestion and maybe I think the next release of the starter kit for Ad Tech we might include and say, “Hey, if you happen to have a database handy and you want to connect it to the starter kit, here’s how you might do that,” and give you some instructions on how to do that so you could see that functionality.

Amanda: OK. We have another question that has come in. “Would you say that Wukong provides a consistent abstraction of Storm and Hadoop so customers do not have to deal with different data manipulation tools across Storm and Hadoop?”

Dhruv: Oh my God, [inaudible 46:42], you said exactly what I’ve been trying to say but much more elegantly perhaps than I did. That’s exactly what we’re trying to do with Wukong. Part of the reason is because we want to abstract the difference between batch and real-time.

Part of the reason is because we don’t want customer applications to be locked into one particular choice of big data framework. It’s tempting to believe that Hadoop is going to be around forever.

It is a great service and it’s got a lot of traction but it may change. Storm, the engine that we use for our real-time services didn’t exist a couple years ago and that just gives you a sense of how quickly some of the tools and technologies are changing.

We don’t want customers to build the next generation big data app in a very new world of technology and discover three years from now that something new and better has come along and be unable to adapt to it. So that’s a big part of also why Wukong gives us that insulation against the lower layers.

Amanda: OK. If you have any further questions, now is the time to send them in using your chat functionality in GoToWebinar control panel. We’ll give you all a minute or two to ask any remaining questions.

If there are no further questions we’ll go ahead and end in just a moment. Oh, here they come. We have one more question with, “What made you choose Storm?”

Dhruv: “What made us use Storm?” OK, so we did use [inaudible 48:09] technologies in the past. If you look through our GitHub and you look at some of what we’ve been doing in talking about in our blog for the last few years, you’ll notice that there’s a time when we were using Flume for quite awhile.

Flume was a service that’s very similar to Storm in the sense that it handles real-time and just should transformation, multiplexing of information, of data. We always had issues with Flume before we knew about Storm and before Storm was released by Twitter.

We had issues with it because it conflated what we felt were two different problems, namely the processing of data and the queuing data to be processed. So ingesting and e-gestion of data versus processing of data, that was a confusing thing to be able to split apart in Flume. It seemed that it always wanted you to do both within itself. That often led to downtimes and problems.

Storm is really great. It’s modular. It lets us separate those two concerns and then we’re actually able to supplement Storm with Kafka to act as our IO buffer. We actually have two of the most cutting edge technologies in the real-time space that we’re actually playing off each other and I think that’s kind of what drove our position.

I hope that’s a good enough answer. I don’t mean to talk poorly about Flume. It’s just Storm is newer and solves those problems more elegantly.

Amanda: OK, Dhruv, at this time I don’t see any further questions coming in so I’d like to thank you all for joining us. It was a pleasure to have your company today and, again, this is being recorded.

We will send the link to the recorded video and slide deck through your e-mail inbox early next week.

Thanks again for joining us and I hope you enjoy the rest of your day.

Dhruv: Thank you, everybody.