Infochimps at the NYC Data Business Meetup
WATCH VIDEO »
Dhruv Bansal, co-founder and Chief Science Officer of Infochimps presents at the NYC Data Business Meetup on December 18, 2012 about Big Data infrastructure. This Meetup focuses on the business aspects of the data revolution: startup opportunities, new products, business models, funding trends, emerging players, and ways for existing startups to leverage their data.
All right, let's see if we can do this. All right, so I have a funny title for my talk. The fourth word is not a word, but is a symbol. But I'll get to it. I should introduce myself, and talk a little bit briefly about the company, and then let's get into the title and let's get into some of what I think one of the coolest things that I've learned recently in the last couple years, and it's really helped our company achieve a lot of success as well, so I'm going to dig into it.
I'm Dhruv Bansal. I'm formerly a physicist. I'm recovered, and now I have a small company. We make data infrastructure really easy for folks. I really wish I had Joseph's slide here in this deck where you see infrastructure at the bottom. You see a platform in the middle, and you see these beautiful applications on top. We're definitely a platform in the middle kind of company. We don't build hardware. We don't offer you hardware. We leverage Clouds that are public or private, or in data centers.
That's our infrastructure, and we don't build applications, so I don't know that much about sentiment analysis. I don't actually deeply understand fraud detection. But I provide the platform in the middle that lets folks who want to build those kinds of things get them done quickly. That's what we're very proud of, how quickly, how easily we can get customers to build those applications. That's what it's all about for us.
But the title of the talk is something I've got here called "Can We Make This Red Box Run Anywhere?" That red box means something to me. It's like, I've been drawing those for four years now on whiteboards, in all sorts of weird contexts. Those of you who are elite programmer types, or build architecture, or like to argue about algorithms, you know what I'm talking about, right? You're like, "Okay. How does this really work?"
Well, the input comes in like this. All right? You might draw a diagram or something like this. You're like, "My input comes in and then what happens to it?" This thing runs, right? Does it run? I don't remember, let me draw a line there. Oh, okay. Then I do this other thing, and I do this other thing, and then I glue it together.
The reality is my co-founder Flip, who's a genius when it comes to these technical things, deeply realized at some point that all we're really doing here is we're building networks. When we think about computation, when we think about building infrastructure, when we think about solving actual business problems, we're ultimately building some kind of network with some topology or the other.
We're sending data in, and then every node in this network is doing something to the data. It's cleaning it. It's transforming it. It's capitalizing it. It's extracting a record. It's adding in some context from somewhere else. But then, all we're doing ever when we compute, and this is almost to the point of being some sort of weird tautology. All we ever do when we compute is that we string together little pieces of computation, one after the other.
Now that could be true in a 1970s mainframe computer, still being compiled down to that level. It's true on the command line for those of us who are hackers, and you play and you pipe stuff back and forth. What you're doing is you're gluing together computations.
It's certainly true of Hadoop, which I'll pause now before I get into the meaning of. Who here knows Hadoop, who play with it? Lots of hands. Who here has actually written a Hadoop job? Keep them up, long and hard, folks. Who here has debugged a Hadoop job? Yes. Yes. Okay. So you might cringe, those of you who have more experience than [inaudible 00:02:49] myself.
You might cringe at the simplistic description of what a Hadoop job is. It's so much more. There's a job tracker. There's a single point of failure. There are all these big companies and billions of dollars being made. But this is actually all it is. This is really it, so a map, a sort, and a reduce. That's what Hadoop has boiled down to, right?
But what's cool about Hadoop is that it makes this really crucial distinction. It says that your data — you always think of your data as the most meaningful noun when it comes to Big Data. It's the thing that's big, right? But in reality, what Hadoop does is it takes the emphasis almost off the data for me. Where it says, "The data is so damn big we'll never move it around. Let's just leave it where the hell it is."
What we're going to do when we analyze data is we're going to ship around computations. So we're going to take actual functions, processes, algorithms, we're going to make those somehow our nouns. We're going to grab those and we're going to send them places. We're going to send them where our data is. We're going to hook them together like this in this particular way.
Each event going to go through first this one, then this one, then this one, and that's what we mean by a Hadoop job. But really what we're saying is that all of the sudden the world has become functional. Not pure functional, like [Haskell 00:03:50] or anything like that.
We're not voluntarily here.
Yeah, okay. There we go. Thank you, sir. Thank you. Whew! Tough crowd at the [Haskell 00:04:00] tonight. Yes. No, I'm just kidding. You're absolutely right. But the point is I use the word loosely because it has a lot of specific meanings. But really as a programmer, someone who writes a lot of Hadoop jobs, for me I've realized it's not about the data.
The data is almost something I can't control. It's almost something I'm afraid of. It just stacks up over here and my only approach to it is to write these little functions. These are like my missiles. I send them over in there, they answer some questions, they come back all weary and I have an answer. They have journeyed to the data for me and fetched it back.
But, when I think about them, I'm gluing them together. This is how I design things, right? This is how everybody designs stuff. You draw little boxes on whiteboards, photo, argue with friends.
Are your missiles good or bad?
It's hard to say. Really, I don't know. Big Data is messy. Here's another one.
They can be both good or bad.
They can be good or bad, I suppose. But there's a whole interpretive issue around that that we won't even get into. Here's a thing called Storm. This is a newer thing in Hadoop. It's got less of a community right now but I think it's really cool. It does real-time computation very effectively. I'm going to make the kind of specious argument. Hopefully I've convinced you of my base sufficiently well.
But storm is just, again, functions. It's just processes that we apply, except it's more complicated than Hadoop. Right? It's not map, sort, and reduce. It is an arbitrary topology of stuff you want to glue together. So it works in a different way and we can't send batches of records. You'll notice that in this slide I'm sending lots and lots of records through at once, right? Hadoop is awesome at doing that.
I can't do that anymore over here, I have to send a record through at a time. But I get other things. I get transactionality. I get a bunch of other cool stuff that happened. But really, the way that I still think about it when I'm programming it in my head is I'm just writing these little boxes and I'm gluing them together.
In some sense I'm being a little object-oriented. Right? I'm separating out concerns. Like, this box handles the decoration of my log events with IP address metadata. This box handles that stupid thing that I have to do because this one data source has this problem with it. Right? I encapsulate those problems and I glue them together into a topology through which my data flows.
Here's one more example of something you might have seen. This is every API in the world, ever. Right? This is every service-oriented architecture. It is a bunch of functions that have been named with a particular path to which you can send events, and you can get back a response. This is the only one in which arguably data is maybe moving, maybe, if you want to think about it.
But really, again, I think of these as boxes now too, and it's like I can't decide if I just spent too much time at work. Or if the reality is that all of computation is just passing functions around and gluing them together. I think I might have even read that once somewhere, so I'm suspecting it's the latter, but I'm not sure.
It brings me the question, can we make a function run anywhere? That is essentially the mission that we've been on at Infochimps, even though we didn't elocute it that way, we didn't think about it like that, we were trying to say, "Let's make the world easier to use. Let's make Hadoop more approachable. Let's take this talent gap that exists in the world and eliminate it because we can get folks who don't know Hadoop to run Hadoop jobs somehow. We can get folks who aren't into Big Data, but do understand their data to ask questions and get them answered."
So the key to us was making a function run anywhere. Specifically, here's a tokenizer. I'll skip the code just because we're getting, I think a little bit time-wise already. So I'll skip through the details of this. The point I want to make briefly is that we actually did this. That it now exists. There's a bunch of open-source code out there called Wukong, which is our attempt at letting you focus exactly on the verbs.
These processors, these functions, these lambdas if you're familiar that concept, it lets you then glue them together into more complicated flows, which then you will run wherever you please. Because the thing is the function at this point, right? It's not the data anymore. This is a tokenizer. This guy is an accumulator. Right? This counts up records.
If I have a tokenizer and an accumulator together and I glue them up with a sort in the middle, I just built a Hadoop job. Right? This is the word count. That's what word count is. It's those three particular functions that were glued together in series. So, I can actually write this now in Wukong, in code that really exists in the world. I haven't made any claims about where this runs. I just described what the algorithm is and I can choose to run it on my command line.
That's awesome. I can develop locally on my laptop at the airport. I can take this exact same piece of code, I can push it and run it in Hadoop. So now I'm doing batch analytics all of the sudden, without having touched Java, without having deeply understood anything about how that framework works.
I'm just working with my data, defining the algorithm I want, and pushing it through. I can of course build more complicated flows. Here's the previous one that I've somehow interpreted by describing what that flow might actually mean to me in a real business problem. Some ETL workflow or something like that that I'm trying to create.
At the end of the day, and I'll stay little bit not-product pitchy, we've glued everything together. We've taken this Wukong as our interface and really built what we think is the world's first truly usable Big Data platform. That hopefully does sit in the middle between, straddling the infrastructure that's going to provide you with the actual machines and resource, the unreliable resources in Joseph's slide that are going to get you there.
Then, of course, providing you with the interfaces that are going to make it really easy for you to build those applications on top. The fraud detection, the website personalization, the customer segmentation, this is the stuff that everybody cares about. So let's give you technology like this that's simple, easy to express, easy to test, easy to run locally, and let you lift it into all those other categories. Right? So let's deploy all this stuff out there. Let's let you take these simple interfaces and inject them into Hadoop, inject them into other places.
It's been a great success. The company's doing very well. We can talk about it. But I think ultimately, I only realized this recently that all we actually have accomplished over the last year or two of noodling on this issue is we've admitted to ourselves that the function is more important than the data. That we're going to move the function around wherever it needs to go, and we're going to let it be free. I don't know. That's my perspective. Thanks guys.