How Wukong™ Makes Big Data Easy
There are two complementary ways to process Big Data: batch processing and real-time (or stream) processing. These are traditionally viewed as very different approaches to solving problems, especially in a Big Data context, where the toolsets for each kind of processing differ greatly.
For batch processing, Infochimps Cloud provides Cloud Hadoop, an on-demand framework for executing Hadoop jobs on big data. Hadoop is an open-source Apache project written in Java.
For real-time processing, Infochimps Cloud provides Cloud::Streams, an extension of Storm and Apache Kafka, also open-source projects written in Java.
Big Data Applications Are Hard to Develop
Hadoop and Storm are wonderful technologies on which to Big Data applications. Unfortunately, they each create a set of unique challenges for developers:
• You Need to Run the Whole Thing: Both Hadoop and Storm are complex programs that require several running, configured, and connected daemons of their own as well as several other programs (HDFS, Zookeeper) to even start up properly.
• You Will Wait 10 Minutes Every Time You Make a Mistake: Modifying your Java code, compiling it into a jar, sending it to the appropriate servers, launching your job, hitting an error, and finding debug logs so you can fix it will take you 10 minutes. You will do this hundreds of times as you develop your applications so you will wind up waiting...a lot.
• You will Disrupt Production Traffic: Both Hadoop and Storm are multi-tenant environments that can run code from many different individuals or teams at the same time. When you make a mistake that is truly large (and Big Data gives you plenty of opportunities to do this) you can slow down or even crash the servers for everyone else.
Quickly write simple, composable data transformations and test them locally. Then let Wukong deploy them for you as a MapReduce job in Hadoop or as a decorator with Cloud::Streams.
Even if you are successfully leveraging one or both of Hadoop and Storm, there is one more challenge created at their interstices:
• Hadoop Does Not Understand Storm and Storm Does Not Understand Hadoop: In real-world Big Data infrastructure it is a common need to move some of your processing from batch to real-time, or vice versa. Code you write for a Hadoop MapReduce job cannot be run in the context of Storm and code you write as Infochimps Cloud::Streams decorators cannot be run in the context of a Hadoop MapReduce job. Should you need flexibility between batch and real-time processing, you’ll have to write, test, and deploy the exact same algorithm multiple times: once locally on sample data (such as with a statistical analysis tool like R), once for Hadoop, and once for Storm.
While tools such as Hadoop and Storm are truly powerful and can help you create amazing things, they present significant challenges analytics and Big Data applications development. Surely there must be an easier way?
Wukong to the Rescue
There is indeed an easier way, and it’s called Wukong
Consider that you are a data scientist, and you are developing a sentiment analyzer algorithm.
With Wukong, you can test that algorithm with sample data, or even a live data stream, on your laptop.
Want to test it on a huge swath of historical data? Without changing your code, run your sentiment analyzer on terabytes of data within Hadoop.
Maybe your boss tells you — we’d love to have that sentiment analysis available as part of a real-time dashboard. Without changing your code, run your sentiment analyzer as part of a real-time streaming data flow.
To see this visual image, download the White Paper now.
Wukong is built on the following fundamental philosophy: Big Data tools accept, transform, and emit events and that is all.
• A Hadoop job takes input events and transforms them in a mapper, sorts all the output, and passes each event sequentially through a reducer. A Storm flow emits events at its source and each decorator processes those events serially till they are consumed by a writer. In all cases, the conceptual picture is that of events being transformed and then passed.
Wukong allows you to write code that accepts, transforms, and emits events. Wukong will run this code for you locally, processing events the same way as Hadoop or Storm, only at a much smaller scale. When you’re ready, Wukong will let you deploy your code unmodified to work either as a Hadoop MapReduce job (leveraging Hadoop) or in Storm as a Cloud::Streams decorator.
This brings a lot of advantages which address the challenges created by Hadoop and Storm:
• Write and Test Code Locally: You don’t have to install either Storm or Hadoop. Just write your code and Wukong will process events in the same way as Storm or Hadoop; just locally and at a smaller scale, appropriate for development and testing.
• Write Simple Scripts in Ruby: Wukong is a Ruby library, and if you write your code in Ruby, you can take advantage of a lot of abstractions designed for MapReduce and streaming that Wukong provides.
• Have a Tight Debug Cycle: Since you’ll be developing and testing your code locally without Hadoop or Storm, you’ll be able to instantly run your code, find your errors, and fix them, right there on your laptop.
• Don’t Disrupt Anyone Else: Since you can develop and test all your code locally, you’ll only be running working, well-tested code on your production clusters. You’re less likely to cause a slow-down or crash for other users.
The most unique feature that developing with Wukong gives you, is the ability to bridge the chasm between batch processing and real-time processing.
• Seamlessly Move Between Batch and Real- Time Contexts: Because Wukong relies on the fundamental concept of accepting, transforming, and emitting events, it can take your algorithm and deploy it as a MapReduce job directly in Hadoop or as a decorator within Storm.
This ability gives you tremendous technical freedom and flexibility. You can develop an algorithm locally, test it using Wukong, and then run it on historical data in your HDFS or database using Hadoop. If you’re happy with what it does for your historical data, you could plug it in as a decorator into the flow inside Cloud::Streams and have it process all new records in real-time as they come in. You only write your algorithm once, and Wukong helps you move it from testing into production on either Hadoop or Storm, or both.
Wukong uses a domain specific language specifically designed to make it easy to describe data models, data transformations, and data flows in the context of data processing and analytics. The beauty of Wukong is that it has abstracted away the notions of any given processing framework, such as Hadoop or Storm, and just thinks of the world as models, processors, and data flows. A data flow might run in a streaming context where it handles events sequentially as they are received, or in a batch context where it handles many events simultaneously in bulk.
• Models are the nouns. They describe the structure and meaning of a data object. For example, a tweet model would define fields such as id, timestamp, author, text, etc.
• Processors are the verbs. They perform a specific, modular data transformation or augmentation. For example, processors that will be transforming your tweets might include tweet_parser, sentiment_ analyzer, censor, topic_extractor, etc. Processors don’t know anything about the objects they are modifying, other than what action they should perform.
• Data flows are the sentences. Stitch together inputs, outputs, models, and processors to create an overall workflow. If necessary, you can split or merge flows as well. For example, a data flow might look something like: input() > tweet_parser > sentiment_analyzer > censor > topic_extractor > output().
The Big Data Application Lifecycle
Wukong executes your data flows in the context you specify, allowing you to transition through local development, local testing, pushing code to the cloud (git push), testing and QA in a staging environment and finally to production. This lets your application developers and data scientists work collaboratively, only write algorithms once and quickly yet confidently push features into production.
To see this visual image, download the White Paper now.
Request a Free Demo
See Infochimps Cloud for Big Data.
Infochimps Cloud is a suite of enterprise-ready cloud services that make it simpler, faster and far less complicated to develop and deploy Big Data applications in public, virtual private and private clouds.
Our cloud services provide a comprehensive analytics platform, including data streaming, storage, queries and administration. With Infochimps, you focus on the analytics that drive your business insights, not building and managing a complex infrastructure.