Infochimps Big Data Platform as-a-Service Technical Overview

Infochimps Big Data Platform as-a-Service is an integrated product tool set that makes it much easier to perform Big Data analytics and create Big Data applications. It’s a collection of open source and proprietary software for Big Data processing, data collection and integration, data storage, data analysis and visualization, and infrastructure management.



Infochimps™ Technical Overview

Gain Big Insights from Big Data

Infochimps Big Data Platform as-a-Service is a suite of robust, scalable cloud services that make it faster and far less complex to develop and deploy enterprise Big Data applications. Whether you need real-time analytics on multi-source streaming data, a scalable NoSQL database or an elastic, cloud-based Hadoop cluster — Infochimps Cloud is your easiest step to Big Data. Infochimps Cloud enables Big Data processing, data collection and integration, data storage, data analysis and visualization, and infrastructure management. Coupled with our expert team and a revolutionary approach to tying it all together, we help you accelerate your Big Data projects.

Explore the technology Infochimps Cloud is a suite of managed cloud services, which means we provide the system administration, maintenance, and support — so your team and resources can stay focused on creating insights, not herding machines.

The Cloud Services Layer represents the machine systems, managed by Infochimps, that you leverage to process or store your data. We categorize infrastructure into three buckets: real-time stream processing, databases and data storage, and Apache Hadoop™ batch processing.

While you can interact with your infrastructure layer directly if you’d like to, such as connecting directly with a database, we include the Application Layer, which provides productivity software and interfaces to make performing analysis and building applications easier.

Underneath all of this is Ironfan™, an Infochimps-authored tool for systems provisioning, deployment, and updating, which serves as the foundation for Infochimps Cloud. It’s built from a combination of open source technologies such as Chef and Fog plus additional software — allowing you to orchestrate infrastructure at the system diagram level (manage clusters, not machines) and deploy to virtually any cloud provider.

To see this visual image, download the White Paper now.

Using Infochimps Cloud

Infochimps Cloud has several key interface points. This document will explore in more detail these key areas:

• Collect Data
• Perform Stream Processing with Decorators

• Query Data and Build Applications

• Perform Hadoop Processing

To see this visual image, download the White Paper now.

Data Streaming and Real-time Analytics

Collect Data
Cloud::Streams leverages Storm and Apache Kafka to provide scalable, distributed, fault-tolerant data flows. Storm is multi-node and modular, so you can have fine-grained control over how you modify and scale your data flows and stream processing.

Flow Components
All data processed by Cloud::Streams consists of a stream of data called events, which consist of a body and some associated metadata. Events flow through a series of logical nodes chained together. Cloud::Streams connects collectors, which produce events, to writers, which consume events. Events can be modified in-stream by decorators.

Cloud::Streams extends Storm and Kafka by adding hardened deployment and monitoring functionality, and a development toolkit for easier development of data flows and stream processes. While very flexible, Infochimps provides a standard set of collectors and writers which provide the primary interfaces for data input and data delivery.

To see this visual image, download the White Paper now.

Standard data collectors include:
• HTTP (post, stream, web socket)
• Logs (syslogng, tailing, etc.)
• Data partners (Gnip, Digital Element, etc.)
• Batch upload (S3 bucket, FTP, etc.)
• Custom connectors for unique or legacy systems

Standard data writers target:
• Applications (REST API)
• Databases (MySQL/PostgreSQL, Elasticsearch, HBase, MongoDB)
• Filesystems (Hadoop HDFS, S3 or equivalent, local filesystem)

Architecture Design
Designing data flows to take advantage of the strengths of Infochimps Cloud and to match the output your applications or databases expects requires critical upfront focus on architecture and design. To speed you along in your pursuit of this goal, Infochimps will help you:
• Choose which pre-built collectors, writers, and decorators are appropriate for your data flow
• Help you choose what sorts of processing should be done in-stream vs. batch or offline
• Help you design your data flows so they can match the schema of your business applications

Multiplexing and Demultiplexing
Not only can Cloud::Streams process events from a collector, through a chain of decorators, and into a writer, it can also split or recombine flows.

Splitting flows is useful for several reasons:

• Different events may need to be routed to different downstream applications or writers (high security vs low security, regulated vs. non-regulated, etc.).
• The same event may need to be duplicated and routed to two different downstream applications or writers (taking advantage of two different databases for the same data, sending a sample of live data to a testing environment, etc.)
• Aggregating events from several different flows is also a common pattern. Different systems (POS systems, web logs, mobile devices) may be normalized in their own separate flows and then combined before being written to a database.

To see this visual image, download the White Paper now.

Perform Stream Processing with Decorators

Developing Decorators
The true power of Cloud::Streams is the ability for developers to write and chain together decorators within a flow -- enabling real-time analytical processing.

While decorators can be developed by Infochimps, in many cases your organization will choose to do much of this development in-house. Decorators will often contain custom business logic (“Filter all events older than 2 weeks” or “Transform data from source X so it looks like source Y”). Infochimps makes it easy for you to develop, test, and deploy your code and business logic into the flow.

A decorator in Cloud::Streams receives an event and must decide what to do. A decorator has access to:
• The string body of the event
• Key-value metadata associated with the event
• Global config info available from the Cloud API
• Any databases or other shared Platform resources

A decorator has complete freedom as to what it does when it receives an event. It can • Transmit the event unchanged, perhaps incrementing counters somewhere else in the Platform (counting, monitoring)
• Modify the event and then transmit it (normalization, transformation)
• Use a database or external web service to insert new data into an event and transmit it (augmentation)
• Not emit the event at all (filtering)

Keep in mind that a decorator can only examine the individual event it is currently processing or reach out to shared resources like configuration or databases. It cannot look at a different event at the same time.

Deploy Pack
Infochimps Cloud uses the concept of a “deploy pack” to define custom decorators and to control deployment within Cloud::Streams.

A deploy pack is a Git repository with the following purpose: • Hold the implementation of various decorators
• Provide an environment for local testing of decorators
• Define how decorators, collectors, and writers are wired together in a real data flow

Your instance of Infochimps Cloud will come with a deploy pack that you can clone, modify, and push changes back to. When you push your changes back, Infochimps Cloud will take necessary actions to deploy the changes you’ve made live into Cloud::Streams.

To see this visual image, download the White Paper now.

Using Wukong™ to Simplify Decorator Development
Wukong™ is a Ruby framework for data transformation. Wukong’s goal is to let you describe the data transformation or flow you’re trying to build as simply, quickly and elegantly as possible.

Once written, Wukong will be able to execute your data transformation
• Locally in your development environment without the aid of any other big data tools for frameworkswithin Hadoop as a MapReduce job
• Within Cloud::Streams as a Storm decorator

The benefits to this approach over the traditional Java write / compile / package / upload cycle familiar to most Big Data developers are speed of development and portability of your algorithm between execution environments.

NoSQL Database and Ad Hoc, Query-Driven Analytics

Query Data and Build Applications

Application and Schema Design
Infochimps Cloud offers several databases, from traditional RDBMS like MySQL to a choice between several distributed, linearly scalable NoSQL databases. Not every database can perform every task well and one of the advantages of Infochimps Cloud is that data can be routed or mirrored across databases to optimize for different use cases or modes of access. A few of the databases currently offered by Infochimps Cloud include:
• MySQL or PostgreSQL: A fast, open source SQL-compatible database.
• Elasticsearch: A distributed, linearly-scalable database with a rich data model oriented around full-text search.
• HBase: A distributed, linearly-scalable database appropriate for time series analysis and other numerical or otherwise flat data.
• MongoDB: A fast, analytics-oriented document store.

Infochimps offers several databases for Big Data applications because leveraging the strengths of many databases to support an enterprise application is often the only way possible to build robust, scalable Big Data applications. Your financial time series analysis application might have several thousand or a million users (a small number) but tens of billions of data points for those users to analyze (a large number). No database in existence can adequately handle both the user management and querying over billions of your data points. Commonly, developers choose to store information like user management in a different data store, either inside or outside the scope of Infochimps Cloud.

For these and other reasons, your Big Data application will need to query data from multiple databases.

This can present a problem: since data now lives in multiple places, how can you be sure you’re reading it from the right place or writing it to the right place? The simplest answer is that you must understand the structure of the Big Data problem you’re solving and use the right databases to store the correct data in an appropriate schema so that your application can be efficient and powerful.

Infochimps Cloud will let you: • Choose the most appropriate database for your application
• Understand what data you can continue to use from your own environment
• Design a schema for your new data which matches the strengths of the database(s) you’re using

Interacting With Your Data
The way you interact with your data will depend on the database in which it is stored.

MySQL or PostgreSQL
Infochimps Cloud includes MySQL or PostgreSQL for use cases in which a traditional RDBMS is the best choice, such as small- to medium-scale web applications or exploratory data analysis or visualization using traditional or legacy tools designed around an RDBMS (R, SPSS, SAS, Tableau, etc.)

RDBMS technology is highly mature, and Infochimps supports virtually any plugins and client libraries / interfaces. If it works with MySQL / PostgreSQL, then it works with Infochimps Cloud.

Elasticsearch is an open source, Lucene-based, distributed, linearly-scalable database with a rich data model, optimized for full-text search. Elasticsearch is a great choice for housing data that users will query via a full-text search interface. Powerful query predicates, faceted search, geo queries, and time series are all natively supported by Elasticsearch and easily accessible through its APIs.

Elasticsearch is also schema-less. Different rows of the same “table” in Elasticsearch can have completely different document structure. New fields can be added to existing documents on the fly.

Elasticsearch’s rich document model means it can support the associations necessary to power a traditional web application.

There are several options for interacting with Elasticsearch:

• An HTTP-based, REST API, accessible from any web framework that allows for external API calls (including JSONP) or any programming language or tool that can make HTTP requests and parse JSON
• Using Apache Thrift from any programming language that supports it
• Using the memcached protocol
• Using the native Java API
• Via one of many client libraries

HBase is a distributed, scalable key-value database with a strong focus on consistent data and being able to partition cleanly across data centers.

HBase is the right choice when data can naturally be sorted and indexed by a “row key”. Time series data keyed by time or user data keyed by user ID are both good use cases for HBase. HBase also couples very cleanly to the Hadoop framework, meaning MapReduce jobs run very efficiently on data stored in HBase.

There are several options for interacting with HBase:

• An HTTP-based, REST API, accessible from any web framework that allows for external API calls or any programming language or tool that can make HTTP requests and parse JSON
• Using Apache Thrift or Avro from any programming language that supports them
• Using the native Java API
• A command-line shell

MongoDB is a fast, analytics-oriented, document store, optimized for fast, in-place updates, and rich queries. MongoDB is a great choice when the total amount of data to be stored is not large (<1TB). It is schema-less, so it’s ideal for situations when your schema is evolving over time. It’s also very fast to update records, making it a good resource for monitoring usage or real-time analytics.

MongoDB’s rich document structure means it can support the associations necessary for a web application.

There are several options for interacting with MongoDB:
• A command-line shell
• Drivers for MongoDB exist for a large number of languages and frameworks

Using the Infochimps Cloud API
In addition to direct access to databases and the Ironfan configuration layer, Infochimps Cloud provides an HTTP-based API. With simple JSON commands, you can orchestrate dynamic cloud operations and access unified views into your data and applications.

For example, you might be creating a data flow which is gathering social media data on behalf of a client tracking the words “coca cola” and “pepsi”. If this user adds the tag “sprite” within your application, you might want the end result to be that Cloud::Streams automatically starts collecting data about “sprite”.

The Cloud API is currently in beta. As such, new configuration commands are driven by customer needs and are expanding rapidly.

Elastic Hadoop and Large-Scale Batch Analytics

Perform Hadoop Processing Every interaction with Hadoop occurs via a MapReduce job which executes as a batch process over accumulated, historical data. Infochimps Cloud exposes Hadoop through a variety of interfaces.

Designing Clusters Infochimps Cloud provides clusters that are optimized for your Hadoop use cases. Clusters can be in a variety of configurations, each configuration available in the appropriate size for your problem:

• Permanent clusters: up 24/7, traditional model, shared execution environment, etc.
• Persisted, elastic clusters: only up when you need them, persist data between runs
• Volatile, elastic clusters: only up when you need them, empty each time
• Ad hoc clusters: for small teams, individuals, or testing

Native Hadoop Ecosystem
Infochimps Cloud currently supports Cloudera’s Distribution of Hadoop v.3 (CDH3). In addition to the core HDFS and MapReduce components required to support Hadoop MapReduce job execution, Infochimps Cloud optionally includes installations of common ecosystem tools including

• Pig: a high-level data-flow language and execution framework
• Hive: a SQL-like interface for data summarization and ad hoc querying
• Zookeeper: a high-performance coordination service for Big Data processes
• Mahout: a machine learning and data mining library

Combined with the traditional Java interface, experienced Hadoop developers will find that the Cloud Hadoop experience on the Infochimps Platform will meet their interface needs.

Using Wukong™ to Simplify Hadoop Application Development
Wukong can be used to simplify the experience of writing applications by letting you write them in Ruby, as well as running applications on your local machine first on sample data for faster iterations. For additional information, check out the open source version of Wukong on Github. Example Wukong script: ./wukong_script.rb --run=

To see this visual image, download the White Paper now.

Optimizing and Monitoring a Job
Hadoop distributions provide a built-in JobTracker for coordinating the execution of jobs across a cluster. The JobTracker provides basic counters, progress meters, and job logs and is exposed by Infochimps Cloud.

Infochimps Cloud additionally includes a Hadoop job visualization dashboard built into Infochimps Command Center. This dashboard recapitulates much of the information available through the existing JobTracker but adds helpful visualizations that illustrate the job’s completion and can be used to optimize its execution. Command Center also provides complete system monitoring, including CPU utilization and load, network and disk IO, memory usage, and process monitoring.

Running Workflows
A workflow is a set of Hadoop jobs that run in sequence, with a given job’s execution dependent on prior jobs in the sequence.

There are several tools in the Hadoop ecosystem that help to define, manage, and execute workflows that Infochimps can optionally include as part of Cloud::Hadoop, including:
• Azkaban: simple batch scheduler for constructing and running Hadoop jobs or other offline processes
• Oozie: workflow/coordination system to manage Hadoop jobs

Case Study: Infomart™

To see this visual image, download the White Paper now.

Infomart is Canada’s largest provider of news and broadcast media monitoring, and financial and corporate data. Infomart’s goal is to continue to strengthen and extend these data offerings with social media monitoring and analytics capabilities. By coupling social media with their other data sources, Infomart moves toward a much more comprehensive service and achieves significant value-adds for their customer base.

Infomart uses Infochimps Cloud API in conjunction with the Cloud::Streams to link incoming Gnip social media content with the specific rules defined by Infomart customers. Infomart owns the relationship and rules-definitions within Gnip, so by using Infochimps Cloud API, Infomart can communicate configuration changes. Cloud::Streams will use these mappings to dynamically make appropriate connections to Gnip and to route Gnip data to the correct customer indices within Elasticsearch.

Infomart is writing a series of decorators which perform in-stream analytics in Cloud::Streams such as real-time sentiment analysis. Wukong and the Deploy Pack are simplifying that process. Infomart also is writing Hadoop jobs that run against data in the Elasticsearch database or direct from HDFS. Included with Elasticsearch, the Elasticsearch API allows for powerful, complex queries of data in the Elasticsearch database. Infomart is using this interface point for their applications and reporting views.

About Infochimps
Infochimps makes it faster and far less complex to build and manage Big Data applications and quickly deliver actionable insights. With Infochimps Cloud, enterprises benefit from the easiest way to develop and deploy Big Data applications in public, virtual private and private clouds.

Request a Free Demo

See Infochimps Cloud for Big Data.

Infochimps Cloud is a suite of enterprise-ready cloud services that make it simpler, faster and far less complicated to develop and deploy Big Data applications in public, virtual private and private clouds.

Our cloud services provide a comprehensive analytics platform, including data streaming, storage, queries and administration. With Infochimps, you focus on the analytics that drive your business insights, not building and managing a complex infrastructure.