Whitepapers

wp-real-time-analytics

Real-Time Analytics

In situations where you need to make well-informed, real-time decisions, good data isn’t enough. It must be timely and actionable. The time window for data analysis is shrinking, and you need a different set of tools to get these on-the-fly answers.

DOWNLOAD »

Transcript

Why Real-Time Analytics?

Infochimps™ Cloud: Using the right tool for each job
At Infochimps™, we abide by the philosophy that you should use the right tool for each job. Why lock in to one set of technologies or techniques? Depending on what you are trying to accomplish – the questions you want to ask of your data, or the applications and visualizations you build on top of that data – different technologies are best suited for each unique task. You should have all the best tools at your fingertips for each task. Infochimps excels at systems and technology integration — we can take your existing tools, add powerful new ones from our kit, and glue them together into a unified whole.

We also strongly embrace open source technologies as part of a complete data solution. Not only do you benefit from the active participation of the open source community — you aren’t limited to a proprietary vendor’s finite feature set and integration connectors. We use Hadoop, Elasticsearch, Ironfan, and Wukong, among other world-class open source tools that work flexibly with each other and the rest of the tools in your enterprise.

Explore the technology that enables real-time analytics and streaming data processing, and how it differs from the world of Hadoop and batch analytics.

The Hadoop & NoSQL conundrum
Hadoop is a powerful framework for Big Data analytics. It simplifies the analysis of massive sets of data by distributing the computation load across many processes and machines. Hadoop embraces a map/reduce framework, which means analytics are performed as batch processes. Depending on the quantity of data and the complexity of the computation, running a set of Hadoop jobs could take anywhere from a few minutes to many days. Batch analytics tool sets like Hadoop are great for doing one-off reports, a recurring schedule of periodic runs, or setting up dedicated data exploration environments. However, waiting hours for the analysis you need means you aren’t able to get real-time answers from your data. Hadoop analysis ends up being a rear view mirror instead of a pulse on the moment.

NoSQL databases are extremely powerful, but come with certain challenges of their own
At Infochimps we use Hadoop to run map/reduce jobs against scalable, NoSQL data stores like HBase, Cassandra, or Elasticsearch. These databases are extremely good at enabling fast queries against many terabytes of data, but each makes certain tradeoffs to enable this ability. One major tradeoff, common across all three of these examples, is the inability to do SQL-like joins — the ability to combine data from one database table with data from another table.

The usual way we work around this tradeoff is to practice denormalization. Imagine we’re asking a question such as “Find all posts that contain the phrase ‘Cola-Cola’ from all authors based in Spokane, Washington”. In a traditional relational database like SQL, a table of “posts” would join against a table of “authors” using a shared key like an author’s ID number. In NoSQL databases, denormalization consists of inserting a copy of the author into each row of their posts. Rather than joining the posts table with the authors table during the query a la SQL, all the authors’ data is already contained within the posts table before the query.

The question then becomes when should the denormalization of our NoSQL database occur? One option is to use Hadoop to “backfill” denormalized data from normalized tables before running these kinds of queries. This approach is perfectly workable but it suffers from the same “rear-view mirror” problem of doing Hadoop-based batch analytics — we still cannot perform complex queries of real-time data. What if we could write denormalized data on the fly: write each incoming Twitter post into a row in the posts table, and augment that row with information on the author in real-time. This would keep all data denormalized at all times, always ready for downstream applications to run complex queries and generate the rich, real-time business insights. Real-time analytics and stream processing make this possible.

Real-time + Big Data = Stream Processing
In situations where you need to make well-informed, real-time decisions, good data isn’t enough. It must be timely and actionable. As a mutual fund operator, you can’t wait hours to analyze whether or not it’s the right moment to sell 200,000 stock shares. As CMO, you can’t wait days to see if there is a PR crisis occurring around your brand. The time window for data analysis is shrinking, and you need a different set of tools to get these on-the-fly answers.

Batch Versus Streaming
Consider two hypothetical sandwich makers. Each company makes great sandwiches, but chooses to deliver them to their customers either in batches or in near real-time.

The Batch Sub Shop can provide large quantities of sandwiches by leveraging many people to accomplish the overall project. Similarly, batch analytics can leverage multiple machines to accomplish a set of analytics jobs. By adding more resources, we can increase the speed with which the tasks are accomplished, but at a higher cost.

To see this visual image, download the White Paper now.

Contrast that with the Streaming Sub Shop, which doesn’t deliver a huge set of sandwiches all at once, but does quickly create sandwiches on the fly. The process aims to get a sandwich in the customer’s hand as soon as possible. Real-time analytics works the same way by processing data the moment it is collected. If the data is coming in too quickly, we can flexibly increase the resources that support our real-time workflow. Is the toasting process the bottleneck of our production line? We easily add a couple of additional toasters.

As you can imagine, the ideal sandwich company probably combines both the ability to cater large orders ahead of time and in-store made-to-order business. Likewise, your organization can leverage both batch analytics and real-time analytics depending on your business needs. Batch analytics is the most efficient way to process a large quantity of data in a non-time sensitive manner. Real-time analytics and stream processing are the answer when the timeliness of your insights is important, you need to scalably process a very large influx of live data, or if NoSQL databases cannot answer the questions you are asking.

How Does Real-Time Analytics Work?

To see this visual image, download the White Paper now.

How Does Real-Time Analytics Work?
1. Collect real-time data.
Real-time data is being generated all the time. If you are a mutual fund operator, it’s real-time stock price data. If you are a CMO, it’s real-time social media posts and Google search results. Typically this data is live streaming data. That means the moment the stock price changes, we can grab that data point – like a faucet of running water. We collect live data by “hooking a hose up” to the faucet stream to capture that information in real-time. A lot of different vocabulary exists to describe these “hoses” including calling them scrapers, collectors, agents, and listeners.

2. Process the data as it flows in.
The key to real-time analytics is that we cannot wait until later to do things to our data; we must analyze it instantly. Stream processing (also known as streaming data processing) is the term used for doing things to data instantly as it’s collected. Actions that you can perform in real-time include splitting data, merging it, doing calculations, connecting it with outside data sources, forking data to multiple destinations, and more.

3. Reports and dashboards access processed data.
Now that data has been processed, it is reliably delivered to the databases that power your reports, dashboards, and ad-hoc queries. Just seconds after the data was collected, it is now visible in your charts and tables. Since real-time analytics and stream processing are flexible frameworks, you can utilize whatever tools you prefer, whether that’s Tableau, Pentaho, GoodData, a custom application, or something else. Integration is Infochimps’ forté.

What Can You Do With Stream Processing?

Augment
• Enhance your sales leads – IP addresses of visitors to your website are augmented by the “company name” associated with that visitor if they are coming from an enterprise. Email addresses get linked to Twitter handles and Facebook handles to help your sales team leverage social selling.
• Real-time social media analytics – tweets that mention the brands you are tracking are augmented with a sentiment score (how positive or negative the comment was) and an influencer score (such as Klout). Know instantly if positive news breaks or a PR crisis arises. Instantly gain insight into how influential people are and on what topics. Process and Transform
• On-the-fly analytics reporting – Reformat a tweet on the fly to fit into an agency’s data model so that the data is visible in our reporting application immediately upon landing in the database.
• SQL-like data queries – Implement a denormalization policy to allow for doing complex JOIN-like queries in real-time in downstream analytics applications.
• Stock price algorithms – Implement your stock analyzer algorithm mid-stream. Instantly after an updated stock price is received, the data is processed through the algorithm, and placed in your reporting database.

Calculate
• Usage monitoring – Track the number of social media posts mentioning your client company’s brand. See at any given moment how much a brand is buzzing, and even set up tiered pricing based on how many social posts you are collecting on a client’s behalf.

Real-time analytics with the Infochimps Cloud

Cloud::Streams
It’s not enough anymore to simply perform historical analysis and batch reports. In situations where you need to make well-informed decisions in real-time, the data and insights must also be timely and immediately actionable. Cloud::Streams lets you process data as it flows into your application, powering real-time dashboards and on-the-fly analytics and delivering data seamlessly to Hadoop clusters and NoSQL databases. Single-purpose ETL solutions are rapidly being replaced with multi-node, multi-purpose data integration platforms — the universal glue that connects systems together and makes Big Data analytics feasible.

Cloud::Streams is a linearly scalable, fault-tolerant distributed routing framework for data integration, collection, and streaming data processing. Ready-to-go integration connectors allow you to tap into virtually any internal or external data source that your application needs.

Infochimps Cloud::Streams Benefits
• Easily integrate with virtually any data source, both live/in-motion as well as bulk/at rest
• Process data as it flows, at scale – not only generating real-time insights, but also delivering data to databases and Hadoop clusters that has already been cleaned, transformed, and augmented/enhanced
• Solve any business use case with the ability to handle any complexity business logic and parallel stream computing
• Write your analytics once when leveraging Wukong – then run in both real-time with Cloud::Streams and in batch with Cloud::Hadoop

To see this visual image, download the White Paper now.

Request a Free Demo

See Infochimps Cloud for Big Data.

Infochimps Cloud is a suite of enterprise-ready cloud services that make it simpler, faster and far less complicated to develop and deploy Big Data applications in public, virtual private and private clouds.

Our cloud services provide a comprehensive analytics platform, including data streaming, storage, queries and administration. With Infochimps, you focus on the analytics that drive your business insights, not building and managing a complex infrastructure.