Wednesday, November 6, 2013

R installation and usage for data analysis


Introduction R


Today I provide a brief introduction about recent popular R language.
R is very an effective tool for statistics and graphics.
There are existing statistical programs such as SAS, SPSS.
However R is open source and it has a lot of packages.



The site of R is http://r-project.org.
You can download R program, documents, add-on packages, etc.

If you need to search for R, visit http://rseek.org.

Using R for Big-Data

You should consider something when R is used by Hadoop for big data processing.
R works based on a single core and in memory.
In other words, it is not composed of the distributed environment.

Thus some vendors are using R-Hadoop, R-Hive.



R Installation

Installation of R is very simple.
When you get install file from the R project site, you can install any OS like Mac, Windows, and Linux.


Now you can test R programming.


Saturday, November 2, 2013

Real-Time Technology for Big Data



Hadoop is used to process big-data.
However according to the needs for real-time,  there is growing interest in the in-memory technology.

OLTP Database for real-time processing was used in times past.
We need a new approach in order to handle quickly data stream in Big data era.

Google is often processed numerous data in a short period through Dremel.
Let's look at technologies for real-time.

Redis

Redis is BSD-based open source and is an acronym for "Remote Dictionary System".
It classifies as No-SQL database because it has a key-value store.
It is also used for Message Queue and Shared Memory.
So it's sometimes used as a real-time processing.

This is architecture using node.js and redis. (Source: http://simonhampshire.wordpress.com/)

It is similar to memcached in the part of in-memory technology but it has saved the data onto disk unlike memcached.

Apache Kafka


Kafka was made by Linkedin for improving their log and tracking system.
Now it works on Apache Project.

It is Publish/Subscribe messaging system and it is used in conjunction with Hadoop.


I discovered that Netflix processes data log using Kafaka.





Esper

I beleive that the best technology about in-memory and real-time is the CEP(Complex Event Processing).
Oracle and SAP are now focused on CEP technology.

As an open source, Esper has a configuration of EPL such as SQL like script language.


CEP is the technology to filter specific event from numerous event.
It's a real-time processing using event-driven architecture.

Esper has architecture like this.


Sometimes to get the data into Espter used "storm".