Big Data Introduction

Big data is the IT industry’s hottest buzz word. Everyone from developers to decision makers, from a small startups to big names are dealing in it.

There are so many resources available online, which give complex theories, but in simple terms what is big data and what is its use because of which industry is crazy about it?

As per Gartner‘s 2012 updated definition “Big data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.

In simple terms we can say Big data is data which is too large, moves too fast, and doesn’t fit into structure of relational database. The size of big data goes into terabytes or petabytes.

Big data has the characteristic of three V’s: volume, velocity, and variety. Volume is the size of data TB or PB, velocity is the speed at which data is getting generated such as continuos stream,variety is the type of data such as video files, twitter feeds, page clicks, log files.

Companies are using big data to extract information from large amount of data quickly. If processing the huge data and extracting the meaningful information from it takes time, then that information might lose its value.

For example consider the product suggestion on any e-commerce website. The site which displays the right product suggestions to the customer at right time, gains most from it. The product suggestion won’t be useful later, after say a minute, because at that time user might have left or have lost interest..

The other uses of big data are processing the video feed of the user in a departmental store and extracting the information such as user’s face expression, mood when he picks a particular product, identifying the fraud transactions from millions of transactions happening every second, scanning twitter feed and acting on the information relevant to me.

The way big data can be used is huge, it’s intended to make online world more secure (detecting frauds) and smart (real-time suggestions and analysis).

Recommended learning: http://cdn.oreillystatic.com/oreilly/radarreport/0636920025559/Planning_for_Big_Data.pdf

Cassandra Quick Basics

Cassandra is a decentralized No-Sql database. It works on multi node cluster where every node is identical to every other node (server symmetry – all node features same). There is no master node concept, as in Hadoop, hence there is no single point of failure.

A few features/terms

  • Elastic scalability: able to scale, up or down, dynamically without restart or disruption of services.
  • Consistency level: decide when to consider transaction successful.
  • Tunable consistency (strict, casual, and weak), which is inversely proportional to availability.
  • Stores data in multidimensional hash table.
  • Schema free: model requires queries and then work on data.
  • Designed to take advantage of multiprocessor/core machines.
  • optimized for excellent throughput for write.

Data Model

Hunch’s “Taste Graph”

Hunch is a recommendation platform that analyzes the user’s activity trend and lists recommendations on all sort of topics. It develops a map around user preferences and name it “Taste Graph”. It’s a sort of prediction tool, which improves constantly as you use (you like something, provide ratings, check-in, etc.)

It’s somewhat similar to companies offer you shopping card, such as lifestyle, and in turn the companies track your purchase. What does companies gain? They gain the invaluable data: your shopping preferences, trend, etc. Based on which companies evaluate their market strategy.  Hunch approach is somewhat similar, except that instead of discount Hunch offers recommendations and it makes profit by selling your data to companies. It seems a great business model, as data is one of the highest ranked commodity – invaluable.

I’m eager to try hunch and see how the Taste Graph evolves and how good it recommends 😉

Mockito: Java Unit Testing with Mock Objects

Mockito is an open source testing framework for Java. The framework allows the creation of Test Double objects called, “Mock Objects” in automated unit tests for the purpose of Test-driven Development (TDD) or Behavior Driven Development (BDD).

Mockito compared to EasyMock seems to be more easily and has more flexibility. First it’s able to mock up interfaces as well as classes. You doesn’t need additional Jar libraries. Furthermore there isn’t any replay mode. First you have to stub and afterwards you have to verificate your mocked classes or interfaces. Further advantages could be read here

Continue reading “Mockito: Java Unit Testing with Mock Objects”