Monday, 19 August 2013

Big Data ,Map reduce ,Hadoop and related terms


Big Data is the term used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.
Example : Face book hosts approximately 10 billion photos taking up one peta byte of storage.
The real  issue is not in acquiring and storing the big data, what you do with the acquired data matters.

Big Data analytics
With Big data and Big data analytics it is possible to
Analyze millions of SKU(stock keeping unit)s to determine optimal prices that maximize profit and clear inventory.
Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk.
Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers.
Quickly identify customers who matter the most.
Generate retail coupons at the point of sale based on the customer's current and past purchases, ensuring a higher redemption rate.
Send tailored recommendations to mobile devices at just the right time, while customers are in the right location to take advantage of offers.
Analyze data from social media to detect new market trends and changes in demand.
Use click stream analysis and data mining to detect fraudulent behavior.
Determine root causes of failures, issues and defects by investigating user sessions, network logs and machine sensors
Source
90% of the world’s data was mostly generated in the last 2+ yrs.
All this new data is coming from  smart phones, social networks, trading platforms etc
This data might be structured,semi structured or non structured(majority)

Map Reduce
Google apps team designed an algorithm to help them in the massive they get (getting) acquired.
The large data calculations are chopped to smaller chunks and mapped to many computers, then when calculations were done they are brought back to produce the resulting data set. This is called Map-Reduce.
This algorithm was later used to develop an open source project called Hadoop









Driving force

The two ingredients forcing the  business to look in to hadoop are
        1. data > 10 TB
        2. High calculation complexity

Hadoop will be playing the central role in
     1. Statistical analysis
     2. ETL processing
     3. Business intelligence






In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. ~ Grace hopper


                                                 
Thanks for reading....cheers :-) :-)


Sunday, 18 August 2013

Apache Mahout ~ Setting Recomendation Engine on Ubuntu





Apache Mahout is one of the projects from Apache to implement scalable Machine Learning Algorithms.

From the high level Apache Mahout mainly deals with the following three areas.

1. Recommendation Engine.
2. Clustering.
3. Classification.

Recommendation Engine : Recommendation Engine analyze the  user data (preferences or ratings) and helps in introducing or recommending the products/friends etc to them which they never know earlier.  

Ex : Face book suggesting friends , Amazon suggesting the products which the user might be interested in.

Clustering : This part of the machine learning techniques allows to group various entities based on the particular charcters or features.

Ex:  Google News does the grouping of particular stories related to a news and presents to the user.

Classification : This part helps in  automatically classifying  documents,images,implement spam filters and can be extended to various domains.

Ex : Yahoo mail identifying the spam.

This publish talks about the Mahout Recommendation part and steps to set the same on Ubuntu.


Mahout mainly focuses on collaborative filtering techniques for recommendation. Every user has different taste and preferences but they will follow some patterns.The recommendation engine tries to identify the patterns from the existing data and predicts what the users may like or prefer and recommend them with the particular recommendations.

Mahout currently implements  a collaborative filtering engine that supports the user-based, item-based and Slope-one recommend-er systems. Other algorithms available in  are  the k-means, fuzzy k-Means clustering, Canopy, Dirichlet and Mean-Shift.The other approach available is content based recommendations,but when content is considered for recommendation building the generic recommendation engine may not be effective. for eg: recommendation engine built to recommend pizza stuff considering the toppings,cheese content etc cannot be used for other domains. The same pizza recommendation engine cannot be used to recommend books or any other domain.So the Mahout mainly emphasizes on collaborative filtering.

Installation on  Ubuntu :

Prerequisites: 

a.. java >=1.6 version 
b. Maven installed
c. Hadoop downloaded ( http://www.apache.org/dyn/closer.cgi/hadoop/common/ )
d. set JAVA_HOME and HADOOP_HOME  (hadoop home should point to the extracted directory of the downloaded in the above step)

(collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm so is the above step d)

Steps :

1. Download the stable version of Mahout   
         https://cwiki.apache.org/confluence/display/MAHOUT/Downloads
2. Unpack the above download and navigate to the mahout directory
3. run  mvn install (the build should succeed)



Executing a Example :

The data is usually fed from a file or it can be directly done from  data tables using JDBC models.In the following example data is inputted from a file.

Usually the data in the input file is of the following format.if user 101 rates a book book id (450) with rating 4....the entry in the input file would be

101,450,4.0.

So have the input file call it as input.dat  with some relevant data
example :

1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0


and one more file to be supplied to the command is recommend.dat(this file will just contain the id of the user to whom the recommendation is targeted eg:101)

Now run the following command on the terminal

$ bin/mahout recommenditembased --input input.dat --usersFile recommend.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION

similarities in the above command can be changed to any of the following

SIMILARITY_PEARSON_CORRELATION,SIMILARITY_COOCCURRENCE , SIMILARITY_LOGLIKELIHOOD,SIMILARITY_TANIMOTO_COEFFICIENT,SIMILARITY_CITY_BLOCK, SIMILARITY_COSINE,SIMILARITY_EUCLIDEAN_DISTANCE



cheers....Thanks for Reading....... :-) :-)



Thursday, 15 August 2013

Need for NoSql Data Base



First things first

Its good to understand the following terms before looking  the Motivations or need for NoSql Data bases.

¨  RDBMS
¨  Joins,multi table transaction
¨  ACID (atomicity, consistency, Isolation , Durability)
¨  Scale up scale out
¨  Vertical scaling ,horizontal scaling
¨  Big User ,Big data, cloud computing
¨  Structured data , unstructured data
¨  Distributed system.
¨  Usage  profiles
¨  CAP (consistency , Availability ,Partition Tolerance)theorem
¨  Key challenges with RDBMS
 Complex joins on large data sets becomes a bottle neck
¨  As  DB scales to more machines
¨  Distributed transaction  management across nodes
¨  Lock  Management challenges
¨  Failure (single point of failure)
¨  Joining of tables across nodes.


What is NO SQL???

¨  Is not “never  SQL”
¨  Is not  “No To SQL”
¨  Is Not Only SQL.


Motivation for NoSql:


Big Users

¨  With the extensive use of internet, smart phones, tabs and increased number of usage hours
Ø  A newly launched app can go viral, growing from zero to a million users overnight – literally.
Ø  Some users are active frequently, while others use an app a few times, never to return.
Ø  Seasonal swings like those around Christmas or festive season create spikes for short periods.
The large numbers of users combined with the dynamic nature of usage patterns is driving the need for more easily scalable database technology

Big Data

¨  Data is becoming easier to capture and access
                Face book.
                Personal user information,
                geo location data,
                social graphs,,,,
¨  Big data is characterized by Volume, velocity and variety.


The large numbers of users combined with the dynamic nature of usage patterns is driving the need for more easily scalable database technology

 Cloud Computing

       Today, most applications use a three-tier Internet architecture, run in a public or private cloud, and support large numbers of users. In the cloud, a load balancer directs the incoming traffic to a scale-out tier of web/application servers that process the logic of the application. The scale-out architecture at the web/application tier works fine.
       At the database tier, relational databases were originally the popular choice. Their  use was increasingly problematic however, because they are a centralized, share-everything technology that scales up rather than out.

Scale out and Scale up statistics:










 The  two main factors to be considered to choose a NoSql:

1. The System is Distributed and is fine to go with CAP theorem rather than ACIDic RDBMS.
2.Need of horizontal scaling (scaling out) of the database tier.


Types of NoSql Databases:

There have been various approaches to classify NoSQL databases, each with different categories and subcategories. Because of the variety of approaches and overlappings regarding the nonfunctional requirements and the feature-set it could be difficult to get and maintain an overview of the nonrelational database market. Nevertheless, the most basic classification that most would agree is one based on the data model. A few of these and their prototypes are:
  • Column: Hbase, Cassandra, Accumulo
  • Document: MongoDB, Couchbase, Raven
  • Key-value : Dynamo, Riak, Azure, Redis, Cache, GT.m
  • Graph: Neo4J, Allegro, Virtuoso, Bigdata


Thanks for reading...cheers :-)









Spring MVC set up using Maven in no time





Just follow the following steps to get the Spring MVC basic set up done using Maven.


1. create following folders on one of your drives

[project] ->src -> main -> java folder for the source code
[project] ->src -> main -> resources folder for the resources, xmls, property files etc.
[project] -> src -> main ->webapp the root folder for the web application, will conteain the web content, pages, css, javascript files etc and the web-inf folder
[project]->src-> test -> java - The folder for the test source junit test classes and anything related to junit testing
[project]->src-> test -> resources The folder which will contain all resources for our tests.

2. Add the following POM.xml to the project Folder.

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <modelVersion>4.0.0</modelVersion>

  <groupId>springtest</groupId>
  <artifactId>springTest</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>war</packaging>
    
  <name>Our Spring test project</name>
  <properties>
        <org.springframework.version>3.0.5.RELEASE</org.springframework.version>
        <javax.servlet.jstl.version>1.2</javax.servlet.jstl.version>
    </properties>
   
  <dependencies>
       <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-context</artifactId>
            <version>${org.springframework.version}</version>
       </dependency>
      

        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-webmvc</artifactId>
            <version>${org.springframework.version}</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-web</artifactId>
            <version>${org.springframework.version}</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-test</artifactId>
            <version>${org.springframework.version}</version>
            <scope>test</scope>
        </dependency>
       
        <dependency>
            <groupId>javax.servlet</groupId>
            <artifactId>jstl</artifactId>
            <version>${javax.servlet.jstl.version}</version>
        </dependency>
         <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.16</version>
        </dependency>
        
         <dependency>
           <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.8.1</version>
            <scope>test</scope>
         </dependency>
  </dependencies>

  <build>

    <plugins>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.6</source>
                    <target>1.6</target>
                    <encoding>UTF8</encoding>
                </configuration>
                <inherited>true</inherited>
            </plugin>
           
    </plugins>
     
     
      <resources>
            <resource>
                <directory>src/main/resources</directory>
                <filtering>true</filtering>
            </resource>
        </resources>
        <testResources>
            <testResource>
                <directory>src/test/resources</directory>
                <filtering>true</filtering>
            </testResource>
            <testResource>
                <directory>src/main/resources</directory>
                <filtering>true</filtering>
            </testResource>
        </testResources>
  </build>
</project>



3. create web.xml in webapps/WEB-INF

<?xml version="1.0" encoding="UTF-8"?>
<web-app xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://java.sun.com/xml/ns/javaee" xmlns:web="http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" id="WebApp_ID" version="2.5">
  <display-name>springTest</display-name>

  <servlet>
    <servlet-name>springapp</servlet-name>
    <servlet-class>org.springframework.web.servlet.DispatcherServlet</servlet-class>
    <load-on-startup>1</load-on-startup>
  </servlet>

  <servlet-mapping>
    <servlet-name>springapp</servlet-name>
    <url-pattern>*.html</url-pattern>
  </servlet-mapping>
</web-app>


4. create a springapp-servlet.xml in WEB-INF

<?xml version="1.0" encoding="UTF-8"?>

<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:context="http://www.springframework.org/schema/context"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.5.xsd

  <!-- the application context definition for the springapp DispatcherServlet -->
            <!-- Scans within the base package of the application for @Components to
            configure as beans -->
      <!-- @Controller, @Service, @Configuration, etc. -->
      <context:component-scan base-package="springtest" />
     
      <bean id="viewResolver"
            class="org.springframework.web.servlet.view.InternalResourceViewResolver">
            <property name="prefix" value="/WEB-INF/jsp/" />
            <property name="suffix" value=".jsp" />
      </bean>
</beans>



5. Create sample controller

package springtest.controllers;


import org.apache.log4j.Logger;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.stereotype.Controller;
import org.springframework.ui.Model;


@Controller
public class HelloController {
    protected final Logger logger = Logger.getLogger(getClass());

    @RequestMapping("/hello.html")
    public String handleRequest(Model model) {

        logger.debug("Returning index view");
        model.addAttribute("message", "HELLO!!!");
        return "hello";
    }
}


6. create the needed jsp pages.

index.jsp in webapp
<%@ page language="java" contentType="text/html; charset=UTF-8"
    pageEncoding="UTF-8"%>
<% response.sendRedirect("hello.html"); %>




hello.jsp in web-inf/jsp

<%@ page language="java" contentType="text/html; charset=UTF-8"
    pageEncoding="UTF-8"%>
<%@ taglib uri="http://java.sun.com/jsp/jstl/core"
         prefix="c" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Hello page</title>
</head>
<body>
      <c:out value="${requestScope.message}"/>
</body>

</html>


Once done you can import this as Existing Maven Project to the IDE,do a maven build and run it on server.


Thanks for Reading....cheers... :-) :-)