DP Blog

Monday, 19 August 2013

Big Data ,Map reduce ,Hadoop and related terms

Big Data is the term used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.

Example : Face book hosts approximately 10 billion photos taking up one peta byte of storage.

The real issue is not in acquiring and storing the big data, what you do with the acquired data matters.

Big Data analytics

With Big data and Big data analytics it is possible to

•Analyze millions of SKU(stock keeping unit)s to determine optimal prices that maximize profit and clear inventory.

•Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk.

•Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers.

•Quickly identify customers who matter the most.

•Generate retail coupons at the point of sale based on the customer's current and past purchases, ensuring a higher redemption rate.

•Send tailored recommendations to mobile devices at just the right time, while customers are in the right location to take advantage of offers.

•Analyze data from social media to detect new market trends and changes in demand.

•Use click stream analysis and data mining to detect fraudulent behavior.

•Determine root causes of failures, issues and defects by investigating user sessions, network logs and machine sensors

Source

•90% of the world’s data was mostly generated in the last 2+ yrs.

•All this new data is coming from smart phones, social networks, trading platforms etc

•This data might be structured,semi structured or non structured(majority)

Map Reduce

•Google apps team designed an algorithm to help them in the massive they get (getting) acquired.

•The large data calculations are chopped to smaller chunks and mapped to many computers, then when calculations were done they are brought back to produce the resulting data set. This is called Map-Reduce.

•This algorithm was later used to develop an open source project called Hadoop

Driving force

The two ingredients forcing the business to look in to hadoop are

1. data > 10 TB

2. High calculation complexity

Hadoop will be playing the central role in

1. Statistical analysis

2. ETL processing

3. Business intelligence

In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. ~ Grace hopper

Thanks for reading....cheers :-) :-)

Sunday, 18 August 2013

Apache Mahout ~ Setting Recomendation Engine on Ubuntu

Apache Mahout is one of the projects from Apache to implement scalable Machine Learning Algorithms.

From the high level Apache Mahout mainly deals with the following three areas.

1. Recommendation Engine.
2. Clustering.
3. Classification.

Recommendation Engine : Recommendation Engine analyze the user data (preferences or ratings) and helps in introducing or recommending the products/friends etc to them which they never know earlier.

Ex : Face book suggesting friends , Amazon suggesting the products which the user might be interested in.

Clustering : This part of the machine learning techniques allows to group various entities based on the particular charcters or features.

Ex: Google News does the grouping of particular stories related to a news and presents to the user.

Classification : This part helps in automatically classifying documents,images,implement spam filters and can be extended to various domains.

Ex : Yahoo mail identifying the spam.

This publish talks about the Mahout Recommendation part and steps to set the same on Ubuntu.

Mahout mainly focuses on collaborative filtering techniques for recommendation. Every user has different taste and preferences but they will follow some patterns.The recommendation engine tries to identify the patterns from the existing data and predicts what the users may like or prefer and recommend them with the particular recommendations.

Mahout currently implements a collaborative filtering engine that supports the user-based, item-based and Slope-one recommend-er systems. Other algorithms available in are the k-means, fuzzy k-Means clustering, Canopy, Dirichlet and Mean-Shift.The other approach available is content based recommendations,but when content is considered for recommendation building the generic recommendation engine may not be effective. for eg: recommendation engine built to recommend pizza stuff considering the toppings,cheese content etc cannot be used for other domains. The same pizza recommendation engine cannot be used to recommend books or any other domain.So the Mahout mainly emphasizes on collaborative filtering.

Installation on Ubuntu :

Prerequisites:

a.. java >=1.6 version

b. Maven installed

c. Hadoop downloaded ( http://www.apache.org/dyn/closer.cgi/hadoop/common/ )

d. set JAVA_HOME and HADOOP_HOME (hadoop home should point to the extracted directory of the downloaded in the above step)

(collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm so is the above step d)

Steps :

1. Download the stable version of Mahout

https://cwiki.apache.org/confluence/display/MAHOUT/Downloads

2. Unpack the above download and navigate to the mahout directory

3. run mvn install (the build should succeed)

Executing a Example :

The data is usually fed from a file or it can be directly done from data tables using JDBC models.In the following example data is inputted from a file.

Usually the data in the input file is of the following format.if user 101 rates a book book id (450) with rating 4....the entry in the input file would be

101,450,4.0.

So have the input file call it as input.dat with some relevant data
example :

1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0

and one more file to be supplied to the command is recommend.dat(this file will just contain the id of the user to whom the recommendation is targeted eg:101)

Now run the following command on the terminal

$ bin/mahout recommenditembased --input input.dat --usersFile recommend.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION

similarities in the above command can be changed to any of the following

SIMILARITY_PEARSON_CORRELATION,SIMILARITY_COOCCURRENCE , SIMILARITY_LOGLIKELIHOOD,SIMILARITY_TANIMOTO_COEFFICIENT,SIMILARITY_CITY_BLOCK, SIMILARITY_COSINE,SIMILARITY_EUCLIDEAN_DISTANCE

cheers....Thanks for Reading....... :-) :-)

Thursday, 15 August 2013

Need for NoSql Data Base

First things first

Its good to understand the following terms before looking the Motivations or need for NoSql Data bases.

¨ RDBMS

¨ Joins,multi table transaction

¨ ACID (atomicity, consistency, Isolation , Durability)

¨ Scale up scale out

¨ Vertical scaling ,horizontal scaling

¨ Big User ,Big data, cloud computing

¨ Structured data , unstructured data

¨ Distributed system.

¨ Usage profiles

¨ CAP (consistency , Availability ,Partition Tolerance)theorem

¨ Key challenges with RDBMS

Complex joins on large data sets becomes a bottle neck

¨ As DB scales to more machines

¨ Distributed transaction management across nodes

¨ Lock Management challenges

¨ Failure (single point of failure)

¨ Joining of tables across nodes.

What is NO SQL???

¨ Is not “never SQL”

¨ Is not “No To SQL”

¨ Is Not Only SQL.

Motivation for NoSql:

Big Users

¨ With the extensive use of internet, smart phones, tabs and increased number of usage hours

Ø A newly launched app can go viral, growing from zero to a million users overnight – literally.

Ø Some users are active frequently, while others use an app a few times, never to return.

Ø Seasonal swings like those around Christmas or festive season create spikes for short periods.

The large numbers of users combined with the dynamic nature of usage patterns is driving the need for more easily scalable database technology

Big Data

¨ Data is becoming easier to capture and access

Face book.

Personal user information,

geo location data,

social graphs,,,,

¨ Big data is characterized by Volume, velocity and variety.

The large numbers of users combined with the dynamic nature of usage patterns is driving the need for more easily scalable database technology

Cloud Computing

• Today, most applications use a three-tier Internet architecture, run in a public or private cloud, and support large numbers of users. In the cloud, a load balancer directs the incoming traffic to a scale-out tier of web/application servers that process the logic of the application. The scale-out architecture at the web/application tier works fine.

• At the database tier, relational databases were originally the popular choice. Their use was increasingly problematic however, because they are a centralized, share-everything technology that scales up rather than out.

Scale out and Scale up statistics:

The two main factors to be considered to choose a NoSql:

1. The System is Distributed and is fine to go with CAP theorem rather than ACIDic RDBMS.

2.Need of horizontal scaling (scaling out) of the database tier.

Types of NoSql Databases:

There have been various approaches to classify NoSQL databases, each with different categories and subcategories. Because of the variety of approaches and overlappings regarding the nonfunctional requirements and the feature-set it could be difficult to get and maintain an overview of the nonrelational database market. Nevertheless, the most basic classification that most would agree is one based on the data model. A few of these and their prototypes are:

Column: Hbase, Cassandra, Accumulo
Document: MongoDB, Couchbase, Raven
Key-value : Dynamo, Riak, Azure, Redis, Cache, GT.m
Graph: Neo4J, Allegro, Virtuoso, Bigdata

Thanks for reading...cheers :-)

Spring MVC set up using Maven in no time

Just follow the following steps to get the Spring MVC basic set up done using Maven.

1. create following folders on one of your drives

[project] ->src -> main -> java — folder for the source code

[project] ->src -> main -> resources — folder for the resources, xmls, property files etc….

[project] -> src -> main ->webapp — the root folder for the web application, will conteain the web content, pages, css, javascript files etc… and the web-inf folder

[project]->src-> test -> java -– The folder for the test source – junit test classes and anything related to junit testing

[project]->src-> test -> resources – The folder which will contain all resources for our tests.

2. Add the following POM.xml to the project Folder.

pom.xml

<?xml version="1.0" encoding="UTF-8"?>

<modelVersion>4.0.0</modelVersion>

<groupId>springtest</groupId>

<artifactId>springTest</artifactId>

<version>1.0-SNAPSHOT</version>

<packaging>war</packaging>

<name>Our Spring test project</name>

<properties>

<org.springframework.version>3.0.5.RELEASE</org.springframework.version>

<javax.servlet.jstl.version>1.2</javax.servlet.jstl.version>

</properties>

<dependencies>

<dependency>

<groupId>org.springframework</groupId>

<artifactId>spring-context</artifactId>

<version>${org.springframework.version}</version>

</dependency>

<dependency>

<groupId>org.springframework</groupId>

<artifactId>spring-webmvc</artifactId>

<version>${org.springframework.version}</version>

</dependency>

<dependency>

<groupId>org.springframework</groupId>

<artifactId>spring-web</artifactId>

<version>${org.springframework.version}</version>

</dependency>

<dependency>

<groupId>org.springframework</groupId>

<artifactId>spring-test</artifactId>

<version>${org.springframework.version}</version>

<scope>test</scope>

</dependency>

<dependency>

<groupId>javax.servlet</groupId>

<artifactId>jstl</artifactId>

<version>${javax.servlet.jstl.version}</version>

</dependency>

<dependency>

<groupId>log4j</groupId>

<artifactId>log4j</artifactId>

<version>1.2.16</version>

</dependency>

<dependency>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

<version>4.8.1</version>

<scope>test</scope>

</dependency>

</dependencies>

<build>

<plugins>

<plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-compiler-plugin</artifactId>

<configuration>

<source>1.6</source>

<target>1.6</target>

<encoding>UTF8</encoding>

</configuration>

<inherited>true</inherited>

</plugin>

</plugins>

<resources>

<resource>

<directory>src/main/resources</directory>

<filtering>true</filtering>

</resource>

</resources>

<testResources>

<testResource>

<directory>src/test/resources</directory>

<filtering>true</filtering>

</testResource>

<testResource>

<directory>src/main/resources</directory>

<filtering>true</filtering>

</testResource>

</testResources>

</build>

</project>

3. create web.xml in webapps/WEB-INF

<?xml version="1.0" encoding="UTF-8"?>

<web-app xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://java.sun.com/xml/ns/javaee" xmlns:web="http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" id="WebApp_ID" version="2.5">

<display-name>springTest</display-name>

<servlet>

<servlet-name>springapp</servlet-name>

<servlet-class>org.springframework.web.servlet.DispatcherServlet</servlet-class>

<load-on-startup>1</load-on-startup>

</servlet>

<servlet-mapping>

<servlet-name>springapp</servlet-name>

<url-pattern>*.html</url-pattern>

</servlet-mapping>

</web-app>

4. create a springapp-servlet.xml in WEB-INF

<?xml version="1.0" encoding="UTF-8"?>

<beans xmlns="http://www.springframework.org/schema/beans"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns:context="http://www.springframework.org/schema/context"

xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.5.xsd

http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.0.xsd">

<!-- Scans within the base package of the application for @Components to

configure as beans -->

<context:component-scan base-package="springtest" />

<bean id="viewResolver"

class="org.springframework.web.servlet.view.InternalResourceViewResolver">

<property name="prefix" value="/WEB-INF/jsp/" />

<property name="suffix" value=".jsp" />

</bean>

</beans>

5. Create sample controller

package springtest.controllers;

import org.apache.log4j.Logger;

import org.springframework.web.bind.annotation.RequestMapping;

import org.springframework.stereotype.Controller;

import org.springframework.ui.Model;

@Controller

public class HelloController {

protected final Logger logger = Logger.getLogger(getClass());

@RequestMapping("/hello.html")

public String handleRequest(Model model) {

logger.debug("Returning index view");

model.addAttribute("message", "HELLO!!!");

return "hello";

}

6. create the needed jsp pages.

index.jsp in webapp

<%@ page language="java" contentType="text/html; charset=UTF-8"

pageEncoding="UTF-8"%>

<% response.sendRedirect("hello.html"); %>

hello.jsp in web-inf/jsp

<%@ page language="java" contentType="text/html; charset=UTF-8"

pageEncoding="UTF-8"%>

<%@ taglib uri="http://java.sun.com/jsp/jstl/core"

prefix="c" %>

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

<title>Hello page</title>

</head>

<body>

<c:out value="${requestScope.message}"/>

</body>

</html>

Once done you can import this as Existing Maven Project to the IDE,do a maven build and run it on server.

Thanks for Reading....cheers... :-) :-)

Thursday, 21 June 2012

DPMailSniffer ~find out if your email has been read ~

Now you can find out when your email has been read by the recipient! No more guessing: "Has he or she read my email yet?".Use sniffer anytime on anyone: Your business contact, cheating girlfriend, ex-husband, stranger you met online, or just anyone who doesn't respond to your email and needs to be "verified." Just sniff them ;-)

There are already some sites providing this ,still you can develop your notification system very easily and customize as you wish.

Mail-sniffer a simple email tracking system that sends you a notification by email when the recipient opens your message.

Note : Both you and the recipient must use an HTML email, not plain-text or rich-text email.(yahoo,gmail,hotmail all the modern ones supports this)

The Simple logic behind this tracking is to have a image from your application being included to the mail being sent(of course secretly),so when the mail is opened the request comes to your application for downloading the secret(invisible) image which you have included in the mail,now you can investigate the request and get the details.

Step 1: Have a form with from mail and the subject of the mail.

Step 2: Create a secret image (some white image ) or you can have some visible image backing your signature in the mail.Have your own logic to map the email id and subject entered in the above fields mapping to the image created.Image should be in your application running on the server(intranet or internet )

Step 3: Copy the image (select the image and CTRL+C) and paste it at some place in the mail being sent(CTRL+V) and send your mail as usually.

So when ever the recipient opens the mail,the image should be downloaded from your application running and the request comes to your application.

Say for example you have included some gif (secret empty)image in the mail.The following bits of code will help you in getting hold of the request.

web.xml

<servlet>
<servlet-name>spy</servlet-name>
<servlet-class>com.dp.controller.SpyServlet</servlet-class>
<load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping>
<servlet-name>spy</servlet-name>
<url-pattern>*.gif
</url-pattern>
</servlet-mapping>

check the request and identify which image is being requested for.We have already stored the from id and the subject mapped to a particular image file in Step 2.Use this from id and the subject to notify that the particular mail sent is being opened by so and so using the following kind of servlet.

SpyServlet :

/**
*
* @author DurgaPrasad.K
*
*/
public class SpyServlet extends HttpServlet {

@Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException {

System.out.println(req.getRemoteAddr() + " - - " + req.getRemoteHost()
+ "- - " + req.getRemotePort() + "- - " + req.getRemoteUser()
+ "- - " + new Date(System.currentTimeMillis()));
//notify the user

}
}

Your app must be hosted to the intranet or internet as per your requirement.That's all for now.

Cheers...Thanks for reading :-) :-)

DP Blog

Monday, 19 August 2013

Big Data ,Map reduce ,Hadoop and related terms

Sunday, 18 August 2013

Apache Mahout ~ Setting Recomendation Engine on Ubuntu

Thursday, 15 August 2013

Need for NoSql Data Base

Spring MVC set up using Maven in no time

Thursday, 21 June 2012

DPMailSniffer ~find out if your email has been read ~

Blog Archive

Followers