# Big Data Analytics

*Big Data is becoming one of the most talked about technology trends nowadays. The real challenge with the big organization is to get maximum out of the data already available and predict what kind of data to collect in the future. How to take the existing data and make it meaningful that it provides us accurate insight in the past data is one of the key discussion points in many of the executive meetings in organizations. There is not a single solution to big data as well there is not a single vendor which can claim to know all about Big Data. Big Data is too big a concept and there are many players – different architectures, different vendors and different technology.*

# Notes

Big Data Analytics – B.E. – M.U. – By Anuradha Bhatia

# Video Lectures

### CLUSTERING USING REPRESENTATIVES [CURE] ALGORITHM

*Clustering refers to classification of objects into groups which implements partitioning or hierarchical techniques. Partitioning to create cluster starts with one big cluster and downward step by step reaches the number of clusters we want. CURE works between centroid and all point techniques.*

### DATAR- GIONIS- INDYNK- MOTWANI [DGIM] ALGORITHM

*The algorithm works on bit stream, where each bit stream has a timestamp, for the position at which it arrives. The input stream is combination of 1’s and 0’s. The first bit has timstamp1, the second bit has timestamp 2, and so on. The positions are recognized with the window size N. The timestamp is represented with modulo N, and are represented as log2N bits.*

### FLAJOLET MARTIN [FM] ALGORITHM

*Flajolet-Martin algorithm approximates the number of unique objects in a stream or a database in one pass. An algorithm for approximating the number of distinct elements in a stream with a single pass. If the stream contains n elements with m of them unique, this algorithm runs in O(n) time and needs O(log(m)) memory.*

### HUBS AND AUTHORITIES

*Hyperlink-Induced Topic Search (HITS; also known as hubs and authorities) is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. A good hub represented a page that pointed to many other pages. A good authority represented a page that was linked by many different hubs.*

### MULTISTAGE PCY ALGORITHM

*Implements the park Chen Yu at multistage as the name says. Uses m counters and hash-table H for linear scan of baskets b. it increments the counter for each time item is in basket b and increment hash table counter for each item pair in b.*

### PAGE RANK ALGORITHM

*PageRank is a “vote”, by all the other pages on the Web, about how important a page is. A link to a page counts as a vote of support. The original PageRank algorithm was designed by Lawrence Page and Sergey Brin.*

### PAGE RANK USING MATRIX

*Lawrence Page and Sergey Brin characterise links to those pages as dangling links. Inbound link for a web page always increases that page’s PageRank. An important aspect of outbound links is the lack of them on web pages. When a web page has no outbound links, its PageRank cannot be distributed to other pages.*

### APRIORI ALGORITHM

*The Apriori Algorithm is an algorithm for mining frequent itemsets for Boolean association rules. Apriori uses a “bottom up” approach, where frequent subsets are extended one item at a time which is known as candidate generation, and groups of candidates are tested against the data.*

### PARK CHEN YU [PCY] ALGORITHM

*Improvement implementation of A-Priori algorithm. Memory is mostly idle during the Pass 1. Idea implementation for PCY*

### SAVASERE, OMIECINSKI, AND NAVATHE [SON] ALGORITHM

*Algorithm is implemented for finding all frequent itemsets. Repeatedly read small subsets of the baskets into main memory and perform the simple algorithm on each subset. An itemset becomes a candidate if it is found to be frequent in any one or more subsets of the baskets.*

### BETWEENNESS CENTRALITY

*The vertex betweenness centrality BC(v) of a vertex v is defined as Total number of shortest paths between node u and w and Total number of shortest paths between node u and w that pass through v.*

### CLIQUE AND COMMUNITY

*Clique Percolation Method (CPM) Two k-cliques are considered adjacent if they share k − 1 nodes. Cliques are formed with complete graph. A community is defined as the maximal union of k-cliques that can be reached from each other through a series of adjacent k-cliques. Cliques are formed with complete graph.*