Hadoop Basics

Blogs on BIG DATA.

1. Big Data for Starters

http://scn.sap.com/community/developer-center/hana/blog/2013/04/26/big-data-for-starters

2. Advanced level - Tech deep dive on BIG DATA Technologies & Applications.

http://scn.sap.com/community/hana-in-memory/blog/2013/04/30/big-data-technologies-applications

What is Hadoop

Hadoop is considered as one of the best in storing the structured, semi-structured and unstructured data. Constructed to be an open source software framework for data-intensive distributed applications, the Apache Hadoop uses a series of nodes to store the data, cutting structured a MapReduce facility with a distributed file system to meet the multi processing requirements. This technology known as Apache Hadoop became so popular that it’s considered as one of the best in open source technology.

The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework

Hadoop is written in the Java programming language and is an Apache top-level project being built and used by a global community of contributors. Hadoop and its related projects (Hive, HBase, Zookeeper, and so on) have many contributors from across the ecosystem. Though Java code is most common, any programming language can be used with "streaming" to implement the "map" and "reduce" parts of the system.

Why Hadoop

Create Transparency

Its open Source technology.

Insights from all types of data, from all types of systems.

Expose Variability enable experimentation.

Ability to analyze data in real-time, its cost-effectiveness in relation to the value of predictive analytics.

Harnessing not just the power of Hadoop but making its provisioning and integration in the corporate data center much more seamless will drive changes in the data center.

Segment populations to customize actions.

Replace/support human decision-making with automated algorithms.

Innovate new business models, products, and services.

Hadoop Big Picture

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Apache Hive supports analysis of large datasets stored in Hadoop-compatible file systems such as Amazon S3file system. It provides an SQL-like language called HiveQL while maintaining full support for map/reduce. To accelerate queries, it provides indexes, including bitmap indexes.

By default, Hive stores metadata in an embedded Apache Derbydatabase, and other client/server databases like MySQL can optionally be used.

Currently, there are three file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE and RCFILE.

HBase is a large-scale, distributed database built on top of the Hadoop Distributed File System (HDFS). Facebook Messages, which combines messages, chat and email into a real-time conversation, is the first application in Facebook to use HBase in production. As a result, we will see more Hadoop deployments involved in lightweight online transaction processing (OLTP).

One application that is particularly well-suited for HBase is binary large objects (BLOB) stores. HBase is Hadoop's open-source, nonrelational, distributed database modeled after Google's BigTable and written in Java. These BLOBs require large databases with rapid retrieval. BLOBs typically are images, audio clips or other multimedia objects, and storing BLOBs in a database enables a variety of innovative applications. One example is a digital wallet—which enables users to upload their credit card images, checks and receipts for online processing; the technology eases banking, purchasing and lost-wallet recovery.

Apache Pigis a platform for analyzing large data sets. Pig's language, Pig Latin, is a simple query algebra that lets you express data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Users can create their own functions to do special-purpose processing.

Pig Latin queries execute in a distributed fashion on a cluster. Our current implementation compiles Pig Latin programs into Map-Reduce jobs, and executes them using Hadoop cluster.

The talent pool with structured query language skills is well-established and will drive Hadoop's support of SQL. SQL-like languages, such as HiveQL and DrQL, are examples of tools that are making Hadoop accessible to the large SQL-fluent community.

Mahout has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining.

Core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.

The goal of Apache Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases.

The Hadoop Distributed File System (HDFS) is a distributed file system (such as FAT,NTFS) designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

MapReduce is a programming model for data processing. The model is simple, yet not too simple to express useful programs in.

Map-Reduce allows for the development of distributed programming without requiring expertise in distributed development. It does this by exporting a very simple interface (Map and Reduce) and handling distributed processing issues internally. Distributed programming issues

Include: