Skip to content

Hadoop Fundemental

January 22, 2016

What is Hadoop?

  • Hadoop is an open source project of the Apache Foundation
  • It is a framework written in Java
  • It uses Google MapReduce and Google File System technologies as its foundation
  • It is optimized to handle massive quantities of data with variety formats using inexpensive commodity hardware
  • It replicates the data into multiple computers for reliability
  • It is for Big Data (not for OLTP nor OLAP/DSS), it is not good to work on unparalleled load, data with dependencies, for low latency data access, process lots of small files, intensive calculation with little data, and process transactions (lack random)

Hadoop-related open source projects

  • Eclipse
  • Lucene
  • HBase
  • Hive
  • Pig
  • Zookeeper
  • avro
  • UIMA

 Hadoop Architecture

  • Hadoop Note is a computer
  • All Nodes in the same network connection is a Rack
  • The bandwidth between 2 nodes in the same rack is larger than the one in different racks
  • The Hadoop cluster is a collection of Racks

HadoopRack

 

Main Hadoop components

  1. Distributed File System
    • Hadoop Distributed File System (HDFS)
    • IBM GPFS – FPO:
  2. MapReduce Engine component
    • Framework for performing calculations on the data in the distributed file system
    • Has a built-in resource manager and scheduler

Data Access Patterns for HDFS

HDFS runs on top of the

 

What is HDFS?

 

 

 

Advertisements

From → Data Science

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: