Skip to content

Hadoop Fundemental

January 22, 2016

What is Hadoop?

  • Hadoop is an open source project of the Apache Foundation
  • It is a framework written in Java
  • It uses Google MapReduce and Google File System technologies as its foundation
  • It is optimized to handle massive quantities of data with variety formats using inexpensive commodity hardware
  • It replicates the data into multiple computers for reliability
  • It is for Big Data (not for OLTP nor OLAP/DSS), it is not good to work on unparalleled load, data with dependencies, for low latency data access, process lots of small files, intensive calculation with little data, and process transactions (lack random)

Hadoop-related open source projects

  • Eclipse
  • Lucene
  • HBase
  • Hive
  • Pig
  • Zookeeper
  • avro
  • UIMA

 Hadoop Architecture

  • Hadoop Note is a computer
  • All Nodes in the same network connection is a Rack
  • The bandwidth between 2 nodes in the same rack is larger than the one in different racks
  • The Hadoop cluster is a collection of Racks



Main Hadoop components

  1. Distributed File System
    • Hadoop Distributed File System (HDFS)
    • IBM GPFS – FPO:
  2. MapReduce Engine component
    • Framework for performing calculations on the data in the distributed file system
    • Has a built-in resource manager and scheduler

Data Access Patterns for HDFS

HDFS runs on top of the


What is HDFS?





From → Data Science

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: