Skip to content

Background of Big Data

December 1, 2016

1. Why Big Data?

Big data -> better models -> precision results.

2. Big Data: Where Does It Come From?

  • People: Social Media
  • Organization: relational / traditional data
  • Machine: sensors

3. Characteristics of Big Data and Dimensions of Scalability

A “Small” Definition of Big Data

The term ‘big data’ seems to be popping up everywhere these days. And there seems to be as many uses of this term as there are contexts in which you find it: ‘big data’ is often used to refer to any dataset that is difficult to manage using traditional database systems; it is also used as a catch-all term for any collection of data that is too large to process on a single server; yet others use the term to simply mean “a lot of data”; sometimes it turns out it doesn’t even have to be large. So what exactly is big data?

A precise specification of ‘big’ is elusive. What is considered big for one organization may be small for another. What is large-scale today will likely seem small-scale in the near future; petabyte is the new terabyte. Thus, size alone cannot specify big data. The complexity of the data is an important factor that must also be considered.

Most now agree with the characterization of big data using the 3 V’s coined by Doug Laney of Gartner:

· Volume: Size. This refers to the vast amounts of data that is generated every second/minute/hour/day in our digitized world.

· Velocity: This refers to the speed at which data is being generated and the pace at which data moves from one point to the next.

· Variety: complexity. This refers to the ever-increasing different forms that data can come in, e.g., text, images, voice, geospatial.

A fourth V is now also sometimes added:

· Veracity: This refers to the quality of the data, which can vary greatly.

There are many other V’s that gets added to these depending on the context. For our specialization, we will add:

· Valence: This refers to how big data can bond with each other, forming connections between otherwise disparate datasets.

The above V’s are the dimensions that characterize big data, and also embody its challenges: We have huge amounts of data, in different formats and varying quality, that must be processed quickly.

It is important to note that the goal of processing big data is to gain insight to support decision-making. It is not sufficient to just be able to capture and store the data. The point of collecting and processing volumes of complex data is to understand trends, uncover hidden patterns, detect anomalies, etc. so that you have a better understanding of the problem being analyzed and can make more informed, data-driven decisions. In fact, many consider value as the sixth V of big data:

· Value: Processing big data must bring about value from insights gained.

To address the challenges of big data, innovative technologies are needed. Parallel, distributed computing paradigms, scalable machine learning algorithms, and real-time querying are key to analysis of big data. Distributed file systems, computing clusters, cloud computing, and data stores supporting data variety and agility are also necessary to provide the infrastructure for processing of big data. Workflows provide an intuitive, reusable, scalable and reproducible way to process big data to gain verifiable value from it in and enable application of same methods to different datasets.

 

4. Data Science: Getting Value out of Big Data

4.a> Steps in the Data Science Process

ACQUIRE -> PREPARE -> ANALYZE -> REPORT / VISUALIZE -> ACT

this is an iterative process and findings from one step may require the previous step to be repeated with new information.

4.a.1> Acquire data

The first step in acquiring data is to determine what data is available. Leave no stone unturned when it comes to finding the right data sources.

Acquire includes anything that makes us retrieve data including; finding, accessing, acquiring, and moving data. It includes identification of and authenticated access to all related data. And transportation of data from sources to distributed files systems.

It includes waste subset to match the data to regions or times of interest. As we sometimes refer to it as geo-spacial query.

This data is then processed and compared to patterns found by our models to determine if a weather station is experiencing Santa Ana conditions.

Depending on the source and structure of data, there are alternative ways to access it.

4.a.2> Prepare data

4.a.2.a> explore data

In summary, what you get by exploring your data is a better understanding of the complexity of the data you have to work with.

 

4.a.2.2> pre-process data

 

4.a.3> Analyze data

The stock price is a numeric value, not a category. So this is a regression task instead of a classification task.

In clustering, the goal is to organize similar items into groups. An example is grouping a company’s customer base into distinct segments for more effective targeted marketing like seniors, adults and teenagers, as we see here.

The goal in association analysis is to come up with a set of rules to capture associations within items or events. The rules are used to determine when items or events occur together. A common application of association analysis is known as market basket analysis, which is used to understand customer purchasing behavior. For example, association analysis can reveal that banking customers who have certificate of deposit accounts, surety CDs, also tend to be interested in other investment vehicles, such as money market accounts. This information can be used for cross-selling. According to data mining folklore, a supermarket chain used association analysis to discover a connection between two seemingly unrelated products. They discovered that many customers who go to the supermarket late on Sunday night to buy diapers also tend to buy beer, who are likely to be fathers. When your data can be transformed into a graph representation with nodes and links, then you want to use graph analytics to analyze your data. This kind of data comes about when you have a lot of entities and connections between those entities, like social networks. Some examples where graph analytics can be useful are exploring the spread of a disease or epidemic by analyzing hospitals’ and doctors’ records. 

Modeling starts with selecting, one of the techniques we listed as the appropriate analysis technique, depending on the type of problem you have. Then you construct the model using the data you’ve prepared. To validate the model, you apply it to new data samples. This is to evaluate how well the model does on data that was used to construct it. The common practice is to divide the prepared data into a set of data for constructing the model and reserving some of the data for evaluating the model after it has been constructed. You can also use new data prepared the same way as with the data that was used to construct model. Evaluating the model depends on the type of analysis techniques you used. Let’s briefly look at how to evaluate each technique.

For classification and regression, you will have the correct output for each sample in your input data. Comparing the correct output and the output predicted by the model, provides a way to evaluate the model. For clustering, the groups resulting from clustering should be examined to see if they make sense for your application. For example, do the customer segments reflect your customer base? Are they helpful for use in your targeted marketing campaigns? For association analysis and graph analysis, some investigation will be needed to see if the results are correct. For example, network traffic delays need to be investigated to see what your model predicts is actually happening. And whether the sources of the delays are where they are predicted to be in the real system.

As there are different types of problems, there are also different types of analysis techniques.

This step can take a couple of iterations on its own or might require data scientists to go back to steps one and two to get more data or package data in a different way

4.a.4> Visualize data: reporting insights

Timeline is a JavaScript library that allows you to create timelines.

In summary, you want to report your findings by presenting your results and value add with graphs using visualization tools.

 4.a.5> Act, apply the result: turning insights into action

Reporting insights from analysis and determining actions from insights based on the purpose you initially defined is what we refer to as the act step.

Once we define these real time actions, we need to make sure that there are automated systems, or processes to perform such actions, and provide failure recovery in case of problems.

As a summary, big data and data signs are only useful if the insites can be turned into action, and if the actions are carefully defined and evaluated.

5> Introduce to Hadoop

Although it would be possible to find counterexamples, we can generally say that the Hadoop framework is not the best for working with small data sets, advanced algorithms that require a specific hardware type, task level parallelism, infrastructure replacement, or random data access.

5.a> Hadoop Ecosystem

hadoopecosystem

As a summary, the Hadoop ecosystem consists of a growing number of open-source tools. Providing opportunities to pick the right tool for the right tasks for better performance and lower costs. We will reveal some of these tools in further detail and provide an analysis of when to use which in the next set of lectures.

5.b> Hadoop HDFS

hadoopnamenode

 

The application protects against hardware failures and provides data locality when we remove analytical complications to data.

5.c> YARN: A Resource Manager for Hadoop

 hadoopyarn

It’s a scalable platform that has enabled growth of several applications over the HTFS, enriching the Hadoop ecosystem.

6> MapReduce

Many types of tasks suitable for MapReduce include search engine page ranking and topic mapping. Please see the reading after this lecture on making with MapReduce for another fun application using the MapReduce programming model. 

 

 

Source: https://www.coursera.org/learn/big-data-introduction

Advertisements

From → Data Science

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: