Skip to content

Process Mining: Data Science in Action

We will learn about

  • process mining
  • overview of the broader data science discipline
  • basic data mining techniques motivate why these are not suitable for the analysis of processes
  • automatically discovering process models from raw event data

1. Data and Process Mining

Process mining is bridging the gap between classical process model analysis. And data oriented analysis like data mining and machine learning. Process mining is bridging this gap, because it’s focusing on processes. But at the same time, using the real data.

2. Decision Tree


3. Clustering and Association Rule Learning



Background of Big Data

1. Why Big Data?

Big data -> better models -> precision results.

2. Big Data: Where Does It Come From?

  • People: Social Media
  • Organization: relational / traditional data
  • Machine: sensors

3. Characteristics of Big Data and Dimensions of Scalability

A “Small” Definition of Big Data

The term ‘big data’ seems to be popping up everywhere these days. And there seems to be as many uses of this term as there are contexts in which you find it: ‘big data’ is often used to refer to any dataset that is difficult to manage using traditional database systems; it is also used as a catch-all term for any collection of data that is too large to process on a single server; yet others use the term to simply mean “a lot of data”; sometimes it turns out it doesn’t even have to be large. So what exactly is big data?

A precise specification of ‘big’ is elusive. What is considered big for one organization may be small for another. What is large-scale today will likely seem small-scale in the near future; petabyte is the new terabyte. Thus, size alone cannot specify big data. The complexity of the data is an important factor that must also be considered.

Most now agree with the characterization of big data using the 3 V’s coined by Doug Laney of Gartner:

· Volume: Size. This refers to the vast amounts of data that is generated every second/minute/hour/day in our digitized world.

· Velocity: This refers to the speed at which data is being generated and the pace at which data moves from one point to the next.

· Variety: complexity. This refers to the ever-increasing different forms that data can come in, e.g., text, images, voice, geospatial.

A fourth V is now also sometimes added:

· Veracity: This refers to the quality of the data, which can vary greatly.

There are many other V’s that gets added to these depending on the context. For our specialization, we will add:

· Valence: This refers to how big data can bond with each other, forming connections between otherwise disparate datasets.

The above V’s are the dimensions that characterize big data, and also embody its challenges: We have huge amounts of data, in different formats and varying quality, that must be processed quickly.

It is important to note that the goal of processing big data is to gain insight to support decision-making. It is not sufficient to just be able to capture and store the data. The point of collecting and processing volumes of complex data is to understand trends, uncover hidden patterns, detect anomalies, etc. so that you have a better understanding of the problem being analyzed and can make more informed, data-driven decisions. In fact, many consider value as the sixth V of big data:

· Value: Processing big data must bring about value from insights gained.

To address the challenges of big data, innovative technologies are needed. Parallel, distributed computing paradigms, scalable machine learning algorithms, and real-time querying are key to analysis of big data. Distributed file systems, computing clusters, cloud computing, and data stores supporting data variety and agility are also necessary to provide the infrastructure for processing of big data. Workflows provide an intuitive, reusable, scalable and reproducible way to process big data to gain verifiable value from it in and enable application of same methods to different datasets.


4. Data Science: Getting Value out of Big Data

4.a> Steps in the Data Science Process


this is an iterative process and findings from one step may require the previous step to be repeated with new information.

4.a.1> Acquire data

The first step in acquiring data is to determine what data is available. Leave no stone unturned when it comes to finding the right data sources.

Acquire includes anything that makes us retrieve data including; finding, accessing, acquiring, and moving data. It includes identification of and authenticated access to all related data. And transportation of data from sources to distributed files systems.

It includes waste subset to match the data to regions or times of interest. As we sometimes refer to it as geo-spacial query.

This data is then processed and compared to patterns found by our models to determine if a weather station is experiencing Santa Ana conditions.

Depending on the source and structure of data, there are alternative ways to access it.

4.a.2> Prepare data

4.a.2.a> explore data

In summary, what you get by exploring your data is a better understanding of the complexity of the data you have to work with.


4.a.2.2> pre-process data


4.a.3> Analyze data

The stock price is a numeric value, not a category. So this is a regression task instead of a classification task.

In clustering, the goal is to organize similar items into groups. An example is grouping a company’s customer base into distinct segments for more effective targeted marketing like seniors, adults and teenagers, as we see here.

The goal in association analysis is to come up with a set of rules to capture associations within items or events. The rules are used to determine when items or events occur together. A common application of association analysis is known as market basket analysis, which is used to understand customer purchasing behavior. For example, association analysis can reveal that banking customers who have certificate of deposit accounts, surety CDs, also tend to be interested in other investment vehicles, such as money market accounts. This information can be used for cross-selling. According to data mining folklore, a supermarket chain used association analysis to discover a connection between two seemingly unrelated products. They discovered that many customers who go to the supermarket late on Sunday night to buy diapers also tend to buy beer, who are likely to be fathers. When your data can be transformed into a graph representation with nodes and links, then you want to use graph analytics to analyze your data. This kind of data comes about when you have a lot of entities and connections between those entities, like social networks. Some examples where graph analytics can be useful are exploring the spread of a disease or epidemic by analyzing hospitals’ and doctors’ records. 

Modeling starts with selecting, one of the techniques we listed as the appropriate analysis technique, depending on the type of problem you have. Then you construct the model using the data you’ve prepared. To validate the model, you apply it to new data samples. This is to evaluate how well the model does on data that was used to construct it. The common practice is to divide the prepared data into a set of data for constructing the model and reserving some of the data for evaluating the model after it has been constructed. You can also use new data prepared the same way as with the data that was used to construct model. Evaluating the model depends on the type of analysis techniques you used. Let’s briefly look at how to evaluate each technique.

For classification and regression, you will have the correct output for each sample in your input data. Comparing the correct output and the output predicted by the model, provides a way to evaluate the model. For clustering, the groups resulting from clustering should be examined to see if they make sense for your application. For example, do the customer segments reflect your customer base? Are they helpful for use in your targeted marketing campaigns? For association analysis and graph analysis, some investigation will be needed to see if the results are correct. For example, network traffic delays need to be investigated to see what your model predicts is actually happening. And whether the sources of the delays are where they are predicted to be in the real system.

As there are different types of problems, there are also different types of analysis techniques.

This step can take a couple of iterations on its own or might require data scientists to go back to steps one and two to get more data or package data in a different way

4.a.4> Visualize data: reporting insights

Timeline is a JavaScript library that allows you to create timelines.

In summary, you want to report your findings by presenting your results and value add with graphs using visualization tools.

 4.a.5> Act, apply the result: turning insights into action

Reporting insights from analysis and determining actions from insights based on the purpose you initially defined is what we refer to as the act step.

Once we define these real time actions, we need to make sure that there are automated systems, or processes to perform such actions, and provide failure recovery in case of problems.

As a summary, big data and data signs are only useful if the insites can be turned into action, and if the actions are carefully defined and evaluated.

5> Introduce to Hadoop

Although it would be possible to find counterexamples, we can generally say that the Hadoop framework is not the best for working with small data sets, advanced algorithms that require a specific hardware type, task level parallelism, infrastructure replacement, or random data access.

5.a> Hadoop Ecosystem


As a summary, the Hadoop ecosystem consists of a growing number of open-source tools. Providing opportunities to pick the right tool for the right tasks for better performance and lower costs. We will reveal some of these tools in further detail and provide an analysis of when to use which in the next set of lectures.

5.b> Hadoop HDFS



The application protects against hardware failures and provides data locality when we remove analytical complications to data.

5.c> YARN: A Resource Manager for Hadoop


It’s a scalable platform that has enabled growth of several applications over the HTFS, enriching the Hadoop ecosystem.

6> MapReduce

Many types of tasks suitable for MapReduce include search engine page ranking and topic mapping. Please see the reading after this lecture on making with MapReduce for another fun application using the MapReduce programming model. 




Big Data And The Technology

Hadoop Fundemental

What is Hadoop?

  • Hadoop is an open source project of the Apache Foundation
  • It is a framework written in Java
  • It uses Google MapReduce and Google File System technologies as its foundation
  • It is optimized to handle massive quantities of data with variety formats using inexpensive commodity hardware
  • It replicates the data into multiple computers for reliability
  • It is for Big Data (not for OLTP nor OLAP/DSS), it is not good to work on unparalleled load, data with dependencies, for low latency data access, process lots of small files, intensive calculation with little data, and process transactions (lack random)

Hadoop-related open source projects

  • Eclipse
  • Lucene
  • HBase
  • Hive
  • Pig
  • Zookeeper
  • avro
  • UIMA

 Hadoop Architecture

  • Hadoop Note is a computer
  • All Nodes in the same network connection is a Rack
  • The bandwidth between 2 nodes in the same rack is larger than the one in different racks
  • The Hadoop cluster is a collection of Racks



Main Hadoop components

  1. Distributed File System
    • Hadoop Distributed File System (HDFS)
    • IBM GPFS – FPO:
  2. MapReduce Engine component
    • Framework for performing calculations on the data in the distributed file system
    • Has a built-in resource manager and scheduler

Data Access Patterns for HDFS

HDFS runs on top of the


What is HDFS?




Accessing Hadoop Data Using Hive

Writing map/reduce programs to analyze your Big Data can get complex. Hive can help make querying your data much easier

Fundamentals of Hive

Hive System Architecture

Hadoop was the solution for large data storage but using Hadoop was not easy task for end users, especially for those who were not familiar with the map reduce concept. Hive is an easy way to work with data stored in HDFS.

Basically Hive is SQL for Hadoop cluster. It is an open source data warehouse system on top of HDFS that adds structure to the data.

HQL also has features for working with unstructured data in HDFS

The above figure shows the connection of Hive to Hadoop (HDFS + Map Reduce) and the internal structure of Hive.

The main components of Hive are:

  • Metastore: It stores all the metadata of Hive. It stores data of data stored in database, tables, columns, etc.
  • Driver: It includes compiler, optimizer and executor used to break down the Hive query language statements.
  • Query compiler: It compiles HiveQL into graph of map reduce tasks.
  • Execution engine: It executes the tasks produces by compiler.
  • Thrift server: It provides an interface to connect to other applications like MySQL, Oracle, Excel, etc. through JDBC/ODBC drivers.
  • Command line interface: It is also called Hive shell. It is used for working with data either interactively or batch data processing.
  • Web Interface: It is a visual structure on Hive used for interaction with data.

Data Storage in Hive

Hive has different forms of storage options and they include:

  1. Metastore: Metastore keeps track of all the metadata of database, tables, columns, datatypes etc. in Hive. It also keeps track of HDFS mapping.
  2. Tables: There can be 2 types of tables in Hive. First, normal tables like any other table in database. Second, external tables which are like normal tables except for the deletion part. HDFS mappings are used to create external tables which are pointers to table in HDFS. The difference between the two types of tables is that when the external table is deleted its data is not deleted. Its data is stored in the HDFS whereas in case of normal table the data also gets deleted on deleting the table.
    Partitions: Partition is slicing of tables that are stored in different subdirectory within a table’s directory. It enhances query performance especially in case of select statements with “WHERE” clause.
    Buckets: Buckets are hashed partitions and they speed up joins and sampling of data.

Hive vs. RDBMS (Relational database)

Hive and RDBMS are very similar but they have different applications and different schemas that they are based on.

  • Hive is built for OLAP that is real time reporting of data. Hive does not support inserting into an existing table or updating table data like RDBMS which is an important part of OLTP process. All data is either inserted in new table or overwritten in existing table.
  • Hive is based on read schema that means data is not checked when it is loaded so data loading is fast but reading is slower.

Hive Query Language (HQL)

Load is used for taking data from HDFS and moving it into Hive. Insert is used for moving data from one Hive table to another. Select is used for querying data. Explain gives insights into structure of data.

Hive Installation

Prerequisites for installing Hive:

  • Java 1.7
  • Hadoop 2.x

Steps to install:

  1. Download stable version of Hive from
  2. Go to downloads and select the latest mirror. Download the latest tar ball apache-Hive-1.2.1-bin.tar.gz
  3. Unzip the tar ball using following command: tar -xzvf tar -xzvf apache-Hive-1.2.1-bin.tar.gz
  4. Set the environment variable HIVE_HOME to point to the installation directory:
    export HIVE_HOME = /user/local/Hive

Finally, add HIVE_HOME/bin to PATH:

export PATH = $PATH:HIVE_HOME/bin

To start Hive shell just type Hive after setting the path and Hive shell will fire up. To verify that Hive has started use command:

set –v

All the Hive properties will show up and look for mapred.job.tracker = hname : 1002 to verify that Hive has found the Hadoop cluster. Thus Hive is installed successfully and database can be created followed by tables and queries.











Upload data from Azure blob storage to Hive tables

This section describes how to ingest data into Azure blob target environments where the data can be stored and processed during the Analystics Process

What is Azure Blob Storage?

Azure Blob storage is a service for storing large amounts of unstructured data, such as text or binary data, that can be accessed from anywhere in the world via HTTP or HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately.

Common uses of Blob storage include:

  • Serving images or documents directly to a browser
  • Storing files for distributed access
  • Streaming video and audio
  • Performing secure backup and disaster recovery
  • Storing data for analysis by an on-premises or Azure-hosted service


  • Storage Account: All access to Azure Storage is done through a storage account
  • Container: A container provides a grouping of a set of blobs. All blobs must be in a container. An account can contain an unlimited number of containers. A container can store an unlimited number of blobs.
  • Blob: A file of any type and size. Azure Storage offers three types of blobs: block blobs, page blobs, and append blobs.

    Block blobs are ideal for storing text or binary files, such as documents and media files. Append blobs are similar to block blobs in that they are made up of blocks, but they are optimized for append operations, so they are useful for logging scenarios. A single block blob or append blob can contain up to 50,000 blocks of up to 4 MB each, for a total size of slightly more than 195 GB (4 MB X 50,000).

    Page blobs can be up to 1 TB in size, and are more efficient for frequent read/write operations. Azure Virtual Machines use page blobs as OS and data disks.

Naming and referencing containers and blobs

You can address a blob in your storage account using the following URL format:


For example, here is a URL that addresses one of the blobs in the diagram above

Container naming rules

A container name must be a valid DNS name and conform to the following rules:

  • A container name must be all lowercase.
  • Container names must start with a letter or number, and can contain only letters, numbers, and the dash (-) character.
  • Every dash (-) character must be immediately preceded and followed by a letter or number; consecutive dashes are not permitted in container names.
  • Container names must be from 3 through 63 characters long.

Every blob in Azure storage must reside in a container. The container forms part of the blob name. For example, mycontainer is the name of the container in these sample blob URIs:

lob naming rules

A blob name must conform to the following rules:

  • A blob name can contain any combination of characters.
  • A blob name must be at least one character long and cannot be more than 1,024 characters long.
  • Blob names are case-sensitive.
  • Reserved URL characters must be properly escaped.
  • The number of path segments comprising the blob name cannot exceed 254. A path segment is the string between consecutive delimiter characters (e.g., the forward slash ‘/’) that corresponds to the name of a virtual directory.

The Blob service is based on a flat storage scheme. You can create a virtual hierarchy by specifying a character or string delimiter within the blob name to create a virtual hierarchy. For example, the following list shows some valid and unique blob names:


You can use the delimiter character to list blobs hierarchically.

Create Hive database and tables


create database if not exists <database name>;
CREATE EXTERNAL TABLE if not exists <database name>.<table name>
field1 string,
field2 int,
field3 float,
field4 double,
fieWIldN string
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘<field separator>’ lines terminated by ‘<line separator>’


Here are the descriptions of the fields that users need to plug in and other configurations:

  • <database name>: the name of the database users want to create. If users just want to use the default database, the querycreate database… can be omitted.
  • <table name>: the name of the table users want to create within the specified database. If users want to use the default database, the table can be directly referred by <table name> without <database name>.
  • <field separator>: the separator that delimits fields in the data file to be uploaded to the Hive table.
  • <line separator>: the separator that delimits lines in the data file.
  • <storage location>: the Azure storage location to save the data of Hive tables. If users do not specify LOCATION <storage location>, the database and the tables are stored in hive/warehouse/ directory in the default container of the Hive cluster by default. If a user wants to specify the storage location, the storage location has to be within the default container for the database and tables. This location has to be referred as location relative to the default container of the cluster in the format of‘wasb:///<directory 1>/’ or ‘wasb:///<directory 1>/<directory 2>/’, etc. After the query is executed, the relative directories will be created within the default container. Otherwise, to create External hive table on an additional container of the cluster, the <storage location> can be a configuration value ‘${hiveconf:<defined path>}’ and defined the <defined path> in the Activity of Azure Data Factory’s Pipeline with format ‘wasb://<addional container name>@<storage account name><directory 2>’.
  • TBLPROPERTIES(“skip.header.line.count”=”1”): If the data file has a header line, users have to add this property at the end of the create table query. Otherwise, the header line will be loaded as a record to the table. If the data file does not have a header line, this configuration can be omitted in the query.


Here is the Hive query that loads data into a Hive table.

LOAD DATA INPATH '<path to blob data>' INTO TABLE <database name>.<table name>;

<path to blob data>:

If the blob file to be uploaded to the Hive table is in the default container of the HDInsight Hadoop cluster, the <path to blob data> should be in the format ‘wasb:///<directory in this container>/<blob file name>’.

The blob file can also be in an additional container of the HDInsight Hadoop cluster. In this case, <path to blob data> should be in the format ‘wasb://<container name>@<storage account name><blob file name>’.

Partitioned table and store Hive data in Optimized Row columnar (ORC)

If the data is large, partitioning the table is beneficial for queries that only need to scan a few partitions of the table. For instance, it is reasonable to partition the log data of a web site by dates.

In addition to partitioning Hive tables, it is also beneficial to store the Hive data in the Optimized Row Columnar (ORC) format. 

Partitioned table

CREATE EXTERNAL TABLE IF NOT EXISTS <database name>.<table name>
(field1 string,
fieldN string
PARTITIONED BY (<partitionfieldname> vartype) 
     lines terminated by '<line separator>' 

LOAD DATA INPATH '<path to the source file>' INTO TABLE <database name>.<partitioned table name> 
    PARTITION (<partitionfieldname>=<partitionfieldvalue>);

When querying partitioned tables, it is recommended to add the partition condition in the beginning of the where clause as this improves the efficacy of searching significantly.

    field1, field2, ..., fieldN
from <database name>.<partitioned table name> 
where <partitionfieldname>=<partitionfieldvalue> and ...;

Store Hive data in ORC format

Users cannot directly load data from blob storage into Hive tables that is stored in the ORC format. Here are the steps that the users need to take in order to load data from Azure blobs to Hive tables stored in ORC format.

  1. Create an external table STORED AS TEXTFILE and load data from blob storage to the table.

CREATE EXTERNAL TABLE IF NOT EXISTS <database name>.<external textfile table name>
( field1 string, field2 int,
fieldN date ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘<field separator>’ lines terminated by ‘<line separator>’ STORED AS TEXTFILE LOCATION ‘wasb:///<defaul container directory in Azure blob>’
(“skip.header.line.count”=“1”); LOAD DATA INPATH ‘<path to the source file>’ INTO TABLE <database name>.<table name>;

2. Create an internal table with the same schema as the external table in step 1, with the same field delimiter, and store the Hive data in the ORC format.

CREATE TABLE IF NOT EXISTS <database name>.<ORC table name> ( field1 string, field2 int, fieldN date ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘<field separator>’ STORED AS ORC;

3.  Select data from the external table in step 1 and insert into the ORC table

INSERT OVERWRITE TABLE <database name>.<ORC table name>
    SELECT * FROM <database name>.<external textfile table name>;


If the TEXTFILE table <database name>.<external textfile table name> has partitions, in STEP 3, theSELECT * FROM <database name>.<external textfile table name> command will select the partition variable as a field in the returned data set. Inserting it into the <database name>.<ORC table name> will fail since <database name>.<ORC table name> does not have the partition variable as a field in the table schema. In this case, users need to specifically select the fields to be inserted to <database name>.<ORC table name> as follows:

INSERT OVERWRITE TABLE <database name>.<ORC table name> PARTITION (<partition variable>=<partition value>)
   SELECT field1, field2, ..., fieldN
   FROM <database name>.<external textfile table name> 
   WHERE <partition variable>=<partition value>;

4. It is safe to drop the <external textfile table name> when using the following query after all data has been inserted into<database name>.<ORC table name>:

DROP TABLE IF EXISTS <database name>.<external textfile table name>;

Tuning Cube Processing