Skip to content

SQL Server connection Manager Tips

Following are the highlights of OLE DB and ADO.NET connections:

  • OLE DB connection is supported by majority of tasks/transformations in SSIS, but a few tasks/transformations like Bulk Insert Task, Lookup Transformation etc. support only OLE DB connection.
  • ADO.NET connection is supported by many tasks/transformations in SSIS, but a few tasks/transformations like Data Profiling Task, CDC Control Task etc. support only ADO.NET connection.
  • Mapping Parameters in Execute SQL Task is very convenient and clear and easy to understand using ADO.NET Connection.
  • OLE DB connection (being a native connection) is a bit faster than ADO.NET in most scenarios.

Among various connection types supported by Execute SQL Task, ADO.NET is the only connection type which allows specifying the parameter mapping clearly.

Adding an “Application Name” property to a SSIS connection string to easily trace the queries being executed by SSIS package

 

Advertisements

SSIS Package uses incorrect Package Configuration Settings

SSIS packages evaluate configurations (XML at least) in the order of:

1) What they’re told to look for (say in a job or manual execute step)
2) The path saved in the package: Package Configuration
3) Default values in the package itself: Hard code value

So it makes sense that when #1 is not available, it defaults to #2, and when #2 is also not available, it defaults to #3

On occasion we have a package that insists on using the hard code values instead of actually using the designated config file. When that happens, we have to open the package, strip out the hard coded values, setting properties to not evaluate upon open (DelayValidation=True), and disconnect the local config file settings (just remove the config file name in the Package Configuration). This is a good work around.

Package Configurations

SQL Server Integration Services provides package configurations that you can use to update the values of properties at run time

Configurations are available for the package deployment model.

Parameters are used in place of configurations for the project deployment model

A configuration is a property/value pair that you add to a completed package. Typically, you create a package set properties on the package objects during package development, and then add the configuration to the package. When the package runs, it gets the new values of the property from the configuration

Understanding How Package Configurations Are Applied at Run Time

When you use the dtexec command prompt utility (dtexec.exe) to run a deployed package, the utility applies package configurations twice. The utility applies configurations both before and after it applies the options that you specified on command line.

As the utility loads and runs the package, events occur in the following order:

  1. The dtexec utility loads the package.
  2. Apply the configurations that were specified in the package at design time and in the order that is specified in the package. (except the Parent Package Variables configurations)
  3. Apply any options that you specified on the command line
  4. Reloads the configurations that were specified in the package at design time and in the order specified in the package. (except the Parent Package Variables configurations). The utility uses any command-line options that were specified to reload the configurations. Therefore, different values might be reloaded from a different location
  5. Apply the Parent Package Variable configurations.
  6. The utility runs the packageThe way in which the dtexec utility applies configurations affects the following command-line options:
  • You can use the /Connection or /Set option at run time to load package configurations from a location other than the location that you specified at design time
  • You can use the /ConfigFile option to load additional configurations that you did not specify at design time, but the Package Configuration has to be enabled

    However, these command-line options do have some restrictions:

  • You cannot use the /Set or the /Connection option to override single values that are also set by a configuration.
  • You cannot use the /ConfigFile option to load configurations that replace the configurations that you specified at design time.

The ConfiguredValue element holds the value to be used in the package for the object specified in the Path element, both open and closed tags must be in the same line, ex:  <ConfiguredValue>abc</ConfiguredValue>

 

Process Mining: Data Science in Action

We will learn about

  • process mining
  • overview of the broader data science discipline
  • basic data mining techniques motivate why these are not suitable for the analysis of processes
  • automatically discovering process models from raw event data

1. Data and Process Mining

Process mining is bridging the gap between classical process model analysis. And data oriented analysis like data mining and machine learning. Process mining is bridging this gap, because it’s focusing on processes. But at the same time, using the real data.

2. Decision Tree

 

3. Clustering and Association Rule Learning

 

 

https://www.coursera.org/learn/process-mining/supplement/NehVf/process-mining-data-science-in-action-getting-started

 

Background of Big Data

1. Why Big Data?

Big data -> better models -> precision results.

2. Big Data: Where Does It Come From?

  • People: Social Media
  • Organization: relational / traditional data
  • Machine: sensors

3. Characteristics of Big Data and Dimensions of Scalability

A “Small” Definition of Big Data

The term ‘big data’ seems to be popping up everywhere these days. And there seems to be as many uses of this term as there are contexts in which you find it: ‘big data’ is often used to refer to any dataset that is difficult to manage using traditional database systems; it is also used as a catch-all term for any collection of data that is too large to process on a single server; yet others use the term to simply mean “a lot of data”; sometimes it turns out it doesn’t even have to be large. So what exactly is big data?

A precise specification of ‘big’ is elusive. What is considered big for one organization may be small for another. What is large-scale today will likely seem small-scale in the near future; petabyte is the new terabyte. Thus, size alone cannot specify big data. The complexity of the data is an important factor that must also be considered.

Most now agree with the characterization of big data using the 3 V’s coined by Doug Laney of Gartner:

· Volume: Size. This refers to the vast amounts of data that is generated every second/minute/hour/day in our digitized world.

· Velocity: This refers to the speed at which data is being generated and the pace at which data moves from one point to the next.

· Variety: complexity. This refers to the ever-increasing different forms that data can come in, e.g., text, images, voice, geospatial.

A fourth V is now also sometimes added:

· Veracity: This refers to the quality of the data, which can vary greatly.

There are many other V’s that gets added to these depending on the context. For our specialization, we will add:

· Valence: This refers to how big data can bond with each other, forming connections between otherwise disparate datasets.

The above V’s are the dimensions that characterize big data, and also embody its challenges: We have huge amounts of data, in different formats and varying quality, that must be processed quickly.

It is important to note that the goal of processing big data is to gain insight to support decision-making. It is not sufficient to just be able to capture and store the data. The point of collecting and processing volumes of complex data is to understand trends, uncover hidden patterns, detect anomalies, etc. so that you have a better understanding of the problem being analyzed and can make more informed, data-driven decisions. In fact, many consider value as the sixth V of big data:

· Value: Processing big data must bring about value from insights gained.

To address the challenges of big data, innovative technologies are needed. Parallel, distributed computing paradigms, scalable machine learning algorithms, and real-time querying are key to analysis of big data. Distributed file systems, computing clusters, cloud computing, and data stores supporting data variety and agility are also necessary to provide the infrastructure for processing of big data. Workflows provide an intuitive, reusable, scalable and reproducible way to process big data to gain verifiable value from it in and enable application of same methods to different datasets.

 

4. Data Science: Getting Value out of Big Data

4.a> Steps in the Data Science Process

ACQUIRE -> PREPARE -> ANALYZE -> REPORT / VISUALIZE -> ACT

this is an iterative process and findings from one step may require the previous step to be repeated with new information.

4.a.1> Acquire data

The first step in acquiring data is to determine what data is available. Leave no stone unturned when it comes to finding the right data sources.

Acquire includes anything that makes us retrieve data including; finding, accessing, acquiring, and moving data. It includes identification of and authenticated access to all related data. And transportation of data from sources to distributed files systems.

It includes waste subset to match the data to regions or times of interest. As we sometimes refer to it as geo-spacial query.

This data is then processed and compared to patterns found by our models to determine if a weather station is experiencing Santa Ana conditions.

Depending on the source and structure of data, there are alternative ways to access it.

4.a.2> Prepare data

4.a.2.a> explore data

In summary, what you get by exploring your data is a better understanding of the complexity of the data you have to work with.

 

4.a.2.2> pre-process data

 

4.a.3> Analyze data

The stock price is a numeric value, not a category. So this is a regression task instead of a classification task.

In clustering, the goal is to organize similar items into groups. An example is grouping a company’s customer base into distinct segments for more effective targeted marketing like seniors, adults and teenagers, as we see here.

The goal in association analysis is to come up with a set of rules to capture associations within items or events. The rules are used to determine when items or events occur together. A common application of association analysis is known as market basket analysis, which is used to understand customer purchasing behavior. For example, association analysis can reveal that banking customers who have certificate of deposit accounts, surety CDs, also tend to be interested in other investment vehicles, such as money market accounts. This information can be used for cross-selling. According to data mining folklore, a supermarket chain used association analysis to discover a connection between two seemingly unrelated products. They discovered that many customers who go to the supermarket late on Sunday night to buy diapers also tend to buy beer, who are likely to be fathers. When your data can be transformed into a graph representation with nodes and links, then you want to use graph analytics to analyze your data. This kind of data comes about when you have a lot of entities and connections between those entities, like social networks. Some examples where graph analytics can be useful are exploring the spread of a disease or epidemic by analyzing hospitals’ and doctors’ records. 

Modeling starts with selecting, one of the techniques we listed as the appropriate analysis technique, depending on the type of problem you have. Then you construct the model using the data you’ve prepared. To validate the model, you apply it to new data samples. This is to evaluate how well the model does on data that was used to construct it. The common practice is to divide the prepared data into a set of data for constructing the model and reserving some of the data for evaluating the model after it has been constructed. You can also use new data prepared the same way as with the data that was used to construct model. Evaluating the model depends on the type of analysis techniques you used. Let’s briefly look at how to evaluate each technique.

For classification and regression, you will have the correct output for each sample in your input data. Comparing the correct output and the output predicted by the model, provides a way to evaluate the model. For clustering, the groups resulting from clustering should be examined to see if they make sense for your application. For example, do the customer segments reflect your customer base? Are they helpful for use in your targeted marketing campaigns? For association analysis and graph analysis, some investigation will be needed to see if the results are correct. For example, network traffic delays need to be investigated to see what your model predicts is actually happening. And whether the sources of the delays are where they are predicted to be in the real system.

As there are different types of problems, there are also different types of analysis techniques.

This step can take a couple of iterations on its own or might require data scientists to go back to steps one and two to get more data or package data in a different way

4.a.4> Visualize data: reporting insights

Timeline is a JavaScript library that allows you to create timelines.

In summary, you want to report your findings by presenting your results and value add with graphs using visualization tools.

 4.a.5> Act, apply the result: turning insights into action

Reporting insights from analysis and determining actions from insights based on the purpose you initially defined is what we refer to as the act step.

Once we define these real time actions, we need to make sure that there are automated systems, or processes to perform such actions, and provide failure recovery in case of problems.

As a summary, big data and data signs are only useful if the insites can be turned into action, and if the actions are carefully defined and evaluated.

5> Introduce to Hadoop

Although it would be possible to find counterexamples, we can generally say that the Hadoop framework is not the best for working with small data sets, advanced algorithms that require a specific hardware type, task level parallelism, infrastructure replacement, or random data access.

5.a> Hadoop Ecosystem

hadoopecosystem

As a summary, the Hadoop ecosystem consists of a growing number of open-source tools. Providing opportunities to pick the right tool for the right tasks for better performance and lower costs. We will reveal some of these tools in further detail and provide an analysis of when to use which in the next set of lectures.

5.b> Hadoop HDFS

hadoopnamenode

 

The application protects against hardware failures and provides data locality when we remove analytical complications to data.

5.c> YARN: A Resource Manager for Hadoop

 hadoopyarn

It’s a scalable platform that has enabled growth of several applications over the HTFS, enriching the Hadoop ecosystem.

6> MapReduce

Many types of tasks suitable for MapReduce include search engine page ranking and topic mapping. Please see the reading after this lecture on making with MapReduce for another fun application using the MapReduce programming model. 

 

 

Source: https://www.coursera.org/learn/big-data-introduction

Big Data And The Technology

Hadoop Fundemental

What is Hadoop?

  • Hadoop is an open source project of the Apache Foundation
  • It is a framework written in Java
  • It uses Google MapReduce and Google File System technologies as its foundation
  • It is optimized to handle massive quantities of data with variety formats using inexpensive commodity hardware
  • It replicates the data into multiple computers for reliability
  • It is for Big Data (not for OLTP nor OLAP/DSS), it is not good to work on unparalleled load, data with dependencies, for low latency data access, process lots of small files, intensive calculation with little data, and process transactions (lack random)

Hadoop-related open source projects

  • Eclipse
  • Lucene
  • HBase
  • Hive
  • Pig
  • Zookeeper
  • avro
  • UIMA

 Hadoop Architecture

  • Hadoop Note is a computer
  • All Nodes in the same network connection is a Rack
  • The bandwidth between 2 nodes in the same rack is larger than the one in different racks
  • The Hadoop cluster is a collection of Racks

HadoopRack

 

Main Hadoop components

  1. Distributed File System
    • Hadoop Distributed File System (HDFS)
    • IBM GPFS – FPO:
  2. MapReduce Engine component
    • Framework for performing calculations on the data in the distributed file system
    • Has a built-in resource manager and scheduler

Data Access Patterns for HDFS

HDFS runs on top of the

 

What is HDFS?

 

 

 

Accessing Hadoop Data Using Hive

Writing map/reduce programs to analyze your Big Data can get complex. Hive can help make querying your data much easier

Fundamentals of Hive

Hive System Architecture

Hadoop was the solution for large data storage but using Hadoop was not easy task for end users, especially for those who were not familiar with the map reduce concept. Hive is an easy way to work with data stored in HDFS.

Basically Hive is SQL for Hadoop cluster. It is an open source data warehouse system on top of HDFS that adds structure to the data.

HQL also has features for working with unstructured data in HDFS

The above figure shows the connection of Hive to Hadoop (HDFS + Map Reduce) and the internal structure of Hive.

The main components of Hive are:

  • Metastore: It stores all the metadata of Hive. It stores data of data stored in database, tables, columns, etc.
  • Driver: It includes compiler, optimizer and executor used to break down the Hive query language statements.
  • Query compiler: It compiles HiveQL into graph of map reduce tasks.
  • Execution engine: It executes the tasks produces by compiler.
  • Thrift server: It provides an interface to connect to other applications like MySQL, Oracle, Excel, etc. through JDBC/ODBC drivers.
  • Command line interface: It is also called Hive shell. It is used for working with data either interactively or batch data processing.
  • Web Interface: It is a visual structure on Hive used for interaction with data.

Data Storage in Hive

Hive has different forms of storage options and they include:

  1. Metastore: Metastore keeps track of all the metadata of database, tables, columns, datatypes etc. in Hive. It also keeps track of HDFS mapping.
  2. Tables: There can be 2 types of tables in Hive. First, normal tables like any other table in database. Second, external tables which are like normal tables except for the deletion part. HDFS mappings are used to create external tables which are pointers to table in HDFS. The difference between the two types of tables is that when the external table is deleted its data is not deleted. Its data is stored in the HDFS whereas in case of normal table the data also gets deleted on deleting the table.
    Partitions: Partition is slicing of tables that are stored in different subdirectory within a table’s directory. It enhances query performance especially in case of select statements with “WHERE” clause.
    Buckets: Buckets are hashed partitions and they speed up joins and sampling of data.

Hive vs. RDBMS (Relational database)

Hive and RDBMS are very similar but they have different applications and different schemas that they are based on.

  • Hive is built for OLAP that is real time reporting of data. Hive does not support inserting into an existing table or updating table data like RDBMS which is an important part of OLTP process. All data is either inserted in new table or overwritten in existing table.
  • Hive is based on read schema that means data is not checked when it is loaded so data loading is fast but reading is slower.

Hive Query Language (HQL)

Load is used for taking data from HDFS and moving it into Hive. Insert is used for moving data from one Hive table to another. Select is used for querying data. Explain gives insights into structure of data.

Hive Installation

Prerequisites for installing Hive:

  • Java 1.7
  • Hadoop 2.x

Steps to install:

  1. Download stable version of Hive from http://Hive.apache.org/
  2. Go to downloads and select the latest mirror. Download the latest tar ball apache-Hive-1.2.1-bin.tar.gz
  3. Unzip the tar ball using following command: tar -xzvf tar -xzvf apache-Hive-1.2.1-bin.tar.gz
  4. Set the environment variable HIVE_HOME to point to the installation directory:
    export HIVE_HOME = /user/local/Hive

Finally, add HIVE_HOME/bin to PATH:

export PATH = $PATH:HIVE_HOME/bin

To start Hive shell just type Hive after setting the path and Hive shell will fire up. To verify that Hive has started use command:

set –v

All the Hive properties will show up and look for mapred.job.tracker = hname : 1002 to verify that Hive has found the Hadoop cluster. Thus Hive is installed successfully and database can be created followed by tables and queries.