The Data Science Specialization – 4. Exploratory Data Analysis
Course Content
This course covers the essential exploratory techniques for summarizing data. These techniques are typically applied before formal modeling commences and can help inform the development of more complex statistical models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the world that can be addressed by the data. We will cover in detail the plotting systems in R as well as some of the basic principles of constructing data graphics. We will also cover some of the common multivariate statistical techniques used to visualize highdimensional data.
Week 1/Unit 1
 Making exploratory graphs
 Principles of analytic graphics
 Plotting systems and graphics devices in R
 The base, lattice, and ggplot2 plotting systems in R
 Clustering methods
 Hierarchical Clustering
 KMeans Clustering
 Dimension reduction techniques
 Working with Color in R plots
Week 1/Unit 1
Before doing modelling, prediction, or any sort of inference, we do EXPLORATORY DATA ANALYSIS where we look at and see what (iterative, investigative, visual analytic algorithms) is going on with the processed tidy data set in terms of:
 What is happening (from this data)
 What kinds of plots are being used for this data
 Principles of Analytic Graphics: Basic principles for building analytic graphics
 Are general rules that one can follow when building analytic graphics from data
 Principle 1: Show comparisons
 Evidence for a hypothesis is always relative to another competing hypothesis
 Always ask “Compared to What?”
 Principle 2: Show causality, mechanism, explanation, systematic structure
 What is your causal framework for thinking about a question? (what is the explanation why the outcomes are, i.e the air cleaners clean and fresh the air so they improve the quality of the air > reduce the symptoms…)
 Principle 3: Show multivariate data: show AS MUCH data AS possible on a single plot
 Multivariate = more than 2 variables
 The real world is multivariate
 Need to “escape flatland”
 Principle 4: Integration of evidence (use as many different modes of evidence/displaying evidence as possible)
 Completely integrate words, numbers, images, diagrams
 Data graphics should make use of many modes table, plot… of data presentation
 Don’t let the tool drive the analysis
 Principle 5: Describe and document the evidence with appropriate labels, scales, sources (where the data is from), (how to make plot), etc.
 A data graphic should tell a complete story that is credible/reliable
 Principle 6: Content is king
 Analytical presentations ultimately stand or fall depending on the quality, relevance, and integrity of their content (If you don’t have a story to tell the amount of presentation, think about what is the content /data/story that you’re trying to present and what is the best way/how to present that when making plots, figures, graphs )
 Constructing Exploratory graphs
 Why do we use graph in data analysis?
 To understand data properties
 To find pattern in data
 To suggest modeling strategies
 To “debug” analyses
 To communicate results
 Characteristics of exploratory graphs
 They are made quickly
 A large number are made
 The goal is for personal understanding
 Axes/legends are generally cleaned up (later)
 Color/size are primarily used for information
 Summaries of Data for selecting kinds of plots
 2 dimensions
 Miltiple/overlayed 1D plots (Lattice/ggplot2)
 Scatterplots
 Smooth scatterplots
 More than 2 dimensions
 Overlayed/multiple 2D plots; coplots
 Use coder, size, shape to add dimensions
 Spinning plots
 Actual 3D plots (not that useful)
 Why do we use graph in data analysis?
 Plotting Systems in R
 R has developed 3 CORE PLOTTING SYSTEMS, we can NOT mix the functions between systems because the plotting will be confused
 Annotation of plots in any plotting system involves adding points, lines, or text to the plot, in addition to customizing axis labels or adding titles. Different plotting systems have different sets of functions for annotating plots in this way
 The first Plotting system is the Base Plotting System
 Artist’s palette model
 Start with blank canvas and build up from there
 Start with plot function (or similar)
 Use annotation functions to add/modify (text, lines, points, axis)
 Convenient, mirrors how we think of building plots and analyzing data
 Can’t go back once plot has started (i.e. to adjust margins); need to plan in advance
 Difficult to translate to others once a new plot has been created (no graphical language)
 Plot is just a series of R commands
 Example of Base Plot
 R> library(datasets)
 data(cars)
 with(cars, plot(speed, dist))
 The second Plotting system is the Lattice System
 Plots are created with a single function call (xyplot, bwplot, etc.) — all plots are created at once time/1 function
 Most useful for conditioning types of plots: Looking at how y changes with x across levels of z
 Things like margins/spacing set automatically because entire plot is specified at once
 Good for putting many many plots on a screen
 Sometimes awkward to specify an entire plot in a single function call
 Annotation in plot is not especially intuitive
 Use of panel functions and subscripts difficult to wield and requires intense preparation
 Cannot “add” to the plot once it is created
 Example
 library(lattice)
 state
 xyplot(Lite.Exp ~ Income  region, data = state, layout = c(4, 1))
 The third Plotting system is the ggplots System — grammar graphic plot system
 Splits the difference between base and lattice in a number of ways — mixed ideas from both systems
 Automatically deals with spacings, text, titles but also allows you to annotate by “adding” to a plot
 Superficial similarity to lattice but generally easier/more intuitive to use
 Default mode makes many choices for you (but you can still customize to your heart’s desire)
 Example:
 library(ggplots)
 data(mpg)
 qplot(displ, hwy, data = mpg)
 MultipleBoxplots
 R> boxplot( ~ , data =
 Multiple Histograms
 par(mfrow=c(2,1), mar = c(4, 4, 2, 1))
 his(subset(, region == “east”)<$pm25>), col = “green”
 his(subset(, region == “west”)<$pm25>), col = “green”
 Scatterplot (with color:col= region–West/East)
 R> with(, plot(latitude, pm25, col = ))
 abline(h=12, lwd = 2, lty = 2)
 MultipleScaterplots
 par(mfrow=c(1,2), mar = c(5, 4, 2, 1))
 with(subset(pollution, region == “West”), plot(latitude, pm25, main = “West”))
 with(subset(pollution, region == “east”), plot(latitude, pm25, main = “East”))
 Adding a legend to a plot
 plot (c(1968,2010),c(0,10),type=”n”, # sets the x and y axes scales xlab=”Year”,ylab=”Expenditures/GDP (%)”) # adds titles to the axes
 lines(year,defense,col=”red”,lwd=2.5) # adds a line for defense expenditures
 lines(year,health,col=”blue”,lwd=2.5) # adds a line for health expenditures
 legend(2000,9.5, # places a legend at the appropriate place c(“Health”,”Defense”), # puts text in the legend lty=c(1,1), # gives the legend appropriate symbols (lines) lwd=c(2.5,2.5),col=c(“blue”,”red”)) # gives the legend lines the correct color and width
 The Process of Making a Plot: When making a plot, one must first make a few considerations (its order is not important)
 Where will the plot be made? On the screen? In a file?
 How will the plot be used?
 Is the plot for viewing temporarily on the screen?
 Will it be presented in a web browser?
 Will it eventually en up in a paper that might be printed?
 Are you using it in a presentation?
 Is there a large amount of data going into the plot? Or is it just a few points?
 Do you need to be able to dynamically resize the graphic?
 What graphics system will you use: base, lattice, or ggplot2? These generally cannnot be mixed
 Base graphics are usually constructed piecemeal, with each aspect of the plot handled separately trough a series of function calls; this is often conceptually simpler and allows plotting to mirror the thought process.
 Lattice graphics are usually created in a single function to call, so all of the graphics parameters have to specified at once; specifying everything at once allows R to automatically calculate the necessary spacings and font sizes.
 ggplots combines concepts from both base and lattice graphics but uses an independent implementation
 The Base Plotting System in R
 The core plotting system and the graphics engine/the base graphic system in R are encapsulated in the following packages:
 graphics: contains plotting functions for the “base” graphing systems, including plot, hist, boxplot and many others
 grDevices: contains all the code implementing the various graphics devices, including X11, PDF, PostScript, PNG, etc.
 We focus on using the base plotting system to create graphics on the screen device
 Base graphics: Base graphics are used most commonly and are a very powerful system for creating 2D graphics
 There are 2 phrases to creating a base plot
 Initializing a new plot
 Annotating (adding to) an existing plot
 Calling plot(x,y) or hist(x) will launch a graphics device (if one is not already open) and draw a new plot on the device
 If the arguments to plot are not of some special class, then the default method for plot is called; this function has many arguments, letting you set the title, x asis label, y axis label, etc.
 The base graphics system has many parameters that can set and tweaked; these parameters are documented in ?par;
 There are 2 phrases to creating a base plot
 Some Important Base Graphics Parameters
Many base plotting functions share a set of parameters. Here are a few key ones:pch : the plotting symbol (default is open circle)lty: the line type (default is solid line), can be dashed, dotted, etc.lwd: the line width, specified as an integer multiplecol: the plotting color, specified as a number, string, or hex code; the colors() function gives you a vector of colors by namexlab: character string for the xaxis labelylab: character string for the yaxis label

Some Important Base Graphics Parameters for par() functionThe par() function is used to specify global graphics parameters that affect all plots in an R session. These parameters can be overridden when specified as arguments to specific plotting functions.las: the orientation of the axis labels on the plotbg: the background colormar: the margin sizeoma: the outer margin size (default is 0 for all sides)mfrow: number of plots per row, column (plots are filled rowwise)mfcol: number of plots per row, column (plots are filled columnwise

Base Plotting Functionsplot: make a scatterplot, or other type of plot depending on the class of the object being plotted lines: add lines to a plot, given a vector x values and a corresponding vector of y values (or a 2column matrix); this function just connects the dotspoints: add additional points to a plottext: add text labels to a plot using specified x, y coordinatestitle: add annotations to x, y axis labels, title, subtitle, outer marginmtext: add arbitrary text to the margins (inner or outer) of the plotaxis: adding axis ticks/labels
 Base Plot with Annotation
 library(datasets)
with(airquality, plot(Wind, Ozone))
title(main =”Ozone and Wind in New York City”) ## Add a title
 with(airquality, plot(Wind, Ozone, main =”Ozone and Wind in New York City”)) with(subset(airquality, Month ==5), points(Wind, Ozone, col =”blue”))

with(airquality, plot(Wind, Ozone, main =”Ozone and Wind in New York City”, type =”n”)) #type=”n”: just initial a graphic device but not plot any thingwith(subset(airquality, Month ==5), points(Wind, Ozone, col =”blue”))with(subset(airquality, Month !=5), points(Wind, Ozone, col =”red”))legend(“topright”, pch =1, col = c(“blue”,”red”), legend = c(“May”,”Other Months”))
 library(datasets)
 Base Plot with Regression Line

with(airquality, plot(Wind, Ozone, main =”Ozone and Wind in New York City”, pch =20))abline(model, lwd =2)

 Multiple Base Plots

par(mfrow = c(1,2))with(airquality, { plot(Wind, Ozone, main =”Ozone and Wind”)plot(Solar.R, Ozone, main =”Ozone and Solar Radiation”) })

par(mfrow = c(1,3), mar = c(4,4,2,1), oma = c(0,0,2,0))with(airquality, { plot(Wind, Ozone, main =”Ozone and Wind”)plot(Solar.R, Ozone, main =”Ozone and Solar Radiation”)plot(Temp, Ozone, main =”Ozone and Temperature”)mtext(“Ozone and Weather in New York City”, outer =TRUE)})

 The core plotting system and the graphics engine/the base graphic system in R are encapsulated in the following packages:
 Graphics Device
 What is a Graphics Device? A graphics device is something where you can make a plot appear

A window on your computer (screen device)

A PDF file (file device)

A PNG or JPEG file (file device)

A scalable vector graphics (SVG) file (file device


When you make a plot in R, it has to be “sent” to a specific graphics device

The most common place for a plot to be “sent” is the screen device

On a Mac the screen device is launched with the quartz()

On Windows the screen device is launched with windows()

On Unix/Linux the screen device is launched with x11()

 How Does a Plot Get Created? There are two basic approaches to plotting.
 The first is most common:
 1 Call a plotting function like plot, xyplot, or qplot
 2 The plot appears on the screen device
 3 Annotate plot if necessary
 4 Enjoy
 library (datasets)
with(faithful, plot(eruptions, waiting)) ## Make plot appear on screen devicetitle(main =”Old Faithful Geyser data”) ## Annotate with a title
 The second approach to plotting is most commonly used for file devices:
 1 Explicitly launch a graphics device
 2 Call a plotting function to make a plot (Note: if you are using a file device, no plot will appear on the screen)
 3 Annotate plot if necessary
 4 Explicitly close graphics device with dev.off() (this is very important!)
 pdf(file = “myplot.pdf”) ## Open PDF device; create ‘myplot.pdf’ in my working directory ## Create plot and send to a file (no plot appears on screen) with(faithful, plot(eruptions, waiting)) title(main = “Old Faithful Geyser data”) ## Annotate plot; still nothing on screen dev.off() ## Close the PDF file device ## Now you can view the file ‘myplot.pdf’ on your computer
 The first is most common:
 Graphics File Devices
 There are two basic types of file devices: vector and bitmap devices
 Vector formats: pdf: useful for linetype graphics, resizes well, usually portable, not efficient if a plot has many objects/points svg: XMLbased scalable vector graphics; supports animation and interactivity, potentially useful for webbased plots win.metafile: Windows metafile format (only on Windows) postscript: older format, also resizes well, usually portable, can be used to create encapsulated postscript files; Windows systems often don’t have a postscript viewer
 Bitmap formats png: bitapped format, good for line drawings or images with solid colors, uses lossless compression (like the old GIF format), most web browsers can read this format natively, good for plotting many many many points, does not resize well jpeg: good for photographs or natural scenes, uses lossy compression, good for plotting many many many points, does not resize well, can be read by almost any computer and any web browser, not great for line drawings tiff: Creates bitmap files in the TIFF format; supports lossless compression bmp: a native Windows bitmapped format
 R> dev.copy, dev.cur, dev.set(<2)
 What is a Graphics Device? A graphics device is something where you can make a plot appear
 The Lattice plotting system
 Used for plotting kind of high dimensional data and/or many plots at once
 Be implemented using the following packages:
 lattice: contains code for producing Trellis graphics, which are independent of the “base” graphics system; includes functions like xyplot, bwplot, levelplot
 grid: implements a different graphing system independent of the “base” system; the lattice package builds on top of grid;
 we seldom call functions from the grid package directly.
 The lattice plotting system does not have a “twophase” aspect with separate plotting and annotation like in base plotting
 All plotting/annotation is done at once with a single function call
 Lattice Functions
 xyplot: this is the main function for creating scatterplots
 bwplot: boxandwhiskers plots (“boxplots”)
 histogram: histograms
 stripplot: like a boxplot but with actual points
 dotplot: plot dots on “violin strings”
 splom: scatterplot matrix; like pairs in base plotting system
 levelplot, contourplot: for plotting “image” data
 Lattice functions generally take a formula for their first argument, usually of the form
 xyplot(y ~ x  f * g, data)
 We use the formula notation here, hence the ~
 On the left of the ~ is the yaxis variable, on the right is the xaxis varibale
 f and g are conditioning variables – they are optional
 the * indicates an interaction between the two variables
 The second argument is the data frame or list from which the variables in the formula should be looked up
 If no data frame or list is passed, the the parent frame is used
 If no other arguments are passed, there are defaults that can be used
 Example: R>library(lattice)
 library(lattice); library(datasets) xyplot(Ozone ~ Wind, data=airquality)
 library(datasets) airquality xyplot(Ozone ~ Wind  Month, data = airquality, layout = c(5,1)) # separate by Month: 5 columns
 Lattice Behavior: lattice functions behave differently from base graphics functions in one critical way
 Base graphics functions plot data directly to the graphics device (screen, PDF file, etc.)
 xyplot does 2 steps: return an object (invisible to user) then auto print (visible to user)
 Lattice graphics functions return an object of class trellis (do not plot anything, just return object)
 The print methods for lattice functions actually do the work of plotting the data on the graphics device
 p<xyplot(Ozone ~ Wind, data=airquality) ##Nothing happens!
 print(p) ##Plot appears
 xyplot(Ozone ~ Wind, data=airquality) ##Autoprinting
 Lattice functions return “plot objects” that can, in principle, be stored (but it’s usually better to just save the code + data)
 On the command line, trellis objects are autoprinted so that it appears the function is plotting the data
 Lattice Panel Functions
 Lattice functions have a panel function which controls what happens inside each panel of the plot
 The lattice package comes with default panel functions, but you can supply your own if you want to customize what happens in each panel
 Panel functions receive the x/y coordinates of the data points in their panel (along with any optional arguments)
 Lattice Panel Functions
 set.seed(10) x
 Custom panel function xyplot(y ~ xf, panel=function(x, y, …) { panel.xyplot(x,y,…) #First call the default panel function for ‘xyplot’ panel.abline(h=median(y), lty=2) #Add a horizontal line at the median })
 Custom panel function with Regression line xyplot(y ~ xf, panel=funciton(x,y,…){ panel.xyplot(x, y, …) #First call default panel function panel.lmline(x,y, col=2) #Overlay a simple linear regression line })
 Summary
 Lattice plots are constructed with a single function call to a core lattice function (e.g. xyplot)
 Aspects like margins and spacing are automatically handled and defaults are usually sufficient
 The lattice system is ideal for creating conditioning plots where you examine the same kind of plot under many different conditions
 Panel functions can be specified/customized to modify what is plotted in each of the plot panels
 Theggplot2 plotting system
 it implements what is called the grammar of graphics
 Grammar of Graphics is a description of how a kind of graphics can be broken down into abstract concepts/abstraction of graphics ideas/objects
 Grammar = verb + noun + adjective, which are the basic elements of ggplot2 graphics
 What are Verbs, nouns, and adjectives of a data graphic? They are basic elements of graphic so that you can put them together to make new types of graphics. The basic elements can not be modified.
 Allows for a “theory” of graphics on which to build new graphics and graphics objects
 “Shorten the distance from mind to page”
 Grammar of Graphics: “In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system“
 The basics function: qplot() – quick plot
 Works much like the plot function in the base graphics system
 Looks for data in a data frame, similar to lattice, or in the parent environment (if you do not specify the data frame, the plotting functions will look for the data in your workspace)
 Plots are made up of aesthetics (size, shape, color) and geoms (points, lines)
 Factors are important for indicating subsets/deviding of the data (if they are to have different properties); they should be labeled with informative labels
 The qplot() hides what goes on underneath, which is okay for most operations
 ggplot() is the core function and very flexible for doing things qplot() cannot do
 Example of qplot()
 library(ggplot2) str(mpg) #example dataset qplot(displ, hwy, data=mpg) # qplot(x coord, y coord, data frame)
 Modifying aesthetics: color aesthetic= data variable of the data frame qplot(displ, hwy, data=mpg, color=drv) #auto legend will be shown
 Adding a geom: smoother is a kind of a statistic/ a summary of the data qplot(displ, hwy, data=mpg, geom=c(“point”, “smooth”)) # add a smoother with point plot
 Create a histogram: only provide x coordination qplot(hwy, data = mpg, fill=drv) # filled color with data variable “drv”
 Facets of ggplot2 system are like the panels in the lattice system, which allow to create separate plots which indicate again subsets of your data, indicated by a factor variable (data are subset/divided by factor variable values) qplot(displ, hwy, data = mpg, facets = .~drv)#separate plots by columns, variable on the right hand side qplot(displ, hwy, data = mpg, facets=drv~., binwidth=2) #separate plots by rows, variable on the left hs
 Summary ofqplot() function
 The qplot() function is the analog to plot() but with many builtin features
 Syntax somewhere in between base and lattice systems
 Produces very nice graphic, essentially publication ready (if you like the design)
 Difficult to go against the grain/customize (don’t bother; use full ggplot2() power in that case)
 The basic components of a ggplot2 Plot (review)
 A data frame
 aesthetic mappings: how data are mapped to color, size
 geoms: geometric objects like points, lines, shapes
 facets: for conditional plots (divide subset data into each plot in columns/rows; multiple panel plots)
 stats: statistical transformations like binning, quantiles, smoothing
 scales: what scale an aesthetic map uses (example: male=red, female=blue)
 coordinate system
 qqplot() function – initializes a ggplot object
 When building plots in ggplot2 system with ggplot() function, the “artist’s palette” model may be the closest analogy
 Plots are built up in layers – can add piece of piece later after plotting the data
 Plot the data
 Overlay a summary
 Metadata and annotation
 Example of qqplot() function
 str(maac)
 head(maacs)
 g <ggplot(maacs, aes(logpm25, NocturnalSympt)) # initial call to gglot summary(g) #summary of ggplot object data: logpm25, bmicat, NocturnalSympt [554×3] mapping: x = logpm25, y = NocturnalSympt faceting: facet_null()
 No Plot Yet >g <ggplot(maacs, aes(logpm25, NocturnalSympt)) print (g) #Error: no layers in plot, it doesn’t know how to draw the data yet (draw in points/lines/or tiles…) >p g + geom_point() # Autoprint plot object without saving
 Plot with Point layer g<ggplot(maacs, aes(logpm25, NocturnalSympt)) g + geom_point()
 Adding more layers: Smoother g + geom_point() + geom_smooth() # default smooth g + geom_point() + geom_smooth(method=”lm”) # smooth with regression line
 Adding more layers: Facets; Faceting (factor) variable, the labels of each panel are from the levels of the factor variable that you condition. g + geom_point() + goem_smooth(method=”lm”) + facet_grid(.~bmicat)
 Modifying Aesthetics g + geom_point(color=”steelblue”, size=4, alpha=1/2) #”steeelblue” is constant value g + geom_point(aes(color=bmicat), size=4, alpha=1/2)#bmicat is data variable
 Modifying lables: using labs() function g+ geom_point(aes(color=bmicat)) + labs(title=”MAACS cohort”) + labs(x=expression(“log ” * PM[2.5), y =”Noctural Symptoms”)
 Customizing the Smoother g + geom_point(aes(color=bmicat), size=2, alpha=1/2) + geom_smooth(size=4, linetype=3, method=”lm”, se = FALSE)
 Changing the Theme g + geom_point(aes(color=bmicat)) + theme_bw(base_family=”Times”)#change font type
 More complex example:
 Convert continuous variable to categorical one with cut() function which cut data into reasonable series of ranges ## Calculate the deciles of the data> cutpoints maacs$no2dec levels(maacs$no2dec) #return 3 different levels
 Create a splot with 2 conditions with 2 factor variables ## Setup ggplot with data frameg ## Add layers g + geom_point(alpha = 1/3) + facet_wrap(bmicat ~ no2dec, nrow = 2, ncol = 4) + geom_smooth(method=”lm”, se=FALSE, col=”steelblue”) + theme_bw(base_family = “Avenir”, base_size = 10) + labs(x = expression(“log ” * PM[2.5]) + labs(y = “Nocturnal Symptoms”) + labs(title = “MAACS Cohort”)
 Annotation
 Labels: xlab(), ylab(), labs(), ggtitle()
 Each of the “geom” functions has options to modify
 For things that only make sense globally, use theme() function
 Ex: them(legend.postion=”none”)
 Two standard appearance themes are included
 theme_gray(): the default theme (gray background)
 theme_bw(): more stark/plain
 Summary of ggplot2
 ggplot2 is very powerful and flexible if you learn the “grammar” and the various elements that can be tuned/modified
 Many more types of plots can be made; explore and mess around with the package
 Clustering methods: data are complex so the need to sum them up and to visualize the information in a proper and convenient way; clustering methods organize datasets into regions of interest
 Clustering is a task of assigning a group (a cluster) to objects so that instances from the same group are more similar than those of different groups
 Clustering organizes things that are closed into groups
 How do we define close?
 How do we group things?
 How do we visualize the grouping?
 How do we interpret the grouping?
 Hierarchical clustering: organize data into a kind of hierarchy
 An agglomerative approach: a bottom up approach, start with individual data points, and start lumping them together into clusters until eventually you have the entire data are grouped into just 1 big cluster
 Find closest two things: start kind of grouping balls into little balls then they get group up into a bigger balls, then the bigger balls get grouped together into one big massive cluster; the merged points, superpoints, are not the original data points but are created with this approach by merging 2 closest data point in the data set.
 Put them together: replace 2 original points by the new merged points/super points
 Find next closest
 Requires
 A defined distance: a distance metric; how to calculate the distance between 2 points?
 A merging approach: how to merge 2 closest points together
 Produces·
 A tree showing how close things are merged to each other, called the dendrogram
 How do we define close?
 Most important step
 Garbage in > garbage out: if a distance metric doesn’t make sense then the result will be relatively meaningless
 Example of Distance or similarity
 Example of continuous data – euclidean distance: a distance metric, e.g. the straiggtline distance bw 2 locations of the cities; whether that makes sense for you depends on whether you are a bird or something else. general formula for multidimensional problem
 Continuous data – correlation similarity
 Example of Binary data – Manhattan distance: look at points on a grid or a city block grid and imagine you are in the city of Manhattan in New York. You want to move from one black circle point to another, you can not just go directly from one point to another because of the city block, you have to follow the streets, so you need to go up or down, left or right . The greend line here would represent the Euclidean distance which would be like if you were a bird and can fly over everything, across the 3 points. However, as a person walking on the ground, you have to take either the red line, blue line or yellow line.
 Pick a distance/similarity that makes sense for your problem
 Most important step
 Example of hierarchical clustering – hclust() function
 df <data.fram(x=x, y=y)
 distxh <dist(df)
 hClustering <hclust(distxy)
 plot(hClustering)

 df <data.fram(x=x, y=y)
 set.seed(143)
 dataMatrix < as.matrix(df)[samle(1:12),]
 heatmap(dataMatrix)Example of hierarchical clustering – heatmap() function which runs the hierarchical cluster analysis on rows and on the columns of the large table to organize the rows and the columns so that you can visualize them in a kind of groups of observations with the table.
 Merging points – complete
 When you merge a point together, what represents its new location?There are 2 different merging approaches, it is useful to try both to see what kinds of clustering results you get in the end and whether one set makes more sense than another.
 The average Approach: it is just the average of their x coordinates and their y coordinates, it gives you the distance bw the 2 centers of gravity. The distance is somewhat shorter than the complete linkage approach’s.
 The Complete linkage approach: to measure the distance bw 2 clusters of points, you take the farthest 2 points from 2 clusters as the distance. The distance is really far
 When you merge a point together, what represents its new location?There are 2 different merging approaches, it is useful to try both to see what kinds of clustering results you get in the end and whether one set makes more sense than another.
 The problems of Hierarchical clustering
 The picture may be unstable
 change a few points: outliers
 Have different missing values
 Pick a different distance
 Change the merging strategy
 Change the scale of points for one variable
 Choosing where to cut is not always obvious
 The picture may be unstable
 The advantage of Hierarchical clustering
 Be deterministic: no randomness in it (the same input will give the same result)
 Should be primarily used for exploration: visualize data, get sens of what patterns are there; and if there are any patterns, you can formalize them later in a more sophisticated models
 An agglomerative approach: a bottom up approach, start with individual data points, and start lumping them together into clusters until eventually you have the entire data are grouped into just 1 big cluster
 KMeans Clustering
 A partitioning approach
 Fix a number of clusters
 Get “centroids” of each cluster
 Assign things to closest centroid
 Reclaculate centroids
 Requires
 A defined distance metric
 A number of clusters
 An initial guess as to cluster centroids
 Produces
 Final estimate of cluster centroids
 An assignment of each point to clusters
 Kmeans clustering – example
 a
 A partitioning approach
 Dimension reduction techniques
 Working with Color in R plots
Advertisements
Leave a Comment