Skip to content

The Data Science Specialization – 4. Exploratory Data Analysis

May 7, 2015

Course Content

This course covers the essential exploratory techniques for summarizing data. These techniques are typically applied before formal modeling commences and can help inform the development of more complex statistical models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the world that can be addressed by the data. We will cover in detail the plotting systems in R as well as some of the basic principles of constructing data graphics. We will also cover some of the common multivariate statistical techniques used to visualize high-dimensional data.

 Week 1/Unit 1

  • Making exploratory graphs
  • Principles of analytic graphics
  • Plotting systems and graphics devices in R
  • The base, lattice, and ggplot2 plotting systems in R
  • Clustering methods
    • Hierarchical Clustering
    • K-Means Clustering
  • Dimension reduction techniques
  • Working with Color in R plots

 Week 1/Unit 1

Before doing modelling, prediction, or any sort of inference, we do EXPLORATORY DATA ANALYSIS where we look at and see what (iterative, investigative, visual analytic algorithms) is going on with the processed tidy data set in terms of:

  • What is happening (from this data)
  • What kinds of plots are being used for this data
  1. Principles of Analytic Graphics: Basic principles for building analytic graphics
    • Are general rules that one can follow when building analytic graphics from data
    • Principle 1: Show comparisons
      • Evidence for a hypothesis is always relative to another competing hypothesis
      • Always ask “Compared to What?”
    • Principle 2: Show causality, mechanism, explanation, systematic structure
      • What is your causal framework for thinking about a question? (what is the explanation why the outcomes are, i.e the air cleaners clean and fresh the air so they improve the quality of the air -> reduce the symptoms…)
    • Principle 3: Show multivariate data: show AS MUCH data AS possible on a single plot
      • Multivariate = more than 2 variables
      • The real world is multivariate
      • Need to “escape flatland”
    • Principle 4: Integration of evidence (use as many different modes of evidence/displaying evidence as possible)
      • Completely integrate words, numbers, images, diagrams
      • Data graphics should make use of many modes -table, plot…- of data presentation
      • Don’t let the tool drive the analysis
    • Principle 5: Describe and document the evidence with appropriate labels, scales, sources (where the data is from), (how to make plot), etc.
      • A data graphic should tell a complete story that is credible/reliable
    • Principle 6: Content is king
      • Analytical presentations ultimately stand or fall depending on the quality, relevance, and integrity of their content  (If you don’t have a story to tell -the amount of presentation, think about what is the content /data/story that you’re trying to present and what is the best way/how to present that when  making plots, figures, graphs )
  2. Constructing Exploratory graphs
    • Why do we use graph in data analysis?
      • To understand data properties
      • To find pattern in data
      • To suggest modeling strategies
      • To “debug” analyses
      • To communicate results
    • Characteristics of exploratory graphs
      • They are made quickly
      • A large number are made
      • The goal is for personal understanding
      • Axes/legends are generally cleaned up (later)
      • Color/size are primarily used for information
    • Summaries of Data for selecting kinds of plots
    • 2 dimensions
      • Miltiple/overlayed 1-D plots (Lattice/ggplot2)
      • Scatterplots
      • Smooth scatterplots
    • More than 2 dimensions
      • Overlayed/multiple 2-D plots; coplots
      • Use coder, size, shape to add dimensions
      • Spinning plots
      • Actual 3-D plots (not that useful)
  3. Plotting Systems in R
    • R has developed 3 CORE PLOTTING SYSTEMS, we can NOT mix the functions between systems because the plotting will be confused
    • Annotation of plots in any plotting system involves adding points, lines, or text to the plot, in addition to customizing axis labels or adding titles. Different plotting systems have different sets of functions for annotating plots in this way
    • The first Plotting system is the Base Plotting System
      • Artist’s palette model
      • Start with blank canvas and build up from there
      • Start with plot function (or similar)
      • Use annotation functions to add/modify (text, lines, points, axis)
      • Convenient, mirrors how we think of building plots and analyzing data
      • Can’t go back once plot has started (i.e. to adjust margins); need to plan in advance
      • Difficult to translate to others once a new plot has been created (no graphical language)
      • Plot is just a series of R commands
      • Example of Base Plot
        • R> library(datasets)
        • data(cars)
        • with(cars, plot(speed, dist))
    • The second Plotting system is the Lattice System
      • Plots are created with a single function call (xyplot, bwplot, etc.) — all plots are created at once time/1 function
      • Most useful for conditioning types of plots: Looking at how y changes with x across levels of z
      • Things like margins/spacing set automatically because entire plot is specified at once
      • Good for putting many many plots on a screen
      • Sometimes awkward to specify an entire plot in a single function call
      • Annotation in plot is not especially intuitive
      • Use of panel functions and subscripts difficult to wield and requires intense preparation
      • Cannot “add” to the plot once it is created
      • Example
        • library(lattice)
        • state
        • xyplot(Lite.Exp ~ Income | region, data = state, layout = c(4, 1))
    • The third Plotting system is the ggplots System — grammar graphic plot system
      • Splits the difference between base and lattice in a number of ways — mixed ideas from both systems
      • Automatically deals with spacings, text, titles but also allows you to annotate by “adding” to a plot
      • Superficial similarity to lattice but generally easier/more intuitive to use
      • Default mode makes many choices for you (but you can still customize to your heart’s desire)
      • Example:
        • library(ggplots)
        • data(mpg)
        • qplot(displ, hwy, data = mpg)
    • MultipleBoxplots
      • R> boxplot( ~ , data =
    • Multiple Histograms
      • par(mfrow=c(2,1), mar = c(4, 4, 2, 1))
      • his(subset(, region == “east”)<$pm25>), col = “green”
      • his(subset(, region == “west”)<$pm25>), col = “green”
    • Scatterplot (with color:col= region–West/East)
      • R> with(, plot(latitude, pm25,  col = ))
      • abline(h=12, lwd = 2, lty = 2)
    • MultipleScaterplots
      • par(mfrow=c(1,2), mar = c(5, 4, 2, 1))
      • with(subset(pollution, region == “West”), plot(latitude, pm25, main = “West”))
      • with(subset(pollution, region == “east”), plot(latitude, pm25, main = “East”))
    • Adding a legend to a plot
      • plot (c(1968,2010),c(0,10),type=”n”, # sets the x and y axes scales xlab=”Year”,ylab=”Expenditures/GDP (%)”) # adds titles to the axes
      • lines(year,defense,col=”red”,lwd=2.5) # adds a line for defense expenditures
      • lines(year,health,col=”blue”,lwd=2.5) # adds a line for health expenditures
      • legend(2000,9.5, # places a legend at the appropriate place c(“Health”,”Defense”), # puts text in the legend lty=c(1,1), # gives the legend appropriate symbols (lines) lwd=c(2.5,2.5),col=c(“blue”,”red”)) # gives the legend lines the correct color and width
  4. The Process of Making a Plot: When making a plot, one must first make a few considerations (its order is not important)
    • Where will the plot be made? On the screen? In a file?
    • How will the plot be used?
      • Is the plot for viewing temporarily on the screen?
      • Will it be presented in a web browser?
      • Will it eventually en up in a paper that might be printed?
      • Are you using it in a presentation?
    • Is there a large amount of data going into the plot? Or is it just a few points?
    • Do you need to be able to dynamically resize the graphic?
    • What graphics system will you use: base, lattice, or ggplot2?  These generally cannnot be mixed
    • Base graphics are usually constructed piecemeal, with each aspect of the plot handled separately trough a series of function calls; this is often conceptually simpler and allows plotting to mirror the thought process.
    • Lattice graphics are usually created in a single function to call, so all of the graphics parameters have to specified at once; specifying everything at once allows R to automatically calculate the necessary spacings and font sizes.
    • ggplots combines concepts from both base and lattice graphics but uses an independent implementation
  5. The Base Plotting System in R
    • The core plotting system and the graphics engine/the base graphic system in R are encapsulated in the following packages:
      • graphics: contains plotting functions for the “base” graphing systems, including plot, hist, boxplot and many others
      • grDevices: contains all the code implementing the various graphics devices, including X11, PDF, PostScript, PNG, etc.
    • We focus on using the base plotting system to create graphics on the screen device
    • Base graphics: Base graphics are used most commonly and are a very powerful system for creating 2-D graphics
      • There are 2 phrases to creating a base plot
        • Initializing a new plot
        • Annotating (adding to) an existing plot
      • Calling plot(x,y) or hist(x) will launch a graphics device (if one is not already open) and draw a new plot on the device
      • If the arguments to plot are not of some special class, then the default method for plot is called; this function has many arguments, letting you set the title, x asis label, y axis label, etc.
      • The base graphics system has many parameters that can set and tweaked; these parameters are documented in ?par;
    • Some Important Base Graphics Parameters
      Many base plotting functions share a set of parameters. Here are a few key ones:
      pch : the plotting symbol (default is open circle)
      lty: the line type (default is solid line), can be dashed, dotted, etc.
      lwd: the line width, specified as an integer multiple
      col: the plotting color, specified as a number, string, or hex code; the colors() function gives you a vector of colors by name
      xlab: character string for the x-axis label
      ylab: character string for the y-axis label
    • Some Important Base Graphics Parameters for par() function
      The par() function is used to specify global graphics parameters that affect all plots in an R session. These parameters can be overridden when specified as arguments to specific plotting functions.
      las: the orientation of the axis labels on the plot
      bg: the background color
      mar: the margin size
      oma: the outer margin size (default is 0 for all sides)
      mfrow: number of plots per row, column (plots are filled row-wise)
      mfcol: number of plots per row, column (plots are filled column-wise
    • Base Plotting Functions
      plot: make a scatterplot, or other type of plot depending on the class of the object being plotted lines: add lines to a plot, given a vector x values and a corresponding vector of y values (or a 2-column matrix); this function just connects the dots
      points: add additional points to a plot
      text: add text labels to a plot using specified x, y coordinates
      title: add annotations to x, y axis labels, title, subtitle, outer margin
      mtext: add arbitrary text to the margins (inner or outer) of the plot
      axis: adding axis ticks/labels
    • Base Plot with Annotation
      • library(datasets)
        with(airquality, plot(Wind, Ozone))

        title(main =”Ozone and Wind in New York City”) ## Add a title

      • with(airquality, plot(Wind, Ozone, main =”Ozone and Wind in New York City”)) with(subset(airquality, Month ==5), points(Wind, Ozone, col =”blue”))
      • with(airquality, plot(Wind, Ozone, main =”Ozone and Wind in New York City”, type =”n”)) #type=”n”: just initial a graphic device but not plot any thing
        with(subset(airquality, Month ==5), points(Wind, Ozone, col =”blue”))
        with(subset(airquality, Month !=5), points(Wind, Ozone, col =”red”))
        legend(“topright”, pch =1, col = c(“blue”,”red”), legend = c(“May”,”Other Months”))
    • Base Plot with Regression Line
      • with(airquality, plot(Wind, Ozone, main =”Ozone and Wind in New York City”, pch =20))
        abline(model, lwd =2)
    • Multiple Base Plots
      • par(mfrow = c(1,2))
        with(airquality, { plot(Wind, Ozone, main =”Ozone and Wind”)
        plot(Solar.R, Ozone, main =”Ozone and Solar Radiation”) })
      • par(mfrow = c(1,3), mar = c(4,4,2,1), oma = c(0,0,2,0))
        with(airquality, { plot(Wind, Ozone, main =”Ozone and Wind”)
        plot(Solar.R, Ozone, main =”Ozone and Solar Radiation”)
        plot(Temp, Ozone, main =”Ozone and Temperature”)
        mtext(“Ozone and Weather in New York City”, outer =TRUE)
        })
  6. Graphics Device
    • What is a Graphics Device? A graphics device is something where you can make a plot appear
      • A window on your computer (screen device)
      • A PDF file (file device)
      • A PNG or JPEG file (file device)
      • A scalable vector graphics (SVG) file (file device
    • When you make a plot in R, it has to be “sent” to a specific graphics device
    • The most common place for a plot to be “sent” is the screen device
      • On a Mac the screen device is launched with the quartz()
      • On Windows the screen device is launched with windows()
      • On Unix/Linux the screen device is launched with x11()
    • How Does a Plot Get Created? There are two basic approaches to plotting.
      • The first is most common:
        • 1 Call a plotting function like plot, xyplot, or qplot
        • 2 The plot appears on the screen device
        • 3 Annotate plot if necessary
        • 4 Enjoy
        • library (datasets)
          with(faithful, plot(eruptions, waiting)) ## Make plot appear on screen device
          title(main =”Old Faithful Geyser data”) ## Annotate with a title
      • The second approach to plotting is most commonly used for file devices:
        • 1 Explicitly launch a graphics device
        • 2 Call a plotting function to make a plot (Note: if you are using a file device, no plot will appear on the screen)
        • 3 Annotate plot if necessary
        • 4 Explicitly close graphics device with dev.off() (this is very important!)
        • pdf(file = “myplot.pdf”) ## Open PDF device; create ‘myplot.pdf’ in my working directory ## Create plot and send to a file (no plot appears on screen) with(faithful, plot(eruptions, waiting)) title(main = “Old Faithful Geyser data”) ## Annotate plot; still nothing on screen dev.off() ## Close the PDF file device ## Now you can view the file ‘myplot.pdf’ on your computer
    • Graphics File Devices
      • There are two basic types of file devices: vector and bitmap devices
      • Vector formats: pdf: useful for line-type graphics, resizes well, usually portable, not efficient if a plot has many objects/points svg: XML-based scalable vector graphics; supports animation and interactivity, potentially useful for web-based plots win.metafile: Windows metafile format (only on Windows) postscript: older format, also resizes well, usually portable, can be used to create encapsulated postscript files; Windows systems often don’t have a postscript viewer
      • Bitmap formats png: bitapped format, good for line drawings or images with solid colors, uses lossless compression (like the old GIF format), most web browsers can read this format natively, good for plotting many many many points, does not resize well jpeg: good for photographs or natural scenes, uses lossy compression, good for plotting many many many points, does not resize well, can be read by almost any computer and any web browser, not great for line drawings tiff: Creates bitmap files in the TIFF format; supports lossless compression bmp: a native Windows bitmapped format
      • R> dev.copy, dev.cur, dev.set(<2)
  7. The Lattice plotting system
    • Used for plotting kind of high dimensional data and/or many plots at once
    • Be implemented using the following packages:
      • lattice: contains code for producing Trellis graphics, which are independent of the “base” graphics system; includes functions like xyplot, bwplot, levelplot
      • grid: implements a different graphing system independent of the “base” system; the lattice package builds on top of grid;
        • we seldom call functions from the grid package directly.
      • The lattice plotting system does not have a “two-phase” aspect with separate plotting and annotation like in base plotting
      • All plotting/annotation is done at once with a single function call
    • Lattice Functions
      • xyplot: this is the main function for creating scatterplots
      • bwplot: box-and-whiskers plots (“boxplots”)
      • histogram: histograms
      • stripplot: like a boxplot but with actual points
      • dotplot: plot dots on “violin strings”
      • splom: scatterplot matrix; like pairs in base plotting system
      • levelplot, contourplot: for plotting “image” data
    • Lattice functions generally take a formula for their first argument, usually of the form
      • xyplot(y ~ x | f * g, data)
      • We use the formula notation here, hence the ~
      • On the left of the ~ is the y-axis variable, on the right is the x-axis varibale
      • f and g are conditioning variables – they are optional
        • the * indicates an interaction between the two variables
      • The second argument is the data frame or list from which the variables in the formula should be looked up
        • If no data frame or list is passed, the the parent frame is used
      • If no other arguments are passed, there are defaults that can be used
    • Example: R>library(lattice)
      • library(lattice); library(datasets) xyplot(Ozone ~ Wind, data=airquality)
      • library(datasets) airquality xyplot(Ozone ~ Wind | Month, data = airquality, layout = c(5,1)) # separate by Month: 5 columns
    • Lattice Behavior: lattice functions behave differently from base graphics functions in one critical way
      • Base graphics functions plot data directly to the graphics device (screen, PDF file, etc.)
      • xyplot does 2 steps: return an object (invisible to user) then auto print (visible to user)
      • Lattice graphics functions return an object of class trellis (do not plot anything, just return object)
      • The print methods for lattice functions actually do the work of plotting the data on the graphics device
        • p<-xyplot(Ozone ~ Wind, data=airquality) ##Nothing happens!
        • print(p) ##Plot appears
        • xyplot(Ozone ~ Wind, data=airquality) ##Auto-printing
      • Lattice functions return “plot objects” that can, in principle, be stored (but it’s usually better to just save the code + data)
      • On the command line, trellis objects are auto-printed so that it appears the function is plotting the data
    • Lattice Panel Functions
      • Lattice functions have a panel function which controls what happens inside each panel of the plot
      • The lattice package comes with default panel functions, but you can supply your own if you want to customize what happens in each panel
      • Panel functions receive the x/y coordinates of the data points in their panel (along with any optional arguments)
      • Lattice Panel Functions
        1. set.seed(10) x
        2. Custom panel function xyplot(y ~ x|f, panel=function(x, y, …) { panel.xyplot(x,y,…) #First call the default panel function for ‘xyplot’ panel.abline(h=median(y), lty=2) #Add a horizontal line at the median })
        3. Custom panel function with Regression line xyplot(y ~ x|f, panel=funciton(x,y,…){ panel.xyplot(x, y, …) #First call default panel function panel.lmline(x,y, col=2) #Overlay a simple linear regression line })
    • Summary
      • Lattice plots are constructed with a single function call to a core lattice function (e.g. xyplot)
      • Aspects like margins and spacing are automatically handled and defaults are usually sufficient
      • The lattice system is ideal for creating conditioning plots where you examine the same kind of plot under many different conditions
      • Panel functions can be specified/customized to modify what is plotted in each of the plot panels
  8. Theggplot2 plotting system
    • it implements what is called the grammar of graphics
    • Grammar of Graphics is a description of how a kind of graphics can be broken down into abstract concepts/abstraction of graphics ideas/objects
    • Grammar = verb + noun + adjective, which are the basic elements of ggplot2 graphics
    • What are Verbs, nouns, and adjectives of a data graphic? They are basic elements of graphic so that you can put them together to make new types of graphics. The basic elements can not be modified.
    • Allows for a “theory” of graphics on which to build new graphics and graphics objects
    • “Shorten the distance from mind to page”
    • Grammar of Graphics: “In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system
    • The basics function: qplot() – quick plot
      • Works much like the plot function in the base graphics system
      • Looks for data in a data frame, similar to lattice, or in the parent environment (if you do not specify the data frame, the plotting functions will look for the data in your workspace)
      • Plots are made up of aesthetics (size, shape, color) and geoms (points, lines)
      • Factors are important for indicating subsets/deviding of the data (if they are to have different properties); they should be labeled with informative labels
      • The qplot() hides what goes on underneath, which is okay for most operations
      • ggplot() is the core function and very flexible for doing things qplot() cannot do
    • Example of qplot()
      • library(ggplot2) str(mpg) #example dataset qplot(displ, hwy, data=mpg) # qplot(x coord, y coord, data frame)
      • Modifying aesthetics: color aesthetic= data variable of the data frame qplot(displ, hwy, data=mpg, color=drv) #auto legend will be shown
      • Adding a geom: smoother is a kind of a statistic/ a summary of the data qplot(displ, hwy, data=mpg, geom=c(“point”, “smooth”)) # add a smoother with point plot
      • Create a histogram: only provide x coordination qplot(hwy, data = mpg, fill=drv) # filled color with data variable “drv”
      • Facets of ggplot2 system are like the panels in the lattice system, which allow to create separate plots which indicate again subsets of your data, indicated by a factor variable (data are subset/divided by factor variable values) qplot(displ, hwy, data = mpg, facets = .~drv)#separate plots by columns, variable on the right hand side qplot(displ, hwy, data = mpg, facets=drv~., binwidth=2) #separate plots by rows, variable on the left hs
    • Summary ofqplot() function
      • The qplot() function is the analog to plot() but with many built-in features
      • Syntax somewhere in between base and lattice systems
      • Produces very nice graphic, essentially publication ready (if you like the design)
      • Difficult to go against the grain/customize (don’t bother; use full ggplot2() power in that case)
    • The basic components of a ggplot2 Plot (review)
      • A data frame
      • aesthetic mappings: how data are mapped to color, size
      • geoms: geometric objects like points, lines, shapes
      • facets: for conditional plots (divide subset data into each plot in columns/rows; multiple panel plots)
      • stats: statistical transformations like binning, quantiles, smoothing
      • scales: what scale an aesthetic map uses (example: male=red, female=blue)
      • coordinate system
    • qqplot() function –  initializes a ggplot object
      • When building plots in ggplot2 system with ggplot() function, the “artist’s palette” model may be the closest analogy
      • Plots are built up in layers – can add piece of piece later after plotting the data
        • Plot the data
        • Overlay a summary
        • Metadata and annotation
    • Example of qqplot() function
      • str(maac)
      • head(maacs)
      • g <-ggplot(maacs, aes(logpm25, NocturnalSympt)) # initial call to gglot summary(g) #summary of ggplot object data: logpm25, bmicat, NocturnalSympt [554×3] mapping: x = logpm25, y = NocturnalSympt faceting: facet_null()
      • No Plot Yet >g <-ggplot(maacs, aes(logpm25, NocturnalSympt)) print (g) #Error: no layers in plot, it doesn’t know how to draw the data yet (draw in points/lines/or tiles…) >p g + geom_point() # Auto-print plot object without saving
      • Plot with Point layer g<-ggplot(maacs, aes(logpm25, NocturnalSympt)) g + geom_point()
      • Adding more layers: Smoother g + geom_point() + geom_smooth() # default smooth g + geom_point() + geom_smooth(method=”lm”) # smooth with regression line
      • Adding more layers: Facets; Faceting (factor) variable, the labels of each panel are from the levels of the factor variable that you condition. g + geom_point() + goem_smooth(method=”lm”) + facet_grid(.~bmicat)
      • Modifying Aesthetics g + geom_point(color=”steelblue”, size=4, alpha=1/2) #”steeelblue” is constant value g + geom_point(aes(color=bmicat), size=4, alpha=1/2)#bmicat is data variable
      • Modifying lables: using labs() function g+ geom_point(aes(color=bmicat)) + labs(title=”MAACS cohort”) + labs(x=expression(“log ” * PM[2.5), y =”Noctural Symptoms”)
      • Customizing the Smoother g + geom_point(aes(color=bmicat), size=2, alpha=1/2) + geom_smooth(size=4, linetype=3, method=”lm”, se = FALSE)
      • Changing the Theme g + geom_point(aes(color=bmicat)) + theme_bw(base_family=”Times”)#change font type
      • More complex example:
        • Convert continuous variable to categorical one with cut() function which cut data into reasonable series of ranges ## Calculate the deciles of the data> cutpoints maacs$no2dec levels(maacs$no2dec) #return 3 different levels
        • Create a splot with 2 conditions with 2 factor variables ## Setup ggplot with data frameg ## Add layers g + geom_point(alpha = 1/3) + facet_wrap(bmicat ~ no2dec, nrow = 2, ncol = 4) + geom_smooth(method=”lm”, se=FALSE, col=”steelblue”) + theme_bw(base_family = “Avenir”, base_size = 10) + labs(x = expression(“log ” * PM[2.5]) + labs(y = “Nocturnal Symptoms”) + labs(title = “MAACS Cohort”)
    • Annotation
      • Labels: xlab(), ylab(), labs(), ggtitle()
      • Each of the “geom” functions has options to modify
      • For things that only make sense globally, use theme() function
        • Ex: them(legend.postion=”none”)
      • Two standard appearance themes are included
        • theme_gray(): the default theme (gray background)
        • theme_bw(): more stark/plain
    • Summary of ggplot2
    • ggplot2 is very powerful and flexible if you learn the “grammar” and the various elements that can be tuned/modified
    • Many more types of plots can be made; explore and mess around with the package
  9. Clustering methods: data are complex so the need to sum them up and to visualize the information in a proper and convenient way; clustering methods organize datasets into regions of interest
    • Clustering is a task of assigning a group (a cluster) to objects so that instances from the same group are more similar than those of different groups
    • Clustering organizes things that are closed into groups
      • How do we define close?
      • How do we group things?
      • How do we visualize the grouping?
      • How do we interpret the grouping?
    • Hierarchical clustering: organize data into a kind of hierarchy
      • An agglomerative approach: a bottom up approach, start with individual data points, and start lumping them together into clusters until eventually you have the entire data are grouped into just 1 big cluster
        • Find closest two things: start kind of grouping balls into little balls then they get group up into a bigger balls, then the bigger balls get grouped together into one big massive cluster; the merged points, super-points, are not the original data points but are created with this approach by merging 2 closest data point in the data set.
        • Put them together: replace 2 original points by the new merged points/super points
        • Find next closest
      • Requires
        • A defined distance: a distance metric; how to calculate the distance between 2 points?
        • A merging approach: how to merge 2 closest points together
      • Produces·
        • A tree showing how close things are merged to each other, called the dendrogram
      • How do we define close?
        • Most important step
          • Garbage in -> garbage out: if a distance metric doesn’t make sense then the result will be relatively meaningless
        • Example of Distance or similarity
          • Example of continuous data – euclidean distance: a distance metric, e.g. the straiggt-line distance bw 2 locations of the cities; whether that makes sense for you depends on whether you are a bird or something else. straight-line function      straight-line distance general formula for multidimensional problem straight-line general function
          • Continuous data – correlation similarity
          • Example of Binary data – Manhattan distance: look at points on a grid or a city block grid and imagine you are in the city of Manhattan in New York. You want to move from one black circle point to another, you can not just go directly from one point to another because of the city block, you have to follow the streets, so you need to go up or down, left or right . The greend line here would represent the Euclidean distance which would be like if you were a bird and can fly over everything, across the 3 points. However, as a person walking on the ground, you have to take either the red line, blue line or yellow line.manhata distance manhata formula
        • Pick a distance/similarity that makes sense for your problem
      • Example of hierarchical clustering – hclust() function
        • df <-data.fram(x=x, y=y)
        • distxh <-dist(df)
        • hClustering <-hclust(distxy)
        • plot(hClustering)
        • df <-data.fram(x=x, y=y)
        • set.seed(143)
        • dataMatrix <- as.matrix(df)[samle(1:12),]
        • heatmap(dataMatrix)Example of hierarchical clustering – heatmap() function which runs the hierarchical cluster analysis on rows and on the columns of the large table to organize the rows and the columns so that you can visualize them in a kind of groups of observations with the table.
      • Merging points – complete
        • When you merge a point together, what represents its new location?There are 2 different merging approaches, it is useful to try both to see what kinds of clustering results you get in the end and whether one set makes more sense than another.
          • The average Approach: it is just the average of their x coordinates and their y coordinates, it gives you the distance bw the 2 centers of gravity. The distance is somewhat shorter than the complete linkage approach’s.
          • The Complete linkage approach: to measure the distance bw 2 clusters of points, you take the farthest 2 points from 2 clusters as the distance. The distance is really far
      • The problems of Hierarchical clustering
        • The picture may be unstable
          • change a few points: outliers
          • Have different missing values
          • Pick a different distance
          • Change the merging strategy
          • Change the scale of points for one variable
        • Choosing where to cut is not always obvious
      • The advantage of Hierarchical clustering
        • Be deterministic: no randomness in it (the same input will give the same result)
        • Should be primarily used for exploration: visualize data, get sens of what patterns are there; and if there are any patterns, you can formalize them later in a more sophisticated models
    • K-Means Clustering
      • A partitioning approach
        • Fix a number of clusters
        • Get “centroids” of each cluster
        • Assign things to closest centroid
        • Reclaculate centroids
      • Requires
        • A defined distance metric
        • A number of clusters
        • An initial guess as to cluster centroids
      • Produces
        • Final estimate of cluster centroids
        • An assignment of each point to clusters
      • K-means clustering – example
        • a
  10. Dimension reduction techniques
  11. Working with Color in R plots
Advertisements

From → Data Science

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: