Skip to content

The Data Science Specialization – 3. Getting and Cleaning Data

May 5, 2015

Course Description

This course will cover the basic ways that data can be obtained. The course will cover obtaining data from the web, from APIs, and from colleagues in various formats including raw text files, binary files, and databases. It will also cover the basics of data cleaning and how to make data tidy. The course will also cover the components of a complete data set including raw data, processing instructions, codebooks, and processed data. The course will cover the basics needed for collecting, cleaning, and sharing data. The goal of this course:  RAW DATA -> PROCESSING SCRIPT -> TIDY DATA -> DATA ANALYSIS -> DATA COMMUNICATION (to a colleague/collaborator/business partner)

Course Content

  • Data collection
    • Raw files (.csv,.xlsx)
    • Databases (mySQL)
    • APIs/Web application
    • Other source
  • Data formats
    • Flat files (.csv,.txt)
    • XML
    • JSON
    • HDF5
  • Making data tidy
  • Distributing data
  • Scripting for data cleaning

Detail/content

  1. Data collection
    • Raw files (.csv, .xlsx)
    • Databases (mySQL, PostGreSQL (RPostresSQL), MS Access and SQLite (with RODBC)
    • APIs
    • Web application
      • Webscraping is programmatically extracting data from the HTML code of websites
        • Many websites have information you may want to programmatically read
        • In some cases, this is against the terms of service for the website
        • Attempting to read too many pages too quickly can get your IP address blocked
    • Other source – with foreign package
      • file – open a connection to a text file
      • url – open a connection to a url
      • gzfile – open a connection to a .gzfile
      • bzfile – open a connection to a .bz2 file
      • ?connections for more information
      • Remember to close connections
  2. Data formats
    • Flat files (.csv, .txt)
    • XML
      • XML Tags:
        • Start tags “<” and tag name and “>”
        • End tags “<” and tag name and “/” AND “>”
        • Empty tags “<” and value and “/>” (no label and not have enough both <> for each tag)
      • XML Elements: are specific examples of tags (= both tags and values/contents)
        • “<” tag name and “>” Hello world “”
      • Attributes are components of the label
        • “<” STEP NUMBER EQUAL TO “3” “>” CONNECT A TO B “AND / AND >”
        • “<” IMG SRC=”JEFF.JPG” ALT=”INSTRUCTOR “/>”
    • JSON: Java Script Object Notation
      • Lightweight data storage
      • Common format for data from application programming interfaces (APIs)
      • Similar structure to XML but different syntax/format
      • Data stored as
        • Numbers (doubles)
        • Strings (double quoted)
        • Boolean (True/False)
        • Array (ordered, comma separated enclosed in square brackets[])
        • Object (unordered, comma separated collection of key:value pairs in curley brackets {})
    • HDF5 format: Hierarchical Data Format (www.hdfgroup.org) HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5.
      • Used for storing large data sets
      • Supports storing a range of data types
      • Hierarchical data format: data are stored in groups
      • Groups containing zero or more data sets and metadata
        • Have a group header with group name and list of attributes corresponding in that group
        • Have a group symbol table with a list of objects in group
      • datasets are multidimensional arrays of data elements with metadata
        • Have a header with name, dataspace, and storage layout
        • Have a data array with the data frame
  3. Raw and Processed Data
    • Definition of data: “Data are values of qualitative or quantitative variables, belonging to a set of items
      • Set of items: sometimes called the population; the set of objects being interested in
      • Variables: (a thing that you measure) A measurement or characteristic of an item
      • Qualitative: Country of origin, sex, treatment
      • Quantitative: Height, weight, blood pressure
    • Raw data: the original source of data, often hard to use for analyses
    • Processed data: Data that is ready for analysis, processing can include merging, subsetting, transforming, …
  4. Making data tidy
    • The 4 things we should have
      • 1. The raw data
      • 2. A tidy data set
      • 3. A code book describing each variable and its values in the tidy data set
      • 4. Instruction list: An explicit and exact recipe you used to go from 1 to 2, 3
    • 1. The raw data is in the right format IF
      • Ran no software on the data
      • Did no manipulate any of the numbers in the data
      • Did not remove any data from the data set
      • Did not summarize the data in any way
    • 2. The tidy data is the target or the end goal of the whole process: separate each table per file name, file name should be human readable, have header of variable names
      • Each variable measured should be in one column (Each variable forms a column)
      • Each different observation of that variable should be in a different row (Each observation forms a row)
      • There should be one TABLE for each kind of variable (Each table/file stores data about one kind of observation (e.g. people/hospitals))
      • If there are multiple tables, they should include a column in the table that allows them to be linked
      • Steps to make data tidy:
        • Subset and Sorting
        • Summarizing Data
        • Creating New Variables: why create new variables
          • Often the raw data won’t have a value you are looking for
          • You will need to transform the data to get the values you would like
          • Usually you will add those values to the data frames you are working with
          • Common variables to create
            • Missingness indicators
            • “Cutting up” quantitative variables
            • Apply transforms
        • Reshaping Data
          • Reshape the raw data into the format that you like to have it in
          • The goal is tidy data
        • Managing Data Frames with dplyr
          • The dplyr package is designed to help you work with data frames but you can use dplyr package for data table and relational database systems
          • An optimized and distilled version of plyr package
          • Does not provide any “new” functionality er se, but greatly simplifies existing funcionalit in R
          • Provide a grammar (in particular, verbs) for data manipulation
          • Is very fast, as many key operations are coded in C++
          • dplyr Verbs
            • select: return a subset of the columns of a data frame
            • filter: extract a subset of rows from a data frame based on logical conditions
            • arrange: reorder rows of a data frame
            • rename: rename variables in a data frame
            • mutate: add new variables?columns or transform existing varibales
            • summarize: generate summary statistics of different variables in the data frame, possibly within strata
            • There is also a handy print method that prevents you from printing a lot of data to the console
          • dplyr Properties
            • The first argument is a data frame.
            • The subsequent arguments describe what to do with it, and you can refer to columns in the data frame directly without using the $ operator (just use the names).
            • The result is a new data frame
            • Data frames must be properly formatted and annotated for this to all be useful
          • Once you learn the dplyr “grammar”, there are a few additional benefits
            • dplyr can work with other data frame “backends”
            • data.table for large fast tables
            • SQL interface for relational databases via the DBI package
        • Merging Data: merge more datasets together by matching those datasets based on an ID
    • 3. A code book: describe each variable and its units and data and any transformation/work that you performed to clean up the data; should have a section called “Study design” to describe thoroughly how the data is collected
      • Information about the variables (including units) in the data set
      • Information about the summary choices you made
      • Information about the experimental study design you used
    • 4. The instruction list: guide the readers how to use the scripts to process the data
      • A computer script in R or Python to process the input raw data
      • The output is the processed, tidy data
      • There is no parameters to the script
      • Script must generate the same tidy data for the same input raw data
  5. R programming: Reading and Processing data
    • if (!file.exists(“foldername”){ dir.create (“data”)}
    • download.file(“http://…&#8221;, destfile=”./data/cameras.csv///.xlsx”, method = “curl”, mode=”wb”)
    • dateDownloaded
    • list.files(“./data”)
    • read.table(“./data/cameras.csv///.xlsx”, sep =”,”, header =”TRUE”) = read.csv(“./data/cameras.csv”)
      • quote, na.strings, nrows, skip (number of lines to skip before starting to read)
    • library(xlsx)
      • read.xlsx(“./data/cameras.xlsx”), sheetIndex=1, header=TRUE”, colIndex=2:3, rowIndex=1:4)
      • write.xlsx
      • read.xlsx2 is much faster than read.xlsx but for reading subsets of rows may be slightly unstable
      • the XLConnect package has more options for writing and manipulating Excel files
      • The XLConnect vignette is a good place to start for the XLConnect package
    • library(XML)
      • doc <-xmlTreePars(“http://www.w3schools.com/xml/simple.xml&#8221;, useInternal=T)
      • rootNode<-xmlRoot(doc)
      • xmlName(rootNode)
      • names(rootNode)
      • reading first element: rootNode[[1]]
      • reading first element of the first one: rootNode[[1]][[1]]
      • loop through all elemnts of  the xml object rootNode and apply xmlValue function which gets XML value
        • xmlSApply(, xmlValue)
      • get specific component of the document with xpathSApply: xpathSApply(, “//price”, xmlValue)
        • /node: top level node
        • //node: Node at any level
        • //node[@attr-name]: Node with an attribute name
        • //node[@attr-name=’bob’]: Node with attribute name attr-name = ‘bob’
        • doc htmlTreeParse(“http://…&#8221;, internal=TRUE)
        • scores <-xpathSApply(doc, “//li[@class=’score’]”, xmlValue)
        • teams
    • library(jsonlite):
      • jsonData <-fromJSON(“https://…&#8221;)
        • names(jsonData)
        • jsonData$owner$login
      • Writing/converting data frames to JSON
        • myjson <-toJSON(jsonData, pretty=TRUE)
        • cat(myjson)
      • Convert back to data frames from JSON
        • fromJSON(myjson)
    • library(data.table): the package is much faster as written in C
      • inherets from data.frame: all functions that accept data.frame work on data.table
      • dt
      • subset: dt[dt$y==”a”,]: get all column where row of y column =”a”
      • adding new columns: dt[w:=z^2]
      • special variables
        • .N: an integer, length 1, containing the number
        • count number of each value of x: dt[, .N, by=x]
      • Keys:
        • set x as the key of data.table dt: setkey[dt, x], data will be sorted by x column
        • dt[x==’a] == dt[‘a’]
        • merg (dt1, dt2) where both of dt have key on x
      • fread: fast reading function
      • table(): see all data.tables in memory
      • Expression: {}
        • { 1, 2}: 2 (after a comma is an expression)
        • calculate values for variables with expression: dt[,list(mean(x), sum(z) )], this will report the mean of x column and the sum of z column
        • dt[, table(y)]:
    • Reading data from MyQL Database
      • Install MySQL: https://dev.mysql.com/doc/refman/5.7/en/installing.html (for local MySQL Server only)
      • Install RMySQL: install.packages(“RMySQL”) on MAC or http://biostat.mc.vanderbilt.edu/wiki/Main/RMySQL for All OS
      • install.packages(‘RMySQL’,type=’source’)
      • library(RMySQL)
      • ucscDb <-dbConnect(mySQL(), user=””, host=”xxxmysql.cse.ucsc.edu”) result dbGetQuery(ucscDb, “sow databases;”) dbDisconnect (ucscDb)
      • connect to a specific Database
        • hg
        • allTables <-dbListTables(hg)
        • length(allTables) # total number of tables
        • allTables[1:5]
        • dbListFields(hg, “table name”) # list all fields of a table name
        • dbGetQuery(hg, “select count(*) from tablename”)
        • dbREadTable(hg, “table name”)
        • query affyMis<-fetch(query); #fet(query, n=10): get only first 10 rows quantile(affyMis$misMatches) dbClearResult(query);
        • dbDisconnect(hg) # close the connection to MySQL server
      • list of commands: http://www.r-bloggers.com/mysql-and-r/
    • Reading data from HDF5
      • source(“http://bioconductor.org/biocLite.R&#8221;)
      • biocLite(rhdf5″)
      • library(rhdf5)
      • h5createFile(“filename1.h5”)
      • h5createGroup(“filename1.h5”, “groupname1”) h5createGroup(“filename1.h5”, “groupname1/subgroupname”) h5ls(“filename.h5”) #list out what’s going on inside the file
      • h5write(, “filename.h5”, “groupname”)
      • h5write(, “filename.h5”, “groupname”, index=list(1:3,1)) #index: list(1:3,1): write to 3 first rows, first column
      • h5read(“filename1.h5”, “groupname1”)
    • Reading data from API-Application Programming Interfaces
      • library(httr)
      • myapp<-oauth_app(“”, key=”yourConsumerKeyHere”, secret=”yourConsummerSecretHere”) # start the authorization process sig homeURL <-GET(“https://api.twitter.com/1.1/statuses/home_timeline.json&#8221;, sig)#get the json file
      • converting the JSON object json1 json2<-jsonlite::fromJSON(toJSON(json1))
      • json[1, 1:4]
    • Reading data from web application
      • con <-url(“http://scholar.google.com/citations?…&#8221;) htmlCode close(con) htmlCode
      • parsing web data with XML
        • library(XML) url<-“http://…&#8221; html xpathsApply(html, “//title”, xmlValue) xpathSApply(html, “//td[@id=’col-citedby’]”, xmlValue)
      • GET data from the httr package: library(httr)
        • library(httr); html2<-GET(url) content2<-content(html2, as=”text”) #extract the content from the HTML page as one big text string parsedHtml<-htmlParse(content2, asText=TRUE)#parse out the text of content2 to get the parsed HTML xpathSApply(parsedHtml, “//title”, xmlValue)
        • Accessing websites with passwords pg<-GET(“http://…&#8221;, authenticate(“username”,”password”))
      • Using handles to allow cookies to save authentication information across multiple access to websites
        • hlgoogle pg<-GET(handle=hlgoogle, path=”/”)#use the existing handle hlgoogle to access the same website pg2<-GET(handle=hlgoogle, path=”search”)
    • Subset and Sorting
      • set.seed(12345) X<-data.frame(“var1″=sample(1:5), “var2″=sample(6:10), “var3″=sample(11:15)) X<-X[sample(1:5),] X$var2[c(1,3)]
      • Logicals: And(&)/Or(|) X[(X$var1<=3 & X$var3 >11),]
      • Dealing with missing values X[which(X$var2 >8),] #which: indices the filtered variable
      • Sorting: sort(X$var1, decreasing = TRUE) sort(X$var1, na.last=TRUE)
      • Ordering X[order(X$var1, X$var3),]
      • Ordering with plyr library(plyr) arrange(X, var1) arrange(X, desc(var1))
      • Adding rows and columns X$var4 <-rnorm(5) #var4 is the new column name Y<-cbind(X, rnorm(5) #combine more columns
    • Summarizing Data
      • After loading the source data, you need to look at the data set to check whether there are any weird data, it is to do summarize the dataset
      • head(,n=3); tail(,n=4)
      • summary()
      • str()
      • table(<>$varibale name, useNA=”ifany”)
      • check for missing values sum(is.na(dataset)) any(is.na(dataset)) # return TRUE/FALSE
      • check if any condition is met all(dataset$col >0) #return TRUE/FALSE
      • Row and Column sums colSums(is.na(dataset)) all(colSums(is.na(dataset)))
      • Values with specific characteristics table(dataset$col %in% c(“1234”, “2345”)) dataset[dataset$col %in c(,), ]#row subset
      • Cross tabs
        • xtabs(col1 ~ col2 + coln, data=dataset) #col1 will be value of column
        • xtabs(breaks ~.,…)
      • Size of a dataset
        • object.size(dataset)
        • print(object.size(dataset), units=”Mb”)
    • Creating New Variables
      • Creating sequences: sometimes you need an index for your data set seq(1, 10, by=2 # min=1, max=10, step =2; return 1, 3, 5, , 9 seq(1, 10, length=3)# return exactly 3 values: 1, 5.5, 10 seq(along=x) #create a vector with length of vector x
      • Creating categorical variables cut(dataset$var, breaks=quantile(dataset$var)) OR library(Hmisc) restData$zipGroups <-cut2(restData$ZipCode, g=4) #break the zipcode values into 4 different groups
      • Create factor and revert the factor variable yesnofac<-factor(var, levels=c(“yes”,”no”)) #create factor variable relevel(yesnofac, ref=”yes”) as.numeric(yesnofac) #revert the factor variable to normal numeric variable
      • Using the mutate function from plyr pacakge library(Hmisc); library(plyr) restDate2 <-mutate(restData, zipGroup=cut2(ZipCode, g=4))#add new variable zipGroup to the dataset restData
      • Common transforms
        • abs(x), sqrt(x)
        • ceiling(x): ceiling (3.475) is 4
        • floor(x): floor(3.475) is 3
        • round(x, digits=n)
        • signif(x, digits=n): signif(3.475, digits=2) is 3.5
        • cos(x), sin(x) etc.
        • log(x), log2(x), log10(x)…
        • exp(x)
    • Reshaping Data
      • library(reshape2)
      • head(mtcars)
      • Melting data frame mtcars$carname <-rownames(mtcars) #carname is a new variable/column carMelt <-melt(mtcars, id=c(“carname”,”gear”, “cyl”), measure.vars=c(“mpg”,”hp”)) # which are id variables, and the melt() will melt all the rest of values. It will reshape the data set to have 1 row for every “mpg” and 1 row for every “hp’, the data set become tall and skinny, the measured variables are converted from columns to rows
        head(carMelt,3)
                carname gear cyl variable value
        1     Mazda RX4    4   6      mpg  21.0
        2 Mazda RX4 Wag    4   6      mpg  21.0
        3    Datsun 710    4   4      mpg  22.8
      • Casting data frames: cast a molten data set into an array or data frame; reformat the data set into different sort of shapes; cylData <-dcast(carMelt, cyl ~ variable)#recast the data set into a particular shape, into a data frame; break down cylinders by different variables (“mpg” or “hp”); summarized values of the left hand side (cyl) will be the rows, summarized values of the right hand side (variable) will be the columns.
        cylData
          cyl mpg hp
        1   4  11 11 #4 cylinders we have 11 measures of mpg and 11 ones of horsepower
        2   6   7  7
        3   8  14 14
        • cylData <-dcast(carMelt, cyl ~ variable, mean)#take the mean for each values of the variable’s value
        • the input data set here is not necessary to be molten if the “measured data” is already in rows. Melting a data set just brings measured data from columns to rows.
      • Averaging values tapply(vector var1, vector var2, sum)
        • tapply: apply a long index
        • withing each of value of var2, the tapply() will sum() up the var1’s value
      • Another way to reshape the raw data split(var1, var2,)# split var1 by each of var2’s value lapply(, sum) #list apply function; sum() values of each element of the list(by var2’s value of split())
        spIns split(InsectSprays$count, InsectSprays$spray)
        $A
         [1] 10  7 20 14 14 12 10 23 17 20 14 13
        
        $B
         [1] 11 17 21 11 16 14 17 17 19 21  7 13
        
        $C
         [1] 0 1 7 2 3 1 2 1 3 0 1 4
        
        $D
         [1]  3  5 12  6  4  3  5  5  5  5  2  4
        sprCount $A
        [1] 174
        
        $B
        [1] 184
        
        $C
        [1] 25
        
        $D
        [1] 59

        unlist(); #unlist the list to get a vector back;

        unlist(sprCount)
          A   B   C   D   E   F 
        174 184  25  59  42 200

        sapply(, sum)#do both the apply() and the combine() components of the list

        sapply(spIns, sum)
          A   B   C   D   E   F 
        174 184  25  59  42 200
        • The plyr package provides a nice interface for doing this sort of action in one step library(plyr); ddply(InsectSprays, .(spray), summarize, sum=sum(count)) #.(spray): the variable whose values are going to be grouped (rows) #you may melt() the data frame before summarize data by .(columns list) column
          spray sum
          1     A 174
          2     B 184
          3     C  25
          4     D  59
          5     E  42
          6     F 200
          names(InsectSprays)
          [1] "count" "spray"
    • Managing Data Frames with dplyr
      • library(dplyr)
      • chicago <-readRDS(“chicago.rds”)
      • dim(chicago)
      • select
        • head(select(chicago, 1:5))#select all columns from 1 to 5
        • names(chicago)[1:3] ## [1] “city” “tmpd” “dptp”
        • head(select(chicago, city:dptp)) #select all columns between the “city” and “dptp” columns
        • In dplyr you can do head(select(chicago, -(city:dptp)))#select all columns except columns between “city” and “dptp” Equivalent base R i <-match(“city”, names(chicago))#find the index of “city” column j head(chicago[, -(i:j)])
      • filter
        • chic.f <-filter(chicago, pm25tmean2 >30) head(select(chic.f, 1:3, pm25tmean2), 4) ##    city tmpd dptp pm25tmean2 ## 1  chic   23 21.9      38.10 ## 2  chic   28 25.8      33.95 ## 3  chic   55 51.3      39.40 ## 4  chic   59 53.7      35.40…
        • chic.f <-filter(chicago, pm25tmean2 >30 & tmpd >80) head(select(chic.f, 1:3, pm25tmean2, tmpd), 4) ##    city tmpd dptp pm25tmean2 ## 1  chic   81 71.2    39.6000 ## 2  chic   81 70.4    31.5000 ## 3  chic   82 72.2    32.3000 ## 4  chic   84 72.9    43.7000
      • arrange: reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R
        • chicago head(select(chicago, date, pm25tmean2), 3) ##         date pm25tmean2 ## 1 1987-01-01         NA ## 2 1987-01-02         NA ## 3 1987-01-03         NA
        • columns can be arranged in descending order too chicago <arrange(chicago, desc(date)) head(select(chicago, date, pm25tmean2), 3)
      • rename: renaming a variable in a data frame in R is surprising hard to do
        • head(chicago[,1:5], 3) ##   city tmpd dptp       date pm25tmean2 ## 1 chic   35 30.1 2005-12-31   15.00000 ## 2 chic   36 31.0 2005-12-30   15.05714 ## 3 chic   35 29.4 2005-12-29    7.45000
        • chicago <-rename(chicago, dewpoint = dptp, pm25=pm25tmean2) head(chicago[,1:5], 3) ##   city tmpd dewpoint       date     pm25 ## 1 chic   35     30.1 2005-12-31 15.00000 ## 2 chic   36     31.0 2005-12-30 15.05714 ## 3 chic   35     29.4 2005-12-29  7.45000
      • mutate: create a new variable or transform a variable
        • chicago
        • head(select(chicago, pm25, pm25detrend)) ##       pm25 pm25detrend ## 1 15.00000   -1.230958 ## 2 15.05714   -1.173815 ## 3  7.45000   -8.780958 ## 4 17.75000    1.519042 ## 5 23.56000    7.329042 ## 6  8.40000   -7.830958
      • group_by: generating summary statistics by stratum
        • chicago 80), labels=c(“cold”, “hot”) )
        • hotcold <-group_by(chicago, tempcat)
        • summarize(hotcold, pm25=mean(pm25, na.rm=T),o3=max(o3tmean2), no2=median(no2tmean2)) #want to know the mean pm2.5 for both hot and cold days… ## Source: local data frame [19 x 4] ## ##    year     pm25       o3      no2 ## 1  1987      NaN 62.96966 23.49369 ## 2  1988      NaN 61.67708 24.52296 ## 3  1989      NaN 59.72727 26.14062 ## 4  1990      NaN 52.22917 22.59583 ## 5  1991      NaN 63.10417 21.38194
        • chicago <-mutate(chicago, year=as.POSIXlt(date)$year + 1900) years <-group_by(chicago, year) summariz(years, pm25=mean(pm25, na.rm=T),o3=max(o3tmean2, na.rm=T), no2=median(no2tmean2, na.rm=T))#summarize/group data by value
      • special operatior: the pipeline operator (%>%) allow you to chain different operations together; it allow you feed through a pipeline of operations to create a new data set
        • chicago %>% mutate(year=as.POSXlt(date)$mon+1) #mutate the chicago dataset to create a month variable %>% group_by(month)#take the output of mutate() and then group by it (output) according to this month variable %>%summarize(pm25=…)#take the output of the group_by() and run it through the summarize()
        • when using the pipeline operator, you do not have to specify the data frame as the first argument because that is knid of implid by the use of the pipeline operator.
    • Merging Data: merge more datasets together by matching those datasets based on an ID
      • merge() function: merging data
        • merging data frames
        • important parameters: x, y, by, by.x, by.y, all mergedData<-merge(x=, y=, by.x=””, by.y=””,all=TRUE) #all=T means cross-join
          head(mergedData,3)
            solution_id id reviewer_id    start.x     stop.x time_left.x accept problem_id subject_id    start.y     stop.y time_left.y answer
          1           1  4          26 1304095267 1304095423        2089      1        156         29 1304095119 1304095169        2343      B
          2           2  6          29 1304095471 1304095513        1999      1        269         25 1304095119 1304095183        2329      C
          3           3  1          27 1304095698 1304095758        1754      1         34         22 1304095127 1304095146        2366      C
          names(reviews)
          [1] "id"          "solution_id" "reviewer_id" "start"       "stop"       
          [6] "time_left"   "accept"
          names(solutions)
          [1] "id"         "problem_id" "subject_id" "start"      "stop"       "time_left" 
          [7] "answer"
      • merge all common column names
        • get all common column names between 2 data sets: intersect() function
          intersect(names(solutions), names(reviews))
          [1] "id"        "start"     "stop"      "time_left"
        • merge 2 datasets based on all common column names (above) mergedData2 <-merge(reviews, solutions, all=TRUE)
          head(mergedData2)
            id      start       stop time_left solution_id reviewer_id accept problem_id subject_id answer
          1  1 1304095119 1304095169      2343          NA          NA     NA        156         29      B
          2  1 1304095698 1304095758      1754           3          27      1         NA         NA   
          3  2 1304095119 1304095183      2329          NA          NA     NA        269         25      C
          4  2 1304095188 1304095206      2306           4          22      1         NA         NA   
          5  3 1304095127 1304095146      2366          NA          NA     NA         34         22      C
          6  3 1304095276 1304095320      2192           5          28      1         NA         NA
    • Using join in the plyr package: faster, but less full featured – defaults to left join
      • df1 <-data.frame(id=sample(1:10), x=rnorm(10))
      • df2 <-data.frame(id=sample(1:10), y=rnorm(10))
      • arrange(join(df1, df2), id) #2 datasets must share/have common column names
        Joining by: id
           id           x           y
        1   1  0.61423108  1.57138208
        2   2 -2.23638174 -0.10411887
        3   3 -0.73054521  0.95628285
        4   4 -0.04250583 -0.08918296
        5   5  2.55319021  0.30713449
        6   6  1.71105730 -0.73299488
        7   7 -0.43878496  0.86411208
        8   8 -0.30357070  0.08990658
        9   9 -0.45801538  0.40962146
        10 10 -3.48889747 -0.09042853
      • df1 <-data.frame(id=sample(1:11), x=rnorm(11))
      • df2 <-data.frame(id=sample(1:10), y=rnorm(10))
        arrange(join(df1, df2), id)
        Joining by: id
           id           x           y
        1   1 -0.27103830  1.57138208
        2   2  0.01674570 -0.10411887
        3   3 -0.34927468  0.95628285
        4   4  0.08840833 -0.08918296
        5   5 -1.22798708  0.30713449
        6   6 -2.80160067 -0.73299488
        7   7 -0.16905970  0.86411208
        8   8  1.68916602  0.08990658
        9   9 -0.21505776  0.40962146
        10 10  0.42874951 -0.09042853
        11 11  1.10748133          NA
    • Joining/merging multiple data frames with plyr package: join_all() function df1 <-data.frame(id=sample(1:10), x=rnorm(10)) df2 <-data.frame(id=sample(1:10), y=rnorm(10)) df3 <-data.frame(id=sample(1:10), x=rnorm(10)) dfList<-list(df1, df2, df3)#create a list with 3 data frames in it
      join_all(dfList)
      Joining by: id
      Joining by: id
         id           x            y           z
      1   4 -0.03056143 -0.559789649  1.87125897
      2   6 -1.30431162  1.010476723  1.10769169
      3   1 -0.80145591  0.002993755  0.67703283
      4   3 -2.04931140 -0.158624892  0.34393202
      5   2 -1.58799207  1.639577687 -0.17712989
      6   8  0.57319822 -2.169699642  0.88747090
      7   5 -0.82071094 -0.442751158  0.06112546
      8   7 -0.72013911 -2.105903250  0.61511933
      9  10  1.72631859  1.550345761 -0.53992913
      10  9 -1.82238933  0.461132955  0.26143661
    • Editing Text Variables: a very common data cleaning step is to have text variables that are in text format
      • tolower(): return all letters to lower case
      • toupper(): convert all letters to upper case
      • strsplit() function
        • 6th column’s name of the data set ds = “Location.1”
        • splitNames=strsplit(names(ds), “\\.”) #separate by a dot, only split name whose has a dot within it
        • splitNames[[6]] #”Location” “1”
        • splitNames[[6]][1] $”Location”
      • sub(): replace a pattern for a vector
        • sub(pattern, replacement, x-vector)
      • gsub(): replace multiple instances of a particular character
        • gsub(“_”,””,”this_is_a_test”) #return “thisisatest”
      • Finding values with grep(), grepl()
        • find all positions of the variable col1 that include the text “Alameda” grep(“Almeda”, ds$col1) #return [1] 4 5 36: the position of the element/row
        • get value of the element: grep(“Almeda”, ds$col1, value=T) [1] “The Alameda & 33rd St”   “E 33rd & The Alameda”   “Harford \n & The Alameda”
        • table(grepl(“Almeda”, ds$col1))
        • subset: ds[grepl(“Almeda”, ds$col1),]
      • More useful string functions
        • library(stringr)
        • nchar()
        • substr(“Jeffrey Leek”, 1, 7) #Jeffrey
        • paste(“Jeffrey”, “Leek”) #”Jeffrey Leek”
        • paste0(“Jeffrey”, “Leek”) #”JeffreyLeek”
        • str_trim(“Jeff   “) #”Jeff”
    • Regular Expressions: to search for a bit of text that might fit a more broad pattern
      • Regular expressions can be thought of as a combination of literals and metacharacters
      • To draw an analogy with natural language, think of literal text forming the words of this language, and the metacharacters defining its grammar
      • Regular expressions have a rich set of metacharacters
      • Literals
        • literals contain words that are matched exactly
        • the simplest regular expression
        • We need a way to express
          • whitespace word boundaries
          • sets of literals
          • the beginning (^…) and end (…$) of a line
          • combining: ^[Ii] am
        • character classes with []:
          • [list of accepted characters]:[Bb]
          • [a range (-) of letters or number]: [0-9]; [a-z]; [a-zA-Z]
          • ^ means not, [^?.]$: find rows don’t have any punctuation (.) at the end AND question mark
      • metacharacters: we can think of it as the grammar
        • “.” is used to refer to any character so 9.11 will match the lines of “9-11 rules”/ “9/11 we”…
        • “|” does not mean “pipe” in the context of regular expressions; instead it translates to “or”; we can use it to combine 2 expressions, the subexpressions being called alternatives
          • food|fire
          • include as many alternatives as possible: flood|coldfire|earthquake…
          • the alternatives can be real expression and not just literals ^[Gg]ood|[Bd]ad
        • “?” indicate that the indicated expression is optional ()?
          • [Gg]eorge( [WW]\.)? [Bb]ush: matched line “goerge bush combined”/ “George W. Bush claimned”/ “two george bushes”
        • “\.” don’t consider this dot as a metacharacter but should consider it as the literal dot
        • matched text with \1, \2 (number of times): () +\1: find text with () and () exact one more time
        • * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item”, and + means “at least one of the item”
          • (.*) will match all the lines having any characters repeated any number of times bw ()
        • (*) is “greedy” so it always matches the longest possible string that satisfies the regular expression
          • ^s(.*)s:
          • ? can turn off the greediness of * ^s(.*?)s$
        • {} are referred to as interval quantifiers; let us specify the minimum and maximum number of matches of an expression
          • [Bb]us( +[^ ]+ +){1,5} debate: find () between 1 and 5 times
    • Working with Dates
      • date() #current datetime; character class: [1] “Sat May 30 05:31:31 2015”
      • Sys.Date() #current date; Date class: [1] “2015-05-30”
      • Formatting dates
        • %d: day as number (0-31)
        • %a: abbreviated weekday
        • %A: unabbreviated weekday
        • %m: month(00-12)
        • %b: abbreviated month
        • %B: unabbreviated month
        • %y: 2 digit year
        • %Y: four digit year
        • format(Sys.Date(), “%a %b %d”)
          [1] “Sat May 30”
        • > x<-c(“1jan1980”, “2jan1980”, “31mar2010”) > as.Date(x, “%d%b%Y”)
          [1] “1980-01-01” “1980-01-02” “2010-03-31”
        • > z[1]-z[2]
          Time difference of -1 days
          > as.numeric(z[1]-z[2])
          [1] -1
      • Convert date to Julian
        • weekdays()
        • months()
        • julian(date)
      • library(lubridate)
        • ymd(), mdy(), dmy(): the input parameter format must comply with each kind of functions
        • Dealing with time
          • > ymd_hms(“2015-05-30 10:05:03”)
            [1] “2015-05-30 10:05:03 UTC”
          • ?Sys.timezone
            ymd_hms(“2015-05-30 10:05:03″, tz=”Pacific/Auckland”)
            [1] “2015-05-30 10:05:03 NZST”
        • Some functions may have slightly different syntax
          • > x<-dmy(c(“1jan2013”, “2jan2013”)) > wday(x[1])
            [1] 3
          • wday(x[1], label=T)
            [1] Tues
            Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
      • Ultimately you want your dates and times as class “Date” or classes “POSIXct”, “POSIXlt”, for more information type ?POSIXlt
  6. Distributing data
    • Data stored in either a database or in csv files or tab separated files (.tab/.txt) are easier to distribute
  7. Scripting for data cleaning
    • Names of variable should be
      • All lower case when possible
      • Descriptive (Diagnosis versus Dx)
      • Not duplicated
      • Not have underscores or dots or white spaces
    • Variable with character values
      • Should usually be made into factor variables (depends on application)
      • Should be descriptive (use T/F instead of 0/1 and Make/Female versus 0/1 or M/F)
Advertisements

From → Data Science

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: