Skip to content

Data Analyst and Statistical Inference

March 10, 2015

1. Steps to do data analysis process

  • Identifying the type of variable you are working with is always the first step: (continuous/concrete) numeric data/quantitative data OR (inherent/ordering or non) categorical data 
  • Next, we start looking for relationships between variables/columns: associated/dependent (positive / negative) OR independent / not associated


2. Observational Studies and Experiment

  • Correlation does not imply causation
  • Observational Studies: infer correlation; observe -> Establish an association / correlation between the explanatory (~causes) and the response variables (~effects). 
    • Retrospective: use past data
    • Prospective: collect data during the study
  • Experiment: infer casual inference / causation: researchers RANDOMLY assign subjects to various treatments/groups => establish causal connection between the explanatory and the response variables
  • Most experiments use RANDOM ASSIGNMENT while observation studies do not

2.a>Observational studies

2.b>Principle of experiment Design: 4

  • Control: to compare treatment/testing groups of interest to a control group
  • Randomize: randomly ASSIGNING subjects to treatment
  • Replicate: collect a sufficiently large sample within a study, or to replicate the entire study
  • Block: block variables that are known or suspected to affect the response variable / outcome (professional and amateur status of players). Ex: first to group the subjects into blocks based on these variables and then randomize cases within each block to treatment groups
    • Blocking and explanatory variables (factors)
      • Explanatory variables (factors): are conditions we can impose on our experimental units
      • Blocking variables: are characteristic that the experimental units come with / have, that we would like to control for.
      • Blocking is basically like stratifying, except
        • blocking during random assignment
        • stratifying during random sampling

Confounding variables: Are extraneous variables, that affect both the explanatory and the response variables.

3. A few sources of (data) sampling bias => not be representative for the whole population

  • Convenience sample: Individuals who are easy accessible , neighbors / colleagues, are more likely to be included in the sample
  • Non-response: (there is an initial random sample) If only a (non-random) fraction of a RANDOMLY sampled people respond to a survey. Ex people from lower socio-economic status are less likely to respond to the survey.
  • Voluntary-response: When the sample ONLY includes people who volunteer to respond because they have  a strong opinions on the issue. (There is no initial random sample here).


  • Simple random sampling (SRS): we randomly select the cases from the population, such that each case is equally likely to be selected
  • Stratified sampling (phan tang): we first divide the population into homogeneous groups called strata, then we RANDOMLY SAMPLE from within ALL each stratum. Ex: female and male groups => randomly sample within each group
  • Cluster sampling (cum): we divide the population into clusters, RANDOMLY SAMPLE a few clusters, then randomly sample from within these clusters. A cluster is not necessary a homogeneous group, each cluster is similar to another => just sampling a few of the clusters. Ex: divide a city into many geographic regions that are similar to each other

5. Random Sample AND Random Assignment

  • Distinguish random Sample and Random Assignment
    • Random sampling: occurs when subjects are being selected for a study (each subject in the population is equally likely to be selected the resulting sample is likely representative of the population)
    • Random assignment: occurs only in experimental setting where subjects are being assigned to various treatments (Through random assignment, we ensure that these different characteristics are represented equally in the treatment and control groups. This allows us to attribute any observed difference between the treatment and the control group ).

6. Visualization method


  • Evaluate the relationship between variables
    • the direction: positive / negative
    • the shape: linear / curved
    • the strength: strong / weak
    • the outliers:
  • Histogram
    • provide a view of data density
    • Skewness: Distributions are said to be skewed (nghiêng về) to the side of the long tail: left skewed / symmetric / right skewed
  •  Modality
    • unimodal
    • bimodal
    • uniform
    • multimodal
  • Dotplot
  • Box plot
    • Box plots do not display modality, histograms do
    • Determine the skewness of a distribution from a box plot is to imagine what the histogram would look like. The peak of the distribution will be roughly around the median, and the tails will extend out to the tails in the box plot.
  • Intensity map ( bản đồ cường độ):

1C2.  Quantitative or Qualitative data?

Definitions:Quantitative data are data that are numbers. Quantitative data are also callednumeric data.

Numeric data are subdivided into discrete and continuous data. Discrete data are whole numbers and typically answer the question “how many?”Continuous data can take on any value (or any value within a certain range) and typically answer the question “how much?”

Qualitative data are data that are not numbers. Qualitative data are also called non-numeric data, attribute data or categorical data.


  • Just seeing numbers in a problem does not mean you have numeric data. Consider this statement: “45% of viewers polled said they thought Candidate X performed well in the debate.” There’s a number there, all right, but you have non-numeric data because each person answered “yes” or “no”, which means the individual data points are non-numeric.
  • Some data look like numbers but aren’t: ZIP codes, for instance. When in doubt, ask yourself, “Would it make sense to average the data?” If the answer is no, you have non-numeric data.

Sometimes we talk about data types, and sometimes about variable types. They’re the same thing. For instance, “weight of a machine part” is a continuous variable, and 61.1 g, 61.4 g, 60.4 g, 61.0 g, and 60.7 g are continuous data.

Quantitative (numeric) Qualitative (categorical or non-numeric)
You get a number from each member of the sample. You get a yes/no or a category from each member of the sample.
The data have units (inches, pounds, dollars, IQ points, whatever) and can be sorted from low to high. The data may or may not have units and do not have a definite sort order.
It makes sense to average the data. Your summary is counts or percentages in each category.
Examples (discrete): number of children in a family, number of cigarettes smoked per day, age at last birthday
Examples (continuous): height, salary, exact age
Examples: hair color, marital status, gender, country of birth, and opinion for or against a particular issue

Continuous or discrete data? Sometimes when you have numeric data it’s hard to say whether you have discrete or continuous data. But since you’ll graph them differently, it’s important to be clear on the distinction. Here are two examples of doubtful cases: salary and age.

It’s true that your salary can be only a whole number of pennies. But there are a great many possible values, and the distance between the possible values is quite small, so you call salary a continuous variable. Besides, you don’t ask “how many pennies do you make?” but rather “how much do you make?”

What about age? Well, age at last birthday is clearly discrete since it can be only a whole number: “how many years old were you at your last birthday?” But age now, including years and months and days and fractions of days, would be continuous, again because you can subdivide it as finely as desired.

1C3.  Summary Statements

When you see a summary statement, you have to do a little mental detective work to figure out the data type. Always ask yourself, what was the original measurement taken or question asked?

Example 14: “The average salary at our corporation is $22,471.” The original measurement was the salary of each individual, so this is continuous data.

Example 15: “The average American family has 1.7 children.” Don’t let “1.7” fool you into identifying this as a continuous variable! What was the original question or measurement? “How many children are there in your family?” That’s discrete data.

Example 16: “Four out of five dentists surveyed recommend Trident sugarless gum for their patients who chew gum.” Yes, there are numbers in the summary statement, but the original question asked of each dentist was “Do you recommend Trident?” That is a yes/no question, so the data type is categorical.


From → Data Science

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: