# Data Analyst and Statistical Inference

1. Steps to do data analysis process

2. Observational Studies and Experiment

- Observational Studies: infer correlation; observe -> Establish an association /
- Retrospective: use past data
- Prospective: collect data during the study

- Experiment: infer casual inference / causation:
- Most experiments use RANDOM ASSIGNMENT while observation studies do not

2.a>Observational studies

2.b>Principle of experiment Design: 4

- Control:
- Randomize: r
- Replicate:
- Block: block
- Blocking and explanatory variables (factors)
- Explanatory variables (factors):
- Blocking variables: are characteristic that
- blocking during random assignment
- stratifying during random sampling

- Blocking and explanatory variables (factors)

**Confounding variables: **

3. A few sources of (data) sampling bias => not be representative for the whole population

- Convenience sample: Individuals who are easy accessible , neighbors / colleagues, are more likely to be included in the sample
- Non-response: (there is an initial random sample) If only a (non-random) fraction of a RANDOMLY sampled people respond to a survey. Ex people from lower socio-economic status are less likely to respond to the survey.
- Voluntary-response: When the sample ONLY includes people who volunteer to respond because they have a strong opinions on the issue. (There is no initial random sample here).

4. SAMPLING METHOD

- Simple random sampling (SRS): we randomly select the cases from the population, such that each case is equally likely to be selected
- Stratified sampling (phan tang):
- Cluster sampling (cum):

5. Random Sample AND Random Assignment

- Distinguish random Sample and Random Assignment
- Random assignment:

- Evaluate the relationship between variables
- the direction: positive / negative
- the shape: linear / curved
- the strength: strong / weak
- the outliers:

- Histogram
- provide a view of data density
- Skewness: left skewed / symmetric / right skewed

- Modality
- unimodal
- bimodal
- uniform
- multimodal

- Dotplot
- Box plot
- Box plots do not display modality, histograms do

- Intensity map ( bản đồ cường độ):

http://www.tc3.edu/instruct/sbrown/swt/chap01.htm

### 1C2. Quantitative or Qualitative data?

**Definitions:****Quantitative data** are data that are numbers. Quantitative data are also called**numeric data**.

Numeric data are subdivided into discrete and continuous data. **Discrete data** are whole numbers and typically answer the question “how many?”**Continuous data** can take on any value (or any value within a certain range) and typically answer the question “how much?”

**Qualitative data** are data that are not numbers. Qualitative data are also called **non-numeric data**, **attribute data** or **categorical data**.

**Common**

mistakes:

mistakes:

- Just seeing numbers in a problem does not mean you have numeric data. Consider this statement: “45% of viewers polled said they thought Candidate X performed well in the debate.” There’s a number there, all right, but you have non-numeric data because each person answered “yes” or “no”, which means the individual data points are non-numeric.
- Some data look like numbers but aren’t: ZIP codes, for instance. When in doubt, ask yourself, “Would it make sense to average the data?” If the answer is no, you have non-numeric data.

Sometimes we talk about data types, and sometimes about variable types. **They’re the same thing.** For instance, “weight of a machine part” is a continuous variable, and 61.1 g, 61.4 g, 60.4 g, 61.0 g, and 60.7 g are continuous data.

Quantitative (numeric) | Qualitative (categorical or non-numeric) |
---|---|

You get a number from each member of the sample. | You get a yes/no or a category from each member of the sample. |

The data have units (inches, pounds, dollars, IQ points, whatever) and can be sorted from low to high. | The data may or may not have units and do not have a definite sort order. |

It makes sense to average the data. | Your summary is counts or percentages in each category. |

Examples (discrete): number of children in a family, number of cigarettes smoked per day, age at last birthday Examples (continuous): height, salary, exact age |
Examples: hair color, marital status, gender, country of birth, and opinion for or against a particular issue |

**Continuous or discrete data?** Sometimes when you have numeric data it’s hard to say whether you have discrete or continuous data. But since you’ll graph them differently, it’s important to be clear on the distinction. Here are two examples of doubtful cases: salary and age.

It’s true that **your salary** can be only a whole number of pennies. But there are a great many possible values, and the distance between the possible values is quite small, so you call salary a continuous variable. Besides, you don’t ask “how many pennies do you make?” but rather “how much do you make?”

What about age? Well, **age at last birthday** is clearly discrete since it can be only a whole number: “how many years old were you at your last birthday?” But **age now**, including years and months and days and fractions of days, would be continuous, again because you can subdivide it as finely as desired.

### 1C3. Summary Statements

When you see a summary statement, you have to do a little mental detective work to figure out the data type. Always ask yourself, **what was the original measurement taken or question asked?**

**Example 14:** “The average salary at our corporation is $22,471.” The original measurement was the salary of each individual, so this is continuous data.

**Example 15:** “The average American family has 1.7 children.” Don’t let “1.7” fool you into identifying this as a continuous variable! What was the original question or measurement? “How many children are there in your family?” That’s discrete data.

**Example 16:** “Four out of five dentists surveyed recommend Trident sugarless gum for their patients who chew gum.” Yes, there are numbers in the summary statement, but the original question asked of each dentist was “Do you recommend Trident?” That is a yes/no question, so the data type is categorical.