# Statistical Formulas for Data Science

1. Introduction

It is important to set up hypotheses that the evaluation has statistical significance.

Statistical significance means:

• rected the null hypothesis
• results are not likely due to chance (sampling error)

The final result of the statistics should be:

• Desciriptive Statistics
• M – Mean
• Sd – Standard Deviation
• Inferential Statistics
• Hypothesis Test $\alpha$
• kind of test, bsp: one-sample t-test
• test statistic, bsp: t-value
• df – degrees of freedom
• p-value
• direction of test, bsp: one-tail-test or two-tail-test

• Confidence intervals
• Confidence leve, bsp: 95\%
• Lower Limit
• Upper Limit
• CI on what?

• Effect size measures
• $d$
• $r^2$

2. Statistical Formulas

2.1. Empirical Mean $\gamma = \frac{\sum{x}}{N}$

Properties (for independent random variables X and Y):

1. $Mean(X + Y) = Mean(X) + Mean(Y)$
2. $Mean(X \times Y) = Mean(X) \times Mean(Y)$

2.2. Variance $\sigma^2 = Var(X)$
$\sigma^2 = \frac{\sum{(x_i – \gamma)^2}}{N}$
$\sigma^2 = \frac{\sum{X_i^2}}{N} – \frac{(\sum{X_i})^2}{N^2}$
$\sigma^2 = \frac{\sum{(x_i – \gamma)^2}}{N})$

2.3. Standard Deviation $\sigma = \sqrt{Var(X)} \sigma = \sqrt{\frac{\sum{(X_i – \gamma)^2}}{N}}$

2.4. Standard Error Sd $= \frac{\sigma}{N}$

2.5. Z-Score Z-Score $= \frac{x – \bar{x}}{Sd}$

2.6. Standard Normal Distribution $\frac{1}{\sqrt{2 \cdot \pi \cdot \sigma^2}} \cdot e^{[-\frac{1}{2} \cdot \frac{(x-\gamma)^2}{\sigma^2}]}$

2.7. Confidence Interval CI $= 1.96 \cdot \sqrt{\frac{p(1 – p)}{N}}$

General Form:
Size of CI $= a \cdot \sqrt{\frac{\sigma^2}{N}}$
$\frac{\sum{X_i \pm a \cdot \sqrt{\frac{\sigma^2}{N}}}}{N}$

$\textbf{Note}$

1. $a = 1.96 for N \geq 30$
2. $a$ is the t-value computed for $(N – 1)$ degrees of freedom and confidence level $p$.

2.8. T-Test $t = \frac{\bar{x_D} – 0}{s_D / \sqrt{n}}$

3. Lesson 10 – Dependent samples

Dependent samples (repeated measures)

• Two Conditions
• Longitudinal
• Pre-Test, Post-Test

3.1. Important formulas $X = [ARRAY of VALUES] \Rightarrow$ Population Data
$x = [ARRAY of VALUES] \Rightarrow$ Sample Data
$n \Rightarrow$ Sample  Size
$\mu = \frac{\sum{x_i}}{n}\Rightarrow$ Population Mean
$\bar{X} = \frac{\sum{X_i}}{n} \Rightarrow$ Sample Mean
$s_D = \sqrt{\frac{\sum{(X_i – \mu)^2}}{n}} \Rightarrow$ Standard Deviation for sample
$\alpha \Rightarrow$ tail probability on t-table
$t_{critical} \Rightarrow$ from t-table
$df = n – 1 \Rightarrow$ Degrees of Freedom
$SEM = \frac{s_D}{\sqrt{n}} \Rightarrow$ Standard Error of the Mean
$t = \frac{\bar{X} – \mu}{SEM} \Rightarrow$ One Sample t-Test
\margin of error $= (t^{critical} \times SEM)$
$CI = \bar{X} \pm$ margin of error $\Rightarrow$ Confidence Interval
$d = \frac{\bar{X}-\mu}{s_D} \Rightarrow$ Cohen’s d $\Rightarrow$ Standardized mean difference
$r^2 = \frac{t^2}{t^2 + df} \Rightarrow$ determines the strength of the relationship between two variables as a proportion. Example:

$r^2$ gives  how much a person’s gender contributed to the difference between the two samples

3.2. Hypothesis Testing “US families spent an average of \\$151 per week on food”
Null Hypothese: “the program did not change the cost of food”
Alternative Hypothese: “the program reduced the cost of food”

$null \rightarrow H_0:\mu_{program} >= 151$
$alt \rightarrow H_A:\mu_{program} < 151$

4. Lesson 11 – Independent Samples

Interdependent data does not need so many test candidates, is cost effective and less time consuming. However, this type of data also has disadvantages. For example, the subjects could already know the answers when completing a test for the second time.

That’s why you need independent samples, one part of the sample that made the treatment and another part that did not.

$s_1 \sqrt{\frac{\sum{(X_i1 – \mu)^2}}{n – 1}} \Rightarrow$ Standard Deviation for sample with
http://www.wikiwand.com/en/Bessel\%27s\_correction
$s_2 \sqrt{\frac{\sum{(X_i2 – \mu)^2}}{n – 1}} \Rightarrow$ Standard Deviation for sample with http://www.wikiwand.com/en/Bessel\%27s\_correction
$n_1 \Rightarrow$ Size of Sample 1
$n_2 \Rightarrow$ Size of Sample 2
$S_D = \sqrt{s_1^2 + s_2^2}$ Standard Deviation for Samples
$SEM = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \Rightarrow$ Standard Error for independent samples
$df = n_1 + n_2 – 2 \Rightarrow$ Degrees of freedom for independent samples
$t = \frac{\bar{X}_1 – \bar{X}_2}{SEM} \Rightarrow$ Two Sample t-Test
$CI = (\bar{X}_1 – \bar{X}_2) \pm$ margin of error $\Rightarrow$ Confidence Interval
$SS_x = \sum{(x_i – \bar{x})^2} \Rightarrow$ Sum of Squared Deviations
$S_p^2 = \frac{SS_1 + SS_2}{df_1 + df_2} \Rightarrow$ Pooled variance
$SEM_{CORRECTED} (S_{\bar{x_1} – \bar{x_2}}) = \sqrt{\frac{s_p^2}{n_1} + \frac{s_p^2}{n_2}} \Rightarrow$ Corrected Standard Error
$t_{CORRECTED} = \frac{\bar{x_1} – \bar{x_2}}{S_{\bar{x_1}-\bar{x_2}}} \Rightarrow$ t-Statistic Corrected

5. Lesson 12 – ANOVA (Analysis of Variance) ANOVA for samples with the same size
$N \Rightarrow$ number of values from all samples
$k \Rightarrow$ number of samples
$\bar{x}_G = \frac{\sum{\bar{x}_i}}{N}\Rightarrow$ Grand Mean
$df = N – k \Rightarrow$ degrees of freedom
$df_1 = k – 1)$
$df_2 = N – k)$
$df_{total} = N – 1 \Rightarrow$ total degrees of freedom
$F = \frac{n * \sum{(\bar{x}_K – \bar{x}_G)^2 / df_1}}{\sum{(x_i – \bar{x}_k)^2 / df_2}} \Rightarrow$ between-group variability / within-group variability. F-Table