in ,

Statistical Formulas for Data Science


It is important to set up hypotheses that the evaluation has statistical significance.

Statistical significance means:
\item rected the null hypothesis
\item results are not likely due to chance (sampling error)

The final result of the statistics should be:

\item Desciriptive Statistics
\item M – Mean
\item Sd – Standard Deviation
\item Inferential Statistics
\item Hypothesis Test \(\alpha\)
\item kind of test, bsp: one-sample t-test
\item test statistic, bsp: t-value
\item df – degrees of freedom
\item p-value
\item direction of test, bsp: one-tail-test or two-tail-test

\item Confidence intervals
\item Confidence leve, bsp: 95\%
\item Lower Limit
\item Upper Limit
\item CI on what?

\item Effect size measures
\item \(d\)
\item \(r^2\)


\section{Statistical Formulas}

\subsection{Empirical Mean}
\(\gamma = \frac{\sum{x}}{N}\)

Properties (for independent random variables X and Y):
\item \(Mean(X + Y) = Mean(X) + Mean(Y)\)
\item \(Mean(X \times Y) = Mean(X) \times Mean(Y)\)

$\sigma^2 = Var(X)$
$\sigma^2 = \frac{\sum{(x_i – \gamma)^2}}{N}$
$\sigma^2 = \frac{\sum{X_i^2}}{N} – \frac{(\sum{X_i})^2}{N^2}$
$\sigma^2 = \frac{\sum{(x_i – \gamma)^2}}{N})$

\subsection{Standard Deviation}
$\sigma = \sqrt{Var(X)}
\sigma = \sqrt{\frac{\sum{(X_i – \gamma)^2}}{N}}$

\subsection{Standard Error}
Sd $= \frac{\sigma}{N}$

Z-Score $= \frac{x – \bar{x}}{Sd}$

\subsection{Standard Normal Distribution}
$\frac{1}{\sqrt{2 \cdot \pi \cdot \sigma^2}} \cdot e^{[-\frac{1}{2} \cdot \frac{(x-\gamma)^2}{\sigma^2}]}$

\subsection{Confidence Interval}
CI $= 1.96 \cdot \sqrt{\frac{p(1 – p)}{N}}$

General Form:
Size of CI $= a \cdot \sqrt{\frac{\sigma^2}{N}}$
$\frac{\sum{X_i \pm a \cdot \sqrt{\frac{\sigma^2}{N}}}}{N}$


\item $a = 1.96 for N \geq 30$
\item $a$ is the t-value computed for $(N – 1)$ degrees of freedom and confidence level $p$.


t = \frac{\bar{x_D} – 0}{s_D / \sqrt{n}}

\section{Lesson 10 – Dependent samples}

Dependent samples (repeated measures)
\item Two Conditions
\item Longitudinal
\item Pre-Test, Post-Test

\subsection{Important formulas}
\(X = [ARRAY of VALUES] \Rightarrow\) Population Data
\(x = [ARRAY of VALUES] \Rightarrow\) Sample Data
\(n \Rightarrow\) Sample  Size
\(\mu = \frac{\sum{x_i}}{n}\Rightarrow\) Population Mean
\(\bar{X} = \frac{\sum{X_i}}{n} \Rightarrow\) Sample Mean
\(s_D = \sqrt{\frac{\sum{(X_i – \mu)^2}}{n}} \Rightarrow\) Standard Deviation for sample
\(\alpha \Rightarrow\) tail probability on t-table
\(t_{critical} \Rightarrow\) from \href{}{t-table}
\(df = n – 1 \Rightarrow\) Degrees of Freedom
\(SEM = \frac{s_D}{\sqrt{n}} \Rightarrow\) Standard Error of the Mean
\(t = \frac{\bar{X} – \mu}{SEM} \Rightarrow\) One Sample t-Test
\margin of error \(= (t^{critical} \times SEM)\)
\(CI = \bar{X} \pm \) margin of error \(\Rightarrow\) Confidence Interval
\(d = \frac{\bar{X}-\mu}{s_D} \Rightarrow\) Cohen’s d \(\Rightarrow\) Standardized mean difference
\(r^2 = \frac{t^2}{t^2 + df} \Rightarrow\) determines the strength of the relationship between two variables as a proportion. Example:

\(r^2 \) gives  how much a person’s gender contributed to the difference between the two samples

\subsection{Hypothesis Testing}
“US families spent an average of \$151 per week on food”
Null Hypothese: “the program did not change the cost of food”
Alternative Hypothese: “the program reduced the cost of food”

\(null \rightarrow H_0:\mu_{program} >= 151\)
\(alt \rightarrow H_A:\mu_{program} < 151\)

\section{Lesson 11 – Independent Samples}

Interdependent data does not need so many test candidates, is cost effective and less time consuming. However, this type of data also has disadvantages. For example, the subjects could already know the answers when completing a test for the second time.

That’s why you need independent samples, one part of the sample that made the treatment and another part that did not.

\(s_1 \sqrt{\frac{\sum{(X_i1 – \mu)^2}}{n – 1}} \Rightarrow\) Standard Deviation for sample with
\href{Bessel\’s correction}{\%27s\_correction}
\(s_2 \sqrt{\frac{\sum{(X_i2 – \mu)^2}}{n – 1}} \Rightarrow\) Standard Deviation for sample with \href{Bessel\’s correction}{\%27s\_correction}
\(n_1 \Rightarrow\) Size of Sample 1
\(n_2 \Rightarrow\) Size of Sample 2
\(S_D = \sqrt{s_1^2 + s_2^2}\) Standard Deviation for Samples
\(SEM = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \Rightarrow\) Standard Error for independent samples
\(df = n_1 + n_2 – 2 \Rightarrow\) Degrees of freedom for independent samples
\(t = \frac{\bar{X}_1 – \bar{X}_2}{SEM} \Rightarrow\) Two Sample t-Test
\(CI = (\bar{X}_1 – \bar{X}_2) \pm \) margin of error \(\Rightarrow\) Confidence Interval
\(SS_x = \sum{(x_i – \bar{x})^2} \Rightarrow\) Sum of Squared Deviations
\(S_p^2 = \frac{SS_1 + SS_2}{df_1 + df_2} \Rightarrow\) Pooled variance
\(SEM_{CORRECTED} (S_{\bar{x_1} – \bar{x_2}}) = \sqrt{\frac{s_p^2}{n_1} + \frac{s_p^2}{n_2}} \Rightarrow\) Corrected Standard Error
\(t_{CORRECTED} = \frac{\bar{x_1} – \bar{x_2}}{S_{\bar{x_1}-\bar{x_2}}} \Rightarrow\) t-Statistic Corrected

\section{Lesson 12 – ANOVA (Analysis of Variance)}
ANOVA for samples with the same size
\(N \Rightarrow\) number of values from all samples
\(k \Rightarrow\) number of samples
\(\bar{x}_G = \frac{\sum{\bar{x}_i}}{N}\Rightarrow\) Grand Mean
\(df = N – k \Rightarrow\) degrees of freedom
\(df_1 = k – 1)\)
\(df_2 = N – k)\)
\(df_{total} = N – 1 \Rightarrow\) total degrees of freedom
\(F = \frac{n * \sum{(\bar{x}_K – \bar{x}_G)^2 / df_1}}{\sum{(x_i – \bar{x}_k)^2 / df_2}} \Rightarrow\) between-group variability / within-group variability. \href{\_table.html}{F-Table}


What do you think?

235 Points
Upvote Downvote

Written by mathsgee


Leave a Reply



Learn Latex – Mathematics for Machine Learning

Edzai Zvobwo to speak at MERL Tech Jozi 2018