Topic outline

  • Welcome to Essential Statistics for GIS Data Analysts

    This introductory course will help you gain a solid understanding of statistics and basic probability for Geographical Information Science (GIS) that forms a foundation for further work in spatial data analysis and data science. It is our hope you will really enjoy the course.

    The aim of the course is to provide a basis for understanding the statistical methods and their use in GIS practice. Students will gain an overview of basic techniques needed for statistical analysis of data. After completing the subject, students will be able to perform routine statistical analysis.

    The entire course is in a self-paced format and we estimate that it will take anywhere from 24 to 48 hours to complete depending on your current skill set. It is possible to sit for the course all at once, but we recommend allocating 4 to 8 hours per week for six weeks to complete this course. That includes videos, reading materials, homework, quizzes, and the final exam. We also encourage you to participate in the discussion forums when you have questions, comments, or concerns.


    Motto: 
    There are three kinds of lies: lies, damned lies, and statistics.
     This well-known saying is part of a phrase attributed to Benjamin Disraeli and popularized in the U.S. by Mark Twain.


  • Key words:
    measures of location, measures of central tendency (centre), measures of variability (spread), measures of skewness, measures of kurtosis, percentile, quartile, mean, average, mode, median, interquartile range, variance, standard deviation, coefficient of variation, coefficient of skewness, coefficient of kurtosis, boxplot, five-number summary

    Descriptive statistics can be useful for two purposes:

    1) to provide basic information about variables in a dataset and

    2) to highlight potential relationships between variables.

    The three most common descriptive statistics can be displayed graphically or pictorially and are measures of:

    Useful Materials for this course

    Descriptive Statistics Notes

    List of abbreviations and standard notation used in statistics

    List of statistical formulas with description


  • Discrete Random Variables

    Example

    Select three fans randomly at a football game in which Penn State is playing Notre Dame. Identify whether the fan is a Penn State fan (P) or a Notre Dame fan (N). This experiment yields the following sample space:

    S = {PPP, PPN, PNP, NPP, NNP, NPN, PNN, NNN}

    Let X = the number of Penn State fans selected. The possible values of X are, therefore, either 0, 1, 2, or 3. Now, we could find probabilities of individual events, P(PPP) or P(PPN), for example. Alternatively, we could find P(X = x), the probability that X takes on a particular value x. Let’s do that!

    Since the game is a home game, let’s suppose that 80% of the fans attending the game are Penn State fans, while 20% are Notre Dame fans. That is, P(P) = 0.8 and P(N) = 0.2. Then, by independence:

    P(X = 0) = P(NNN) = 0.2 × 0.2 × 0.2 = 0.008

    And, by independence and mutual exclusivity of NNP, NPN, and PNN:

    P(X = 1) = P(NNP) + P(NPN) + P(PNN) = 3 × 0.2 × 0.2 × 0.8 = 0.096

    Likewise, by independence and mutual exclusivity of PPN, PNP, and NPP:

    P(X = 2) = P(PPN) + P(PNP) + P(NPP) = 3 × 0.8 × 0.8 × 0.2 = 0.384

    Finally, by independence:

    P(X = 3) = P(PPP) = 0.8 × 0.8 × 0.8 = 0.512

    There are a few things to note here:

    • The results make sense! Given that 80% of the fans in the stands are Penn State fans, it shouldn’t seem surprising that we would be most likely to select 2 or 3 Penn State fans.
    • The probabilities behave well in that (1) the probabilities are all greater than 0, that is, P(X = x) > 0 and (2) the probability of the sample space is 1, that is, P(S) = P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) = 1.
    • Because the values that it takes on are random, the variable X has a special name. It is called a random variable!  

     

    Let’s give a formal definition of a random variable.

     

    Definition. Given a random experiment with sample space S, random variable X is a set function that assigns one and only one real number to each element s that belongs in the sample space S.

    The set of all possible values of the random variable X, denoted x, is called the supportor space, of X. 

     

    Note that the capital letters at the end of the alphabet, such as WXY, and Z typically represent the definition of the random variable. The corresponding lowercase letters, such as wxy, and z, represent the random variable’s possible values.

     

    Example

    A rat is selected at random from a cage of male (M) and female rats (F). Once selected, the gender of the selected rat is noted. The sample space is thus:

    S = {M, F}

    Define the random variable X as follows:

    • Let X = 0 if the rat is male.
    • Let X = 1 if the rat is female.

    Note that the random variable X assigns one and only one real number (0 and 1) to each element of the sample space (M and F). The support, or space, of X is {0, 1}.

    Note that we don’t necessarily need to use the numbers 0 and 1 as the support. For example, we could have alternatively (and perhaps arbitrarily?!) used the numbers 5 and 15, respectively. In that case, our random variable would be defined as X = 5 of the rat is male, and X = 15 if the rat is female.

     

    Example

    A roulette wheel has 38 numbers on it: a zero (0), a double zero (00), and the numbers 1, 2, 3, …, 36. Spin the wheel until the pointer lands on number 36. One possibility is that the wheel lands on 36 on the first spin.  Another possibility is that the wheel lands on 0 on the first spin, and 36 on the second spin.  Yet another possibility is that the wheel lands on 0 on the first spin, 7 on the second spin, and 36 on the third spin. The sample space must list all of the countably infinite (!) number of possible sequences. That is, the sample space looks like this:

    S = {36, 0-36, 00-36, 1-36, … 35-36, 0-0-36, 0-1-36, …}

    If we define the random variable X to equal the number of spins until the wheel lands on 36, then the support of X is {0, 1, 2, 3, ….}.

    Note that in the rat example, there were a finite (two, to be exact) number of possible outcomes, while in the roulette example, there were a countably infinite number of possible outcomes. This leads us to the following formal definition.

     

    Definition. A random variable X is a discrete random variable if:
    • there are a finite number of possible outcomes of X, or
    • there are a countably infinite number of possible outcomes of X.

     

     


    • Introduction

      In this lesson, and some of the lessons that follow in this section, we’ll be looking at specially named discrete probability mass functions, such as the geometric distribution, the hypergeometric distribution, and the Poisson distribution. As you can probably gather by the name of this lesson, we’ll be exploring the well-known binomial distribution in this lesson.

      The basic idea behind this lesson and the ones that follow is that when certain conditions are met, we can derive a general formula for the probability mass function of a discrete random variable X. We can then use that formula to calculate probabilities concerning X rather than resorting to first principles. Sometimes the probability calculations can be tedious. In those cases, we might want to take advantage of the cumulative probability tables that others have created. We’ll do exactly that for the binomial distribution. We’ll also derive formulas for the mean, variance, and standard deviation of a binomial random variable.

       

      Objectives

      • To understand the derivation of the formula for the binomial probability mass function.
      • To verify that the binomial p.m.f. is a valid p.m.f.
      • To learn the necessary conditions for which a discrete random variable X is a binomial random variable.
      • To learn the definition of a cumulative probability distribution.
      • To understand how cumulative probability tables can simplify binomial probability calculations.
      • To learn how to read a standard cumulative binomial probability table.
      • To learn how to determine binomial probabilities using a standard cumulative binomial probability table when p is greater than 0.5.
      • To understand the effect on the parameters n and p on the shape of a binomial distribution.
      • To derive formulas for the mean and variance of a binomial random variable.
      • To understand the steps involved in each of the proofs in the lesson.
      • To be able to apply the methods learned in the lesson to new problems.

    • Introduction

      continuous random variable differs from a discrete random variable in that it takes on an uncountably infinite number of possible outcomes.  For example, if we let X denote the height (in meters) of a randomly selected maple tree, then X is a continuous random variable. In this lesson, we’ll extend much of what we learned about discrete random variables to the case in which a random variable is continuous. Our specific goals include:

      1. Finding the probability that X falls in some interval, that is finding P(a < X < b), where a and b are some constants.  We’ll do this by using f(x), the probability density function (“p.d.f.”) of X, and F(x), the cumulative distribution function (“c.d.f.”) of X.
      2. Finding the mean μ, variance σ2, and standard deviation of X. We’ll do this through the definitions E(X) and Var(X) extended for a continuous random variable, as well as through the moment generating function M(t) extended for a continuous random variable.

      Objectives

      • To introduce the concept of a probability density function of a continuous random variable.
      • To learn the formal definition of a probability density function of a continuous random variable.
      • To learn that if X is continuous, the probability that X takes on any specific value x is 0.
      • To learn how to find the probability that a continuous random variable X falls in some interval (ab).
      • To learn the formal definition of a cumulative distribution function of a continuous random variable.
      • To learn how to find the cumulative distribution function of a continuous random variable X from the probability density function of X.
      • To learn the formal definition of a (100p)th percentile.
      • To learn the formal definition of the median, first quartile, and third quartile.
      • To learn how to use the probability density function to find the (100p)th percentile of a continuous random variable X.
      • To extend the definitions of the mean, variance, standard deviation, and moment-generating function for a continuous random variable X.
      • To be able to apply the methods learned in the lesson to new problems.
      • To learn a formal definition of the probability density function of a continuous uniform random variable.
      • To learn a formal definition of the cumulative distribution function of a continuous uniform random variable.
      • To learn key properties of a continuous uniform random variable, such as the mean, variance, and moment generating function.
      • To understand and be able to create a quantile-quantile (q-q) plot.
      • To understand how randomly-generated uniform (0,1) numbers can be used to randomly assign experimental units to treatment.
      • To understand how randomly-generated uniform (0,1) numbers can be used to randomly select participants for a survey.

    • Point Interval Estimate Notes

      Key words:
      estimate, confidence level, alpha, point estimate, interval estimate, population, sample, population parameter, sample statistics, sampling error, sample size, point estimate of mean, variance, standard deviation, interval estimate of mean, variance, standard deviation


    • Introduction

      In this lesson, we’ll investigate one of the most prevalent probability distributions in the natural world, namely the normal distribution. Just as we have for other probability distributions, we’ll explore the normal distribution’s properties, as well as learn how to calculate normal probabilities.

      Objectives

      • To define the probability density function of a normal random variable.
      • To learn the characteristics of a typical normal curve.
      • To learn how to transform a normal random variable X into the standard normal random variable Z.
      • To learn how to calculate the probability that a normal random variable falls between two values a and b, below a value c, or above a value d.
      • To learn how to read standard normal probability tables.
      • To learn how to find the value x associated with a cumulative normal probability.
      • To explore the key properties, such as the moment-generating function, mean and variance, of a normal random variable.
      • To investigate the relationship between the standard normal random variable and a chi-square random variable with one degree of freedom.
      • To learn how to interpret a Z-value.
      • To learn why the Empirical Rule holds true.
      • To understand the steps involved in each of the proofs in the lesson.
      • To be able to apply the methods learned in the lesson to new problems.

    • Key words:
      hypothesis, null hypothesis, alternate hypothesis, decision (do not reject the null hypothesis; reject the null hypothesis), type I error, type II error, one-tailed test (left-tailed test; right-tailed test), two-tailed test, test statistic, critical value, significance level, alpha, p-value, normal distribution, Student t distribution, independent sample, matched sample, hypothesis test about a population mean, hypothesis test about the difference between means of two populations, paired two sample for means

      Hypothesis testing was introduced by Ronald FisherJerzy NeymanKarl Pearson and Pearson’s son, Egon Pearson.   Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data.  Hypothesis Testing is basically an assumption that we make about the population parameter.

      Key terms and concepts:

      • Null hypothesis: Null hypothesis is a statistical hypothesis that assumes that the observation is due to a chance factor.  Null hypothesis is denoted by; H0: μ1 = μ2, which shows that there is no difference between the two population means.
      • Alternative hypothesis: Contrary to the null hypothesis, the alternative hypothesis shows that observations are the result of a real effect.
      • Level of significance: Refers to the degree of significance in which we accept or reject the null hypothesis.  100% accuracy is not possible for accepting or rejecting a hypothesis, so we, therefore, select a level of significance that is usually 5%.
      • Type I error: When we reject the null hypothesis, although that hypothesis was true.  Type I error is denoted by alpha.  In hypothesis testing, the normal curve that shows the critical region is called the alpha region.
      • Type II errors: When we accept the null hypothesis but it is false.  Type II errors are denoted by beta.  In Hypothesis testing, the normal curve that shows the acceptance region is called the beta region.
      • Power: Usually known as the probability of correctly accepting the null hypothesis.  1-beta is called the power of the analysis.
      • One-tailed test: When the given statistical hypothesis is one value like H0: μ1 = μ2, it is called the one-tailed test.
      • Two-tailed test: When the given statistics hypothesis assumes a less than or greater than value, it is called the two-tailed test.

      Statistical decision for hypothesis testing:

      In statistical analysis, we have to make decisions about the hypothesis.  These decisions include deciding if we should accept the null hypothesis or if we should reject the null hypothesis.  Every test in hypothesis testing produces the significance value for that particular test.  In Hypothesis testing, if the significance value of the test is greater than the predetermined significance level, then we accept the null hypothesis.  If the significance value is less than the predetermined value, then we should reject the null hypothesis.  For example, if we want to see the degree of relationship between two stock prices and the significance value of the correlation coefficient is greater than the predetermined significance level, then we can accept the null hypothesis and conclude that there was no relationship between the two stock prices.  However, due to the chance factor, it shows a relationship between the variables.

      You will learn null and alternative hypotheses, Type I and Type II error, One sample tests for means and proportions, Tests for difference between means of two populations, and the Chi-Square Test for independence.

      Useful Materials

      Hypothesis Testing Notes 1

      Hypothesis testing Notes 2

      Hypothesis testing Cheat Sheet


    • Linear regression and correlation analysis

      Key words:
      scatter plot, independent variable, dependent variable, linear pattern, residual plots, positive correlation, negative correlation, correlation coefficient, coefficient of determination, observed values, predicted values, outliers, influential points, ordinary least squares method, line of best fit, parameter estimate, slope, intercept, simple (multiple) regression analysis, power of model, model building

       

      Useful Materials

      Linear Regression and Correlation Analysis Notes


      • Analysis of variance (ANOVA)

        Key words:
        hypothesis test, more than two mean values, analysis of variance, decomposition of variability, total variability, variability between groups, variability within groups, residual variability, F distribution, factor, Chi-square distribution, observed (empirical) frequencies, expected (theoretical) frequencies, qualitative data, contingency table, strength of a relationship

         

        What is Analysis Of Variance – ANOVA

        Analysis of variance (ANOVA) is an analysis tool used in statistics that splits the aggregate variability found inside a data set into two parts: systematic factors and random factors. The systematic factors have a statistical influence on the given data set, but the random factors do not. Analysts use the analysis of the variance test to determine the result that independent variables have on the dependent variable amid a regression study.

         

        BREAKING DOWN Analysis Of Variance – ANOVA

        The analysis of variance test is the initial step in analyzing factors that affect a given data set. Once the analysis of variance test is finished, an analyst performs additional testing on the methodical factors that measurably contribute to the data set’s inconsistency. The analyst utilizes the analysis of the variance test results in an f-test to generate additional data that aligns with the proposed regression models.

        The test allows a comparison of more than two groups at the same time to determine whether a relationship exists between them. The test analyzes multiple groups to determine the types between and within samples. For example, a researcher might test students from multiple colleges to see if students from one of the colleges consistently outperform the others. Also, an Ru0026amp;D researcher might test two different processes of creating a product to see if one process is better than the other in terms of cost efficiency.

         

        How to Use ANOVA

        The type of ANOVA run depends on a number of factors. It is applied when data needs to be experimental. Analysis of variance is employed if there is no access to statistical software resulting in computing ANOVA by hand. It is simple to use and best suited for small samples. With many experimental designs, the sample sizes have to be the same for the various factor level combinations.

        Analysis of variances is helpful for testing three or more variables. It is similar to multiple two-sample t-tests. However, it results in fewer type I errors and is appropriate for a range of issues. ANOVA groups differences by comparing the means of each group, and includes spreading out the variance into diverse sources. It is employed with subjects, test groups, between groups and within groups.

         

        Types of ANOVA

        There are two types of analysis of variance: one-way (or unidirectional) and two-way. One-way or two-way refers to the number of independent variables in your Analysis of Variance test. A one-way ANOVA evaluates the impact of a sole factor on a sole response variable. It determines whether all the samples are the same. The one-way ANOVA is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups.

        A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have one independent variable affecting a dependent variable. With a two-way ANOVA, there are two independents. For example, a two-way ANOVA allows a company to compare worker productivity based on two independent variables, say salary and skill set. It is utilized to observe the interaction between the two factors. It tests the effect of two factors at the same time.

         

        History

        The t- and z-tests developed in the 20th century were used until 1918, when Ronald Fisher created the analysis of variance. ANOVA is also called the Fisher analysis of variance, and it is the extension of the t- and the z-tests. The term became well-known in 1925, after appearing in Fisher’s book, “Statistical Methods for Research Workers.” It was employed in experimental psychology and later expanded to subjects that are more complex.

        The formula for F used in ANOVA is F = between group variance estimate (MSB) divided by the group variance estimate (MSW), where F = MSB/MSW. Every variance estimate has two parts, the sum of squares and the rim (SSB and SSW) and degrees of freedom (df).

         

        Useful Materials

        ANOVA Notes

        Chi-Square Test of Independence Notes


        • The Statistics toolset contains tools that perform standard statistical analysis (such as mean, minimum, maximum, and standard deviation) on attribute data as well as tools that calculate area, length, and count statistics for overlapping and neighboring features. The toolset also includes the Enrich Layer tool that adds demographic and landscape facts that surround your data.

          Inherent in GIS data is information on the attributes of features as well as their locations. This information is used to create maps that can be visually analyzed. Statistical analysis helps you extract additional information from your GIS data that might not be obvious simply by looking at a map—information such as how attribute values are distributed, whether there are spatial trends in the data, or whether the features form spatial patterns. Unlike query functions—such as identify or selection, which provide information about individual features—statistical analysis reveals the characteristics of a set of features as a whole.

          Some of the statistical analysis techniques described in this lesson are most well-suited for interactive applications, such as ArcMap, that allow you to select and visualize data in an ad-hoc and fluid environment. Some of the methods described here are found in ArcMap’s menus and toolbars and don’t have a geoprocessing tool counterpart. Other methods, such as the spatial statistics tools, are only implemented as geoprocessing tools.


        • Basic Probability

          [h5p id="300"]

           

          Key words:
          probability, event, opposite event, complement event, certain event, conditional probability, cumulative distribution function, continuous random variable, discrete random variable, normal distribution, probability density function, sample space

          In this lesson, we learn the fundamental concepts of probability. It is this lesson that will allow us to start putting our first tools into our new probability toolbox.

           

          Objectives

          In this lesson, we will:

          • Learn why an understanding of probability is so critically important to the advancement of most kinds of scientific research.
          • Learn the definition of an event.
          • Learn how to derive new events by taking subsets, unions, intersections, and/or complements of already existing events.
          • Learn the definitions of specific kinds of events, namely empty events, mutually exclusive (or disjoint) events, and exhaustive events.
          • Learn the formal definition of probability.
          • Learn three ways — the person opinion approach, the relative frequency approach, and the classical approach — of assigning a probability to an event.
          • Learn five fundamental theorems, which when applied, allow us to determine the probabilities of various events.
          • Get lots of practice calculating probabilities of various events.

           

          Useful Materials

          Descriptive Statistics Notes

          Probability For Dummies

          Basic Probability Concepts

          Calculating Two Dice Probability

          Why Lack of Probability Knowledge Drives Poverty

          Probability for high school learners