• Causal Inference

    This note provides an overview of causal inference for an introductory data science course. First, the note discusses observational studies and confounding variables. Next the note describes how randomized experiments can be used to account for the effect of confounding variables. Then it walks through the steps to designing an experiment, including a discussion of how to calculate the power of a test.
    詳細資料
  • Exploratory Data Analysis

    This module note provides an overview of exploratory data analysis for an introduction to data science course. It begins by defining the term "data", and then describes the different types of data that companies work with (structured v. unstructured, categorical v. numeric, etc.). Next, the note describes the basic summary statistics that firms use to track key business outcomes. Finally, the note provides an overview of different visualizations. An appendix is provided, which includes the R code for creating all of the figures and visualizations shown in the note.
    詳細資料
  • Statistical Inference

    This note provides an overview of statistical inference for an introductory data science course. First, the note discusses samples and populations. Next the note describes how to calculate confidence intervals for means and proportions. Then it walks through the logic of hypothesis testing and the interpretation of p-values (in the context of two-sample hypothesis testing for means and proportions). The appendix of the note contains R code for all of these topics.
    詳細資料
  • Linear Regression

    This note provides an overview of linear regression for an introductory data science course. It begins with a discussion of correlation, and explains why correlation does not necessarily imply causation. The note then describes the method of least squares , and how to interpret the r-squared and model coefficient values of a simple linear regression model. Next, the note describes how the interpretation of a model coefficient changes when there are multiple independent variables in the model. Finally, the note explains how to interpret the coefficients on dummy variables in a regression model. The appendix includes R code for implementing all of these topics.
    詳細資料
  • Prediction & Machine Learning

    This note provides an introduction to machine learning for an introductory data science course. The note begins with a description of supervised, unsupervised, and reinforcement learning. Then, the note provides a brief explanation of the difference between traditional statistical modeling and machine learning. Next, the note covers two models used for classification, logistic regression and decision trees. After introducing these two models, the note explains how train, validation, and holdout sets (and k-fold cross validation) are used to tune and evaluate different models. Finally, the note concludes with a discussion of different performance metrics (ROC cruves, confusion matrices, log loss) that are used to evaluate classification models.
    詳細資料
  • Precision Paint Co.

    Describes a marketing director about to launch a new process for demand forecasting. Provides data that allow students to do a multivariable regression analysis. A rewritten version of an earlier case.
    詳細資料
  • Probability Distributions

    This technical note introduces students to the concept of random variables, and from there the normal and binomial distributions. After a brief introduction to random variables, the note describes the standard properties of the normal distribution: a single peak, and a symmetric, bell-shaped curve. Students observe the 68-95-99.7 rule, and see how the distribution changes with different values of the mean and standard deviation parameters. Finally, the note demonstrates how probability calculations based on the normal distribution can be done in the R programming language, and how random data can be simulated from a normal curve in R. The note then describes the standard properties of the binomial distribution, and similarly shows how binomial calculations can be performed in R.
    詳細資料