Logistic regression is a modeling technique often used to predict a binary variable-a variable coded as 1 if an event of interest occurs (e.g., a borrower defaults on a loan) and coded as 0 otherwise. This note details how logistic regression applies the logistic function to generate a probability forecast for a binary event. It also includes an example of how to fit a logistic regression model to loan default data using StatTools (an Excel add-in). The StatTools output is then used to predict a loan's default as a function of the borrower's credit score.
Google Cloud Platform offers BigQuery ML, a popular cloud computing resource for developing data models. This note provides information about creating, evaluating, and deploying data models with BigQuery ML.
Machine-learning (ML) models have become a common tool used across a multitude of industries to help people make decisions. As these models have increased in predictive power, many have also grown in complexity. The pursuit to improve the accuracy of predictions has diminished the interpretability of many models-leaving users with little understanding of the model's behavior or trust in its predictions. The field of eXplainable artificial intelligence (XAI) seeks to encourage the development of interpretable models. Google Cloud Platform offers two Explainable AI functions in BigQuery ML that allow users to examine the attribution of model features, which aids in model behavior verification and bias recognition. One of the functions available in BigQuery provides a global perspective on the features used to train the model, while the second function examines local feature attribution associated with individual predictions in more detail. This note offers an overview of Explainable AI in BigQuery ML, using as an example a (fictional) realtor's linear regression model that predicted a home's latest sale price based on predictor variables such as the total tax assessment from the year of the last sale, the square footage of the house, the number of bedrooms, the number of bathrooms, and whether the condition of the home is below average. After training the linear model, the feature attribution can be studied from a global and local perspective in BigQuery.
Google Cloud Platform (GCP) offers Cloud Storage, a popular cloud computing resource for data storage. This note provides information about uploading data to Cloud Storage, creating BigQuery tables from data found in Cloud Storage, previewing files in Cloud Storage, and sharing Cloud Storage buckets.
Tableau is a powerful visualization software that supports the data analysis process. The program allows users to create custom visualizations from datasets of all sizes and types using a simple drag-and-drop interface. This note introduces new users to Tableau by providing guidance on connecting to data, exploring the interface, and creating a number of common visualizations.
Google Cloud Platform (GCP) offers a popular cloud computing resource for writing Structured Query Language (SQL) queries. The resource is called BigQuery. This note provides step-by-step information about accessing BigQuery and using it to upload and explore data. It also includes a quick-start guide.
This exercise explores customer transaction data generated from a business owner's website and illustrates the added benefit of basic data analytics practices used to uncover business insights. Students discover the business's purchase trends when they answer the provided questions by writing SQL queries. The questions guide students to determine which product offerings the business should promote and which customer segments to target. Additionally, the case discusses the common relational database design that is often associated with transactional data and its metadata.
This case, which has been taught successfully in a Darden online class, allows for an introductory application of the Tableau analytics platform. In 2012, Carvana Co., an e-commerce platform for buying used cars, hosted a competition called "Don't Get Kicked!" wherein 570 teams competed to predict if a car purchased at auction was a "kick" (i.e., a bad buy)-a vehicle with a major defect. To compete, teams downloaded Carvana's data from Kaggle's website. At the time of the competition, data science was a burgeoning field, and industry watchers wondered if machine learning could help a company such as Carvana develop a competitive advantage. This case analyzes the US used-car market, Carvana's history and Kaggle's role in its development, and the viability of data science-particularly visual analytics-in guiding business and consumer decisions.
Two recently graduated MBA students are tasked with developing an ad-serving learning algorithm for a mobile ad-serving company. The case illustrates the way in which hypotheses can be tested in an A/B format or "horse race" in order to establish customer preferences and superior profitability. The case was written for a course elective covering hypothesis testing.
Two recently graduated MBA students are tasked with developing an ad-serving learning algorithm for a mobile ad-serving company. The case illustrates the way in which hypotheses can be tested in an A/B format or "horse race" in order to establish customer preferences and superior profitability. The case was written for a course elective covering hypothesis testing.
This case works well to provide experience with developing a revenue growth forecast using time series analysis. Students are asked to run several exponential smoothing models to predict the growth in a high-profile social network's traffic from three years of monthly history. The model students select is used to forecast a stream of future revenues in a pro forma cash flow statement. This case is supported by a teaching note, an Excel spreadsheet, a .csv file, and an R file.
The purpose of this case is to introduce data visualization, advanced regression techniques, and supervised learning. Students are asked to visualize data geographically and in scatterplots. They will use stepwise regression and regression trees to select a predictive model for forecasting data in a holdout sample. In a forecasting competition, they will submit their models to be tested for accuracy. Supervised learning techniques-such as training, validation, and testing-are introduced. Regression trees serve as both predictive and graphical tools for communicating insights from data analysis to a decision maker.
"This case serves to illustrate how averaging point forecasts harnesses the wisdom of crowds. Students access data from the Survey of Professional Forecasters (SPF) and compare the performance of the crowd (i.e., the average point forecasts) to the average performance of the individual panelists and the best performer from the previous period. The case is intended for use in a class on forecasting, and the instructor can present it in three ways: with all necessary SPF data cleaned and preprocessed in a student spreadsheet (UVA-QA-0805X, provided with the case); with code (also provided in the student spreadsheet) written by the case authors in R, the statistical computing package, as well as a supplementary handout (UVA-QA-0805H, also provided with the case), which walks students through R code, explaining how to clean and analyze the SPF data; or as a team project to be worked on over several days, providing neither the spreadsheet nor the supplement. The latter would be an excellent exercise in data retrieval, cleaning, reshaping, and analysis."
A large general-merchandise retailer misses its fiscal year earnings-per-share guidance, so its CFO is charged with improving the firm's forecasts. This case presents the use of probability distributions for forecasting discrete and continuous uncertainties such as GDP growth, inflation, and unemployment, including the benefit of ranges and distributions over point estimates. The Federal Reserve Bank of Philadelphia's Survey of Professional Forecasters is introduced as a source of forecasts