Created
March 1, 2020 20:56
-
-
Save EricPostMaster/bffea9ebcb33a8e80a9a34207e8fcca9 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| {"cells":[{"metadata":{},"cell_type":"markdown","source":"# Simple Linear Regression: Salary as a Function of Years of Experience"},{"metadata":{},"cell_type":"markdown","source":"Welcome to my R notebook! Execute the code blocks below to see the results and plots of this simple linear regression. You can do so by pressing 'Shift+Enter' with your cursor in the cell or by pushing the 'Play' button on the left side of the code block. Enjoy!"},{"metadata":{"_uuid":"051d70d956493feee0c6d64651c6a088724dca2a","_execution_state":"idle","trusted":true,"_kg_hide-output":true},"cell_type":"code","source":"library(tidyverse)\nlist.files(path = \"../input\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Step 1. Hypothesize the Model"},{"metadata":{},"cell_type":"markdown","source":"This one is pretty easy. There are only two variables, so this will be a simple linear regression with the response variable being **Salary** and the predictor variable being **Years of Experience**.\n\nHere is the hypothesized model: \\\\( y=\\beta_{0}+\\beta_{1}x+\\epsilon \\\\)"},{"metadata":{},"cell_type":"markdown","source":"## Step 2. Collect the Data"},{"metadata":{"trusted":true},"cell_type":"code","source":"salary <- read_csv(\"../input/salary-data-simple-linear-regression/Salary_Data.csv\")\nhead(salary)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"**<span style=\"color:red\">SPOILER ALERT!</span>** The following scatter plot displays what appears to be a linear relationship between salary and years of experience in this data set. Feel free to skip ahead to **Step 3** below if you prefer to continue in suspense about whether a significant relationship really exists!"},{"metadata":{"trusted":true},"cell_type":"code","source":"scatter.smooth(salary$Salary, salary$YearsExperience, main=\"Salary vs. Years of Experience\",\n xlab=\"Salary\", ylab=\"Years of Experience\", col=\"blue\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Step 3. Estimate Parameters"},{"metadata":{},"cell_type":"markdown","source":"Here I'll create the model and see if the relationship between the predictor and response variables is significant."},{"metadata":{"trusted":true},"cell_type":"code","source":"salary.slr <- lm(Salary ~ YearsExperience, data=salary)\nsummary(salary.slr)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"print(salary.slr)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"cor(salary$YearsExperience, salary$Salary)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Step 4. Check Assumptions"},{"metadata":{},"cell_type":"markdown","source":"### Assumptions 1 & 2: Mean of Error = 0, Variance of Error is constant"},{"metadata":{},"cell_type":"markdown","source":"You can see from the Predicted vs. Actual plot that there is a pretty good fit. No red flags."},{"metadata":{"trusted":true},"cell_type":"code","source":"plot(predict(salary.slr),salary$Salary, xlab=\"Predicted\",ylab=\"Actual\")\nabline(0,1, col=\"red\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"Residuals vs. Fitted shows similar information, but it's not as easy to read. I prefer Fitted vs. Actual."},{"metadata":{"trusted":true},"cell_type":"code","source":"plot(salary.slr, which=1, col=\"blue\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Assumption 3: Error is Normally distributed"},{"metadata":{},"cell_type":"markdown","source":"There are a few points that are a little ways off at the extremes of the QQ plot below, but I don't think it is indicative of a terribly non-Normal distribution."},{"metadata":{"trusted":true},"cell_type":"code","source":"plot(salary.slr, which=2, col=\"blue\")","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"car::outlierTest(salary.slr)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"plot(salary.slr, which=4, col=\"blue\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Assumption 4: Errors are independent of one another"},{"metadata":{},"cell_type":"markdown","source":"Normally, the Durbin-Watson test would be appropriate for determining independence of error; however, this dataset only has 30 observations, and the Durbin-Watson test requires at least 50 for a reliable result. Instead, I graphed the independent variable **Years of Experience** vs. the residuals of the predicted values and looked for any patterns that might indicate lack of independence. As you will see below, the residuals are independent of one another."},{"metadata":{"trusted":true},"cell_type":"code","source":"scatter.smooth(salary$YearsExperience, resid(salary.slr),\n xlab=\"Years of Experience\", ylab=\"Residuals\", main=\"Residuals vs. Independent Variable Plot\")\nabline(h = 0)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Bonus: I learned how to use par(mfrow) while working on this. The plots are small, but it's a nice at-a-glance reference."},{"metadata":{"trusted":true},"cell_type":"code","source":"par(mfrow = c(2,2))\nplot(salary.slr, col=\"blue\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Step 5. Check the Model Against New Data"},{"metadata":{},"cell_type":"markdown","source":"This is a small sample size, so I can jackknife it by running the regression again but witholding a single data observation for validation."},{"metadata":{},"cell_type":"markdown","source":"The original dataset was sorted by Years of Experience. I want to take a random sample, so first I'll randomize the dataframe:"},{"metadata":{"_kg_hide-output":true,"trusted":true,"collapsed":true},"cell_type":"code","source":"set.seed(42)\nrows <- sample(nrow(salary))\nshuffled <- salary[rows, ]\nshuffled.head() #shuffled is the randomized version of the original salary dataframe","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"salary2 = shuffled[0:29,] #salary2 is the first 29 observations in the randomized dataframe.\nsalary2.head() #The last observation will be used to validate the model.","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"salary2.slr <- lm(Salary ~ YearsExperience, data=salary) #Create the regression again\nsummary(salary2.slr) #Coefficients are the same. Good!","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"salaryTest = shuffled[30:30,] #salaryTest is the validation observation\nsalaryTest","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"exp = 6.8\nPred_Salary = exp*9450 + 25792\ndifference = abs(salaryTest$Salary - Pred_Salary)\nsprintf(\"The predicted salary is $%s\", Pred_Salary) #Print the predicted salary\nsprintf(\"The actual salary is $%s\", salaryTest$Salary) #Print the actual salary\nsprintf(\"Difference of $%s\", difference) #Print the difference between predicted and actual","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"scatter.smooth(salary$Salary, salary$YearsExperience)\nabline(v=Pred_Salary, h=exp, col=\"red\") #The red lines correspond to the predicted values","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Conclusion"},{"metadata":{},"cell_type":"markdown","source":"I really enjoyed creating this regression. One of the lessons I learned at the end was to plan ahead for your validation. If I would have thought ahead, I would have set aside a validation observation before creating the initial model and saved myself some extra work at the end. In any case, it all worked out in the end.\n\nMy next step is to do something similar in Python. I would also like to experiment with multiple predictor variables and logistic regression. Thanks for reading!"}],"metadata":{"kernelspec":{"display_name":"R","language":"R","name":"ir"},"language_info":{"mimetype":"text/x-r-source","name":"R","pygments_lexer":"r","version":"3.4.2","file_extension":".r","codemirror_mode":"r"}},"nbformat":4,"nbformat_minor":4} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment