Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save RorisangSitoboli/fa6a2df2e7bef40e8fa2c0d5d7519a8a to your computer and use it in GitHub Desktop.

Select an option

Save RorisangSitoboli/fa6a2df2e7bef40e8fa2c0d5d7519a8a to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "<b>Project Scenario</b> <p>\nAs a Data Scientist with a housing agency in Boston MA, you have been given access to a previous dataset on housing prices derived from the U.S. Census Service to present insights to higher management. Based on your experience in Statistics, what information can you provide them to help with making an informed decision? Upper management will like to get some insight into the following.\n\n- Is there a significant difference in the median value of houses bounded by the Charles river or not?\n- Is there a difference in median values of houses of each proportion of owner-occupied units built before 1940?\n- Can we conclude that there is no relationship between Nitric oxide concentrations and the proportion of non-retail business acres per town?\n- What is the impact of an additional weighted distance to the five Boston employment centres on the median value of owner-occupied homes?\n\nUsing the appropriate graphs and charts, generate basic statistics and visualizations that you think will be useful for the upper management to give them important insight given the question they are asking, in your graphs, include an explanation of each statistic. \n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<b>Column descriptions</b>\n- CRIM: per capita crime rate by town.\n- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.\n- INDUS: proportion of non-retail business acres per town.\n- CHAS: \t Charles River dummy variable (1 if tract bounds river; 0 otherwise).\n- NOX:\t nitric oxides concentration (parts per 10 million).\n- RM: \t average number of rooms per dwelling.\n- AGE: \t proportion of owner-occupied units built prior to 1940.\n- DIS: \t weighted distances to five Boston employment centres.\n- RAD: \t index of accessibility to radial highways.\n- TAX: \t full-value property-tax rate per 10,000.\n- PTRATIO: \t pupil-teacher ratio by town.\n- LSTAT: \t lower status of the population.\n- MEDV: \t Median value of owner-occupied homes in 1000s."
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": "# Import libraries.\nimport numpy as np\nimport pandas as pd\nimport scipy.stats as stats\nimport matplotlib.pyplot as plt\nimport seaborn as sns"
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": "# Create dataframe from the given dataset in the url.\nboston_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/boston_housing.csv'\nboston_df=pd.read_csv(boston_url)"
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Dataframe's top entries: \n 0 1 2 3 4\nUnnamed: 0 0.00000 1.00000 2.00000 3.00000 4.00000\nCRIM 0.00632 0.02731 0.02729 0.03237 0.06905\nZN 18.00000 0.00000 0.00000 0.00000 0.00000\nINDUS 2.31000 7.07000 7.07000 2.18000 2.18000\nCHAS 0.00000 0.00000 0.00000 0.00000 0.00000\nNOX 0.53800 0.46900 0.46900 0.45800 0.45800\nRM 6.57500 6.42100 7.18500 6.99800 7.14700\nAGE 65.20000 78.90000 61.10000 45.80000 54.20000\nDIS 4.09000 4.96710 4.96710 6.06220 6.06220\nRAD 1.00000 2.00000 2.00000 3.00000 3.00000\nTAX 296.00000 242.00000 242.00000 222.00000 222.00000\nPTRATIO 15.30000 17.80000 17.80000 18.70000 18.70000\nLSTAT 4.98000 9.14000 4.03000 2.94000 5.33000\nMEDV 24.00000 21.60000 34.70000 33.40000 36.20000\n\n\nDataframe's last entries: \n 501 502 503 504 505\nUnnamed: 0 501.00000 502.00000 503.00000 504.00000 505.00000\nCRIM 0.06263 0.04527 0.06076 0.10959 0.04741\nZN 0.00000 0.00000 0.00000 0.00000 0.00000\nINDUS 11.93000 11.93000 11.93000 11.93000 11.93000\nCHAS 0.00000 0.00000 0.00000 0.00000 0.00000\nNOX 0.57300 0.57300 0.57300 0.57300 0.57300\nRM 6.59300 6.12000 6.97600 6.79400 6.03000\nAGE 69.10000 76.70000 91.00000 89.30000 80.80000\nDIS 2.47860 2.28750 2.16750 2.38890 2.50500\nRAD 1.00000 1.00000 1.00000 1.00000 1.00000\nTAX 273.00000 273.00000 273.00000 273.00000 273.00000\nPTRATIO 21.00000 21.00000 21.00000 21.00000 21.00000\nLSTAT 9.67000 9.08000 5.64000 6.48000 7.88000\nMEDV 22.40000 20.60000 23.90000 22.00000 11.90000\n"
}
],
"source": "# Sanity check of the dataset.\n# transpose() is used to make the data fit on a single line.\nprint('Dataframe\\'s top entries: \\n', boston_df.head().transpose())\nprint('\\n\\nDataframe\\'s last entries: \\n', boston_df.tail().transpose())"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "- All entries seem normal. \n- To make sure, let's check via a summary.\n- We will view the number of entries and shape of the dataframe as well."
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Unnamed: 0</th>\n <th>CRIM</th>\n <th>ZN</th>\n <th>INDUS</th>\n <th>CHAS</th>\n <th>NOX</th>\n <th>RM</th>\n <th>AGE</th>\n <th>DIS</th>\n <th>RAD</th>\n <th>TAX</th>\n <th>PTRATIO</th>\n <th>LSTAT</th>\n <th>MEDV</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n </tr>\n <tr>\n <th>mean</th>\n <td>252.500000</td>\n <td>3.613524</td>\n <td>11.363636</td>\n <td>11.136779</td>\n <td>0.069170</td>\n <td>0.554695</td>\n <td>6.284634</td>\n <td>68.574901</td>\n <td>3.795043</td>\n <td>9.549407</td>\n <td>408.237154</td>\n <td>18.455534</td>\n <td>12.653063</td>\n <td>22.532806</td>\n </tr>\n <tr>\n <th>std</th>\n <td>146.213884</td>\n <td>8.601545</td>\n <td>23.322453</td>\n <td>6.860353</td>\n <td>0.253994</td>\n <td>0.115878</td>\n <td>0.702617</td>\n <td>28.148861</td>\n <td>2.105710</td>\n <td>8.707259</td>\n <td>168.537116</td>\n <td>2.164946</td>\n <td>7.141062</td>\n <td>9.197104</td>\n </tr>\n <tr>\n <th>min</th>\n <td>0.000000</td>\n <td>0.006320</td>\n <td>0.000000</td>\n <td>0.460000</td>\n <td>0.000000</td>\n <td>0.385000</td>\n <td>3.561000</td>\n <td>2.900000</td>\n <td>1.129600</td>\n <td>1.000000</td>\n <td>187.000000</td>\n <td>12.600000</td>\n <td>1.730000</td>\n <td>5.000000</td>\n </tr>\n <tr>\n <th>25%</th>\n <td>126.250000</td>\n <td>0.082045</td>\n <td>0.000000</td>\n <td>5.190000</td>\n <td>0.000000</td>\n <td>0.449000</td>\n <td>5.885500</td>\n <td>45.025000</td>\n <td>2.100175</td>\n <td>4.000000</td>\n <td>279.000000</td>\n <td>17.400000</td>\n <td>6.950000</td>\n <td>17.025000</td>\n </tr>\n <tr>\n <th>50%</th>\n <td>252.500000</td>\n <td>0.256510</td>\n <td>0.000000</td>\n <td>9.690000</td>\n <td>0.000000</td>\n <td>0.538000</td>\n <td>6.208500</td>\n <td>77.500000</td>\n <td>3.207450</td>\n <td>5.000000</td>\n <td>330.000000</td>\n <td>19.050000</td>\n <td>11.360000</td>\n <td>21.200000</td>\n </tr>\n <tr>\n <th>75%</th>\n <td>378.750000</td>\n <td>3.677082</td>\n <td>12.500000</td>\n <td>18.100000</td>\n <td>0.000000</td>\n <td>0.624000</td>\n <td>6.623500</td>\n <td>94.075000</td>\n <td>5.188425</td>\n <td>24.000000</td>\n <td>666.000000</td>\n <td>20.200000</td>\n <td>16.955000</td>\n <td>25.000000</td>\n </tr>\n <tr>\n <th>max</th>\n <td>505.000000</td>\n <td>88.976200</td>\n <td>100.000000</td>\n <td>27.740000</td>\n <td>1.000000</td>\n <td>0.871000</td>\n <td>8.780000</td>\n <td>100.000000</td>\n <td>12.126500</td>\n <td>24.000000</td>\n <td>711.000000</td>\n <td>22.000000</td>\n <td>37.970000</td>\n <td>50.000000</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Unnamed: 0 CRIM ZN INDUS CHAS NOX \\\ncount 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 \nmean 252.500000 3.613524 11.363636 11.136779 0.069170 0.554695 \nstd 146.213884 8.601545 23.322453 6.860353 0.253994 0.115878 \nmin 0.000000 0.006320 0.000000 0.460000 0.000000 0.385000 \n25% 126.250000 0.082045 0.000000 5.190000 0.000000 0.449000 \n50% 252.500000 0.256510 0.000000 9.690000 0.000000 0.538000 \n75% 378.750000 3.677082 12.500000 18.100000 0.000000 0.624000 \nmax 505.000000 88.976200 100.000000 27.740000 1.000000 0.871000 \n\n RM AGE DIS RAD TAX PTRATIO \\\ncount 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 \nmean 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 \nstd 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 \nmin 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 \n25% 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 \n50% 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 \n75% 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 \nmax 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 \n\n LSTAT MEDV \ncount 506.000000 506.000000 \nmean 12.653063 22.532806 \nstd 7.141062 9.197104 \nmin 1.730000 5.000000 \n25% 6.950000 17.025000 \n50% 11.360000 21.200000 \n75% 16.955000 25.000000 \nmax 37.970000 50.000000 "
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Basics statistical summary of the dataset .\nboston_df.describe()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "- The dataframe has only 506 entries.\n- Let's check the data types as well as the completeness of the dataset- i.e. are there any missing values?"
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "<bound method DataFrame.info of Unnamed: 0 CRIM ZN INDUS CHAS NOX RM AGE DIS RAD \\\n0 0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 \n1 1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 \n2 2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 \n3 3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 \n4 4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 \n.. ... ... ... ... ... ... ... ... ... ... \n501 501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 \n502 502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 \n503 503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 \n504 504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 \n505 505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 \n\n TAX PTRATIO LSTAT MEDV \n0 296.0 15.3 4.98 24.0 \n1 242.0 17.8 9.14 21.6 \n2 242.0 17.8 4.03 34.7 \n3 222.0 18.7 2.94 33.4 \n4 222.0 18.7 5.33 36.2 \n.. ... ... ... ... \n501 273.0 21.0 9.67 22.4 \n502 273.0 21.0 9.08 20.6 \n503 273.0 21.0 5.64 23.9 \n504 273.0 21.0 6.48 22.0 \n505 273.0 21.0 7.88 11.9 \n\n[506 rows x 14 columns]>"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "boston_df.info"
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "(506, 14)"
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Alternatively check the shape\nboston_df.shape"
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 506 entries, 0 to 505\nData columns (total 14 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 Unnamed: 0 506 non-null int64 \n 1 CRIM 506 non-null float64\n 2 ZN 506 non-null float64\n 3 INDUS 506 non-null float64\n 4 CHAS 506 non-null float64\n 5 NOX 506 non-null float64\n 6 RM 506 non-null float64\n 7 AGE 506 non-null float64\n 8 DIS 506 non-null float64\n 9 RAD 506 non-null float64\n 10 TAX 506 non-null float64\n 11 PTRATIO 506 non-null float64\n 12 LSTAT 506 non-null float64\n 13 MEDV 506 non-null float64\ndtypes: float64(13), int64(1)\nmemory usage: 55.5 KB\n"
}
],
"source": "# Check data types and for missing entries.\nboston_df.info()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "- Our data has 505 rows (excluding column names) and 14 columns.\n- All data is numeric, either int64 or float64.\n- The first column is not really useful as it is just an index/count. We will remove it from our dataframe over the course of the assessment.\n- As part of Exploratory Data Analysis (EDA), check the column distributions."
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "<Figure size 1800x1800 with 0 Axes>"
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment