RorisangSitoboli · July 1, 2021 09:04
diff --git a/Statistics for Data Science with Python.ipynb b/Statistics for Data Science with Python.ipynb
 {
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "<b>Project Scenario</b> <p>\nAs a Data Scientist with a housing agency in Boston MA, you have been given access to a previous dataset on housing prices derived from the U.S. Census Service to present insights to higher management. Based on your experience in Statistics, what information can you provide them to help with making an informed decision? Upper management will like to get some insight into the following.\n\n- Is there a significant difference in the median value of houses bounded by the Charles river or not?\n- Is there a difference in median values of houses of each proportion of owner-occupied units built before 1940?\n- Can we conclude that there is no relationship between Nitric oxide concentrations and the proportion of non-retail business acres per town?\n- What is the impact of an additional weighted distance to the five Boston employment centres on the median value of owner-occupied homes?\n\nUsing the appropriate graphs and charts, generate basic statistics and visualizations that you think will be useful for the upper management to give them important insight given the question they are asking, in your graphs, include an explanation of each statistic. \n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "<b>Column descriptions</b>\n- CRIM:      per capita crime rate by town.\n- ZN:        proportion of residential land zoned for lots over 25,000 sq.ft.\n- INDUS:     proportion of non-retail business acres per town.\n- CHAS: \t Charles River dummy variable (1 if tract bounds river; 0 otherwise).\n- NOX:\t     nitric oxides concentration (parts per 10 million).\n- RM: \t     average number of rooms per dwelling.\n- AGE: \t     proportion of owner-occupied units built prior to 1940.\n- DIS: \t     weighted distances to five Boston employment centres.\n- RAD: \t     index of accessibility to radial highways.\n- TAX: \t     full-value property-tax rate per 10,000.\n- PTRATIO: \t pupil-teacher ratio by town.\n- LSTAT: \t lower status of the population.\n- MEDV: \t Median value of owner-occupied homes in 1000s."
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {},
            "outputs": [],
            "source": "# Import libraries.\nimport numpy as np\nimport pandas as pd\nimport scipy.stats as stats\nimport matplotlib.pyplot as plt\nimport seaborn as sns"
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {},
            "outputs": [],
            "source": "# Create dataframe from the given dataset in the url.\nboston_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/boston_housing.csv'\nboston_df=pd.read_csv(boston_url)"
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": "Dataframe's top entries: \n                     0          1          2          3          4\nUnnamed: 0    0.00000    1.00000    2.00000    3.00000    4.00000\nCRIM          0.00632    0.02731    0.02729    0.03237    0.06905\nZN           18.00000    0.00000    0.00000    0.00000    0.00000\nINDUS         2.31000    7.07000    7.07000    2.18000    2.18000\nCHAS          0.00000    0.00000    0.00000    0.00000    0.00000\nNOX           0.53800    0.46900    0.46900    0.45800    0.45800\nRM            6.57500    6.42100    7.18500    6.99800    7.14700\nAGE          65.20000   78.90000   61.10000   45.80000   54.20000\nDIS           4.09000    4.96710    4.96710    6.06220    6.06220\nRAD           1.00000    2.00000    2.00000    3.00000    3.00000\nTAX         296.00000  242.00000  242.00000  222.00000  222.00000\nPTRATIO      15.30000   17.80000   17.80000   18.70000   18.70000\nLSTAT         4.98000    9.14000    4.03000    2.94000    5.33000\nMEDV         24.00000   21.60000   34.70000   33.40000   36.20000\n\n\nDataframe's last entries: \n                   501        502        503        504        505\nUnnamed: 0  501.00000  502.00000  503.00000  504.00000  505.00000\nCRIM          0.06263    0.04527    0.06076    0.10959    0.04741\nZN            0.00000    0.00000    0.00000    0.00000    0.00000\nINDUS        11.93000   11.93000   11.93000   11.93000   11.93000\nCHAS          0.00000    0.00000    0.00000    0.00000    0.00000\nNOX           0.57300    0.57300    0.57300    0.57300    0.57300\nRM            6.59300    6.12000    6.97600    6.79400    6.03000\nAGE          69.10000   76.70000   91.00000   89.30000   80.80000\nDIS           2.47860    2.28750    2.16750    2.38890    2.50500\nRAD           1.00000    1.00000    1.00000    1.00000    1.00000\nTAX         273.00000  273.00000  273.00000  273.00000  273.00000\nPTRATIO      21.00000   21.00000   21.00000   21.00000   21.00000\nLSTAT         9.67000    9.08000    5.64000    6.48000    7.88000\nMEDV         22.40000   20.60000   23.90000   22.00000   11.90000\n"
                }
            ],
            "source": "# Sanity check of the dataset.\n# transpose() is used to make the data fit on a single line.\nprint('Dataframe\\'s top entries: \\n', boston_df.head().transpose())\nprint('\\n\\nDataframe\\'s last entries: \\n', boston_df.tail().transpose())"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "- All entries seem normal. \n- To make sure, let's check via a summary.\n- We will view the number of entries and shape of the dataframe as well."
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Unnamed: 0</th>\n      <th>CRIM</th>\n      <th>ZN</th>\n      <th>INDUS</th>\n      <th>CHAS</th>\n      <th>NOX</th>\n      <th>RM</th>\n      <th>AGE</th>\n      <th>DIS</th>\n      <th>RAD</th>\n      <th>TAX</th>\n      <th>PTRATIO</th>\n      <th>LSTAT</th>\n      <th>MEDV</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>count</th>\n      <td>506.000000</td>\n      <td>506.000000</td>\n      <td>506.000000</td>\n      <td>506.000000</td>\n      <td>506.000000</td>\n      <td>506.000000</td>\n      <td>506.000000</td>\n      <td>506.000000</td>\n      <td>506.000000</td>\n      <td>506.000000</td>\n      <td>506.000000</td>\n      <td>506.000000</td>\n      <td>506.000000</td>\n      <td>506.000000</td>\n    </tr>\n    <tr>\n      <th>mean</th>\n      <td>252.500000</td>\n      <td>3.613524</td>\n      <td>11.363636</td>\n      <td>11.136779</td>\n      <td>0.069170</td>\n      <td>0.554695</td>\n      <td>6.284634</td>\n      <td>68.574901</td>\n      <td>3.795043</td>\n      <td>9.549407</td>\n      <td>408.237154</td>\n      <td>18.455534</td>\n      <td>12.653063</td>\n      <td>22.532806</td>\n    </tr>\n    <tr>\n      <th>std</th>\n      <td>146.213884</td>\n      <td>8.601545</td>\n      <td>23.322453</td>\n      <td>6.860353</td>\n      <td>0.253994</td>\n      <td>0.115878</td>\n      <td>0.702617</td>\n      <td>28.148861</td>\n      <td>2.105710</td>\n      <td>8.707259</td>\n      <td>168.537116</td>\n      <td>2.164946</td>\n      <td>7.141062</td>\n      <td>9.197104</td>\n    </tr>\n    <tr>\n      <th>min</th>\n      <td>0.000000</td>\n      <td>0.006320</td>\n      <td>0.000000</td>\n      <td>0.460000</td>\n      <td>0.000000</td>\n      <td>0.385000</td>\n      <td>3.561000</td>\n      <td>2.900000</td>\n      <td>1.129600</td>\n      <td>1.000000</td>\n      <td>187.000000</td>\n      <td>12.600000</td>\n      <td>1.730000</td>\n      <td>5.000000</td>\n    </tr>\n    <tr>\n      <th>25%</th>\n      <td>126.250000</td>\n      <td>0.082045</td>\n      <td>0.000000</td>\n      <td>5.190000</td>\n      <td>0.000000</td>\n      <td>0.449000</td>\n      <td>5.885500</td>\n      <td>45.025000</td>\n      <td>2.100175</td>\n      <td>4.000000</td>\n      <td>279.000000</td>\n      <td>17.400000</td>\n      <td>6.950000</td>\n      <td>17.025000</td>\n    </tr>\n    <tr>\n      <th>50%</th>\n      <td>252.500000</td>\n      <td>0.256510</td>\n      <td>0.000000</td>\n      <td>9.690000</td>\n      <td>0.000000</td>\n      <td>0.538000</td>\n      <td>6.208500</td>\n      <td>77.500000</td>\n      <td>3.207450</td>\n      <td>5.000000</td>\n      <td>330.000000</td>\n      <td>19.050000</td>\n      <td>11.360000</td>\n      <td>21.200000</td>\n    </tr>\n    <tr>\n      <th>75%</th>\n      <td>378.750000</td>\n      <td>3.677082</td>\n      <td>12.500000</td>\n      <td>18.100000</td>\n      <td>0.000000</td>\n      <td>0.624000</td>\n      <td>6.623500</td>\n      <td>94.075000</td>\n      <td>5.188425</td>\n      <td>24.000000</td>\n      <td>666.000000</td>\n      <td>20.200000</td>\n      <td>16.955000</td>\n      <td>25.000000</td>\n    </tr>\n    <tr>\n      <th>max</th>\n      <td>505.000000</td>\n      <td>88.976200</td>\n      <td>100.000000</td>\n      <td>27.740000</td>\n      <td>1.000000</td>\n      <td>0.871000</td>\n      <td>8.780000</td>\n      <td>100.000000</td>\n      <td>12.126500</td>\n      <td>24.000000</td>\n      <td>711.000000</td>\n      <td>22.000000</td>\n      <td>37.970000</td>\n      <td>50.000000</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
                        "text/plain": "       Unnamed: 0        CRIM          ZN       INDUS        CHAS         NOX  \\\ncount  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   \nmean   252.500000    3.613524   11.363636   11.136779    0.069170    0.554695   \nstd    146.213884    8.601545   23.322453    6.860353    0.253994    0.115878   \nmin      0.000000    0.006320    0.000000    0.460000    0.000000    0.385000   \n25%    126.250000    0.082045    0.000000    5.190000    0.000000    0.449000   \n50%    252.500000    0.256510    0.000000    9.690000    0.000000    0.538000   \n75%    378.750000    3.677082   12.500000   18.100000    0.000000    0.624000   \nmax    505.000000   88.976200  100.000000   27.740000    1.000000    0.871000   \n\n               RM         AGE         DIS         RAD         TAX     PTRATIO  \\\ncount  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   \nmean     6.284634   68.574901    3.795043    9.549407  408.237154   18.455534   \nstd      0.702617   28.148861    2.105710    8.707259  168.537116    2.164946   \nmin      3.561000    2.900000    1.129600    1.000000  187.000000   12.600000   \n25%      5.885500   45.025000    2.100175    4.000000  279.000000   17.400000   \n50%      6.208500   77.500000    3.207450    5.000000  330.000000   19.050000   \n75%      6.623500   94.075000    5.188425   24.000000  666.000000   20.200000   \nmax      8.780000  100.000000   12.126500   24.000000  711.000000   22.000000   \n\n            LSTAT        MEDV  \ncount  506.000000  506.000000  \nmean    12.653063   22.532806  \nstd      7.141062    9.197104  \nmin      1.730000    5.000000  \n25%      6.950000   17.025000  \n50%     11.360000   21.200000  \n75%     16.955000   25.000000  \nmax     37.970000   50.000000  "
                    },
                    "execution_count": 4,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": "# Basics statistical summary of the dataset .\nboston_df.describe()"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "- The dataframe has only 506 entries.\n- Let's check the data types as well as the completeness of the dataset- i.e. are there any missing values?"
        },
        {
            "cell_type": "code",
            "execution_count": 5,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": "<bound method DataFrame.info of      Unnamed: 0     CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  \\\n0             0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0   \n1             1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0   \n2             2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0   \n3             3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0   \n4             4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0   \n..          ...      ...   ...    ...   ...    ...    ...   ...     ...  ...   \n501         501  0.06263   0.0  11.93   0.0  0.573  6.593  69.1  2.4786  1.0   \n502         502  0.04527   0.0  11.93   0.0  0.573  6.120  76.7  2.2875  1.0   \n503         503  0.06076   0.0  11.93   0.0  0.573  6.976  91.0  2.1675  1.0   \n504         504  0.10959   0.0  11.93   0.0  0.573  6.794  89.3  2.3889  1.0   \n505         505  0.04741   0.0  11.93   0.0  0.573  6.030  80.8  2.5050  1.0   \n\n       TAX  PTRATIO  LSTAT  MEDV  \n0    296.0     15.3   4.98  24.0  \n1    242.0     17.8   9.14  21.6  \n2    242.0     17.8   4.03  34.7  \n3    222.0     18.7   2.94  33.4  \n4    222.0     18.7   5.33  36.2  \n..     ...      ...    ...   ...  \n501  273.0     21.0   9.67  22.4  \n502  273.0     21.0   9.08  20.6  \n503  273.0     21.0   5.64  23.9  \n504  273.0     21.0   6.48  22.0  \n505  273.0     21.0   7.88  11.9  \n\n[506 rows x 14 columns]>"
                    },
                    "execution_count": 5,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": "boston_df.info"
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": "(506, 14)"
                    },
                    "execution_count": 6,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": "# Alternatively check the shape\nboston_df.shape"
        },
        {
            "cell_type": "code",
            "execution_count": 7,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": "<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 506 entries, 0 to 505\nData columns (total 14 columns):\n #   Column      Non-Null Count  Dtype  \n---  ------      --------------  -----  \n 0   Unnamed: 0  506 non-null    int64  \n 1   CRIM        506 non-null    float64\n 2   ZN          506 non-null    float64\n 3   INDUS       506 non-null    float64\n 4   CHAS        506 non-null    float64\n 5   NOX         506 non-null    float64\n 6   RM          506 non-null    float64\n 7   AGE         506 non-null    float64\n 8   DIS         506 non-null    float64\n 9   RAD         506 non-null    float64\n 10  TAX         506 non-null    float64\n 11  PTRATIO     506 non-null    float64\n 12  LSTAT       506 non-null    float64\n 13  MEDV        506 non-null    float64\ndtypes: float64(13), int64(1)\nmemory usage: 55.5 KB\n"
                }
            ],
            "source": "# Check data types and for missing entries.\nboston_df.info()"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "- Our data has 505 rows (excluding column names) and 14 columns.\n- All data is numeric, either int64 or float64.\n- The first column is not really useful as it is just an index/count. We will remove it from our dataframe over the course of the assessment.\n- As part of Exploratory Data Analysis (EDA), check the column distributions."
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": "<Figure size 1800x1800 with 0 Axes>"
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "data": {
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "<b>Project Scenario</b> <p>\nAs a Data Scientist with a housing agency in Boston MA, you have been given access to a previous dataset on housing prices derived from the U.S. Census Service to present insights to higher management. Based on your experience in Statistics, what information can you provide them to help with making an informed decision? Upper management will like to get some insight into the following.\n\n- Is there a significant difference in the median value of houses bounded by the Charles river or not?\n- Is there a difference in median values of houses of each proportion of owner-occupied units built before 1940?\n- Can we conclude that there is no relationship between Nitric oxide concentrations and the proportion of non-retail business acres per town?\n- What is the impact of an additional weighted distance to the five Boston employment centres on the median value of owner-occupied homes?\n\nUsing the appropriate graphs and charts, generate basic statistics and visualizations that you think will be useful for the upper management to give them important insight given the question they are asking, in your graphs, include an explanation of each statistic. \n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "<b>Column descriptions</b>\n- CRIM: per capita crime rate by town.\n- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.\n- INDUS: proportion of non-retail business acres per town.\n- CHAS: \t Charles River dummy variable (1 if tract bounds river; 0 otherwise).\n- NOX:\t nitric oxides concentration (parts per 10 million).\n- RM: \t average number of rooms per dwelling.\n- AGE: \t proportion of owner-occupied units built prior to 1940.\n- DIS: \t weighted distances to five Boston employment centres.\n- RAD: \t index of accessibility to radial highways.\n- TAX: \t full-value property-tax rate per 10,000.\n- PTRATIO: \t pupil-teacher ratio by town.\n- LSTAT: \t lower status of the population.\n- MEDV: \t Median value of owner-occupied homes in 1000s."
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": "# Import libraries.\nimport numpy as np\nimport pandas as pd\nimport scipy.stats as stats\nimport matplotlib.pyplot as plt\nimport seaborn as sns"
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": "# Create dataframe from the given dataset in the url.\nboston_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/boston_housing.csv'\nboston_df=pd.read_csv(boston_url)"
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": "Dataframe's top entries: \n 0 1 2 3 4\nUnnamed: 0 0.00000 1.00000 2.00000 3.00000 4.00000\nCRIM 0.00632 0.02731 0.02729 0.03237 0.06905\nZN 18.00000 0.00000 0.00000 0.00000 0.00000\nINDUS 2.31000 7.07000 7.07000 2.18000 2.18000\nCHAS 0.00000 0.00000 0.00000 0.00000 0.00000\nNOX 0.53800 0.46900 0.46900 0.45800 0.45800\nRM 6.57500 6.42100 7.18500 6.99800 7.14700\nAGE 65.20000 78.90000 61.10000 45.80000 54.20000\nDIS 4.09000 4.96710 4.96710 6.06220 6.06220\nRAD 1.00000 2.00000 2.00000 3.00000 3.00000\nTAX 296.00000 242.00000 242.00000 222.00000 222.00000\nPTRATIO 15.30000 17.80000 17.80000 18.70000 18.70000\nLSTAT 4.98000 9.14000 4.03000 2.94000 5.33000\nMEDV 24.00000 21.60000 34.70000 33.40000 36.20000\n\n\nDataframe's last entries: \n 501 502 503 504 505\nUnnamed: 0 501.00000 502.00000 503.00000 504.00000 505.00000\nCRIM 0.06263 0.04527 0.06076 0.10959 0.04741\nZN 0.00000 0.00000 0.00000 0.00000 0.00000\nINDUS 11.93000 11.93000 11.93000 11.93000 11.93000\nCHAS 0.00000 0.00000 0.00000 0.00000 0.00000\nNOX 0.57300 0.57300 0.57300 0.57300 0.57300\nRM 6.59300 6.12000 6.97600 6.79400 6.03000\nAGE 69.10000 76.70000 91.00000 89.30000 80.80000\nDIS 2.47860 2.28750 2.16750 2.38890 2.50500\nRAD 1.00000 1.00000 1.00000 1.00000 1.00000\nTAX 273.00000 273.00000 273.00000 273.00000 273.00000\nPTRATIO 21.00000 21.00000 21.00000 21.00000 21.00000\nLSTAT 9.67000 9.08000 5.64000 6.48000 7.88000\nMEDV 22.40000 20.60000 23.90000 22.00000 11.90000\n"
	}
	],
	"source": "# Sanity check of the dataset.\n# transpose() is used to make the data fit on a single line.\nprint('Dataframe\\'s top entries: \\n', boston_df.head().transpose())\nprint('\\n\\nDataframe\\'s last entries: \\n', boston_df.tail().transpose())"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "- All entries seem normal. \n- To make sure, let's check via a summary.\n- We will view the number of entries and shape of the dataframe as well."
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Unnamed: 0</th>\n <th>CRIM</th>\n <th>ZN</th>\n <th>INDUS</th>\n <th>CHAS</th>\n <th>NOX</th>\n <th>RM</th>\n <th>AGE</th>\n <th>DIS</th>\n <th>RAD</th>\n <th>TAX</th>\n <th>PTRATIO</th>\n <th>LSTAT</th>\n <th>MEDV</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n <td>506.000000</td>\n </tr>\n <tr>\n <th>mean</th>\n <td>252.500000</td>\n <td>3.613524</td>\n <td>11.363636</td>\n <td>11.136779</td>\n <td>0.069170</td>\n <td>0.554695</td>\n <td>6.284634</td>\n <td>68.574901</td>\n <td>3.795043</td>\n <td>9.549407</td>\n <td>408.237154</td>\n <td>18.455534</td>\n <td>12.653063</td>\n <td>22.532806</td>\n </tr>\n <tr>\n <th>std</th>\n <td>146.213884</td>\n <td>8.601545</td>\n <td>23.322453</td>\n <td>6.860353</td>\n <td>0.253994</td>\n <td>0.115878</td>\n <td>0.702617</td>\n <td>28.148861</td>\n <td>2.105710</td>\n <td>8.707259</td>\n <td>168.537116</td>\n <td>2.164946</td>\n <td>7.141062</td>\n <td>9.197104</td>\n </tr>\n <tr>\n <th>min</th>\n <td>0.000000</td>\n <td>0.006320</td>\n <td>0.000000</td>\n <td>0.460000</td>\n <td>0.000000</td>\n <td>0.385000</td>\n <td>3.561000</td>\n <td>2.900000</td>\n <td>1.129600</td>\n <td>1.000000</td>\n <td>187.000000</td>\n <td>12.600000</td>\n <td>1.730000</td>\n <td>5.000000</td>\n </tr>\n <tr>\n <th>25%</th>\n <td>126.250000</td>\n <td>0.082045</td>\n <td>0.000000</td>\n <td>5.190000</td>\n <td>0.000000</td>\n <td>0.449000</td>\n <td>5.885500</td>\n <td>45.025000</td>\n <td>2.100175</td>\n <td>4.000000</td>\n <td>279.000000</td>\n <td>17.400000</td>\n <td>6.950000</td>\n <td>17.025000</td>\n </tr>\n <tr>\n <th>50%</th>\n <td>252.500000</td>\n <td>0.256510</td>\n <td>0.000000</td>\n <td>9.690000</td>\n <td>0.000000</td>\n <td>0.538000</td>\n <td>6.208500</td>\n <td>77.500000</td>\n <td>3.207450</td>\n <td>5.000000</td>\n <td>330.000000</td>\n <td>19.050000</td>\n <td>11.360000</td>\n <td>21.200000</td>\n </tr>\n <tr>\n <th>75%</th>\n <td>378.750000</td>\n <td>3.677082</td>\n <td>12.500000</td>\n <td>18.100000</td>\n <td>0.000000</td>\n <td>0.624000</td>\n <td>6.623500</td>\n <td>94.075000</td>\n <td>5.188425</td>\n <td>24.000000</td>\n <td>666.000000</td>\n <td>20.200000</td>\n <td>16.955000</td>\n <td>25.000000</td>\n </tr>\n <tr>\n <th>max</th>\n <td>505.000000</td>\n <td>88.976200</td>\n <td>100.000000</td>\n <td>27.740000</td>\n <td>1.000000</td>\n <td>0.871000</td>\n <td>8.780000</td>\n <td>100.000000</td>\n <td>12.126500</td>\n <td>24.000000</td>\n <td>711.000000</td>\n <td>22.000000</td>\n <td>37.970000</td>\n <td>50.000000</td>\n </tr>\n </tbody>\n</table>\n</div>",
	"text/plain": " Unnamed: 0 CRIM ZN INDUS CHAS NOX \\\ncount 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 \nmean 252.500000 3.613524 11.363636 11.136779 0.069170 0.554695 \nstd 146.213884 8.601545 23.322453 6.860353 0.253994 0.115878 \nmin 0.000000 0.006320 0.000000 0.460000 0.000000 0.385000 \n25% 126.250000 0.082045 0.000000 5.190000 0.000000 0.449000 \n50% 252.500000 0.256510 0.000000 9.690000 0.000000 0.538000 \n75% 378.750000 3.677082 12.500000 18.100000 0.000000 0.624000 \nmax 505.000000 88.976200 100.000000 27.740000 1.000000 0.871000 \n\n RM AGE DIS RAD TAX PTRATIO \\\ncount 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 \nmean 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 \nstd 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 \nmin 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 \n25% 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 \n50% 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 \n75% 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 \nmax 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 \n\n LSTAT MEDV \ncount 506.000000 506.000000 \nmean 12.653063 22.532806 \nstd 7.141062 9.197104 \nmin 1.730000 5.000000 \n25% 6.950000 17.025000 \n50% 11.360000 21.200000 \n75% 16.955000 25.000000 \nmax 37.970000 50.000000 "
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": "# Basics statistical summary of the dataset .\nboston_df.describe()"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "- The dataframe has only 506 entries.\n- Let's check the data types as well as the completeness of the dataset- i.e. are there any missing values?"
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": "<bound method DataFrame.info of Unnamed: 0 CRIM ZN INDUS CHAS NOX RM AGE DIS RAD \\\n0 0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 \n1 1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 \n2 2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 \n3 3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 \n4 4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 \n.. ... ... ... ... ... ... ... ... ... ... \n501 501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 \n502 502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 \n503 503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 \n504 504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 \n505 505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 \n\n TAX PTRATIO LSTAT MEDV \n0 296.0 15.3 4.98 24.0 \n1 242.0 17.8 9.14 21.6 \n2 242.0 17.8 4.03 34.7 \n3 222.0 18.7 2.94 33.4 \n4 222.0 18.7 5.33 36.2 \n.. ... ... ... ... \n501 273.0 21.0 9.67 22.4 \n502 273.0 21.0 9.08 20.6 \n503 273.0 21.0 5.64 23.9 \n504 273.0 21.0 6.48 22.0 \n505 273.0 21.0 7.88 11.9 \n\n[506 rows x 14 columns]>"
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": "boston_df.info"
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": "(506, 14)"
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": "# Alternatively check the shape\nboston_df.shape"
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": "<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 506 entries, 0 to 505\nData columns (total 14 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 Unnamed: 0 506 non-null int64 \n 1 CRIM 506 non-null float64\n 2 ZN 506 non-null float64\n 3 INDUS 506 non-null float64\n 4 CHAS 506 non-null float64\n 5 NOX 506 non-null float64\n 6 RM 506 non-null float64\n 7 AGE 506 non-null float64\n 8 DIS 506 non-null float64\n 9 RAD 506 non-null float64\n 10 TAX 506 non-null float64\n 11 PTRATIO 506 non-null float64\n 12 LSTAT 506 non-null float64\n 13 MEDV 506 non-null float64\ndtypes: float64(13), int64(1)\nmemory usage: 55.5 KB\n"
	}
	],
	"source": "# Check data types and for missing entries.\nboston_df.info()"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "- Our data has 505 rows (excluding column names) and 14 columns.\n- All data is numeric, either int64 or float64.\n- The first column is not really useful as it is just an index/count. We will remove it from our dataframe over the course of the assessment.\n- As part of Exploratory Data Analysis (EDA), check the column distributions."
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": "<Figure size 1800x1800 with 0 Axes>"
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
No results found