Skip to content

Instantly share code, notes, and snippets.

@amattu2
Created January 22, 2023 21:24
Show Gist options
  • Select an option

  • Save amattu2/91b730bae47ad6da154af3c41223213a to your computer and use it in GitHub Desktop.

Select an option

Save amattu2/91b730bae47ad6da154af3c41223213a to your computer and use it in GitHub Desktop.
The goal is to extract insights from the repair data, particularly revolving around repair costs and vehicle mileage.
See https://gist.github.com/amattu2/e4e6be2e6114c2229c8647fa282c1a3a for the dataset
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 194,
"id": "4d67132d",
"metadata": {},
"outputs": [],
"source": [
"#\n",
"# Produced: Sun Jan 22 2023\n",
"# Author: Alec M.\n",
"# GitHub: https://amattu.com/links/github\n",
"# Copyright: (C) 2023 Alec M.\n",
"# License: License GNU Affero General Public License v3.0\n",
"# \n",
"# This program is free software: you can redistribute it and/or modify\n",
"# it under the terms of the GNU Affero General Public License as published by\n",
"# the Free Software Foundation, either version 3 of the License, or\n",
"# (at your option) any later version.\n",
"# \n",
"# This program is distributed in the hope that it will be useful,\n",
"# but WITHOUT ANY WARRANTY; without even the implied warranty of\n",
"# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\n",
"# GNU Affero General Public License for more details.\n",
"# \n",
"# You should have received a copy of the GNU Affero General Public License\n",
"# along with this program. If not, see <http://www.gnu.org/licenses/>.\n",
"#"
]
},
{
"cell_type": "code",
"execution_count": 195,
"id": "4fabb56f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>\n",
"table {\n",
" float: left;\n",
" display: block;\n",
"}\n",
"</style>\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%html\n",
"<style>\n",
"table {\n",
" float: left;\n",
" display: block;\n",
"}\n",
"</style>"
]
},
{
"cell_type": "code",
"execution_count": 196,
"id": "e6f71d50",
"metadata": {},
"outputs": [],
"source": [
"# Import Libraries\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import plotly.express as px\n",
"import plotly.graph_objects as go\n",
"import plotly.io as pio\n",
"from sklearn.preprocessing import minmax_scale"
]
},
{
"cell_type": "markdown",
"id": "9d7c9c16",
"metadata": {},
"source": [
"# Automotive Repair Classification\n",
"\n",
"This is an exploratory data analysis using a proprietary and anonymized dataset extracted from an automotive repair facility operating since 2002. The goal is to extract insights from the repair data, particularly revolving around repair costs and vehicle mileage. Each row represents a unique repair instance, not a unique vehicle.\n",
"\n",
"|Description|Date|\n",
"|:-:|:-|\n",
"|Project Created|January 22, 2023|\n",
"|Dataset Updated|December 31st, 2022|"
]
},
{
"cell_type": "markdown",
"id": "9471599c",
"metadata": {},
"source": [
"## Data Importing & Cleanup\n",
"\n",
"Import the dataset, parse data, & remove invalid datapoints."
]
},
{
"cell_type": "code",
"execution_count": 197,
"id": "19ef7993",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>TicketType</th>\n",
" <th>Total</th>\n",
" <th>Mileage</th>\n",
" <th>ModYear</th>\n",
" <th>Make</th>\n",
" <th>Model</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2002-10-31 08:59:00</td>\n",
" <td>Invoice</td>\n",
" <td>5.00</td>\n",
" <td>37032</td>\n",
" <td>2001</td>\n",
" <td>DODGE</td>\n",
" <td>CARAVAN -GRAND CARAVAN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2002-10-31 09:11:00</td>\n",
" <td>Invoice</td>\n",
" <td>232.00</td>\n",
" <td>73475</td>\n",
" <td>1997</td>\n",
" <td>TOYOTA</td>\n",
" <td>CAMRY FWD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2002-10-31 10:32:00</td>\n",
" <td>Invoice</td>\n",
" <td>22.00</td>\n",
" <td>89254</td>\n",
" <td>1996</td>\n",
" <td>SATURN</td>\n",
" <td>S SERIES</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2002-10-31 11:28:00</td>\n",
" <td>Invoice</td>\n",
" <td>49.95</td>\n",
" <td>51470</td>\n",
" <td>1999</td>\n",
" <td>FORD</td>\n",
" <td>RANGER</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2002-10-31 15:23:00</td>\n",
" <td>Invoice</td>\n",
" <td>850.20</td>\n",
" <td>68480</td>\n",
" <td>1998</td>\n",
" <td>MITSUBISHI</td>\n",
" <td>ECLIPSE</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date TicketType Total Mileage ModYear Make \\\n",
"0 2002-10-31 08:59:00 Invoice 5.00 37032 2001 DODGE \n",
"1 2002-10-31 09:11:00 Invoice 232.00 73475 1997 TOYOTA \n",
"2 2002-10-31 10:32:00 Invoice 22.00 89254 1996 SATURN \n",
"3 2002-10-31 11:28:00 Invoice 49.95 51470 1999 FORD \n",
"4 2002-10-31 15:23:00 Invoice 850.20 68480 1998 MITSUBISHI \n",
"\n",
" Model \n",
"0 CARAVAN -GRAND CARAVAN \n",
"1 CAMRY FWD \n",
"2 S SERIES \n",
"3 RANGER \n",
"4 ECLIPSE "
]
},
"execution_count": 197,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(\"dataset.csv\", header = 0)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 198,
"id": "9d9bb2e0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 54892 entries, 0 to 54891\n",
"Data columns (total 7 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Date 54892 non-null object \n",
" 1 TicketType 54892 non-null object \n",
" 2 Total 54892 non-null float64\n",
" 3 Mileage 54892 non-null int64 \n",
" 4 ModYear 54892 non-null int64 \n",
" 5 Make 54884 non-null object \n",
" 6 Model 54892 non-null object \n",
"dtypes: float64(1), int64(2), object(4)\n",
"memory usage: 2.9+ MB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 199,
"id": "1163d45e",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"Coerce the Date column from a string into a DateTime field\n",
"\"\"\"\n",
"df['Date'] = pd.to_datetime(df['Date'])"
]
},
{
"cell_type": "code",
"execution_count": 200,
"id": "bd541266",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['DODGE', 'TOYOTA', 'SATURN', 'FORD', 'MITSUBISHI', 'NISSAN',\n",
" 'CHRYSLER', 'BMW', 'HONDA', 'PONTIAC', 'LEXUS', 'JEEP',\n",
" 'CHEVROLET', 'VOLKSWAGEN', 'ACURA', 'PLYMOUTH', 'MERCURY',\n",
" 'INFINITI', 'LINCOLN', 'MAZDA', 'HYUNDAI', 'GMC LIGHT TRUCK',\n",
" 'VOLVO', 'CADILLAC', 'SUBARU', 'MERCEDES', 'AUDI', 'BUICK', 'SAAB',\n",
" 'ISUZU', 'OLDSMOBILE', 'SUZUKI', 'LAND ROVER', 'KIA', 'DAEWOO',\n",
" 'GMC - LIGHT TRUCK', 'JAGUAR', 'PORSCHE', 'MERCEDES-BENZ',\n",
" 'IMPALA', 'SIENNA LE', 'GMC', 'TAURUS SE', 'VOLKSWAGON',\n",
" 'ROLLS ROYCE', 'SCION', 'TRAILER', 'ALTIMA 2.5S', 'MATRIX',\n",
" 'RX350', 'CARAVAN SXT', 'EDGE', 'IMPALA LTZ', 'EDGE LIMITED',\n",
" 'A7 PRESTIGE AWD', 'GRAND CHEROKEE LMT 4X4', 'MERCADES-BENS',\n",
" 'NISSAN HD TRUCK', 'MERCEDES BENZ', 'FIAT', 'CHEVY',\n",
" 'INTERNATIONAL', 'FORD HD TRUCK', 'PARTS', 'TESLA', nan, 'VW',\n",
" 'RAM', 'SUBURU', 'HUMMER', 'MINI', 'ALFA ROMEO',\n",
" 'BIG TEX MANUFACTURING', 'GENESIS'], dtype=object)"
]
},
"execution_count": 200,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.Make.unique()"
]
},
{
"cell_type": "code",
"execution_count": 201,
"id": "0d3bd4dc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"TOYOTA 11751\n",
"HONDA 8537\n",
"FORD 5212\n",
"LEXUS 4624\n",
"NISSAN 3120\n",
" ... \n",
"IMPALA LTZ 1\n",
"EDGE 1\n",
"CARAVAN SXT 1\n",
"MATRIX 1\n",
"BIG TEX MANUFACTURING 1\n",
"Name: Make, Length: 62, dtype: int64"
]
},
"execution_count": 201,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"\"\"\n",
"Replace typos, abbreviations, & inconsistencies\n",
"\"\"\"\n",
"replacements = [\n",
" # From, To\n",
" (\"MERCEDES-BENZ\", \"MERCEDES\"),\n",
" (\"MERCEDES BENZ\", \"MERCEDES\"),\n",
" (\"MERCADES-BENZ\", \"MERCEDES\"),\n",
" (\"MERCADES-BENS\", \"MERCEDES\"),\n",
" (\"VW\", \"VOLKSWAGEN\"),\n",
" (\"VOLKSWAGON\", \"VOLKSWAGEN\"),\n",
" (\"RX350\", \"LEXUS\"),\n",
" (\"SIENNA LE\", \"TOYOTA\"),\n",
" (\"GENESIS\", \"HYUNDAI\"),\n",
" (\"GMC - LIGHT TRUCK\", \"GMC\"),\n",
" (\"GMC LIGHT TRUCK\", \"GMC\"),\n",
" (\"NISSAN HD TRUCK\", \"NISSAN\"),\n",
"]\n",
"\n",
"for (fr, to) in replacements:\n",
" df.Make.replace(fr, to, regex = True, inplace = True)\n",
"\n",
"df.Make.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 202,
"id": "e5f7e6a7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"TOYOTA 11751\n",
"HONDA 8537\n",
"FORD 5212\n",
"LEXUS 4624\n",
"NISSAN 3120\n",
"ACURA 2316\n",
"CHEVROLET 2204\n",
"HYUNDAI 1975\n",
"DODGE 1576\n",
"MERCEDES 1237\n",
"BMW 1108\n",
"JEEP 1075\n",
"MAZDA 913\n",
"SUBARU 912\n",
"INFINITI 906\n",
"CHRYSLER 816\n",
"VOLVO 785\n",
"VOLKSWAGEN 646\n",
"CADILLAC 613\n",
"MERCURY 571\n",
"GMC 567\n",
"SATURN 530\n",
"AUDI 419\n",
"BUICK 359\n",
"KIA 313\n",
"LINCOLN 299\n",
"PARTS 239\n",
"PLYMOUTH 216\n",
"SAAB 200\n",
"PONTIAC 195\n",
"MITSUBISHI 167\n",
"OLDSMOBILE 102\n",
"SUZUKI 66\n",
"ISUZU 64\n",
"JAGUAR 56\n",
"LAND ROVER 36\n",
"ALFA ROMEO 34\n",
"RAM 29\n",
"MINI 25\n",
"PORSCHE 21\n",
"EDGE LIMITED 12\n",
"TRAILER 5\n",
"FIAT 5\n",
"TESLA 5\n",
"INTERNATIONAL 3\n",
"TAURUS SE 3\n",
"SCION 2\n",
"Name: Make, dtype: int64"
]
},
"execution_count": 202,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"\"\"\n",
"Drop makes with a single occurrence\n",
"\"\"\"\n",
"makeCounts = df[\"Make\"].value_counts() > 1\n",
"df = df[df[\"Make\"].isin(makeCounts[makeCounts].index)]\n",
"\n",
"df.Make.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 203,
"id": "88ecd56a",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Total</th>\n",
" <th>Mileage</th>\n",
" <th>ModYear</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>54869.000000</td>\n",
" <td>5.486900e+04</td>\n",
" <td>54869.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>257.745082</td>\n",
" <td>9.977004e+04</td>\n",
" <td>2005.472908</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>403.686697</td>\n",
" <td>6.111207e+04</td>\n",
" <td>7.249758</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-282.810000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>1950.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>40.110000</td>\n",
" <td>5.514100e+04</td>\n",
" <td>2001.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>116.520000</td>\n",
" <td>9.389500e+04</td>\n",
" <td>2005.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>347.390000</td>\n",
" <td>1.366550e+05</td>\n",
" <td>2010.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>25650.000000</td>\n",
" <td>2.662311e+06</td>\n",
" <td>2022.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Total Mileage ModYear\n",
"count 54869.000000 5.486900e+04 54869.000000\n",
"mean 257.745082 9.977004e+04 2005.472908\n",
"std 403.686697 6.111207e+04 7.249758\n",
"min -282.810000 0.000000e+00 1950.000000\n",
"25% 40.110000 5.514100e+04 2001.000000\n",
"50% 116.520000 9.389500e+04 2005.000000\n",
"75% 347.390000 1.366550e+05 2010.000000\n",
"max 25650.000000 2.662311e+06 2022.000000"
]
},
"execution_count": 203,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 204,
"id": "45b044f5",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"Drop outliers due to typo likelyhood\n",
"\"\"\"\n",
"df.drop(df[(df.Total > 4000) | (df.Total < 0)].index, inplace = True)\n",
"df.drop(df[(df.Mileage > 550000) | (df.Mileage <= 0)].index, inplace = True)\n",
"df.drop(df[df.ModYear <= 1950].index, inplace = True) # 1950 is the default Model Year when not inputted"
]
},
{
"cell_type": "markdown",
"id": "039db813",
"metadata": {},
"source": [
"## Creating Datapoints\n",
"\n",
"Because we only have raw data, we need to define and extract insightful key-points from our dataset.\n",
"\n",
"Examples:\n",
"\n",
"- Country of Origin\n",
"- Age at repair (ModelYear - Repair Date)\n",
"- Average mileage at repair (Mileage/Age)\n",
"- Manufacturer class (Standard, Luxury)\n",
"- Cost grouped by 100s\n",
"- Mileage grouped by 10000s"
]
},
{
"cell_type": "code",
"execution_count": 205,
"id": "650c3dc7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>TicketType</th>\n",
" <th>Total</th>\n",
" <th>Mileage</th>\n",
" <th>ModYear</th>\n",
" <th>Make</th>\n",
" <th>Model</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>7083</th>\n",
" <td>2006-09-20 16:25:00</td>\n",
" <td>Invoice</td>\n",
" <td>0.00</td>\n",
" <td>29114</td>\n",
" <td>2006</td>\n",
" <td>TAURUS SE</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7592</th>\n",
" <td>2006-12-12 13:18:00</td>\n",
" <td>Invoice</td>\n",
" <td>0.00</td>\n",
" <td>31456</td>\n",
" <td>2006</td>\n",
" <td>TAURUS SE</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8284</th>\n",
" <td>2007-04-10 11:07:00</td>\n",
" <td>Invoice</td>\n",
" <td>0.00</td>\n",
" <td>34063</td>\n",
" <td>2006</td>\n",
" <td>TAURUS SE</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24825</th>\n",
" <td>2013-06-10 12:00:00</td>\n",
" <td>Invoice</td>\n",
" <td>46.16</td>\n",
" <td>31990</td>\n",
" <td>2013</td>\n",
" <td>EDGE LIMITED</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25774</th>\n",
" <td>2013-09-19 15:16:00</td>\n",
" <td>Invoice</td>\n",
" <td>0.00</td>\n",
" <td>36273</td>\n",
" <td>2013</td>\n",
" <td>EDGE LIMITED</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25944</th>\n",
" <td>2013-10-08 11:21:00</td>\n",
" <td>Invoice</td>\n",
" <td>0.00</td>\n",
" <td>37502</td>\n",
" <td>2013</td>\n",
" <td>EDGE LIMITED</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26536</th>\n",
" <td>2013-12-18 15:58:00</td>\n",
" <td>Invoice</td>\n",
" <td>0.00</td>\n",
" <td>42503</td>\n",
" <td>2013</td>\n",
" <td>EDGE LIMITED</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27927</th>\n",
" <td>2014-06-03 15:04:00</td>\n",
" <td>Invoice</td>\n",
" <td>0.00</td>\n",
" <td>50321</td>\n",
" <td>2013</td>\n",
" <td>EDGE LIMITED</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29729</th>\n",
" <td>2015-01-05 13:21:00</td>\n",
" <td>Invoice</td>\n",
" <td>0.00</td>\n",
" <td>66000</td>\n",
" <td>2013</td>\n",
" <td>EDGE LIMITED</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30143</th>\n",
" <td>2015-02-24 14:44:00</td>\n",
" <td>Invoice</td>\n",
" <td>60.87</td>\n",
" <td>65249</td>\n",
" <td>2013</td>\n",
" <td>EDGE LIMITED</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30597</th>\n",
" <td>2015-04-16 13:52:00</td>\n",
" <td>Invoice</td>\n",
" <td>202.30</td>\n",
" <td>67670</td>\n",
" <td>2013</td>\n",
" <td>EDGE LIMITED</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30887</th>\n",
" <td>2015-05-21 07:38:00</td>\n",
" <td>Invoice</td>\n",
" <td>123.30</td>\n",
" <td>69231</td>\n",
" <td>2013</td>\n",
" <td>EDGE LIMITED</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31956</th>\n",
" <td>2015-09-17 14:10:00</td>\n",
" <td>Invoice</td>\n",
" <td>43.87</td>\n",
" <td>74721</td>\n",
" <td>2013</td>\n",
" <td>EDGE LIMITED</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32973</th>\n",
" <td>2016-01-27 14:22:00</td>\n",
" <td>Invoice</td>\n",
" <td>237.22</td>\n",
" <td>80334</td>\n",
" <td>2013</td>\n",
" <td>EDGE LIMITED</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33813</th>\n",
" <td>2016-05-12 14:55:00</td>\n",
" <td>Invoice</td>\n",
" <td>45.54</td>\n",
" <td>85372</td>\n",
" <td>2013</td>\n",
" <td>EDGE LIMITED</td>\n",
" <td>FORD</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40009</th>\n",
" <td>2018-05-18 16:15:00</td>\n",
" <td>Invoice</td>\n",
" <td>245.95</td>\n",
" <td>252536</td>\n",
" <td>2002</td>\n",
" <td>INTERNATIONAL</td>\n",
" <td>4200-4400 SERIES</td>\n",
" <td>NA</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date TicketType Total Mileage ModYear Make \\\n",
"7083 2006-09-20 16:25:00 Invoice 0.00 29114 2006 TAURUS SE \n",
"7592 2006-12-12 13:18:00 Invoice 0.00 31456 2006 TAURUS SE \n",
"8284 2007-04-10 11:07:00 Invoice 0.00 34063 2006 TAURUS SE \n",
"24825 2013-06-10 12:00:00 Invoice 46.16 31990 2013 EDGE LIMITED \n",
"25774 2013-09-19 15:16:00 Invoice 0.00 36273 2013 EDGE LIMITED \n",
"25944 2013-10-08 11:21:00 Invoice 0.00 37502 2013 EDGE LIMITED \n",
"26536 2013-12-18 15:58:00 Invoice 0.00 42503 2013 EDGE LIMITED \n",
"27927 2014-06-03 15:04:00 Invoice 0.00 50321 2013 EDGE LIMITED \n",
"29729 2015-01-05 13:21:00 Invoice 0.00 66000 2013 EDGE LIMITED \n",
"30143 2015-02-24 14:44:00 Invoice 60.87 65249 2013 EDGE LIMITED \n",
"30597 2015-04-16 13:52:00 Invoice 202.30 67670 2013 EDGE LIMITED \n",
"30887 2015-05-21 07:38:00 Invoice 123.30 69231 2013 EDGE LIMITED \n",
"31956 2015-09-17 14:10:00 Invoice 43.87 74721 2013 EDGE LIMITED \n",
"32973 2016-01-27 14:22:00 Invoice 237.22 80334 2013 EDGE LIMITED \n",
"33813 2016-05-12 14:55:00 Invoice 45.54 85372 2013 EDGE LIMITED \n",
"40009 2018-05-18 16:15:00 Invoice 245.95 252536 2002 INTERNATIONAL \n",
"\n",
" Model Country \n",
"7083 FORD NA \n",
"7592 FORD NA \n",
"8284 FORD NA \n",
"24825 FORD NA \n",
"25774 FORD NA \n",
"25944 FORD NA \n",
"26536 FORD NA \n",
"27927 FORD NA \n",
"29729 FORD NA \n",
"30143 FORD NA \n",
"30597 FORD NA \n",
"30887 FORD NA \n",
"31956 FORD NA \n",
"32973 FORD NA \n",
"33813 FORD NA \n",
"40009 4200-4400 SERIES NA "
]
},
"execution_count": 205,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"countryOfMake = {\n",
" # Country: (Makes, ...),\n",
" \"GERMANY\": (\"MERCEDES\", \"AUDI\", \"BMW\", \"VOLKSWAGEN\", \"MINI\", \"PORSCHE\"),\n",
" \"USA\": (\"BUICK\", \"CADILLAC\", \"CHEVROLET\", \"FORD\", \"LINCOLN\", \"DODGE\", \"OLDSMOBILE\",\n",
" \"GMC\", \"SATURN\", \"JEEP\", \"CHRYSLER\", \"TESLA\", \"RAM\", \"PONTIAC\", \"PLYMOUTH\", \"MERCURY\"),\n",
" \"JAPAN\": (\"HONDA\", \"HYUNDAI\", \"ACURA\", \"TOYOTA\", \"LEXUS\", \"SCION\",\n",
" \"SUBARU\", \"NISSAN\", \"INFINITI\", \"MAZDA\", \"MITSUBISHI\", \"KIA\", \"ISUZU\", \"SUZUKI\"),\n",
" \"SWEDEN\": (\"VOLVO\", \"SAAB\",),\n",
" \"UK\": (\"LAND ROVER\", \"JAGUAR\"),\n",
" \"ITALY\": (\"FIAT\", \"ALFA ROMEO\")\n",
"}\n",
"\n",
"def getMakeCountry(r):\n",
" make = r.Make.upper()\n",
" \n",
" for country, makers in countryOfMake.items():\n",
" if (make in makers):\n",
" return country\n",
"\n",
" return \"NA\"\n",
" \n",
"df['Country'] = df.apply(getMakeCountry, axis = 1)\n",
"\n",
"df[df.Country == \"NA\"]"
]
},
{
"cell_type": "code",
"execution_count": 206,
"id": "365a2fb7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 53007.000000\n",
"mean 7.656687\n",
"std 4.272216\n",
"min -6.000000\n",
"25% 4.000000\n",
"50% 7.000000\n",
"75% 10.000000\n",
"max 44.000000\n",
"Name: Age, dtype: float64"
]
},
"execution_count": 206,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"Age\"] = df.apply(lambda r: r.Date.year - r.ModYear, axis = 1)\n",
"\n",
"df.Age.describe()"
]
},
{
"cell_type": "code",
"execution_count": 207,
"id": "20841432",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 53006.000000\n",
"mean 7.656944\n",
"std 4.271844\n",
"min -1.000000\n",
"25% 4.000000\n",
"50% 7.000000\n",
"75% 10.000000\n",
"max 44.000000\n",
"Name: Age, dtype: float64"
]
},
"execution_count": 207,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"\"\"\n",
"Drop invalid ages\n",
"\n",
"Note: -1 is valid, vehicles are released ~1yr early.\n",
"\"\"\"\n",
"df.drop(df[df.Age < -1].index, inplace = True)\n",
"\n",
"df.Age.describe()"
]
},
{
"cell_type": "code",
"execution_count": 208,
"id": "2763d4ba",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"count 53006.000000\n",
"mean 14604.188677\n",
"std 6610.606383\n",
"min 0.000000\n",
"25% 10637.000000\n",
"50% 13726.000000\n",
"75% 17334.000000\n",
"max 158161.000000\n",
"Name: AvgMileage, dtype: float64"
]
},
"execution_count": 208,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def getAverageMileage(r):\n",
" return round(r.Mileage / r.Age) if r.Age > 0 else 0\n",
"\n",
"df[\"AvgMileage\"] = df.apply(getAverageMileage, axis = 1)\n",
"\n",
"df.AvgMileage.describe()"
]
},
{
"cell_type": "code",
"execution_count": 209,
"id": "8fd34525",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 53006\n",
"unique 3\n",
"top Standard\n",
"freq 40993\n",
"Name: Class, dtype: object"
]
},
"execution_count": 209,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"makeClasses = {\n",
" # Country: (Makes, ...)\n",
" \"Luxury\": (\"MERCEDES\", \"CADILLAC\", \"LINCOLN\", \"LEXUS\", \"JAGUAR\", \"LAND ROVER\", \"VOLVO\", \"ACURA\"),\n",
" \"Sport-Luxury\": (\"AUDI\", \"INFINITI\", \"BMW\", \"PORSCHE\")\n",
"}\n",
"\n",
"def getMakeClass(r):\n",
" make = r.Make.upper()\n",
" \n",
" for makeClass, makers in makeClasses.items():\n",
" if (make in makers):\n",
" return makeClass\n",
"\n",
" return \"Standard\"\n",
" \n",
"df['Class'] = df.apply(getMakeClass, axis = 1)\n",
"\n",
"df.Class.describe()"
]
},
{
"cell_type": "code",
"execution_count": 210,
"id": "6d9e8a9a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>TicketType</th>\n",
" <th>Total</th>\n",
" <th>Mileage</th>\n",
" <th>ModYear</th>\n",
" <th>Make</th>\n",
" <th>Model</th>\n",
" <th>Country</th>\n",
" <th>Age</th>\n",
" <th>AvgMileage</th>\n",
" <th>Class</th>\n",
" <th>MileageRounded</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2002-10-31 08:59:00</td>\n",
" <td>Invoice</td>\n",
" <td>5.00</td>\n",
" <td>37032</td>\n",
" <td>2001</td>\n",
" <td>DODGE</td>\n",
" <td>CARAVAN -GRAND CARAVAN</td>\n",
" <td>USA</td>\n",
" <td>1</td>\n",
" <td>37032</td>\n",
" <td>Standard</td>\n",
" <td>40000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2002-10-31 09:11:00</td>\n",
" <td>Invoice</td>\n",
" <td>232.00</td>\n",
" <td>73475</td>\n",
" <td>1997</td>\n",
" <td>TOYOTA</td>\n",
" <td>CAMRY FWD</td>\n",
" <td>JAPAN</td>\n",
" <td>5</td>\n",
" <td>14695</td>\n",
" <td>Standard</td>\n",
" <td>70000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2002-10-31 10:32:00</td>\n",
" <td>Invoice</td>\n",
" <td>22.00</td>\n",
" <td>89254</td>\n",
" <td>1996</td>\n",
" <td>SATURN</td>\n",
" <td>S SERIES</td>\n",
" <td>USA</td>\n",
" <td>6</td>\n",
" <td>14876</td>\n",
" <td>Standard</td>\n",
" <td>90000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2002-10-31 11:28:00</td>\n",
" <td>Invoice</td>\n",
" <td>49.95</td>\n",
" <td>51470</td>\n",
" <td>1999</td>\n",
" <td>FORD</td>\n",
" <td>RANGER</td>\n",
" <td>USA</td>\n",
" <td>3</td>\n",
" <td>17157</td>\n",
" <td>Standard</td>\n",
" <td>50000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2002-10-31 15:23:00</td>\n",
" <td>Invoice</td>\n",
" <td>850.20</td>\n",
" <td>68480</td>\n",
" <td>1998</td>\n",
" <td>MITSUBISHI</td>\n",
" <td>ECLIPSE</td>\n",
" <td>JAPAN</td>\n",
" <td>4</td>\n",
" <td>17120</td>\n",
" <td>Standard</td>\n",
" <td>70000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date TicketType Total Mileage ModYear Make \\\n",
"0 2002-10-31 08:59:00 Invoice 5.00 37032 2001 DODGE \n",
"1 2002-10-31 09:11:00 Invoice 232.00 73475 1997 TOYOTA \n",
"2 2002-10-31 10:32:00 Invoice 22.00 89254 1996 SATURN \n",
"3 2002-10-31 11:28:00 Invoice 49.95 51470 1999 FORD \n",
"4 2002-10-31 15:23:00 Invoice 850.20 68480 1998 MITSUBISHI \n",
"\n",
" Model Country Age AvgMileage Class MileageRounded \n",
"0 CARAVAN -GRAND CARAVAN USA 1 37032 Standard 40000 \n",
"1 CAMRY FWD JAPAN 5 14695 Standard 70000 \n",
"2 S SERIES USA 6 14876 Standard 90000 \n",
"3 RANGER USA 3 17157 Standard 50000 \n",
"4 ECLIPSE JAPAN 4 17120 Standard 70000 "
]
},
"execution_count": 210,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['MileageRounded'] = df.apply(lambda r: round(r.Mileage, -4), axis = 1)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 211,
"id": "fe34f4d3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>TicketType</th>\n",
" <th>Total</th>\n",
" <th>Mileage</th>\n",
" <th>ModYear</th>\n",
" <th>Make</th>\n",
" <th>Model</th>\n",
" <th>Country</th>\n",
" <th>Age</th>\n",
" <th>AvgMileage</th>\n",
" <th>Class</th>\n",
" <th>MileageRounded</th>\n",
" <th>TotalRounded</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2002-10-31 08:59:00</td>\n",
" <td>Invoice</td>\n",
" <td>5.00</td>\n",
" <td>37032</td>\n",
" <td>2001</td>\n",
" <td>DODGE</td>\n",
" <td>CARAVAN -GRAND CARAVAN</td>\n",
" <td>USA</td>\n",
" <td>1</td>\n",
" <td>37032</td>\n",
" <td>Standard</td>\n",
" <td>40000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2002-10-31 09:11:00</td>\n",
" <td>Invoice</td>\n",
" <td>232.00</td>\n",
" <td>73475</td>\n",
" <td>1997</td>\n",
" <td>TOYOTA</td>\n",
" <td>CAMRY FWD</td>\n",
" <td>JAPAN</td>\n",
" <td>5</td>\n",
" <td>14695</td>\n",
" <td>Standard</td>\n",
" <td>70000</td>\n",
" <td>200.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2002-10-31 10:32:00</td>\n",
" <td>Invoice</td>\n",
" <td>22.00</td>\n",
" <td>89254</td>\n",
" <td>1996</td>\n",
" <td>SATURN</td>\n",
" <td>S SERIES</td>\n",
" <td>USA</td>\n",
" <td>6</td>\n",
" <td>14876</td>\n",
" <td>Standard</td>\n",
" <td>90000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2002-10-31 11:28:00</td>\n",
" <td>Invoice</td>\n",
" <td>49.95</td>\n",
" <td>51470</td>\n",
" <td>1999</td>\n",
" <td>FORD</td>\n",
" <td>RANGER</td>\n",
" <td>USA</td>\n",
" <td>3</td>\n",
" <td>17157</td>\n",
" <td>Standard</td>\n",
" <td>50000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2002-10-31 15:23:00</td>\n",
" <td>Invoice</td>\n",
" <td>850.20</td>\n",
" <td>68480</td>\n",
" <td>1998</td>\n",
" <td>MITSUBISHI</td>\n",
" <td>ECLIPSE</td>\n",
" <td>JAPAN</td>\n",
" <td>4</td>\n",
" <td>17120</td>\n",
" <td>Standard</td>\n",
" <td>70000</td>\n",
" <td>900.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date TicketType Total Mileage ModYear Make \\\n",
"0 2002-10-31 08:59:00 Invoice 5.00 37032 2001 DODGE \n",
"1 2002-10-31 09:11:00 Invoice 232.00 73475 1997 TOYOTA \n",
"2 2002-10-31 10:32:00 Invoice 22.00 89254 1996 SATURN \n",
"3 2002-10-31 11:28:00 Invoice 49.95 51470 1999 FORD \n",
"4 2002-10-31 15:23:00 Invoice 850.20 68480 1998 MITSUBISHI \n",
"\n",
" Model Country Age AvgMileage Class MileageRounded \\\n",
"0 CARAVAN -GRAND CARAVAN USA 1 37032 Standard 40000 \n",
"1 CAMRY FWD JAPAN 5 14695 Standard 70000 \n",
"2 S SERIES USA 6 14876 Standard 90000 \n",
"3 RANGER USA 3 17157 Standard 50000 \n",
"4 ECLIPSE JAPAN 4 17120 Standard 70000 \n",
"\n",
" TotalRounded \n",
"0 0.0 \n",
"1 200.0 \n",
"2 0.0 \n",
"3 0.0 \n",
"4 900.0 "
]
},
"execution_count": 211,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['TotalRounded'] = df.apply(lambda r: round(r.Total, -2), axis = 1)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "6f798797",
"metadata": {},
"source": [
"## Graphing\n",
"\n",
"Let's take a look at the data to find correlations between datapoints."
]
},
{
"cell_type": "code",
"execution_count": 212,
"id": "47843cb6",
"metadata": {},
"outputs": [
{
"data": {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment